Hide and Seek

Late last week, I was working on a project that was responsible for reading the contents of a CSV, parsing the information, and then inserting it into the WordPress database.

But I hit a snag (as we so often do, right?): The first few rows of the CSV were working fine, but a number of the rows were failing to import.

The thing is, there appeared to be no rhyme or reason. I made sure the CSV was a raw text file and even saved a new version of the file using a raw editor twice to make sure any, um, ‘stray’ characters were being removed.

Unfortunately, it didn’t work so rather than spend time trying to reformat the entire file, I ended up writing a small regex to strip hidden ASCII characters form the incoming information.

Remove Hidden ASCII Characters

The regex is simple:

// Replace anything that is not an 'a-z', 'A-Z', or '0-9' from the given $value
$value = preg_replace( "/[^a-zA-Z0-9\s]/", "", $value );

I tend to be a pragmatist so I don’t often do much more than necessary to make sure that the given problem at hand is solved. This isn’t an excuse to be lazy – quite the opposite, really – it’s just meant to make sure the solution in my project is enough to solve the problem at hand.

No more, no less.

But I also recognize that this is a problem others may encounter and my simple parsing out of everything that isn’t a letter or a number may not be enough.

At this point, you guys know this works:

  • I provide the code and a link the gist
  • I ask if there are any improvements to the code you’d make in order to make it as resilient as possible
  • You guys improve the code ;)

But in all seriousness, I know that the code is just enough to work for my own needs, but am sure it can be improved for other use cases.

So check out the gist, make it better, and let’s help out those that ultimately encounter the same problem.

Category:
Notes

Join the conversation! 4 Comments

  1. Did you determine what kind of ‘invalid’ characters were in there? Maybe some UTF-8 or ISO-WIN1252?

    The reason I ask is that an often overlooked problem is when dealing with invalid UTF-8 character sequences is that by using a naive stripping mechanism, it’s possible to open up security problems. If you do other sanity checks, make sure you run them after any character stripping.

    For example, consider a sequence like:

    <sc[invalid]ript>/*evil javascript here*/</sc[invalid]ript

    If you were sanitizing for script tags early, then stripping invalid sequences later, you’d open up XSS possibilities.

    • It was invalid UTF8 and the file was simple enough that simply stripping out anything but alphanumeric characters worked.

      Your security point is spot on, though. Hadn’t considered that so I’m glad you shared.

      Konstantin’s comment also shares a function I had no idea existed until today – so there’s that :)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.