Late last week, I was working on a project that was responsible for reading the contents of a CSV, parsing the information, and then inserting it into the WordPress database.

But I hit a snag (as we so often do, right?): The first few rows of the CSV were working fine, but a number of the rows were failing to import.

The thing is, there appeared to be no rhyme or reason. I made sure the CSV was a raw text file and even saved a new version of the file using a raw editor twice to make sure any, um, ‘stray’ characters were being removed.

Unfortunately, it didn’t work so rather than spend time trying to reformat the entire file, I ended up writing a small regex to strip hidden ASCII characters form the incoming information.

Remove Hidden ASCII Characters

The regex is simple:

// Replace anything that is not an 'a-z', 'A-Z', or '0-9' from the given $value
$value = preg_replace( "/[^a-zA-Z0-9\s]/", "", $value );

I tend to be a pragmatist so I don’t often do much more than necessary to make sure that the given problem at hand is solved. This isn’t an excuse to be lazy – quite the opposite, really – it’s just meant to make sure the solution in my project is enough to solve the problem at hand.

No more, no less.

But I also recognize that this is a problem others may encounter and my simple parsing out of everything that isn’t a letter or a number may not be enough.

At this point, you guys know this works:

  • I provide the code and a link the gist
  • I ask if there are any improvements to the code you’d make in order to make it as resilient as possible
  • You guys improve the code ;)

But in all seriousness, I know that the code is just enough to work for my own needs, but am sure it can be improved for other use cases.

So check out the gist, make it better, and let’s help out those that ultimately encounter the same problem.