UTF8

For those who have been into computer science for any amount of time, you’re likely familiar with Joel Spolsky, his blog Joel on Software, and/or perhaps any of his books.

A couple of years ago, I read an article called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

I’m not ashamed to admit that, at the time, it wasn’t very applicable to me. Yes, it was interesting, yes, I cared, but I didn’t have a practical way to implement it simply because there was nothing that I was working that warranted the information in the article.

But here was one of my biggest takeaways:

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

Fast forward a couple of years and I was working at a place where every piece of application code that we rolled out had to be internationalized because it was accessible by a variety of countries all across the world – now it was more practical (and it’s not much different than WordPress, huh?).

And now, I’m finding myself working more with unicode characters in WordPress more than I ever have before.

Here’s the thing that few people talk about: Sites, themes, or HTML in general will specify a character set that can drastically affect how the content in your page is rendered.

Unicode Characters in WordPress

For those who have been working with internationalized WordPress work for sometime now, you understand functions such as _e, __, and the significance of text domains. But we – well, at least I – have always taken for granted the features of this particular function because it abstracts some of luxuries that we haven’t always had.

I will go study encodings and properly use UTF-8

I will go study encodings and properly use UTF-8

See, this makes it really easy to make sure our work in WordPress is translated and thus accessible by those that speak other languages. But there’s more to it than that.

Let Me Back Up a Bit: JavaScript and Unicode Characters

I’ve been working on a project in which I’m using regular expressions in JavaScript to match certain characters, strings, and words in order to process them. The thing is, I’m using the standard unicode character set for two reasons:

  1. The definition of the content type in the browser is defined as utf8
  2. If the character set is different for a customer, then I leave notes letting them know that the character set will be different and will require the necessary adjustments

It sounds fair enough, but I’m not so sure.

I consider myself to be more of a pragmatist – if something that I’m working on is usually targeting a specific character set and the platform and/or framework in question supports internationalization, then I default to those facilities to provide the steps necessary to handle non-utf8 character set translation.

So the first question is: Assuming the requirements don’t dictate that we handle each character set, should we, as developers, be handling every single case of character sets that exist for a content type, or should we leave it up to our international friends to handle their content type?

Why We Exclude Certain Unicode Characters

Depending on your use case, you could potentially escape every single character that exists in the string that’s bring parsed, but that’s not always going to be the best course of action for every use case.

To that end, there are certain reserved characters, if you will, that are similar to reserved words in computer programming languages. Some of these characters are <, >, [, ], {, }, -, * and so on.

In short, these characters perform a specific function within the context of regular expressions so trying to match then in a regular expression without any type of escaping or encoding will actually break the regular expression.

So what are the options?

  1. Encode the characters, then decode the characters then decode them when processing them or matching them later
  2. Black list them as characters that can be used

For this current project, I’ve been working on the second option as I don’t really need to match them for what the work requires.

How To Match Unicode Characters

First off, it’s always handy to have a unicode reference handy. Though a new one may always emerge, I’ve found this particular reference to be exceptionally useful:

Unicode Character Table

You see, it provides the unicode characters that correspond to the alphanumeric and special characters so that you can create regular expressions like the following:

/^[\u0000-\u002B\u0021-\u002F\u003A-\u003F\u005B-\u005F\u007B-\u007E]+/

This regular expression excludes a range of reserved characters for regular expressions.

Of course, this is only for the content type of utf8.

Prior to this, I was perfectly fine stripping out all non-alphanumeric characters by using \W+\ but as I wanted to add additional flexibility and additional support for characters, specifying ranges of characters seemed easier.

Which brings me to my second question: Is this really the best way to go about approaching this, or is there a better alternative that supports each variety of content encoding or, again, or we leave this up to our multi-lingual friends?

How To Handle Unicode Characters in WordPress?

So I’ve laid out some of the issues I’ve been working through on some of the projects that I’ve had going on and this is where I’ve landed. In the post, I asked the following two questions:

  1. Assuming the requirements don’t dictate that we handle each character set, should we, as developers, be handling every single case of character sets that exist for a content type, or should we leave it up to our international friends to handle their content type?
  2. Is this really the best way to go about approaching this, or is there a better alternative that supports each variety of content encoding or, again, or we leave this up to our multi-lingual friends

And I ask honestly because I’m looking to provide the greatest amount of flexibility and resilient code possible, and I know I’m not the first person to encounter this problem. So, for those of you who have encountered it, any advice?

Category:
Articles
Tags:
,

Join the conversation! 1 Comment

  1. Internally, all web browsers convert all text formats to UTF-16: HTML, CSS, XML and JavaScript. So when you have a HTML file encoded in Windows-1252 or BIG5 and a script encoded in UTF-8, they will fit happily together as long as they are served with the correct encoding information (charset) by the web server.

    You cannot use a BOM or any other reliable way to announce the internal encoding in JavaScript. That would break concatenation.

    So in a controlled environment, there is no problem.

    \u003A is not just valid in an UTF-8 encoded file, it is a Unicode code point reference and therefor valid in any allowed encoding.

    For code delivered to an unknown audience (like a WordPress plugin) you cannot do much. If the server isn’t set up properly and doesn’t send a correct file encoding you have to stick to Unicode references. In your own projects just use real UTF-8.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.