How to test an application for correct encoding (e.g. UTF-8)

Posted by Olaf on Stack Overflow See other posts from Stack Overflow or by Olaf
Published on 2009-01-25T20:20:17Z Indexed on 2010/04/01 18:53 UTC
Read the original article Hit count: 347

Encoding issues are among the one topic that have bitten me most often during development. Every platform insists on its own encoding, most likely some non-UTF-8 defaults are in the game. (I'm usually working on Linux, defaulting to UTF-8, my colleagues mostly work on german Windows, defaulting to ISO-8859-1 or some similar windows codepage)

I believe, that UTF-8 is a suitable standard for developing an i18nable application. However, in my experience encoding bugs are usually discovered late (even though I'm located in Germany and we have some special characters that along with ISO-8859-1 provide some detectable differences).

I believe that those developers with a completely non-ASCII character set (or those that know a language that uses such a character set) are getting a head start in providing test data. But there must be a way to ease this for the rest of us as well.

What [technique|tool|incentive] are people here using? How do you get your co-developers to care for these issues? How do you test for compliance? Are those tests conducted manually or automatically?

Adding one possible answer upfront:

I've recently discovered fliptitle.com (they are providing an easy way to get weird characters written "u?op ?pisdn" *) and I'm planning on using them to provide easily verifiable UTF-8 character strings (as most of the characters used there are at some weird binary encoding position) but there surely must be more systematic tests, patterns or techniques for ensuring UTF-8 compatibility/usage.

Note: Even though there's an accepted answer, I'd like to know of more techniques and patterns if there are some. Please add more answers if you have more ideas. And it has not been easy choosing only one answer for acceptance. I've chosen the regexp answer for the least expected angle to tackle the problem although there would be reasons to choose other answers as well. Too bad only one answer can be accepted.

Thank you for your input.

*) that's "upside down" written "upside down" for those that cannot see those characters due to font problems

© Stack Overflow or respective owner

Related posts about language-agnostic

Related posts about encoding