How do I best remove the unicode characters that XHTML regards as non-valid using php?

Posted by Andrew Stacey on Stack Overflow See other posts from Stack Overflow or by Andrew Stacey
Published on 2010-04-13T07:43:56Z Indexed on 2010/04/13 10:32 UTC
Read the original article Hit count: 353

Filed under:
|
|

I run a forum designed to support an international mathematics group. I've recently switched it to unicode for better support of international characters. In debugging this conversion, I've discovered that not all unicode characters are considered as valid XHTML (the relevant website appears to be http://www.w3.org/TR/unicode-xml/). One of the steps that the forum software goes through before presenting the posts to the browser is an XHTML validation/sanitisation step. It seems a reasonable idea that at that stage it should remove any unicode characters that XHTML doesn't like.

So my question is:

Is there a standard (or best) way of doing this in PHP?

(The forum is written in PHP, by the way.)

I guess that the failsafe would be a simple str_replace (if that's also the best, do I need to do anything extra to make sure it works properly with unicode?) but that would involve me having to go through the XHTML DTD (or the above-referenced W3 page) carefully to figure out what characters to list in the search part of str_replace, so if this is the best way, has someone already done that so that I can steal, err, copy, it?

(Incidentally, the character that caused the problem was U+000C, the 'formfeed', which (according to the W3 page) is valid HTML but invalid XHTML!)

© Stack Overflow or respective owner

Related posts about php

Related posts about unicode