How do I best remove the unicode characters that XHTML regards as non-valid using php?

Posted by Andrew Stacey on Stack Overflow See other posts from Stack Overflow or by Andrew Stacey
Published on 2010-04-13T07:43:56Z Indexed on 2010/04/13 10:32 UTC
Read the original article Hit count: 454

Filed under:

I run a forum designed to support an international mathematics group. I've recently switched it to unicode for better support of international characters. In debugging this conversion, I've discovered that not all unicode characters are considered as valid XHTML (the relevant website appears to be http://www.w3.org/TR/unicode-xml/). One of the steps that the forum software goes through before presenting the posts to the browser is an XHTML validation/sanitisation step. It seems a reasonable idea that at that stage it should remove any unicode characters that XHTML doesn't like.

So my question is:

Is there a standard (or best) way of doing this in PHP?

(The forum is written in PHP, by the way.)

I guess that the failsafe would be a simple str_replace (if that's also the best, do I need to do anything extra to make sure it works properly with unicode?) but that would involve me having to go through the XHTML DTD (or the above-referenced W3 page) carefully to figure out what characters to list in the search part of str_replace, so if this is the best way, has someone already done that so that I can steal, err, copy, it?

(Incidentally, the character that caused the problem was U+000C, the 'formfeed', which (according to the W3 page) is valid HTML but invalid XHTML!)

Developer IT

How do I best remove the unicode characters that XHTML regards as non-valid using php? - Developer IT

How do I best remove the unicode characters that XHTML regards as non-valid using php?

php

unicode

XHTML

Related posts about php

Magento, NGINX, PHP-FPM, APC, MEMCACHED, 16gb Ram CentOS, Spiking PHP-FPM to 100% CPU

PHP Pear Installation on CentOS

Apache configurations for php "AddType text/html php" or "AddType application/x-httpd-php php .php"

mod_rewrite settings causes server to throw HTTP 500 errors instead of 404

Problems installing Memcache (PECL extension)

Related posts about unicode

Translating Between Unicode and Non-Unicode Character Sets in Java

SQLite, python, unicode, and non-utf data

SQLite, python, unicode, and non-utf data

notepad sql Unicode and Non Unicode

On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U

Categories cloud