Unicode Regex; Invalid XML characters

Posted by Ambush Commander on Stack Overflow See other posts from Stack Overflow or by Ambush Commander
Published on 2008-12-29T06:51:44Z Indexed on 2010/03/27 7:23 UTC
Read the original article Hit count: 243

Filed under:
|
|

The list of valid XML characters is well known, as defined by the spec it's:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.

© Stack Overflow or respective owner

Related posts about unicode

Related posts about regex