Mathematica regular expressions on unicode strings.

Posted by dreeves on Stack Overflow See other posts from Stack Overflow or by dreeves
Published on 2010-03-25T02:32:18Z Indexed on 2010/03/25 2:33 UTC
Read the original article Hit count: 477

Filed under:
|
|
|

This was a fascinating debugging experience. Can you spot the difference between the following two lines?

StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]

They do very different things when you evaluate them. It turns out it's because the string being replaced in the first line consists of a unicode en dash, as opposed to a plain old ascii dash in the second line.

In the case of the unicode string, the regular expression doesn't match. I meant the regex "[\s\S]" to mean "match any character (including newline)" but Mathematica apparently treats it as "match any ascii character".

How can I fix the regular expression so the first line above evaluates the same as the second? Alternatively, is there an asciify filter I can apply to the strings first?

PS: The Mathematica documentation says that its string pattern matching is built on top of the Perl-Compatible Regular Expressions library (http://pcre.org) so the problem I'm having may not be specific to Mathematica.

© Stack Overflow or respective owner

Related posts about pcre

Related posts about mathematica