regex to filter all but whitelisted characters from a multi-language string

Posted by jeroen on Stack Overflow See other posts from Stack Overflow or by jeroen
Published on 2010-03-18T20:18:43Z Indexed on 2010/03/18 20:21 UTC
Read the original article Hit count: 516

Filed under:
|
|
|

I am trying to cleanup a string coming from a search box on a multi-language site.

Normally I would use a regex like:

$allowed = "-+?!,.;:\w\s";
$txt_search = preg_replace("/[^" . $allowed . "]?(.*?)[^" . $allowed . "]?/iu", "$1", $_GET['txt_search']);

and that works fine for English texts.

However, now I need to do the same when the texts entered can be in any language (Russian now, Chinese in the future).

How can I clean up the string while preserving "normal texts" in the original language?

I though about switching to a blacklist (although I´d rather not...) but at this moment the regex just completely destroys all original input.

© Stack Overflow or respective owner

Related posts about regex

Related posts about php