php regex word boundary matching in utf-8

Posted by dontomaso on Stack Overflow See other posts from Stack Overflow or by dontomaso
Published on 2010-03-12T13:08:32Z Indexed on 2010/03/14 14:45 UTC
Read the original article Hit count: 515

Filed under:

php

|

utf-8

|

word-boundary

|

pcre

|

preg-replace

Hi, I have the following php code in a utf-8 php file:

var_dump(setlocale(LC_CTYPE, 'de_DE.utf8', 'German_Germany.utf-8', 'de_DE', 'german'));
var_dump(mb_internal_encoding());
var_dump(mb_internal_encoding('utf-8'));
var_dump(mb_internal_encoding());
var_dump(mb_regex_encoding());
var_dump(mb_regex_encoding('utf-8'));
var_dump(mb_regex_encoding());
var_dump(preg_replace('/\bweiß\b/iu', 'weiss', 'weißbier'));

I would like the last regex to replace only full words and not parts of words.

On my windows computer, it returns:

string 'German_Germany.1252' (length=19)
string 'ISO-8859-1' (length=10)
boolean true
string 'UTF-8' (length=5)
string 'EUC-JP' (length=6)
boolean true
string 'UTF-8' (length=5)
string 'weißbier' (length=9)

On the webserver (linux), I get:

string(10) "de_DE.utf8"
string(10) "ISO-8859-1"
bool(true)
string(5) "UTF-8"
string(10) "ISO-8859-1"
bool(true)
string(5) "UTF-8"
string(9) "weissbier"

Thus, the regex works as I expected on windows but not on linux.

So the main question is, how should I write my regex to only match at word boundaries?

A secondary questions is how I can let windows know that I want to use utf-8 in my php application.

© Stack Overflow or respective owner

Related posts about php

Magento, NGINX, PHP-FPM, APC, MEMCACHED, 16gb Ram CentOS, Spiking PHP-FPM to 100% CPU

as seen on Server Fault - Search for 'Server Fault'
I have been trying to resolve my issue of spiking cpu caused by php-fpm processes. I've reduced the php-fpm config settings to: pm = ondemand pm.max_children = 12 pm.start_servers = 2 pm.min_spare_servers = 2 pm.max_spare_servers = 10 pm.max_requests = 500 php_admin_value[memory_limit] = 128M Problem… >>> More
PHP Pear Installation on CentOS

as seen on Server Fault - Search for 'Server Fault'
[root@ip ~]# yum install php-pear* Reducing CentOS-5 Testing to included packages only Finished Setting up Install Process Package 1:php-pear-1.8.1-2.el5.centos.noarch already installed and latest versio … >>> More
Apache configurations for php "AddType text/html php" or "AddType application/x-httpd-php php .php"

as seen on Server Fault - Search for 'Server Fault'
I am taking over an application server and discover that it contain the following settings: AddType text/html php Although it works, but my understanding is that it should set as following: AddType application/x-httpd-php php .php What are the key differences between the two settings?… >>> More
mod_rewrite settings causes server to throw HTTP 500 errors instead of 404

as seen on Server Fault - Search for 'Server Fault'
Hello. I have a server with VBulletin forum (working under Apache 2.2, CentOS). The default settings for it in .htaccess are as follows: RewriteEngine on RewriteCond %{HTTP_HOST} ^gsmforum\.ru RewriteRule (.*) http://www.gsmforum.ru/$1 [R=301,L] # If you are having problems or are using VirtualDocumentRoot… >>> More
Problems installing Memcache (PECL extension)

as seen on Server Fault - Search for 'Server Fault'
I have installed memcached fine, and now I will need to install PECL extension memcache. Im running RedHat x86_64 es5. The installation gives me this: downloading memcache-2.2.6.tgz ... Starting to download memcache-2.2.6.tgz (35,957 bytes) ..........done: 35,957 bytes 11 source files, building running:… >>> More

Related posts about utf-8

Why can't I change the AU_AU locale to en_US?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
/bin/bash: warning: setlocale: LC_ALL: cannot change locale ( (unset)) Generating locales... en_US.ISO-8859-1... /usr/sbin/locale-gen: line 177: warning: setlocale: LC_ALL: cannot change locale ( (unset)) done Generation complete. ganesha@ubuntu:~$ sudo update_locale LANG=en_US sudo: update_locale:… >>> More
Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that. I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite… >>> More
Reading a plist utf-8 value as utf-16

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on an iphone app that needs to display superscripts and subscripts. I'm using a picker to read in data from a plist but the unicode values aren't being displayed corretly in the pickerview. Subscripts and superscripts are not being recognized. I'm assuming this is due to the encoding… >>> More
Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Consider the following problem: A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed. I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1… >>> More
How can I tell if a CSV is in UTF-7 or UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Excel seems to save CSV files in (what I think is) UTF-7, despite the fact that most information I have read suggest that in general, you should not UTF-7. Indeed, other applications (Text pad, which lets me choose) save things in UTF-8 (or Unicode etc, but UTF-7 is not even an option). Using .NET… >>> More