How to get all captures of subgroup matches with preg_match_all()?

Posted by hakre on Stack Overflow See other posts from Stack Overflow or by hakre
Published on 2011-06-16T11:41:34Z Indexed on 2012/10/07 3:39 UTC
Read the original article Hit count: 285

Filed under:

preg-match-all

Update/Note:

I think what I'm probably looking for is to get the captures of a group in PHP.

Referenced: PCRE regular expressions using named pattern subroutines.

(Read carefully:)

I have a string that contains a variable number of segments (simplified):

$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well

I would like now to match the segments and return them via the matches array:

$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);

This will only return the last match for the capture group 2: DD.

Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?

This question is a generalization.

Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.

But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.

For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.

Example

This is an example in pseudo code to describe a bit of the background. Imagine the following:

Regular definitions of tokens:

   CHARS := [a-z]+
   PUNCT := [.,!?]
   WS := [ ]

$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).

That array is then transformed into a string, containing one character per token:

   CHARS -> "c"
   PUNCT -> "p"
   WS -> "s"

So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.

   regex: (cs)?cp

to express one or more group of chars followed by a punctuation.

As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:

   words = word | (word space)+ word
   word = CHARS+
   space = WS
   punctuation = PUNCT

If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.

  words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+)    # words resolved to tokens
  words = (c+)|((c+)s)+c+                         # words resolved to regex

I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.

So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.

That's basically all. Probably now it's understandable why I simplified the question.

Developer IT