Split a string by comma, quote and full-stop.. with a few exceptions

Posted by dunc on Stack Overflow See other posts from Stack Overflow or by dunc
Published on 2012-06-16T14:55:58Z Indexed on 2012/06/16 15:16 UTC
Read the original article Hit count: 140

Filed under:
|
|

I've got a lot of text, similar to the following paragraph, which I'd like to split into words without punctuation (', ", ,, ., newline etc).. with a few exceptions.

Initially considered endemic to the Chalakudy River system in Kerala state, southern India, but now recognised to have a wider distribution in surrounding drainages including the Periyar, Manimala, and Pamba river though the Manimala data may be questionable given it seems to be the type locality of P. denisonii.

In the Achankovil River basin it occurs sympatrically, and sometimes syntopically, with P. denisonii.

Wild stocks may have dwindled by as much as 50% in the last 15 years or so with collection for the aquarium trade largely held responsible although habitats are also being degraded by pollution from agricultural and domestic sources, plus destructive fishing methods involving explosives or organic toxins.

The text refers to P. denisonii which is a species of fish. It's an abbreviation of Genus species. I would like this reference to be one word.

So, for instance, this is the kind of array I'd like to see:

Array
(
    ...
    [44] given
    [45] it
    [46] seems
    [47] to
    [48] be
    [49] the
    [50] type
    [51] locality
    [52] of
    [53] P. denisonii
    [54] In
    [55] the
    ...
)

The only things that distinguish these species references such as P. denisonii from a new sentence like end. New are:

  • The P (for Puntius, as in the P. in the aforementioned example) is only ever one letter, always a capital
  • the d (as in . denisonii) is always either a lower case letter or an apostrophe (')

What regexp can I use with preg_split to give me such an array? I've tried a simple explode( " ", $array ) but it doesn't do the job at all.

Thanks in advance,

© Stack Overflow or respective owner

Related posts about php

Related posts about regex