Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one l

Posted by Alison on Stack Overflow See other posts from Stack Overflow or by Alison
Published on 2010-05-07T14:10:08Z Indexed on 2010/05/07 14:28 UTC
Read the original article Hit count: 564

Filed under:

I am trying to set up an index page for the weekly magazine I work on. It is to show readers the names of companies mentioned in that weeks' issue, plus the page numbers they are appear on.

I want to search all the PDF files for the week, where one PDF = one magazine page (originally made in Adobe InDesign CS3 and Adobe InCopy CS3).

I have set up a list of companies I want to search for and, using PowerGREP and using delimited regular expressions, I am able to find most page numbers where a company is mentioned. However, where a company name contains two or more words, the search I am running will not pick up instances where the name appears over more than one line.

For example, when looking for "CB Richard Ellis" and "Cushman & Wakefield", I got no result when the text appeared like this:

DTZ beat BNP PRE, CB [line break here]

Richard Ellis and Cushman & [line break here]

Wakefield to secure the contract. [line end here]

Could someone advise me on how to write a regular expression that will ignore white space between words and ignore line endings OR one that will look for the words including all types of white space (ie uneven spaces between words; spaces at the end of lines or line endings; and tabs (I am guessing that this info is imbedded somehow in PDF files).

Here is a sample of the set of terms I have asked PowerGREP to search for:

\bCB Richard Ellis\b
\bCB Richard Ellis Hotels\b
\bCentaur Services\b
\bChapman Herbert\b
\bCharities Property Fund\b
\bChetwoods Architects\b
\bChurch Commissioners\b
\bClive Emson\b
\bClothworkers’ Company\b
\bColliers CRE\b
\bCombined English Stores Group\b
\bCommercial Estates Group\b
\bConnells\b
\bCooke & Powell\b 
\bCordea Savills\b
\bCrown Estate\b
\bCushman & Wakefield\b
\bCWM Retail Property Advisors\b

[Note that there is a delimited hard return between each \b at the end of each phrase and beginnong of the next phrase.]

By the way, I am a production journalist and not usually involved in finding IT-type solutions and am finding it difficult to get to grips with the technical language on the PowerGREP site.

Thanks for assistance

Alison

© Stack Overflow or respective owner

Related posts about regex