What regular expression(s) would I use to remove escaped html from large sets of data.

Posted by Elizabeth Buckwalter on Stack Overflow See other posts from Stack Overflow or by Elizabeth Buckwalter
Published on 2010-04-13T17:09:57Z Indexed on 2010/04/13 17:13 UTC
Read the original article Hit count: 229

Filed under:
|
|

Our database is filled with articles retrieved from RSS feeds. I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie library). This plugin does some basic encoding before insertion using Wordpress's built in post insert function which also does some filtering. I've figured out most of the filters before insertion, but now I have whacko data that I need to remove.

This is an example of whacko data that I have data in one field which the content I want in the front, but this part removed which is at the end:

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>

&lt;img src=&quot;http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk&quot;

Notice how some of the images are escape and some aren't. I believe this has to do with the last part being cut off so as to be unrecognizable as an html tag, which then caused it to be html endcoded.

Another field has only this which is now filtered before insertion, but I have to get rid of the others:

&lt;img src=&quot;http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg&quot; alt=&quot;post_img&quot; width=&quot;80&quot;

(all examples are on one line, but broken up for readability)

Question: What is the best way to work with the above escaped html (or portion of an html tag)?

I can do it in Perl, PHP, SQL, Ruby, and even Python. I believe Perl to be the best at text parsing, so that's why I used the Perl tag. And PHP times out on large database operations, so that's pretty much out unless I wanted to do batch processing and what not.

PS One of the nice things about using Wordpress's insert post function, is that if you use php's strip_tags function to strip out all html, insert post function will insert <p> at the paragraph points.

Let me know if there's anything more that I can answer.

Some article that didn't quite answer my questions. (http://stackoverflow.com/questions/2016751/remove-text-from-within-a-database-text-field) (http://stackoverflow.com/questions/462831/regular-expression-to-escape-html-ampersands-while-respecting-cdata)

© Stack Overflow or respective owner

Related posts about regex

Related posts about mysql