Using Regex, how can I remove certain characters from inside angle-brackets, leaving the characters

Posted by Iain Fraser on Stack Overflow See other posts from Stack Overflow or by Iain Fraser
Published on 2010-05-12T08:01:46Z Indexed on 2010/05/12 8:34 UTC
Read the original article Hit count: 389

Edit: To be clear, please understand that I am not using Regex to parse the html, that's crazy talk! I'm simply wanting to clean up a messy string of html so it will parse

Edit #2: I should also point out that the control character I'm using is a special unicode character - it's not something that would ever be used in a proper tag under any normal circumstances

Suppose I have a string of html that contains a bunch of control characters and I want to remove the control characters from inside tags only, leaving the characters outside the tags alone.

For example

Here the control character is the numeral "1".

Input

The quick 1<strong>orange</strong> lemming <sp11a1n 1class1='jumpe111r'11>jumps over</span> 1the idle 1frog

Desired Output

The quick 1<strong>orange</strong> lemming <span class='jumper'>jumps over</span> 1the idle 1frog

So far I can match tags which contain the control character but I can't remove them in one regex. I guess I could perform another regex on my matches, but I'd really like to know if there's a better way.

My regex

Bear in mind this one only matches tags which contain the control character.

<(([^>])*?`([^>])*?)*?>

Thanks very much for your time and consideration.

Iain Fraser

© Stack Overflow or respective owner

Related posts about regex

Related posts about string-manipulation