RegEx to ignore / skip everything in html tags

Posted by Scott Sumpter on Stack Overflow See other posts from Stack Overflow or by Scott Sumpter
Published on 2010-04-16T16:09:10Z Indexed on 2010/04/16 16:13 UTC
Read the original article Hit count: 327

Filed under:
|
|
|

Looking for a way to combine two Regular Expressions. One to catch the urls and the other to ensure is skips text within html tags. See sample text below functions.

Need to pass a block of news text and format text by wrapping urls and email addresses in html tags so users don't have to. The below code works great until there are already html tags within the text. In that case it doubles the html tags.

There are plenty of examples to strip html, but I want to just ignore it since the url is already linkified. Also - if there is an easier was to accomplish this, with or without Regex, please let me know. none of my attempts to combine Regexs have worked.

coding in ASP.NET VB but will take any workable example/direction.

Thanks!

===== Functions =============

Public Shared Function InsertHyperlinks(ByVal inText As String) As String Dim strBuf As String Dim objMatches As Object Dim iStart, iEnd As Integer strBuf = "" iStart = 1 iEnd = 1

    Dim strRegUrlEmail As String = "\b(www|http|\S+@)\S+\b"             
    'RegEx to find urls and email addresses
    Dim objRegExp As New Regex(strRegUrlEmail, RegexOptions.IgnoreCase) 
    'Match URLs and emails        
    Dim MatchList As MatchCollection = objRegExp.Matches(inText)
    If MatchList.Count <> 0 Then

        objMatches = objRegExp.Matches(inText)
        For Each Match In MatchList
            iEnd = Match.Index
            strBuf = strBuf & Mid(inText, iStart, iEnd - iStart + 1)
            If InStr(1, Match.Value, "@") Then
                strBuf = strBuf & HrefGet(Match.Value, "EMAIL", "_BLANK")
            Else
                strBuf = strBuf & HrefGet(Match.Value, "WEB", "_BLANK")
            End If
            iStart = iEnd + Match.Length + 1
        Next
        strBuf = strBuf & Mid(inText, iStart)
        InsertHyperlinks = strBuf
    Else
        'No hyperlinks to replace
        InsertHyperlinks = inText
    End If

End Function

Shared Function HrefGet(ByVal url As String, ByVal urlType As String, ByVal Target As String) As String
    Dim strBuf As String
    strBuf = "<a href="""
    If UCase(urlType) = "WEB" Then
        If LCase(Left(url, 3)) = "www" Then
            strBuf = "<a href=""http://" & url & """ Target=""" & _
                     Target & """>" & url & "</a>"
        Else
            strBuf = "<a href=""" & url & """ Target=""" & _
                    Target & """>" & url & "</a>"
        End If
    ElseIf UCase(urlType) = "EMAIL" Then
        strBuf = "<a href=""mailto:" & url & """ Target=""" & _
                 Target & """>" & url & "</a>"
    End If
    HrefGet = strBuf
End Function

===== Sample Text ============= This would be the inText parameter.

Midway through the ride, we see a Skip this too. But sometimes we go here [insert normal www dot link dot com]. If you'd like to join us contact Bill Smith at [email protected]. Thanks!

sorry stack overflow won't allow multiple hyperlinks to be added.

===== End Sample Text =============

© Stack Overflow or respective owner

Related posts about .NET

Related posts about regex