Hyperlink regex including http(s):// not working in C#

Posted by Rory Fitzpatrick on Stack Overflow See other posts from Stack Overflow or by Rory Fitzpatrick
Published on 2010-03-12T15:42:09Z Indexed on 2010/03/12 15:57 UTC
Read the original article Hit count: 374

Filed under:
|

I think this is sufficiently different from similar questions to warrant a new one.

I have the following regex to match the beginning hyperlink tags in HTML, including the http(s):// part in order to avoid mailto: links

<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>

When I run this through Nregex (with escaping removed) it matches correctly for the following test cases:

<a href="http://www.bbc.co.uk">

<a href="http://bbc.co.uk">

<a href="https://www.bbc.co.uk">

<a href="mailto:[email protected]">

However when I run this in my C# code it fails. Here is the matching code:

public static IEnumerable<string> GetUrls(this string input, string matchPattern)
    {
        var matches = Regex.Matches(input, matchPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        foreach (Match match in matches)
        {
            yield return match.Groups["href"].Value;
        }
    }

And my tests:

@"<a href=""https://www.bbc.co.uk"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(1);

@"<a href=""mailto:[email protected]"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(0);

The problem seems to be in the \\b(https?):// part which I added, removing this passes the normal URL test but fails the mailto: test.

Anyone shed any light?

© Stack Overflow or respective owner

Related posts about regex

Related posts about c#