Parsing html for domain links

Posted by Hallik on Stack Overflow See other posts from Stack Overflow or by Hallik
Published on 2010-05-07T01:56:46Z Indexed on 2010/05/07 2:08 UTC
Read the original article Hit count: 261

Filed under:

I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want to compare it against. So a sample list contains

list=['www.domain.com', 'sub.domain.com']

But I may have a list of links that look like

http://domain.com
http://sub.domain.com/some/other/page

I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.

Right now I am using url2lib for parsing the html. What are my options in completely this task?

© Stack Overflow or respective owner

Related posts about python