Extract anything that looks like links from large amount of data in python

Posted by Riz on Stack Overflow See other posts from Stack Overflow or by Riz
Published on 2010-04-18T14:46:36Z Indexed on 2010/04/18 14:53 UTC
Read the original article Hit count: 206

Filed under:
|
|
|

Hi, I have around 5 GB of html data which I want to process to find links to a set of websites and perform some additional filtering. Right now I use simple regexp for each site and iterate over them, searching for matches. In my case links can be outside of "a" tags and be not well formed in many ways(like "\n" in the middle of link) so I try to grab as much "links" as I can and check them later in other scripts(so no BeatifulSoup\lxml\etc). The problem is that my script is pretty slow, so I am thinking about any ways to speed it up. I am writing a set of test to check different approaches, but hope to get some advices :)

Right now I am thinking about getting all links without filtering first(maybe using C module or standalone app, which doesn't use regexp but simple search to get start and end of every link) and then using regexp to match ones I need.

© Stack Overflow or respective owner

Related posts about screen-scraping

Related posts about python