How can I efficiently group a large list of URLs by their host name in Perl?

Posted by jesper on Stack Overflow See other posts from Stack Overflow or by jesper
Published on 2010-04-06T20:54:07Z Indexed on 2010/04/12 22:22 UTC
Read the original article Hit count: 337

Filed under:
|

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:

{
    'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
    'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}

My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?

My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.

EDIT

Here is my implementation (I've cut off irrelevant things):

while($line = <STDIN>) { 
    chomp($line); 
    $line =~ /(http:\/\/.+?)(\/|$)/i; 
    $host = "$1"; 
    push @{$urls{$host}}, $line; 
}

store \%urls, 'out.hash'; 

© Stack Overflow or respective owner

Related posts about perl

Related posts about efficiency