How can I efficiently group a large list of URLs by their host name in Perl?

Posted by jesper on Stack Overflow See other posts from Stack Overflow or by jesper
Published on 2010-04-06T20:54:07Z Indexed on 2010/04/12 22:22 UTC
Read the original article Hit count: 388

Filed under:

perl

|

efficiency

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:

{
    'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
    'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}

My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?

My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.

EDIT

Here is my implementation (I've cut off irrelevant things):

while($line = <STDIN>) { 
    chomp($line); 
    $line =~ /(http:\/\/.+?)(\/|$)/i; 
    $host = "$1"; 
    push @{$urls{$host}}, $line; 
}

store \%urls, 'out.hash';

© Stack Overflow or respective owner

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

as seen on Server Fault - Search for 'Server Fault'
I'm trying to install Munin on a new VPS through yum install munin but I keep getting an error about a missing perl module: Requires: perl(:MODULE_COMPAT_5.8.8). This is the perl version currently installed: v5.10.1. I've searched all around and still haven't found a solution for this. Here's the… >>> More
Pain removing a perl rootkit

as seen on Server Fault - Search for 'Server Fault'
So, we host a geoservice webserver thing at the office. Someone apparently broke into this box (probably via ftp or ssh), and put some kind of irc-managed rootkit thing. Now I'm trying to clean the whole thing up, I found the process pid who tries to connect via irc, but i can't figure out who's… >>> More
How To Avoid a Perl script calling an Another Perl Script

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, i am calling a perl script client.pl from a main script to capture the output of client.pl in @output. is there anyway to avoid the use of these two files so i can use the output of client.pl in main.pl itself here is my code.... main.pl ======= my @output = readpipe("client.pl"); client… >>> More
Perl :how to sort dates in perl

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, How can I sort the dates in perl. my @dates = ( "02/11/2009" , "12/20/2001" , "11/21/2010" ) ; I have above dates in my array . How can I sort those dates... ? My date format is dd/mm/YYYY. >>> More
please suggest a perl book exclusively for perl programs

as seen on Stack Overflow - Search for 'Stack Overflow'
I want tha name of a perl book for only PERL PROGRAMS. The reason behind is I want to improve my programming skill in perl >>> More

Related posts about efficiency

iOS Efficiency File Saving Efficiency

as seen on Stack Overflow - Search for 'Stack Overflow'
I was working on my iOS app and my goal is to save a file that I am receiving from the internet bit by bit. My current setup is that I have a NSMutableData object and I add a bit of data to it as I receive my file. After the last "packet" is received, I write the NSData to a file and the process is… >>> More
Data Center Efficiency From the Ground Up

as seen on Internet.com - Search for 'Internet.com'
Hard-Core Hardware: It isn't often a data center gets to install the latest in servers, blades, storage and networking gear from Day One. Yet the Emerson data center in St. Louis, Missouri had just that opportunity. >>> More
Data Center Efficiency From the Ground Up

as seen on Internet.com - Search for 'Internet.com'
Hard-Core Hardware: It isn't often a data center gets to install the latest in servers, blades, storage and networking gear from Day One. Yet the Emerson data center in St. Louis, Missouri had just that opportunity. >>> More
Storage Energy Efficiency: Knowledge Is Power

as seen on Internet.com - Search for 'Internet.com'
When it comes to storage and data center energy costs, users aren't getting enough information, according to a recent report. >>> More
Meta on a Mac Delivers Efficiency Boost

as seen on Internet.com - Search for 'Internet.com'
Tip of the Trade: Tired of stretching your fingers to hit the ESC key on your Mac? There's a simple way to fix that. >>> More