Improving grepping over a huge file performance

Posted by rogerio_marcio on Programmers See other posts from Programmers or by rogerio_marcio
Published on 2012-05-29T22:02:09Z Indexed on 2012/09/05 21:50 UTC
Read the original article Hit count: 411

Filed under:

algorithms

|

Performance

|

perl

|

bash

I have FILE_A which has over 300K lines and FILE_B which has over 30M lines. I created a bash script that greps each line in FILE_A over in FILE_B and writes the result of the grep to a new file.

This whole process is taking over 5+ hours.

I'm looking for suggestions on whether you see any way of improving the performance of my script.

I'm using grep -F -m 1 as the grep command. FILE_A looks like this:

123456789 
123455321

and FILE_B is like this:

123456789,123456789,730025400149993,
123455321,123455321,730025400126097,

So with bash I have a while loop that picks the next line in FILE_A and greps it over in FILE_B. When the pattern is found in FILE_B i write it to result.txt.

while read -r line; do
   grep -F -m1 $line 30MFile
done < 300KFile

Thanks a lot in advance for your help.

© Programmers or respective owner

Related posts about algorithms

Finding a problem in some task [closed]

as seen on Programmers - Search for 'Programmers'
Recently I competed in nation wide programming contest finals. Not unexpectedly all problems were algorithmic. I lost (40 points out of 600. Winner got ~300). I know why I lost very well - I don't know how to find actual problem in those obfuscated tasks which are life-blood of every competition.… >>> More
Genetic algorithms

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to implement a genetic algorithm that will calculate the minimum of the Rastrigin functon and I'm having some issues. I need to represent the chromosome as a binary string and as the Rastrigin's function takes a list of numbers as a parameter, how can decode the chromosome to a list of… >>> More
understanding evaluation function

as seen on Programmers - Search for 'Programmers'
I am developing a chess program. And have made use of an alpha beta algorithm and a static evaluation function. I have successfully implemented both but I want to improve the evaluation function by automatically tuning the weights assigned to its features. At this point am totally confused about the… >>> More
How to know whether to create a general system or to hack a solution

as seen on Programmers - Search for 'Programmers'
I'm new to coding , learning it since last year actually. One of my worst habits is the following: Often I'm trying to create a solution that is too big , too complex and doesn't achieve what needs to be achieved, when a hacky kludge can make the fit. One last example was the following (see paste… >>> More
How to implement a genetic algorithm with distance, time, and cost

as seen on Programmers - Search for 'Programmers'
I want to make a solution to find the optimum route of school visit. For example, I want to visit 5 schools (A, B, C, D, E) in my city. Then I must find out what school I should visit first, then the second, then the third etc. with distance, time, and cost criteria. The problem is, I am confused… >>> More

Related posts about Performance

Improving VPN performance - stronger encryption = more performance?

as seen on Server Fault - Search for 'Server Fault'
I have a site-to-site VPN set up with two SonicWall's (a TZ170 and a Pro1260). It was suggested to me that turning off encryption (so the VPN is tunneling only) would improve performance. (I'm not concerned with security, because the VPN is running over a trusted line.) Using FTP and HTTP transfers… >>> More
Inaccurate performance counter timer values in Windows Performance Monitor

as seen on Stack Overflow - Search for 'Stack Overflow'
I am implementing instrumentation within an application and have encountered an issue where the value that is displayed in Windows Performance Monitor from a PerformanceCounter is incongruent with the value that is recorded. I am using a Stopwatch to record the duration of a method execution, then… >>> More
Excel-based Performance Reviews transformed into Web Application for Performance Management

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
HR TMS provides enterprise talent management solutions for healthcare, retail and corporate customers, focusing on performance management, compensation management and succession planning. As the competency of nurses and other healthcare workers is critical, the government, via the Joint Commission… >>> More
How to save a perfmon Performance Counter as a textfile (Reliability and Performance Monitor Version

as seen on Server Fault - Search for 'Server Fault'
Now the file gets saved as blg, but I would like a txt versin to import in Excel. >>> More
SQLAuthority News – A Successful Performance Tuning Seminar at Pune – Dec 4-5, 2010

as seen on SQL Authority - Search for 'SQL Authority'
This is report to my third of very successful seminar event on SQL Server Performance Tuning. SQL Server Performance Tuning Seminar in Colombo was oversubscribed with total of 35 attendees. You can read the details over here SQLAuthority News – SQL Server Performance Optimizations Seminar – Grand… >>> More