Word frequency tally script is too slow

Posted by Dave Jarvis on Stack Overflow See other posts from Stack Overflow or by Dave Jarvis
Published on 2011-01-07T15:49:22Z Indexed on 2011/01/07 15:53 UTC
Read the original article Hit count: 207

Filed under:

bash

|

optimization

|

dictionary

|

corpus

|

lexicon

Background

Created a script to count the frequency of words in a plain text file. The script performs the following steps:

Count the frequency of words from a corpus.
Retain each word in the corpus found in a dictionary.
Create a comma-separated file of the frequencies.

The script is at: http://pastebin.com/VAZdeKXs

Problem

The following lines continually cycle through the dictionary to match words:

for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
  grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done

It works, but it is slow because it is scanning the words it found to remove any that are not in the dictionary. The code performs this task by scanning the dictionary for every single word. (The -m 1 parameter stops the scan when the match is found.)

Question

How would you optimize the script so that the dictionary is not scanned from start to finish for every single word? The majority of the words will not be in the dictionary.

Thank you!

© Stack Overflow or respective owner

Related posts about bash

launching a program from bash causes bash to go to new prompt

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
When I run a program from the console, e.g. me@box:~$ firefox I expect the console to log error messages (I think this is std out or std err?) and other items from the program, firefox in this case. But today I notice that bash just opens the program and goes to a new prompt, e.g. me@box:~$… >>> More
How to debug a .bash_profile

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I was updating my .bash_profile, and unfortunetly I made a few updates and now I am getting: env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory -bash: tar: command… >>> More
Every command fails with "command not found" after changing .bash_profile?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I was updating my .bash_profile, and unfortunetly I made a few updates and now I am getting: env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory -bash: tar: command… >>> More
Is there any fundamental difference between piping in mac and linux?

as seen on Super User - Search for 'Super User'
ps -e | grep bash sample output from a linux machine: 1128 pts/14 00:00:00 bash 7491 pts/7 00:00:00 bash 12651 pts/14 00:00:00 bash 16145 pts/2 00:00:00 bash sample output from a mac machine: 58352 ttys000 0:00.09 login -pfl username /bin/bash -c exec -la bash /bin/bash 58353 ttys000… >>> More
why is $0 set to -bash?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
First login process name seems to be set to "-bash", but if I subshell then it becomes "bash". for example: root@nowere:~# echo $0 -bash root@nowere:~# bash root@nowere:~# echo $0 bash -bash is causing some scripts to fail, such as . /usr/share/debconf/confmodule exec /usr/share/debconf/frontend… >>> More

Related posts about optimization

Search Engine Optimization - The Importance of Page Optimization in Search Engine Optimization

as seen on Ezine Articles - Search for 'Ezine Articles'
In order for your website to rank well, your internal linking structure is critical to your success. This is covered some of the theory for this in various articles and blogs about Page Structure of a website, which said how you should map out the physical linking structure, but in this guide I will… >>> More
SEO Optimization - How to Master the SEO Optimization Process in Four Easy Steps

as seen on Ezine Articles - Search for 'Ezine Articles'
Search engine optimization is a critical part of any internet marketing campaign but can often be intimidating to new marketers. In this article you will learn the 4 basic components of SEO. >>> More
Keywords Optimization For Website Optimization

as seen on Ezine Articles - Search for 'Ezine Articles'
Saying that you need to do website optimization sounds like saying you need to get healthy. To get healthy we do 2 things: diet management and exercise. Lets start with diet management. Keywords are like food for your WebPages. This article explains the role of keywords in website optimization. >>> More
The Expert Secret to Search Engine Optimization - Effective Website Optimization

as seen on Ezine Articles - Search for 'Ezine Articles'
Throwing keywords into a program that shows you how popular they are and then using those keywords without doing a little bit of preliminary research and answering some very important questions can just spell disaster. There are three questions that are extremely important to ask yourself before just… >>> More
Importance of On-Page Optimization in Search Engine Optimization (SEO)

as seen on Ezine Articles - Search for 'Ezine Articles'
On-page optimization is crucial for any website. Read the importance of on-page optimization and how it would be helpful for getting high ranking in search engines. >>> More