Create a term-document matrix from files

Posted by Joe on Super User See other posts from Super User or by Joe
Published on 2012-12-01T14:45:27Z Indexed on 2013/10/31 16:01 UTC
Read the original article Hit count: 158

Filed under:
|
|
|
|

I have a set of files from example001.txt to example100.txt. Each file contains a list of keywords from a superset (the superset is available if we want it).

So example001.txt might contain

apple
banana
...
otherfruit

I'd like to be able to process these files and produce something akin to a matrix so there is the list of examples* on the top row, the fruit down the side, and a '1' in a column if the fruit is in the file.

An example might be...

x           example1    example2   example3
Apple         1            1          0
Babana        0            1          0
Coconut       0            1          1

Any idea how I might build some sort of command-line magic to put this together? I'm on OSX and happy with perl or python...

© Super User or respective owner

Related posts about osx

Related posts about command-line