large amount of data in many text files - how to process?

Posted by Stephen on Stack Overflow See other posts from Stack Overflow or by Stephen
Published on 2010-05-30T05:06:28Z Indexed on 2010/05/30 5:12 UTC
Read the original article Hit count: 356

Filed under:

python

|

sql

|

r

|

large-files

|

large-data-volumes

Hi, I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to 1) write the whole thing in C (or Fortran) 2) import the files (tables) into a relational database directly and then pull off chunks in R or Python (some of the transformations are not amenable for pure SQL solutions) 3) write the whole thing in Python Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about sql

SQL SERVER – Concat Strings in SQL Server using T-SQL – SQL in Sixty Seconds #035 – Video

as seen on SQL Authority - Search for 'SQL Authority'
Concatenating string is one of the most common tasks in SQL Server and every developer has to come across it. We have to concat the string when we have to see the display full name of the person by first name and last name. In this video we will see various methods to concatenate the strings. SQL… >>> More
SQL SERVER – Concat Function in SQL Server – SQL Concatenation

as seen on SQL Authority - Search for 'SQL Authority'
Earlier this week, I was delivering Advanced BI training on the subject of “SQL Server 2008 R2″. I had great time delivering the session. During the session, we talked about SQL Server 2010 Denali. Suddenly one of the attendees suggested his displeasure for the product. He said, even though… >>> More
Error with SQL Server Setup 2012 on Windows 2012

as seen on Server Fault - Search for 'Server Fault'
I am trying to install SQL Server on Windows 2012. I was able to finally get the wizard up and running after making some changes on the server, but now it fails no matter what I do with the following error: TITLE: SQL Server Setup failure. SQL Server Setup has encountered the following error: … >>> More
How can I detect which version of SQL (eg SQL 2008 or SQL Azure)

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to detect which version of SQL I am dealing with to perorm various tasks, I need specifically detect if I am on SQL 2008 or SQL Azure. How can I do this with detection code written in SQL? >>> More
Nested SQL Select statement fails on SQL Server 2000, ok on SQL Server 2005

as seen on Stack Overflow - Search for 'Stack Overflow'
Here is the query: INSERT INTO @TempTable SELECT UserID, Name, Address1 = (SELECT TOP 1 [Address] FROM (SELECT TOP 1 [Address] FROM [UserAddress] ua INNER JOIN UserAddressOrder uo ON ua.UserID = uo.UserID WHERE ua.UserID = u.UserID ORDER BY uo.AddressOrder ASC) q ORDER BY AddressOrder… >>> More