what changes when your input is giga/terabyte sized?

Posted by Wang on Stack Overflow See other posts from Stack Overflow or by Wang
Published on 2010-06-10T06:34:47Z Indexed on 2010/06/14 14:42 UTC
Read the original article Hit count: 263

Filed under:

python

|

large-data-volumes

|

scientific-computing

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.

I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.

For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.

What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about large-data-volumes

How to best transfer large payloads of data using wsHttp with WCF with message security

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer serialized object graph which can sometimes approach around… >>> More
Using Hibernate's ScrollableResults to slowly read 90 million records

as seen on Stack Overflow - Search for 'Stack Overflow'
I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate: ScrollableResults results = session.createQuery("SELECT person FROM Person person") … >>> More
Practical size limitations for RDBMS

as seen on Stack Overflow - Search for 'Stack Overflow'
I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex… >>> More
How would you handle making an array or list that would have more entries than the standard implemen

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to create an array or list that could handle in theory, given adequate hardware and such, as many as 100^100 BigInteger entries. The problem with using an array or standard list is that they can only hold Integer.MAX_VALUE number of entries. How would you work around this limitations… >>> More
Doing large updates against indexed view

as seen on Stack Overflow - Search for 'Stack Overflow'
We have an indexed view that runs across three large tables. Two of these tables (A & B) are constantly getting updated with user transactions and the other table (C) contains data product info that is needs to be updated once a week. This product table contains over 6 million records. We need… >>> More