Optimising speeds in HDF5 using Pytables

Posted by Sree Aurovindh on Stack Overflow See other posts from Stack Overflow or by Sree Aurovindh
Published on 2012-03-19T06:58:49Z Indexed on 2012/03/20 5:29 UTC
Read the original article Hit count: 251

Filed under:
|
|

The problem is with respect to the writing speed of the computer (10 * 32 bit machine) and the postgresql query performance.I will explain the scenario in detail.

I have data about 80 Gb (along with approprite database indexes in place). I am trying to read it from Postgresql database and writing it into HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 file.The implementation of Hdf5 is not multithreaded or enabled for symmetric multi processing.I have rented about 10 computers for a day and trying to write them inorder to speed up my data handling.

As for as the postgresql table is concerned the overall record size is 140 million and I have 5 primary- foreign key referring tables.I am not using joins as it is not scalable

So for a single lookup i do 6 lookup without joins and write them into hdf5 format. For each lookup i do 6 inserts into each of the table and its corresponding arrays.

The queries are really simple

select * from x.train where tr_id=1 (primary key & indexed)
select q_t from x.qt where q_id=2 (non-primary key but indexed) 

(similarly five queries)

Each computer writes two hdf5 files and hence the total count comes around 20 files.

Some Calculations and statistics:

Total number of records : 14,37,00,000
Total number of records per file : 143700000/20 =71,85,000 
The total number of records in each file : 71,85,000 * 5 = 3,59,25,000

Current Postgresql database config :

My current Machine : 8GB RAM with i7 2nd generation Processor.

I made changes to the following to postgresql configuration file : shared_buffers : 2 GB effective_cache_size : 4 GB

Note on current performance:

I have run it for about ten hours and the performance is as follows: The total number of records written for each file is about 6,21,000 * 5 = 31,05,000

The bottle neck is that i can only rent it for 10 hours per day (overnight) and if it processes in this speed it will take about 11 days which is too high for my experiments.

Please suggest me on how to improve. Questions: 1. Should i use Symmetric multi processing on those desktops(it has 2 cores with about 2 GB of RAM).In that case what is suggested or prefereable? 2. If i change my postgresql configuration file and increase the RAM will it enhance my process. 3. Should i use multi threading.. In that case any links or pointers would be of great help

Thanks Sree aurovindh V

© Stack Overflow or respective owner

Related posts about multithreading

Related posts about hdf5

  • Optimising speeds in HDF5 using Pytables

    as seen on Stack Overflow - Search for 'Stack Overflow'
    The problem is with respect to the writing speed of the computer (10 * 32 bit machine) and the postgresql query performance.I will explain the scenario in detail. I have data about 80 Gb (along with approprite database indexes in place). I am trying to read it from Postgresql database and writing… >>> More

  • HDF5 .Net wrapper

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I'm getting ( http://www.hdfgroup.org/projects/hdf.net/) The specified module could not be found. (Exception from HRESULT: 0x8007007E) from the dependency walker i'm seeing that SZLIBDLL.DLL is missing i tried to download it from random place but then i got another error. Does any one had… >>> More

  • Install h5py in Mac OS X 10.6.3

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I'm trying to install h5py in Mac OS X 10.6.3. First I installed HDF5 1.8, which used the following commands: ./configure \ --prefix=/Library/Frameworks/Python.framework/Versions/Current \ --enable-shared \ --enable-production \ --enable-threadsafe \ CPPFLAGS=-I/Library/Frameworks/Python.framework/Versions/Current/include… >>> More

  • Error mpicc command not found [closed]

    as seen on Programmers - Search for 'Programmers'
    I want to compile hdf5 but I find the following error: /hdf5/hdf5-1.6.9CC=/usr/local/openmpi/bin/mpicc ./configure /home/sknandi/Research/ Simulation/hdf5/parallel_fdf5 CC=/usr/local/openmpi/bin/mpicc: Command not found. The result of echo $PATH is /hdf5/hdf5-1.6.9echo $PATH /priv/myriad3/ayw/research/COALA/visit/bin:/usr/local/bin:/usr/bin:/bin:/pkg/linux/intel/composerxe-2011… >>> More

  • Storage for large gridded datasets

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I am looking for a good storage format for large, gridded datasets. The application is meteorology, and we would prefer a format that is common within this field (to help exchange data with others). I don't need to deal with special data structures, and there should be a Fortran API. I am currently… >>> More