pytables - Developer IT

Search Results

Search found 2 results on 1 pages for 'pytables'.

Page 1/1 | 1

Optimising speeds in HDF5 using Pytables

- by Sree Aurovindh

The problem is with respect to the writing speed of the computer (10 * 32 bit machine) and the postgresql query performance.I will explain the scenario in detail. I have data about 80 Gb (along with approprite database indexes in place). I am trying to read it from Postgresql database and writing it into HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 file.The implementation of Hdf5 is not multithreaded or enabled for symmetric multi processing.I have rented about 10 computers for a day and trying to write them inorder to speed up my data handling. As for as the postgresql table is concerned the overall record size is 140 million and I have 5 primary- foreign key referring tables.I am not using joins as it is not scalable So for a single lookup i do 6 lookup without joins and write them into hdf5 format. For each lookup i do 6 inserts into each of the table and its corresponding arrays. The queries are really simple select * from x.train where tr_id=1 (primary key & indexed) select q_t from x.qt where q_id=2 (non-primary key but indexed) (similarly five queries) Each computer writes two hdf5 files and hence the total count comes around 20 files. Some Calculations and statistics: Total number of records : 14,37,00,000 Total number of records per file : 143700000/20 =71,85,000 The total number of records in each file : 71,85,000 * 5 = 3,59,25,000 Current Postgresql database config : My current Machine : 8GB RAM with i7 2nd generation Processor. I made changes to the following to postgresql configuration file : shared_buffers : 2 GB effective_cache_size : 4 GB Note on current performance: I have run it for about ten hours and the performance is as follows: The total number of records written for each file is about 6,21,000 * 5 = 31,05,000 The bottle neck is that i can only rent it for 10 hours per day (overnight) and if it processes in this speed it will take about 11 days which is too high for my experiments. Please suggest me on how to improve. Questions: 1. Should i use Symmetric multi processing on those desktops(it has 2 cores with about 2 GB of RAM).In that case what is suggested or prefereable? 2. If i change my postgresql configuration file and increase the RAM will it enhance my process. 3. Should i use multi threading.. In that case any links or pointers would be of great help Thanks Sree aurovindh V

Read the article

How to efficiently store and update binary data in Mongodb?

- by Rocketman

I am storing a large binary array within a document. I wish to continually add bytes to this array and sometimes change the value of existing bytes. I was looking for some $append_bytes and $replace_bytes type of modifiers but it appears that the best I can do is $push for arrays. It seems like this would be doable by performing seek-write type operations if I had access somehow to the underlying bson on disk, but it does not appear to me that there is anyway to do this in mongodb (and probably for good reason). If I were instead to just query this binary array, edit or add to it, and then update the document by rewriting the entire field, how costly will this be? Each binary array will be on the order of 1-2MB, and updates occur once every 5 minutes and across 1000s of documents. Worse, yet there is no easy way to spread these out (in time) and they will usually be happening close to one another on the 5 minute intervals. Does anyone have a good feel for how disastrous this will be? Seems like it would be problematic. An alternative would be to store this binary data as separate files on disk, implement a thread pool to efficiently manipulate the files on disk, and reference the filename from my mongodb document. (I'm using python and pymongo so I was looking at pytables). I'd prefer to avoid this though if possible. Is there any other alternative that I am overlooking here? Thanks in advnace.