Search Results

Search found 14 results on 1 pages for 'hdf5'.

Page 1/1 | 1 

  • Optimising speeds in HDF5 using Pytables

    - by Sree Aurovindh
    The problem is with respect to the writing speed of the computer (10 * 32 bit machine) and the postgresql query performance.I will explain the scenario in detail. I have data about 80 Gb (along with approprite database indexes in place). I am trying to read it from Postgresql database and writing it into HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 file.The implementation of Hdf5 is not multithreaded or enabled for symmetric multi processing.I have rented about 10 computers for a day and trying to write them inorder to speed up my data handling. As for as the postgresql table is concerned the overall record size is 140 million and I have 5 primary- foreign key referring tables.I am not using joins as it is not scalable So for a single lookup i do 6 lookup without joins and write them into hdf5 format. For each lookup i do 6 inserts into each of the table and its corresponding arrays. The queries are really simple select * from x.train where tr_id=1 (primary key & indexed) select q_t from x.qt where q_id=2 (non-primary key but indexed) (similarly five queries) Each computer writes two hdf5 files and hence the total count comes around 20 files. Some Calculations and statistics: Total number of records : 14,37,00,000 Total number of records per file : 143700000/20 =71,85,000 The total number of records in each file : 71,85,000 * 5 = 3,59,25,000 Current Postgresql database config : My current Machine : 8GB RAM with i7 2nd generation Processor. I made changes to the following to postgresql configuration file : shared_buffers : 2 GB effective_cache_size : 4 GB Note on current performance: I have run it for about ten hours and the performance is as follows: The total number of records written for each file is about 6,21,000 * 5 = 31,05,000 The bottle neck is that i can only rent it for 10 hours per day (overnight) and if it processes in this speed it will take about 11 days which is too high for my experiments. Please suggest me on how to improve. Questions: 1. Should i use Symmetric multi processing on those desktops(it has 2 cores with about 2 GB of RAM).In that case what is suggested or prefereable? 2. If i change my postgresql configuration file and increase the RAM will it enhance my process. 3. Should i use multi threading.. In that case any links or pointers would be of great help Thanks Sree aurovindh V

    Read the article

  • HDF5 .Net wrapper

    - by UshaP
    I'm getting ( http://www.hdfgroup.org/projects/hdf.net/) The specified module could not be found. (Exception from HRESULT: 0x8007007E) from the dependency walker i'm seeing that SZLIBDLL.DLL is missing i tried to download it from random place but then i got another error. Does any one had that problem? i tried also vs2005 and vs2008 Thanks, Pini.

    Read the article

  • Install h5py in Mac OS X 10.6.3

    - by zyq524
    I'm trying to install h5py in Mac OS X 10.6.3. First I installed HDF5 1.8, which used the following commands: ./configure \ --prefix=/Library/Frameworks/Python.framework/Versions/Current \ --enable-shared \ --enable-production \ --enable-threadsafe \ CPPFLAGS=-I/Library/Frameworks/Python.framework/Versions/Current/include \ LDFLAGS=-L/Library/Frameworks/Python.framework/Versions/Current/lib make make check sudo make install Then install h5py: /Library/Frameworks/Python.framework/Versions/Current/bin/python \ setup.py \ build \ --api=18 \ --hdf5=/Library/Frameworks/Python.framework/Versions/Current Then I got the errors: Configure: Autodetecting HDF5 settings... Custom HDF5 dir: /Library/Frameworks/Python.framework/Versions/Current Custom API level: (1, 8) ld: warning: in detect/vers.o, file was built for unsupported file format which is not the architecture being linked (i386) ld: warning: in /Library/Frameworks/Python.framework/Versions/Current/lib/libhdf5.dylib, file was built for unsupported file format which is not the architecture being linked (i386) Undefined symbols: "_main", referenced from: start in crt1.10.5.o ld: symbol(s) not found collect2: ld returned 1 exit status Failed to compile HDF5 test program. Please check to make sure: * You have a C compiler installed * A development version of Python is installed (including header files) * A development version of HDF5 is installed (including header files) * If HDF5 is not in a default location, supply the argument --hdf5=<path> error: command 'cc' failed with exit status 1 I just updated my Xcode, I don't know whether this is because my gcc's default setting. If so, how can I get rid of this error? Thanks.

    Read the article

  • Error mpicc command not found [closed]

    - by skn
    I want to compile hdf5 but I find the following error: /hdf5/hdf5-1.6.9CC=/usr/local/openmpi/bin/mpicc ./configure /home/sknandi/Research/ Simulation/hdf5/parallel_fdf5 CC=/usr/local/openmpi/bin/mpicc: Command not found. The result of echo $PATH is /hdf5/hdf5-1.6.9echo $PATH /priv/myriad3/ayw/research/COALA/visit/bin:/usr/local/bin:/usr/bin:/bin:/pkg/linux/intel/composerxe-2011.3.174/composerxe-2011.3.174/bin/intel64:/pkg/linux/casa/x86_64:/usr/local/bin:/bin:/usr/bin:/usr/local/openmpi/bin:/pkg/linux/intel/composerxe-2011.3.174/composerxe-2011.3.174/mpirt/bin/intel64:/pkg/linux/SS12/solstudio12.2/bin:/usr/local/vanilla-pds/bin and result of which mpicc is /hdf5/hdf5-1.6.9which mpicc /usr/local/openmpi/bin/mpicc

    Read the article

  • Storage for large gridded datasets

    - by nullglob
    I am looking for a good storage format for large, gridded datasets. The application is meteorology, and we would prefer a format that is common within this field (to help exchange data with others). I don't need to deal with special data structures, and there should be a Fortran API. I am currently considering HDF5, GRIB2 and NetCDF4. How do these formats compare in terms of data compression? What are their main limitations? How steep is the learning curve? Are there any other storage formats worth investigating? I have not found a great deal of material outlining the differences and pros/cons of these formats (there is one relevant SO thread, and a presentation comparing GRIB and NetCDF).

    Read the article

  • what is a fast way to output h5py dataset to text?

    - by user362761
    I am using the h5py python package to read files in HDF5 format. (e.g. somefile.h5) I would like to write the contents of a dataset to a text file. For example, I would like to create a text file with the following contents: 1,20,31,75,142,324,78,12,3,90,8,21,1 I am able to access the dataset in python using this code: import h5py f = h5py.File('/Users/Me/Desktop/thefile.h5', 'r') group = f['/level1/level2/level3'] dset = group['dsetname'] My naive approach is too slow, because my dataset has over 20000 entries: # write all values to file for index in range(len(dset)): # do not add comma after last value if index == len(dset)-1: txtfile.write(repr(dset[index])) else: txtfile.write(repr(dset[index])+',') txtfile.close() return None Is there a faster way to write this to a file? Perhaps I could convert the dataset into a NumPy array or even a Python list, and then use some file-writing tool? (I could experiment with concatenating the values into a larger string before writing to file, but I'm hoping there's something entirely more elegant)

    Read the article

  • Why does this code leak? (simple codesnippet)

    - by Ela782
    Visual Studio shows me several leaks (a few hundred lines), in total more than a few MB. I traced it down to the following "helloWorld example". The leak disappears if I comment out the H5::DataSet.getSpace() line. #include "stdafx.h" #include <iostream> #include "cpp/H5Cpp.h" int main(int argc, char *argv[]) { _CrtSetDbgFlag ( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF ); // dump leaks at return H5::H5File myfile; try { myfile = H5::H5File("C:\\Users\\yyy\\myfile.h5", H5F_ACC_RDONLY); } catch (H5::Exception& e) { std::string msg( std::string( "Could not open HDF5 file.\n" ) + e.getCDetailMsg() ); throw msg; } H5::Group myGroup = myfile.openGroup("/so/me/group"); H5::DataSet myDS = myGroup.openDataSet("./myfloatvec"); hsize_t dims[1]; //myDS.getSpace().getSimpleExtentDims(dims, NULL); // <-- here's the leak H5::DataSpace dsp = myDS.getSpace(); // The H5::DataSpace seems to leak dsp.getSimpleExtentDims(dims, NULL); //dsp.close(); // <-- doesn't help either std::cout << "Dims: " << dims[0] << std::endl; // <-- Works as expected return 0; } Any help would be appreciated. I've been on this for hours, I hate unclean code...

    Read the article

  • No such file but the file is there!

    - by user288757
    I'm trying to compile a C++ file with some includes. My main file (well I didn't make it hdf5_getters includes a file which includes the file hdf5.h, also not my design but it's a downloaded library. Every time I try to compile it I get the error message that the file hdf5.h does not exist while it clearly does. I started reading on the internet and people say it can happen because it's a 32bit binary running on a 64bit architecture. But I'm running a 32bit Ubuntu so that can't be it... I'm out of ideas, if anyone can help me please :) This is the errormessage with commands: make hdf5_getters g++ -c -Wall -std=c++0x -O2 -c hdf5_getters.cc In file included from H5Cpp.h:20:0, from hdf5_getters.cc:34: H5Include.h:17:18: fatal error: hdf5.h: No such file or directory #include <hdf5.h> ^ compilation terminated. make: *** [hdf5_getters.o] Error 1

    Read the article

  • Overwhelmed by design patterns... where to begin?

    - by Pete
    I am writing a simple prototype code to demonstrate & profile I/O schemes (HDF4, HDF5, HDF5 using parallel IO, NetCDF, etc.) for a physics code. Since focus is on IO, the rest of the program is very simple: class Grid { public: floatArray x,y,z; }; class MyModel { public: MyModel(const int &nip1, const int &njp1, const int &nkp1, const int &numProcs); Grid grid; map<string, floatArray> plasmaVariables; }; Where floatArray is a simple class that lets me define arbitrary dimensioned arrays and do mathematical operations on them (i.e. x+y is point-wise addition). Of course, I could use better encapsulation (write accessors/setters, etc.), but that's not the concept I'm struggling with. For the I/O routines, I am envisioning applying simple inheritance: Abstract I/O class defines read & write functions to fill in the "myModel" object HDF4 derived class HDF5 HDF5 using parallel IO NetCDF etc... The code should read data in any of these formats, then write out to any of these formats. In the past, I would add an AbstractIO member to myModel and create/destroy this object depending on which I/O scheme I want. In this way, I could do something like: myModelObj.ioObj->read('input.hdf') myModelObj.ioObj->write('output.hdf') I have a bit of OOP experience but very little on the Design Patterns front, so I recently acquired the Gang of Four book "Design Patterns: Elements of Reusable Object-Oriented Software". OOP designers: Which pattern(s) would you recommend I use to integrate I/O with the myModel object? I am interested in answering this for two reasons: To learn more about design patterns in general Apply what I learn to help refactor an large old crufty/legacy physics code to be more human-readable & extensible. I am leaning towards applying the Decerator pattern to myModel, so I can attach the I/O responsibilities dynamically to myModel (i.e. whether to use HDF4, HDF5, etc.). However, I don't feel very confident that this is the best pattern to apply. Reading the Gang of Four book cover-to-cover before I start coding feels like a good way to develop an unhealthy caffeine addiction. What patterns do you recommend?

    Read the article

  • Converting large files in python

    - by Cenoc
    I have a few files that are ~64GB in size that I think I would like to convert to hdf5 format. I was wondering what the best approach for doing so would be? Reading line-by-line seems to take more than 4 hours, so I was thinking of using multiprocessing in sequence, but was hoping for some direction on what would be the most efficient way without resorting to hadoop. Any help would be very much appreciated. (and thank you in advance) EDIT: Right now I'm just doing a for line in fd: approach. After that right now I just check to make sure I'm picking out the right sort of data, which is very short; I'm not writing anywhere, and it's taking around 4 hours to complete with that. I can't read blocks of data because the blocks in this weird file format I'm reading are not standard, it switches between three different sizes... and you can only tell which by reading the first few characters of the block.

    Read the article

  • large amount of data in many text files - how to process?

    - by Stephen
    Hi, I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to 1) write the whole thing in C (or Fortran) 2) import the files (tables) into a relational database directly and then pull off chunks in R or Python (some of the transformations are not amenable for pure SQL solutions) 3) write the whole thing in Python Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks

    Read the article

  • what changes when your input is giga/terabyte sized?

    - by Wang
    I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny. I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer. For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind. What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)

    Read the article

  • Versioning freindly, extendible binary file format

    - by Bas Bossink
    In the project I'm currently working on there is a need to save a sizeable data structure to disk. Being in optimist I thought their must be a standard solution for such a problem however upto now I haven't found a solution that satisfies the following requirements: .net 2.0 support, preferably with a foss implementation version friendly (this should be interpreted as reading an old version of the format should be relatively simple if the changes in the underlying data structure are simple, say adding/dropping fields) ability to do some form of random access where part of the data can be extended after initial creation (think of this as extending intermediate results) space and time efficient (xml has been excluded as option given this requierement) Options considered so far: Protocol Buffers : was turned down by verdict of the documentation about Large Data Sets since this comment suggest adding another layer on top, this would call for additional complexity which I wish to have handled by the file format itself. HDF5,EXI : do not seem to have .net implementations SQLite : the data structure at hand would result in a pretty complex table structure that seems to heavyweight for the intended use BSON : does not appear to support requirement 3. Fast Infoset : only seems to have buyware .net implementations Any recommendations or pointers are greatly appreciated. Furthermore if you believe any of the information above is not true please provide pointers/examples to proove me wrong.

    Read the article

  • Versioning friendly, extendible binary file format

    - by Bas Bossink
    In the project I'm currently working on there is a need to save a sizable data structure to disk (edit: think dozens of MB's). Being an optimist, I thought that there must be a standard solution for such a problem; however, up to now I haven't found a solution that satisfies the following requirements: .NET 2.0 support, preferably with a FOSS implementation Version friendly (this should be interpreted as: reading an old version of the format should be relatively simple if the changes in the underlying data structure are simple, say adding/dropping fields) Ability to do some form of random access where part of the data can be extended after initial creation (think of this as extending intermediate results) Space and time efficient (XML has been excluded as option given this requirement) Options considered so far: Protocol Buffers: was turned down by verdict of the documentation about Large Data Sets - since this comment suggested adding another layer on top, this would call for additional complexity which I wish to have handled by the file format itself. HDF5,EXI: do not seem to have .net implementations SQLite/SQL Server Compact edition: the data structure at hand would result in a pretty complex table structure that seems too heavyweight for the intended use BSON: does not appear to support requirement 3. Fast Infoset: only seems to have paid .NET implementations. Any recommendations or pointers are greatly appreciated. Furthermore if you believe any of the information above is not true, please provide pointers/examples to prove me wrong.

    Read the article

1