Data Aggregation of CSV files java

Posted by royB on Programmers See other posts from Programmers or by royB
Published on 2013-08-07T14:04:31Z Indexed on 2013/10/22 16:03 UTC
Read the original article Hit count: 421

Filed under:

aggregate

I have k csv files (5 csv files for example), each file has m fields which produce a key and n values. I need to produce a single csv file with aggregated data.

I'm looking for the most efficient solution for this problem, speed mainly. I don't think by the way that we will have memory issues. Also I would like to know if hashing is really a good solution because we will have to use 64 bit hashing solution to reduce the chance for a collision to less than 1% (we are having around 30000000 rows per aggregation).

For example

file 1: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,50,60,70,80
        a3,b2,c4,60,60,80,90 

file 2: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,30,50,90,40
        a3,b2,c4,30,70,50,90

result: f1,f2,f3,v1,v2,v3,v4  
        a1,b1,c1,80,110,160,120
        a3,b2,c4,90,130,130,180

algorithm that we thought until now:

hashing (using concurentHashTable)
merge sorting the files
DB: using mysql or hadoop or redis.

The solution needs to be able to handle Huge amount of data (each file more than two million rows)

a better example: file 1

country,city,peopleNum
england,london,1000000
england,coventry,500000

file 2:

country,city,peopleNum
england,london,500000
england,coventry,500000
england,manchester,500000

merged file:

country,city,peopleNum
england,london,1500000
england,coventry,1000000
england,manchester,500000

The key is: country,city. This is just an example, my real key is of size 6 and the data columns are of size 8 - total of 14 columns.

We would like that the solution will be the fastest in regard of data processing.

Developer IT

Data Aggregation of CSV files java - Developer IT

Data Aggregation of CSV files java

java

database

hashing

big-data

aggregate

Related posts about java

Tomcat 6: Access Control Exception?

Problem in creation MDB Queue connection at Jboss StartUp

failing to establish connection between Postgres db and gwt

failing to establish connection between postgre db and gwt

Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

Related posts about database

SQL SERVER Retrieve and Explore Database Backup without Restoring Database Idera virtual database

Cloning A Database On The Same Server Using Rman Duplicate From Active Database

cPickle ImportError: No module named multiarray

SQL SERVER – 2008 – Introduction to Snapshot Database – Restore From Snapshot

OTN ???? ?????? ???????

Categories cloud