Non standard interaction among two tables to avoid very large merge

Posted by riko on Stack Overflow See other posts from Stack Overflow or by riko
Published on 2012-12-17T16:49:25Z Indexed on 2012/12/17 17:03 UTC
Read the original article Hit count: 277

Filed under:

binary-search

Suppose I have two tables A and B.

Table A has a multi-level index (a, b) and one column (ts). b determines univocally ts.

A = pd.DataFrame(
     [('a', 'x', 4), 
      ('a', 'y', 6), 
      ('a', 'z', 5), 
      ('b', 'x', 4), 
      ('b', 'z', 5), 
      ('c', 'y', 6)], 
     columns=['a', 'b', 'ts']).set_index(['a', 'b'])
AA = A.reset_index()

Table B is another one-column (ts) table with non-unique index (a). The ts's are sorted "inside" each group, i.e., B.ix[x] is sorted for each x. Moreover, there is always a value in B.ix[x] that is greater than or equal to the values in A.

B = pd.DataFrame(
    dict(a=list('aaaaabbcccccc'), 
         ts=[1, 2, 4, 5, 7, 7, 8, 1, 2, 4, 5, 8, 9])).set_index('a')

The semantics in this is that B contains observations of occurrences of an event of type indicated by the index.

I would like to find from B the timestamp of the first occurrence of each event type after the timestamp specified in A for each value of b. In other words, I would like to get a table with the same shape of A, that instead of ts contains the "minimum value occurring after ts" as specified by table B.

So, my goal would be:

C: 
('a', 'x') 4
('a', 'y') 7
('a', 'z') 5
('b', 'x') 7
('b', 'z') 7
('c', 'y') 8

I have some working code, but is terribly slow.

C = AA.apply(lambda row: (
    row[0], 
    row[1], 
    B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b'])

Profiling shows the culprit is obviously B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))). However, standard solutions using merge/join would take too much RAM in the long run.

Consider that now I have 1000 a's, assume constant the average number of b's per a (probably 100-200), and consider that the number of observations per a is probably in the order of 300. In production I will have 1000 more a's.

1,000,000 x 200 x 300 = 60,000,000,000 rows

may be a bit too much to keep in RAM, especially considering that the data I need is perfectly described by a C like the one I discussed above.

How would I improve the performance?

Developer IT

Non standard interaction among two tables to avoid very large merge - Developer IT

Non standard interaction among two tables to avoid very large merge

python

join

merge

pandas

binary-search

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about join

Improving Partitioned Table Join Performance

MS SQL 2008, join or no join?

SQL Server 2008, join or no join?

Simple join gives unnecessary rows,How to get join properly with only matches in both tables

Mulitple full joins in Postgres is slow

Categories cloud