Are there any radix/patricia/critbit trees for Python?

Posted by Andrew Dalke on Stack Overflow See other posts from Stack Overflow or by Andrew Dalke
Published on 2011-01-16T18:44:11Z Indexed on 2011/01/16 19:53 UTC
Read the original article Hit count: 389

Filed under:
|
|

I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).

My prototype uses Python's set as the obvious data type.

When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.

Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.

I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.

I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.

Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?

Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)

© Stack Overflow or respective owner

Related posts about python

Related posts about patricia-trie