Deterministic key serialization

Posted by Mike Boers on Stack Overflow See other posts from Stack Overflow or by Mike Boers
Published on 2010-06-03T14:05:08Z Indexed on 2010/06/03 14:14 UTC
Read the original article Hit count: 265

Filed under:
|
|

I'm writing a mapping class which uses SQLite as the storage backend. I am currently allowing only basestring keys but it would be nice if I could use a couple more types hopefully up to anything that is hashable (ie. same requirements as the builtin dict). To that end I would like to derive a deterministic serialization scheme.

Ideally, I would like to know if any implementation/protocol combination of pickle is deterministic for hashable objects (e.g. can only use cPickle with protocol 0).

I noticed that pickle and cPickle do not match:

>>> import pickle
>>> import cPickle
>>> def dumps(x):
...     print repr(pickle.dumps(x))
...     print repr(cPickle.dumps(x))
... 
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."

Another option is to use repr to dump and ast.literal_eval to load. This would only be valid for builtin hashable types. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):

def is_reprable_key(key):
    return type(key) in (int, str, unicode) or (type(key) == tuple and all(
        is_reprable_key(x) for x in key))

The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions)?

(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it.)

Any insights?

© Stack Overflow or respective owner

Related posts about python

Related posts about serialize