Deterministic key serialization
- by Mike Boers
I'm writing a mapping class which uses SQLite as the storage backend. I am currently allowing only basestring keys but it would be nice if I could use a couple more types hopefully up to anything that is hashable (ie. same requirements as the builtin dict). To that end I would like to derive a deterministic serialization scheme.
Ideally, I would like to know if any implementation/protocol combination of pickle is deterministic for hashable objects (e.g. can only use cPickle with protocol 0).
I noticed that pickle and cPickle do not match:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
...     print repr(pickle.dumps(x))
...     print repr(cPickle.dumps(x))
... 
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
Another option is to use repr to dump and ast.literal_eval to load. This would only be valid for builtin hashable types. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):
def is_reprable_key(key):
    return type(key) in (int, str, unicode) or (type(key) == tuple and all(
        is_reprable_key(x) for x in key))
The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions)?
(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it.)
Any insights?