-
Notifications
You must be signed in to change notification settings - Fork 0
What happens when you mess with hashing
This is based on a message from Aaron Meurer on the mailing list
Python expects a few invariants with regard to hashing, but it does not enforce them. In particular, it expects that:
-
The hash of an object does not change across the object's lifetime (in other words, a hashable object should be immutable).
-
a == b
implieshash(a) == hash(b)
(note that the reverse might not hold in the case of a hash collision).
So let's look at what happens when these rules are broken. First, let's create an object with __hash__
, and try changing the hash across the object's lifetime.
>>> class Bad(object):
... def __init__(self, arg):
... self.arg = arg
... def __hash__(self):
... return hash(self.arg)
...
>>> Bad(1)
<__main__.Bad object at ...>
>>> hash(Bad(1))
1
>>> a = Bad(1)
>>> b = {a:1}
>>> a.arg = 2
>>> hash(a)
2
>>> b[a]
Traceback (most recent call last):
...
KeyError: <__main__.Bad object at ...>
Here, we implicitly changed the hash of a
by mutating the argument of
a
that is used to compute the hash. As a result, the object is no longer found in a dictionary, which uses the hash to find the object.
Note that Python doesn't prevent
me from doing this. I could make it if I want, by making __setattr__
raise AttributeError
, but even then I could forcibly change it by
modifying the object's __dict__
. This is what is meant when we say
that Python is a "consenting adults" language.
Now let's look at the second point. Bad things also happen if we change the object, but keep the hash the same:
class Bad2(object):
def __init__(self, arg):
self.arg = arg
self.hash_cache = None
def __hash__(self):
if self.hash_cache is None:
self.hash_cache = hash(self.arg)
return self.hash_cache
def __eq__(self, other):
if not isinstance(other, Bad2):
return False
return self.arg == other.arg
(this is the pattern used by SymPy's Basic
, by the way)
Here we see that finding the key in a dictionary still works:
>>> a = Bad2(1)
>>> b = Bad2(2)
>>> d = {b:2}
>>> b.arg = 1
>>> hash(b)
2
>>> d[b]
2
but it no longer keeps the invariant that sets (and dicts) keep unique objects (keys):
>>> a = Bad2(1)
>>> b = Bad2(2)
>>> hash(a)
1
>>> hash(b)
2
>>> b.arg = 1
>>> a == b
True
>>> hash(a) == hash(b)
False
>>> set([a, b])
set([<__main__.Bad2 object at ...>, <__main__.Bad2 object at ...>])
and this can lead to very subtle issues. For example, look what happens when I don't compute the hash before changing the object:
>>> a = Bad2(1)
>>> b = Bad2(2)
>>> b.arg = 1
>>> a == b
True
>>> hash(a) == hash(b)
True
>>> set([a, b])
set([<__main__.Bad2 object at ...>])
This time, the set contains one element!
So, yes, Python won't enforce it, but if you don't follow the rule
that hash(object)
remains the same for the lifetime of object and that
a == b
implies hash(a) == hash(b)
, then you will run into very subtle
issues, because very basic Python operations expect these invariants.
If you want to be able to mutate an object's properties, you have two options. First, make the object unhashable (make __hash__
raise TypeError). You won't be able to use it in sets or as keys to a dictionary, but you will be free to change the object in place however you want.
A second option is to make all mutable properties non-dependent on hashing or equality testing. This option works well if you just want to cache some internal state that doesn't inherently change the object. Both __eq__
and __hash__
should remain unchanged by changes to this state. You may also want to make sure you use proper getters and setters to prevent modification of internal state that equality testing and hashing does depend on.