GetHashCode() Pitfalls

GetHashCode() is one of the functions you should probably avoid implementing yourself. It generates a lot of confusion among developers and can lead to some hard to track bugs when implemented poorly. The goal of this article is to help you understand the common mistakes developers make when implementing GetHashCode() and some general misconceptions when it comes to computing hash codes for your custom types in C#. Note that this is not a hash function design and implementation guideline. Rather, the main focus will be on GetHashCode() in .Net and more specifically:

The difference between GetHashCode() behavior for reference and value types. Implementation details you need to be aware of when defining your own GetHashCode() version.

Hash Codes for Reference and Value Types

Every object in C# has a default behavior for generating an integer value called a hash code. This is the value that a hash-based container like HashSet<T> or Dictionary<K,V> uses to store and retrieve objects. The objects are grouped together by their hash codes into buckets or slots. Multiple objects can produce the same hash key. This is what’s called a hash collision.

To store an object in a hash container, its hash code is computed, which determines the correct bucket for that object to go into. To do a lookup for an object, its hash code will be computed again. Then the container would search only into that single bucket corresponding to the calculated hash. This is the main benefit of the hash lookups. Instead of iterating through the whole collection and comparing with each element in there, we just compute the hash code and go directly to the bucket (which is very fast) to find the object. Once the bucket is found, it may contain multiple objects in it with the same hash key. This is when the usage of Object.Equals() comes into place. Each object in the bucket is compared to the lookup object via its Equals() method(or the default equality comparer to be precise).

GetHashCode() is defined in System.Object meaning that every object you create will inherit a default implementation from the base class. This implementation, however, is different for reference and value types as System.ValueType defines its own rules for generating hash codes by overriding System.Object.GetHashCode() . Let’s explore these differences. For reference types, the hash code is based on the object’s identity. Imagine this is just a unique number assigned to every object by the runtime. Let’s see that by creating an example using the following simple class Person containing first and last name fields:

public class Person { public string FirstName { get; set; } public string LastName { get; set; } }

Now, let’s run the following program:

public static void Main(string[] args) { var p1 = new Person { FirstName = "FirstName", LastName = "LastName" }; var p2 = p1; Console.WriteLine(p1.GetHashCode() == p2.GetHashCode()); // True }

This program outputs True because p1 and p2 hold a reference to the same object e.g. with the same identity. It’s important to understand this. The reason the two hash codes are the same is not because the two objects have the same value for first and last name. Let’s prove this with another example:

public static void Main(string[] args) { var p1 = new Person { FirstName = "FirstName", LastName = "LastName" }; var p2 = new Person { FirstName = "FirstName", LastName = "LastName" }; Console.WriteLine(p1.GetHashCode() == p2.GetHashCode()); // False }

The output now is False . This is because p1 and p2 are references to two completely different and unrelated objects.

Let’s now turn to value types and see how the rules are different there. I’ll continue working with the Person example, but this time define it as a struct :

public struct Person { public string FirstName { get; set; } public string LastName { get; set; } }

As Person is now a value type, its hash value will be equal to the hash value of the first property in the struct. In this case this is FirstName . This means that if two objects of type Person have the same value for FirstName , their hash codes will match:

public static void Main(string[] args) { var p1 = new Person { FirstName = "FirstName", LastName = "LastName" }; var p2 = new Person { FirstName = "FirstName", LastName = "LastName" }; Console.WriteLine(p1.GetHashCode() == p2.GetHashCode()); // True p1.LastName = "ChangedLastName"; Console.WriteLine(p1.GetHashCode() == p2.GetHashCode()); // True p1.FirstName = "ChangedFirstName"; Console.WriteLine(p1.GetHashCode() == p2.GetHashCode()); // False }

You can see that we get different hash codes only if we change the FirstName value.

UPDATE: An exception to this rule was pointed out in this Reddit discussion

GetHashCode() Implementation Rules

Equipped with the knowledge of the default GetHashCode() behavior for reference and value types let’s now define a couple of rules we need to follow when implementing a custom version of GetHashCode() . These rules are by no means an extensive list of requirements for writing a good hash function in general. But they represent probably the most common GetHashCode() implementation mistakes I’ve seen. If you’re generally interested in hash algorithms details e.g. how to implement a performant hash function with a uniform distribution across the hash keys etc. – there are plenty of good sources out there. For something practical for the C# and Java audience you can probably start with this SO thread.

Let’s move on to the rules.

Rule One

If two objects are equal, they must generate the same hash value. In other words, their GetHashCode() function should return the same result.

Btw, the reverse statement is not true. It’s perfectly fine for different objects to have the same hash codes. This is even to be expected. Such cases are called hash collisions and although we should try to minimize them, they are pretty much inevitable simply due to the Pigeonhole principle.

Now, let’s define a class that breaks this rule so I can show some problems this can lead to. In the Person class below, two objects are considered equal when their FirstName properties match. On the other hand, GetHashCode() is calculated only based on the LastName field. So, the rule is broken – if we have two Person objects with the same FirstName but different LastName values they will be considered equal but they’ll also return different hash codes.

public class Person { public string FirstName { get; set; } public string LastName { get; set; } public override bool Equals(object obj) => obj is Person person && person.FirstName == FirstName; public override int GetHashCode() => LastName.GetHashCode(); }

What can go wrong in this case? Plenty of things actually. This implementation leads to different equality rules based on the context the object is used in, which can lead to some unpredictable issues. Let’s see an example.

Say you want to implement a bool function receiving a sequence of Person objects and returning true if the sequence contains duplicates and false otherwise. Someone may go for a somewhat naïve implementation like the following:

private static bool ContainsDuplicates(IEnumerable<Person> people) { var distinctPeople = new List<Person>(); foreach (var person in people) { if (distinctPeople.Contains(person)) { return true; } distinctPeople.Add(person); } return false; }

This implementation may be suboptimal from performance perspective (which will be discussed in a moment) but works well. The following program outputs True as we would expect since we’ve defined two Person objects to be equal when they have the same first name:

public static void Main(string[] args) { var p1 = new Person{FirstName = "FirstName", LastName = "LastNameOne"}; var p2 = new Person{FirstName = "FirstName", LastName = "LastNameTwo"}; var people = new List<Person> {p1, p2}; Console.WriteLine(ContainsDuplicates(people)); }

So far so good. At some point, though, someone notices that this code is quite inefficient as for every person we are essentially iterating though the whole distinctPeople collection. In Big O terms this means O(n^2) time complexity. But we know better! There is just one very small change we can do to make this code execute in linear time O(n). You name it – just swap the backing data structure from a list to a hash set. We make this small change and our new version looks like this:

private static bool ContainsDuplicates(IEnumerable<Person> people) { var distinctPeople = new HashSet<Person>(); foreach (var person in people) { if (distinctPeople.Contains(person)) { return true; } distinctPeople.Add(person); } return false; }

We have our super performant solution now, but we’ve introduces a bigger problem. Using the same driver code as before our ContainsDuplicates() function now returns False . This is because a lookup in a hash based collection like HashSet<T> would use the GetHashCode() function when checking for existence of an object. In our example the two Person objects have different hash codes, thus live in different buckets, because GetHashCode() just takes into account the LastName property.

This was just a simple example for a hypothetical problematic scenario but there are endless more situations where a similar bug may surface if we don’t follow this rule. So just remember – equality and hash code generation should go hand in hand. If two objects are equal, their hash codes should be the same!

Rule Two

Once you create an object X , X.GetHashCode() should never change during the object lifetime no matter how you interact with the object and what public methods and properties you invoke. In other words, X.GetHashCode() should be an instance invariant.

For demo purposes, let’s again use our Person class but this time with a single FirstName property. Both Equals and GetHashCode implementations are based on the FirstName value:

public class Person { public string FirstName { get; set; } public override bool Equals(object obj) => obj is Person person && person.FirstName == FirstName; public override int GetHashCode() => FirstName.GetHashCode(); }

Now let’s examine the following driver code:

public static void Main(string[] args) { var distinctPeople = new HashSet<Person>(); var person = new Person{FirstName = "FirstName"}; distinctPeople.Add(person); var lookFor = new Person {FirstName = "FirstName"}; Console.WriteLine(distinctPeople.Contains(lookFor)); // true }

This program outputs True because the FirstName of lookFor is the same as the FirstName of person .

Let’s see an example where we change the FirstName value after adding the person to the hash set:

public static void Main(string[] args) { var distinctPeople = new HashSet<Person>(); var person = new Person{FirstName = "John"}; distinctPeople.Add(person); person.FirstName = "Peter"; var lookFor = new Person {FirstName = "Peter"}; Console.WriteLine(distinctPeople.Contains(lookFor)); // False }

The output here is False . Here it what happens:

A Person object person is created with FirstName being John . The person is added to the hash set to a bucket corresponding to the hash value of John . The person’s FirstName is changed to Peter . However, the person object is still in the position in the hash set it was put initially corresponding to the hash value of John . The lookupFor object is created with the most recent name of the person – Peter . When we do a lookup in the hash set for lookFor , the lookup algorithm will calculate the hash value of lookFor which in this case will be based on Peter . It can’t find anything because the person that’s initially stored is in a bucket corresponding to the John hash code.

So, the hash set is left in an invalid state. We have an object in a bucket that doesn’t correspond to its current hash code. Such scenarios can lead to all sort of problems causing a lot of headaches. So, make sure the properties you use in your hash function can never change during the lifetime of the object! You can, for example, declare the properties as read only and initialize them only through the constructor. Or taking it even further and implementing a true immutable type for which every state change operation would create a fresh new object for the client.

Also note that this rule applies only for reference types. Value types will be copied into the hash container so even if you change the object you hold, this will not change the one that was stored in the container. An analogy for a better understanding would be if you have some int value, add it to a hash set and change it afterwards. You are changing your local copy but the one stored in the hash set would stay unmodified.

Summary

If you are about to implement GetHashCode() for your custom type you should probably think again if that’s really necessary. In most of the cases we would already have some sort of an ID being part of the object. It can be the database ID or some other unique identifier. And if you need your object used as a key in a hash collection, just use the ID field directly instead of the object itself. This will drastically reduce the mental burden for everyone else working with your code or even your future self in 6 months.

Hope I managed to unravel some of the GetHashCode() pitfalls and you found this article useful!

Resources