Once again we consider some of the lesser known classes and keywords of C#. Today we will be looking at two set implementations in the System.Collections.Generic namespace: HashSet<T> and SortedSet<T>. Even though most people think of sets as mathematical constructs, they are actually very useful classes that can be used to help make your application more performant if used appropriately.
A Background From Math
In mathematical terms, a set is an unordered collection of unique items. In other words, the set {2,3,5} is identical to the set {3,5,2}. In addition, the set {2, 2, 4, 1} would be invalid because it would have a duplicate item (2). In addition, you can perform set arithmetic on sets such as:
Intersections:
The intersection of two sets is the collection of elements common to both.
Example: The intersection of {1,2,5} and {2,4,9} is the set {2}.
Unions:
The union of two sets is the collection of unique items present in either or both set.
Example: The union of {1,2,5} and {2,4,9} is {1,2,4,5,9}.
Differences:
The difference of two sets is the removal of all items from the first set that are common between the sets.
Example: The difference of {1,2,5} and {2,4,9} is {1,5}.
Supersets:
One set is a superset of a second set if it contains all elements that are in the second set.
Example: The set {1,2,5} is a superset of {1,5}.
Subsets:
One set is a subset of a second set if all the elements of that set are contained in the first set.
Example: The set {1,5} is a subset of {1,2,5}.
If We’re Not Doing Math, Why Do We Care?
Now, you may be thinking: why bother with the set classes in C# if you have no need for mathematical set manipulation? The answer is simple: they are extremely efficient ways to determine ownership in a collection.
For example, let’s say you are designing an order system that tracks the price of a particular equity, and once it reaches a certain point will trigger an order. Now, since there’s tens of thousands of equities on the markets, you don’t want to track market data for every ticker as that would be a waste of time and processing power for symbols you don’t have orders for. Thus, we just want to subscribe to the stock symbol for an equity order only if it is a symbol we are not already subscribed to.
Every time a new order comes in, we will check the list of subscriptions to see if the new order’s stock symbol is in that list. If it is, great, we already have that market data feed! If not, then and only then should we subscribe to the feed for that symbol.
So far so good, we have a collection of symbols and we want to see if a symbol is present in that collection and if not, add it. This really is the essence of set processing, but for the sake of comparison, let’s say you do a list instead:
1: // class that handles are order processing service
2: public sealed class OrderProcessor
3: {
4: // contains list of all symbols we are currently subscribed to
5: private readonly List<string> _subscriptions = new List<string>();
6:
7: ...
8: }
Now whenever you are adding a new order, it would look something like:
1: public PlaceOrderResponse PlaceOrder(Order newOrder)
2: {
3: // do some validation, of course...
4:
5: // check to see if already subscribed, if not add a subscription
6: if (!_subscriptions.Contains(newOrder.Symbol))
7: {
8: // add the symbol to the list
9: _subscriptions.Add(newOrder.Symbol);
10:
11: // do whatever magic is needed to start a subscription for the symbol
12: }
13:
14: // place the order logic!
15: }
What’s wrong with this? In short: performance! Finding an item inside a List<T> is a linear - O(n) – operation, which is not a very performant way to find if an item exists in a collection.
(I used to teach algorithms and data structures in my spare time at a local university, and when you began talking about big-O notation you could immediately begin to see eyes glossing over as if it was pure, useless theory that would not apply in the real world, but I did and still do believe it is something worth understanding well to make the best choices in computer science).
Let’s think about this: a linear operation means that as the number of items increases, the time that it takes to perform the operation tends to increase in a linear fashion. Put crudely, this means if you double the collection size, you might expect the operation to take something like the order of twice as long.
Linear operations tend to be bad for performance because they mean that to perform some operation on a collection, you must potentially “visit” every item in the collection. Consider finding an item in a List<T>: if you want to see if the list has an item, you must potentially check every item in the list before you find it or determine it’s not found.
Now, we could of course sort our list and then perform a binary search on it, but sorting is typically a linear-logarithmic complexity – O(n * log n) - and could involve temporary storage. So performing a sort after each add would probably add more time.
As an alternative, we could use a SortedList<TKey, TValue> which sorts the list on every Add(), but this has a similar level of complexity to move the items and also requires a key and value, and in our case the key is the value.
This is why sets tend to be the best choice for this type of processing: they don’t rely on separate keys and values for ordering – so they save space – and they typically don’t care about ordering – so they tend to be extremely performant.
The .NET BCL (Base Class Library) has had the HashSet<T> since .NET 3.5, but at that time it did not implement the ISet<T> interface. As of .NET 4.0, HashSet<T> implements ISet<T> and a new set, the SortedSet<T> was added that gives you a set with ordering.
HashSet<T> – For Unordered Storage of Sets
When used right, HashSet<T> is a beautiful collection, you can think of it as a simplified Dictionary<T,T>. That is, a Dictionary where the TKey and TValue refer to the same object. This is really an oversimplification, but logically it makes sense. I’ve actually seen people code a Dictionary<T,T> where they store the same thing in the key and the value, and that’s just inefficient because of the extra storage to hold both the key and the value.
As it’s name implies, the HashSet<T> uses a hashing algorithm to find the items in the set, which means it does take up some additional space, but it has lightning fast lookups! Compare the times below between HashSet<T> and List<T>:
Operation
HashSet<T>
List<T>
Add()
O(1)
O(1) at end
O(n) in middle
Remove()
O(1)
O(n)
Contains()
O(1)
O(n)
Now, these times are amortized and represent the typical case. In the very worst case, the operations could be linear if they involve a resizing of the collection – but this is true for both the List and HashSet so that’s a less of an issue when comparing the two.
The key thing to note is that in the general case, HashSet is constant time for adds, removes, and contains! This means that no matter how large the collection is, it takes roughly the exact same amount of time to find an item or determine if it’s not in the collection. Compare this to the List where almost any add or remove must rearrange potentially all the elements! And to find an item in the list (if unsorted) you must search every item in the List.
So as you can see, if you want to create an unordered collection and have very fast lookup and manipulation, the HashSet is a great collection.
And since HashSet<T> implements ICollection<T> and IEnumerable<T>, it supports nearly all the same basic operations as the List<T> and can use the System.Linq extension methods as well.
All we have to do to switch from a List<T> to a HashSet<T> is change our declaration. Since List and HashSet support many of the same members, chances are we won’t need to change much else.
1: public sealed class OrderProcessor
2: {
3: private readonly HashSet<string> _subscriptions = new HashSet<string>();
4:
5: // ...
6:
7: public PlaceOrderResponse PlaceOrder(Order newOrder)
8: {
9: // do some validation, of course...
10:
11: // check to see if already subscribed, if not add a subscription
12: if (!_subscriptions.Contains(newOrder.Symbol))
13: {
14: // add the symbol to the list
15: _subscriptions.Add(newOrder.Symbol);
16:
17: // do whatever magic is needed to start a subscription for the symbol
18: }
19:
20: // place the order logic!
21: }
22:
23: // ...
24: }
25:
Notice, we didn’t change any code other than the declaration for _subscriptions to be a HashSet<T>. Thus, we can pick up the performance improvements in this case with minimal code changes.
SortedSet<T> – Ordered Storage of Sets
Just like HashSet<T> is logically similar to Dictionary<T,T>, the SortedSet<T> is logically similar to the SortedDictionary<T,T>.
The SortedSet can be used when you want to do set operations on a collection, but you want to maintain that collection in sorted order. Now, this is not necessarily mathematically relevant, but if your collection needs do include order, this is the set to use.
So the SortedSet seems to be implemented as a binary tree (possibly a red-black tree) internally. Since binary trees are dynamic structures and non-contiguous (unlike List and SortedList) this means that inserts and deletes do not involve rearranging elements, or changing the linking of the nodes.
There is some overhead in keeping the nodes in order, but it is much smaller than a contiguous storage collection like a List<T>. Let’s compare the three:
Operation
HashSet<T>
SortedSet<T>
List<T>
Add()
O(1)
O(log n)
O(1) at end
O(n) in middle
Remove()
O(1)
O(log n)
O(n)
Contains()
O(1)
O(log n)
O(n)
The MSDN documentation seems to indicate that operations on SortedSet are O(1), but this seems to be inconsistent with its implementation and seems to be a documentation error. There’s actually a separate MSDN document (here) on SortedSet that indicates that it is, in fact, logarithmic in complexity. Let’s put it in layman’s terms: logarithmic means you can double the collection size and typically you only add a single extra “visit” to an item in the collection.
Take that in contrast to List<T>’s linear operation where if you double the size of the collection you double the “visits” to items in the collection. This is very good performance! It’s still not as performant as HashSet<T> where it always just visits one item (amortized), but for the addition of sorting this is a good thing.
Consider the following table, now this is just illustrative data of the relative complexities, but it’s enough to get the point:
Collection Size
O(1) Visits
O(log n) Visits
O(n) Visits
1
1
1
1
10
1
4
10
100
1
7
100
1000
1
10
1000
Notice that the logarithmic – O(log n) – visit count goes up very slowly compare to the linear – O(n) – visit count. This is because since the list is sorted, it can do one check in the middle of the list, determine which half of the collection the data is in, and discard the other half (binary search).
So, if you need your set to be sorted, you can use the SortedSet<T> just like the HashSet<T> and gain sorting for a small performance hit, but it’s still faster than a List<T>.
Unique Set Operations
Now, if you do want to perform more set-like operations, both implementations of ISet<T> support the following, which play back towards the mathematical set operations described before:
IntersectWith() – Performs the set intersection of two sets. Modifies the current set so that it only contains elements also in the second set.
UnionWith() – Performs a set union of two sets. Modifies the current set so it contains all elements present both in the current set and the second set.
ExceptWith() – Performs a set difference of two sets. Modifies the current set so that it removes all elements present in the second set.
IsSupersetOf() – Checks if the current set is a superset of the second set.
IsSubsetOf() – Checks if the current set is a subset of the second set.
For more information on the set operations themselves, see the MSDN description of ISet<T> (here).
What Sets Don’t Do
Don’t get me wrong, sets are not silver bullets. You don’t really want to use a set when you want separate key to value lookups, that’s what the IDictionary implementations are best for.
Also sets don’t store temporal add-order. That is, if you are adding items to the end of a list all the time, your list is ordered in terms of when items were added to it. This is something the sets don’t do naturally (though you could use a SortedSet with an IComparer with a DateTime but that’s overkill) but List<T> can.
Also, List<T> allows indexing which is a blazingly fast way to iterate through items in the collection. Iterating over all the items in a List<T> is generally much, much faster than iterating over a set.
Summary
Sets are an excellent tool for maintaining a lookup table where the item is both the key and the value. In addition, if you have need for the mathematical set operations, the C# sets support those as well.
The HashSet<T> is the set of choice if you want the fastest possible lookups but don’t care about order. In contrast the SortedSet<T> will give you a sorted collection at a slight reduction in performance.
Technorati Tags: C#,.Net,Little Wonders,BlackRabbitCoder,ISet,HashSet,SortedSet