SortedDictionary and SortedList

Posted by Simon Cooper on Simple Talk See other posts from Simple Talk or by Simon Cooper
Published on Wed, 05 Oct 2011 16:59:00 GMT Indexed on 2011/11/11 18:18 UTC
Read the original article Hit count: 409

Filed under:

Apart from Dictionary<TKey, TValue>, there's two other dictionaries in the BCL - SortedDictionary<TKey, TValue> and SortedList<TKey, TValue>. On the face of it, these two classes do the same thing - provide an IDictionary<TKey, TValue> interface where the iterator returns the items sorted by the key. So what's the difference between them, and when should you use one rather than the other? (as in my previous post, I'll assume you have some basic algorithm & datastructure knowledge)

SortedDictionary

We'll first cover SortedDictionary. This is implemented as a special sort of binary tree called a red-black tree. Essentially, it's a binary tree that uses various constraints on how the nodes of the tree can be arranged to ensure the tree is always roughly balanced (for more gory algorithmical details, see the wikipedia link above).

What I'm concerned about in this post is how the .NET SortedDictionary is actually implemented. In .NET 4, behind the scenes, the actual implementation of the tree is delegated to a SortedSet<KeyValuePair<TKey, TValue>>. One example tree might look like this:

Each node in the above tree is stored as a separate SortedSet<T>.Node object (remember, in a SortedDictionary, T is instantiated to KeyValuePair<TKey, TValue>):

class Node {
    public bool IsRed;
    public T Item;
    public SortedSet<T>.Node Left;
    public SortedSet<T>.Node Right;
}

The SortedSet only stores a reference to the root node; all the data in the tree is accessed by traversing the Left and Right node references until you reach the node you're looking for. Each individual node can be physically stored anywhere in memory; what's important is the relationship between the nodes.

This is also why there is no constructor to SortedDictionary or SortedSet that takes an integer representing the capacity; there are no internal arrays that need to be created and resized. This may seen trivial, but it's an important distinction between SortedDictionary and SortedList that I'll cover later on.

And that's pretty much it; it's a standard red-black tree. Plenty of webpages and datastructure books cover the algorithms behind the tree itself far better than I could. What's interesting is the comparions between SortedDictionary and SortedList, which I'll cover at the end.

As a side point, SortedDictionary has existed in the BCL ever since .NET 2. That means that, all through .NET 2, 3, and 3.5, there has been a bona-fide sorted set class in the BCL (called TreeSet). However, it was internal, so it couldn't be used outside System.dll. Only in .NET 4 was this class exposed as SortedSet.

SortedList

Whereas SortedDictionary didn't use any backing arrays, SortedList does. It is implemented just as the name suggests; two arrays, one containing the keys, and one the values (I've just used random letters for the values):

The items in the keys array are always guarenteed to be stored in sorted order, and the value corresponding to each key is stored in the same index as the key in the values array. In this example, the value for key item 5 is 'z', and for key item 8 is 'm'.

Whenever an item is inserted or removed from the SortedList, a binary search is run on the keys array to find the correct index, then all the items in the arrays are shifted to accomodate the new or removed item.

For example, if the key 3 was removed, a binary search would be run to find the array index the item was at, then everything above that index would be moved down by one:

and then if the key/value pair {7, 'f'} was added, a binary search would be run on the keys to find the index to insert the new item, and everything above that index would be moved up to accomodate the new item:

If another item was then added, both arrays would be resized (to a length of 10) before the new item was added to the arrays.

As you can see, any insertions or removals in the middle of the list require a proportion of the array contents to be moved; an O(n) operation. However, if the insertion or removal is at the end of the array (ie the largest key), then it's only O(log n); the cost of the binary search to determine it does actually need to be added to the end (excluding the occasional O(n) cost of resizing the arrays to fit more items).

As a side effect of using backing arrays, SortedList offers IList Keys and Values views that simply use the backing keys or values arrays, as well as various methods utilising the array index of stored items, which SortedDictionary does not (and cannot) offer.

The Comparison

So, when should you use one and not the other? Well, here's the important differences:

Memory usage

SortedDictionary and SortedList have got very different memory profiles. SortedDictionary...
1. has a memory overhead of one object instance, a bool, and two references per item. On 64-bit systems, this adds up to ~40 bytes, not including the stored item and the reference to it from the Node object.
2. stores the items in separate objects that can be spread all over the heap. This helps to keep memory fragmentation low, as the individual node objects can be allocated wherever there's a spare 60 bytes.
In contrast, SortedList...
1. has no additional overhead per item (only the reference to it in the array entries), however the backing arrays can be significantly larger than you need; every time the arrays are resized they double in size.
  
  That means that if you add 513 items to a SortedList, the backing arrays will each have a length of 1024. To conteract this, the TrimExcess method resizes the arrays back down to the actual size needed, or you can simply assign list.Capacity = list.Count.
2. stores its items in a continuous block in memory. If the list stores thousands of items, this can cause significant problems with Large Object Heap memory fragmentation as the array resizes, which SortedDictionary doesn't have.
Performance
Operations on a SortedDictionary always have O(log n) performance, regardless of where in the collection you're adding or removing items. In contrast, SortedList has O(n) performance when you're altering the middle of the collection.

If you're adding or removing from the end (ie the largest item), then performance is O(log n), same as SortedDictionary (in practice, it will likely be slightly faster, due to the array items all being in the same area in memory, also called locality of reference).

So, when should you use one and not the other? As always with these sort of things, there are no hard-and-fast rules. But generally, if you:

need to access items using their index within the collection
are populating the dictionary all at once from sorted data
aren't adding or removing keys once it's populated

then use a SortedList. But if you:

don't know how many items are going to be in the dictionary
are populating the dictionary from random, unsorted data
are adding & removing items randomly

then use a SortedDictionary.

The default (again, there's no definite rules on these sort of things!) should be to use SortedDictionary, unless there's a good reason to use SortedList, due to the bad performance of SortedList when altering the middle of the collection.

Developer IT