Enumerating Objects with .NET

In this article, I will explain the programming paradigm used within the .NET Framework Base Class Library (BCL) for enumerating objects. Pay particular attention to the section on type safety and performance, because the information presented in this section is very useful when doing all kinds of things in the runtime. In future articles, I’ll explain how the BCL exposes collections of objects. This article focuses on enumerating over the objects in a BCL collection.

Enumerating Over a Set of Items

There are many types that manage a set of items. These types must offer some mechanism that allow items to be individually enumerated. Since this programming paradigm is so common, the Frameworks designers established a standard mechanism for all collection types to adhere to. This standard mechanism comes in the form of the System.Collections.IEnumerable interface:

public interface IEnumerable {
 IEnumerator GetEnumerator();
}

Any type that implements the IEnumerable interface has a set of items that can be enumerated. So, if you understand how to use IEnumerable, then you’ll immediately know how to work with this set’s items. By the way, all of the collection types implement IEnumerable; this interface is implemented by many other types as well, including System.Array, various classes in System.Data, various classes in System.Xml, and more.

The IEnumerable interface has just one method, GetEnumerator. The intention of this method is to return a separate object whose sole purpose is to grant access to the set’s items. This separate object must implement the System.Collections.IEnumerator interface:

public interface IEnumerator {
 Boolean MoveNext();
 Object  Current { get; }
 void    Reset();
}

Each call to GetEnumerator should return a different object implementing IEnumerator. We’ll use the term enumerator to mean an instance of an object that implements the IEnumerator interface. When an enumerator is constructed, it should take a snapshot of the items currently in the set, and a single enumerator always refers to the items in its snapshot. So it’s possible that two successive calls to GetEnumerator will return two different enumerators, each with different items in their snapshots.

For example, imagine an EmailFolder type that implements a GetEnumerator method that returns a snapshot of the messages in the folder. It’s quite possible that calling GetEnumerator twice in a row will create two enumerators, each with a different snapshot, since a new e-mail message could arrive in between getting both enumerators.

There are two common techniques for an enumerator to initialize its snapshot:

  • Copy all items in the set to the enumerator. This technique has the worst performance in terms of both speed and memory, since a duplicate of the set’s items is made for the enumerator. This technique is best if you know that the set typically has a small number of items or if it is extremely important that the enumerator reflect the exact set of items as a specific point in time.
  • Pass a reference to the set to the enumerator. This technique has the lowest overhead and is therefore much more common. However, the set could be changing while items are being enumerated, which could corrupt the integrity of the enumerator. This technique is best used if the set of items remains static while enumerated.

The second technique is so much better than the first that it is the technique of choice 99% of the time. Of course, it’s easy to use this technique when the items being enumerated are static. For example, there is no problem using this technique to enumerate the states in the country since these items never change. However, for non-static sets, the potential for corrupting the enumerator’s integrity is unsatisfactory and something must be done to handle this.

Here is what the framework’s collections do: Each collection type has an instance field that contains a version number. When an instance of the collection is constructed, this version number is set to 0. Each time a method modifies the collection, the version number is incremented. When GetEnumerator is called, it constructs an enumerator; this enumerator object contains the current version number in the collection and a reference to the collection’s items (conserving time and memory). Now, every time the enumerator is accessed, it compares the version of the enumerator with the version of the collection. If the versions do not match, then the collection has changed and an InvalidOperationException is thrown. Your code must be able to catch this exception and gracefully recover if you suspect that the collection’s items may change while they are being enumerated.

So far, we’ve talked about how each enumerator object maintains a logical snapshot of the collection’s items. In addition, each enumerator object also maintains a cursor to the item currently being enumerated. Initially, the cursor is positioned just before the first item in the set, and attempting to request the item at the cursor (using IEnumerator’s read-only Current property) will raise an InvalidOperationException exception.

To access the first item in the set, you must call IEnumerator’s MoveNext method first and then access the Current property. Each call to MoveNext advances the cursor to the next item in the set. MoveNext returns false when the cursor has reached the last item in the set. At this point, you have enumerated all the items and attempting to access the Current property raises an InvalidOperationException exception. If you want, you can call the Reset method, which resets the cursor so that it is placed just before the first item in the set and you can enumerate the same items all over again from the beginning.

Using IEnumerable and IEnumerator

Now that you understand what IEnumerable and IEnumerator are all about, let’s take a look at some code that demonstrates how to use these interfaces to actually enumerate the items in a collection:

// Construct a type that manages a set of items
// This type must offer a public GetEnumerator method
SetType st = new SetType(...);

// To enumerate the set's item, we must first request 
// the enumerator
IEnumerator e = st.GetEnumerator();

// Advance the enumerator's cursor until we reach 
// the last item
while (e.MoveNext()) {

 // Cast the Current item to the proper type
 ItemType it = (ItemType) e.Current;

 // Use the item any way you want
 Console.WriteLine("Do something with this item: " + it);
}

The above programming paradigm is so common that many languages offer a shorthand way to express it. In C#, the above loop can be coded using the foreach statement. The code below is identical to the code shown above:

// Construct a type that manages a set of items
// This type must offer a public GetEnumerator method
SetType st = new SetType(...);

foreach (ItemType it in st) {
 // Use the item any way you want
 Console.WriteLine("Do something with this item: " + it);
}

As you can see, we saved ourselves a few lines of code here. Specifically, the foreach statement automatically generates the code necessary to:

  • Request an enumerator object from the SetType object (st)
  • Call the enumerator’s MoveNext method
  • Cast the value returned from the enumerator’s Current property to an ItemType (it)

Again, don’t forget that arrays also implement IEnumerable, so elements of an array may also be enumerated using the foreach statement.

Type Safety and Performance Issues

There are some issues surrounding enumerators with respect to type safety and performance. For example, the IEnumerator interface’s Current property is defined to return a System.Object. This means that each item in a collection is treated as a generic object. There are two problems with this. First, if we have a collection of System.Strings, then enumerating each string item would be returned as an object, which is then cast to a String. Sure, this cast will always work (assuming that the collection does, in fact, contains Strings). However, cast operations hurt performance since the runtime must determine whether the object you’re casting to a String is, in fact, a String. If the object were not a String, then the usual InvalidCastException would be thrown. Avoiding this cast check would improve performance.

We see the second problem when the items in a collection are value types, such as Point or Rectangle structures. Again, since IEnumerator’s Current property returns a System.Object, each value type item must be converted into its boxed equivalent. This means that an object must be allocated from the managed heap, and the structure’s value must be copied into this object. Later, this object will have to be garbage collected (GC’d). In addition, it is likely that your code will cast the boxed object to a structure variable, which will cause the object to be unboxed and another memory copy to occur. That is why enumerating value type items is particularly bad on performance.

The following code shows how to properly implement a simple collection and also demonstrates the performance issues. First, let’s define a Point value type:

public struct Point { public Int32 x, y; }

Next, let’s define a Triangle type that contains a collection of three points:

class Triangle : IEnumerable {
 // NOTE: Initialization of points left out to simplify code
 private Point[] points = new Point[3];

 public IEnumerator GetEnumerator() {
  return new Enumerator(points);
 }

 public class Enumerator : IEnumerator {
  private Point[] points;
  private int cursor;
  public Enumerator(Point[] points) {
   this.points = points; cursor = -1;
  }

  public void Reset() { cursor = -1; }
  public bool MoveNext() {
   // If not at end, advance cursor
   if (cursor < points.Length) cursor++;

   // Return false if cursor is beyond last item index
   return(!(cursor == points.Length));
  }
  public Object Current {
   get {
   if ((cursor < 0) || (cursor == points.Length))
    throw new InvalidOperationException();
    return points[cursor];
   }
  }
 }
}

Now, let's construct a Triangle and enumerate its points:

Triangle t = new Triangle();
IEnumerator e = t.GetEnumerator();
while (e.MoveNext()) {
 // Current boxes the Point and returns a reference
 // to the boxed object (which will later be GC'd)
 Object o = e.Current;

 // The cast unboxes the object and the assignment copies 
 // the value to p (an unboxed value type)
 Point p = (Point) o;
 // Use 'p' ...
}

For clarity, I avoided using the foreach statement in the code above and instead coded the loop manually. However, you get would get the exact same results if you were to use the foreach statement:

Triangle t = new Triangle();
foreach(Point p in t) {
 // Use 'p' ...
}

The code above clearly shows that enumerating a collection's value type items incurs multiple performance hits. This is ironic, since value types were originally designed into the system in order to improve performance. Fortunately, there is a way to remove these performance hits and also improve type safety. The code below shows the same Triangle class with just a few slight modifications (called out with comments):

class Triangle : IEnumerable {
 Point[] points = new Point[3];

 // New method, not in IEnumerable
 public Enumerator GetEnumerator() {
  return new Enumerator(points);
 }

 // Changed method to an
 // Explicit Interface Method Implementation
 IEnumerator IEnumerable.GetEnumerator() {
  return GetEnumerator(); // Calls non-interface method
 }

 public class Enumerator : IEnumerator {
  private Point[] points;
  private int cursor;

  public Enumerator(Point[] points) {
   this.points = points;
   cursor = -1;
  }

  public void Reset() { cursor = -1; }

  public bool MoveNext() {
   if (cursor < points.Length) {
    // Not at end, advance cursor
    cursor++;
   }
   // Return false if cursor is beyond last item index
   return(!(cursor == points.Length));
  }

  // New property, not in IEnumerator
  // Notice: return type is Point, not Object
  public Point Current {
   get { // Body is identical to previous version 
    if ((cursor < 0) || (cursor == points.Length))
     throw new InvalidOperationException();
    return points[cursor];
   }
  }

  // Changed property to an 
  // Explicit Interface Member Implementation
  Object IEnumerator.Current {
   get { return Current; } // Calls non-interface property
  }
 }
}

With these few modifications, the class becomes quite flexible. Let's examine the impact of these changes. First, to enumerate the points of a Triangle, you call GetEnumerator, as always. However, calling a Triangle's GetEnumerator method now calls a simple member function that is not implementing the IEnumerable's GetEnumerator method. Since this is a simple member function, it may have any return type it desires; instead of returning an IEnumerator, the simple GetEnumerator method returns an instance of the Triangle's nested Enumerator class.

This Enumerator class offers the standard enumerator methods, Reset and MoveNext, as well as the standard property, Current. However, there are two Current properties: one, which is of type Object and another, which is of type Point. The Current property of type Point is not an implementation of IEnumerator's Current property. When the Current property is accessed for an Enumerator object (as opposed to an IEnumerator), then the Current property of type Point is accessed. This property returns a Point value type, removing the need for any boxing or casting!

So, as long as our code accesses the Triangle type and its nested Enumerator type directly, we get a big performance boost. However, a Triangle can also be used generically by casting a Triangle to an IEnumerable and then calling GetEnumerator to get a generic IEnumerator. Using the IEnumerator will not box all the Points and return them as generic objects. Note that the Triangle is flexible enough to be used either way but when it is used generically, there is a performance hit.

You might be wondering how these changes affect C#'s foreach statement. Well, you'll be pleased to know that foreach was designed specifically to take advantage of our improved class. When compiling a foreach statement, the compiler first calls GetEnumerator for the specified collection type. As long as the type offers a GetEnumerator function, all is well; the type doesn't actually have to implement IEnumerable at all.

The type GetEnumerator that returns is used in the foreach statement. As long as this type has a MoveNext method and a Current property, the foreach statement compiles successfully. The object returned from GetEnumerator doesn't have to implement IEnumerator at all for foreach to work. In fact, the object doesn't have to offer a Reset method at all, since foreach never attempts to call Reset.

All of this means that any existing code that enumerates a Triangle's points doesn't have to be modified in any way to gain the performance improvements made to the Triangle class. Just recompile the code, and performance will be better.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read