Changes in System.IO classes in .NET Framework 4.0

Introduction

Earlier versions of .NET Framework (prior to .NET Framework 4.0 and going back to .NET 2.0) had many APIs in the System.IO namespace for enumerating lines in a file, enumerating files in a directory, etc. These APIs returned arrays which could then be looped over to process each item.

For example, if one desires to print all the lines in a file, he/she can use the File.ReadAllLines API to get an array of strings (each string representing a line) which can then be printed by iterating over the array.

  string[] allLines = File.ReadAllLines("foo.txt");
  foreach(string line in allLines)
  {
  	Console.WriteLine(line);
  }

Similarly, to get a list of all files/directories in a directory, you call the GetFiles/GetDirectories API on DirectoryInfo class.

  DirectoryInfo dirInfo = new DirectoryInfo(@"c: \windows\system32");
  FileInfo[] arrayFileInfo = dirInfo.GetFiles();
  DirectoryInfo[] arrayDirectoryInfo = dirInfo.GetDirectories();

Issues With the Old APIs

While the above APIs worked as expected, there was a considerable performance hit when they were exercised on large files. The performance hit is coming from the fact that the APIs mentioned above are synchronous. i.e. The operation is blocked until all the lines in a file are read (to populate the array). Imagine the time the operation will take when you are parsing a 1 GB log file. Another issue on a memory-constrained execution environment will be the amount of memory needed to allocate the array. If you were only interested in the first few lines, you still have to pay the penalty of loading all the lines in memory.

New APIs in .NET Framework 4.0

To overcome these issues, in .NET Framework 4.0, the Base Class Library folks over at the Common Language Runtime team built new APIs with enumerators rather than arrays. These new APIs were extremely efficient because they didn't read all the lines into memory at once. Also since it read only one line at a time into memory, you can abrupt your iteration at any point without having to pay the late-comers tax we saw in the older APIs.

The new APIs are:

  • File.ReadLines (+1 overload)
  • File.WriteAllLines (+1 overload)
  • File.AppendAllLines (+1 overload)
  • DirectoryInfo,EnumerateDirectories(+2 overloads)
  • DirectoryInfo,EnumerateFiles (+2 overloads)
  • DirectoryInfo.EnumerateFileSystemInfos(+2 overloads)
  • Directory.EnumerateDirectories (+2 overloads)
  • Directory.EnumerateFiles (+2 overloads)
  • Directory.EnumerateFileSystemEntries (+2 overloads)

These new work by returning an IEnumerable <t> which is a much more performant operation than an array of objects returned by the earlier methods.

The application developer can use the returned iterator to iterate, reducing the startup disk I/O experienced in the older APIs.

Hands On

Here is how the APIs can be used:

  using System;
  using System.Collections.Generic;
  using System.Linq;
  using System.Text;
  using System.IO;
  
  namespace FileEnumerators
  {
      class Program
      {
          static void Main(string[] args)
          {
              DateTimeOffset tstart = new DateTimeOffset(DateTime.Now);
  
              string[] oldlines = File.ReadAllLines(Environment.ExpandEnvironmentVariables(@"%TEMP%\registry.reg"));
  
              DateTimeOffset tstop = new DateTimeOffset(DateTime.Now);
              TimeSpan difference = tstop - tstart;
              Console.WriteLine("Time taken with old API = " + difference.ToString());
              tstart = new DateTimeOffset(DateTime.Now);
              for (int i = 0; i < oldlines.Length; i++)
              {
                  // Dont do anything. Just cycle through
              }
              
              tstop = new DateTimeOffset(DateTime.Now);
              TimeSpan cycleDifference = tstop - tstart;
              Console.WriteLine("Cycle time taken with old API = " + cycleDifference.ToString());
              
              tstart = new DateTimeOffset(DateTime.Now);
              IEnumerable<string> allLines = File.ReadLines(Environment.ExpandEnvironmentVariables("%TEMP%\\registry.reg"));
              tstop = new DateTimeOffset(DateTime.Now);
              difference = tstop - tstart;
              Console.WriteLine("Time taken with new API = " + difference.ToString());
              tstart = new DateTimeOffset(DateTime.Now);
              foreach (string str in allLines)
              {
                  // Dont do anything. Just cycle through
              }
              tstop = new DateTimeOffset(DateTime.Now);
              cycleDifference = tstop - tstart;
              Console.WriteLine("Cycle time taken with new API = " + cycleDifference.ToString());
          }
      }
  }

The results are very obvious. On my system (currently already pegging one of the two CPUs on a dual-core at constant value), I exported my registry to a temp file and ran the above code on that file. On a DEBUG build, the results are below:

Time taken with old API = 00:00:11.6660156
Cycle time taken with old API = 00:00:00.0087891
Time taken with new API = 00:00:00
Cycle time taken with new API = 00:00:09.9238281

The new API (which has the enumerators) does not block to read all the contents of a large file. Instead it immediately returns the enumerator (highly performant). The old API loads everything in memory so it takes an initial hit, however its cycle time is almost zero since it does not have to hit the hard drive to get the values of the strings again.

Usage of the Performant APIs

The new APIs can be very useful when you want to list all files/directories in a directory which has a lot of content including lot of sub-directories.

The new APIs can also be used where we want to read lines in a very large text file.

Summary

In the above article, we saw how the new APIs in System.IO help improve the performance by using enumerators, hence reducing the initial lookup time.

Related Article



About the Author

Vipul Vipul Patel

Vipul Patel is a Software Engineer currently working at Microsoft Corporation, working in the Office Communications Group and has worked in the .NET team earlier in the Base Class libraries and the Debugging and Profiling team. He can be reached at vipul_d_patel@hotmail.com

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: December 11, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Market pressures to move more quickly and develop innovative applications are forcing organizations to rethink how they develop and release applications. The combination of public clouds and physical back-end infrastructures are a means to get applications out faster. However, these hybrid solutions complicate DevOps adoption, with application delivery pipelines that span across complex hybrid cloud and non-cloud environments. Check out this …

  • On-demand Event Event Date: October 29, 2014 It's well understood how critical version control is for code. However, its importance to DevOps isn't always recognized. The 2014 DevOps Survey of Practice shows that one of the key predictors of DevOps success is putting all production environment artifacts into version control. In this webcast, Gene Kim discusses these survey findings and shares woeful tales of artifact management gone wrong! Gene also shares examples of how high-performing DevOps …

Most Popular Programming Stories

More for Developers

RSS Feeds