Managed Extensions: Parsing CSV Files with Regular Expressions


Desktop-as-a-Service Designed for Any Cloud ? Nutanix Frame

Welcome to this week's installment of .NET Tips & Techniques! Each week, award-winning Architect and Lead Programmer Tom Archer demonstrates how to perform a practical .NET programming task using either C# or Managed C++ Extensions.

In my latest book, Extending MFC Applications with the .NET Framework, I devote an entire chapter to using the .NET Regular Expression classes. In that chapter, I even included a regular expression that can parse text for essentially any e-mail address format. Since the book's publication, many readers have requested my help with their regular expressions for parsing various types of data. Some of the most popular requests I receive have to do with reading comma-delimited text files (sometimes referred to as "CSV files") and handling scenarios where the data contains quotes, commas, and blanks. Therefore, in this week's installment of the .NET Tips & Techniques series, I present a very simple means of handling these cases.

Returning Comma-delimited Data in an Array

In the name of reusability, I've placed the text-parsing code into a class called Csv and provided a static method (LineToArray) that takes a comma-delimited string and returns an array of String objects, where each string represents a row of data. That way, the Csv class's client need only call this method and then use a for loop to enumerate the array. Here is that class/method:

using namespace System::Text::RegularExpressions;


__gc class Csv
   static String* LineToArray(String* line) __gc[]
      String* pattern = S",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))";
      Regex* r = new Regex(pattern);

      return r->Split(line);

Using the StreamReader and Csv Classes

At this point, the client can focus on opening and reading the text file, calling the Csv::LineToArray method (for each line of text read), and iterating through the returned array of String objects. Reading a text file can be accomplished in several ways. I typically use the StreamReader class because my language of choice is Visual C++/MFC and this class closely mimics the interface of the MFC CStdioFile class.

The two main StreamReader methods used for reading are ReadToEnd and ReadLine. The difference between the two is that the ReadToEnd method is used in situations where you want to read the entire file into a String object, whereas the ReadLine method is used to read each line of text from an ASCII file (as delimited by a carriage-return/line-feed pair). When reading a text file where each record will be treated independently, you'll most likely use the ReadLine method.

The following code snippet simply opens and reads each line of text from a file (c:\data.txt):

using namespace System::IO;


StreamReader* reader = NULL;

   // load data from text (csv) file
   reader = new StreamReader(S"c:\\data.txt");
   String* data;
   String* dataArray[];
   int currRec = 0;

   while (0 < reader->Peek())
      // get a single line of text
      data = reader->ReadLine();

      // call routine to place delimited 
      // text into an array
      dataArray = Csv::LineToArray(data);

      // print the array of text items
      Console::WriteLine(S"Record {0} : ", __box(currRec++));
      for (int i = 0; i < dataArray->Length; i++)
      Console::WriteLine(S"[{0}] = [{1}]", __box(i), dataArray[i]);
catch(Exception* e)
   if (NULL != reader) reader->Close();

Note the use of the StreamReader::Peek method, which doesn't alter the stream's pointer but instead returns the next character to be read. If a value of -1 is returned, that indicates that there is no more data to be read. For each line of text read, the code then calls the Csv::LineToArray method and displays the returned string array's contents.

The following figure illustrates the running of this article's demo against an included sample text file to test the scenarios mentioned at the outset.

About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.


Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date