Managed Extensions: Parsing CSV Files with Regular Expressions

Welcome to this week's installment of .NET Tips & Techniques! Each week, award-winning Architect and Lead Programmer Tom Archer demonstrates how to perform a practical .NET programming task using either C# or Managed C++ Extensions.

In my latest book, Extending MFC Applications with the .NET Framework, I devote an entire chapter to using the .NET Regular Expression classes. In that chapter, I even included a regular expression that can parse text for essentially any e-mail address format. Since the book's publication, many readers have requested my help with their regular expressions for parsing various types of data. Some of the most popular requests I receive have to do with reading comma-delimited text files (sometimes referred to as "CSV files") and handling scenarios where the data contains quotes, commas, and blanks. Therefore, in this week's installment of the .NET Tips & Techniques series, I present a very simple means of handling these cases.

Returning Comma-delimited Data in an Array

In the name of reusability, I've placed the text-parsing code into a class called Csv and provided a static method (LineToArray) that takes a comma-delimited string and returns an array of String objects, where each string represents a row of data. That way, the Csv class's client need only call this method and then use a for loop to enumerate the array. Here is that class/method:

using namespace System::Text::RegularExpressions;

...

__gc class Csv
{
public:
   static String* LineToArray(String* line) __gc[]
   {
      String* pattern = S",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))";
      Regex* r = new Regex(pattern);

      return r->Split(line);
   }
};

Using the StreamReader and Csv Classes

At this point, the client can focus on opening and reading the text file, calling the Csv::LineToArray method (for each line of text read), and iterating through the returned array of String objects. Reading a text file can be accomplished in several ways. I typically use the StreamReader class because my language of choice is Visual C++/MFC and this class closely mimics the interface of the MFC CStdioFile class.

The two main StreamReader methods used for reading are ReadToEnd and ReadLine. The difference between the two is that the ReadToEnd method is used in situations where you want to read the entire file into a String object, whereas the ReadLine method is used to read each line of text from an ASCII file (as delimited by a carriage-return/line-feed pair). When reading a text file where each record will be treated independently, you'll most likely use the ReadLine method.

The following code snippet simply opens and reads each line of text from a file (c:\data.txt):

using namespace System::IO;

...

StreamReader* reader = NULL;

try
{
   // load data from text (csv) file
   reader = new StreamReader(S"c:\\data.txt");
   String* data;
   String* dataArray[];
   int currRec = 0;

   while (0 < reader->Peek())
   {
      // get a single line of text
      data = reader->ReadLine();

      // call routine to place delimited 
      // text into an array
      dataArray = Csv::LineToArray(data);

      // print the array of text items
      Console::WriteLine(S"Record {0} : ", __box(currRec++));
      for (int i = 0; i < dataArray->Length; i++)
      Console::WriteLine(S"[{0}] = [{1}]", __box(i), dataArray[i]);
      Console::WriteLine();
   }
   reader->Close();
}
catch(Exception* e)
{
   Console::WriteLine(e->Message);
}
__finally
{
   if (NULL != reader) reader->Close();
}

Note the use of the StreamReader::Peek method, which doesn't alter the stream's pointer but instead returns the next character to be read. If a value of -1 is returned, that indicates that there is no more data to be read. For each line of text read, the code then calls the Csv::LineToArray method and displays the returned string array's contents.

The following figure illustrates the running of this article's demo against an included sample text file to test the scenarios mentioned at the outset.



About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.

Downloads

Comments

  • Discard enclosing Double Quotes (") from field value

    Posted by MehdiAnis on 10/27/2008 05:08pm

    This is an excellent solution. Many thanks. I wonder, is there a way not to include begining and end " char for an enclosed field?
    
    Like in the example, [0] would be 
    [0] = text withing quote
    [0] = Mr. Text, you have a comma.
    
    So, we are taking what inside the double quotes, not including them. 
    
    by the way, I can iterate through the Array and look for beg/end "s and get rid of them. But I want the solution built into the reg expression. 
    
    Thank you.
    
    regards,
    Mehdi Anis

    Reply
  • using Pipe (|) as delimiter instead of comma(,)

    Posted by Kunal_shah on 06/27/2006 12:47pm

    The article is really helpful. However when i try to parse a file in which i have a pipe (|) as a delimitter it does not work as expected. It removes a single character at a time and puts it in the array. Could u please tell me where i am going wrong. Thank you Kunal

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: September 10, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild". This loop of continuous delivery and continuous feedback is …

  • The first phase of API management was about realizing the business value of APIs. This next wave of API management enables the hyper-connected enterprise to drive and scale their businesses as API models become more complex and sophisticated. Today, real world product launches begin with an API program and strategy in mind. This API-first approach to development will only continue to increase, driven by an increasingly interconnected web of devices, organizations, and people. To support this rapid growth, …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds