.NET Regular Expressions and Captures

The result of a regular expression is a collection (MatchCollection) of Match objects (see Figure 1). Within each Match object is a collection (GroupCollection) of Group objects. Each Group object within the GroupCollection represents either the entire match or a sub-match that was defined via parenthesis.

Figure 1: .NET Regular Expression Class Hierarchy

Although you probably won't encounter too many situations where you work directly with captures (Capture and CaptureCollection), to present a complete picture of regular expressions, this article covers what they are and when they do come into play.

Working with Captures

One reason that you'll rarely deal with captures is that there's usually only one capture per group. Therefore, you'll generally extract your sub-match information from the Group object. This is also why you'll generally see the terms group and capture used somewhat interchangeably. However, your regular expression will yield groups with multiple captures in a few situations, so this article focuses on that.

Take a simple expression used to extract a time-formatted value:

(?<time>(\d|\:)+)

You have a named group called time that searches for sequences of numbers and colons. Imagine using that expression to parse the following value:

Tom Archer posted at 12:34:56 today

Looking at the expression/input pair, you can see that the parser will yield three groups (including the entire match). However, what is not so obvious is that the second group will contain eight captures. The following function proves this:

using namespace System::Text;
using namespace System::Text::RegularExpressions;

...

void DisplayGroups(String* input, String* pattern, bool displayCats = false)
{
  try
  {
    StringBuilder* results = new StringBuilder();

    Regex* rex = new Regex(pattern);

    // for all the matches
    for (Match* match = rex->Match(input);
         match->Success;
         match = match->NextMatch())
    {  
      results->AppendFormat(S"Match {0} at {1}\r\n",
                            match->Value,
                            __box(match->Index));

      // for all of THIS match's groups
      GroupCollection* groups = match->Groups;
      for (int i = 0; i < groups->Count; i++)
      {
        results->AppendFormat(S"\tGroup: Value '{0}' at Index {1}\r\n",
                              groups->Item[i]->Value,
                              __box(groups->Item[i]->Index));
        
        if (displayCats)
        {
          // for all of THIS group's captures
          CaptureCollection* captures = groups->Item[i]->Captures;
          for (int j = 0; j < captures->Count; j++)
          {
            results->AppendFormat(S"\t\tCapture: Value '{0}' at Index {1}\r\n",
                                  captures->Item[j]->Value,
                                  __box(captures->Item[j]->Index));
          }
        }
      }
    }
    MessageBox::Show(results->ToString());
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

Calling the function in the following manner will result in what you see in Figure 2:

DisplayGroups(S"Tom Archer posted at 12:34:56 today",
              S"(?<time>(\\d|\\:)+)",
              true);

Figure 2: Example of Groups vs. Captures

So, the obvious question is: Why so many captures? The second group (the inner parenthesis in the test expression) states that a sub-match can be made on any single digit or colon:

(\d|\:)

Because the input string contains eight such instances of a number or colon, you get eight captures! Therefore, while I frequently find it more convenient to use the term group to characterize the result of a sub-match, sub-matches technically result in captures.

Having said that, because groups do typically contain a single capture, you can use the two terms interchangeably unless you have a specific reason to differentiate between them—as in the example in this article.

.NET Regular Expressions and Captures

Turning Off Captures

As previously mentioned, you will not care about captures most of the time and will extract the needed information from the group collection. Therefore, I suggest simply "turning off" the capture facility. One way to do that is to specify the RegexOptions::ExplicitCapture option when constructing the Regex object. Using the example DisplayGroups function, this would look like the following:

Regex* rex = new Regex(pattern, RegexOptions::ExplicitCapture);

Another way is to use the "non-capturing groups" syntax, which simply means to follow the opening parenthesis with the "?:" combination. Once again, using the example regular expression, this would look like the following:

(?<time>(?:\d|\:)+)

Figure 3 shows the results of running the DisplayGroups function without the captures. As you can see, the only object in the CaptureCollection will be that of the entire group, which is the entire match.

[Captures3.jpg]

Figure 3: Example of Suppressing Captures

Looking Ahead

In future tips, I'll cover how to take what you've discovered so far in this series and apply it to common tasks, such as searching for and replacing text, replacing matches with substitution patterns, and finally using a very advanced expression to parse e-mail addresses of almost any format from a body of text.



About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • The explosion in mobile devices and applications has generated a great deal of interest in APIs. Today's businesses are under increased pressure to make it easy to build apps, supply tools to help developers work more quickly, and deploy operational analytics so they can track users, developers, application performance, and more. Apigee Edge provides comprehensive API delivery tools and both operational and business-level analytics in an integrated platform. It is available as on-premise software or through …

  • As mobile devices have pushed their way into the enterprise, they have brought cloud apps along with them. This app explosion means account passwords are multiplying, which exposes corporate data and leads to help desk calls from frustrated users. This paper will discover how IT can improve user productivity, gain visibility and control over SaaS and mobile apps, and stop password sprawl. Download this white paper to learn: How you can leverage your existing AD to manage app access. Key capabilities to …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds