Named and Non-Capturing Groups in .NET Regular Expressions

The previous article introduced the .NET Group and GroupCollection classes, explained their uses and place in the .NET regular expression class hierarchy, and gave an example of how to define and use groups to extract or isolate sub-matches of a regular expression match. This article covers two more facets of using groups: creating named groups and defining non-capturing groups.

Named Groups

As a refresher, the following is an example of a simple regular expression pattern for North American phone numbers. The parentheses define the group, isolating the area code of the phone number:

(\d{3})-\d{3}-\d{4}

The following snippet shows how to instantiate a Regex object using this pattern, how to enumerate the resulting matches, and then how to extract the specific group from each match's Groups member (The Match::Groups member is a collection of type GroupCollection that contains all the Group objects related to the match.):

String* input   = S"test: 404-555-1212 and 770-555-1212";
String* pattern = S"(\\d{3})-\\d{3}-\\d{4}";

Regex* rex = new Regex(pattern);

// for all the matches
for (Match* match = rex->Match(input);
      match->Success;
      match = match->NextMatch())
{
  MessageBox::Show(match->Groups->Item[1]->Value);
}

This works, but the index value into the Groups object is hard-coded, which is a problem. (The value I used is 1 because the entire match is the first entry in the Groups collection and referenced at 0.) As a result, if at a later date, you or someone else disturbs the pattern such that the second group in the Groups object is not the area code, the code depending on this order will fail. This tip illustrates how to name your groups such as to isolate them from changes to the pattern.

The syntax for naming a group, while not very intuitive, is easy and looks like this:

(?<GROUPNAME>expression)

Therefore, in order to name the area code group, you would simply alter the pattern as follows:

(?<AreaCode>\\d{3})-\\d{3}-\\d{4}

Now that the group is named, you just need to know how to extract it. The previous code snippet used the Groups object's Item indexer property to extract the desired group by index value. This is because the Groups object is an instance of the GroupCollection class and, like almost all collection classes, the Item property takes an numeric parameter that represents the desired object's place in the collection: 0 for the first object, 1 for the second object, and so on.

In the case of the GroupCollection::Item property, it is overloaded to also accept a value of type String that represents the actual name given to the group. Therefore, using the example where I gave the name of AreaCode to the group, I can now extract that value using either of these forms:

match->Groups->Item[1]->Value
match->Groups->Item["AreaCode"]->Value

The second (named group) form is preferred and strongly recommended as it better isolates the code from the pattern changing.

Named and Non-Capturing Groups in .NET Regular Expressions

Non-Capturing Groups

Groups are not always defined in order to create sub-matches. Sometimes, groups get created as a side effect of the parenthetical syntax used to isolate a part of the expression such that modifiers, operators, or quantifiers can act on the isolated part of the expression. Irregardless of your reason for isolating a part of the pattern, once you do it using the parenthesis symbols, the regular expressions parser creates Group objects for that group within each Match object's group collection (Groups).

An example might better explain what I mean. Say you have a pattern to search a string for occurrences of the words "on", "an", and "in":

((A|a)|(O|o)|(I|i))n\s

If you tested this pattern with the following function, which simply displays all the groups of each match, you'd find that each match results in five groups:

void DisplayGroups(String* input, String* pattern)
{
  try
  {
    StringBuilder* results = new StringBuilder();

    Regex* rex = new Regex(pattern);

    // for all the matches
    for (Match* match = rex->Match(input);
         match->Success;
         match = match->NextMatch())
    {
      results->AppendFormat(S"Match {0} at {1}\r\n",
                            match->Value,
                            __box(match->Index));

      // for all of THIS match's groups
      GroupCollection* groups = match->Groups;
      for (int i = 0; i < groups->Count; i++)
      {
        results->AppendFormat(S"\tGroup {0} at {1}\r\n",
                              (groups->Item[i]->Value),
                              __box(groups->Item[i]->Index));

      }
    }
    MessageBox::Show(results->ToString());
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

Figure 1 shows the results of running the following code using the DisplayGroups function:

// Example usage of the DisplayGroups function
DisplayGroups(S"Tommy sat on a chair, in a room",
              S"((A|a)|(O|o)|(I|i))n\\s");

[Regex0501.jpg]

Figure 1: Using Parentheses in Patterns Always Creates Groups

Therefore, for efficiency—especially if you're processing huge amounts of data with your regular expressions, define the group as "non-capturing" by giving your group a blank name as follows:

(?:(?:A|a)|(?:O|o)|(?:I|i))n\s

If you run this pattern through the DisplayGroups function, you'll see that the only groups created represent the entire match (see Figure 2). (There is no way to eliminate that group.)

[Regex0502.jpg]

Figure 2: The Only Groups Created Represent the Entire Match

Looking Ahead

The previous two articles covered groups and group collections in regular expressions. However, one area that is closely tied to groups that I haven't touched on yet is captures. Therefore, the upcoming articles will explain what captures are and how they related to matches and groups.



About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: November 20, 2014 @ 2:00 p.m. ET / 11:00 a.m. PT Are you wanting to target two or more platforms such as iOS, Android, and/or Windows? You are not alone. 90% of enterprises today are targeting two or more platforms. Attend this eSeminar to discover how mobile app developers can rely on one IDE to create applications across platforms and approaches (web, native, and/or hybrid), saving time, money, and effort and introducing apps to market faster. You'll learn the trade-offs for gaining long …

  • Live Event Date: October 29, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Are you interested in building a cognitive application using the power of IBM Watson? Need a platform that provides speed and ease for rapidly deploying this application? Join Chris Madison, Watson Solution Architect, as he walks through the process of building a Watson powered application on IBM Bluemix. Chris will talk about the new Watson Services just released on IBM bluemix, but more importantly he will do a step by step cognitive …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds