Using Regular Expressions Groups to Isolate Sub-Matches

My most recent articles introduced several basic functions that you can perform with regular expressions, such as string splitting and using metacharacters for matching. However, before you can really get into the more powerful tasks that can be performed with regular expressions, you need to understand groups and their purpose. Grouping has two purposes:

  • Grouping several parts of a pattern together so that the entire group can be acted on via modifiers, operators, or quantifiers.
  • Denoting "sub-matches" where the pattern dictates that a part of the entire match must be isolated. (An example—which I'll show shortly—is a North American telephone number where the pattern would help locate the phone number and the grouping syntax would enable you to extract just the area code.)

In my two previous articles, I introduced the MatchCollection and Match classes. Now, look at how those classes and the classes for groups and captures fit into the .NET regular expression class hierarchy:

  • Each Regex object has a MatchCollection object (that contains Match objects).
  • Each Match object has a GroupCollection object (that contains Group objects).
  • Each Group object has a CaptureCollection object (that contains Capture objects).

The following figure (modified from my book, Extending MFC Applications with the .NET Framework) graphically illustrates the relationship among the various .NET regular expression classes.

Defining and Enumerating Groups

Groups are denoted simply by placing parenthesis around the desired part of the pattern. Take a simple North American telephone number pattern as an example:

\d{3}-\d{3}-\d{4}

The \d represents any numerical digit (0-9), and the {n} modifier tells the parser to locate n number of items. To isolate the area code, you would simply put the parenthesis around that part of the pattern:

(\d{3})-\d{3}-\d{4}

Using the hierarchy shown in Figure 1, you can surmise that a simple loop will display all the groups from a match. The following generic function takes as its parameters an input string to parse and a pattern to use:

void DisplayGroups(String* input, String* pattern)
{
  try
  {
    StringBuilder* results = new StringBuilder();

    Regex* rex = new Regex(pattern);

    // for all the matches
    for (Match* match = rex->Match(input);
         match->Success; 
         match = match->NextMatch())
    {
      results->AppendFormat(S"Match {0} at {1}\r\n",
                            match->Value,
                            __box(match->Index));

      // for all of THIS match's groups
      GroupCollection* groups = match->Groups;
      for (int i = 0; i < groups->Count; i++)
      {
        results->AppendFormat(S"\tGroup {0} at {1}\r\n",
                              (groups->Item[i]->Value),
                              __box(groups->Item[i]->Index));

      }
    }
    MessageBox::Show(results->ToString());
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

Once the regular expression object has been created, the function enumerates the matches and, for each match, enumerates its groups.

To test this code, you can call the following function, passing it a text string containing some phone numbers (remember to define the area code group with the parenthesis):

DisplayGroups(S"My phone numbers are 770-555-1212 and 404-555-1212",
              S"(\\d{3})-\\d{3}-\\d{4}");

Figure 2 shows the results of using this example with the DisplayGroups function.

Figure 2: You Can Easily Enumerate the Groups of a Regular Expression Match

Extracting Specific Groups

Note from Figure 2 that each group collection contains—as its first group object—the entire match. Therefore, any defined groups (per the placement of parenthesis in the pattern) start at the second group object in the group collection. Because the DisplayGroups function does most of what you need, you can simply modify it a bit to create a function—ExtractAreaCodes—that is specific to extracting area codes from a text value:

void ExtractAreaCodes(String* input)
{
  try
  {
    StringBuilder* results = new StringBuilder();
    results->AppendFormat(S"The Area Codes for '{0}'
                          are:\r\n\r\n", input);

    String* pattern = S"(\\d{3})-\\d{3}-\\d{4}";

    Regex* rex = new Regex(pattern);

    // for all the matches
    for (Match* match = rex->Match(input);
         match->Success;
         match = match->NextMatch())
    {
      results->AppendFormat(S"\t{0}\r\n", match->Groups->Item[1]->Value);
    }
    MessageBox::Show(results->ToString());
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

As you can see, the only major changes to the function were to hard-code the pattern—because this function is dedicated to area codes—and the following parameter to the result object's AppendFormat call, which extracts the second group from the match's group collection object:

match->Groups->Item[1]->Value

Now, you can test the ExtractAreaCodes function like this:

ExtractAreaCodes(S"My phone numbers are 770-555-1212 and 404-555-1212");

Doing so yields the expected results shown in Figure 3.

Figure 3: You Can Use Standard Collection Notation to Retrieve Specific Groups from a Match

Looking Ahead

You've learned the basics of how to define groups or sub-matches within a regular expression pattern and how to enumerate all the groups of a match as well as extract a specific group. At this point, you should be able to modify the code in this tip for situations where you need to both locate a particular pattern in a string using regular expressions and then extract specific sub-matches.

One thing that's not so nice about the ExtractAreaCodes function is that the code is hard-coded to retrieve the second object from the group collection. What if the pattern changes such that another group appears before the area code? The programmer would need to change the ExtractAreaCodes function—as well as any other functions depending on the specific order of groups within the group collection. Therefore, the next tip will cover how to name groups (to avoid this code-maintenance hassle) and explain how to define "non-capturing" groups.



About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: August 20, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT When you look at natural user interfaces as a developer, it isn't just fun and games. There are some very serious, real-world usage models of how things can help make the world a better place – things like Intel® RealSense™ technology. Check out this upcoming eSeminar and join the panel of experts, both from inside and outside of Intel, as they discuss how natural user interfaces will likely be getting adopted in a wide variety …

  • Corporate e-Learning technology has a long and diverse pedigree. As far back as the 1980s, companies were adopting computer-based training to supplement traditional classroom activities. More recently, rich web-based applications have added streaming audio and video, real-time collaboration and other new tools to the e-Learning mix. At the same time, the growing availability of informal learning tools--a category that includes everything from web searches to social media posts--are having a major impact on …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds