A TR1 Tutorial: Regular Expressions

Regular expressions allow text processing in a more efficient way than procedural code, because the description of the processing is moved from code into the regular expression. Until TR1, there was no support in STL for regular expressions. Now, it has been introduced and is based on the regex from the boost library. In this article, I will try to shed some light on the regular expression features in TR1.

Regular expressions from TR1, (class regex), can work with any of these six grammars:

  • ECMAScript, default grammar and the most powerful
  • basic, POSIX Basic Regular Expressions
  • extended, POSIX Extended Regular Expressions
  • awk, POSIX awk
  • grep, POSIX grep
  • egrep, POSIX grep -E

As I already mentioned, the most powerful of the six is ECMAScript (which offers what all the other grammars offer). The new standard says that the grammar recognized by the basic_regex class is (with some exceptions) the one specified by ECMA-262. This is ECMAScript Language Specification, whose regular expressions are modeled after the ones in Perl 5.

In this article, I will not get into the details of any of these grammars. This MSDN article explains the grammar and semantics of the regular expressions. The only thing I want to point here are the so-called "capture groups." A regular expression can contain such capture groups (also called sub-expressions); their role is identifying parts of an expression, parts that can be used later. Capture groups are introduced with parenthesis, such as in (ab+|cd?). Parenthesis also override precedence.

Header <regex> defines types, algorithms and iterators under the namespace std::tr1.

Types

  • basic_regex: This is a template class that contains a regular expression; it basically implements a finite state machine, constructed based on a regular expression. There are two typedefs for this class, one for char and one for wchar_t.

    typedef basic_regex<char> regex;
    typedef basic_regex<wchar_t> wregex;
    

    It should be noted that this class does not have an implicit constructor, only an explicit one. The reason is that instantiating an object of this type is time consuming and should only be made explicitly.

  • match_results: A class that contains a sequence of matches; each element points to a subsequence that matched the capture group corresponding to the element. When the empty() method returns true or the size() method returns 0, an object of this type does not contain any match. Otherwise, when empty() returns false, size() returns 1 or a greater value and:

    • match[0]: represents the entire match
    • match[1]: represents the first match (sub_match)
    • match[2]: represents the second match, and so forth
    • prefix(): returns the string that precedes the match
    • suffix(): returns the string that follows the match

    There are several typedefs for std::tr1::match_results:

    typedef match_results<const char*> cmatch;
    typedef match_results<const wchar_t*> wcmatch;
    typedef match_results<string::const_iterator> smatch;
    typedef match_results<wstring::const_iterator> wsmatch;
    
  • sub_match: Represents a sequence of characters that match a capture group; an object of the match_results type can contain an array of objects of this type. If there was no match for a capture group, the two iterators for an object of this type are equal. There are several typedefs for this template class:

    typedef sub_match<const char*>             csub_match;
    typedef sub_match<const wchar_t*>          wcsub_match;
    typedef sub_match<string::const_iterator>  ssub_match;
    typedef sub_match<wstring::const_iterator> wssub_match;
    
  • regex_constants: Contains constants for syntax, matching, and formatting rules and error identifiers.
  • regex_error: An exception of this type is thrown to indicate an error in the construction or usage of an object of the basic_regex type.
  • regex_traits: Describes various characteristics of elements for matching; there are two specializations of this template class, for char and wchar_t:

    template <>
        class regex_traits<char>
        
    template <>
        class regex_traits<wchar_t>
    

Algorithms

  • regex_match(): Completely matches a string with a regular expression, building sub-matches for the capture groups
  • regex_search(): Matches parts of a string with a regular expression, building sub-matches for the capture groups
  • regex_replace(): Replaces all the matches from a regular expression according to a specified format; optionally, you can replace only the first match or the parts of the string that did not produce a match
  • swap(): Swaps two objects of the basic_regex or match_result types

Iterators

  • regex_iterator: A forward constant iterator for iterating through all occurrences of a pattern in a string. There are several typedefs:

    typedef regex_iterator<const char*>            cregex_iterator;
    typedef regex_iterator<const wchar_t*>         wcregex_iterator;
    typedef regex_iterator<string::const_iterator> sregex_iterator;
    typedef regex_iterator<wstring::const_iterator>
       wsregex_iterator;
    
  • regex_token_iterator: A forwards constant iterator for iterating through the capture groups of all occurrences of a pattern in a string. Conceptually, it holds a regex_iterator object that it uses to search for regular expression matches in a character sequence. There are several typedefs:

    typedef regex_token_iterator<const char*>
       cregex_token_iterator;
    typedef regex_token_iterator<const wchar_t*>
       wcregex_token_iterator;
    typedef regex_token_iterator<string::const_iterator>
       sregex_token_iterator;
    typedef regex_token_iterator<wstring::const_iterator>
       wsregex_token_iterator;
    

Matching

This section contains examples for exactly matching a string.

Example 1

The is_email_valid function returns true if an email address has a valid format (not necessary the most complete format).

#include <regex>
#include <iostream>
#include <string>

bool is_email_valid(const std::string& email)
{
   // define a regular expression
   const std::tr1::regex pattern
      ("(\\w+)(\\.|_)?(\\w*)@(\\w+)(\\.(\\w+))+");

   // try to match the string with the regular expression
   return std::tr1::regex_match(email, pattern);
}

int main()
{
   std::string email1 = "marius.bancila@domain.com";
   std::string email2 = "mariusbancila@domain.com";
   std::string email3 = "marius_b@domain.co.uk";
   std::string email4 = "marius@domain";

   std::cout << email1 << " : " << (is_email_valid(email1) ?
      "valid" : "invalid") << std::endl;
   std::cout << email2 << " : " << (is_email_valid(email2) ?
      "valid" : "invalid") << std::endl;
   std::cout << email3 << " : " << (is_email_valid(email3) ?
     "valid" : "invalid") << std::endl;
   std::cout << email4 << " : " << (is_email_valid(email4) ?
     "valid" : "invalid") << std::endl;

   return 0;
}

This program prints:

marius.bancila@domain.com : valid
mariusbancila@domain.com  : valid
marius_b@domain.co.uk     : valid
marius@domain             : invalid

By using an object of the match_results type, you can get access to the capture groups, introduced with parenthesis. If there is at least a match, match[0] represents the entire match, and match[i] (with i > 0) represents the i-th match.

Example 2

The following example identifies and prints the individual parts of an IP address.

#include <regex>
#include <iostream>
#include <string>

void show_ip_parts(const std::string& ip)
{
   // regular expression with 4 capture groups defined with
   // parenthesis (...)
   const std::tr1::regex pattern("(\\d{1,3}):(\\d{1,3}):(\\d{1,3}):
                                 (\\d{1,3})");

   // object that will contain the sequence of sub-matches
   std::tr1::match_results<std::string::const_iterator> result;

   // match the IP address with the regular expression
   bool valid = std::tr1::regex_match(ip, result, pattern);

   std::cout << ip << " \t: " << (valid ? "valid" : "invalid") 
             << std::endl;
             
   // if the IP address matched the regex, then print the parts
   if(valid)
   {
      std::cout << "b1: " << result[1] << std::endl;
      std::cout << "b2: " << result[2] << std::endl;
      std::cout << "b3: " << result[3] << std::endl;
      std::cout << "b4: " << result[4] << std::endl;
   }
}

int main()
{
   show_ip_parts("1:22:33:444");
   show_ip_parts("1:22:33:4444");
   show_ip_parts("100:200");

   return 0;
}

The program prints:

1:22:33:444     : valid
b1: 1
b2: 22
b3: 33
b4: 444
1:22:33:4444    : invalid
100:200         : invalid

Searching

regex_match tried to match the entire string with the regular expression. On the other hand, regex_search() does a string search, making partial matches of a string with a regular expression.

Example 1

In this example, you search for the first word that ends in 'day'.

int main()
{
   // regular expression
   const std::tr1::regex pattern("(\\w+day)");
   
   // the source text
   std::string weekend = "Saturday and Sunday";
   
   // sequence of string sub-matches
   std::tr1::smatch result;

   bool match = std::tr1::regex_search(weekend, result, pattern);

   if(match)
   {
      // if there was a match print it
      for(size_t i = 1; i < result.size(); ++i)
      {
         std::cout << result[i] << std::endl;
      }
   }

   return 0;
}

The output is:

Saturday

To find all the sub-matches, you have to use a token iterator, as shown in the next code sample.

Example 2

int main()
{
   // regular expression
   const std::tr1::regex pattern("\\w+day");

   // the source text
   std::string weekend = "Saturday and Sunday, but some Fridays also.";

   const std::tr1::sregex_token_iterator end;
   for (std::tr1::sregex_token_iterator i(weekend.begin(),
      weekend.end(), pattern);
      i != end;
      ++i)
   {
      std::cout << *i << std::endl;
   }
   
   return 0;
}

In this case, the output is:

Saturday
Sunday
Friday

In the preceding example, the sregex_token_iterator constructor took as arguments two iterators that delimited the text to search and a regex object representing the regular expression. In this case, the iterator points only to matches corresponding to a single capture group. To iterate over several capture groups, a second constructor is used. This constructor takes a vector whose elements represent the indexes of the capture groups to be considered.

Example 2

In this example, you extract from a text a sequence of points representing the vertices of a polygon.

struct Point
{
   int X;
   int Y;
   Point(int x, int y): X(x), Y(y){}
};

typedef std::vector<Point> Polygon;

int main()
{
   Polygon poly;
   std::string s = "Polygon: (1,2), (3,4), (5,6), (7,8)";

   const std::tr1::regex r("(\\d+),(\\d+)");

   const std::tr1::sregex_token_iterator end;
   std::vector<int> v;
   v.push_back(1);
   v.push_back(2);

   for (std::tr1::sregex_token_iterator i(s.begin(), s.end(), r, v);
      i != end;)
   {
      int x = atoi((*i).str().c_str()); ++i;
      int y = atoi((*i).str().c_str()); ++i;
      poly.push_back(Point(x, y));
   }

   for(size_t i = 0; i < poly.size(); ++i)
   {
      std::cout << "(" << poly[i].X << ", " << poly[i].Y << ") ";
   }
   std::cout << std::endl;

   return 0;
}

The output is:

(1, 2) (3, 4) (5, 6) (7, 8)

To understand how it works, I will comment the second call to push_back().

    std::vector<int> v;
    v.push_back(1);
    //v.push_back(2);

In this case, only the first capture group will be considered and the output changes to:

(1, 3) (5, 7)

If I comment the first call to push_back(), only the second capture group is considered and the output becomes:

(2, 4) (6, 8)

One aspect that must be considered is the evaluation order. If I write:

poly.push_back(
   Point(
      atoi((*i++).str().c_str()),
      atoi((*i++).str().c_str())));

The behavior is undefined, because the iterator is incremented more than once between two sequence points, which is illegal.

Transformations

You can replace a match in a string according to a pattern. This can either be a simple string, or a string representing a pattern constructed with escape characters indicating capture groups.

  • $1: What matches the first capture group
  • $2: What matches the second capture group
  • $&: What matches the whole regular expression
  • $`: What appears before the whole regex
  • $': What appears after the whole regex
  • $$: $

Example 1

The following code replaces 'a' with 'an', when the article 'a' precedes a word that starts with a vowel.

int main()
{
   // text to transform
   std::string text = "This is a element and this a unique ID.";
   
   // regular expression with two capture groups
   const std::tr1::regex pattern("(\\ba (a|e|i|u|o))+");
   
   // the pattern for the transformation, using the second
   // capture group
   std::string replace = "an $2";

   std::string newtext = std::tr1::regex_replace(text, pattern, replace);

   std::cout << newtext << std::endl;
   
   return 0;
}
This is an element and this an unique ID.

Example 2

The following code replaces only the first sub-match with a specified string.

std::string change_root(const std::string& item,
                        const std::string& newroot)
{
   // regular expression
   const std::tr1::regex pattern("\\\\?((\\w|:)*)");
   
   // transformation pattern
   std::string replacer = newroot;

   // flag that indicates to transform only the first match
   std::tr1::regex_constants::match_flag_type fonly = 
      std::tr1::regex_constants::format_first_only;

   // apply the transformation
   return std::tr1::regex_replace(item, pattern, replacer, fonly);
}

int main()
{
   std::string item1 = "\\dir\\dir2\\dir3";
   std::string item2 = "c:\\folder\\";

   std::cout << item1 << " -> " << change_root(item1, "\\dir1")
      << std::endl;
   std::cout << item2 << " -> " << change_root(item2, "d:")
      << std::endl;
   
   return 0;
}

The output is:

\dir\dir2\dir3 -> \dir1\dir2\dir3
c:\folder\ -> d:\folder\

Example 3

This example shows how to transform a string representing a date in the format DD-MM-YYYY to a string representing a date in the format YYYY-MM-DD. For the separator, I will consider any of the characters '.', '-', and '/'.

std::string format_date(const std::string& date)
{
   // regular expression
   const std::tr1::regex pattern("(\\d{1,2})(\\.|-|/)(\\d{1,2})
      (\\.|-|/)(\\d{4})");

   // transformation pattern, reverses the position of all capture groups
   std::string replacer = "$5$4$3$2$1";

   // apply the tranformation
   return std::tr1::regex_replace(date, pattern, replacer);
}

int main()
{
   std::string date1 = "1/2/2008";
   std::string date2 = "12.08.2008";

   std::cout << date1 << " -> " << format_date(date1) << std::endl;
   std::cout << date2 << " -> " << format_date(date2) << std::endl;
}
1/2/2008 ->   2008/2/1
12.08.2008 -> 2008.08.12

Conclusions

This article is an overview on the algorithms and several classes for regular expression in TR1. More detailed information about them can be found in MSDN. Unfortunately, at least for the moment, the TR1 documentation is not very elaborate; hopefully, this article will help you to clarify at least the basics.



About the Author

Marius Bancila

Marius Bancila is a Microsoft MVP for VC++. He works as a software developer for a Norwegian-based company. He is mainly focused on building desktop applications with MFC and VC#. He keeps a blog at www.mariusbancila.ro/blog, focused on Windows programming. He is the co-founder of codexpert.ro, a community for Romanian C++/VC++ programmers.