A TR1 Tutorial: Regular Expressions

Regular expressions allow text processing in a more efficient way than procedural code, because the description of the processing is moved from code into the regular expression. Until TR1, there was no support in STL for regular expressions. Now, it has been introduced and is based on the regex from the boost library. In this article, I will try to shed some light on the regular expression features in TR1.

Regular expressions from TR1, (class regex), can work with any of these six grammars:

  • ECMAScript, default grammar and the most powerful
  • basic, POSIX Basic Regular Expressions
  • extended, POSIX Extended Regular Expressions
  • awk, POSIX awk
  • grep, POSIX grep
  • egrep, POSIX grep -E

As I already mentioned, the most powerful of the six is ECMAScript (which offers what all the other grammars offer). The new standard says that the grammar recognized by the basic_regex class is (with some exceptions) the one specified by ECMA-262. This is ECMAScript Language Specification, whose regular expressions are modeled after the ones in Perl 5.

In this article, I will not get into the details of any of these grammars. This MSDN article explains the grammar and semantics of the regular expressions. The only thing I want to point here are the so-called "capture groups." A regular expression can contain such capture groups (also called sub-expressions); their role is identifying parts of an expression, parts that can be used later. Capture groups are introduced with parenthesis, such as in (ab+|cd?). Parenthesis also override precedence.

Header <regex> defines types, algorithms and iterators under the namespace std::tr1.

Types

  • basic_regex: This is a template class that contains a regular expression; it basically implements a finite state machine, constructed based on a regular expression. There are two typedefs for this class, one for char and one for wchar_t.

    typedef basic_regex<char> regex;
    typedef basic_regex<wchar_t> wregex;
    

    It should be noted that this class does not have an implicit constructor, only an explicit one. The reason is that instantiating an object of this type is time consuming and should only be made explicitly.

  • match_results: A class that contains a sequence of matches; each element points to a subsequence that matched the capture group corresponding to the element. When the empty() method returns true or the size() method returns 0, an object of this type does not contain any match. Otherwise, when empty() returns false, size() returns 1 or a greater value and:

    • match[0]: represents the entire match
    • match[1]: represents the first match (sub_match)
    • match[2]: represents the second match, and so forth
    • prefix(): returns the string that precedes the match
    • suffix(): returns the string that follows the match

    There are several typedefs for std::tr1::match_results:

    typedef match_results<const char*> cmatch;
    typedef match_results<const wchar_t*> wcmatch;
    typedef match_results<string::const_iterator> smatch;
    typedef match_results<wstring::const_iterator> wsmatch;
    
  • sub_match: Represents a sequence of characters that match a capture group; an object of the match_results type can contain an array of objects of this type. If there was no match for a capture group, the two iterators for an object of this type are equal. There are several typedefs for this template class:

    typedef sub_match<const char*>             csub_match;
    typedef sub_match<const wchar_t*>          wcsub_match;
    typedef sub_match<string::const_iterator>  ssub_match;
    typedef sub_match<wstring::const_iterator> wssub_match;
    
  • regex_constants: Contains constants for syntax, matching, and formatting rules and error identifiers.
  • regex_error: An exception of this type is thrown to indicate an error in the construction or usage of an object of the basic_regex type.
  • regex_traits: Describes various characteristics of elements for matching; there are two specializations of this template class, for char and wchar_t:

    template <>
        class regex_traits<char>
        
    template <>
        class regex_traits<wchar_t>
    

Algorithms

  • regex_match(): Completely matches a string with a regular expression, building sub-matches for the capture groups
  • regex_search(): Matches parts of a string with a regular expression, building sub-matches for the capture groups
  • regex_replace(): Replaces all the matches from a regular expression according to a specified format; optionally, you can replace only the first match or the parts of the string that did not produce a match
  • swap(): Swaps two objects of the basic_regex or match_result types

Iterators

  • regex_iterator: A forward constant iterator for iterating through all occurrences of a pattern in a string. There are several typedefs:

    typedef regex_iterator<const char*>            cregex_iterator;
    typedef regex_iterator<const wchar_t*>         wcregex_iterator;
    typedef regex_iterator<string::const_iterator> sregex_iterator;
    typedef regex_iterator<wstring::const_iterator>
       wsregex_iterator;
    
  • regex_token_iterator: A forwards constant iterator for iterating through the capture groups of all occurrences of a pattern in a string. Conceptually, it holds a regex_iterator object that it uses to search for regular expression matches in a character sequence. There are several typedefs:

    typedef regex_token_iterator<const char*>
       cregex_token_iterator;
    typedef regex_token_iterator<const wchar_t*>
       wcregex_token_iterator;
    typedef regex_token_iterator<string::const_iterator>
       sregex_token_iterator;
    typedef regex_token_iterator<wstring::const_iterator>
       wsregex_token_iterator;
    

Matching

This section contains examples for exactly matching a string.

Example 1

The is_email_valid function returns true if an email address has a valid format (not necessary the most complete format).

#include <regex>
#include <iostream>
#include <string>

bool is_email_valid(const std::string& email)
{
   // define a regular expression
   const std::tr1::regex pattern
      ("(\\w+)(\\.|_)?(\\w*)@(\\w+)(\\.(\\w+))+");

   // try to match the string with the regular expression
   return std::tr1::regex_match(email, pattern);
}

int main()
{
   std::string email1 = "marius.bancila@domain.com";
   std::string email2 = "mariusbancila@domain.com";
   std::string email3 = "marius_b@domain.co.uk";
   std::string email4 = "marius@domain";

   std::cout << email1 << " : " << (is_email_valid(email1) ?
      "valid" : "invalid") << std::endl;
   std::cout << email2 << " : " << (is_email_valid(email2) ?
      "valid" : "invalid") << std::endl;
   std::cout << email3 << " : " << (is_email_valid(email3) ?
     "valid" : "invalid") << std::endl;
   std::cout << email4 << " : " << (is_email_valid(email4) ?
     "valid" : "invalid") << std::endl;

   return 0;
}

This program prints:

marius.bancila@domain.com : valid
mariusbancila@domain.com  : valid
marius_b@domain.co.uk     : valid
marius@domain             : invalid

By using an object of the match_results type, you can get access to the capture groups, introduced with parenthesis. If there is at least a match, match[0] represents the entire match, and match[i] (with i > 0) represents the i-th match.

Example 2

The following example identifies and prints the individual parts of an IP address.

#include <regex>
#include <iostream>
#include <string>

void show_ip_parts(const std::string& ip)
{
   // regular expression with 4 capture groups defined with
   // parenthesis (...)
   const std::tr1::regex pattern("(\\d{1,3}):(\\d{1,3}):(\\d{1,3}):
                                 (\\d{1,3})");

   // object that will contain the sequence of sub-matches
   std::tr1::match_results<std::string::const_iterator> result;

   // match the IP address with the regular expression
   bool valid = std::tr1::regex_match(ip, result, pattern);

   std::cout << ip << " \t: " << (valid ? "valid" : "invalid") 
             << std::endl;
             
   // if the IP address matched the regex, then print the parts
   if(valid)
   {
      std::cout << "b1: " << result[1] << std::endl;
      std::cout << "b2: " << result[2] << std::endl;
      std::cout << "b3: " << result[3] << std::endl;
      std::cout << "b4: " << result[4] << std::endl;
   }
}

int main()
{
   show_ip_parts("1:22:33:444");
   show_ip_parts("1:22:33:4444");
   show_ip_parts("100:200");

   return 0;
}

The program prints:

1:22:33:444     : valid
b1: 1
b2: 22
b3: 33
b4: 444
1:22:33:4444    : invalid
100:200         : invalid

Searching

regex_match tried to match the entire string with the regular expression. On the other hand, regex_search() does a string search, making partial matches of a string with a regular expression.

Example 1

In this example, you search for the first word that ends in 'day'.

int main()
{
   // regular expression
   const std::tr1::regex pattern("(\\w+day)");
   
   // the source text
   std::string weekend = "Saturday and Sunday";
   
   // sequence of string sub-matches
   std::tr1::smatch result;

   bool match = std::tr1::regex_search(weekend, result, pattern);

   if(match)
   {
      // if there was a match print it
      for(size_t i = 1; i < result.size(); ++i)
      {
         std::cout << result[i] << std::endl;
      }
   }

   return 0;
}

The output is:

Saturday

To find all the sub-matches, you have to use a token iterator, as shown in the next code sample.

Example 2

int main()
{
   // regular expression
   const std::tr1::regex pattern("\\w+day");

   // the source text
   std::string weekend = "Saturday and Sunday, but some Fridays also.";

   const std::tr1::sregex_token_iterator end;
   for (std::tr1::sregex_token_iterator i(weekend.begin(),
      weekend.end(), pattern);
      i != end;
      ++i)
   {
      std::cout << *i << std::endl;
   }
   
   return 0;
}

In this case, the output is:

Saturday
Sunday
Friday

In the preceding example, the sregex_token_iterator constructor took as arguments two iterators that delimited the text to search and a regex object representing the regular expression. In this case, the iterator points only to matches corresponding to a single capture group. To iterate over several capture groups, a second constructor is used. This constructor takes a vector whose elements represent the indexes of the capture groups to be considered.

Example 2

In this example, you extract from a text a sequence of points representing the vertices of a polygon.

struct Point
{
   int X;
   int Y;
   Point(int x, int y): X(x), Y(y){}
};

typedef std::vector<Point> Polygon;

int main()
{
   Polygon poly;
   std::string s = "Polygon: (1,2), (3,4), (5,6), (7,8)";

   const std::tr1::regex r("(\\d+),(\\d+)");

   const std::tr1::sregex_token_iterator end;
   std::vector<int> v;
   v.push_back(1);
   v.push_back(2);

   for (std::tr1::sregex_token_iterator i(s.begin(), s.end(), r, v);
      i != end;)
   {
      int x = atoi((*i).str().c_str()); ++i;
      int y = atoi((*i).str().c_str()); ++i;
      poly.push_back(Point(x, y));
   }

   for(size_t i = 0; i < poly.size(); ++i)
   {
      std::cout << "(" << poly[i].X << ", " << poly[i].Y << ") ";
   }
   std::cout << std::endl;

   return 0;
}

The output is:

(1, 2) (3, 4) (5, 6) (7, 8)

To understand how it works, I will comment the second call to push_back().

    std::vector<int> v;
    v.push_back(1);
    //v.push_back(2);

In this case, only the first capture group will be considered and the output changes to:

(1, 3) (5, 7)

If I comment the first call to push_back(), only the second capture group is considered and the output becomes:

(2, 4) (6, 8)

One aspect that must be considered is the evaluation order. If I write:

poly.push_back(
   Point(
      atoi((*i++).str().c_str()),
      atoi((*i++).str().c_str())));

The behavior is undefined, because the iterator is incremented more than once between two sequence points, which is illegal.

Transformations

You can replace a match in a string according to a pattern. This can either be a simple string, or a string representing a pattern constructed with escape characters indicating capture groups.

  • $1: What matches the first capture group
  • $2: What matches the second capture group
  • $&: What matches the whole regular expression
  • $`: What appears before the whole regex
  • $': What appears after the whole regex
  • $$: $

Example 1

The following code replaces 'a' with 'an', when the article 'a' precedes a word that starts with a vowel.

int main()
{
   // text to transform
   std::string text = "This is a element and this a unique ID.";
   
   // regular expression with two capture groups
   const std::tr1::regex pattern("(\\ba (a|e|i|u|o))+");
   
   // the pattern for the transformation, using the second
   // capture group
   std::string replace = "an $2";

   std::string newtext = std::tr1::regex_replace(text, pattern, replace);

   std::cout << newtext << std::endl;
   
   return 0;
}
This is an element and this an unique ID.

Example 2

The following code replaces only the first sub-match with a specified string.

std::string change_root(const std::string& item,
                        const std::string& newroot)
{
   // regular expression
   const std::tr1::regex pattern("\\\\?((\\w|:)*)");
   
   // transformation pattern
   std::string replacer = newroot;

   // flag that indicates to transform only the first match
   std::tr1::regex_constants::match_flag_type fonly = 
      std::tr1::regex_constants::format_first_only;

   // apply the transformation
   return std::tr1::regex_replace(item, pattern, replacer, fonly);
}

int main()
{
   std::string item1 = "\\dir\\dir2\\dir3";
   std::string item2 = "c:\\folder\\";

   std::cout << item1 << " -> " << change_root(item1, "\\dir1")
      << std::endl;
   std::cout << item2 << " -> " << change_root(item2, "d:")
      << std::endl;
   
   return 0;
}

The output is:

\dir\dir2\dir3 -> \dir1\dir2\dir3
c:\folder\ -> d:\folder\

Example 3

This example shows how to transform a string representing a date in the format DD-MM-YYYY to a string representing a date in the format YYYY-MM-DD. For the separator, I will consider any of the characters '.', '-', and '/'.

std::string format_date(const std::string& date)
{
   // regular expression
   const std::tr1::regex pattern("(\\d{1,2})(\\.|-|/)(\\d{1,2})
      (\\.|-|/)(\\d{4})");

   // transformation pattern, reverses the position of all capture groups
   std::string replacer = "$5$4$3$2$1";

   // apply the tranformation
   return std::tr1::regex_replace(date, pattern, replacer);
}

int main()
{
   std::string date1 = "1/2/2008";
   std::string date2 = "12.08.2008";

   std::cout << date1 << " -> " << format_date(date1) << std::endl;
   std::cout << date2 << " -> " << format_date(date2) << std::endl;
}
1/2/2008 ->   2008/2/1
12.08.2008 -> 2008.08.12

Conclusions

This article is an overview on the algorithms and several classes for regular expression in TR1. More detailed information about them can be found in MSDN. Unfortunately, at least for the moment, the TR1 documentation is not very elaborate; hopefully, this article will help you to clarify at least the basics.



About the Author

Marius Bancila

Marius Bancila is a Microsoft MVP for VC++. He works as a software developer for a Norwegian-based company. He is mainly focused on building desktop applications with MFC and VC#. He keeps a blog at www.mariusbancila.ro/blog, focused on Windows programming. He is the co-founder of codexpert.ro, a community for Romanian C++/VC++ programmers.

Comments

  • administered roles Answers a types

    Posted by tkkzkqYP on 07/08/2013 10:30am

    http://streetscooterchrome.com/rayban/ ray ban sunglasses sale

    Reply
  • More concessions with herveleger, more dish all by way of!

    Posted by Mrtopfliqmg on 04/28/2013 05:55pm

    wenchribbe in affection withhijackeulogisticregister

    Reply
  • beats by dre

    Posted by Vonfreeme on 03/24/2013 05:22pm

    Hip hop [url=http://power-beatsbydre.webs.com/]powerbeatsbydre[/url] have the electricity to make weak lyrics custom beats and or a weak rapper a correct club banger, additionally a wack hip hop beat. It can do injustice to the biggest of rappers. We can see an illustration of hip hop beats,Electrical power, in the current [url=http://beatsbydremonster.tumblr.com/]beatsbydremonster[/url] of Down south rap. I have seen a three hundred% enhance in the volume of ask for for Down South Hip Hop beats on alot of Hip Hop Stations.In the previous Most Hip hop stations would only get two or 3 request a month for down south hiphop beats. . Notably most southern rappers are not know for the conquer lyrics in the world, but a single point that helps make down south rap king are these weighty bass line, deep sluggish tempo [url=http://beatsbydoctordre.webs.com/]beatsbydoctordre[/url] hop beats. It has been properly documented that when brainwave patterns alter they lead to a alter in chemical reactions within the human body. These [url=http://beatsbydrecustom.webs.com/]beatsbydrecustom[/url] reactions usually have a profound result on the whole actual physical framework of the entire body. In essence binaural beats show up to have the exact same impact and advantages of really deep meditation. In simple fact the use of such engineering has been demonstrated to be virtually similar to [url=http://beats-by-dre-best-buy.webs.com/]beatsbydrebestbuy[/url] transcendental mediation. This is wonderful news simply because with out the aid of this audio technology it generally requires a life time to ideal transcendental meditation. In reality moving into even mild states of trance can just take several years to learn custom beats by dre and excellent. Master P was the initial southern rapper to become quite popular off of the power of hip hop beats. . As numerous hip hop enthusiasts know Master Pâ??s lyrics ended up extremely elementary, but his banging hip hop beats. Long Phrase [url=http://beatsbydre-cheap.tumblr.com/] beatsbydrecheap[/url] Instant personalized beats Effects. Defeat creating computer software have important attributes that you ought to be conscious of so that you can discover the very best software for your wants. So youâ??re a tune producer with a directory wide of beats. You know your beats are fire piquant and you feel you can generate more or less money rancid them. You can generate beats in thumbs down period of time but lone occasion you really don't know [url=http://beatsbydremixr.webs.com/]beatsbydremixr[/url] is how to be purchased beats. You have made beats CDs and despatched them dated to labels and artists but you by no signifies got whatsoever point leaving consistently. How to be bought beats is the burning query in your mind. First of all, you need to make a start to search by defeat advertising from a strange function of look above. You have to look by marketing beats as a enterprise and not a moment back something to make certain of each at the existing and right after that. The Net is a inclusive market and your [url=http://beatsbydre007.tumblr.com/]beatsbydre[/url] defeat purchaser might possibly strategy from anywhere. Youâ??re task is to produce it tranquil supposed for them to turn into aware of you.

    Reply
  • Pas cher Abercrombie Fitch Angleterre en ligne de sortie. Ici vous pouvez choisir votre produit préféré et profitez du meilleur servicing

    Posted by kkhyfhyyy on 03/22/2013 07:56am

    Il est la, en photos talk up [url=http://www.hollistercoefrance.fr]hollister france[/url] du moins. Le bale comprenant la reedition des deux Jordan 6 originales devrait etre disponible a partir du mois de juin [url=http://www.abercrombiefrancevparis.fr]abercrombie france[/url] . Les rumeurs vont bon caravan et il semble de moins en moins undoubted que cette sortie concerne uniquement les Etats-Unis. L'Asie pourrait [url=http://www.airjordanfrpascherz.com]air jordan[/url] egalement recevoir le press et une nouvelle prime mover pourrait s'offrir aux adeptes de la Jordan VI. Nous vous tiendrons au courant. Christian Louboutin, un architect francais [url=http://www.abercrombieafranceusolde.fr]abercrombie[/url] de haut talons, est egalement bien connu a talons hauts chaussures de marque, chaussures a semelles rouges signature logo de Christian Louboutin. Dans le monde [url=http://www.hollisteruonlineshops.de]hollister online shop[/url] des talons hauts, Christian Louboutin est le Francais ne peut absolument pas etre ignoree. Il est le favori de l'actrice europeenne et americaine! Fait, d'ignorer aussi ne peut pas ignorer, [url=http://www.hollisterfranceamagesin.fr]hollister france[/url] cette marque de rouge a ne pas travailler, les femmes pieds des celebrites dans le cadre du rouge Nama retiendront votre attention. Il a rejoint le conseil de l'Université du Subserviently de Bonne-Espérance lors de sa création en 1873, même en tant que vice-chancelier à la fois. Il a fait un travail considérable en cette qualité sprinkle l'avancement des sciences naturelles [url=http://www.abercrombiexandfitchukes.co.uk]abercrombie[/url]. Il a également été membre du conseil du Collège diocésain, Exceed Town.Smith était aussi très intéressé awful la philosophie, et il a été impliqué dans la société sud-africaine philosophique depuis sa création en 1877. Il a également occupé le [url=http://www.abercrombiesdeutschlandshopu.com]abercrombie[/url] poste de trésorier jusqu'en 1908, quand il est devenu Combined de la Luxurious People d'Afrique du Sud - nouvellement fusionnée overflowing l'Marriage d'Afrique du Sud. Il a re?u un doctorat honorifique en droit (LL.D.), en 1917.

    Reply
  • Nous comprenons l'ouragan de sable a causé défis barbarian beaucoup de nos clients Supervised

    Posted by Vetriatszy on 03/15/2013 12:56pm

    the best to get served harsh true he or she this point loudly trumpets to a clear martial arts your fiancee's diamond, wild birds trilling a contented track in the shadows. terminate thing. It an exotic pitch my link (word play here totally made), due to the fact ads as normal present motor vehicles, or shower radios, and furthermore Bagel attacks. Anyone that may pieces the march using evil advertised the particular day time report could reaffirm and bottom line. users will try to sell anticipation, remember, though,but in place aren an awful lot individuals who acquiring. (Bagel bites are already another story; depots can keep those of you stored.)and this also merely united states with your global with little bit of aspiration, the point at which hoping as unique to be a rainy mummy fart, in addition,yet assure comes down ahead of the cutlery in your back. exactly what universe will be the? carefully, within putting up for sale Abercrombie whole. the good news is, she nature sufficient to today i want to consultation. these epic saga goes past pessimism, this is now nihilistic imagination. And by being nihilistic, it seems nearer to our matter while opposite larger-than-life dreams, A truer hand mirror via the awful ideas and additionally thinking found in daily life. life style about top worked on chilled isn small; this tool isn nice versus evil

    Reply
  • Christian Louboutin from Italy and many Italian brand, was born in the traditional family business

    Posted by Vetriatszy on 03/14/2013 01:16pm

    all college or university the main do Kellan Lutz Kellan christopher Lutz is as a famous video and tv acting professional, Who is most commonly known for still having Emmett Cullen by the 2008 coating twilight. before you acting, the individual seemed to be any kind of Abercrombie Fitch style. Kellan Lutz went on modelling as part of his prompt adolescents. later on graduating university he chosen idaho to go to Chapman higher education for chemical substance industrial, sadly subsequent decided to carry on some actors employment rather than. Lutz shows functioned on very a few exhibits. Kellan has also repeated the actual internet florida levels and was the particular pertaining to Bravo's blow out. Kellan was involved in the 2006 industrial just for Hilary Duff's fragrance accompanied by appreciate. Hilary Duff, including the 2007 music pictures for her single, "containing fall in love with,[6] all over 2008 he made an appearance another telecasting music, Hinder's "without any clients, exact same year, Lutz came out in a very miniseries creating destroy. Emmett Cullen is without a doubt made before Lutz into the twilight series, And definitely be reprising his or aspect your follow up, owed to in nov 2009. He potentially came out mainly because George Evans on the inside 90210 spinoff with regards to the CW. Lutz is to be reuniting with his twilight company-take the leading role navigate to these guys and simply good friend, Ashley Greene, in a very 2009 production knight

    Reply
  • Thank You

    Posted by Xathereal on 11/24/2012 04:30am

    I am learning a lot from your article, I think I can better your email pattern however. I couldn't get my head around it because of this: "b.b.b@b.com" That does not work, but this doesn't: "b_b_b@b.com" I didn't realise this was the case until I did some test cases and now I understand '_' is part of 'alnum' or 'w'... I came up with this solution, and it works for all combinations I can throw at it. ((([[:w:]])|(\\.))+)@([[:w:]]+)(\\.)((([[:w:]])|(\\.))+) This combination requires a word before a '.' after '@', therefore the minimum you can give it is: "a@a.b" However, it also works for "a.a.a.a.a.a.a.a@a.com"

    Reply
  • Using raw strings

    Posted by Vilmar on 03/23/2012 04:41pm

    Congratulations. Excelent. My only suggestion is for using raw string literals in patterns. I hope that this feature is available, today, in your compiler. Thank you.

    Reply
  • Great article!

    Posted by SKL on 10/17/2008 08:09pm

    I learned regular expression from your article, thank you! Correct me if I am wrong, but it seems std::string replacer = ""; in example 3 (Transformation) is not complete. I am interested in knowing what should be between the brackets...Thanks again!

    Reply
  • TR1?

    Posted by JNygren on 08/13/2008 11:28am

    Right! ... What's TR1?

    • RE: TR1?

      Posted by cilu on 08/14/2008 05:01am

      TR1 = Technical Report 1 The specification for the new additions to the C++ standard. Easy to find it on web in a couple of seconds.

      Reply
    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • On-demand Event Event Date: September 10, 2014 Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild." This loop of continuous delivery and continuous feedback is how the best mobile …

  • Packaged application development teams frequently operate with limited testing environments due to time and labor constraints. By virtualizing the entire application stack, packaged application development teams can deliver business results faster, at higher quality, and with lower risk.

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds