Repetitive Searches

The following article is a chapter from The C++ Standard Library Extensions: A Tutorial and Reference by Pete Becker, published by Addison-Wesley Professional books. Reprinted with the publisher’s permission.

Chapter 19: Repetitive Searches

There are only two or three human stories, and they go on repeating themselves as fiercely as if they had never happened before.

OPioneers!
WillaCather

Did you spot the problem with the example program that searched for code snippets in a text file at the beginning of Chapter 18? In lines that have multiple code snippets, everything between the first "<CODE>" and the last "</CODE>" is listed as a single snippet. To separate multiple snippets, we first have to change the regular expression a bit so that it doesn’t swallow multiple snippets. In this case, we can replace the “.*”with a nongreedy repetition:

stringexpr="<CODE>(.*?)</CODE>";

Now, to resume searching after the text that matched, we have to change the code. To do that, instead of searching the entire line from the file, we use a pair of iterators that point at the contents of the line. After a match, we advance the first iterator to point at the character immediately following the match and search again.

#include <regex>
#include <iostream>
#include <fstream>
#include <string>
#include <stdlib.h>
using std::tr1::regex; usingstd::tr1::regex_search;
using std::tr1::smatch;
using std::string; using std::ifstream; using std::cout;

static void show_matches(const char *fname)
   { // scan file named by fname, line by line
   ifstream input(fname);
   string str;
   smatch match;
   string expr = "<CODE>(.*?)</CODE>";
   regex rgx(expr, regex::icase);
   while (getline(input, str))
     { // check line for match
    string::const_iterator first = str.begin();
    string::const_iterator second = str.end();
    while (regex_search(first, second, match,rgx))
       { // show match, then skip past it
       cout << match[1] << 'n' ;
       first += match.position() + match.length();
       }
     }
   }

int main(int argc, char *argv[])
   { // search for code snippets in text file
   if (argc != 2)
     { // wrong number of arguments
     cout<<"Usage:snippets<filename>n";
     returnEXIT_FAILURE;
     }
   try
     { // search the file
     show_matches(argv[1]);
     }
   catch(...)
    { // something went wrong
    cout << "Errorn";
     return EXIT_FAILURE;
     }
   return0;
   }

Example 19.1 Repeated Searches (regexiter/repeated.cpp)

Don’t be fooled, though: Repetitive searches aren’t usually that easy to write. For example, if the regular expression begins with a “^”,simply restarting the search after a match, as the previous example does, can lead to wrong answers. The following program searches the target text “abcdef” for subsequences that match the regular expression “^(abc|def)”. The only one is the initial “abc“, but the program finds two, reporting that “def” also matches.

#include <regex>
#include <iostream>
#include <string>
using std::tr1::regex; using std::tr1::regex_search;
using std::tr1::smatch;
using std::string; using std::cout;

int main()
   { // search for regular expression in text
   string str = "abcdef";
   string::const_iterator first = str.begin();
   string::const_iterator second = str.end();
   smatch match;
   string expr = "^(abc|def)";
   regex rgx(code);
   while (regex_search(first, second, match,rgx))
    { // check range for match
    cout << match[0] << 'n' ;
    first += match.position() + match.length();
    }
   return 0;
   }

Example 19.2 Naive Search Doesn’t Work (regexiter/naive.cpp)

In this chapter, we look first at the complications that any repetitive search has to allow for and the techniques for fixing problems (Section 19.1). Then we look at prewritten solutions, in the form of the class template regex_iterator (Section 19.2) and the class template regex_token_iterator (Section 19.3).

19.1 Brute-ForceSearches

In Chapter 17 we looked at several flags that you can pass to the regular expression search functions to change the details of regular expression matching. Here, we look at some of those flags again but in the context of specific problems that arise in repetitive searches. Eventually, we’ll build a search function that avoids these problems; you can judge for yourself whether that’s a better approach than using the two forms of regular expression iterator that the TR1 library provides.

19.1.1 Lost Anchors

Earlier in this chapter, we looked at a naive search function that reported two matches when applying the regular expression “^(abc|def)“to the target text “abcdef“. The problem with simply repeating the same search at a new location in the target text, as that program did, is that on the second call to regex_search, the target text is passed, effectively, as “def“, which does match the regular expression. That is, we chopped off the start of the target text but didn’t tell the search function that we had done that, so it matched the “^”at the beginning of the regular expression to the beginning of the text that we passed, even though the text was not the beginning of the target text. The solution to this problem is simply to tell the search function that we’re not at the beginning of a line, so “^”shouldn’t match. To do that, we use the flag match_not_bol for all searches except the first.

#include <regex>
#include <iostream>
using std::tr1::regex;usingstd::tr1::cmatch;
using std::tr1::regex_search;
using namespace std::tr1::regex_constants;
using std::cout;

static void search(const char *tgt, const char *expr)
  {// show all subsequences of tgt that match expr
  regex rgx(expr);
  cmatch match;
  match_flag_typeflags = match_default;
  const char *first = tgt;
  const char *last = tgt + strlen(tgt);
  for (;;)
    { // keep trying 
     if (regex_search(first, last, match, rgx, flags))
         { // show match, move past it
         cout << match.str()
            << " at offset "
            << (match[0].first - tgt) << 'n' ;
         first += match.position() + match.length();
         flags = flags | match_not_bol;
         }
else
   break;
  }
}

int main()
   { // demonstrate use of match_not_bol
   const char *expr = "^(abc|def)";
   const char *tgt = "abcdef";
   search(tgt,expr);
   return 0;
}

Example 19.3 Preserving Anchors (regexiter/search1.cpp)

19.1.2 Lost Word Boundaries

The regular expression "babc” should match the target text “abcabc” in one place: the first occurrence of the character sequence “abc“. The second “abc” doesn’t match, because it doesn’t start at a word boundary. If you try the previous search function with this regular expression and target text, it will find two matches. The problem is similar to the one with lost anchors: When we restart the search after the first match, the regular expression engine treats the start of the text as a word boundary. You might be tempted to fix that with the same approach we used before, by adding the flag match_not_-bow after a successful match. But the two cases are different: A “^”can match only at the beginning of the original target text, so it’s okay to simply disallow that match once we’ve moved away from the beginning of the text. A word boundary can occur inside the target text as well as at the beginning, so we have to be careful to disable matching the beginning of a word only when we’re not at the beginning of a word. That can be done by checking whether the last character in a match can be in a word and, if so, prohibiting matching the beginning of a word on the next pass. That solves half the problem.

The other half of the problem occurs with a regular expression like “b3“, when matched against the target text “33“. The first “3“is at a word boundary, so it should match. The second “3“is not at a word boundary, so it should not match. But the previous version of search will find that the second one matches because in the target text that’s passed for the second search, it is at the beginning of the target text. So we also need to disable matching of the end of a word when the previous character cannot be in a word.

But there’s an easier way. The regular expression engine already knows how to identify characters that can be in a word, so we don’t need to write that logic ourselves. All we need to do is tell the engine that it can look at the character in front of the target text to decide whether it’s at the beginning of a word. That’s what the flag match_prev_avail does. Of course, we should do that only when we know that a valid character is in front of the target text. Once we’ve moved forward in the target text, we know that we can look behind the current position.

#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::cmatch;
using std::tr1::regex_search;
using namespace std::tr1::regex_constants;
using std::cout;

static void search(const char *tgt,const char *expr)
   { // show all subsequences of tgt that match expr
   regex rgx(expr);
   cmatch match;
   match_flag_type flags = match_default;
   const char *first = tgt;
   const char *last = tgt + strlen(tgt);
   for (;;)
     { // keep trying 
     if (regex_search(first, last, match, rgx,flags))
       {// show match, move past it
       cout << match.str()
         << " at offset "
         << (match[0].first-tgt) << 'n';
       first += match.position() + match.length();
       flags = flags | match_not_bol | match_prev_avail;
       }
     else
       break;
     }
}
int main()
  { // demonstrate use of match_not_bol
  const char *expr = "\babc";
  const char *tgt = "abcabc";
  search(tgt, expr);
  return 0;
  }

Example 19.4 Preserving Word Boundaries (regexiter/search2.cpp)

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read