CodeGuru content and product recommendations are
editorially independent. We may make money when you click on links
to our partners.
Learn More
Normally, when you search for a sub-string in a string, the match should be exact. So if we search for a sub-string “abc” then the string being searched should contain these exact letters in the same sequence for a match to be found. We can extend this kind of search to a case insensitive search where the sub-string “abc” will find strings like “Abc”, “ABC” etc. That is, the case is ignored but the sequence of the letters should be exactly the same. Sometimes, a case insensitive search is also not enough. For example, if we want to search for numeric digit, then we basically end up searching for each digit independantly. This is where regular expressions come in to our help.
Regular expressions are text patterns that are used for string matching. Regular expressions are strings that contains a mix of plain text and special characters to indicate what kind of matching to do. Here’s a very brief turorial on using regular expressions before we move on to the code for handling regular expressions.
Suppose, we are looking for a numeric digit then the regular expression we would search for is “[0-9]”. The brackets indicate that the character being compared should match any one of the characters enclosed within the bracket. The dash (-) between 0 and 9 indicates that it is a range from 0 to 9. Therefore, this regular expression will match any character between 0 and 9, that is, any digit. If we want to search for a special character literally we must use a backslash before the special character. For example, the single character regular expression “*” matches a single asterisk. In the table below the special characters are briefly described.
Character |
Description |
^ |
Beginning of the string. The expression "^A" will match an A only at the beginning of the string. |
^ |
The caret (^) immediately following the left-bracket ([) has a different meaning. It is used to exclude the remaining characters within brackets from matching the target string. The expression "[^0-9]" indicates that the target character should not be a digit. |
$ |
The dollar sign ($) will match the end of the string. The expression "abc$" will match the sub-string "abc" only if it is at the end of the string. |
| |
The alternation character (|) allows either expression on its side to match the target string. The expression "a|b" will match a as well as b. |
. |
The dot (.) will match any character. |
* |
The asterix (*) indicates that the character to the left of the asterix in the expression should match 0 or more times. |
+ |
The plus (+) is similar to asterix but there should be at least one match of the character to the left of the + sign in the expression. |
? |
The question mark (?) matches the character to its left 0 or 1 times. |
() |
The parenthesis affects the order of pattern evaluation and also serves as a tagged expression that can be used when replacing the matched sub-string with another expression. |
[] |
Brackets ([ and ]) enclosing a set of characters indicates that any of the enclosed characters may match the target character. |
The parenthesis, besides affecting the evaluation order of the regular expression, also serves as tagged expression which is something like a temporary memory. This memory can then be used when we want to replace the found expression with a new expression. The replace expression can specify a & character which means that the & represents the sub-string that was found. So, if the sub-string that matched the regular expression is “abcd”, then a replace expression of “xyz&xyz” will change it to “xyzabcdxyz”. The replace expression can also be expressed as “xyzxyz”. The “” indicates a tagged expression representing the entire sub-string that was matched. Similarly we can have other tagged expression represented by “1”, “2” etc. Note that although the tagged expression 0 is always defined, the tagged expression 1,2 etc. are only defined if the regular expression used in the search had enough sets of parenthesis. Here are few examples.
String |
Search |
Replace |
Result |
Mr. |
(Mr)(.) |
1s2 |
Mrs. |
abc |
(a)b(c) |
&-1-2 |
abc-a-c |
bcd |
(a|b)c*d |
&-1 |
bcd-b |
abcde |
(.*)c(.*) |
&-1-2 |
abcde-ab-de |
cde |
(ab|cd)e |
&-1 |
cde-cd |
The code for regular expression search is given below. This code was derived from work done by Henry Spencer. Besides adding a C++ wrapper around C code, a couple of functions were added to enable search and replace operations.
#define NSUBEXP 10
class CRegExp
{
public:
CRegExp();
~CRegExp();
CRegExp *RegComp( const TCHAR *re );
int RegFind(const TCHAR *str);
TCHAR* GetReplaceString( const TCHAR* sReplaceExp );
int GetFindLen()
{
if( startp[0] == NULL || endp[0] == NULL )
return 0;
return endp[0] - startp[0];
}
private:
TCHAR *regnext(TCHAR *node);
void reginsert(TCHAR op, TCHAR *opnd);
int regtry(TCHAR *string);
int regmatch(TCHAR *prog);
size_t regrepeat(TCHAR *node);
TCHAR *reg(int paren, int *flagp);
TCHAR *regbranch(int *flagp);
void regtail(TCHAR *p, TCHAR *val);
void regoptail(TCHAR *p, TCHAR *val);
TCHAR *regpiece(int *flagp);
TCHAR *regatom(int *flagp);
private:
TCHAR OP(TCHAR *p) {return *p;};
TCHAR *OPERAND( TCHAR *p) {return (TCHAR*)((short *)(p+1)+1); };
void regc(TCHAR b)
{
if (bEmitCode)
else
regsize++;
};
TCHAR * regnode(TCHAR op)
{
if (!bEmitCode) {
regsize += 3;
return regcode;
}
return regcode-3;
};
private:
BOOL bEmitCode;
BOOL bCompiled;
TCHAR *sFoundText;
TCHAR *startp[NSUBEXP];
TCHAR *endp[NSUBEXP];
TCHAR regstart;
TCHAR reganch;
TCHAR *regmust;
int regmlen;
TCHAR *program;
TCHAR *regparse;
int regnpar;
TCHAR *regcode;
TCHAR regdummy[3];
long regsize;
TCHAR *reginput;
TCHAR *regbol;
};
#include "stdafx.h"
#include "RegExp.h"
#define END 0
#define BOL 1
#define EOL 2
#define ANY 3
#define ANYOF 4
#define ANYBUT 5
#define BRANCH 6
#define BACK 7
#define EXACTLY 8
#define NOTHING 9
#define STAR 10
#define PLUS 11
#define OPEN 20
#define CLOSE 30
#define FAIL(m) { regerror(m); return(NULL); }
#define ISREPN(c) ((c) == _T('*') || (c) == _T('+') || (c) == _T('?'))
#define META "^$.[()|?+*\"
#define HASWIDTH 01
#define SIMPLE 02
#define SPSTART 04
#define WORST 0
CRegExp::CRegExp()
{
bCompiled = FALSE;
program = NULL;
sFoundText = NULL;
for( int i = 0; i < NSUBEXP; i++ )
{
startp[i] = NULL;
endp[i] = NULL;
}
}
CRegExp::~CRegExp()
{
delete program;
delete sFoundText;
}
CRegExp* CRegExp::RegComp(const TCHAR *exp)
{
TCHAR *scan;
int flags;
if (exp == NULL)
return NULL;
bCompiled = TRUE;
bEmitCode = FALSE;
regparse = (TCHAR *)exp;
regnpar = 1;
regsize = 0L;
regdummy[0] = NOTHING;
regdummy[1] = regdummy[2] = 0;
regcode = regdummy;
if (reg(0, &flags) == NULL)
return(NULL);
delete program;
program = new TCHAR[regsize];
memset( program, 0, regsize * sizeof(TCHAR) );
if (program == NULL)
return NULL;
bEmitCode = TRUE;
regparse = (TCHAR *)exp;
regnpar = 1;
regcode = program;
if (reg(0, &flags) == NULL)
return NULL;
regstart = _T('');
reganch = 0;
regmust = NULL;
regmlen = 0;
scan = program;
if (OP(regnext(scan)) == END)
{
scan = OPERAND(scan);
if (OP(scan) == EXACTLY)
regstart = *OPERAND(scan);
else if (OP(scan) == BOL)
reganch = 1;
if (flags&SPSTART)
{
char *longest = NULL;
size_t len = 0;
for (; scan != NULL; scan = regnext(scan))
if (OP(scan) == EXACTLY && _tcslen(OPERAND(scan)) >= len)
{
longest = OPERAND(scan);
len = _tcslen(OPERAND(scan));
}
regmust = longest;
regmlen = (int)len;
}
}
return this;
}
TCHAR *CRegExp::reg(int paren, int *flagp)
{
char *ret;
char *br;
char *ender;
int parno;
int flags;
if (paren)
{
if (regnpar >= NSUBEXP)
{
TRACE1("Too many (). NSUBEXP is set to %dn", NSUBEXP );
return NULL;
}
parno = regnpar;
regnpar++;
ret = regnode(OPEN+parno);
}
br = regbranch(&flags);
if (br == NULL)
return(NULL);
if (paren)
regtail(ret, br);
else
ret = br;
while (*regparse == _T('|')) {
regparse++;
br = regbranch(&flags);
if (br == NULL)
return(NULL);
regtail(ret, br);
}
ender = regnode((paren) ? CLOSE+parno : END);
regtail(ret, ender);
for (br = ret; br != NULL; br = regnext(br))
regoptail(br, ender);
if (paren && *regparse++ != _T(')'))
{
TRACE0("unterminated ()n");
return NULL;
}
else if (!paren && *regparse != _T(''))
{
if (*regparse == _T(')'))
{
TRACE0("unmatched ()n");
return NULL;
}
else
{
TRACE0("internal error: junk on endn");
return NULL;
}
}
return(ret);
}
TCHAR *CRegExp::regbranch(int *flagp)
{
TCHAR *ret;
TCHAR *chain;
TCHAR *latest;
int flags;
int c;
ret = regnode(BRANCH);
chain = NULL;
while ((c = *regparse) != _T('') && c != _T('|') && c != _T(')')) {
latest = regpiece(&flags);
if (latest == NULL)
return(NULL);
if (chain == NULL)
else
regtail(chain, latest);
chain = latest;
}
if (chain == NULL)
(void) regnode(NOTHING);
return(ret);
}
TCHAR *CRegExp::regpiece(int *flagp)
{
TCHAR *ret;
TCHAR op;
TCHAR *next;
int flags;
ret = regatom(&flags);
if (ret == NULL)
return(NULL);
op = *regparse;
if (!ISREPN(op)) {
return(ret);
}
if (!(flags&HASWIDTH) && op != _T('?'))
{
TRACE0("*+ operand could be emptyn");
return NULL;
}
switch (op) {
case _T('*'): *flagp = WORST|SPSTART; break;
case _T('+'): *flagp = WORST|SPSTART|HASWIDTH; break;
case _T('?'): *flagp = WORST; break;
}
if (op == _T('*') && (flags&SIMPLE))
reginsert(STAR, ret);
else if (op == _T('*')) {
reginsert(BRANCH, ret);
regoptail(ret, regnode(BACK));
regoptail(ret, ret);
regtail(ret, regnode(BRANCH));
regtail(ret, regnode(NOTHING));
} else if (op == _T('+') && (flags&SIMPLE))
reginsert(PLUS, ret);
else if (op == _T('+')) {
next = regnode(BRANCH);
regtail(ret, next);
regtail(regnode(BACK), ret);
regtail(next, regnode(BRANCH));
regtail(ret, regnode(NOTHING));
} else if (op == _T('?')) {
reginsert(BRANCH, ret);
regtail(ret, regnode(BRANCH));
next = regnode(NOTHING);
regtail(ret, next);
regoptail(ret, next);
}
regparse++;
if (ISREPN(*regparse))
{
TRACE0("nested *?+n");
return NULL;
}
return(ret);
}
TCHAR *CRegExp::regatom(int *flagp)
{
TCHAR *ret;
int flags;
switch (*regparse++) {
case _T('^'):
ret = regnode(BOL);
break;
case _T('$'):
ret = regnode(EOL);
break;
case _T('.'):
ret = regnode(ANY);
break;
case _T('['): {
int range;
int rangeend;
int c;
if (*regparse == _T('^')) {
ret = regnode(ANYBUT);
regparse++;
} else
ret = regnode(ANYOF);
if ((c = *regparse) == _T(']') || c == _T('-')) {
regc(c);
regparse++;
}
while ((c = *regparse++) != _T('') && c != _T(']')) {
if (c != _T('-'))
regc(c);
else if ((c = *regparse) == _T(']') || c == _T(''))
regc(_T('-'));
else
{
range = (unsigned) (TCHAR)*(regparse-2);
rangeend = (unsigned) (TCHAR)c;
if (range > rangeend)
{
TRACE0("invalid [] rangen");
return NULL;
}
for (range++; range <= rangeend; range++)
regc(range);
regparse++;
}
}
regc(_T(''));
if (c != _T(']'))
{
TRACE0("unmatched []n");
return NULL;
}
break;
}
case _T('('):
ret = reg(1, &flags);
if (ret == NULL)
return(NULL);
break;
case _T(''):
case _T('|'):
case _T(')'):
TRACE0("internal error: \0|) unexpectedn");
return NULL;
break;
case _T('?'):
case _T('+'):
case _T('*'):
TRACE0("?+* follows nothingn");
return NULL;
break;
case _T('\'):
if (*regparse == _T(''))
{
TRACE0("trailing \n");
return NULL;
}
ret = regnode(EXACTLY);
regc(*regparse++);
regc(_T(''));
break;
default: {
size_t len;
TCHAR ender;
regparse--;
len = _tcscspn(regparse, META);
if (len == 0)
{
TRACE0("internal error: strcspn 0n");
return NULL;
}
ender = *(regparse+len);
if (len > 1 && ISREPN(ender))
len--;
if (len == 1)
ret = regnode(EXACTLY);
for (; len > 0; len--)
regc(*regparse++);
regc(_T(''));
break;
}
}
return(ret);
}
void CRegExp::reginsert(TCHAR op, TCHAR *opnd)
{
TCHAR *place;
if (!bEmitCode) {
regsize += 3;
return;
}
(void) memmove(opnd+3, opnd, (size_t)((regcode - opnd)*sizeof(TCHAR)));
regcode += 3;
place = opnd;
}
void CRegExp::regtail(TCHAR *p, TCHAR *val)
{
TCHAR *scan;
TCHAR *temp;
if (!bEmitCode)
return;
for (scan = p; (temp = regnext(scan)) != NULL; scan = temp)
continue;
}
void CRegExp::regoptail(TCHAR *p, TCHAR *val)
{
if (!bEmitCode || OP(p) != BRANCH)
return;
regtail(OPERAND(p), val);
}
int CRegExp::RegFind(const TCHAR *str)
{
TCHAR *string = (TCHAR *)str;
TCHAR *s;
delete sFoundText;
sFoundText = NULL;
if(string == NULL)
{
TRACE0("NULL argument to regexecn");
return(-1);
}
if (!bCompiled)
{
TRACE0("No regular expression provided yet.n");
return(-1);
}
if (regmust != NULL && _tcsstr(string, regmust) == NULL)
return(-1);
regbol = string;
if (reganch)
{
if( regtry(string) )
{
sFoundText = new TCHAR[GetFindLen()+1];
sFoundText[GetFindLen()] = _T('');
_tcsncpy(sFoundText, string, GetFindLen() );
return 0;
}
return -1;
}
if (regstart != _T(''))
{
for (s = string; s != NULL; s = _tcschr(s+1, regstart))
if (regtry(s))
{
int nPos = s-str;
sFoundText = new TCHAR[GetFindLen()+1];
sFoundText[GetFindLen()] = _T('');
_tcsncpy(sFoundText, s, GetFindLen() );
return nPos;
}
return -1;
}
else
{
for (s = string; !regtry(s); s++)
if (*s == _T(''))
return(-1);
int nPos = s-str;
sFoundText = new TCHAR[GetFindLen()+1];
sFoundText[GetFindLen()] = _T('');
_tcsncpy(sFoundText, s, GetFindLen() );
return nPos;
}
}
int CRegExp::regtry(TCHAR *string)
{
int i;
TCHAR **stp;
TCHAR **enp;
reginput = string;
stp = startp;
enp = endp;
for (i = NSUBEXP; i > 0; i--)
{
}
if (regmatch(program))
{
startp[0] = string;
endp[0] = reginput;
return(1);
}
else
return(0);
}
int CRegExp::regmatch(TCHAR *prog)
{
TCHAR *scan;
TCHAR *next;
for (scan = prog; scan != NULL; scan = next) {
next = regnext(scan);
switch (OP(scan)) {
case BOL:
if (reginput != regbol)
return(0);
break;
case EOL:
if (*reginput != _T(''))
return(0);
break;
case ANY:
if (*reginput == _T(''))
return(0);
reginput++;
break;
case EXACTLY: {
size_t len;
TCHAR *const opnd = OPERAND(scan);
if (*opnd != *reginput)
return(0);
len = _tcslen(opnd);
if (len > 1 && _tcsncmp(opnd, reginput, len) != 0)
return(0);
reginput += len;
break;
}
case ANYOF:
if (*reginput == _T('') ||
_tcschr(OPERAND(scan), *reginput) == NULL)
return(0);
reginput++;
break;
case ANYBUT:
if (*reginput == _T('') ||
_tcschr(OPERAND(scan), *reginput) != NULL)
return(0);
reginput++;
break;
case NOTHING:
break;
case BACK:
break;
case OPEN+1: case OPEN+2: case OPEN+3:
case OPEN+4: case OPEN+5: case OPEN+6:
case OPEN+7: case OPEN+8: case OPEN+9: {
const int no = OP(scan) - OPEN;
TCHAR *const input = reginput;
if (regmatch(next)) {
if (startp[no] == NULL)
startp[no] = input;
return(1);
} else
return(0);
break;
}
case CLOSE+1: case CLOSE+2: case CLOSE+3:
case CLOSE+4: case CLOSE+5: case CLOSE+6:
case CLOSE+7: case CLOSE+8: case CLOSE+9: {
const int no = OP(scan) - CLOSE;
TCHAR *const input = reginput;
if (regmatch(next)) {
if (endp[no] == NULL)
endp[no] = input;
return(1);
} else
return(0);
break;
}
case BRANCH: {
TCHAR *const save = reginput;
if (OP(next) != BRANCH)
next = OPERAND(scan);
else {
while (OP(scan) == BRANCH) {
if (regmatch(OPERAND(scan)))
return(1);
reginput = save;
scan = regnext(scan);
}
return(0);
}
break;
}
case STAR:
case PLUS: {
const TCHAR nextch =
(OP(next) == EXACTLY) ? *OPERAND(next) : _T('');
size_t no;
TCHAR *const save = reginput;
const size_t min = (OP(scan) == STAR) ? 0 : 1;
for (no = regrepeat(OPERAND(scan)) + 1; no > min; no--) {
reginput = save + no - 1;
if (nextch == _T('') || *reginput == nextch)
if (regmatch(next))
return(1);
}
return(0);
break;
}
case END:
return(1);
break;
default:
TRACE0("regexp corruptionn");
return(0);
break;
}
}
TRACE0("corrupted pointersn");
return(0);
}
size_t CRegExp::regrepeat(TCHAR *node)
{
size_t count;
TCHAR *scan;
TCHAR ch;
switch (OP(node))
{
case ANY:
return(_tcslen(reginput));
break;
case EXACTLY:
ch = *OPERAND(node);
count = 0;
for (scan = reginput; *scan == ch; scan++)
count++;
return(count);
break;
case ANYOF:
return(_tcsspn(reginput, OPERAND(node)));
break;
case ANYBUT:
return(_tcscspn(reginput, OPERAND(node)));
break;
default:
TRACE0("internal error: bad call of regrepeatn");
return(0);
break;
}
}
TCHAR *CRegExp::regnext(TCHAR *p)
{
const short &offset = *((short*)(p+1));
if (offset == 0)
return(NULL);
return((OP(p) == BACK) ? p-offset : p+offset);
}
TCHAR* CRegExp::GetReplaceString( const TCHAR* sReplaceExp )
{
TCHAR *src = (TCHAR *)sReplaceExp;
TCHAR *buf;
TCHAR c;
int no;
size_t len;
if( sReplaceExp == NULL || sFoundText == NULL )
return NULL;
int replacelen = 0;
while ((c = *src++) != _T(''))
{
if (c == _T('&'))
no = 0;
else if (c == _T('\') && isdigit(*src))
no = *src++ - _T('0');
else
no = -1;
if (no < 0)
{
if (c == _T('\') && (*src == _T('\') || *src == _T('&')))
c = *src++;
replacelen++;
}
else if (startp[no] != NULL && endp[no] != NULL &&
endp[no] > startp[no])
{
len = endp[no] - startp[no];
replacelen += len;
}
}
buf = new TCHAR[replacelen+1];
if( buf == NULL )
return NULL;
TCHAR* sReplaceStr = buf;
buf[replacelen] = _T('');
src = (TCHAR *)sReplaceExp;
while ((c = *src++) != _T(''))
{
if (c == _T('&'))
no = 0;
else if (c == _T('\') && isdigit(*src))
no = *src++ - _T('0');
else
no = -1;
if (no < 0)
{
if (c == _T('\') && (*src == _T('\') || *src == _T('&')))
c = *src++;
}
else if (startp[no] != NULL && endp[no] != NULL &&
endp[no] > startp[no])
{
len = endp[no] - startp[no];
int tagpos = startp[no] - startp[0];
_tcsncpy(buf, sFoundText + tagpos, len);
buf += len;
}
}
return sReplaceStr;
}
Here’s a function that will do global search and replace using regular expressions. Note that the CStringEx class described in the earlier section is being used here. The main reason for using CStringEx is that it provides the Replace() function which makes our task easier.
int RegSearchReplace( CStringEx& string, LPCTSTR sSearchExp,
LPCTSTR sReplaceExp )
{
int nPos = 0;
int nReplaced = 0;
CRegExp r;
LPTSTR str = (LPTSTR)(LPCTSTR)string;
r.RegComp( sSearchExp );
while( (nPos = r.RegFind((LPTSTR)str)) != -1 )
{
nReplaced++;
TCHAR *pReplaceStr = r.GetReplaceString( sReplaceExp );
int offset = str-(LPCTSTR)string+nPos;
string.Replace( offset, r.GetFindLen(),
pReplaceStr );
str = (LPTSTR)(LPCTSTR)string + offset + _tcslen(pReplaceStr);
delete pReplaceStr;
}
return nReplaced;
}