An alternative Regular Expression Class

WEBINAR: On-demand webcast

How to Boost Database Development Productivity on Linux, Docker, and Kubernetes with Microsoft SQL Server 2017 REGISTER >

This is another regular expression library. Like Zafir's it is based upon the work of Henry Spencer. I started using this a long time ago and called my class Regexp (rather than CRegExp). Actually I prefer Zafir's name but I have too much code using the other name to want to change it, so right now my class is called Regexp (change it if you like).

So why put up another version? I hear you ask. Well the two classes took the same base code and then developed to solve different problems. CRegExp is geared to Search and Replace operations whereas Regexp was written to simplify tokenisation. I wanted a class that could be given a 'program' and from that, return specific substrings from it's input. Regular expressions may not be the fastest way to parse input (though with careful anchoring they can be made so that they fail quickly if they are going to) but once you have a working library they do allow for fairly rapid coding. On the whole this is good enough, worry about making it faster once you have it working and actually know that your optimization effort isn't going unnoticed.

For example:

Regexp re( "^[\t ]*(.*)[\t ]*\\((.*)\\)" );
CString str( "wyrdrune.com!kelly (Kelly)\n" );
CString name, addr;

if ( re.Match( str ) && re.SubStrings() == 2 )
{
	name = re[2];
	addr = re[1];
}

Will give:

name == "Kelly" and addr == "wyrdrune.com!kelly"

If you decompose the regular expression you get:

^ Beginning of line anchor.
[\t ]* Any amount (that is zero or more characters) of tabs or spaces.
(.*) Field 1: A tagged expression matching any string of characters – this will be the longest string that will still allow the rest of the pattern to match.
[\t ]* Any amount of tabs or spaces.
\\( An escaped open parenthesis. The double slash is a C/C++ convention since this is the escape character and we want a literal slash to be passed through to the regular expression code. If the user were typing this sort of thing into your program they would only enter one slash. We escape the parenthesis so that it doesn’t get interpreted as a regular expression special character.
(.*) Field 2: A tagged expression matching any string of characters.
\\) An escaped closing parenthesis.

BTW: the phrase tagged regular expression refers to any part of the regular expression that is, because it was surrounded by parenthesis, accessible after a match has been made as a separate substring.  See here for more information about Regular Expression syntax.

In English, we are looking for two fields. The first will be all characters from the start of the line through to the second field (without any surrounding white space), and the second will be all characters within parenthesis following the first field.

The Class

The library itself comes as two source files, Regexp.cpp and Regexp.h. The header defines the Regexp class with the following members:

Regexp::NSUBEXP

A constant defining how many subexpressions that the library will support (usually 10), attempting to use a regular expression with more than this number will generate an error.

Regexp::Regexp()

A boring constructor, this must be initialized by assignment before anything useful can be done with it.

Regexp::Regexp( TCHAR * exp, BOOL iCase = 0 )

exp :

The regular expression itself, this format of which is defined later. The success or failure of the compilation can be discovered by using either GetErrorString() or CompiledOK().

iCase:

If TRUE the regular expression is compiled so that differences in case are ignored when matching.

Regexp::Regexp( const Regexp &r )

Construct a new regular expression taking the compiled form from another Regexp.

const Regexp::Regexp & operator=( const Regexp & r );

Assign Regexp r to the current object.

bool Regexp::Match( const TCHAR * s );

Examine the TCHAR array s with this regular expression, returning true if there is a match. This match updates the state of this Regexp object so that the substrings of the match can be obtained. The 0th substring is the substring of string that matched the whole regular expression. The others are those substrings that matched parenthesized expressions within the regular expression, with parenthesized expressions numbered in left-to-right order of their opening parentheses. If a parenthesized expression does not participate in the match at all, its length is 0. It is an error if this Regexp has not been successfully initialized.

int Regexp::SubStrings() const;

Return the number of substrings found after a successful Match().

const CString Regexp::operator[]( unsigned int i ) const;

Return the ith matched substring after a successful Match().

int Regexp::SubStart( unsigned int i ) const;

Return the starting offset of the ith matched substring from the beginning of the TCHAR array used in Match().

int Regexp::SubLength( unsigned int i ) const;

Return the length of the ith matched substring

Using the same example Regexp as before:

Regexp re( "^[\t ]*(.*)[\t ]*\\((.*)\\)" );
CString str( "wyrdrune.com!kelly (Kelly)\n" );
if ( re.Match( str ) && re.SubStrings() == 2 )
{
	ASSERT( re.SubStart(0) == 0 );
	ASSERT( re.SubLength(0) == 26 );

	ASSERT( re.SubStart(1) == 0 );
	ASSERT( re.SubLength(1) == 19 );

	ASSERT( re.SubStart(2) == 20 );
	ASSERT( re.SubLength(2) == 5 );
}

CString Regexp::GetReplaceString( LPCTSTR source ) const;

After a successful Match you can retrieve a replacement string as an alternative to building up the various substrings by hand.

Each character in the source string will be copied to the return value except for the following special characters:

&   The complete matched string (sub-string 0).
\1  Sub-string 1
... and so on until...
\9 Sub-string 9

So, taking the now ubiquitous example:

CString repl = re.GetReplacementString( "\2 == \1" );

Will give:

repl == "Kelly == wyrdrune.com!kelly";

As an implementation note: the CRegExp version of a similarly named function returned a newly allocated pointer array. Whilst this is efficient, it puts the onus upon the user of the class to delete it (correctly, with delete [] ) after it’s done with. Considering how the reference counting is implemented in the MFC CString class, passing CStrings around on the stack isn’t that expensive, the allocation only happens when the string data is initially allocated, with the ownership of the actual string data being handed from one CString instance to another as needed. Finally when the CString goes out of scope the data is deleted. This is efficient, and much more robust than having to keep track of which functions are allocators and which ones are not.

CString Regexp::GetErrorString() const;

Return a description of the most recent error caused on this Regexp. Errors include, but are not limited to, various forms of compilation errors, usually syntax errors, and calling Match when the Regexp hasn’t been initialized correctly (or at all). There are a fair number of these that should never occur if all of the Regexp use comes from your code, but where the user can type in regular expressions that you then have to compile, checking this can be very important.

bool Regexp::CompiledOK() const;

Return the status of the last regular expression compilation.

Regular Expression Syntax

A regular expression is zero or more branches, separated by '|'. It matches anything that matches one of the branches.

A branch is zero or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by '*', '+', or '?'. An atom followed by '*' matches a sequence of 0 or more matches of the atom. An atom followed by '+' matches a sequence of 1 or more matches of the atom. An atom followed by '?' matches a match of the atom, or the null string.

An atom is a regular expression in parentheses (matching a match for the regular expression), a range (see below), '.' (matching any single character), '^' (matching the null string at the beginning of the input string), '$' (matching the null string at the end of the input string), a '\' followed by a single character (matching that character), or a single character with no other significance (matching that character).

A range is a sequence of characters enclosed in '[]'. It normally matches any single character from the sequence. If the sequence begins with '^', it matches any single character not from the rest of the sequence. If two characters in the sequence are separated by '-', this is shorthand for the full list of ASCII characters between them (e.g. '[0-9]' matches any decimal digit). To include a literal ']' in the sequence, make it the first character (following a possible '^'). To include a literal '-', make it the first or last character.

Ambiguity

If a regular expression could match two different parts of the input string, it will match the one which begins earliest. If both begin in the same place but match different lengths, or match the same length in different ways, life gets messier, as follows.

In general, the possibilities in a list of branches are considered in left-to-right order, the possibilities for '*', '+', and '?' are considered longest-first, nested constructs are considered from the outermost in, and concatenated constructs are considered leftmost-first. The match that will be chosen is the one that uses the earliest possibility in the first choice that has to be made. If there is more than one choice, the next will be made in the same manner (earliest possibility) subject to the decision on the first choice. And so forth.

For example, '(ab|a)b*c' could match 'abc' in one of two ways. The first choice is between 'ab' and 'a'; since 'ab' is earlier, and does lead to a successful overall match, it is chosen. Since the 'b' is already spoken for, the 'b*' must match its last possibility--the empty string--since it must respect the earlier choice.

In the particular case where the regular expression does not use `|' and does not apply `*', `+', or `?' to parenthesized subexpressions, the net effect is that the longest possible match will be chosen. So `ab*', presented with `xabbbby', will match `abbbb'. Note that if `ab*' is tried against `xabyabbbz', it will match `ab' just after `x', due to the begins-earliest rule. (In effect, the decision on where to start the match is the first choice to be made, hence subsequent choices must respect it even if this leads them to less-preferred alternatives.)

The Source

The accompanying archive contains the regexp library, as well as two separate test programs.

The first (originally enough called Test1) is a C++ port of the original test program that came with the C code. I’ve updated it to use the C++ constructs that the new library exposes. It acts as a useful sanity check and regression test when I’ve been modifying the source.

The second test is much simpler and uses the libraries substring extraction function to chop fields out of an email header, this is less of a test program and more of a simple sample.

Download Source.

A Note about Character Size

This code (and the samples) work and have been tested pretty thoroughly under Single Byte Character Sets (SBCS) and UNICODE. It will NOT work under Multi Byte Character Sets (MBCS), though it will compile which is very misleading. The problem (for anyone interested in fixing it) is that the internal representation of the ‘program’ requires a fixed size character, it manipulates this using memcpy() and memmove() without any knowledge of whether a particular element in it’s array is some internal code or a character. Making this use variable width characters would be a real pain since much more of the code would have to decode the program itself in order to determine whether a specific point in the program was looking at a operator or part of a character. Certainly this is doable, but it is more work than I want right now. The code works under UNICODE and that’s good enough for me. BTW even if the code is compiled with _MBCS it will only fail when it’s actually presented with multi-byte text, it’ll work just fine with 8-bit ASCII.

 



Comments

  • It is very good but...

    Posted by DatVT81 on 05/15/2005 11:14pm

    I have used this class and it did very well.But when I try to paser with more syntax(for example \d,\s....) it did not return output string value. How do I do?Can you teach me? Thanks,

    Reply
  • Inconsistent dll linkage

    Posted by Legacy on 09/17/2002 12:00am

    Originally posted by: Jim Willsher

    Ca anyone tell me why I get 16 errors upon compilation, all referring to inconsistent dll linkage?:

    C:\Test\Regexp.cpp(1454) : warning C4273: 'Regexp::Regexp' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1460) : warning C4273: 'Regexp::Regexp' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1466) : warning C4273: 'Regexp::Regexp' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1475) : warning C4273: '=' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1492) : warning C4273: 'Regexp::~Regexp' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1498) : warning C4273: 'Match' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1525) : warning C4273: 'GetReplaceString' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1535) : warning C4273: 'SubStrings' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1546) : warning C4273: 'SubStart' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1557) : warning C4273: 'SubLength' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1571) : warning C4273: 'CompiledOK' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1588) : warning C4273: 'safeIndex' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1593) : warning C4273: '[]' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1729) : warning C4273: 'GetErrorString' : inconsistent dll linkage. dllexport assumed.
    C:\Test\Regexp.cpp(1736) : warning C4273: 'ClearErrorString' : inconsistent dll linkage. dllexport assumed.

    I can't see anything that declares dllexport linkage in the code.

    Many thanks,

    Jim

    • inconsistent dll linkage

      Posted by avsam on 11/01/2005 12:05am

      Search AFX_EXT_CLASS and remove from the regexp.h header file. class AFX_EXT_CLASS Regexp

      Reply
    Reply
  • Case-Insensitive match problems

    Posted by Legacy on 05/15/2002 12:00am

    Originally posted by: Tim Slattery

    This RE class implements its case-insensitive match by preprocessing the pattern. The preprocess step simply finds alpha characters in the pattern and replaces them with a class containing both upper and lower case versions of the character. So "abc*" becomes "[Aa][Bb][Cc]*". The preprocessor does not make the transformation for alphas within square brackets, to avoid nesting classes.

    The most obvious consequence of this method is that alpha characters that are part of a class will never be matched without case sensitivity. The other consequence is more subtle.

    I was parsing a file that looks like a Microsoft *.ini file. It's divided into classes with lines that look like this:

    [classname]

    The "classname' should not be sensitive to case. To find a class named "Keep", my pattern looked like this:

    \[Keep\]

    The square brackets are escaped, so they are interpreted as literals. Even though I set the case-insensitive flag, the RE class did a case sensitive match, because the alpha string appears between square brackets! The preprocessor doesn't note that the square brackets have been escaped.

    Reply
  • Matching more than 1 email address

    Posted by Legacy on 02/24/2002 12:00am

    Originally posted by: Val

    I am trying to match more than 1 email addresses in test2.
    
    But if the email addresses are put together then can work. If they are all jumbled up in the string header it doesnt match. Did someone come up with a solution to this ?


    void main()
    {
    CString Header =

    _T( "From kelly Mon Jul 26 15:08:25 0600 1993 remote from wyrdrune.com\n" )
    _T( "Message-ID: <9307261508.AA01287@wyrdrune.com>\n" )
    _T( "Received: from if.com by wyrdrune.com; Mon, 26 Jul 93 15:07 MDT\n" )
    _T( "Subject: Info request\n")
    _T( "To: wyrdrune.demon.co.uk!guy\n " )
    _T( "Date: Mon, 26 Jul 1993 15:08:25 -0600 (MDT)\n" )
    _T( "From:wyrdrune.com!Kelly ( Kelly )\n" )
    _T( "In-Reply-To: <8888snx@wyrdrune.demon.co.uk> from \"Guy Gascoigne - Piggford\" at Jul 23, 93 10:03:04 pm\n" )
    _T( "Reply-To: kelly@wyrdrune.com\n" )
    _T( "Return-Path: kelly@wyrdrune.com\n" )
    _T( "X-Mailer: ELM [version 2.4 PL21]\n" )
    _T( "Content-Type: text\n" )
    _T( "Address: aa@hotmail.com ")
    _T( "Address: bb@pacific.com ")
    _T( "Subject: From Val\n" )
    _T( "Content-Length: 55706\n" )
    _T( "Address: aa@hotmail.com ")
    _T( "Address: bb@pacific.com ")
    _T( "Address: cc@swirve.com ");


    Regexp reAddress=_T( "Address: *(^\n)" );

    reAddress.Match( Header );

    printMatch(_T("Address"),reAddress);


    Val

    Reply
  • Linking errors

    Posted by Legacy on 10/04/2001 12:00am

    Originally posted by: Daniel

    Afhter including Regexp.h these two linking errors ocurred:

    error LNK2001: unresolved external symbol "__declspec(dllimport) public: __thiscall Regexp::~Regexp(void)" (__imp_??1Regexp@@QAE@XZ)

    error LNK2001: unresolved external symbol "__declspec(dllimport) public: __thiscall Regexp::Regexp(void)" (__imp_??0Regexp@@QAE@XZ)


    If you could help me i would apreciate, please it's urgent.
    thanx...

    Reply
  • Fix for changes by Ed Maher

    Posted by Legacy on 07/16/2001 12:00am

    Originally posted by: Ximon Eighteen

    When ESM made his changes to work around word alignment
    
    issues on some architectures he introduced a bug concerning
    maximum compiled code size. After his fix it is not
    possible to compile a program that would have a code size
    larger than 256 bytes. The fix is I have implemented is as
    follows :-

    In CRepProgramAccessor::regnext ...

    ESM's code:-
    short offset = (short)(((short)*(p+2)) + (256*((short)*(p+1))));

    Replacement code:-
    short offset = (short)(((short)(*(((unsigned char*)p)+1))<<8)+((short)(*(((unsigned char *)p)+2))));

    And in CRegCompiler::regtail ...

    ESM's code:-
    *(scan+1)=(char)(scanvalue / 256); //MSB
    *(scan+2)=(char)(scanvalue % 256); //LSB

    Replacement code:-
    *(scan+1)=(unsigned char)(scanvalue / 256); //MSB
    *(scan+2)=(unsigned char)(scanvalue % 256); //LSB

    This has been working without a problem now for a short
    while.

    Reply
  • Porting to UNIX systems with word-alignment requirements

    Posted by Legacy on 05/30/2001 12:00am

    Originally posted by: Ed Maher


    One small niggle, myself and a colleague ported the code to run under HP-UX, and found that we kept getting a Bus Error
    in regnext, because an arbitrary pointer (not necessarily word-aligned) was being chaged to be a pointer to a short, and then used to access the value as a short:

    const short &offset = *((short*)(p+1));

    Accessing 'offset' when p+1 is not word aligned causes the Bus Error. Word alignment seems to be required when manipulating integers on this architecture.

    I replaced the code with explicit access to each character, as below:

    static inline char* regnext( char* p )
    {
    //ESM 20010330 Casting pointers, and using them is not permitted under some Unix
    //ESM 20010330 architectures that are fussy about alignment
    //ESM 20010330 -- const short &offset = *((short*)(p+1));
    short offset = (short)(((short)*(p+2)) + (256*((short)*(p+1))));

    if (offset == 0)
    return(NULL);

    return((OP(p) == BACK) ? p-offset : p+offset);
    }

    I also changed the regtail function to explicitly store the data in the same order as it is retrieved, so there are no problems moving between big and little endian architectures:


    void CRegCompiler::regtail(char* p, char* val)
    {
    char* scan;
    char* temp;

    // Find last node.
    for (scan = p; (temp = regnext(scan)) != NULL; scan = temp)
    continue;

    //ESM 20010330 Casting pointers, and using them is not permitted under some Unix
    //ESM 20010330 architectures that are fussy about alignment
    //ESM 20010330 -- *((short *)(scan+1)) = (short)((OP(scan) == BACK) ? scan - val : val - scan);

    int scanvalue=0;
    if (OP(scan) == BACK)
    {
    scanvalue=(scan - val);
    }
    else
    {
    scanvalue=(val - scan);
    }

    *(scan+1)=(char)(scanvalue / 256); //MSB
    *(scan+2)=(char)(scanvalue % 256); //LSB
    Values=" << (int) *(scan+1) << "," << (int) *(scan+2) << endl;
    }

    Reply
  • Test1 Unicode command line parameter fix

    Posted by Legacy on 05/24/2000 12:00am

    Originally posted by: Julian W. Girouard Jr.

    First off, let me just say I love this Regexp class - very useful!
    
    

    To address the problem of Unicode command line parameters (for example, in the Test1 program), it looks like you have to use wmain instead of main if you want to access unicode command line parameters using argv. This is easy to accomplish by changing this line of code:

    void main (int argc, TCHAR ** argv)

    to this ugly thing:

    void
    #ifdef _UNICODE
    wmain
    #else
    main
    #endif
    (int argc, TCHAR ** argv)


    Works for me! :)

    -Julian
    May your RAM and good logic never fail you!

    Reply
  • Greedy expressions

    Posted by Legacy on 05/12/2000 12:00am

    Originally posted by: Dmitry

    Is there any way to make this class not be greedy?

    Reply
  • Slight leak detected

    Posted by Legacy on 02/17/2000 12:00am

    Originally posted by: Bill Bell

    I appreciate having the use of 'regexp'. However, I seem to have detected a memory leak.
    
    

    The following code is a console app generated using the VC++6 wizard and modified slightly to use one of the examples in the doc. I've pared it down almost as much as possible. When run in debug mode an assertion fails. Remove the line containing 're[1]' and 're[2]' (or eliminate debug calls) and all appears well.

    Again, my thanks for the use of this code.

    (Following stuff appears in output window:

    "Detected memory leaks!
    Dumping objects ->
    strcore.cpp(118) : {44} normal block at 0x00A71DF0, 40 bytes long.
    Data: < wyrd> 01 00 00 00 1B 00 00 00 1B 00 00 00 77 79 72 64
    Object dump complete.
    The thread 0xFFC59745 has exited with code 3 (0x3).
    The program 'C:\AGENTPROJECT\PWAFilter\try regexp\Debug\try regexp.exe' has exited with code 3 (0x3).")

    // try regexp.cpp : Defines the entry point for the console application.
    //

    #include "stdafx.h"
    #include "try regexp.h"
    #include "regexp.h"

    #ifdef _DEBUG
    #define new DEBUG_NEW
    #undef THIS_FILE
    static char THIS_FILE[] = __FILE__;
    #endif

    /////////////////////////////////////////////////////////////////////////////
    // The one and only application object

    CWinApp theApp;

    using namespace std;

    int _tmain(int argc, TCHAR* argv[], TCHAR* envp[])
    {
    int nRetCode = 0;

    // initialize MFC and print and error on failure
    if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0))
    {
    // TODO: change error code to suit your needs
    cerr << _T("Fatal Error: MFC initialization failed") << endl;
    nRetCode = 1;
    }
    else
    {

    Regexp re( "^[\t ]*(.*)[\t ]*\\((.*)\\)" );
    CString str( "wyrdrune.com!kelly (Kelly)\n" );
    char c;

    if ( re.Match( str ) && re.SubStrings() == 2 ) {
    cout << "name: " << ( LPCTSTR ) re[2] << ", addr: " << ( LPCTSTR ) re[1] << endl;
    }

    scanf ( "%s", &c );

    }

    return nRetCode;
    }

    Reply
  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • As all sorts of data becomes available for storage, analysis and retrieval - so called 'Big Data' - there are potentially huge benefits, but equally huge challenges...
  • The agile organization needs knowledge to act on, quickly and effectively. Though many organizations are clamouring for "Big Data", not nearly as many know what to do with it...
  • Cloud-based integration solutions can be confusing. Adding to the confusion are the multiple ways IT departments can deliver such integration...

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date