Using Regular Expressions to Parse for E-Mail Addresses

This final installment in my series on using the .NET regular expressions classes from Managed C++ takes much of what the previous installments taught to create production-quality, regular expression patterns for validating e-mail addresses and parsing bodies of text for all e-mail addresses. The first section begins with a basic pattern that—while not all-encompassing—causes the regular expressions parser to match the majority of e-mail addresses in a supplied input string. The remainder of the column presents two more complex patterns that catch almost any e-mail address format, and it fully explains the components of each pattern.

Basic E-Mail Pattern

First, examine a generic function—GetEmailAddresses—that takes as its only argument an input string and returns an array of found e-mail addresses. The e-mail regular expression pattern utilized in this function is very basic, but it “catches” the majority of e-mails you run across:

using namespace System::Text::RegularExpressions;
using namespace System::Windows::Forms;
using namespace System::Collections;

...

ArrayList* GetEmailAddresses(String* input)
{
  try
  {
    ArrayList* al = new ArrayList();

    MatchCollection* mc =
      Regex::Matches(input,
                     S"[\\w]+@[\\w]+.[\\w]{2,3}");

    for (int i=0; i < mc->Count; i++)
      al->Add(mc->Item[i]->Value);

    return al;
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

The first thing the GetEmailAddresses function does is construct an ArrayList object. This object is returned to the caller of the function, and it holds all of the e-mail addresses located in the passed input string. From there, the static Regex::Matches method is called with the desired e-mail pattern (which I’ll cover shortly). The result of the Matches method is a MatchCollection. The MatchCollection object is then enumerated with a for loop, each “match” (representing an e-mail address) is added to the ArrayList object, and finally the ArrayList is returned to the caller.

The GetEmailAddresses function can be used as follows where the returned ArrayList object is enumerated and each e-mail address is written to the console:

ArrayList* addrs =
  GetEmailAddresses(S"I can be reached at tom@archerconsultinggroup.com "
                    S"or info@archerconsultinggroup.com.");

for (int i = 0; i < addrs->Count; i++)
{
  Console::WriteLine(addrs->tem[i]);
}

The pattern used in the GetEmailAddresses function correctly yields the two addresses I specified in the sample call. The following is the pattern itself (with the double backslashes replaced by single backslashes, as that’s specific to C++ and not really part of the actual pattern):

[\w]+@[\w]+.[\w]{2,3}

If you’ve read the previous installments of this series, you hopefully can read this pattern. Here’s a breakdown of each component of the pattern:

  • [\w]+—The \w represents any word-letter (A-Z,a-z and 0-9). This is then bracketed so that the plus modifier can be used to specify one or more. Therefore, [\w]+ tells the parser to match on one or more word-letters.
  • @—Tells the parser to look for the at-sign—@—character
  • [\w]+—As before, tells the parser to match on one or more word-letters
  • .—Match on the period character
  • [\w]{2,3}—Once again, the pattern specifies a search for word-letters with the difference here being that the {x,y} modifier tells the parser to search for a specific number of word-letters—two or three, in this case.

Figure 1. Example of Parsing a Body of Text for E-Mail Address & Domain Information

Advanced E-Mail Regular Expressions Pattern

While the previous e-mail pattern would catch most of the e-mail addresses, it is far from complete. This section illustrates a step at a time how to build a much more robust e-mail pattern that will catch just about every valid e-mail address format. To begin with, the following pattern catches “exact matches.” In other words, you shouldn’t use it to parse a document, but rather to validate a single e-mail address:

^[^@]+@([-\w]+\.)+[A-Za-z]{2,4}$

Personally, I find it easier to read a pattern by dissecting it into components and then attempting to understand each of the components as they relate to the overall pattern. Having said that, this pattern breaks down to the following parts:

  • ^—The caret character at the beginning of a pattern tells the parser to match from the beginning of the line, because the focus of this pattern is to validate a single e-mail address.
  • [^@]+—When the caret character is used within brackets and precedes other characters, it tells the parser to search for everything that is not the specified character. Therefore, here the pattern specifies a search to locate all text that is not an at-sign—@ character. (The plus sign tells the parser to find one or more of these non–at-sign characters leading up to the next component of the expression.)
  • @—Match on the at-sign literal
  • ([-\w]+\.)+—This part of the pattern is for matching everything from the @ to the upper-level domain (for example, .com, .edu, and so on). The reason for this is that many times you’ll see e-mail addresses with a format such as tom.archer@archerconsultinggroup.com. Therefore, this part of the pattern deals with that scenario. The first part—[-\w]+—tells the parser to find one or more word-letters or dashes. The “\.” tells the parser to match on those characters leading up to a period. Finally, all of that is placed within parentheses and modified with the plus operator to specify one or more instances of the entire match.
  • [A-Za-z]{2,4}$—Matches the terminating part of the expression—the upper-level domain. At this point, reading this part of the pattern should be pretty easy. It simply dictates finding between two- and four-letter characters. The $ character tells the parser that these letters should be the end of the input string. (In other words, $ denotes end of input, compared with ^, which denotes beginning of input.)

In order to test for “direct matches,” you need a very simple function like the following:

using namespace System::Text::RegularExpressions;
...
bool ValidateEmailAddressFormat(String* email)
{
  Regex* rex =
    new Regex(S"^[^@]+@([-\\w]+\\.)+[A-Za-z]{2,4}$");
  return rex->IsMatch(email);
}

You then can call this function like this:

bool b;

// SUCCESS
b = ValidateEmailAddressFormat("tom.archer@archerconsultinggroup.com");

// FAILURE!!
b = ValidateEmailAddressFormat("tom.archerarcherconsultinggroup.com");

Now, tweak the pattern so that it can be used to parse a document for all of its contained e-mail addresses:

([-\.\w^@]+@(?:[-\w]+\.)+[A-Za-z]{2,4})+

The main differences between this pattern and the previous one are the following:

  • I removed the beginning-of-line metacharacter—^—because the pattern will be used to search through an entire string for all e-mail addresses (instead of being used to validate the entire string for a single e-mail address).
  • I used the ?: capture inhibitor operator so that I don’t capture unneeded submatches.
  • As with the beginning-of-line metacharacter, I also removed the end-of-line metacharacter—$.
  • I implemented additional “grouping” to locate all e-mails in a provided input string.

So, the natural question at this point would be “Is this pattern guaranteed to find every single valid e-mail address?” After doing quite a bit of research on this issue, it turns out that an all-encompassing e-mail regular expression pattern is almost 6,000 bytes in length! However, that pattern would be necessary to catch only a very miniscule percentage of e-mail addresses that the patterns illustrated in this article won’t. The two patterns that I’ve covered will catch 99 percent of all e-mail addresses.

Regular Expressions: A Lot of Ground to Cover

My original intention for a series on using the .NET regular expressions classes from Managed C++ was to simply cover some basic patterns and usages. However, the more I wrote, the more I realized needed to be covered. So, it turned out to be a much-longer-than-planned series. It covered splitting strings, finding matches within a string, using regular expression metacharacters, grouping, creating named groups, working with captures, performing advanced search-and-replace functions, and finally writing a complex e-mail pattern.

Hopefully along the way, those of you who are new to regular expressions saw just how powerful they can be. Just think of how much manual text parsing code would be necessary to parse a block of code for (almost) every conceivable e-mail address. Compare that with the single line of code it takes with regular expressions! For those who wish to learn still more about working with the .NET regular expressions classes, my book—Extending MFC Applications with the .NET Framework—provides a full 50-page chapter on the subject and introduces half a dozen demo applications with code that you can easily plug into your own production code.

Acknowledgements

I would like to thank Don J. Plaistow, a Perl and Regular Expressions guru who helped me tremendously when I first started learning regular expressions. Don’s help was especially helpful with regards to the email patterns in this article.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read