Understanding Basic Regular Expressions Patterns

When I began writing about the .NET implementation of regular expressions, I intended to focus solely on the .NET classes and not on regular expressions themselves. To that end, I started with a couple of very basic tasks that can be formed with regular expressions: splitting strings with the Regex::Split method and using the Match and MatchCollection methods to enumerate found literals or patterns.

Many readers wrote in explaining that they are new to regular expressions, and they want at least an article or two that explain some basic patterns—instead of having all the examples search for literals or unexplained patterns. Therefore, this article presents some very basic regular expression patterns to get people who are new to this area off and running.

Searching for Letters and Words

The first thing to note is that regular expression patterns consist of metacharacters, which are simply characters that represent other characters and tell the regular expressions parser how to scan the input string and what to search for.

The following sections illustrate some basic patterns that utilize some of the more commonly used metacharacters. The end of the article presents a table that you can use as a metacharacter quick-reference when creating your own patterns.

Here's a basic string that all the example patterns in this article use:

John and Harry are members of the Borbon club.

Now, suppose you want to locate all the proper nouns in this sentence. Logically, you would know that you need to locate every capitalized word. In terms of a regular expression pattern, that would look like the following:

[A-Z][a-z]+[ ]*

If you're new to regular expressions, this definitely will look a bit strange at first. The following breakdown explains each of the pattern's components:

  • [A-Z]—The A-Z indicates that you're looking for any letter within the range of a capital A to a capital Z. The square brackets simply group this like the parenthesis in a numeric equation to remove ambiguity.
  • [a-z]+—Once again, a range of characters is being searched for, this time the lower case letters a-z. However, the plus sign (+) after the right bracket indicates that the parser will search for one match to the criteria specified in the brackets. In other words, the parser is being instructed to look for one or more lowercase letters.
  • [ ]*—This part of the pattern indicates that a space will be located, but the asterisk after the right bracket indicates that the parser will match on zero or any number of spaces. This allows for the pattern to properly handle the end of a string or multiple spaces between words.

So, there you have it. The following pattern simply states "Find a capital letter, followed by one or more lowercase letters, followed by any number of spaces."

[A-Z][a-z]+[ ]*

Using this pattern results in the following list of matches:

John
Harry
Borbon

However, the pattern has two problems. First, it will not catch capitalized abbreviations or acronyms, as it stipulates that only one capital letter will be matched, followed by lowercase letters. Therefore, the current pattern used on the following input value will not yield the match for IBM:

John and Harry are members of the Borbon club at IBM

To fix this problem, you need to modify the pattern as follows; I've bolded the change:

[A-Z][A-Z|a-z]+[ ]*

What I've inserted into this pattern is the A-Z range and the vertical bar separator (|), which acts as an "or" operator. Therefore, the pattern now states: "Find a single capital letter followed by one or more upper and lower case letters followed by any number of spaces". The pattern will now yield the following:

John
Harry
Borbon
IBM

The second problem that the pattern has is it doesn't handle singular pronouns correctly. This is the easiest problem to solve. All you need to do is replace the plus sign in the pattern with an asterisk, so that the parser knows that there may not be a sequence of letters following the capital letter. Using an input value of: "John, Harry and I are members of the Borbon club at IBM.", the pattern would be as follows:

[A-Z][A-Z|a-z]*[ ]*

It would yield:

John
Harry
I
Borbon
IBM

Understanding Basic Regular Expressions Patterns

As promised, the following table contains the most commonly used metacharacters.

 Table 1: Commonly used regular expressions metacharacters

Expression Description
. Matches any character except \n
[characters] Matches a single character in the list
[^characters] Matches a single character not in the list
[charX-charY] Matches a single character in the specified range
\w Matches a word character, same as [a-zA-Z_0-9]
\W Matches a non-word character
\s Matches a whitespace character; same as [\n\r\t\f]
\S Matches a non-whitespace character
\d Matches a decimal digit; same as [0-9]
\D Matches a nondigit character
^ Match the beginning of a line
$ Match the end of a line
\b On a word boundary
\B Not on a word boundary
* Zero or more matches
+ One or more matches
? Zero or one match
{n} Exactly n matches
{n,} At least n matches
{n,m} At least n but no more than m matches
( ) Capture matched substring
(?<name>) Capture matched substring into group name
| Logical OR

Simply combine these metacharacters with what you learned in the previous articles on string splitting and using the Match and MatchCollection classes and you'll be surprised at how easily you can search for many basic patterns.

More Advanced Uses of Regular Expressions

At this point, you have the basic knowledge required to form regular expressions and use them in your Managed C++ code. While what you've learned thus far will work for a lot of common parsing needs, regular expressions allow you to do so much more than search for simple character patterns. For example, you can:

  • Search for email addresses where the number of valid formats leads to very complex patterns
  • Search and replace specific patterns
  • Extract specific information, such as searching for phone numbers and then extracting only the area code

To move into these more advanced areas of regular expressions use, you'll need to know about groups and captures. Therefore, future articles will cover these areas and examine some of the tasks mentioned in this article.



About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: December 11, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Market pressures to move more quickly and develop innovative applications are forcing organizations to rethink how they develop and release applications. The combination of public clouds and physical back-end infrastructures are a means to get applications out faster. However, these hybrid solutions complicate DevOps adoption, with application delivery pipelines that span across complex hybrid cloud and non-cloud environments. Check out this …

  • VMware vCloud® Government Service provided by Carpathia® is an enterprise-class hybrid cloud service that delivers the tried and tested VMware capabilities widely used by government organizations today, with the added security and compliance assurance of FedRAMP authorization. The hybrid cloud is becoming more and more prevalent – in fact, nearly three-fourths of large enterprises expect to have hybrid deployments by 2015, according to a recent Gartner analyst report. Learn about the benefits of …

Most Popular Programming Stories

More for Developers

RSS Feeds