Programming with Regular Expressions

Regular expressions have their own grammar. This makes regular expressions a language. In fact, regular expressions represent a very terse language supported by the System.Text.RegularExpressions namespace. If you haven't heard about regular expressions then you have been at a disadvantage. Still not sure what regular expressions are? Regular expressions support simple or very complex string pattern match and replace operations.

Suppose you want to search an HTML file for mailto tags, or random text for recurring patterns-like all of the uses of the token foreach, or match simple patterns such as a string of digits. For all of these operations and much more you will find that you can eliminate the need to write a tremendous amount of Visual Basic .NET code to perform custom parsing and searching if you learn how to create a regular expression string and use the Regex class.

Regex is the class in System.Text.RegularExpressions that contains methods like IsMatch, Match, and Replace that will compare an input string to a regular expression and perform the named operation. For example, given an input string and some text, if you invoke IsMatch then you will get a Boolean result indicating whether or not the input string matches the regular expression. Regular expressions can be difficult to read because they are very terse but are worth learning because they may save you hours, days, or weeks of writing custom code. Regular expressions are powerful and represent a good tool to have in your toolbox. This article will provide you with an introduction to using Regular Expressions in Visual Basic .NET. If you need more information than is in this article, Dan Appleman has written a nice 75 page ebook available on Amazon.com for about $15. (Wait. Read this article before running off to spend your $15.)

Reviewing Regular Expression Tokens

Regular expressions have key tokens and a grammar. Everything that is not a token is a literal, and the grammar defines the order and placement of the symbols that make up a regular expression. (You can search the help files in Visual Studio .NET for a complete set of tokens and grammatical rules for regular expressions.)

For example, if you check the .NET help files you will discover that \d represents a digit. The Regex.IsMatch method will return true if the expression is \d and the string of input text contains any digits.

Imports System.Text.RegularExpressions
...
IsMatch.Regex("wer345", "\d")

Determining an input string contains a single character is marginally useful but will still prevent you from having to write code that iterates over every element in a string and compare each character in the string to the characters 0 through 9.

There were studies in the 1990s that suggested that code cost approximately $5 to $8 per line of code to own. String parsing code can get complicated and long-winded very quickly. By using regular expressions you can avoid writing custom string parsing and searching code by writing and testing a single regular expression.

Exploring Sample Regular Expressions

Table 1 contains several basic regular expressions and a brief description describing the kind of strings that each expression will match.

Expression Description
\d A string of characters containing a digit anywhere in the string (e.g. "a1bc").
^\d A string of characters containing a digit as the first character (e.g. "1abc").
\d$ A string of characters containing a digit as the last character (e.g. "abc4").
^\d+$ A string containing all digits (e.g. "012345").
\(\d\d\d\) \d\d\d-\d\d\d\d Matches a string that is consistent with US phone numbers (e.g. (517) 555-1212).
\(\d{3}\) \d{3}-\d{4} \d{3} matches three successive digits. This expression is identical to the previous expression.
^\D+$ Matches non-numeric strings.
^[a-zA-Z]+$ Identical to the previous expression.

Table 1: Sample regular expressions with descriptions

The expressions in table 1 are very simple. If you end your experimentation based on the strings in table 1 then you will end up short changing yourself. Regular expressions can perform advanced pattern matching too and supports search and replace operations.

String Matches

You can create an instance of the Regex class, initializing the regular expression with the expressions, or you can invoke one of the Regex shared methods, passing the regular expression and the input string.

Tip: If you need to match special characters like the parenthetical tokens then you can escape them. For example, ( is a token in the regular expression language and \( would match a literal left-parenthesis.

A common and simple test is to invoke IsMatch to determine if an input string matches a predetermined expression. For example, you can define the expression (\d{1,2})/(\d{1,2})/(\d{2,4}) to test date values. There are three groups represented by the parenthetical chunks of the expression. The first group, (\d{1,2}), matches from one to two digits. The / is a literal. The second group matches up to two more digits, followed by a slash, and two to four more digits. This expression will match strings that contain what are typically dates in the format mm/dd/yyyy; dates in the format m/d/yy would match the expression too.

Invoking Regex.IsMatch( "02/12/1966", "(\d{1,2})/(\d{1,2})/(\d{2,4})") would return a value of true.

Implementing Search and Replace Behavior

The parentheses are one of the grouping constructs. Another grouping construct is the (?<name>) construct. This construct captures the matched substring into <name>. Captured values can be used in the Regex.Replace method.

Adding on to the expression in the previous section-and borrowing from the help documentation-we can use the grouping constructor to perform a search and replace, rearranging a date and time value in one format to a second format. Invoking

Regex.Replace("\b(?<month>\d{1,2})/(?<day>\d{1,2})/(?<year>\d{2,4})\b", 
"${day}-${month}-${year}")

The preceding Replace invocation will replace every occurrence of a string in mm/dd/yyyy form with a string value in the dd-mm-yyyy format. It is important to note that the strings may have any combination of one or two digits in the month and day positions and two or four digits in the year position. It is also worth noting that the values do not have to represent valid date values. The regular expression is simply looking for patterns of digits, similar to date-formatted strings.

Compiled Regular Expressions

.NET contains a regular expression parser that runs when the expression is used to evaluate a string. You can optimize regular expressions for frequently used expressions by instructing .NET to compile the regular expression to IL and subsequently JIT-compile it to assembly language. (I performed some simple operations on a million input strings and measured about a 50% performance increase with compiled expressions.)

To compile a regular expression construct an instance of the Regex class with the RegexOptions.Compiled argument. This will instruct the compiler to convert the expression to IL and ultimately be converted to assembly language. The processing of regular expressions versus compiled regular expressions is roughly analogous to the difference in performance to interpreted languages and compiled languages. Here is an example of a regular expression that will be compiled.

Dim Expression As Regex = new Regex("^\d+$", RegexOptions.Compiled)

Summary

Regular expressions in the .NET Framework were derived from popular capabilities supported by regular expressions in languages like Perl and tools like awk and grep. (You have probably used one of these tools at one time or another.)

The purpose of introducing any new subject is to demonstrate the strengths of a product and more importantly, to save you time and energy. Regular expressions can be quite difficult to read and understand, but can eliminate the need for a significant amount of custom string parsing code.

One final note: Visual Basic .NET is a powerful development environment. You can actually define regular expressions after your program is compiled and deployed and use .NET Reflection to emit assemblies dynamically that contains compiled regular expressions. Emitting assemblies containing regular expressions is a great way to impress your friends and create some advanced text processing tools. If you pick up a copy of my "The Visual Basic .NET Developer's Guide", available fall 2002, published by Addison-Wesley (shameless plug) then you will have access to some of these advanced capabilities.

Stay tuned for more Visual Basic .NET and send inquiries to me if you have specific questions.

About the Author

Paul Kimmel is a freelance writer for Developer.com and CodeGuru.com. Look for his recent book Visual Basic .Net Unleashed at a bookstore near you.

Online articles and print magazines provide concise information in a compact format. The current alternative is to spend $50 on an 800 page book. Sometimes you might want just 50 or 100 pages of information on a specific topic, something in between an article and a book. If you like the concept of ebooks then sound off. Contact pkimmel@softconcepts.com if you'd like to share your opinion on ebooks.



Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • You may already know about some of the benefits of Bluemix, IBM's open platform for developing and deploying mobile and web applications. Check out this webcast that focuses on building an Android application using the MobileData service, with a walk-through of the real process and workflow used to build and link the MobileData service within your application. Join IBM's subject matter experts as they show you the way to build a base application that will jumpstart you into building your own more complex app …

  • Packaged application development teams frequently operate with limited testing environments due to time and labor constraints. By virtualizing the entire application stack, packaged application development teams can deliver business results faster, at higher quality, and with lower risk.

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds