Working with Regular Expressions in .NET

Introduction

Regular Expressions provide a standard and powerful way of
pattern matching for text data. The .NET Framework
exposes its regular expression engine via System.Text.RegularExpressions
Namespace
. The Regex class is the primary way for developers to perform
pattern matching, search and replace, and splitting operations on a string.
Many beginners avoid using regular expressions because of the apparently
difficult syntax. However, if your application calls for heavy pattern matching
then learning and using regular expressions over ordinary string manipulation
functions is strongly recommended. This article is intended to give beginners a
quick overview of .NET Framework’s offerings for pattern matching using regular
expressions.

Note:
This article will not teach you how to write regular expressions. It focuses
primarily on using classes from System.Text.RegularExpressions namespace. It is
assumed that you are already familiar with regular expression syntax and are
able to write basic regular expressions.

Basic Terminology

Before you go any further let’s quickly glance over the
basic terminology used in the context of regular expressions.

  • Capture : When you perform pattern matching using a
    regular expression result of a single sub-expression match is called as a
    Capture. The Capture and CaptureCollection classes represent a single
    capture and a collection of captures respectively.
  • Group : A regular expression often consists of one
    or more Groups. A group is represented by rounded brackets within a
    regular expression (the whole regular expression itself is considered as a
    group). There can be zero or more captures for a single group. The Group
    and GroupCollection classes represent a single group and a collection of
    groups respectively.
  • Match : A result obtained after a single match of a
    regular expression is termed as a Match. A match contains one or more
    groups. The Match and MatchCollection classes represent a single match and
    a collection of matches respectively.

Thus the relation between the regular expression related
objects is:

Regex class–> MatchCollection–> Match
objects–> GroupCollection–> Group objects–> CaptureCollection–>
Capture objects

The Regex Class

The Regex
class
along with few more support classes represents the regular
expression engine of .NET Framework. The Regex class allows you to perform
pattern matching, search and replace, and splitting on the source strings. You
can use the Regex class in two ways, viz. calling static methods of Regex class
or by instantiating Regex class and then calling instance methods. The
difference between these two approaches will be clear in the section related to
performance. The following table lists some of the important methods of the
Regex class along with the purpose of each:

Method

Description

IsMatch

IsMatch() method is used to determine whether a string
confirms a specified regular expression. It returns true if the string
matches the specified pattern else returns false.

Match

Match() method searches a string for a specified pattern
and returns the first occurrence of the pattern. Returns a Match object.

Matches

Matches() method searches a string for all the occurrences
of a pattern. Returns a MatchCollection object.

Replace

Replaces all the occurrences of a pattern with a specified
string value.

Split

Splits a string based on a specified pattern as a
delimiter and returns the parts of the string as an array.

In the following sections you are going to use many of the
methods mentioned above.

Pattern Matching Using Regex
Class

In this section you will use the pattern matching abilities
of the Regex class. Begin by creating a new Console Application and import
System.Text.RegularExpressions namespace at the top.

using System.Text.RegularExpressions;

Using IsMatch() Method

In this example you will check whether a string is a valid
URL. Key-in the following code in the Main() method.

static void Main(string[] args)
{
    string source = args[0];
    string pattern = @"http(s)?://([w-]+.)+[w-]+(/[w- ./?%&=]*)?";
 
    bool success = Regex.IsMatch(source, pattern);
    if (success)
    {
        Console.WriteLine("Entered string is a valid URL!");
    }
    else
    {
        Console.WriteLine("Entered string is not a valid URL!");
    }
    Console.ReadLine();
}

The Main() method receives the string to be tested as a
command line argument. The pattern string variable holds the regular expression
for verifying URLs. The code then calls the IsMatch() static method on the
Regex class and passes the source and pattern strings to it. Depending on the
returned boolean value a message is displayed to the user.

You could have achieved the same result by creating an
instance of Regex class and then calling IsMatch() method on it, as shown
below:

Regex ex = new Regex(pattern);
success = ex.IsMatch(source);

Using Match() Method

In order to see how Match() method can be used, modify the
Main() method as shown below:

static void Main(string[] args)
{
    string source = args[0];
    string pattern = @"http(s)?://([w-]+.)+[w-]+(/[w- ./?%&=]*)?";
    Match match = Regex.Match(source, pattern);
    if(match.Success)
    {
        Console.WriteLine("Entered string is a valid URL!");
        Console.WriteLine("{0} Groups", match.Groups.Count);
        for(int i=0;i<match.Groups.Count;i++)
        {
            Console.WriteLine("Group {0} Value = {1} Status = {2}",
            i, match.Groups[i].Value, match.Groups[i].Success);

            Console.WriteLine("t{0} Captures", match.Groups[i].Captures.Count);

            for (int j = 0; j < match.Groups[i].Captures.Count; j++)
            {
                Console.WriteLine("tt Capture {0} Value = {1} Found at = {2}",
                j, match.Groups[i].Captures[j].Value, match.Groups[i].Captures[j].Index);
            }
        }
    }
    else
    {
        Console.WriteLine("Entered string is not a valid URL!");
    }
    Console.ReadLine();
}

The code shown above makes use of the Match() method to perform pattern matching. As mentioned earlier the Match() method returns an instance of Match class that represents the first occurrence of the pattern. The Success property of the Match object tells you whether the pattern matching was successful or not. A for loop then iterates through the Groups collection (GroupCollection object). With each iteration, the group searched for and its status is outputted. Further, the Captures collection of each group is also iterated and with each iteration the captured value and its index in the string is outputted. The following figure shows a sample run of the above application.

A sample run of the application
Figure 1:
A sample run of the application

Observe the above figure carefully. Our pattern contains 4
groups (three in rounded brackets of the regular expression and the whole
expression) so Count property of the Groups collection returns 4. The first
group (the whole expression) has value https://www.codeguru.com/. The second
group has value of s (from https). The third group has two captures – www. and
codeguru. Finally, the last group has value of / (the / at the end of the URL).

Using Matches() Method

Matches() method is similar to Match() method but returns a
collection of Match objects (MatchCollection). You can then iterate through all
of the Match instances and see various group and capture values. The following
code illustrates how this is done:

MatchCollection matches = Regex.Matches(source, pattern);
 
foreach (Match match in matches)
{
    Console.WriteLine("Match Value = {0}",match.Value);
    Console.WriteLine("============");
    if (match.Success)
    {
        Console.WriteLine("Entered string is a valid URL!");
        Console.WriteLine("{0} Groups", match.Groups.Count);
        for (int i = 0; i < match.Groups.Count; i++)
        {
            Console.WriteLine("Group {0} Value = {1} Status = {2}",
            i, match.Groups[i].Value, match.Groups[i].Success);
            Console.WriteLine("t{0} Captures", match.Groups[i].Captures.Count);
            for (int j = 0; j < match.Groups[i].Captures.Count; j++)
            {
                Console.WriteLine("tt Capture {0} Value = {1} Found at = {2}",
                j, match.Groups[i].Captures[j].Value, match.Groups[i].Captures[j].Index);
            }
        }
    }
    else
    {
        Console.WriteLine("Entered string is not a valid URL!");
    }
 
}

The following figure shows a sample run of the above code:

Matches() method returns two Match objects
Figure 2: Matches() method returns two Match objects

Notice how Matches() method has returned two Match objects
(one for http://site1.com and other for http://site2.com).

Search and Replace Using Regex Class

The Regex class not only allows you to perform pattern
matching but also allows you to search and replace strings. Consider, for
example, that you are developing a discussion forum in ASP.NET. For the sake of reducing SPAM and
promotional content you want to scan forum posts made by new members for URLs
and then replace the URLs with ****. Something like this can easily be done
with the search and replace abilities of the Regex class. Let’s see how.

static void Main(string[] args)
{
    string source = args[0];
    string pattern = @"http(s)?://([w-]+.)+[w-]+(/[w- ./?%&=]*)?";

    string result = Regex.Replace(source,pattern,"[*** URLs not allowed ***]");
    Console.WriteLine(result);

    Console.ReadLine();
}

In the code fragment shown above the regular expression is
intended to scan URLs from the input string. You then call the Replace() method
of the Regex class. The first parameter of the Replace() method is the string
in which you wish to perform the replacement. The second parameter indicates
the replacement string. The Replace() method returns the resultant string after
performing the replacement. If you run the above code you should see something
like this in the console window:

The Replace() method of the Regex class
Figure 3: The Replace() method of the Regex class

Notice how the URL has been replaced with the text you
specify.

Splitting Strings Using Regex

Regex class also allows you to split an input string based
on a regular expression. Say, for example, you wish to split a date in
DD/MM/YYYY format at / so as to retrieve individual day, month and year values.
The Split() method of the Regex class allows you to do just that. The following
example shows how:

string strFruits = "Apple,Mango,Banana";
string[] fruits = Regex.Split(strFruits, ",");
foreach(string s in fruits)
{
    Console.WriteLine(s);
}

In the above code the Split() method takes the source string
and a regular expression for searching the delimiter (, in the above example).
It then splits the string and returns an array of strings consisting of
individual elements. A sample run of the above code is shown below:

Splitting Strings Using Regex
Figure 4: Splitting Strings Using Regex

Regex Options

Most of the methods discussed above are overloaded to take a
parameter of type RegexOptions enumeration. As the name suggests, the
RegexOptions enumeration is used to indicate certain configuration options to
the regular expression engine during the pattern matching process. The
following table lists some of the important options of RegexOptions
enumeration:

Option

Description

IgnoreCase

Indicates that the pattern matching operation should
ignore character casing.

Multiline

Indicates that ^ and $ characters are to be applied to the
beginning and end of each line and not just the beginning and end of the
entire source string.

Singleline

Indicates that dot (.) should match every character,
including a newline character.

RightToLeft

Indicates that the pattern matching will be performed from
right to left instead of left to right in a source string.

Compiled

Indicates that the regular expression is to be converted
to MSIL code and not to regular expression internal instructions.

Just to illustrate how RegexOptions enumeration can be used
write the following code in the Main() method and observe the difference due to
RegexOptions value.

bool success1 = Regex.IsMatch(source, "hello");
Console.WriteLine("String found? {0}",success1);
bool success2 = Regex.IsMatch(source, "hello", RegexOptions.IgnoreCase);
Console.WriteLine("String found? {0}", success2);

As you can see, the second call to the IsMatch() method makes use of RegexOptions enumeration and specifies the case should be ignored during pattern matching. If you observe the output of the above code (see below) you will find that IsMatch() method without any RegexOptions returns false whereas with RexexOptions.IgnoreCase returns true.

IsMatch() method without RegexOptions returns false; with RegexOptions.IgnoreCase returns true
Figure 5: IsMatch() method without RegexOptions returns false; with
RegexOptions.IgnoreCase returns true

Note:
You can combine multiple RegexOptions values like this :

bool success2 = Regex.IsMatch(source, "hello", RegexOptions.IgnoreCase | RegexOptions.Compiled);

Performance Considerations

As mentioned earlier, the Regex class provides static as
well as instance methods for pattern matching. The static methods accept the
source string and the pattern as the parameters whereas the instance methods
accept source string (since pattern is specified while creating the instance
itself). The following code fragment makes it clear:

//Using static method
bool success = Regex.IsMatch(source, pattern);
//Using instance method
Regex ex = new Regex(pattern);
success = ex.IsMatch(source);

When you use static methods, the regular expression engine caches the regular expressions so that if the same regular expression is used multiple times the performance will be faster. On the other hand, if you use instance methods, the regular expression engine cannot cache the patterns because Regex instances are immutable (i.e. you cannot change them later). Naturally, even if you use the same pattern multiple times there is no way to boost the performance as in the previous case.

You should also be aware of the impact of
RegexOptions.Compiled on the performance. While calling any of the Regex
methods, if you use the RegexOptions.Compiled option then the regular
expression is converted to MSIL code and not to regular expression internal
instructions. Though this improves performance it also means that the regular
expressions are also loaded as a part of the assembly making it heavy and may
increase the startup time. So, you should carefully evaluate the use of
RegexOptions.Compiled option.

Summary

Regular expressions provide a standard and powerful way of
pattern matching. The Regex class represents .NET Framework’s regular
expression engine. The methods of Regex class are exposed as static as well as
instance methods. These methods allow you to perform search, replace and
splitting operations on input strings. Behavior of the regular expression
engine can be configured with the help of RegExOptions enumeration.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read