Building a Regular Expression Stream Search with the .NET Framework

String pattern matching techniques abound in Computer Science doctrine. Regular Expressions are probably the most well-known string pattern matching syntaxes. Most development tools, languages, frameworks, and libraries on all platforms contain some form of Regular Expression-like features.

In the .NET Framework, classes in the System.Text.RegularExpressions namespace contain the framework's Regular Expression support. On a recent project, we tapped the .NET framework's Regular Expression capabilities to search byte Streams in BizTalk POP3 messages using simple Regular Expressions. In a recent article, I covered the BizTalk aspects of our solutionm "Building a BizTalk Pipeline Content Enricher with SQL Server 2005." In this article, I'm going to explain how we implemented a Regular Expression byte Stream search class to search the POP3 message supplied by BizTalk.

A Simple Search

You can read my prior article for a more complete explanation of our solution. So, I'm just going to outline the pattern matching requirements.

Repetitive, boring tasks are not something people generally excel at or care to do. Reviewing a daily log file received via email for infrequent errors is boring, repetitive, and above all, the activity most users eventually neglect after a period of time.

So, we decided to delegate review duties to a BizTalk POP3 Receive Port configured with a custom pipeline component. BizTalk supplied all but the pattern searching capability. All we needed to do was build a class to scan the byte Stream exposed by the underlying BizTalk classes for a pattern matching an error message in the log.

BizTalk custom component development leverages the capabilities of the .NET framework. Naturally, we first turned to the .NET framework for our solution.

Regular Expresssions and RegEx

A complete introduction to Regular Expressions is beyond the scope of this article. You can review the sources at the end of this article for more details. Our requirements dictated simple patterns such as searching for the word "Error." Regular Expression syntax, though, can support complicated multi-word, partial words, and optional word patterns.

The RegEx class is the Regular Expression workhorse in the .NET Framework. Some of RegEx methods and properties appear below.

[Serializable]
public class Regex : ISerializable
{
   public Regex(string pattern);
   public Regex(string pattern, RegexOptions options);
   public RegexOptions Options { get; }
   public bool IsMatch(string input);
   public bool IsMatch(string input, int startat);
   public static bool IsMatch(string input, string pattern);
   public static bool IsMatch(string input, string pattern,
      RegexOptions options);
   public Match Match(string input);
   public Match Match(string input, int startat);
   public static Match Match(string input, string pattern);
   public Match Match(string input, int beginning, int length);
   public static Match Match(string input, string pattern,
      RegexOptions options);
   public MatchCollection Matches(string input);
   public MatchCollection Matches(string input, int startat);
   public static MatchCollection Matches(string input,
      string pattern);
   public static MatchCollection Matches(string input,
      string pattern, RegexOptions options);

RegEx sports a variety of static methods to do different types of Regular Expression string searches. I'll cover how we used RegEx later in this article.

RegEx can be instantiated or you can opt to use the static functions. Instantiating the class allows you to save the Regular Expression inside the instantiated class.

As you may have noticed, though, RegEx methods only accept strings. We needed to work with Streams. Luckily, though, .NET Streams can be converted to and from strings.

Working with Streams

A complete introduction to the Stream class is beyond the scope of this article. Because our goals were to search Streams for a Regular Expression and the .NET RegEx class only accepts strings, I'm going to focus on how we converted Streams to strings.

Streams are simply raw bytes of data. In .NET, the Stream class in the System.IO namespace is the base class for a variety of other Stream classes. Methods and properties of the Stream class appear below.

[Serializable]
[ComVisible(true)]
public abstract class Stream : MarshalByRefObject, IDisposable
{
   public abstract bool CanRead { get; }
public abstract bool CanSeek { get; }
   public virtual bool CanTimeout { get; }
   public abstract bool CanWrite { get; }
   public abstract long Length { get; }
   public abstract long Position { get; set; }
   public virtual int ReadTimeout { get; set; }
   public virtual int WriteTimeout { get; set; }
   public virtual IAsyncResult BeginRead(byte[] buffer,
      int offset, int count, AsyncCallback callback, object state);
   public virtual IAsyncResult BeginWrite(byte[] buffer,
      int offset, int count, AsyncCallback callback, object state);
   public virtual void Close();
   protected virtual WaitHandle CreateWaitHandle();
   public void Dispose();
   public virtual int EndRead(IAsyncResult asyncResult);
   public virtual void EndWrite(IAsyncResult asyncResult);
   public abstract void Flush();
   public abstract int Read(byte[] buffer, int offset, int count);
   public virtual int ReadByte();
   public abstract long Seek(long offset, SeekOrigin origin);
   public abstract void SetLength(long value);
   public static Stream Synchronized(Stream stream);
   public abstract void Write(byte[] buffer, int offset,
      int count);
   public virtual void WriteByte(byte value);
}

As you can see, the methods above, as fitting for a byte stream, read and write byte data. Converting bytes to a string is the role of some encoding classes in the System.Text namespace. In our solution, we used the ASCIIEncoding class. The following example illustrates how you can convert bytes to a string by using the ASCIIEncoding class.

ASCIIEncoding encoder = new ASCIIEncoding();
copyValue = encoder.GetString(data);

At this point, we have all the tools to compose a solution. There is, however, one other issue to address before we're ready to assemble the solution.

Building a Regular Expression Stream Search with the .NET Framework

A Buffered Solution for Easing Ingestion

Streams can be large. Although we could have loaded an entire Stream into a string, we wanted to avoid the overhead of storing an entire Stream in memory. So, one last issue to confront is: How do you load portions of the Stream when you need to search the entire Stream for a pattern?

We chose the following approach to address the issue.

  • Create a buffer large enough to store the entire string pattern.
  • Add a portion of the Stream to the front of the buffer and trim from the back the same number of bytes you added to the front.

The approach works well if you know how large the target search pattern can be. This may not always be the case with all Regular Expressions, but because we were looking for simple patterns, it was a safe assumption.

We also needed to avoid making the number of characters we trim and add too large. Too large of trim and add values compared to the size of the buffer risk cutting too many characters off of the end of the buffer, missing the pattern. So, in the example string below, a buffer of 7 and a trim and add value of 6 would miss the string pattern "zabcdef" embedded in the middle of the string.

Abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzhskfhsljds
   flashjdsdfkllsdfjasdfnnn

An algorithm using the values above would split the target pattern in two once you reach characters in the pattern of the Stream.

Now, it's time to look at our complete solution embodied in a single class called StreamSearchExpression.

StreamSearchExpression

Earlier, you learned about the relationship between the buffer, trim/add, and the patterns you are matching. Rather than making these values dynamic, class users provide the values in the class constructor. The class constructor appears below.

public StreamSearchExpression(Stream stream, string[] patterns,
   int bufferSize, int trailLeadAdd)
{
   _stream = stream;
   _patterns = patterns;
   _bufferSize = bufferSize;
   _trailLeadAdd = trailLeadAdd;
}

The Check method on StreamSearchExpression initiates the searching process. The Check method appears below.

public bool Check(out string patternMatched, out long positionEnd)
{
   bool patternPresent = false;
   StringBuilder builder = new StringBuilder();

   patternMatched = "";
   positionEnd = -1;

   if (_stream.Length > 0)
   {

      InitBuffer(builder);

      patternPresent = IsMatchInBuffer(builder,
         out patternMatched, out positionEnd);

      if (patternPresent)
      {//you're done; it was right at the beginning
      }
      else
      {
         while ((!(patternPresent)) &&
            (!(_stream.Length == _stream.Position)))
         {
            MoveBuffer(builder, _trailLeadAdd);

            patternPresent = IsMatchInBuffer(builder,
               out patternMatched, out positionEnd);
         }
      }
   }
   else
   {
      patternPresent = false;
   }

   return patternPresent;
}

As you can see, Check loops through the Stream, advancing the buffer until one of the patterns in the array of patterns is found. Although a Regular Expression can be written to work like an array of patterns, we opted for the array mostly to eliminate the need to write a more complicated Regular Expression.

IsmatchBuffer is straightforward, but MoveBuffer requires further discussion. The MoveBuffer function appears below.

private void MoveBuffer(StringBuilder builder,int byteCount)
{
   byte[] data;
   ASCIIEncoding encoder = new ASCIIEncoding();
   string copyValue = "";
   int actualReadCount = byteCount;

   if (builder.Length > 0)
   {
      builder.Remove(0, byteCount);
   }

   //Don't get more than what you actually need
   if (_stream.Length < byteCount)
   {
      actualReadCount = (int)_stream.Length;
   }

   //Don't get more than what is left
   if ((_stream.Position + actualReadCount) > _stream.Length)
   {
      actualReadCount = (int)(_stream.Length - _stream.Position);
   }

   data = new byte[actualReadCount];

   _stream.Read(data, 0, actualReadCount);

   copyValue = encoder.GetString(data);
   builder.Append(copyValue);
}

According to the .NET documentation, StringBuilder is the recommended string copy, append, and remove class. MoveBuffer performs all of the string manipulation. As discussed earlier, you use the ASCIIEncoding class to change the Stream bytes into a string. You move the buffer along in the Stream like a sliding window along the Stream. Bytes are removed from the back of the buffer and added to the front of the buffer.

Conclusion

Regular Expression support is common in many development tools and applications. Although .NET supports regular expression string search via the RegEx class, it has no support for byte Streams. We developed a Stream Regular Expression search class as part of a larger effort to scan incoming email received by a POP3 configured BizTalk Receive Port.

Sources



About the Author

Jeffrey Juday

Jeff is a software developer specializing in enterprise application integration solutions utilizing BizTalk, SharePoint, WCF, WF, and SQL Server. Jeff has been developing software with Microsoft tools for more than 15 years in a variety of industries including: military, manufacturing, financial services, management consulting, and computer security. Jeff is a Microsoft BizTalk MVP. Jeff spends his spare time with his wife Sherrill and daughter Alexandra.

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Packaged application development teams frequently operate with limited testing environments due to time and labor constraints. By virtualizing the entire application stack, packaged application development teams can deliver business results faster, at higher quality, and with lower risk.

  • Email is the most common communication vehicle used by organizations of all shapes and sizes. Among the billions of email messages sent every day are sensitive information, critical requests, and other essential business data. IT staff bear the burden of ensuring the confidentiality, integrity, and availability of the information contained within the communication. This white paper explores the email security landscape, an assessment of the threats organizations face,  and the building blocks of an effective …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds