Text Transformation with MGrammar and the Oslo SDK

My job involves moving different shapes and sizes of data linking systems and business processes together. Normally, I use Integration or Data Warehousing tools. Until I started using the Oslo SDK CTP "M" language, I've never considered building my own Domain Specific Language (DSL) as a tool in my repertoire.

If you've been following my Oslo SDK articles http://www.codeguru.com/columns/experts/article.php/c15779/, you've been introduced to MSchema, MGraph, and the Repository, all components of the Oslo SDK Modeling backbone. MGrammar is the third component in the "M" language. However, instead of defining data structure like MSchema, MGrammar defines data transformation, in particular, human-readable Text data transformation. Continuing to use the sample model I've developed in the other articles, I'm going to show you how MGrammer can be employed to populate Repository data.

Oslo Overview

Oslo is composed of the following components displayed in Figure 1.

Figure 1: Oslo Architecture

Source: "Microsoft PDC 2008—A Lap Around Olso"

  • "M", a language for composing models
  • The Repository, a SQL Server database designed for storing models
  • Quadrant, a tool for editing and viewing model data

Currently, Quadrant is only available to PDC attendees. M and the Repository come with the Oslo SDK available on the Oslo Developer Center site http://msdn.microsoft.com/en-us/oslo/default.aspx.

Oslo's goal is to deliver a foundation for building and storing models of all types. Models are application metadata formatted for runtime consumption. Separate Microsoft initiatives aim to build runtimes and tooling into applications such as Visual Studio that are Oslo model aware.

As I mentioned earlier, the M Language is composed of MSchema, MGraph, and MGrammar. A complete introduction to M is beyond the scope of this article. MSchema and MGraph were covered in my prior articles this article will acquaint you with MGrammar.

MGrammar Overview

Unlike XML, Text is a natural human-consumable data medium. Although text can be semi-structured like XML, text is not standardized like XML is. Parsing text to store it in, for example, a relational database using traditional development tools, though not difficult, is difficult to do right. MGrammar bridges the gap between plain human-readable, composable text data and XML, making semi-structured text parsing more approachable.

In a typical MGrammar program, a developer defines the patterns to search in the text and defines how the pattern is translated into MGraph. MGraph looks a lot like inline C# collections. Oslo utilizes MGraph to populate MSchema models in the Repository.

In MGrammar, a developer defines a set of Rules for transforming text into MGraph. MGrammar has three types of Rules:

  • Token rules work and look a lot like Regular expressions
  • Syntax rules can be composed of Tokens and define the MGraph produced from text input.
  • Interleave rules define ignored text.

There are other features of MGrammar. However, a complete survey of the language is beyond the scope of this article and Rules are really the core of the language. So, I'm going to focus on Rules and, in particular, Token and Syntax rules. Using a sample, I'll illustrate how some basic Token and Syntax capabilities are employed to parse text.

Sample Overview

The sample leverages the models I built in my prior article http://www.codeguru.com/columns/experts/article.php/c15779/. Model code snippets appear below.

type Requirement : Item
{
   Description : Text?;
   ApplicationId : Integer64;
}

Requirements : Requirement* where
item.ApplicationId in MyApplications.Id;

type ServerConfiguration : Item
{
   Server : Text;
   ApplicationId : Integer64;
}

ServerConfigInfo : ServerConfiguration* where
item.ApplicationId in MyApplications.Id;

Figure 2 depicts the sample application running in Intellipad, a development tool shipping with the Oslo SDK.

Figure 2: Application Execution

Enter a typical phrase and the text is translated to MGraph targeting the Requirement model. Enter Server and the application targets the ServerConfiguration model.

I've shown what the application does. Now, I want to show how it works.

MGrammar Structure

Following is the entire MGrammar sample.

module Sample.MGrammar
{
   language ModelMGraph
   {
      syntax Main = Req | ServerConfig | Nil;

      syntax Req = InputText:MultipleTextValues =>
         Requirement { Description {InputText},
         ApplicationId  {100} };
      syntax ServerConfig = "Server="
         InputNum:MultipleTextValues =>
         ServerConfiguration { Server {InputNum},
         ApplicationId  {100} };
      //Case where nothing there without this there is an error
      syntax Nil = empty;

      token TextValue = "a".."z" | "A".."Z" | " ";
      token MultipleTextValues = (TextValue)+;
   }
}

As with other "M" programs, MGrammar applications are scoped to a module. Also like MSchema, MGrammar application can import and export libraries of Modules. As you can see, MGrammar supports comments and as I mentioned before the language syntax supports other things like, for example, preprocessor directives like #if and #define. Keep in mind, though, that the Oslo SDK is a CTP and MGrammar's current incarnation is not complete. The Language keyword begins the application definition.

Main is the application entry point. Main must always be a Syntax. Text input into the MGrammar must match a defined Syntax or the application emits an error. In the example, there are three Syntaxes defining three distinct patterns: Req, ServerConfig, and Nil. Later in the article, I'll explain how a Syntax is constructed. I want to start with the Tokens.

Token Rules

The Tokens in the sample application appear below.

token TextValue = "a".."z" | "A".."Z" | " ";
token MultipleTextValues = (TextValue)+;

Tokens look and act a lot like Regular Expressions. Like Regular Expressions, Tokens can define patterns of text, numeric ranges, whitespace characters, and operators (&, |, *, +, etc.). Tokens can also include other Tokens.

In the example, TextValue defines a pattern containing any lowercase character, uppercase character, or space. MultipleTextValues defines a pattern matching one or more alphabetic characters.

Text Transformation with MGrammar and the Oslo SDK

Syntax Rules

The application Syntaxes appear below.

syntax Main = Req | ServerConfig | Nil;

syntax Req = InputText:MultipleTextValues =>
   Requirement { Description {InputText}, ApplicationId  {100} };
syntax ServerConfig = "Server="
   InputNum:MultipleTextValues =>
   ServerConfiguration { Server {InputNum}, ApplicationId  {100} };
//Case where nothing there without this there is an error
syntax Nil = empty;

Syntaxes handle the input and resulting MGraph output, also called a Projection. Everything to the right of the "=>" symbol is the Projection (output text).

Syntaxes can include scoped variables. Variables can be used in the Projection. Syntaxes can be composed from multiple Syntaxes and Tokens.

In the example Req uses a variable called InputText. InputText contains the full text passed into the application and is used in the MGraph output. ServerConfig utilizes a variable called InputNum in a similar fashion.

Nil is a special case Syntax. If nothing is passed into the application an empty Projection is generated. Empty is a MGrammar keyword.

Main includes all of the defined Syntaxes. As I mentioned before, Main is the entry point of the application.

Running the Sample

There are some tricks to running the sample code.

First, to build and run an MGrammar application you must select the "Sample Enabled" Intellipad. Next you must select "Minibuffer" and type SetMode("MGMode") to enable MGrammer. The graphic below demonstrates enabling MGrammar mode.

Figure 3: Setting MGrammar Mode

Finally, you must select the "Tree Preview" option to be able to enter text and see it transformed to MGraph. The graphic below demonstrates how to do this.

Figure 4: Tree Preview Selection

Conclusion

MGrammar is a feature in the "M" programming language shipping with the Oslo SDK. MGrammar was built to make transforming Semi-structured text easier. Rules are the Core of MGrammar. Syntax and Token Rules define the text patterns and transformations.

Sources

About the Author

Jeffrey Juday is a software developer specializing in enterprise integration solutions utilizing BizTalk, SharePoint, WCF, WF, and SQL Server. Jeff has been developing software with Microsoft tools for more than 15 years in a variety of industries including: military, manufacturing, financial services, management consulting, and computer security. Jeff is a Microsoft BizTalk MVP. Jeff spends his spare time with his wife Sherrill and daughter Alexandra. You can reach Jeff at me@jeffjuday.com.



About the Author

Jeffrey Juday

Jeff is a software developer specializing in enterprise application integration solutions utilizing BizTalk, SharePoint, WCF, WF, and SQL Server. Jeff has been developing software with Microsoft tools for more than 15 years in a variety of industries including: military, manufacturing, financial services, management consulting, and computer security. Jeff is a Microsoft BizTalk MVP. Jeff spends his spare time with his wife Sherrill and daughter Alexandra.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Specialization and efficiency are always in need. Whether it's replacing an aging roof, getting a haircut, or tuning up a car, most seek the assistance of trusted experts. The same is true in the business world, where an increasing number of companies are seeking the help of others to administer their IT systems and services. This special edition of Unleashing IT highlights a new breed of IT caretaker -- Cisco Powered service providers -- and the business advantages and operational efficiencies they …

  • Email is the most common communication vehicle used by organizations of all shapes and sizes. Among the billions of email messages sent every day are sensitive information, critical requests, and other essential business data. IT staff bear the burden of ensuring the confidentiality, integrity, and availability of the information contained within the communication. This white paper explores the email security landscape, an assessment of the threats organizations face,  and the building blocks of an effective …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds