Parsing Is Easy: Base C Sharp Classes and Expressions Calculator

Introduction

Sometimes, you need to analyze a file and look for a simple parsing solution. For this case, I developed some base classes and have written a description about how you can make it. As example, I used an expression calculator. It is easy but a very powerful sample, I think. To develop your own parser, you only need to write rules for your file and write simple classes based on those rules. As input, you can use any source, but the default implementation used a text file. You can get different parts (as variable names, numbers, and so forth) of your file as output.

The sample application can read expressions from a file or from the text box. Expressions are not only parsed but executed too. So, you can use it as a powerful calculator.

In addition, I tried to explain how to create your own parser based on my classes.

What Parsing Is

When I once thought about parsing, I had only one idea—it is too complicated. One time, I really needed a parser for a simple language; this allows me to input complex data for an application. I had no choice and needed to learn how to write a parser. When I finished my project, I thought otherwise—it is not complicated. From time to time, I needed to write some simple parsers and I think now—it is really easy. I hope that you also could change your opinion after reading this article and trying the sample.

I don't want to tell a lot about parsing theory; you can find plenty of good books and articles. I only want to show one "working way" of how to create your own parser and use my parser classes.

First, you should find a guinea pig, because you need something to parse. For the sample ,I choose a simple "expressions language," like this one:

A = 1+2
B = (3+4) * 5
C = A / B

As you can see, it is a really easy language. In each line, you have a variable on the left side and an expression on the right side of the equals symbol. For every line, you must calculate the expression on the right and assign it to the variable on the left.

It looks very easy, doesn't it? But, if you want immediately write the program—stop it, please. You first need a formal description of your language, so you need to write grammar for your language (or set of rules and definitions). I have already done it for you:

expressions_file ::= (comment_line | expression_line)
comment_line :: = ''' | '//' text
expression_line ::= VARIABLE_NAME '=' expression |
   function_designator
expression ::= simple_expression
   [ relational_operator simple_expression [if_operator] ]
//Note: lowest priority operation
if_operator ::= '?'expression ':' expression
relational_operator ::= '=='|' !=' | '<' | '<=' | '>' | '>=
simple_expression ::= [ '+' | '-' ] term (addition_operator term )?
addition_operator ::= '+' | '-' 
term ::= factor (multiplication_operator factor)?
multiplication_operator ::= '*' | '/'
factor ::= NUMBER | VARIABLE_NAME
   | function_designator
   //Note: highest priority operation
   | '(' expression ')' | ! factor
function_designator ::= FUNCTION_NAME [ actual_parameter_list ] 
actual_parameter_list ::= '(' actual_parameter
   [ ',' actual_parameter ]? ')'
actual_parameter ::= expression
text ::= any symbols
NUMBER ::= (digit)* '.' (digit)*
VARIABLE_NAME ::= any_letter ('_' | any_letter | digit) ?
FUNCTION_NAME ::= any_letter ('_' | any_letter | digit) ?

Does it look a little bit strange and not clear? Don't worry; I'll describe it now. For the sample, I'll read you the first line aloud:

expressions_file ::= (comment_line | expression_line)*

The expressions file /expression_file/ can consist of /::=/ repeating one or more times, and the /()*/ line with comment /comment_line/ or /|/ line with expression /expression_line/. Can you understand it now? I'll repeat the explanation for some other lines:

expression ::= simple_expression
   [ relational_operator simple_expression [if_operator] ]

Expression /expression/ is defined as /::=/ simple expression /simple_expression/, after which could be the /[]/ relational operator with the simple expression /relational_operator simple_expression/ and possible /[]/ if operator /if_operator/.

actual_parameter_list ::= '(' actual_parameter
   ( ',' actual_parameter )? ')'

The function actual parameters list /actual_parameter_list/ is defined as /::=/ opening parentheses (/'('/, after which must be the actual parameter /actual_parameter/, comma /','/ and the next actual parameter /actual_parameter/ could be repeated null or more time /()?/. After the parameters, you must place closing parentheses ) /')'/.

I hope you can read the whole grammar of the "expressions language" without any problem now. As you can see, at every step you know the next part of your language. It is very important to notice in case the algorithm doesn't work very well. You describe your language from big parts to smaller ones and each next line gives you additional details.

When you define the grammar for a language, you can start to write the program that can read and "understand" your language and execute the commands. When I develop a program, I try to follow Caesar's strategy of "Divide and Conquer." I divide the program description into small parts and develop each of them separately. Following this principle, you need at least two parts of your program: one part for reading and "understanding" the language and the second part for execution. The first part you called parser, which makes lexical and syntactic analyses of language. The parser built some structures needed for future work. You split the parser into smaller parts too but I'll give you more details later.

For parsing, you can easy follow your grammar rules. As you can see, you need to extract some basic information from the input file, such as names, numbers, comments, special symbol combinations, and so on. You don't want to do that dirty work yourself, so you give it to another—to your base parser. The base parser needs to read lines from the file (scan lines), skip delimiters, and extract minimal (atomic) parts of your language. It's called a minimal part of language—a token (in linguistics, the word lexeme is often used). When you have the token, you need only to analyze the token, follow your grammar, and read the next token. At the current parsing step, you always look ahead for the next token, like after reading the current chapter, you know a title for the next one. You can compare parsing with navigation in an unknown city. First, you need a map—language grammar. Next, you could "scan" a city by walking and extracting the tokens—the street names. At every crossing, you must read the next street name and decide where you go according to the map and needed way.

Some of you can tell me: If you know the grammar of your language, you can parse it automatically with software like Yacc: http://en.wikipedia.org/wiki/Yacc/. Yes, you are right, but first you want to write our yown parser; and second, it is not easy to understand code generated by such a program. To say it more accurately, in each situation you can find the most suitable method.

I only hope that your method is suitable for you, and that you follow me. Oh, where are you? If you already lost, I'll give you a hand and show a little bit more. You take a simple expression line and split it into the tokens themselves (I have purposely added spaces after the +):

A = 1 +     2

Here is your result:

"A" - name, "=" - assign operation, "1" - number, "+" - add operation, "2" - number (notice that all spaces have been eaten. Who is hungry? I hope only your tokenizer). I extracted some lines from your grammar for comparison to show how your algorithm could "think" about the parsing path for sample expression line (a = 1+2):

expression_line ::= VARIABLE_NAME '=' expression
expression ::= simple_expression
simple_expression ::= term addition_operator term
addition_operator ::= '+'
term ::= factor
factor ::= NUMBER

I hope you learned enough now to understanding the parsing technique and the base parsing classes.

Parsing Is Easy: Base C Sharp Classes and Expressions Calculator

Base Parser Classes

You can use two classes, CToken and CBaseParser (They say using C before a class name is a bad habit, but I like it because when I declare variable name I can use "Token" as the variable name and CToken as class name. In addition, I use some prefixes for variables names: "m_" for class data members, "n" for numeric values, "e" for enumeration values, and "s" for string values.). The CBaseParser class does all the dirty work for extracting needed information and CToken to hold information. It is a pity, but I have bad and good news for you. The bad first: You can't use the CBaseParser class directly because it is declared as an abstract class so you must define functions for data reading before you use it. The good news: I already defined the function for reading data from a text file in the CFileParser class that inherited from the CBaseParser class.

public abstract class CBaseParser
public class CFileParser : CBaseParser

I defined some constants as bits for the parser setup:

public enum ESTATE

ReturnEolToken: Sometimes you need to ignore the new line symbols, sometimes not. If you set this state then you will become new line symbol as token.

NumbersPriority: In some grammars, it is not possible to distinguish between hexadecimal numbers (without any prefix) and identifiers because both could be started from a letter. When you set this state, the parser would suggest that a new extracted text chunk is a number. Notice that by default in the CBaseParser class, hexadecimal numbers are not allowed. You need to inherit a new class and override the functions:

protected virtual bool IsNumberStart (char ch)
protected virtual bool IsNumberNext  (char ch)

to allow hexadecimal numbers (you can use the helper function IsHexDigit). To tune up the parser, you can overwrite a lot of functions (for more details, look at the source code):

//test char functions
protected virtual bool IsDelimiter (char ch)
protected virtual bool IsAlfa (char ch)
protected virtual bool IsAlfaNum (char ch)
protected virtual bool IsIdentStart (char ch)
protected virtual bool IsIdentNext (char ch)
protected virtual bool IsHexDigit (char ch)
protected virtual bool IsCommentStartFound (bool bMoveReadPosition)
protected virtual bool IsCommentEndFound   (bool bMoveReadPosition)
//Get functions
protected virtual bool GetIdent   (ref string Ident)
protected virtual bool GetComment (ref string Ident)
protected virtual bool GetNumber  (ref string Ident)
//Error handling functions
protected virtual void ErrorReport ()
public virtual void Error(string Text)
//Token handling function
public virtual bool GetNextToken(ref CToken Token)

Moreover, I give you access to the internal parser data, such as line number, current parser position, and parsed line.

public long LineNumber
public int ReadPos
public string CurrentLine

You could need it for error reporting.

The second class, CToken, contains a value and the value attributes. I created two attributes: m_eTokenClass as an enumeration type and m_nTokenType as an integer type.

public class CToken
...
   private ETokenClass m_eTokenClass = ETokenClass.ttNone;
   private int m_nTokenType = 0;
   private string m_sValue = "";
...

I developed it so that you can easy add you own values in an inherited class for a more detailed description. For a sample, if the value is "A", the token class variable could be "ETokenClass.Ident" and the token type variable is 1. It is possible you want your parser to give a more detailed description and an "A" as ident could be the function name, the variable name, the class name, and so on. For this purpose, you can use any bigger number as the "(int)ETokenClass.EndOfTokenClassValues". I used this technique in the expression parser sample.

Parsing Is Easy: Base C Sharp Classes and Expressions Calculator

"Expressions Language" Parser: Sample Classes

At last, you want to finish the tedious description and start building something more interesting. You can relax playing with the expression parser application. I can promise that the sample application can solve very hard expressions like: x = 2 + 2.

For those who don't want to play, but want to know a little bit more, I'll try to explain the expressions parser classes.

[Parsing2.png]

Because I want to define extended token type values, I created the CSimpexToken class (inherited from class CToken). And, because I need certain declarations, I inherited a new class from the CBaseParser class. The new class CExpressionParser defines:

  • Needed symbols combinations, such as ">=", "= =", "!=", and so on
  • Comment declarations
  • How to read lines, because lines come from the text box
  • Which tokens you need to skip, comment tokens for sample
  • How to display errors

All these actions are specific for a concrete application, so I can't define it in the base parser class. You can only ask me why I don't skip comments in the base class. It is easy; most often, I need the comments.

To parsing your expression language, you need a lot of classes because I would like to use separate classes for every grammar construction. Don't worry about the complexity; all classes are quite easy and mostly used only one function:

void Parse (CExecutor Executor, CExpressionParser Parser,
            ref CToken Token);

You need the"Executor" parameter to store the parsing result; it holds the expressions tree and variables table. Later, you use it to calculate expressions. The "Parser" parameter holds all input data and takes care of displaying error messages. The Token parameter gives you the current token (more correctly, the next token). I especially mark it as a modifiable parameter so that you will pay attention, although others two parameters modifiable too (but for you it is not so important).

[Parsing3.png]

The big question is: How do you write such a function? With your "slaves," it is very easy. You must only look at the grammar and then write the same constructions. For the sample, I used the construction "term":

term ::= factor (multiplication_operator factor)?
multiplication_operator ::= '*' | '/'

Here is "stripped" version of of the function (I left in the grammar parsing parts only):

public override void Parse (CExecutor Executor,
   CExpressionParser Parser, ref CToken Token)
{
   m_Factor = new CSimpexFactor();
   m_Factor.Parse(Executor, Parser, ref Token);
   while ((Token.TokenType == (int)CSimpexToken.ETokenType.Multiplay)
      || (Token.TokenType  == (int)CSimpexToken.ETokenType.Divide))
      {
      if (Token.TokenType == (int)CSimpexToken.ETokenType.Multiplay)
      {
         //skip *
         Parser.GetNextToken(ref Token);
      }
      else if (Token.TokenType == (int)CSimpexToken.ETokenType.Divide)
      {
         //skip /
         Parser.GetNextToken(ref Token);
      }
      CSimpexFactor Factor = new CSimpexFactor();
      Factor.Parse(Executor, Parser, ref Token);
   }
}

You see, at first you must parse "Factor". For this hard job, you need only two lines:

m_Factor = new CSimpexFactor();
m_Factor.Parse(Executor, Parser, ref Token);

What "m_Factor.Parse" did is not important now; you can implement this function later with the same approach as the current one. It's mportant that you have the next token after a call of the Parse function. If this token is the symbol "*" or "/", you must parse the next "Factor" and repeat this process while you have one of the "multiplication operators." As you can see, you only reflected the grammar rules to code. You must know only which tokens are allowed at the current step and which way you must follow after receiving the token. Try to compare every sample class with the corresponding grammar part to see how easy it is to write such a function.

When you have finished development for all parsing functions, start to test it with different language samples. I have found a lot of errors and can bet that you can find one more error in the sample code (if you have found one drop me a mail, please).

When you can parse all samples, it is time to implement "data storage." There, you can store all parsed data in the format best suited for the following data processing. For your sample, I chose a binary tree but you can choose something else for your task. (If you promise to keep a secret, I can tell you; in the next parser sample, "Mork mdb viewer", I used dictionaries and arrays.)

In an expression, you have left and a right values with an operation. The same binary tree has the left and the right node and the operation as the parent node. I drew the two trees for the sample. It is faster than the equal expressions, but with another operation priority. If you look carefully at the your grammatical rules, you can find that operation priority already "hidden" there. The binary tree data structure is very powerful. You can do a lot of operations with it, such as converting from postfix to infix notation and contrary as an expression calculation. All stems from the node traverse method. Start from the root /topmost node/ and do an operation with the left and right nodes. If the left or right node is operational, you try to calculate the operation result in the same manner. This is work for the CExecutor class. Most other classes represent different nodes.

[Parsing4.png]

The CExrpressionNode represents the base binary node object. CValueNode represents ready-to-use values, such as 2, 3, and 5. COperationNode knows all operations and can calculate it, like 2+3 or 3*5. CVaribaleNode or CExpressionNode only know the variable or expression name. If you need the related expression, you look it up in the expression container. CExpressionNode is the same as CVaribaleNode, but it is only an internal variable like an expression in the function parameters. CFunctionNode can calculate different functions. CExecutor contains all expressions and all variables.

I hope that you feel more comfortable now if you need to parse something. As for me, I needed parsers for many file types such as PDF, RTF, EPS, and so on. You have the possibility to make it before me. In this case, I will be happy that I did not waste my time when I wrote the classes and description.

January 2008 by Alex Nek.

Version in Russian. Latest version in English can be found here.



About the Author

Alex Nek

I started my way with Fortran and going via Pascal, Delphi, C and C++. I'm falling in love with C# now.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Specialization and efficiency are always in need. Whether it's replacing an aging roof, getting a haircut, or tuning up a car, most seek the assistance of trusted experts. The same is true in the business world, where an increasing number of companies are seeking the help of others to administer their IT systems and services. This special edition of Unleashing IT highlights a new breed of IT caretaker -- Cisco Powered service providers -- and the business advantages and operational efficiencies they …

  • Java developers know that testing code changes can be a huge pain, and waiting for an application to redeploy after a code fix can take an eternity. Wouldn't it be great if you could see your code changes immediately, fine-tune, debug, explore and deploy code without waiting for ages? In this white paper, find out how that's possible with a Java plugin that drastically changes the way you develop, test and run Java applications. Discover the advantages of this plugin, and the changes you can expect to see …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds