Sometimes, you need to analyze a file and look for a simple parsing solution. For this case, I developed some base classes and have written a description about how you can make it. As example, I used an expression calculator. It is easy but a very powerful sample, I think. To develop your own parser, you only need to write rules for your file and write simple classes based on those rules. As input, you can use any source, but the default implementation used a text file. You can get different parts (as variable names, numbers, and so forth) of your file as output.
The sample application can read expressions from a file or from the text box. Expressions are not only parsed but executed too. So, you can use it as a powerful calculator.
In addition, I tried to explain how to create your own parser based on my classes.
What Parsing Is
When I once thought about parsing, I had only one idea—it is too complicated. One time, I really needed a parser for a simple language; this allows me to input complex data for an application. I had no choice and needed to learn how to write a parser. When I finished my project, I thought otherwise—it is not complicated. From time to time, I needed to write some simple parsers and I think now—it is really easy. I hope that you also could change your opinion after reading this article and trying the sample.
I don't want to tell a lot about parsing theory; you can find plenty of good books and articles. I only want to show one "working way" of how to create your own parser and use my parser classes.
First, you should find a guinea pig, because you need something to parse. For the sample ,I choose a simple "expressions language," like this one:
A = 1+2 B = (3+4) * 5 C = A / B
As you can see, it is a really easy language. In each line, you have a variable on the left side and an expression on the right side of the equals symbol. For every line, you must calculate the expression on the right and assign it to the variable on the left.
It looks very easy, doesn't it? But, if you want immediately write the program—stop it, please. You first need a formal description of your language, so you need to write grammar for your language (or set of rules and definitions). I have already done it for you:
expressions_file ::= (comment_line | expression_line) comment_line :: = ''' | '//' text expression_line ::= VARIABLE_NAME '=' expression | function_designator expression ::= simple_expression [ relational_operator simple_expression [if_operator] ] //Note: lowest priority operation if_operator ::= '?'expression ':' expression relational_operator ::= '=='|' !=' | '<' | '<=' | '>' | '>= simple_expression ::= [ '+' | '-' ] term (addition_operator term )? addition_operator ::= '+' | '-' term ::= factor (multiplication_operator factor)? multiplication_operator ::= '*' | '/' factor ::= NUMBER | VARIABLE_NAME | function_designator //Note: highest priority operation | '(' expression ')' | ! factor function_designator ::= FUNCTION_NAME [ actual_parameter_list ] actual_parameter_list ::= '(' actual_parameter [ ',' actual_parameter ]? ')' actual_parameter ::= expression text ::= any symbols NUMBER ::= (digit)* '.' (digit)* VARIABLE_NAME ::= any_letter ('_' | any_letter | digit) ? FUNCTION_NAME ::= any_letter ('_' | any_letter | digit) ?
Does it look a little bit strange and not clear? Don't worry; I'll describe it now. For the sample, I'll read you the first line aloud:
expressions_file ::= (comment_line | expression_line)*
The expressions file /expression_file/ can consist of /::=/ repeating one or more times, and the /()*/ line with comment /comment_line/ or /|/ line with expression /expression_line/. Can you understand it now? I'll repeat the explanation for some other lines:
expression ::= simple_expression [ relational_operator simple_expression [if_operator] ]
Expression /expression/ is defined as /::=/ simple expression /simple_expression/, after which could be the // relational operator with the simple expression /relational_operator simple_expression/ and possible // if operator /if_operator/.
actual_parameter_list ::= '(' actual_parameter ( ',' actual_parameter )? ')'
The function actual parameters list /actual_parameter_list/ is defined as /::=/ opening parentheses (/'('/, after which must be the actual parameter /actual_parameter/, comma /','/ and the next actual parameter /actual_parameter/ could be repeated null or more time /()?/. After the parameters, you must place closing parentheses ) /')'/.
I hope you can read the whole grammar of the "expressions language" without any problem now. As you can see, at every step you know the next part of your language. It is very important to notice in case the algorithm doesn't work very well. You describe your language from big parts to smaller ones and each next line gives you additional details.
When you define the grammar for a language, you can start to write the program that can read and "understand" your language and execute the commands. When I develop a program, I try to follow Caesar's strategy of "Divide and Conquer." I divide the program description into small parts and develop each of them separately. Following this principle, you need at least two parts of your program: one part for reading and "understanding" the language and the second part for execution. The first part you called parser, which makes lexical and syntactic analyses of language. The parser built some structures needed for future work. You split the parser into smaller parts too but I'll give you more details later.
For parsing, you can easy follow your grammar rules. As you can see, you need to extract some basic information from the input file, such as names, numbers, comments, special symbol combinations, and so on. You don't want to do that dirty work yourself, so you give it to another—to your base parser. The base parser needs to read lines from the file (scan lines), skip delimiters, and extract minimal (atomic) parts of your language. It's called a minimal part of language—a token (in linguistics, the word lexeme is often used). When you have the token, you need only to analyze the token, follow your grammar, and read the next token. At the current parsing step, you always look ahead for the next token, like after reading the current chapter, you know a title for the next one. You can compare parsing with navigation in an unknown city. First, you need a map—language grammar. Next, you could "scan" a city by walking and extracting the tokens—the street names. At every crossing, you must read the next street name and decide where you go according to the map and needed way.
Some of you can tell me: If you know the grammar of your language, you can parse it automatically with software like Yacc: http://en.wikipedia.org/wiki/Yacc/. Yes, you are right, but first you want to write our yown parser; and second, it is not easy to understand code generated by such a program. To say it more accurately, in each situation you can find the most suitable method.
I only hope that your method is suitable for you, and that you follow me. Oh, where are you? If you already lost, I'll give you a hand and show a little bit more. You take a simple expression line and split it into the tokens themselves (I have purposely added spaces after the +):
A = 1 + 2
Here is your result:
"A" - name, "=" - assign operation, "1" - number, "+" - add operation, "2" - number (notice that all spaces have been eaten. Who is hungry? I hope only your tokenizer). I extracted some lines from your grammar for comparison to show how your algorithm could "think" about the parsing path for sample expression line (a = 1+2):
expression_line ::= VARIABLE_NAME '=' expression expression ::= simple_expression simple_expression ::= term addition_operator term addition_operator ::= '+' term ::= factor factor ::= NUMBER
I hope you learned enough now to understanding the parsing technique and the base parsing classes.