Classic Parsing with Flex and Bison

Compilers are a class of programs that translate text written in a source language (such as C++) to a target language (in other words, object code). Writing a compiler from scratch is time-consuming and the resulting product often is difficult to debug. So many developers rely on tools to produce compilers that are both efficient and maintainable. One of the oldest such tools is the Yacc compiler generator, the first official version of which was released by Stephen C. Johnson of Bell Labs in 1975.

The modern successor to Yacc is GNU Bison. (The Bison is a distant relative of a herd animal called a Yak.) Bison is available for almost any platform you can imagine. However, because Bison is useful only for programmers, its releases are very much do-it-yourself. If you are using Linux, BSD, or any POSIX-compliant version of Linux, Bison almost certainly has been installed for you and is ready to go!

If you are a Windows user or you simply demand the latest and greatest, you can visit the GNU FTP site. However, the path of least resistance for Windows users is installing the Cygwin tools. Cygwin is a port of common GNU and other Linux/Unix public domain tools for the Win32 platform, including tar, gunzip, bash, gcc, and bison in Win32 executable binary form. Windows users can access Cygwin mirror sites around the world easily with their download tool ("setup.exe") and get going with little fuss.

Flex and Bison Are Friends

Lexical analysis is the lowest level of translation activity. A lexical analyzer or scanner converts an incoming stream of characters into an outgoing stream of tokens. The parser section of a compiler then translates the input stream of tokens into a series of semantic actions. The code generator, another part of your complete compiler, is isolated from the parser section as much as possible. The parser receives its tokens from the lexical scanner as the return value of yylex(). The yylex() function assembles the tokens by matching patterns against input strings. A particular instance of a token is called a lexeme.

Expanding on my previous article "Flex Your Lexical Analysis Muscles," which explored the Lex and Flex tools for generating lexical analyzers, this article builds up your Flex muscles to produce a mini-compiler for a calculator language. The example was developed with GNU Bison 1.28 and Flex 2.5.4 pre-compiled binaries from Cygwin. Figure 1 shows the Flex and Bison interface.

Figure 1. Flex/Bison Interface

Parsers associate an action with each grammar rule. Each action may make use of the values of previous actions and return new values. The grammar is a set of rules that specify the structure of valid input. The parser uses an internal representation called a parse tree, which represents a derivation of the grammar.

Bison generates a special type of bottom-up parser called LALR(1). In a bottom-up parser, the parse tree contains both tokens and values that are propagated from the leaves up to the root. Attributes transmitted in this way are called synthesized attributes.

Parse trees produced by bottom-up parsers can be annotated by evaluating the rules at each node and passing the values from leaves to root. The sample parse tree in Figure 2 shows the derivation of the expression 3 * (4 + 5) using the expression grammar of tcalc.y (see Listing 2).

Figure 2. Derivation of 3 * ( 4 + 5 )

To use Flex with Bison, you specify the -d option to Bison to generate the include file y.tab.h, which contains definitions of all the %tokens appearing in the Bison input. This file is then incorporated into your Flex scanner (".l" file).

The Structure of a Bison Specification

Bison specification files consist of three sections: declarations, grammar rules, and programs. Each section is separated by the double percent ("%%") marker. A typical specification file is structured as follows:

declarations
%%
rules
programs

Only the rules section is required for a Bison specification. The declaration and program section may be omitted. However, the first %% marker must be present. The rules form the basis for the parser's action, so they must be dealt with first.

Rules

The rules of a Bison input grammar specify the names of terminals (in other words, tokens) and nonterminals. Any name that is not explicitly declared as a token is assumed to be a nonterminal. Each nonterminal must be associated with the left-hand side of at least one rule.

A prototype rule specification has the following general form:

nonterminal : BODY ;

where "nonterminal" is a user-specified name and BODY is a string of zero or more names and character constants. For example, consider one of the rules for an expression in the example calculator language (more on this shortly):

expr: expr PLUS term        { $$ = $1 + $3;}

The PLUS token represents the character '+', as you will see later in the lexer.l input file. The gist of this rule is that an expression plus a term can be replaced by the sum of an expression and a term when the infix PLUS operator is present. Rules that specify the same left-hand nonterminal may be grouped by the vertical bar '|' alternation operator. For example, these rules:

term: term ASTERISK factor  { $$ = $1 * $3; }
term: term SLASH factor     { $$ = $1 / $3; }
term: factor

Can be replaced with this equivalent:

term: term ASTERISK factor  { $$ = $1 * $3; }
    | term SLASH factor     { $$ = $1 / $3; }
    | factor

...which improves readability.

Declarations

All tokens used in the Bison specification file must be listed in the declarations section. Tokens that require no precedence or associativity information are simply listed as %token with no special significance to order.

One nonterminal must be designated as the start symbol. The start symbol has special significance to the parser because it represents the largest structure described by the grammar. The start symbol defaults to the left-hand side of the first rule. This default may be overridden with the %start keyword in the declaration section as follows:

%start symbol

Formally, the parser "accepts" the input if—and only if—the tokens parsed can be reduced to the start symbol when end-of-file is reached. If the end-of-file marker appears in any other context, an error has occurred. In Bison, the yylex() function is expected to return zero for an end-of-file condition.

Bison and Actions

The parser must attach actions to rules if it is to do more than simply accept or reject an input stream. The prototype form of a Bison rule specification allows actions to be placed within the body of the rule:

nonterminal : BODY ;

Actions are bracketed sets of C language statements, which are executed immediately upon recognition of the body.

To facilitate communication between parsing and actions, Bison provides a convention called pseudo-variables. The pseudo-variables $1 through $n refer to the values of the previous actions of the body (indexing from left to right). The $$ pseudo-variable represents the return value of the action. Each action can act like a function by returning its values for use by subsequent rules.

The pseudo-variables are substituted for internal Bison variables of the type yytype. The default yytype can be overridden for more complicated structures if necessary. If no action is specified, yacc will supply $$ = $1 as the default action.

Classic Parsing with Flex and Bison

A Simple Calculator Language

I chose a calculator language as the example for many reasons. First of all, nearly all programming applications need to parse expressions with plus, minus, multiply, and divide operators and parenthesis. Second, it's a universally understood problem and not specific to any particular language. Although many examples use postfix (RPN) notation, this one uses good old infix notation since everyone understands that. Specifically, the goal is for the calculator to be able to handle expressions like this:

-1.1 + 2 * ( 4 / 3 )

...and produce the correct answer!

First off, you need to determine what the tokens of the language are. Basically, you will have numbers, four operators, parenthesis, and newline (standing in for the "=" key on your calculator). As such, you can write a straightforward Flex scanner (".l" file) to process this, as Listing 1 illustrates with lexer.l:

  1 %{
  2 // lexer.l -- From tcalc: a simple calculator program
  3
  4 #include <stdio.h>
  5 #include <string.h>
  6 #include <stdlib.h>
  7 #include "calc.tab.h"
  8 extern YYSTYPE yylval;
  9
 10 %}
 11 %option noyywrap
 12 delim         [ \t]
 13 whitesp       {delim}+
 14 digit         [0-9]
 15 number        [-]?{digit}*[.]?{digit}+
 16 %%
 17 {number}  { sscanf(yytext, "%lf", &yylval); return NUMBER;}
 18 "+"       { return PLUS; }
 19 "-"       { return MINUS; }
 20 "/"       { return SLASH; }
 21 "*"       { return ASTERISK; }
 22 "("       { return LPAREN; }
 23 ")"       { return RPAREN; }
 24 "\n"      { return NEWLINE; }
 25 {whitesp} { /* No action and no return */}

Listing 1. lexer.l

Note a few things in lexer.l (Listing 1):

  1. The definition of number on line 15 is a regular expression built up from instances of the earlier expression digit on line 14. The re-use of expressions is one of Flex's most powerful features.
  2. It thoroughly converts single-character symbols into tokens even though you could have interpreted them as characters later on in Bison. I feel that it enhances maintainability not to hardcode character references in the grammar file later. Also, it would help to isolate internationalization issues, if any, in the input language.
  3. Return values are used to designate the type of token, whereas the actual value of the token is stored in yytext.

Now, take a look at the grammar—the rules and actions that involve building the parse tree. Listing 2 shows these with tcalc.y:

  1 /* tcalc.y - a four function calculator */
  2 %{
  3 #define YYSTYPE double      /* yyparse() stack type */
  4 #include <malloc.h>
  5 #include <stdlib.h>
  6 %}
  7 /* BISON Declarations */
  8 %token NEWLINE NUMBER PLUS MINUS SLASH ASTERISK LPAREN RPAREN
  9
 10 /* Grammar follows */
 11 %%
 12 input:              /* empty string */
 13     | input line
 14     ;
 15 line: NEWLINE
        | expr NEWLINE           { printf("\t%.10g\n",$1); }
 17     ;
 18 expr: expr PLUS term         { $$ = $1 + $3; }
 19     | expr MINUS term        { $$ = $1 - $3; }
 20     | term
 21     ;
 22 term: term ASTERISK factor   { $$ = $1 * $3; }
 23     | term SLASH factor      { $$ = $1 / $3; }
 24     | factor
 25     ;
 26 factor:  LPAREN expr RPAREN  { $$ = $2; }
 27       | NUMBER
 28       ;
 29 %%
 30 /*--------------------------------------------------------*/
 31 /* Additional C code */
 32 /* Error processor for yyparse */
 33 #include <stdio.h>
 34 int yyerror(char *s)        /* called by yyparse on error */
 35 {
 36     printf("%s\n",s);
 37     return(0);
 38 }
 39
 40 /*--------------------------------------------------------*/
 41 /* The controlling function */
 42 int main(void)
 43 {
 44     yyparse();
 45     exit(0);
 46 }

Listing 2. tcalc.y

Like all good C programs, the program begins in the main() function on line 42. The instruction yyparse() begins the input parsing and returns when end-of-input is reached. The output of the calculator will be displayed after each NEWLINE token is reduced on line #16. Also, notice the yyerror() function called in times of parsing failure on lines 34-38. More sophisticated state management routines are available for error handling, but this is only a simple example!

By way of screen dumps, Figure 3 offers some proof that the tcalc application works.

[Victor3.jpg]

Figure 3. TCALC Screen Dumps

Here is the batch file I used to build the project in a CMD window:

set CCOPT=/c /Z7 /W3
bison calc.y
cl %CCOPT% calc.tab.c
bison -d calc.y
flex lexer.l
cl %CCOPT% lex.yy.c
link /DEBUG calc.tab.obj lex.yy.obj /out:tcalc.exe

Rapidly Develop Useful Parsers

Yacc/Bison remains one of the top 10 tools in the compiler writer's toolbox, despite being over 30 years old. Part of its enduring appeal is the similarity of the grammar input (".y" files) to the actual BNF syntax of the language you are parsing. This contributes greatly to the maintainability of Bison-based parsers. This ease of use combined with its low footprint and high degree of portability make Bison and its friend Flex the clear winners for rapidly developing useful parsers for today's challenging translation applications.

Book Recommendation

Compilers: Principles, Techniques, and Tools (Hardcover)

by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman (ISBN 0201100886)

The legendary "dragon book" is still the preferred text for learning compiler design methodologies. This book has a perfect mixture of theory and practice that is unequalled. Along the way, you'll use real-world tools such as Flex and Bison to help develop your projects and, of course, your understanding of compiler technologies.

[Victor4.jpg]

About the Author

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.



About the Author

Victor Volkman

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Packaged application development teams frequently operate with limited testing environments due to time and labor constraints. By virtualizing the entire application stack, packaged application development teams can deliver business results faster, at higher quality, and with lower risk.

  • As mobile devices have pushed their way into the enterprise, they have brought cloud apps along with them. This app explosion means account passwords are multiplying, which exposes corporate data and leads to help desk calls from frustrated users. This paper will discover how IT can improve user productivity, gain visibility and control over SaaS and mobile apps, and stop password sprawl. Download this white paper to learn: How you can leverage your existing AD to manage app access. Key capabilities to …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds