Classic Parsing with Flex and Bison

CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Compilers are a class of programs that translate text written in a source language (such as C++) to a target language (in other words, object code). Writing a compiler from scratch is time-consuming and the resulting product often is difficult to debug. So many developers rely on tools to produce compilers that are both efficient and maintainable. One of the oldest such tools is the Yacc compiler generator, the first official version of which was released by Stephen C. Johnson of Bell Labs in 1975.

The modern successor to Yacc is GNU Bison. (The Bison is a distant relative of a herd animal called a Yak.) Bison is available for almost any platform you can imagine. However, because Bison is useful only for programmers, its releases are very much do-it-yourself. If you are using Linux, BSD, or any POSIX-compliant version of Linux, Bison almost certainly has been installed for you and is ready to go!

If you are a Windows user or you simply demand the latest and greatest, you can visit the GNU FTP site. However, the path of least resistance for Windows users is installing the Cygwin tools. Cygwin is a port of common GNU and other Linux/Unix public domain tools for the Win32 platform, including tar, gunzip, bash, gcc, and bison in Win32 executable binary form. Windows users can access Cygwin mirror sites around the world easily with their download tool (“setup.exe”) and get going with little fuss.

Flex and Bison Are Friends

Lexical analysis is the lowest level of translation activity. A lexical analyzer or scanner converts an incoming stream of characters into an outgoing stream of tokens. The parser section of a compiler then translates the input stream of tokens into a series of semantic actions. The code generator, another part of your complete compiler, is isolated from the parser section as much as possible. The parser receives its tokens from the lexical scanner as the return value of yylex(). The yylex() function assembles the tokens by matching patterns against input strings. A particular instance of a token is called a lexeme.

Expanding on my previous article “Flex Your Lexical Analysis Muscles,” which explored the Lex and Flex tools for generating lexical analyzers, this article builds up your Flex muscles to produce a mini-compiler for a calculator language. The example was developed with GNU Bison 1.28 and Flex 2.5.4 pre-compiled binaries from Cygwin. Figure 1 shows the Flex and Bison interface.

Figure 1. Flex/Bison Interface

Parsers associate an action with each grammar rule. Each action may make use of the values of previous actions and return new values. The grammar is a set of rules that specify the structure of valid input. The parser uses an internal representation called a parse tree, which represents a derivation of the grammar.

Bison generates a special type of bottom-up parser called LALR(1). In a bottom-up parser, the parse tree contains both tokens and values that are propagated from the leaves up to the root. Attributes transmitted in this way are called synthesized attributes.

Parse trees produced by bottom-up parsers can be annotated by evaluating the rules at each node and passing the values from leaves to root. The sample parse tree in Figure 2 shows the derivation of the expression 3 * (4 + 5) using the expression grammar of tcalc.y (see Listing 2).

Figure 2. Derivation of 3 * ( 4 + 5 )

To use Flex with Bison, you specify the -d option to Bison to generate the include file y.tab.h, which contains definitions of all the %tokens appearing in the Bison input. This file is then incorporated into your Flex scanner (“.l” file).

The Structure of a Bison Specification

Bison specification files consist of three sections: declarations, grammar rules, and programs. Each section is separated by the double percent (“%%”) marker. A typical specification file is structured as follows:

declarations
%%
rules
programs

Only the rules section is required for a Bison specification. The declaration and program section may be omitted. However, the first %% marker must be present. The rules form the basis for the parser’s action, so they must be dealt with first.

Rules

The rules of a Bison input grammar specify the names of terminals (in other words, tokens) and nonterminals. Any name that is not explicitly declared as a token is assumed to be a nonterminal. Each nonterminal must be associated with the left-hand side of at least one rule.

A prototype rule specification has the following general form:

nonterminal : BODY ;

where “nonterminal” is a user-specified name and BODY is a string of zero or more names and character constants. For example, consider one of the rules for an expression in the example calculator language (more on this shortly):

expr: expr PLUS term        { $$ = $1 + $3;}

The PLUS token represents the character ‘+’, as you will see later in the lexer.l input file. The gist of this rule is that an expression plus a term can be replaced by the sum of an expression and a term when the infix PLUS operator is present. Rules that specify the same left-hand nonterminal may be grouped by the vertical bar ‘|’ alternation operator. For example, these rules:

term: term ASTERISK factor  { $$ = $1 * $3; }
term: term SLASH factor     { $$ = $1 / $3; }
term: factor

Can be replaced with this equivalent:

term: term ASTERISK factor  { $$ = $1 * $3; }
    | term SLASH factor     { $$ = $1 / $3; }
    | factor

…which improves readability.

Declarations

All tokens used in the Bison specification file must be listed in the declarations section. Tokens that require no precedence or associativity information are simply listed as %token with no special significance to order.

One nonterminal must be designated as the start symbol. The start symbol has special significance to the parser because it represents the largest structure described by the grammar. The start symbol defaults to the left-hand side of the first rule. This default may be overridden with the %start keyword in the declaration section as follows:

%start symbol

Formally, the parser “accepts” the input if—and only if—the tokens parsed can be reduced to the start symbol when end-of-file is reached. If the end-of-file marker appears in any other context, an error has occurred. In Bison, the yylex() function is expected to return zero for an end-of-file condition.

Bison and Actions

The parser must attach actions to rules if it is to do more than simply accept or reject an input stream. The prototype form of a Bison rule specification allows actions to be placed within the body of the rule:

nonterminal : BODY ;

Actions are bracketed sets of C language statements, which are executed immediately upon recognition of the body.

To facilitate communication between parsing and actions, Bison provides a convention called pseudo-variables. The pseudo-variables $1 through $n refer to the values of the previous actions of the body (indexing from left to right). The $$ pseudo-variable represents the return value of the action. Each action can act like a function by returning its values for use by subsequent rules.

The pseudo-variables are substituted for internal Bison variables of the type yytype. The default yytype can be overridden for more complicated structures if necessary. If no action is specified, yacc will supply $$ = $1 as the default action.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read