Flex Your Lexical Analysis Muscles

What programming job is more common than reading a data file, doing some manipulations, and writing it out again? Probably none! In fact, a former manager of mine who had 30 years’ experience in what used to be called data processing opined that all programs were but a slight variation on that theme. Perhaps more often today we read data files and write XML, but the core need essentially is the same: to accurately act on and react to incoming character data streams in a way that is both predictable and debuggable. You can do this one at a time, writing your own lexical analyzers for each special data file (or worse, cloning them), or you can take a step back and manage your work from a lexical analyzer generator.

The mother of all lexical analyzer generators is, of course, the ubiquitous Unix tool Lex. Dating from the early 1970s, it is perhaps one of the oldest compiler tools still in use. (The best source of Lex documentation is still the 1975 paper Lex—A Lexical Analyzer Generator by M.E. Lesk and E. Schmidt.) The modern successor to Lex is GNU Flex: the Fast Lexical analyzer generator.

Getting Flex

Flex is available for almost any platform you can image. However, because Flex is a tool for programmers only, the releases are very much do-it-yourself. If you are using Linux, BSD, or any POSIX-compliant version of Linux, you almost certainly have Flex installed and ready to go! If you are a Windows user or you simply demand the latest and greatest, visit the SourceForge project.

Windows users also will need WinRAR, or they must go through the headaches of finding a tar and gunzip or bzip to uncompress it. They also will need a working install of bash to run the configure script plus the GNU gcc compiler. An easier alternative would be simply installing the Cygwin tools. Cygwin is a port of common GNU and other Linux/Unix public domain tools for the Win32 platform. It includes tar, gunzip, bash, gcc, and flex in Win32 executable binary form, so you can simply download these and get going with little fuss. You easily can access Cygwin mirror sites around the world with their download tool (setup.exe).

Taking the path of least resistance, I developed this article with Flex 2.5.4 pre-compiled binaries from Cygwin, although I did build a new Flex 2.5.33 with the Cygwin tools just to prove it could be done.

Introduction to Tokens and Lexemes

Suppose you’re not only reading data files but reading (and perhaps interpreting) a scripting language input file, such as Perl or VB source code. Lexical analysis is the lowest level translation activity. The purpose of a lexical analyzer or scanner is to convert an incoming stream of characters into an outgoing stream of tokens. The scanner operates by matching patterns of characters into lexemes. Each pattern describes what an instance of a particular token must match. For example, a common pattern for an identifier (for example, user-specified variable or constant) in a script language is a letter followed by one or more occurrences of a letter or digit. Some lexemes that would match this pattern are index, sum, and i47.

Things that your input stream defines as useless, such as white space and comments, are not lexemes and can be safely discarded by the scanner. Several classes of tokens are found in the definitions of most script languages. Table 1 lists some typical ones.

Table 1: Typical Tokens

Keywords Reserved words (such as procedure and return) that cannot be redefined
Operators Typically strings (1–3 characters) such as /, >=, and >>= used in expressions
=Identifiers User-specified objects similar to keywords in form
Numbers Integer, real, or double-precision as specified
Character constants Single characters such as c or \0
Character strings Zero or more characters stored differently than character constants
EOLN and EOF Logical end-of-line and end-of-input markers

The following benefits explain why you would bother with this preliminary lexical analysis:

  1. It modularizes the design of the program, thereby decreasing the burden of software maintenance.
  2. It simplifies the grammar later by eliminating whitespace issues early.
  3. It can improve efficiency by using buffering techniques to reduce the I/O overhead.
  4. A separate lexical analyzer improves portability by isolating the portions that deal with actual input characters.

A scanner may also perform other functions while it is doing the work of tokenizing, such as producing source listings, reporting syntax errors, keeping track of source line numbers, and most often building the symbol table as it goes.

How Flex Works

Flex is a program generator that produces source code for recognizing regular expressions when given pattern specifications for input. The specifications allow an action to be associated with each input pattern. A Flex-produced DFA (deterministic finite automaton) performs the recognition of regular expressions. Flex is able to deal effectively with ambiguous expressions by always choosing the longest matching string in the input stream.

Lex transforms the user’s input table of regular expressions and actions into a function called yylex(). The yylex() function, when incorporated into your source host-language program, performs each action as the associated pattern is recognized. Flex is capable of producing its output as C, C++, or FORTRAN source code. In either case, the yylex() function incorporates the highly efficient string matching routines of Aho and Corasick (Communications of the ACM, No. 18, 1975).

The yylex() function produced by Lex will generally require time proportional to the length of the input stream. This function is linear with respect to the input and independent of the number of rules. As the number and complexity of rules increases, yylex() will tend to increase in size only. Speed will have to decrease when the input rules require extensive forward scanning of input.

Flex “wc”: A Document-Analysis Tool

The format of a Lex source file is as follows, where bracketed ({}) items are optional:

{definitions}
%%
{rules}
%%
{user subroutines}

The second %% delimiter may be omitted, but the first is necessary to mark the start of rules. Experience shows that the user subroutines are usually best implemented elsewhere, though it is nice to have a provision for it locally if required. The rules form a table with regular expressions in the left-hand column and semantic actions to the right.

The demo project for this article implements the popular Unix wc utility, which counts words, lines, and characters in an input file. It starts with a file called wc.l, as shown below:

 1 %{
 2 // wc.l -- a simple word counting program
 3
 4 #include <stdio.h>
 5 #include <string.h>
 6
 7 static int chars, words, lines;
 8
 9 %}
10
11 %option noyywrap
12
13 alpha       [a-zA-Z]
14 word        {alpha}+
15
16 %%
17
18 {word}      { chars += strlen(yytext); ++words; }
19 \n          { ++chars; ++lines; }
20 .           { ++chars; }
21
22 %%
23
24 int main()
25 {
26    chars = words = lines = 0;
27    yylex();
28    printf("\t%d\t%d\t%d\n",lines,words,chars);
29    exit(0);
30 }

The commands to build the program with Microsoft Visual Studio are:

flex wc.l
cl lex.yy.c /link /OUT:wc.exe

Now, test the program on the Constitution of the United States. Remember to use the pipe “<“, because it reads from stdin stream by default:

wc <constitution.txt
   574   (lines)       4592   (words)       27999 (chars)

You can see in the definitions section (lines 11-16) that you create regular expressions that you can use later in the rules section. You can imagine how simple it would be to add a few more definitions for things like float, identifier, string, and so forth. In the rules section (lines 17-21), you have a regular expression on the left and an action on the right. The system variable yytext contains the matching lexeme that was found by the expression on the left; thus, you can use strlen() on it on line 18.

Some more trivial rules such as line 19 implicitly know what the text is (in this case, it is exactly one newline). The last rule “.” matches any character whatsoever, thus giving an accurate character count. If you wanted to return a token to the main program after a particular lexeme was matched, you could use a return statement. For example, return FOUND_WORD; could indicate a word match, and so forth.

Now, you might argue that this demo doesn’t do very much, but depending on how you look at it, it has perhaps fewer than five lines of “source” code. Also, it is a platform that you can easily imagine computing a concordance (list of words sorted by number of usages), a readability (“fog”) index, a grammar checker, a spell checker, and so on. The demo didn’t do that simply because I want to show the power of Flex to get you started. The rest is up to you!

Still Relevant After All These Years

Lex still remains one of the top ten tools in the compiler writer’s toolbox despite being over 30 years old. Programmers starting out today would do well to keep in mind how the power of regular expressions, which is normally confined to scripting languages such as Perl, can be leveraged in a way that is readable, maintainable, and—best of all—debuggable with Flex. Would you rather look at a small table of regular expressions or an endless gnarly mess of “if/else” statements that poorly simulate an expression?

Book Recommendation

The UNIX Programming Environment by Brian W. Kernighan and Rob Pike (ISBN 013937681X) provides a great tutorial for both Lex and yacc. It develops an RPN desk calculator (“hoc”) through several iterations so you can fully understand the complexities of developing a sophisticated application in the Unix environment. A must-read for anyone who is interested in interoperability as well.

About the Author

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read