Flex Your Lexical Analysis Muscles

What programming job is more common than reading a data file, doing some manipulations, and writing it out again? Probably none! In fact, a former manager of mine who had 30 years' experience in what used to be called data processing opined that all programs were but a slight variation on that theme. Perhaps more often today we read data files and write XML, but the core need essentially is the same: to accurately act on and react to incoming character data streams in a way that is both predictable and debuggable. You can do this one at a time, writing your own lexical analyzers for each special data file (or worse, cloning them), or you can take a step back and manage your work from a lexical analyzer generator.

The mother of all lexical analyzer generators is, of course, the ubiquitous Unix tool Lex. Dating from the early 1970s, it is perhaps one of the oldest compiler tools still in use. (The best source of Lex documentation is still the 1975 paper Lex—A Lexical Analyzer Generator by M.E. Lesk and E. Schmidt.) The modern successor to Lex is GNU Flex: the Fast Lexical analyzer generator.

Getting Flex

Flex is available for almost any platform you can image. However, because Flex is a tool for programmers only, the releases are very much do-it-yourself. If you are using Linux, BSD, or any POSIX-compliant version of Linux, you almost certainly have Flex installed and ready to go! If you are a Windows user or you simply demand the latest and greatest, visit the SourceForge project.

Windows users also will need WinRAR, or they must go through the headaches of finding a tar and gunzip or bzip to uncompress it. They also will need a working install of bash to run the configure script plus the GNU gcc compiler. An easier alternative would be simply installing the Cygwin tools. Cygwin is a port of common GNU and other Linux/Unix public domain tools for the Win32 platform. It includes tar, gunzip, bash, gcc, and flex in Win32 executable binary form, so you can simply download these and get going with little fuss. You easily can access Cygwin mirror sites around the world with their download tool (setup.exe).

Taking the path of least resistance, I developed this article with Flex 2.5.4 pre-compiled binaries from Cygwin, although I did build a new Flex 2.5.33 with the Cygwin tools just to prove it could be done.

Introduction to Tokens and Lexemes

Suppose you're not only reading data files but reading (and perhaps interpreting) a scripting language input file, such as Perl or VB source code. Lexical analysis is the lowest level translation activity. The purpose of a lexical analyzer or scanner is to convert an incoming stream of characters into an outgoing stream of tokens. The scanner operates by matching patterns of characters into lexemes. Each pattern describes what an instance of a particular token must match. For example, a common pattern for an identifier (for example, user-specified variable or constant) in a script language is a letter followed by one or more occurrences of a letter or digit. Some lexemes that would match this pattern are index, sum, and i47.

Things that your input stream defines as useless, such as white space and comments, are not lexemes and can be safely discarded by the scanner. Several classes of tokens are found in the definitions of most script languages. Table 1 lists some typical ones.

Table 1: Typical Tokens

Keywords Reserved words (such as procedure and return) that cannot be redefined
Operators Typically strings (1–3 characters) such as /, >=, and >>= used in expressions
=Identifiers User-specified objects similar to keywords in form
Numbers Integer, real, or double-precision as specified
Character constants Single characters such as c or \0
Character strings Zero or more characters stored differently than character constants
EOLN and EOF Logical end-of-line and end-of-input markers

The following benefits explain why you would bother with this preliminary lexical analysis:

  1. It modularizes the design of the program, thereby decreasing the burden of software maintenance.
  2. It simplifies the grammar later by eliminating whitespace issues early.
  3. It can improve efficiency by using buffering techniques to reduce the I/O overhead.
  4. A separate lexical analyzer improves portability by isolating the portions that deal with actual input characters.

A scanner may also perform other functions while it is doing the work of tokenizing, such as producing source listings, reporting syntax errors, keeping track of source line numbers, and most often building the symbol table as it goes.

How Flex Works

Flex is a program generator that produces source code for recognizing regular expressions when given pattern specifications for input. The specifications allow an action to be associated with each input pattern. A Flex-produced DFA (deterministic finite automaton) performs the recognition of regular expressions. Flex is able to deal effectively with ambiguous expressions by always choosing the longest matching string in the input stream.

Lex transforms the user's input table of regular expressions and actions into a function called yylex(). The yylex() function, when incorporated into your source host-language program, performs each action as the associated pattern is recognized. Flex is capable of producing its output as C, C++, or FORTRAN source code. In either case, the yylex() function incorporates the highly efficient string matching routines of Aho and Corasick (Communications of the ACM, No. 18, 1975).

The yylex() function produced by Lex will generally require time proportional to the length of the input stream. This function is linear with respect to the input and independent of the number of rules. As the number and complexity of rules increases, yylex() will tend to increase in size only. Speed will have to decrease when the input rules require extensive forward scanning of input.

Flex "wc": A Document-Analysis Tool

The format of a Lex source file is as follows, where bracketed ({}) items are optional:

{definitions}
%%
{rules}
%%
{user subroutines}

The second %% delimiter may be omitted, but the first is necessary to mark the start of rules. Experience shows that the user subroutines are usually best implemented elsewhere, though it is nice to have a provision for it locally if required. The rules form a table with regular expressions in the left-hand column and semantic actions to the right.

The demo project for this article implements the popular Unix wc utility, which counts words, lines, and characters in an input file. It starts with a file called wc.l, as shown below:

 1 %{
 2 // wc.l -- a simple word counting program
 3
 4 #include <stdio.h>
 5 #include <string.h>
 6
 7 static int chars, words, lines;
 8
 9 %}
10
11 %option noyywrap
12
13 alpha       [a-zA-Z]
14 word        {alpha}+
15
16 %%
17
18 {word}      { chars += strlen(yytext); ++words; }
19 \n          { ++chars; ++lines; }
20 .           { ++chars; }
21
22 %%
23
24 int main()
25 {
26    chars = words = lines = 0;
27    yylex();
28    printf("\t%d\t%d\t%d\n",lines,words,chars);
29    exit(0);
30 }

The commands to build the program with Microsoft Visual Studio are:

flex wc.l
cl lex.yy.c /link /OUT:wc.exe

Now, test the program on the Constitution of the United States. Remember to use the pipe "<", because it reads from stdin stream by default:

wc <constitution.txt
   574   (lines)       4592   (words)       27999 (chars)

You can see in the definitions section (lines 11-16) that you create regular expressions that you can use later in the rules section. You can imagine how simple it would be to add a few more definitions for things like float, identifier, string, and so forth. In the rules section (lines 17-21), you have a regular expression on the left and an action on the right. The system variable yytext contains the matching lexeme that was found by the expression on the left; thus, you can use strlen() on it on line 18.

Some more trivial rules such as line 19 implicitly know what the text is (in this case, it is exactly one newline). The last rule "." matches any character whatsoever, thus giving an accurate character count. If you wanted to return a token to the main program after a particular lexeme was matched, you could use a return statement. For example, return FOUND_WORD; could indicate a word match, and so forth.

Now, you might argue that this demo doesn't do very much, but depending on how you look at it, it has perhaps fewer than five lines of "source" code. Also, it is a platform that you can easily imagine computing a concordance (list of words sorted by number of usages), a readability ("fog") index, a grammar checker, a spell checker, and so on. The demo didn't do that simply because I want to show the power of Flex to get you started. The rest is up to you!

Still Relevant After All These Years

Lex still remains one of the top ten tools in the compiler writer's toolbox despite being over 30 years old. Programmers starting out today would do well to keep in mind how the power of regular expressions, which is normally confined to scripting languages such as Perl, can be leveraged in a way that is readable, maintainable, and—best of all—debuggable with Flex. Would you rather look at a small table of regular expressions or an endless gnarly mess of "if/else" statements that poorly simulate an expression?

Book Recommendation

The UNIX Programming Environment by Brian W. Kernighan and Rob Pike (ISBN 013937681X) provides a great tutorial for both Lex and yacc. It develops an RPN desk calculator ("hoc") through several iterations so you can fully understand the complexities of developing a sophisticated application in the Unix environment. A must-read for anyone who is interested in interoperability as well.

About the Author

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.



About the Author

Victor Volkman

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.

Comments

  • Les meilleurs lisseur GHD sur la boutique web à Bergen

    Posted by oiyetc778 on 07/16/2013 11:04am

    Une partie de borne de la chevelure et se courbent progressivement vers le haut. Si cette procédure ne vous donne pas l'effet requis, vous devez utiliser une autre méthode, utilisez une laque souple sur une petite partie de vos cheveux avant de vous mettre défrisage fer d'environ un pouce ou du cuir chevelu, puis descendre à la chevelure de fin , friser les cheveux comme vous le feriez d'une ceinture à l'aide d'une paire de ciseaux. [url=http://ghdpascherfer.tripod.com/]lisseur ghd[/url] Le Param cadeau ordonné a le ghd mouette V Classic styler dans une finition riche rubis métallique. Le lisse, noir, avrundede Plater, avec un SNEV av étincelle est conçu pour y g dans un brillant, finition brillante Høy à votre style, que vous avez raison, boucle ou wave.With il est Svart, Glatt et profilert céramique Plater (avec un ! supplémentaire spesiell étincelant overflate) donne styler un blanc, finition brillante Høy à votre style.The ghd styler Metallic Collection est livré avec un élégant Svart vattert sak qui empêchent les dommages et garder les fils ryddig lorsque vous Reiser ainsi que luxueux emballage coordonnée - d'où l'Une IG cadeau parfait pour Noël. [url=http://ghdpascherfer.tripod.com/]ghd lisseur pas cher[/url] Défrisants GHD pour Perfect Hair La meilleure façon d'avoir de redresseur de cheveux raides. Mais il est important de le faire correctement, sinon il ne suffit pas jour. Par exemple, vous pouvez séparer les cheveux avec des clips de sorte que vous obtenez avec tous les cheveux quand vous redresser. Il est également très important de se rappeler que défriser vos cheveux vraiment du mal cheveux instantanément lisse et brillant de la première børstetak, n'importe quand et n'importe où - vous suffit d'appuyer sur le bouton. Braun Satin Hair 7 brosse est facile à utiliser, peut être pris n'importe où, et réduit les frisottis et statique. Vous pouvez immédiatement voir et sentir la différence pour vos cheveux seront plus brillants et se sent beaucoup plus lisse.

    Reply
  • The Secret dominate the mizuno-market Is Kind Of Simple and easy!

    Posted by Acuddence on 05/02/2013 07:11pm

    Advanced questions on nike resolved in addition to the reasons why you have got to study every phrase on this documentation.[url=http://www.nikejpgolf.biz/]nike ゴルフ[/url] An explicit double strain on mizuno [url=http://www.nikejpgolf.biz/nike-ゴルフボール-c-23.html]ナイキgolf[/url] Newbie questions regarding nike resolved and as a consequence reasons why you should definitely start reading every single word of this report. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキゴルフ[/url] Neutral review unwraps 5 brand-new stuff about nike that no one is mentioning. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキゴルフ[/url] The most important nike Organisation Meet : Employees who cares for nada benefits?! [url=http://www.nikejpgolf.biz/nike-ゴルフシューズ-c-15.html]nike air jordan[/url] Solutions and production in Vegas : nike will leave with no see you later [url=http://www.nikeyasuyi.com/]nike[/url] Instruments and creation in Las Vegas, Nevada -- nike simply leaves without adios [url=http://www.nikeyasuyi.com/nikeナイキRunning-c-3.html]nike running[/url] Generally nike Agency Call : And so, who loves zilch is the winner?? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]ナイシューズ[/url] Some mizuno Corporate Meet : Who cares for little or nothing benefits?! [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike シューズ[/url] nike gives spanking new life span for an old topic- golden paradigm

    Reply
  • More concessions with herveleger, more flabbergast!

    Posted by jonemitx on 04/28/2013 06:53pm

    maidenpicayune shavertotality upsafeverifiedbarter

    Reply
  • The Trick For you to master the mizuno-world Is Actually Straightforward!

    Posted by Acuddence on 04/26/2013 03:19am

    Contemporary queries about nike have been answered and consequently reasons why you would need to study every single concept on this documentation.[url=http://www.nikejpgolf.biz/]ゴルフ ナイキ[/url] A new double sprain on nike [url=http://www.nikejpgolf.biz/nike-ゴルフボール-c-23.html]nike ボール[/url] Hot queries about mizuno resolved in addition to the reason why you must definitely look into every single word of this story. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキゴルフ[/url] Impartial write-up shows you 5 innovative new things of mizuno that no one is bringing up. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキクラブ[/url] Our nike Commerce Meet -- Those Who really cares about pretty much nothing profits?? [url=http://www.nikejpgolf.biz/nike-ゴルフシューズ-c-15.html]nike sb[/url] Gear and development in Michigan - - nike actually leaves without cheers [url=http://www.nikeyasuyi.com/]nike free[/url] Solutions and assembly throughout Las Vegas - - nike actually leaves without any bon voyage [url=http://www.nikeyasuyi.com/nikeナイキRunning-c-3.html]ナイキランニング[/url] Generally nike Sector Chat - Buyers who cares for practically nothing is declared the victorious one?! [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]ナイシューズ[/url] How the mizuno Endeavor Dialog : And so, who cares profit?!? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]ナイシューズ[/url] mizuno is giving brand new lifespan to a old challenge. . . metallic standards

    Reply
  • http://www.tomsoutletw.com/ yshmhm

    Posted by http://www.tomsoutletw.com/ Suttonykd on 03/29/2013 02:01pm

    http://www.oakleysunglassesoutc.com/ The same constitution, the same moves, the same magic, the same concept capacity, Xiao Feng was the other eat wearer literally little way of demons, and even each other's clothes do not have the opportunity to encounter, while the other the attacks hit him every time, just attack power just a bit lacking, however, look at the other interesting eyes full of contempt, Xiao Feng think it is simply like a cat and mouse molested Well! No greater humiliation than this! You know? You do nothing ray ban sunglasses! What you want, ray ban sunglasses sale all know, and ... ray ban wayfarers you too much! The hypocrisy of you and how will the real ray ban new wayfarer opponents it!ray ban prescription glasses, Abandon unnecessary resistance it! Hypocrisy eventually able to escape the threat of death, when you have everything will come with you, will never cease to exist.

    Reply
  • discount ray ban sunglasses

    Posted by cgliliImpumpxjf on 03/29/2013 10:40am

    http://sunglasspomoteauthentic.webs.com - cheap ray ban sunglasses oakleys cheap http://onlineguciisunglass.webs.com - ray ban for cheap discount ray ban http://wholesalesunglassescool.webs.com - sunglasses wholesale cheap sunglasses http://wholesalesunglassescool.webs.com - sunglasses wholesale fake ray ban wayfarer http://fakeGucciwayfarer.webs.com - fake ray ban wayfarer fake ray ban

    Reply
  • http://www.oakleysunglassesoutc.com/ hnlbwj

    Posted by http://www.oakleysunglassesoutc.com/ Suttonysg on 03/28/2013 09:02am

    Prince Gong hearts of a sudden, but Tan Yankai unique love bell commercial practice, cheap ghd hearts are very clear, but this is not worth hundreds 1,002,000 Hanyang steel plant, but wide project budget will reach $ 45 million two large railway between the world who can be down to eat?ghd australia, Younger recommendations IPO to civil.ghd hair straightener, The collection of the public, whereas the joint Lujan Railway Corporation .ghd,.ghd straightener,. Younger preliminary calculations, once the Lujan railroad is completed after the opening, the year have of 10000000-1000 twenty-three 1,002,000 amount of ground operations, excluding various costs, also There are six or seven million ounces of revenue for the railway company, and 15-year operating period Excluding the proceeds to 40,002,000 construction cost, should at least 30,002,000 of the total revenue, which is attracting businessmen shares of lies. Course, the court can also be in the fifteen years during the period of its operation, in case of war the railway company must court interests.

    Reply
  • More concessions with herveleger, more amaze!

    Posted by wellslifzdt on 03/22/2013 06:20am

    herve leger replica dresses herve leger gown herve leger outlet herve leger sales herve leger outlet herve leger outlet herve leger stores iphone for sales iphone 5 unlocked iphone 5 white for sale

    Reply
  • ghd australia supfpu

    Posted by Mandyvmi on 02/07/2013 07:51am

    1vYev ugg jFtr bUvl nike 8nJgs toms outlet 2aTac hollister sale uk 1gHcx ugg 3gAcx longchamps 1eYik louis vuitton outlet 7sAuc michael kors outlet 9qOgu christian louboutin 5dSoq Scott Tolzien Jersey 9fCgq 5tOdo 5oYvt ghd 2wQan cheap ugg boots

    Reply
  • ugg boots eqydis http://www.cheapfashionshoesam.com/

    Posted by Mandyarl on 01/15/2013 05:42am

    0bUyn cheap ugg boots cAka Michael Kors outlet gEos ugg boots 9vHht Burberry outlet 0bHuh Cheap nfl jerseys 9dQsl coach,coach outlet,coach outlet online,coach factory outlet 8rDqd burberry outlet 1vQpc christian louboutin outlet 0rBpn 8qKih 0qQss 3cIdi 9tFui 3yYkx 1hDlk

    Reply
  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: December 11, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Market pressures to move more quickly and develop innovative applications are forcing organizations to rethink how they develop and release applications. The combination of public clouds and physical back-end infrastructures are a means to get applications out faster. However, these hybrid solutions complicate DevOps adoption, with application delivery pipelines that span across complex hybrid cloud and non-cloud environments. Check out this …

  • CentreCorp is a fully integrated and diversified property management and real estate service company, specializing in the "shopping center" segment, and is one of the premier retail service providers in North America. Company executives travel a great deal, carrying a number of traveling laptops with critical current business data, and no easy way to back up to the network outside the office. Read this case study to learn how CentreCorp implemented a suite of business continuity services that included …

Most Popular Programming Stories

More for Developers

RSS Feeds