Lexical Analyser

In my first year university programming project one of my missions is to write a parser. (Just in case you do not know what a parser is: a parser reads user input that has a complicated structure- such as a mathematical expression or even a programming language- and make sense of it for the computer. So, your C++ compiler in its most basic form, is a parser) Parsing is mainly about advanced problems such as matching up parentheses and building expression tress, but before it can be done, there are some mundane things to take care of: skipping over spaces, recognizing that a sequence of characters such as 0.13 and + represents a real number and a plus operator respectively, and so on. These mundane tasks are usually handled separately from the parser proper, in a lexical analyser.

For example, a lexical analyser reads a sequence of characters, such as (amount-5000.0)*.13 and reports what it finds as a sequence of tokens. So, for that example, the lexical analyser reading the characters would report this sequence of 7 tokens:

left parenthesis
identifier ("amount")
minus sign
real number (5000.0)
right parenthesis
multiplication sign
real number (0.13)

This is what my CLexicalAnalyser class do. Below, is the documentation of my code (actual code in red):

Enumeration definition

enum {nilT, idT, intT, realT, stringT, errorT, endlineT, endfileT};

This enumeration type defines the token type (i.e. what type of token that particualr token is). Here is what the token means:

nilT		Do not worry about this. It is used internally by the CLexicalAnalyser class
idT		This token represents an identifier. An identifier is a sequence of characters
		that can represent variable names, class names, function names and so on. In 
		CLexicalAnalyser, identifiers are case-sensitive, first character must be an 
		alphabet (and the rest can alphabets, digits and underscore- '_' ).
intT		This token represents an integer. The difference integers and real numbers are
		the decimal point character that exist in real numbers ('.').
realT		This token represents a real number. Real number can start with a decimal point.
stringT		This token represents a string. A string is any sequence of characters enclosed 
		in quotation marks (' " ').
errorT		This token represent an unrecognised token in the lexical analysis. This can 
		happen for example, if the lexical analyser ecounters a ~!@#$ sequence which
		cannot be represented by any token (a token can be user-defined type or a 
		pre-defined type as described in this section).
endlineT	This token represent an end of line ("\r\n"). MFC's multiline edit field uses
		"\r\n" to denote a new line instead of "\n". Therefore this lexical analyser
		uses the former to define a new line.
endfileT	This token denotes that there are no more tokens.

It will be explained later how to create your own user-defined token type (for e.g. for operators and special keywords for your new exotic programming language and so on).

CToken class
This class represent a token. The important functions in this token are:

int getTokenType()
This function returns an enumeration value (as described above) of the token type. For example,
this function may return idT, intT, and so on.
CString getTokenName()
This function returns the name of the token for idT and stringT tokens. For example, if the 
CToken class represent a string called "Big", getTokenType() will return stringT and 
getTokenName() will return "Big".
int getTokenInt()
This function returns the integer value of the intT token. For example, if the CToken class
represent an integer whose value is 5000, getTokenType() will return intT and getTokenInt() will
return 5000.
double getTokenReal()
This function returns the real value of the realT token. For example, if the CToken class
represent a real number whose value is 3.14, getTokenType() will return intT and getTokenInt()
will return 3.14;

CSymbol class
This class represents a symbol. This class provides a link between the string of a token and its enumeration number. For example, you will use addOperatorT token type to represent the "+" string. The only important function in this class is one of the constructors:

CSymbol( CString string, int type);
This constructor is the only function which you will need to use. An example on how to use this
will be shown later.
bool operator < (CSymbol &Symbol);
bool operator <= (CSymbol &Symbol);
bool operator > (CSymbol &Symbol);
bool operator >= (CSymbol &Symbol);
Do not worry about these overloaded operator functions. They are used internally by the lexical
analyser to sort the symbols by length.

CLexicalAnalyser class
This class is the utmost important one. It is the lexical analyser. In the lexical analyser, the sequence of tokens are stored in a CList.

void SetCurrentPosition(POSITION pos);
POSITION GetCurrentPosition();

This 2 functions is used to set and retrieve the current token in the CList of token sequence.

void ClearAllTokens();

This function clears the CList of all the tokens.

void ResetPosition();

This function reset the pointer to the current token to the first token.

bool IsAllAplhaOrDigit(CString s,int begin, int length);

This function is used internally by CLexicalAnalyser.

BOOL IsSequenceEmpty();

This function returns true if there are no tokens in the lexical analyser. Otherwise, it returns false.

void AddToken(CToken token);

This function is used internally by the lexical analyser.

CToken NextToken();

This function moves the pointer to the current token to the next token. It returns the CToken object of that next token.

CToken GetCurrentToken();

This function returns the CToken object of the current token.

void String2TokenSequence(CString sequence);

This is the most important function of the class. As its name implies, this function takes in the string input and convert them into a CList of token sequence.

void setSymbol(CSymbol Symbol);

Just now, I mentioned how to create your own user-defined token type. This is the function that enables you to do this. To show you how to create your own user-defined token, the following example is used:

CLexicalAnalyser lex;
lex.setSymbol(CSymbol("+",addOperatorT));
lex.setSymbol(CSymbol("<=",lessThanOrEqualOperatorT));
lex.setSymbol(CSymbol("<",lessThanOperatorT));
lex.setSymbol(CSymbol("goto",gotoKeywordT));

The addOperatorT, lessThanOrEqualOperatorT, lessThanOperatorT and gotoKeywordT are just constant values. Please note that the constant values for your user-defined token types must NOT clash with the enumeration values of the existing pre-defined token types as mentioned in enumeration definition.

This example shows that it is acceptable for one operator to begin with the same sequence of characters as another operator. CLexicalAnalyser will favour the longest possible operator: it will find <= if it is there, or else < if it is there. Furthermore, operators may be juxtaposed with identifiers, keywords, integers and reals, but 2 identifiers may not be juxtaposed with each other.

Other classes
There are other classes (CExpression, CIntLiteralExpression, CRealIntLiteralExpression) and enumeration in this code that is not related to lexical analysis. These non-related classes is used in my programming project to implement a type of parsing technique called recursive descent parsing.

Example
To illustrate further on how to use a lexical analyser, I will provide with a very simple example. Suppose that you're creating a extremely simple reverse-polish-notation (RPN) calculator language which can only perform addition (By the way, HP calculators use RPN notation for their calculators). Let's suppose that the user typed in the code of the calculator language in a multi-line edit field, and that those code are stored in a CString object called inputStr. Also, let's suppose that this is what the user typed:

3.14	 4 +
VariableX VariableY +
"Hello world!" Print

The first step is to define constants for your user defined token types. In this example, the only new token type you must teach the lexical analyser to recognise is the '+' operator. So, the definition of the constant goes like this:

const addOperatorT=endfileT+1; //Note that this is done to ensure no clashing of values.

The next step is to 'teach' the lexical analyser how to recognise the '+' operator. Suppose that lex is an object of type CLexicalAnalyser. The code goes like that:

lex.setSymbol(CSymbol("+",addOperatorT));

Next, you must give the CString object to the lexical analyser so that it can create the CList of tokens.

lex.String2TokenSequence(inputStr);

Now, the lexical analysis phase has finished and parsing can begin! To parse this simple RPN calculator language, the code MAY go like this:

while(lex.GetCurrentToken().getTokenType()!=endfileT) //While not end of token sequence
{
	//While current token is newlines...
	while(lex.GetCurrentToken().getTokenType()==endlineT)
		lex.NextToken();//...go to the next token
	if(lex.GetCurrentToken().getTokenType()==intT)
	{
		doubleStack.Push((double)lex.GetCurrentToken().getTokenInt());
	}
	if(lex.GetCurrentToken().getTokenType()==addOperatorT)//Notice the user of user-defined
	{						      //token type here
		doubleStack.Push(doubleStack.Pop()+ doubleStack.Pop());
	}
	else if(lex.GetCurrentToken().getTokenType()==realT)
	{
		doubleStack.Push(lex.GetCurrentToken().getTokenReal());
	}
	else if(lex.GetCurrentToken().getTokenType()==stringT)
	{
		stringStack.Push(lex.GetCurrentToken().getTokenName());
	}
	else if(lex.GetCurrentToken().getTokenType()==idT && 
		lex.GetCurrentToken.getTokenName() == "Print")
	{
		PrintMessage(stringStack.Pop());
	}
	else if(lex.GetCurrentToken().getTokenType()==idT)
	{
		doubleStack.Push(GetVariableValue(lex.GetCurrentToken().getTokenName()));
	}
	else
	{
		PrintMessage("Syntax error!");
	}
	lex.NextToken();
}

Parsing
The CLexicalAnalyser class is only in charge of lexical analysis. Lexical analysis is only a half the story. The other half of the story is parsing (or more formally called language processing) which in itself is a very complex field in computer science. I chose a RPN language as an example because parsing an RPN language is the easiest. To write code to parse complex languages like C++ is really beyond the scope of this document. However, at the very least, the CLexicalAnalyserclass gets you started with the 'meat' of parsing without having to worry over the nitty-gritty details of string handling- the lexical analyser already analyses the string for you.

Download source -LexicalAnalyser.zip 4 KB

Last Updated: December 5, 1998



Comments

  • Discount Oakley Radar Range wholesale online

    Posted by ovzfghtio on 07/05/2013 12:01am

    FakE OaklEyS ,Oakley sunglasses is a superb brand sunglasses using the highest quality of materials, the option of Oakley, sunglasses, Oakley sunglasses, each many styles and colors. More Hollywood star deeply butterfly love wearing Oakley sunglasses, which can be understandable, this great design to make use of their butterfly-shaped frame, are you going to examine many of the mysterious and gorgeous the DIVA Oakley, sunglasses. Oakley sunglasses has become taken up the purpose of protection, contact wrapped around repel a lot more gentle and UV protection to get more sunshine than a normal light wear, and could toss a wide range of broad-brimmed head. fake Oakley flak Jacket ,Its charm build awareness from worldwide, perfect tone, plus mixed coloring help Oakley shade shopping. Oakley is really a serious movement, a lively personality behind the target of sunglasses. Oakley complex technical and brutal character, for the advantage of sportsmen and some women. Oakley sunglasses find well suited for all age groups, colors, shapes and fashion, no matter what type of a collection of sunglasses, is personally created for you. Oakley sunglasses additional protection vision, as the lens draped over your WATS remain more moderate compared to the standard lens and UV may lead to further reluctant to implement an easy wide-brimmed sun defense. fake oakley sunglasses ,Oakley shades of protective glasses, and usually enclose or protect their eyesight surrounding areas, to stop the particles in the eye to drink water or perhaps in the type of chemicals. The pros signify that your smaller amount of squinting is the same as an inferior volume of creases. Personal security and Oakley sunscreen eye. ray ban outlet ,Custom and specialized shadow on the sun, exactly like employing a number of physical activities, in excess of a lot of the parasol expensive. However, you will discover many online Oakley sunglasses available has reached an affordable price. Brand the origin of the story gives the Oakley's rich breath of life, its culture and outlook on life was deeply moved. Sac Longchamp Pas Cher ,Three-time winner of the Tour de France champion Greg LeMond choose Eyeshades functional protection products, accurate and completely rewritten the activity sunglasses. There are several of those sunglasses, and a a few different cheap OAKLEY sunglasses, various improvements. They are about equally well. Fortunately, Oakley sunglasses have different prices to meet up with their quality, but less of a challenge than trawling round shops to get online suppliers.

    Reply
  • lex

    Posted by Legacy on 01/23/2004 12:00am

    Originally posted by: Mahesh Kumar

    very very good articel for the beginner.
    yes it helps me a to get a overview to design a graphicl compiler for my sem exam.
    Mahesh Kumar. R
    bangalore -india

    "ALL THE BEST"

    Reply
  • Parsing tools

    Posted by Legacy on 06/10/2002 12:00am

    Originally posted by: Channaveeraya

    We can use parsing tools Lex & YAcc for scanning & parsing. these tools will generate highly efficient code. I feel no need o write this using MFC.

    • lexical analyzer

      Posted by kianifarrukh on 03/09/2005 09:19am

      help me about its working

      Reply
    Reply
  • Slight modification to token names...

    Posted by Legacy on 03/16/2002 12:00am

    Originally posted by: Tom

    This seems to be just what I need.
    
    However, I noticed that you only store
    the token name if the token type is a
    string or an identifier. I wanted to
    be able to recreate the original string
    from the token list; so, I made a slight
    modification so that every token now has
    the string from which it was derived stored
    as the token name. This change only involved
    commenting out one line and adding 2 others.

    Thanks for the great lex analyzer!

    Reply
  • a problem with a zero number (0 or 0.0)

    Posted by Legacy on 11/02/2001 12:00am

    Originally posted by: Jacob Bensabat

    I used CLexicalAnalyser class for decoding strings and numbers. To my surprise I dsicovered that it was unable
    to deal with a null number 0 or 0.0. It is unable to cast
    it as a integer number or a real one.
    Any solution would really be appreciated
    jacob

    Reply
  • if String2TokenSequence("1abc") what happened?

    Posted by Legacy on 04/21/1999 12:00am

    Originally posted by: Lee Dongho

    It was a good source,but I found small fault.
    When I use your String2TokenSequence function like below.

    String2TokenSequence("1abc")

    What's the result?
    it's one integer(1) and one id(abc).
    but I think it's error so I think your code is modified like
    below.

    void CLexicalAnalyser::String2TokenSequence(CString sequence)
    {
    CToken token;

    .....

    else if(isdigit(sequence[i]) || sequence[i]=='.')
    {
    token.TokenType=intT;
    token.TokenName.Empty();
    //while the character is within bounds and is either a
    //digit or decimal point...
    while(i< sequence.GetLength() /*&& (isdigit(sequence[i]) || sequence[i]=='.')*/)
    {
    if(!isdigit(sequence[i]) && sequence[i] != '.' && sequence[i] != ' ')
    {
    token.TokenType=errorT;
    for(jj=i; jj< sequence.GetLength();jj++)
    {
    if( sequence[jj] == ' '||sequence[jj] == '\t')
    break;
    }
    break;
    }

    if(sequence[i]=='.') token.TokenType=realT;
    token.TokenName+=sequence[i];
    i++;
    }
    i--;//Need to decrement it because in the for loop, i will be incremented
    ....

    Reply
  • Does your code work with standard edition

    Posted by Legacy on 01/16/1999 12:00am

    Originally posted by: Chaplin

    Terence.

    I would like to know how could your code be modified to work with
    visual c++ 4.0(standard edition).Currently I have done the editor mostly
    guided by wizard, now I would like to use your code to read the program
    written in this editor. I am trying to develop a simple compiler using
    visual c++ 4. I developed one using PASCAL, now I am migrating from PASCAL to Visual c++. I managed to parchases visual c++ 4, which I was reliably informed that it a standard edition.

    Thanx

    M.C. Chaplin
    palapyean@usa.net
    University of Botswana
    Department of Computer Science.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: December 11, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Market pressures to move more quickly and develop innovative applications are forcing organizations to rethink how they develop and release applications. The combination of public clouds and physical back-end infrastructures are a means to get applications out faster. However, these hybrid solutions complicate DevOps adoption, with application delivery pipelines that span across complex hybrid cloud and non-cloud environments. Check out this …

  • CentreCorp is a fully integrated and diversified property management and real estate service company, specializing in the "shopping center" segment, and is one of the premier retail service providers in North America. Company executives travel a great deal, carrying a number of traveling laptops with critical current business data, and no easy way to back up to the network outside the office. Read this case study to learn how CentreCorp implemented a suite of business continuity services that included …

Most Popular Programming Stories

More for Developers

RSS Feeds