Lexical Analyser

CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

In my first year university programming project one of my
missions is to write a parser. (Just in case you do not know what
a parser is: a parser reads user input that has a complicated
structure- such as a mathematical expression or even a
programming language- and make sense of it for the computer. So,
your C++ compiler in its most basic form, is a parser) Parsing is
mainly about advanced problems such as matching up parentheses
and building expression tress, but before it can be done, there
are some mundane things to take care of: skipping over spaces,
recognizing that a sequence of characters such as 0.13 and +
represents a real number and a plus operator respectively, and so
on. These mundane tasks are usually handled separately from the
parser proper, in a lexical analyser.

For example, a lexical analyser reads a sequence of
characters, such as (amount-5000.0)*.13 and reports what
it finds as a sequence of tokens. So, for that example, the
lexical analyser reading the characters would report this
sequence of 7 tokens:

left parenthesis
identifier ("amount")
minus sign
real number (5000.0)
right parenthesis
multiplication sign
real number (0.13)

This is what my CLexicalAnalyser class do. Below, is
the documentation of my code (actual
code in red)
:

Enumeration
definition

enum {nilT, idT, intT, realT, stringT, errorT, endlineT, endfileT};

This enumeration type defines the
token type (i.e. what type of token that particualr token is).
Here is what the token means:

nilT		Do not worry about this. It is used internally by the CLexicalAnalyser class
idT		This token represents an identifier. An identifier is a sequence of characters
		that can represent variable names, class names, function names and so on. In
		CLexicalAnalyser, identifiers are case-sensitive, first character must be an
		alphabet (and the rest can alphabets, digits and underscore- '_' ).
intT		This token represents an integer. The difference integers and real numbers are
		the decimal point character that exist in real numbers ('.').
realT		This token represents a real number. Real number can start with a decimal point.
stringT		This token represents a string. A string is any sequence of characters enclosed
		in quotation marks (' " ').
errorT		This token represent an unrecognised token in the lexical analysis. This can
		happen for example, if the lexical analyser ecounters a ~!@#$ sequence which
		cannot be represented by any token (a token can be user-defined type or a
		pre-defined type as described in this section).
endlineT	This token represent an end of line ("rn"). MFC's multiline edit field uses
		"rn" to denote a new line instead of "n". Therefore this lexical analyser
		uses the former to define a new line.
endfileT	This token denotes that there are no more tokens.

It will be explained later how to
create your own user-defined token type (for e.g. for operators
and special keywords for your new exotic programming language and
so on).

CToken class
This class represent a
token. The important functions in this token are:

int getTokenType()

This function returns an enumeration value (as described above) of the token type. For example,
this function may return idT, intT, and so on.

CString getTokenName()

This function returns the name of the token for idT and stringT tokens. For example, if the
CToken class represent a string called "Big", getTokenType() will return stringT and 
getTokenName() will return "Big".

int getTokenInt()

This function returns the integer value of the intT token. For example, if the CToken class
represent an integer whose value is 5000, getTokenType() will return intT and getTokenInt() will
return 5000.

double getTokenReal()

This function returns the real value of the realT token. For example, if the CToken class
represent a real number whose value is 3.14, getTokenType() will return intT and getTokenInt()
will return 3.14;

CSymbol class
This class represents a
symbol. This class provides a link between the string of a token
and its enumeration number. For example, you will use
addOperatorT
token type to represent the "+" string. The only
important function in this class is one of the constructors:

CSymbol( CString string, int type);

This constructor is the only function which you will need to use. An example on how to use this
will be shown later.

bool operator < (CSymbol &Symbol);
bool operator <= (CSymbol &Symbol);
bool operator > (CSymbol &Symbol);
bool operator >= (CSymbol &Symbol);

Do not worry about these overloaded operator functions. They are used internally by the lexical
analyser to sort the symbols by length.

CLexicalAnalyser class
This class is the utmost
important one. It is the lexical analyser. In the lexical
analyser, the sequence of tokens are stored in a
CList.

void SetCurrentPosition(POSITION
pos);
POSITION GetCurrentPosition();

This 2 functions is used to set and
retrieve the current token in the
CList
of token sequence.

void ClearAllTokens();

This function clears the CList
of all the tokens.

void ResetPosition();

This function reset the pointer to
the current token to the first token.

bool IsAllAplhaOrDigit(CString s,int
begin, int length);

This function is used internally by CLexicalAnalyser.

BOOL IsSequenceEmpty();

This function returns true if there
are no tokens in the lexical analyser. Otherwise, it returns
false.

void AddToken(CToken token);

This function is used internally by
the lexical analyser.

CToken NextToken();

This function moves the pointer to
the current token to the next token. It returns the
CToken
object of that next token.

CToken GetCurrentToken();

This function returns the CToken
object of the current token.

void String2TokenSequence(CString
sequence);

This is the most important function
of the class. As its name implies, this function takes in the
string input and convert them into a
CList
of token sequence.

void setSymbol(CSymbol Symbol);

Just now, I mentioned how to create
your own user-defined token type. This is the function that
enables you to do this. To show you how to create your own
user-defined token, the following example is used:

CLexicalAnalyser lex;
lex.setSymbol(CSymbol("+",addOperatorT));
lex.setSymbol(CSymbol("<=",lessThanOrEqualOperatorT));
lex.setSymbol(CSymbol("<",lessThanOperatorT));
lex.setSymbol(CSymbol("goto",gotoKeywordT));

The addOperatorT,
lessThanOrEqualOperatorT, lessThanOperatorT
and gotoKeywordT
are just constant values. Please note that the constant values
for your user-defined token types must
NOT
clash with the enumeration values of the existing pre-defined
token types as mentioned in
enumeration
definition
.

This example shows that it is
acceptable for one operator to begin with the same sequence of
characters as another operator.
CLexicalAnalyser
will favour the longest possible operator: it will find <= if
it is there, or else < if it is there. Furthermore, operators
may be juxtaposed with identifiers, keywords, integers and reals,
but 2 identifiers may not be juxtaposed with each other.

Other classes
There are other classes (CExpression,
CIntLiteralExpression, CRealIntLiteralExpression)
and
enumeration in this code that is not related to lexical analysis.
These non-related classes is used in my programming project to
implement a type of parsing technique called recursive descent
parsing.

Example
To illustrate further on
how to use a lexical analyser, I will provide with a very simple
example. Suppose that you're creating a extremely simple
reverse-polish-notation (RPN) calculator language which can only
perform addition (By the way, HP calculators use RPN notation for
their calculators). Let's suppose that the user typed in the code
of the calculator language in a multi-line edit field, and that
those code are stored in a
CString
object called
inputStr. Also, let's
suppose that this is what the user typed:

3.14	 4 +
VariableX VariableY +
"Hello world!" Print

The first step is to define
constants for your user defined token types. In this example, the
only new token type you must teach the lexical analyser to
recognise is the '+' operator. So, the definition of the constant
goes like this:

const addOperatorT=endfileT+1;
//Note that this is done to ensure no clashing of values.

The next step is to 'teach' the
lexical analyser how to recognise the '+' operator. Suppose that

lex
is an object of type CLexicalAnalyser.
The code goes like that:

lex.setSymbol(CSymbol("+",addOperatorT));

Next, you must give the CString
object to the lexical analyser so that it can
create the
CList of tokens.

lex.String2TokenSequence(inputStr);

Now, the lexical analysis phase has
finished and parsing can begin! To parse this simple RPN
calculator language, the code
MAY
go like this:

while(lex.GetCurrentToken().getTokenType()!=endfileT) //While not end of token sequence
{
	//While current token is newlines...
	while(lex.GetCurrentToken().getTokenType()==endlineT)
		lex.NextToken();//...go to the next token

	if(lex.GetCurrentToken().getTokenType()==intT)
	{
		doubleStack.Push((double)lex.GetCurrentToken().getTokenInt());
	}
	if(lex.GetCurrentToken().getTokenType()==addOperatorT)//Notice the user of user-defined
	{						      //token type here
		doubleStack.Push(doubleStack.Pop()+ doubleStack.Pop());
	}
	else if(lex.GetCurrentToken().getTokenType()==realT)
	{
		doubleStack.Push(lex.GetCurrentToken().getTokenReal());
	}
	else if(lex.GetCurrentToken().getTokenType()==stringT)
	{
		stringStack.Push(lex.GetCurrentToken().getTokenName());
	}
	else if(lex.GetCurrentToken().getTokenType()==idT &&
		lex.GetCurrentToken.getTokenName() == "Print")
	{
		PrintMessage(stringStack.Pop());
	}
	else if(lex.GetCurrentToken().getTokenType()==idT)
	{
		doubleStack.Push(GetVariableValue(lex.GetCurrentToken().getTokenName()));
	}
	else
	{
		PrintMessage("Syntax error!");
	}
	lex.NextToken();
}

Parsing

The CLexicalAnalyser
class is only in charge of lexical analysis. Lexical analysis is
only a half the story. The other half of the story is parsing (or
more formally called language processing) which in itself is a
very complex field in computer science. I chose a RPN language as
an example because parsing an RPN language is the easiest. To
write code to parse complex languages like C++ is really beyond
the scope of this document. However, at the very least, the
CLexicalAnalyserclass gets you started with the ‘meat’ of parsing
without having to worry over the nitty-gritty details of string
handling- the lexical analyser already analyses the string for
you.

Download source -LexicalAnalyser.zip 4 KB

Last Updated: December 5, 1998

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read