CParser'�A Simple File Parser

Environment:VC6

Introduction

When there is the need to parse a file and a "real" parser would be oversized for the job, this rather simple parser might be an alternative. As the two demo projects show, the CParser is easy to use:

  • Construct a CParser
  • Add the tokens you want to search for
  • Reset() the parser each time before you start parsing
  • Step through the file byte by byte and call CheckForToken(currentByte) for each byte OR
  • Alternatively provide a callback for each token and call ParseFile(fileNameStr)

A token is a piece of text you are searching a file for (=parsing). Example: Assume you have a text file that holds address information and each entry begins with "name = ...". The String "name = " would be a token you search for when scanning the file for address entries.

I created this class when I came across the need to read information from a file that has been generated previously. Thus, there was no need to provide any syntax checking and so forth because I could rely on the file-generating code not to produce faulty output. The files I had to scan were quite large (>80 Mb), so reading the file at once into an String and parsing the String with CString::Find() or similar methods was not a option.

Detailed Description

As mentioned before, the CParser class supports two different approaches for parsing a file. Both are described in detail below and there is a demo project for each. This section handles the aspects that are valid for both approaches.

In either case, you have to provide a set of token IDs. Therefore, create a enum structure as shown below. Important: make sure to start with '1' because '0' is defined as NO_TOKEN parser internally. In your application, you would give the entries more meaningful names. The demo project that parses a file containing information about some virtual graphical objects uses entries such as TOKEN_COLOR or TOKEN_SIZE, for example.

Do not forget to #include the CParser interface header "parser.h".

#include "Parser.h"

enum T_TokenID
{
  TOKEN_MY_FIRST_TOKEN = 1,
  TOKEN_MY_SECOND_TOKEN,
  TOKEN_MY_THIRD_TOKEN
};

The parser-related headers and sources are:

  • Parser.h: CParser interface—header
  • Parser.cpp: CParser implementation
  • Token.h: CToken interface—header included by CParser.h
  • Token.cpp: CToken implementation

These files are the same for both approaches and can be downloaded by "Download CParser sources only" in the Download section below. These files are also included in the demo projects.

To add the parser sources to your application, open Parser.cpp and Token.cpp and choose "Compile" from the "Build" menu for both files and confirm to add these sources to your application.

Parse a File Using Callbacks

For each token, you have to provide a callback function that must be static in case you use a member function of a class. The declaration would resemble this:

static void CallBackForTokenMY_FIRST_TOKEN(CStdioFile* pFile);

The implementation could look like this:

void CParserDemoDlg::CallBackForTokenMY_FIRST_TOKEN(CStdioFile*
                                                    pFile)
{
  // Place your code to handle a token TOKEN_MY_FIRST_TOKEN here.
  // pFile points at the file to parse. The file pointer points
  // at the first byte after the token just found. Thus, you can
  // read in some data that follows the token here.
}

Now, construct a CParser instance and add the tokens you want to search for. The parameters of the CParser::Add method are the Token-ID, the corresponding String you search for in the file, and the corresponding callback function/method.

CParser parser;

parser.Add((int)TOKEN_MY_FIRST_TOKEN,  "name = ",
           CallBackForTokenMY_FIRST_TOKEN);
parser.Add((int)TOKEN_MY_SECOND_TOKEN, "street = ",
           CallBackForTokenMY_SECOND_TOKEN);
parser.Add((int)TOKEN_MY_THIRD_TOKEN,  "phone = ",
            CallBackForTokenMY_THIRD_TOKEN);

The parser is now ready for use. You can parse a file simply by calling

parser.ParseFile("file_to_parse.txt");

An important disadvantage of the callback approach results from the fact that callbacks have to be static methods. It is not possible to access non-static members of the same class directly. The parserDemoCB shows an example how to work around this problem: The CListBox m_lst_itemsInFile cannot be accessed directly, so a pointer is used instead. However, if you need to access non-static members and dislike the pointer idea, you can use the alternative CheckForToken(...)approach.

Parse a File Using CheckForToken(...) and a Switch-Case Block

To parse a file, construct a CParser instance and add the tokens you want to search for. Implementing the CheckForToken(...) approach does not make use of callbacks, so this time the CParser::Add method lacks the parameter pCallBack:

CParser parser;

parser.Add((int)TOKEN_MY_FIRST_TOKEN, "name = ");
parser.Add((int)TOKEN_MY_SECOND_TOKEN, "street = ");
parser.Add((int)TOKEN_MY_THIRD_TOKEN, "phone = ");

You now can open a file and step through the file byte by byte, call CheckForToken(...) and check whether a token was found. For better readability, no exception handling is included in the sample code shown below.

CFile file;
file.Open(fileNameStr, CFile::modeRead)
BYTE buffer;
parser.Reset();
while (file.Read(&buffer, 1) == 1)
{
  T_TokenID currentToken = (T_TokenID)parser.CheckForToken(buffer);

  switch ( currentToken )
  {
  case NO_TOKEN:
    break;    // do nothing but continue searching for a token

  case TOKEN_MY_FIRST_TOKEN:
    // place your code to handle a token TOKEN_MY_FIRST_TOKEN here
    break;

  case TOKEN_MY_SECOND_TOKEN:
    // place your code to handle a token TOKEN_MY_SECOND_TOKEN here
    break;

  case TOKEN_MY_THIRD_TOKEN:
    // place your code to handle a token TOKEN_MY_THIRD_TOKEN here
    break;

  default:
    {
      ASSERT(false);    // CheckForToken(buffer) should always
                        // return a valid T_TokenType
    }
  }    // switch ( CheckForToken(buffer) )
}
file.Close();

Downloads

Both demo projects parse the text file file_to_parse.txt, which also includes some explanation. When a token is found, the corresponding data is read and added to the dialog's list box.

Download demo project, demonstrating the CheckForToken(...) approach - 13 Kb
Download demo project, demonstrating the callback approach - 13 Kb
Download CParser sources only - 3 Kb



Comments

  • parser misses token blocked by partial token

    Posted by hal@kiwisoft.co.nz on 03/01/2008 02:42pm

    You will notice that if you have a partial token immediately followed by a full token that the full token is missed. It is missed because when matching fails the match count is zeroed but the character is not then checked as a starter for a new token. For example, the token is "park" and the input stream is "parpark" and the token is missed. The solution is simply to do a match at the same time as zeroing the match count.

    Reply
  • One more question

    Posted by Legacy on 01/20/2004 12:00am

    Originally posted by: Andy V

    Thanks for answering my previous question, I've tried to put the CParser into my program, but I get an error in bhytpes.h. I've looked but can't tell where this is even used. I did find it has something to do with StdAfx.h. I'm not sure what do about this either. Sometimes I will get the "cannot open StdAfx.h file while compiling. Any help is well appreciated.

    Reply
  • Text file

    Posted by Legacy on 01/15/2004 12:00am

    Originally posted by: Andy V

    Any ideas on how you would put the results into a .txt file instead of displaying them?

    Reply
  • What about using Spirit

    Posted by Legacy on 08/04/2003 12:00am

    Originally posted by: Jonathan de Halleux

    Do you know spirit ? http://spirit.sourceforge.net

    Spirit is a parser generation framework that let's you build simple parsers directly inside your c++ code.

    Reply
  • A slight improvement

    Posted by Legacy on 08/01/2003 12:00am

    Originally posted by: Oktronic

    Hi Peter,
    I read your article and I would like make a suggestion.
    I too have had to parse very large files, many of which are in the gig range.
    You method of reading one byte at a time would be seriously slow for such files. I'm sure you've noticed this with your 80M bytes files. It probably takes a while to parse such large files.

    I would like suggest that you consider trying reading your files in 64k memory buffers to increase the speed. you will find that you will find your tokens very fast this way. To compare speeds, i can go through a 80 meg file and reorder data based on field parameters with adaptable buffer sizes in about 15 seconds on a middle class 2 gig machine, which is pretty fast.
    So if you find that your class is alittle slower then you like, you might want to try this.
    Good work,
    Oktronic


    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • The impact of a data loss event can be significant. Real-time data is essential to remaining competitive. Many companies can no longer afford to rely on a truck arriving each day to take backup tapes offsite. For most companies, a cloud backup and recovery solution will eliminate, or significantly reduce, IT resources related to the mundane task of backup and allow your resources to be redeployed to more strategic projects. The cloud - can now be comfortable for you – with 100% recovery from anywhere all …

  • Live Event Date: May 6, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT While you likely have very good reasons for remaining on WinXP after end of support -- an estimated 20-30% of worldwide devices still are -- the bottom line is your security risk is now significant. In the absence of security patches, attackers will certainly turn their attention to this new opportunity. Join Lumension Vice President Paul Zimski in this one-hour webcast to discuss risk and, more importantly, 5 pragmatic risk mitigation techniques …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds