CParser'�A Simple File Parser

Environment:VC6

Introduction

When there is the need to parse a file and a "real" parser would be oversized for the job, this rather simple parser might be an alternative. As the two demo projects show, the CParser is easy to use:

  • Construct a CParser
  • Add the tokens you want to search for
  • Reset() the parser each time before you start parsing
  • Step through the file byte by byte and call CheckForToken(currentByte) for each byte OR
  • Alternatively provide a callback for each token and call ParseFile(fileNameStr)

A token is a piece of text you are searching a file for (=parsing). Example: Assume you have a text file that holds address information and each entry begins with "name = ...". The String "name = " would be a token you search for when scanning the file for address entries.

I created this class when I came across the need to read information from a file that has been generated previously. Thus, there was no need to provide any syntax checking and so forth because I could rely on the file-generating code not to produce faulty output. The files I had to scan were quite large (>80 Mb), so reading the file at once into an String and parsing the String with CString::Find() or similar methods was not a option.

Detailed Description

As mentioned before, the CParser class supports two different approaches for parsing a file. Both are described in detail below and there is a demo project for each. This section handles the aspects that are valid for both approaches.

In either case, you have to provide a set of token IDs. Therefore, create a enum structure as shown below. Important: make sure to start with '1' because '0' is defined as NO_TOKEN parser internally. In your application, you would give the entries more meaningful names. The demo project that parses a file containing information about some virtual graphical objects uses entries such as TOKEN_COLOR or TOKEN_SIZE, for example.

Do not forget to #include the CParser interface header "parser.h".

#include "Parser.h"

enum T_TokenID
{
  TOKEN_MY_FIRST_TOKEN = 1,
  TOKEN_MY_SECOND_TOKEN,
  TOKEN_MY_THIRD_TOKEN
};

The parser-related headers and sources are:

  • Parser.h: CParser interface—header
  • Parser.cpp: CParser implementation
  • Token.h: CToken interface—header included by CParser.h
  • Token.cpp: CToken implementation

These files are the same for both approaches and can be downloaded by "Download CParser sources only" in the Download section below. These files are also included in the demo projects.

To add the parser sources to your application, open Parser.cpp and Token.cpp and choose "Compile" from the "Build" menu for both files and confirm to add these sources to your application.

Parse a File Using Callbacks

For each token, you have to provide a callback function that must be static in case you use a member function of a class. The declaration would resemble this:

static void CallBackForTokenMY_FIRST_TOKEN(CStdioFile* pFile);

The implementation could look like this:

void CParserDemoDlg::CallBackForTokenMY_FIRST_TOKEN(CStdioFile*
                                                    pFile)
{
  // Place your code to handle a token TOKEN_MY_FIRST_TOKEN here.
  // pFile points at the file to parse. The file pointer points
  // at the first byte after the token just found. Thus, you can
  // read in some data that follows the token here.
}

Now, construct a CParser instance and add the tokens you want to search for. The parameters of the CParser::Add method are the Token-ID, the corresponding String you search for in the file, and the corresponding callback function/method.

CParser parser;

parser.Add((int)TOKEN_MY_FIRST_TOKEN,  "name = ",
           CallBackForTokenMY_FIRST_TOKEN);
parser.Add((int)TOKEN_MY_SECOND_TOKEN, "street = ",
           CallBackForTokenMY_SECOND_TOKEN);
parser.Add((int)TOKEN_MY_THIRD_TOKEN,  "phone = ",
            CallBackForTokenMY_THIRD_TOKEN);

The parser is now ready for use. You can parse a file simply by calling

parser.ParseFile("file_to_parse.txt");

An important disadvantage of the callback approach results from the fact that callbacks have to be static methods. It is not possible to access non-static members of the same class directly. The parserDemoCB shows an example how to work around this problem: The CListBox m_lst_itemsInFile cannot be accessed directly, so a pointer is used instead. However, if you need to access non-static members and dislike the pointer idea, you can use the alternative CheckForToken(...)approach.

Parse a File Using CheckForToken(...) and a Switch-Case Block

To parse a file, construct a CParser instance and add the tokens you want to search for. Implementing the CheckForToken(...) approach does not make use of callbacks, so this time the CParser::Add method lacks the parameter pCallBack:

CParser parser;

parser.Add((int)TOKEN_MY_FIRST_TOKEN, "name = ");
parser.Add((int)TOKEN_MY_SECOND_TOKEN, "street = ");
parser.Add((int)TOKEN_MY_THIRD_TOKEN, "phone = ");

You now can open a file and step through the file byte by byte, call CheckForToken(...) and check whether a token was found. For better readability, no exception handling is included in the sample code shown below.

CFile file;
file.Open(fileNameStr, CFile::modeRead)
BYTE buffer;
parser.Reset();
while (file.Read(&buffer, 1) == 1)
{
  T_TokenID currentToken = (T_TokenID)parser.CheckForToken(buffer);

  switch ( currentToken )
  {
  case NO_TOKEN:
    break;    // do nothing but continue searching for a token

  case TOKEN_MY_FIRST_TOKEN:
    // place your code to handle a token TOKEN_MY_FIRST_TOKEN here
    break;

  case TOKEN_MY_SECOND_TOKEN:
    // place your code to handle a token TOKEN_MY_SECOND_TOKEN here
    break;

  case TOKEN_MY_THIRD_TOKEN:
    // place your code to handle a token TOKEN_MY_THIRD_TOKEN here
    break;

  default:
    {
      ASSERT(false);    // CheckForToken(buffer) should always
                        // return a valid T_TokenType
    }
  }    // switch ( CheckForToken(buffer) )
}
file.Close();

Downloads

Both demo projects parse the text file file_to_parse.txt, which also includes some explanation. When a token is found, the corresponding data is read and added to the dialog's list box.

Download demo project, demonstrating the CheckForToken(...) approach - 13 Kb
Download demo project, demonstrating the callback approach - 13 Kb
Download CParser sources only - 3 Kb



Comments

  • parser misses token blocked by partial token

    Posted by hal@kiwisoft.co.nz on 03/01/2008 02:42pm

    You will notice that if you have a partial token immediately followed by a full token that the full token is missed. It is missed because when matching fails the match count is zeroed but the character is not then checked as a starter for a new token. For example, the token is "park" and the input stream is "parpark" and the token is missed. The solution is simply to do a match at the same time as zeroing the match count.

    Reply
  • One more question

    Posted by Legacy on 01/20/2004 12:00am

    Originally posted by: Andy V

    Thanks for answering my previous question, I've tried to put the CParser into my program, but I get an error in bhytpes.h. I've looked but can't tell where this is even used. I did find it has something to do with StdAfx.h. I'm not sure what do about this either. Sometimes I will get the "cannot open StdAfx.h file while compiling. Any help is well appreciated.

    Reply
  • Text file

    Posted by Legacy on 01/15/2004 12:00am

    Originally posted by: Andy V

    Any ideas on how you would put the results into a .txt file instead of displaying them?

    Reply
  • What about using Spirit

    Posted by Legacy on 08/04/2003 12:00am

    Originally posted by: Jonathan de Halleux

    Do you know spirit ? http://spirit.sourceforge.net

    Spirit is a parser generation framework that let's you build simple parsers directly inside your c++ code.

    Reply
  • A slight improvement

    Posted by Legacy on 08/01/2003 12:00am

    Originally posted by: Oktronic

    Hi Peter,
    I read your article and I would like make a suggestion.
    I too have had to parse very large files, many of which are in the gig range.
    You method of reading one byte at a time would be seriously slow for such files. I'm sure you've noticed this with your 80M bytes files. It probably takes a while to parse such large files.

    I would like suggest that you consider trying reading your files in 64k memory buffers to increase the speed. you will find that you will find your tokens very fast this way. To compare speeds, i can go through a 80 meg file and reorder data based on field parameters with adaptable buffer sizes in about 15 seconds on a middle class 2 gig machine, which is pretty fast.
    So if you find that your class is alittle slower then you like, you might want to try this.
    Good work,
    Oktronic


    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • On-demand Event Event Date: September 10, 2014 Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild." This loop of continuous delivery and continuous feedback is how the best mobile …

  • On-demand Event Event Date: July 22, 2014 In this WhatWorks analysis, John Pescatore examines a use case where end users had local administrative rights on their PCs and it had gotten out of hand for this Fortune 500 Energy and Utilities company. The compelling event that prompted the company to reexamine this situation was the migration to Windows 7. In Windows XP, a custom tool that allowed users one of three levels of administrative rights to their workstations would need to be replaced during the Windows …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds