AccessLog

Environment: Windows 2000/SP2, VC++ 6.0/SP3

Introduction
What you should to know
What you cannot expect from this article
AccessLog
The C++ Wrapper
Read this before using the AccessLog project
Downloads

Introduction

The world of database programming includes names like Oracle, Informix, etc. If you are an MFC programmer, maybe you are familiar with terms like ODBC, DAO, Access, etc. But what if you need a smart system to handle your data quickly ? The Berkeley DB is a library you can use to handle your data without resort to the well known names of database programming and this article shows you how to use it into your Win32/MFC projects through a C++ wrapper.

What you should to know

The sample application presented here requires a minimum knowledge of the database programming topical and some basic notions about the MFC.

What you cannot expect from this article

The AccessLog sample presented here is an utility which uses the Berkeley DB library. That is, it’s only a sample, so do not expect a general database system with SQL support, sophisticated data conversion, triggers, etc. The only purpose of this article is to show you how to use a very reliable database system under Win32 through a C++ wrapper. All the C++ classes presented here to wrap the Berkeley DB library are only a facility to use it in a XBase manner, and does not pretend to be the final solution for the database programming topical.

Most of the Internet hosts are configured to track its activity into a log file. The same applies to the web (http) servers, which generates a log file containing all the requests coming from the users (the access log file).

Such kind of log contains all the basic infos about the user’s request, like the ip address, the name of the requested object, its size, the return code, etc. The following line shows you a typical entry of the access log file, where the real ip address has been changed with the “nnn.nnn.nnn.nnn” sequence:

nnn.nnn.nnn.nnn - - [03/Oct/2001:10:18:21 -0400]
     "GET / HTTP/1.0" 200 27870 "-" "Mozilla/4.0 (compatible;
     MSIE 5.0; Windows 98;)"

As you can see, each entry (record) of the access log contains the following fields:

value	description
nnn.nnn.nnn.nnn	The ip address of the user who makes the request
[03/Oct/2001:10:18:21 -0400]	The date and time when the server receives the request. The GMT specifier for the time (-400) is relative to the time zone of the server.
GET / HTTP/1.0	The request received by the server, usually a GET command. This field contains the HTTP command (GET, HEAD, POST, etc.), the requested object (the url location) and the version of HTTP protocol supported by the client.
200	The return code (200=Ok, 404=Not found, etc.).
27870	The size of the requested object (in bytes).
–	The “Referer” url, which indicates where the user comes from.
Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)	The “UserAgent” field, which identifies the client used by the user.

Each field of such record is delimited with a blank char and each record is terminated with the CR/LF pair.

Our sample app is a brute-approach log analizer. Do not confuse the brute-approach term with some obscure tecnique related to the hacking topical. Brute-approach simply means here that our AccessLog app does not use any kind of optimization during the log analysis. There are various cycles, one for each report you can generate. The main purpose of the AccessLog app is not how to optimize the analysis process, but only the usage of the Berkeley DB library, so do not expect a sophisticated code about the report generation.

To use the AccessLog app, you must first load the access log file (generated by your web server) into the table used by the program. Click first on the “…” button on the side of the “Log:” field to select the log file and next on the “…” button on the side of the “Table:” field to specify an existing table (or enter the name you prefer). When ready, click on the “Load” button and the content of the log file will be loaded into the database.

The “HTML:” field of the “Report” section allows you to enter the name of the output report.

Now you are ready to generate the report. From the “Report” section you can choose the fields which must be included into the output report. The “referer by rate” and “referer by url” check boxes allows you to include the list of the referring urls from which your server has been accessed (the “Referer” field is specified by your browser into the header of the HTTP request and contains the url you are coming from). The “user agent by rate” and “user agent by name” check boxes are for including the list of the user agents used by the people which connects to your web site. As you know, each browser and http client identifies itself with a descriptive name, like Mozilla […], Opera […], etc. Finally, the “get by rate” and the “get by url” fields allows you to list all the url locations accessed by the users.

The remaining fields, “by user agent”, “by url (ignore filters)” and “by ip”, allows you to add a listing including only the value you specify. For example, suppose you want to list all accesses to the main url location of your web server. If so, you must enter the / char into the “by url (ignore filters)” field. If you want to list all the visitors using the AOL service, you must to specify the value *aol* into the “by ip” field. Would you like to know if there are some visitors using the Opera browser ? Enter the value *Opera* into the “by user agent” field.

Note that the wildcards search is case sensitive, so specifying *Opera* as the content for the field is not the same than specifying *opera*.

The “Report Filters” section allows you to filter the output report according to the values you specify. For example, suppose you want to generate a report including all accesses to the main url of your site and to the .txt and .zip files you make available through your html pages. If so, you must enter the value /,index.html,*.zip*,*.txt into the “requested url” field, selecting only the “get by rate” (or “get by rate”) field.

The “by url (ignore filters)” field is the only which ignores the content of the filter during the report generation.

The C++ Wrapper

The main purpose of the AccessLog sample app is to illustrate how to use the Berkeley DB library in the everyday life. Do not expect a sophisticated C++ wrapper to transform the Berkeley DB library into a general SQL database. The C++ wrapper presented here is only a sample code to give an XBase fashion to the Berkeley DB. If your technical background includes languages like the legendary Clipper (from Nantucket), sure you will enjoy with that wrapper. On the other side, if you are looking for some SQL, you must resort to tools like the well known MentorSQL experiment (but do not ask me, because the source code is no more available and I do not know where to find it).

The Berkeley DB library uses a hierarchy approach, so to use it in a XBase (relational) manner we need to make some adjustments. Our C++ wrapper (let me call it the CBase interface) allows you to define an index for each field of your table. The mechanism used by our CBase wrapper is very simple. Each table you define through the CBase interface contains an ‘hidden’ field used as the primary key. This field contains the unique key which identifies the record. Each index you defines for the main table contains the value of the secondary and primary key.

Suppose that the Table1 is your main table:

primary key	your fields (name, email, age, address, etc.) here
0000000001	Luca, lpiergentili@yahoo.com, etc.
0000000002	Pasqualino, pasqualino@settebellezze.com, etc.
0000000003	Giacomino, giacomino@bellammare.com, etc.
etc.

When you create the first index, for example the Index1 on the name field, the CBase interface creates another table with the following fields:

name field (secondary key)	primary key from the table Table1
Luca	0000000001
Pasqualino	0000000002
Giacomino	0000000003
etc.

According to the indexing mechanism, the next index, for example the Index2 on the email field, contains the following:

email field (secondary key)	primary key from the table Table1
lpiergentili@yahoo.com	0000000001
pasqualino@settebellezze.com	0000000002
giacomino@bellammare.com	0000000003
etc.

Would you like to search the table for the user with the lpiergentili@yahoo.com email ? The CBase interface looks for the lpiergentili@yahoo.com value into the Index2 table and retrieves the primary key (0000000001). When found, CBase looks for the 0000000001 value into the Table1 table, allowing you to retrieve the main record. That’s, with this very basic mechanism you are now ready to use more than one index per table. Of course, you do not need to care about all the primary key stuff. The primary key generation process is automatic and all the indexes are synchronized with the main table by the CBase interface.

The CBase wrapper uses the CBerkeleyDB class as the basic interface to the Berkeley DB library. Into the CBerkeleyDB class you can found the definition of the structures for the record:

struct ROW {
  int     num;         // field position (0 based)
  int     ofs;         // field offset (0 based)
  char*   name;        // field name
  char    type;        // field type
  int     size;        // field size
  int     dec;         // decimals
  char*   value;       // pointer to the field content
  unsigned long flags; // filters
};

for the table:

struct TABLE {
  char    filename[_MAX_PATH+1]; // table name
  int     totfield;              // number of fields
  ROW*    row;                   // array for field definition, 
                                 //must alloc at run-time
  int     totindex;              // number of indexes
  INDEX*  index;                 // array for indexed definition,
                                 // must alloc at run-time
  TABLE_STAT stat;               // current state
};

struct DATABASE {
  TABLE   table;                 // table definition
};

and for the index:

struct INDEX {
  char    filename[_MAX_PATH+1]; // index filename
  char*   name;                  // index name
  char*   fieldname;             // the name of the (table) 
                                 // field used as the key
  int     fieldnum;              // the number of the (table)
                                 // field used as the key (0 based)
};

The CBerkeleyDB class also contains all the basic methods required to access the table, like Open(), Create(), Close(), Insert(), Delete(), etc. To retrieve the record, the CBerkeleDB class defines the GetFirst…Last() methods, while all the (internal) primary key stuff is handled through the GetPrimary…() functions. The CBerkeleyDB class does not contains pure virtual methods, like SomeMethod() = 0, but think the CBerkeleyDB class like a base class.

To use it, you must derive your own class from it, like CBase class does. We need to access our tables in a XBase manner, without care about all the primary/secondary keys stuff, and the CBase class is just for this. The CBase class is a layer which hides all the low level details of the Berkeley DB library, but its main purpose is only to define a CBase object, it’s not a ready-to-use class.

To use such kind of object we need to define a CTable class. The CTable class includes the CBase object to access the Berkeley DB library and defines some pure virtual methods which must exists into the derived class. Think the CTable class like a bridge between the Berkeley DB and our final table.

Confused? Do not worry, an example will make clear all the mechanism. To generate the HTML report with all the info about our web site, we need to define a table with the following fields: the ip address of the user, the date/time of the request, the requested url, the response code, the size of the requested object, the referer url and the user agent. To define such table we use the CLogDatabase class. Into the CLogDatabase.h file used for class definition we must first include the required headers:

#include "CBase.h"
#include "CTable.h"

Next, we need to define the table name (see the LOG_TABLE macro) and the length of each field of the record:

#define LOG_TABLE          "logtable"
#define LOG_IP_LEN         MAX_URL
#define LOG_DATE_LEN       32
#define LOG_GET_LEN        MAX_URL
#define LOG_CODE_LEN       5
#define LOG_SIZE_LEN       10
#define LOG_REFERER_LEN    MAX_URL
#define LOG_USERAGENT_LEN  MAX_URL
#define LOG_RECORD_LENGTH  (LOG_IP_LEN + LOG_DATE_LEN +
                            LOG_GET_LEN + LOG_CODE_LEN +
                            LOG_SIZE_LEN + LOG_REFERER_LEN
                            + LOG_USERAGENT_LEN)

Also, we need to define the id for all the indexes:

#define LOG_IDX_IP         0
#define LOG_IDX_GET        1
#define LOG_IDX_REFERER    2
#define LOG_IDX_USERAGENT  3
#define LOG_IDX_DATE       4

Now we can start with the CLogDatabase class definition:

class CLogTable : public CTable
{
private:
  // record definition
  struct RECORD {
    char    ip[LOG_IP_LEN+1];
    char    date[LOG_DATE_LEN+1];
    char    get[LOG_GET_LEN+1];
    int     code;
    long    size;
    char    referer[LOG_REFERER_LEN+1];
    char    useragent[LOG_USERAGENT_LEN+1];
  };

  CBASE_TABLE* table_struct;  // pointer to the table
                              // struct definition, see
                              // into CLogDatabase.cpp
  CBASE_INDEX* idx_struct;    // pointer to the index
                              // struct definition, see
                              // into CLogDatabase.cpp
  RECORD       record;        // the instance of our record
  char  m_szRecord[LOG_RECORD_LENGTH+1]; // internally
                              // used to store the entore
                              // record
  char  m_szTableName[_MAX_PATH+1]; // the name of the table
  char  m_szTablePath[_MAX_PATH+_MAX_FNAME+1]; // the pathname
                              // for the table
  char  m_szIndexIp[_MAX_PATH+_MAX_FNAME+1];   // the pathname
                              // for the index referring to the
                              // ip field
  char  m_szIndexGet[_MAX_PATH+_MAX_FNAME+1];   // the pathname
                              // for the index referring to the
                              // get field
  char  m_szIndexReferer[_MAX_PATH+_MAX_FNAME+1]; // the pathname
                              // for the index referring to the
                              // referer field
  char  m_szIndexUserAgent[_MAX_PATH+_MAX_FNAME+1]; // the pathname
                              // for the index referring to the
                              // user agent field
  char  m_szIndexDate[_MAX_PATH+_MAX_FNAME+1]; // the pathname
                              //  for the index referring to the
                              // date field

public:
  // ctor/dtor
  CLogTable(LPCSTR lpcszTableName =
        NULL,LPCSTR lpcszDataPath =
        NULL,BOOL bOpenTable = FALSE);
  virtual ~CLogTable();

  // must define the pure virtual members of the CTable class
  inline const char* GetClassName(void)         {return("CLogTable");}
  inline const char* GetStaticTableName(void)   {return(LOG_TABLE);}
  inline const char* GetTableName(void)         {return(m_szTableName);}
  inline const char* GetTablePathName(void)     {return(m_szTablePath);}
  inline const CBASE_TABLE* GetTableStruct(void) {return(table_struct);}
  inline const CBASE_INDEX* GetIndexStruct(void) {return(idx_struct);}
  inline const int   GetRecordLength(void)   {return(LOG_RECORD_LENGTH);}
  const char*        GetRecordAsString(void);
  inline void        ResetMemvars(void)
                           {memset(&record,'\0',sizeof(record));}
  void               GatherMemvars(void);
  void               ScatterMemvars(BOOL = TRUE);

  // methods to retrieve the field's content
  inline const char* GetField_Ip(void)        {return(record.ip);}
  inline const char* GetField_Date(void)      {return(record.date);}
  inline const char* GetField_Get(void)       {return(record.get);}
  inline int         GetField_Code(void)      {return(record.code);}
  inline long        GetField_Size(void)      {return(record.size);}
  inline const char* GetField_Referer(void)   {return(record.referer);}
  inline const char* GetField_UserAgent(void) {return(record.useragent);}

  // methods to fill the field
  inline void PutField_Ip(const char* value)
              {strcpyn(record.ip,value,sizeof(record.ip));}
  inline void PutField_Date(const char* value)
              {strcpyn(record.date,value,sizeof(record.date));}
  inline void PutField_Get(const char* value)
              {strcpyn(record.get,value,sizeof(record.get));}
  inline void PutField_Code(int value)
              {record.code = value;}
  inline void PutField_Size(long value)
              {record.size = value;}
  inline void PutField_Referer(const char* value)
              {strcpyn(record.referer,value,sizeof(record.referer));}
  inline void PutField_UserAgent(const char* value)
              {strcpyn(record.useragent,
                       value,
                       sizeof(record.useragent));}
};

The CLogTable class inherits all the methods to handle our data from the CTable class (which contains the CBase object), and if you look
into the CLogDatabase.cpp file, you can see that the CLogTable class only defines the structure for the table:

static CBASE_TABLE table[] = {
  {"IP",        'C', LOG_IP_LEN,        0},
  {"DATE",      'C', LOG_DATE_LEN,      0},
  {"GET",       'C', LOG_GET_LEN,       0},
  {"CODE",      'N', LOG_CODE_LEN,      0},
  {"SIZE",      'N', LOG_SIZE_LEN,      0},
  {"REFERER",   'C', LOG_REFERER_LEN,   0},
  {"USERAGENT", 'C', LOG_USERAGENT_LEN, 0},
  {NULL,         0,  0,                 0}
};

for the index:

static CBASE_INDEX idx[] = {
  {NULL, "IDX_IP",        "IP"},
  {NULL, "IDX_GET",       "GET"},
  {NULL, "IDX_REFERER",   "REFERER"},
  {NULL, "IDX_USERAGENT", "USERAGENT"},
  {NULL, "IDX_DATE",      "DATE"},
  {NULL, NULL,            NULL}
};

and implements the required methods, like GatherMemvars(), ScatterMemvars() and GetRecordAsString().

With the CLogTable class you can now handle your data as you like. Usually, I prefer to create another class which contains a CLogTable object and which performs all the logical operations I need. This is not the case, but suppose you need to update [n] tables for each record you insert into the CLogTable. Instead of repeating the same operation each time, I prefer to create a CLogDatabaseService class which defines internally a CLogTable object and which exposes methods like InsertTheRecordAndUpdateAllTheTables(). Defining a CLogDatabaseService class does not prevent you to obtain the full access to the CLogTable object, due to the GetTable() method, which returns the pointer to the CLogTable object. In other words, with such classes, you can code this way:

CLogDatabaseService LogService( szTableName,
                                szTablePath,
                                TRUE);   // defines an object for
                                         // complex operations on
                                         // the table
CLogTable* pLogTable = LogService.GetTable(); // obtain the pointer
                                         // to the CLogTable object
                                         // to use it directly

I know that the CBase C++ wrapper cannot be fully and clearly explained in a short, so if you like the wrapper and are thinking about using it into your apps, you need to explore all the source code. Comments are not in english, but I think the code is clear enough and the use of the hungarian and MFC notations will help you.

Read this before using the AccessLog project

The AccessLog sample app requires the Berkeley DB library which is not included with the source code of the project due to its size (over 1 MB). If you want to use the AccessLog app or the CBase C++ wrapper, you must first download the Berkeley DB library version 2.7.7 from the sleepycat site and install it. Note that I wrote the C++ wrapper back into 1998 and this is the reason for which the CBase interface uses the Berkeley DB version 2.7.7 (the version available at those times). The CBase interface presented here does not work with the latest version of the Berkeley DB library (3.3.11), due to the introduced from the 2.7.7 version.

The AccessLog project assumes that you install the Berkeley DB library into the \BerkeleyDB directory. Also, the CBrekeleyDB.h header file contains a #pragma directive to automatically include the reference to the required DLL (BerkeleyDB.dll):

#ifdef _DEBUG
  #pragma comment(lib,"BerkeleyDB.d.lib")
#else
  #pragma comment(lib,"BerkeleyDB.lib")
#endif

so you must modify the original Berkeley DB project file and specify the BerleleyDB.dll name for the output dll file (BerkeleyDB.d.dll for the debug version).

In a short, do the following:

download from the below link the Berkeley DB library version 2.7.7 and extract it into the default directory (\db-2.7.7)
rename the \db-2.7.7 directory to \BerkeleyDB
start your VC++ ad open the \BerkeleyDB\build_win32\Berkeley_DB.dsw workspace
if you are using the version 6.0, the VC++ asks you if you want to convert the project to the new format, answer yes
in not already selected, select the DB_DLL project from the workspace
from the main menu of the VC++ select Project and Settings
select the Win32 Debug entry from the Settings For combo
select the Link tab and into the Output file name field change the Debug/libdb.dll to Debug/BerkeleyDB.d.dll
select the Win32 Release entry from the Settings For combo
select the Link tab and into the Output file name field change the Debug/libdb.dll to Debug/BerkeleyDB.dll

Now, you are ready to compile and link the Berkeley DB dll. Be sure to place the corresponding .lib and .dll files into the directories used by VC++ to find the executables and libraries files.

As you can see on the sleepycat site, the Berkeley DB library is used by various open source projects and by proprietary software vendors, including Amazon.com, Cisco, Compaq, Motorola, Netscape, etc.

The CBase interface presented here is only a wrapper around the (excellent) Berkeley DB library, but in spite of its simplicity, I have used it to handle the central database of an Intranet system used into the South Europe by one of the bigger Real Estate Franchisor (over 1 millon of records – approx 2 GB of data – with an average of 300 transactions per day). Currently, I use the CBase interface into the crawlpaper project, to handle the local database of crawled urls.

As usual, any kind of ideas, suggestions or enhancements are always welcome.

Luca Piergentili
lpiergentili@yahoo.com
http://www.geocities.com/lpiergentili/