The XML parsing Article That Should (Not) Be Written!

Introduction

Over the years in my profession as a C++ software developer, I have to infrequently maintain XML file format for some application project files. I found the DOM to be difficult to navigate and use. I have come across many articles and XML libraries which proffer to be easy to use, but none is as easy as the internal XML library co-developed by my ex-coworkers, Srikumar Karaikudi Subramanian and Ali Akber Saifee. Srikumar wrote the 1st version which could only read from XML file and Ali later added the node creation capability which allowed the content to be saved in XML file. However, that library is proprietary. After I left the company, I lost the use of an really-easy-to-use XML library. Unlike many talented programmers out there, I am an idiot; I need an idiot-proof XML library. Too bad, Linq-to-XML (Xinq) is not available in C++/CLI! I decided to re-construct Srikumar’s and Ali’s XML library and made it open-source! I dedicate this article to Srikumar Karaikudi Subramanian and Ali Akber Saifee.

My terrible relationship with Ali Akber Saifee

Ali Akber Saifee and I are what we called “the world’s greatest arch-rivals”. While we worked together in the same company, I would always find every opportunity find ‘flaws’ with Ali and email him to expose some of his ‘problems’ and carbon-copy everyone else. My arch-rival, as always, beat me with some of his best replies. Ali has once offered me a chance for us to make good and work together to conquer the world together. But I rejected his offer (in thinly-veiled plot) to subservient me! The world’s greatest arch-rivals can never work together!

Whenever I lost a friend on facebook, I always check if it was Ali who defriended me. The readers may ask why. Do you, the readers, know the ramifications of the world’s greatest arch-rivals defriend each other on facebook? Ans: there can never be world peace! The readers may ask why the world’s greatest arch-rivals are on each other’s facebook in the 1st place! Well, that is another story for another article in another day!

Why am I rewriting and promoting my arch-rival’s XML library? Before Ali says this, let me pre-empt him and say this myself: Imitation is the most sincere form of flattery. The truth is his XML library is really easy to use!

Some code examples first

<Books>
  <Book>
    <Price>12.990000</Price>
  </Book>
</Books>

To create the above XML, see the C++ code below,

Elmax::Element root;
root.SetDomDoc(pDoc); // A empty DOM doc is initialized beforehand.
root[L"Books"][L"Book"][L"Price"] = 12.99f;

The 3rd line of code detects that the 3 elements do not exist and the float assignment will attempt to create those 3 elements and convert 12.99f to string and assign to the price element. To read the price element, we just assign it to the float variable (see below),

Elmax::Element root;
root.SetDomDoc(pDoc); // A XML file is read into the DOM doc beforehand.
Elmax::Element elemPrice = root[L"Books"][L"Book"][L"Price"];
if(elemPrice.Exists())
    float price = elemPrice;

It is good practice to check if the price element exists, using Exists(), before reading it.

XML versus binary serialization

In this section, let us look first at the advantages of XML over binary serialization before we discuss Elmax. I’ll not discuss XML serialization because I am not familiar with it. Below is the simplified (version 1) file format for a online bookstore.

Version=1
Books
  Book*
    ISBN
    Title
    Price
    AuthorID
Authors
  Author*
    Name
    AuthorID

The child elements are indented under the parent. The elements which can be more than 1 in quantity, are appended with a asterisk(*). The diagram below shows what the (version 1) binary serialization file format will typically look like.

Binary Version 1
Figure 1

Let’s say in the version 2, we add a Description under the Book and a Biography under the Author.

Version=2
Books
  Book*
    ISBN
    Title
    Price
    AuthorID
    Description(new)
Authors
  Author*
    Name
    AuthorID
    Biography(new)

The diagram below shows the version 1 and 2 binary serialization file format. The new additions in version 2 is in lighter colors.

Version 2
Figure 2

Notice the version 1 and 2 are binary incompatible? Below is how binary (note: not binary serialization) file format would choose to implement it.

Version=2
Books
  Book*
    ISBN
    Title
    Price
    AuthorID
Authors
  Author*
    Name
    AuthorID
Description(new)*
Biography(new)*

Binary Version 2
Figure 3

In this way, version 1 of the application still can read the version 2 binary file while ignoring the new additional parts at the back of the file. If XML is used and without doing any work, version 1 of the application still can read the version 2 XML file (forward compatible) while ignoring the new additional elements, provided that the data type of the original elements remains unchanged and not removed. And version 2 application can read version 1 XML file by using the old parsing code (backward compatible). The downside to XML parsing is it is slower than binary file format and takes up more space but XML file are self-describing.

XML Version 2
Figure 4

Below is an example of how I would implement the file format in XML, which is followed by an code example to create the XML file.

<?xml version="1.0" encoding="UTF-8"?>
<All>
  <Version>1</Version>
  <Books>
    <Book ISBN="1111-1111-1111">
      <Title>How not to program!</Title>
      <Price>12.990000</Price>
      <Desc>Learn how not to program from the industry's
worst programmers! Contains lots of code examples which
programmers should avoid! Treat it as inverse education.</Desc>
      <AuthorID>111</AuthorID>
    </Book>
    <Book ISBN="2222-2222-2222">
      <Title>Caught with my pants down</Title>
      <Price>10.000000</Price>
      <Desc>Novel about extra-martial affairs</Desc>
      <AuthorID>111</AuthorID>
    </Book>
  </Books>
  <Authors>
    <Author Name="Wong Shao Voon" AuthorID="111">
      <Bio>World's most funny author!</Bio>
    </Author>
  </Authors>
</All>

#import <msxml6.dll>
using namespace MSXML2;

HRESULT CTryoutDlg::CreateAndInitDom(
    MSXML2::IXMLDOMDocumentPtr& pDoc)
{
    HRESULT hr = pDoc.CreateInstance(__uuidof(MSXML2::DOMDocument30));
    if (SUCCEEDED(hr))
    {
        // these methods should not fail so don't inspect result
        pDoc->async = VARIANT_FALSE;
        pDoc->validateOnParse = VARIANT_FALSE;
        pDoc->resolveExternals = VARIANT_FALSE;
        MSXML2::IXMLDOMProcessingInstructionPtr pi =
            pDoc->createProcessingInstruction
                (L"xml", L" version='1.0' encoding='UTF-8'");
        pDoc->appendChild(pi);
    }
    return hr;
}

bool CTryoutDlg::SaveXml(
    MSXML2::IXMLDOMDocumentPtr& pDoc,
    const std::wstring& strFilename)
{
    TCHAR szPath[MAX_PATH];

    if(SUCCEEDED(SHGetFolderPath(NULL,
        CSIDL_LOCAL_APPDATA|CSIDL_FLAG_CREATE,
        NULL,
        0,
        szPath)))
    {
        PathAppend(szPath, strFilename.c_str());
    }

    variant_t varFile(szPath);
    return SUCCEEDED(pDoc->save(varFile));
}

void CTryoutDlg::TestWrite()
{
    MSXML2::IXMLDOMDocumentPtr pDoc;
    HRESULT hr = CreateAndInitDom(pDoc);
    if (SUCCEEDED(hr))
    {
        using namespace Elmax;
        using namespace std;
        Element root;
        root.SetConverter(NORMAL_CONV);
        root.SetDomDoc(pDoc);

        Element all = root[L"All"];
        all[L"Version"] = 1;
        Element books = all[L"Books"].CreateNew();
        Element book1 = books[L"Book"].CreateNew();
        book1.Attribute(L"ISBN") = L"1111-1111-1111";
        book1[L"Title"] = L"How not to program!";
        book1[L"Price"] = 12.99f;
        book1[L"Desc"] = L"Learn how not to program from the
industry's worst programmers! Contains lots of code examples
which programmers should avoid! Treat it as inverse education.";
        book1[L"AuthorID"] = 111;

        Element book2 = books[L"Book"].CreateNew();
        book2.Attribute(L"ISBN") = L"2222-2222-2222";
        book2[L"Title"] = L"Caught with my pants down";
        book2[L"Price"] = 10.00f;
        book2[L"Desc"] = L"Novel about extra-martial affairs";
        book2[L"AuthorID"] = 111;

        Element authors = all[L"Authors"].CreateNew();
        Element author = authors[L"Author"].CreateNew();
        author.Attribute(L"Name") = L"Wong Shao Voon";
        author.Attribute(L"AuthorID") = 111;
        author[L"Bio"] = L"World's most funny author!";

        std::wstring strFilename = L"Books.xml";
        SaveXml(pDoc, strFilename);
    }
}

Here is the code to read the XML which is saved in the previous code snippet. Some helper class (DebugPrint) and methods (CreateAndLoadXml and DeleteFile) are omitted to focus on the relevant code. The helper class and methods can be found in the Tryout project in the source code download.

void CTryoutDlg::TestRead()
{
    DebugPrint dp;
    MSXML2::IXMLDOMDocumentPtr pDoc;
    std::wstring strFilename = L"Books.xml";
    HRESULT hr = CreateAndLoadXml(pDoc, strFilename);
    if (SUCCEEDED(hr))
    {
        using namespace Elmax;
        using namespace std;
        Element root;
        root.SetConverter(NORMAL_CONV);
        root.SetDomDoc(pDoc);

        Element all = root[L"All"];
        if(all.Exists()==false)
        {
            dp.Print(L"Error: root does not exists!");
            return;
        }
        dp.Print(L"Version : {0}\n\n", all[L"Version"].GetInt32(0));

        dp.Print(L"Books\n");
        dp.Print(L"=====\n");
        Element books = all[L"Books"];
        if(books.Exists())
        {
            Element::collection_t vecBooks =
                books.GetCollection(L"Book");
            for(size_t i=0; i<vecBooks.size(); ++i)
            {
                dp.Print(L"ISBN: {0}\n",
                    vecBooks[i].Attribute(L"ISBN").GetString(L"Error"));
                dp.Print(L"Title: {0}\n",
                    vecBooks[i][L"Title"].GetString(L"Error"));
                dp.Print(L"Price: {0}\n",
                    vecBooks[i][L"Price"].GetFloat(0.0f));
                dp.Print(L"Desc: {0}\n",
                    vecBooks[i][L"Desc"].GetString(L"Error"));
                dp.Print(L"AuthorID: {0}\n\n",
                    vecBooks[i][L"AuthorID"].GetInt32(-1));
            }
        }

        dp.Print(L"Authors\n");
        dp.Print(L"=======\n");
        Element authors = all[L"Authors"];
        if(authors.Exists())
        {
            Element::collection_t vecAuthors =
                authors.GetCollection(L"Author");
            for(size_t i=0; i<vecAuthors.size(); ++i)
            {
                dp.Print(L"Name: {0}\n",
                    vecAuthors[i].Attribute(L"Name")
                        .GetString(L"Error"));
                dp.Print(L"AuthorID: {0}\n",
                    vecAuthors[i].Attribute(L"AuthorID").GetInt32(-1));
				dp.Print(L"Bio: {0}\n\n",
                    vecAuthors[i][L"Bio"].GetString(L"Error: No bio!"));
            }
        }
    }
    DeleteFile(strFilename);
}

This is the output after the XML is read.

Version : 1

Books
=====
ISBN: 1111-1111-1111
Title: How not to program
Price: 12.990000
Desc: Learn how not to program from the industry's worst programmers! Contains lots of code examples which programmers should avoid! Treat it as reverse education.
AuthorID: 11

ISBN: 2222-2222-2222
Title: Caught with my pants down
Price: 10.000000
Desc: Novel about extra-martial affairs AuthorID: 111 Authors ======= Name: Wong Shao Voon AuthorID: 111 Bio: World's most funny author!

More by Author

Must Read