Libxml2: Everything You Need in an XML Library

Libxml2 is the XML parser and toolkit written in the C language and is freely available for integration into your apps via the easy-to-digest MIT License. Libxml2 was originally developed for the Gnome project, but doesn't have any dependencies on it or even the Linux platform. This tool is known to be highly portable and is in use by many teams on Linux, Unix, Win32/Win64, Cygwin, MacOS, MacOS/X, and most other platforms, including embedded systems. Even though Libxml2 was written in C, there are an abundance of language bindings available including bindings for Python, Perl, C++, C#, PHP, Pascal, Ruby, and Tcl.

As you know, XML itself is a metalanguage used to design markup languages. That is to say, it is a grammar where semantics and structure are added to the content using extra "markup" information enclosed between angle brackets "<" and ">". HTML certainly is the most well-known markup language and the specification of HTML 4.0 can be fully articulated using an XML Document Type Definition (DTD).

Of course, just saying something is an XML parser doesn't imply all that much. You have to enumerate both how much and what you're going to support. As such, Libxml2 implements a number of existing standards related to markup languages. I won't bore you with the whole laundry list, but the majors are: XML standard 1.0 including Namespaces, Base, URI, XPointer, XInclude, XPath, HTML 4.0 parser, Canonical XML 1.0, XML Schemas Part 2, xml:id, and XML Catalog working drafts. In most cases, libxml2 tries to implement the specifications in a relatively strictly compliant way. Libxml2 has passed all 1800+ tests from the OASIS XML Testsuite.

XML documents aren't always sitting around on your local filesystem for perusal, so Libxml2 includes basic FTP and HTTP clients so you don't have to write an extra layer of code just to find your documents. Libxml2 exports Push (progressive) and Pull (blocking) type parser interfaces for both XML and HTML. Libxml2 can do DTD validation at parse time, using a parsed document instance, or with an arbitrary DTD. Sister projects provide some additional goodies like XSLT 1.0 (from libxslt) and a DOM2 implementation is also in the works.

Let's Get This Parser Started!

Although you are certainly welcome to recompile the source to meet your own project requirement quirks, I found the simplest way to get parsing was through Igor Zlatkovic's dedicated libxml Win32 resource page. In addition to DLL downloads, you will also find C#, Perl (Apache), and Pascal language bindings at the bottom of Zlatkovic's page. Zlatkovic has packaged Libxml2 and related tools so you can simply take the subset you really need:

  • libxml2, the XML parser and processor
  • libxslt, the XSL and EXSL Transformations processor
  • xmlsec, the XMLSec and XMLDSig processor
  • xsldbg, the XSL Transformations debugger
  • openssl, the general crypto toolkit
  • iconv, the character encoding toolkit
  • zlib, the compression toolkit

Figure 1: libxml package dependencies

For example, libxml depends on iconv and zlib. If you run the included xmllint.exe or xmlcatalog.exe, you simply will discover that you need iconv.dll (as promised in the dependency chart). Be advised that Zlatkovic's downloads don't include the sample programs and data files, however.

For purposes of this article, I used Zlatkovic's distribution of libxml2 2.6.30+, iconv 1.9.2., and zlib 1.2.3.

How to Parse a Tree with Libxml2

You'll only look at the Document Object Model (DOM) parser because that is inherently more complex than the Simple API for XML (SAX). As you recall, the DOM model gives you complete tree navigation at the cost of maintaining the whole XML file in memory. If you're not modifying the tree as it's being parsed, SAX can be significantly less overhead.



Click here for a larger image.

Figure 2: DOM tree example

Specifically, you'll dissect the tree1.c example program and identify some common programming paradigms used. The purpose of this program is to parse a file to a tree, use xmlDocGetRootElement() to get the root element, and then walk the document and print all the element names in document order. This is about the easiest non-trivial sort of thing you can do in XML. For simplicity's sake, you'll assume that the XML file you want to parse is the first argument on the command line and output will go to stdout (console). Program listing follows:

 1 #include <stdio.h>
 2 #include <libxml/parser.h>
 3 #include <libxml/tree.h>
 4
 5 static void print_element_names(xmlNode * a_node)
 6 {
 7    xmlNode *cur_node = NULL;
 8
 9    for (cur_node = a_node; cur_node; cur_node =
         cur_node->next) {
10       if (cur_node->type == XML_ELEMENT_NODE) {
11          printf("node type: Element, name: %s\n",
               cur_node->name);
12       }
13       print_element_names(cur_node->children);
14    }
15 }
16
17 int main(int argc, char **argv)
18 {
19    xmlDoc *doc = NULL;
20    xmlNode *root_element = NULL;
21
22    if (argc != 2)  return(1);
23
24    LIBXML_TEST_VERSION    // Macro to check API for match with
                             // the DLL we are using
25
26    /*parse the file and get the DOM */
27    if (doc = xmlReadFile(argv[1], NULL, 0)) == NULL){
28       printf("error: could not parse file %s\n", argv[1]);
29       exit(-1);
30       }
31
32    /*Get the root element node */
33    root_element = xmlDocGetRootElement(doc);
34    print_element_names(root_element);
35    xmlFreeDoc(doc);       // free document
36    xmlCleanupParser();    // Free globals
37    return 0;
38 }

Libxml2: Everything You Need in an XML Library

To run the program, you assume that libxml.dll, iconv.dll, and zlib.dll are all locatable in the path or current directory. To compile your test program, you use the following command line:

cl tree1.c /MD /Id:\iconv-1.9.2.win32\include
               /Id:\libxml2-2.6.30+.win32\include /link
               /libpath:d:\libxml2-2.6.30+.win32\lib libxml2.lib

You feed it the test data file as shown below as input:

<!DOCTYPE doc [
<!ELEMENT doc (src | dest)*>
<!ELEMENT src EMPTY>
<!ELEMENT dest EMPTY>
<!ATTLIST src ref IDREF #IMPLIED>
<!ATTLIST dest id ID #IMPLIED>
]>
<doc>
   <src ref="foo"/>
   <dest id="foo"/>
   <src ref="foo"/>
</doc>

Which yields the following as output:

node type: Element, name: doc
node type: Element, name: src
node type: Element, name: dest
node type: Element, name: src

The program starts on Line #17 in the main() function. The LIBXML_TEST_VERSION is a safety check to make sure the libxml.dll you are using is in fact compatible with the version you are compiled for (in other words, the headers you used).

The actual parsing takes place inside xmlReadFile() on Line #27, which returns an xmlDoc object if successful. The first parameter is the local filename or an HTTP document path (URL). The second parameter refers to the encoding, which defaults to NONE. The last parameter is a concatenation of option flags which are one or more of the following:

XML_PARSE_RECOVER Recover on errors
XML_PARSE_NOENT Substitute entities
XML_PARSE_DTDLOAD Load the external subset
XML_PARSE_DTDATTR Default DTD attributes
XML_PARSE_DTDVALID Validate with the DTD
XML_PARSE_NOERROR Suppress error reports
XML_PARSE_NOWARNING Suppress warning reports
XML_PARSE_PEDANTIC Pedantic error reporting
XML_PARSE_NOBLANKS Remove blank nodes
XML_PARSE_SAX1 Use the SAX1 interface internally
XML_PARSE_XINCLUDE Implement xinclude substitution
XML_PARSE_NONET Forbid network access
XML_PARSE_NODICT Do not reuse the context dictionary
XML_PARSE_NSCLEAN Remove redundant namespaces declarations
XML_PARSE_NOCDATA Merge CDATA as text nodes
XML_PARSE_NOXINCNODE Do not generate XINCLUDE START/END nodes
XML_PARSE_COMPACT Compact small text nodes; no modification of the tree allowed afterwards (will possibly crash if you try to modify the tree)

Most notably, XML_PARSE_COMPACT can make up for some of the memory performance hitS that DOM parsers are known for if you need the XML tree only for read-only purposes. Note also the ability to turn on DTD validation at this point as well.

Next, in Line #33, you call xmlDocGetRootElement(doc) which, as you would expect, gives you the top of the tree that you then can traverse easily using the recursive print_element_names() function. Inside print_element_names(), Lines #5-15, there are only two things to do: Either the node is an XML_ELEMENT_NODE and you print itl otherwise, you call yourself again, this time with the children of the current node. There are actually 21 different node types, so it's worth seeing the complete list of choices:

XML_ELEMENT_NODE = 1
XML_ATTRIBUTE_NODE = 2
XML_TEXT_NODE = 3
XML_CDATA_SECTION_NODE = 4
XML_ENTITY_REF_NODE = 5
XML_ENTITY_NODE = 6
XML_PI_NODE = 7
XML_COMMENT_NODE = 8
XML_DOCUMENT_NODE = 9
XML_DOCUMENT_TYPE_NODE = 10
XML_DOCUMENT_FRAG_NODE = 11
XML_NOTATION_NODE = 12
XML_HTML_DOCUMENT_NODE = 13
XML_DTD_NODE = 14
XML_ELEMENT_DECL = 15
XML_ATTRIBUTE_DECL = 16
XML_ENTITY_DECL = 17
XML_NAMESPACE_DECL = 18
XML_XINCLUDE_START = 19
XML_XINCLUDE_END = 20
XML_DOCB_DOCUMENT_NODE = 21

In addition to the "type" value, the xmlNode contains all the critical information about each node in the tree, including navigational pointers (next, prev parent, children), pointers to the namespace, properties list, node name, and of course the content itself (if any).

Conclusion

In an introductory article such as this, you can only hope to scratch the surface of what a versatile tool such as libxml2 can do for you. Libxml2 supports DTD, Schemas, XPath, internationalization support, and lots more that can make your application XML standards-compliant. As mentioned before, libxml2 comes with many language bindings so you can work in C, C++, C#, Python, Perl, or whatever you need to get the job done. Best of all, it's freely available to integrate into your apps today.

About the Author

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.



About the Author

Victor Volkman

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: April 22, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Database professionals — whether developers or DBAs — can often save valuable time by learning to get the most from their new or existing productivity tools. Whether you're responsible for managing database projects, performing database health checks and reporting, analyzing code, or measuring software engineering metrics, it's likely you're not taking advantage of some of the lesser-known features of Toad from Dell. Attend this live …

  • A help desk is critical to the operations of an IT services business. As a centralized intake location for technical issues, it allows for a responsive and timely solution to get clients and their staff back to business as usual. In addition to handling immediate IT issues, a help desk performs several proactive tasks to ensure clients' IT systems remain operational and downtime is minimized. Thus, utilizing a help desk and following best practices can improve the productivity, efficiency and satisfaction of …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds