Building the Right Environment to Support AI, Machine Learning and Deep Learning
This series of articles is not a complete tutorial on SAX, but rather a kick-start. Familiarity with XML is required. In this first article, you will see what SAX is, how is it supported with the Microsoft XML Parser (MSXML), and how can you create a simple SAX application.
What Is SAX?
SAX stands for Simple API for XML and is a standard for the event-based (or event-driven) parsing of XML documents. What happens is that, when certain entities are encountered, an event is generated and an event handler or callback function is executed to handle the event. Examples of events are start-of-document, end-of-document, start-of-element, end-of-element, and so forth. The processing of the XML document is done in a serial manner and the document is not saved in memory, like with DOM (Document Object Model). However, in the case of DOM, a tree is created in memory as the file is read and, after that, the tree is traversed to create the internally used data structures. With SAX, you create the internal data structures on the fly, without keeping the same information twice in memory. Although this may not be a problem with small XML documents, when it comes to large files—1 MB or 10 MB—it can prove a very important aspect.
The XML document is parsed by an XML reader. As this reader detects different entities, it fires events. The events are handled by a consumer component, called Content Handler. When parsing errors occur, another set of events is fired, and they are handled by a component called Error Handler. At last, notifications of DTD-specific events are handled by a DTD Handler.
Microsoft COM Implementation of SAX
ISAXXMLReader is the COM/C++ implementation of the XML reader. It has several methods, the most important being the ones that allow you to register an event handler.
|putContentHandler||Used for registering a content handler.|
|putErrorHandler||Used for registering an error handler.|
|putDTDHandler||Used for registering a DTD handler.|
None of the three components is required, but if they are not registered, the events will be ignored. Only one handler of each type can be registered at a time; this can be a problem in situation when you want specialized handlers for different parts of the XML document. This will be the focus of a later article. In the meantime, you should know that you can change the registered handlers during the parsing, in which case the events will be handled automatically by the new handler.
The DTD handler is beyond the scope of this article, and will be ignored.
The ISAXContentHandler interface receives notification about the content of the XML document. The implemented interface is shown below (table taken from MSDN):
|characters||Receives notification of character data.|
|endDocument||Receives notification of the end of a document.|
|startDocument||Receives notification of the beginning of a document.|
|endElement||Receives notification of the end of an element.|
|startElement||Receives notification of the beginning of an element.|
|ignorableWhitespace||Receives notification of ignorable white space in element content. This method is not called in the current (MSXML 4.0) implementation because the SAX2 implementation is non-validating.|
|endPrefixMapping||Indicates the end of a namespace prefix that maps to a URI.|
|startPrefixMapping||Indicates the beginning of a namespace prefix that maps to a URI.|
|processingInstruction||Receives notification of a processing instruction.|
|skippedEntity||Receives notification of a skipped entity.|
The events that will be covered in this article are starElement(), endElement(), and characters().
starElement() is called each time a new element is parsed. It supports up to three names for each element: namespace URI, local name, and QName (qualified XML name). In addition, if there are attributes attached to the element, they can be accessed via an interface called ISAXAttributes.
HRESULT startElement( [in] const wchar_t * pwchNamespaceUri, // The namespace URI [in] int cchNamespaceUri, // The length of the // namespace URI [in] const wchar_t * pwchLocalName, // The local name string [in] int cchLocalName, // The length of the local // name [in] const wchar_t * pwchQName, // The QName, with prefix, // or, an empty string [in] int cchQName, // The length of the QName [in] ISAXAttributes * pAttributes); // The attributes attached // to the element
For each startElement() event, the paired endElement() is called (regardless of whether the element contains data or is empty). The function has the same parameters, except for the attributes interface pointer.
HRESULT endElement( [in] const wchar_t * pwchNamespaceUri, // The namespace URI [in] int cchNamespaceUri, // The length of the // namespace URI [in] const wchar_t * pwchLocalName, // The local name string [in] int cchLocalName, // The length of the local // name [in] const wchar_t * pwchQName, // The QName, with prefix, // or, an empty string [in] int cchQName); // The length of the QName
Each time the parses encounters raw data, it calls a method, call characters().
HRESULT characters( [in] const wchar_t * pwchChars, // The character data [in] int cchChars); // The length of the character // string
The first argument is a pointer to the chunk of data, and the second one is the length of actual data that should be processed. The application should not access the data beyond that limit. This method is called for all printable characters, including the white spaces. ignorableWaitspace() should be called for characters that could be ignored, such as white spaces, but without a DTD or a XML schema, the parser cannot know what characters can be ignored. Without validation means this method is never called.
The ISAXAttributes interface provides a means to access the attributes list of an element. From the list of methods it has, you will use these three of them:
|getLength||Returns the count of attributes.|
|getName||Returns all information related to the name of an attribute at a given index.|
|getValue||Returns the text value of an attribute.|