Reading XML Files with the XmlTextReader Class, Part 2

My previous article presented the XmlTextReader class and various code snippets that illustrated how to use it for sequentially reading XML documents or files, determining node types, and parsing for specific nodes and values. This week, I'll cover a few more issues regarding the XmlTextReader class, including ignoring whitespace, skipping to content, and reading attributes.

Reading All Nodes

As a refresher—and for those who haven't read the previous article—when reading an XML file using the XmlTextReader class, you are forced to read sequentially from the beginning of the file. Here's a sample code snippet where all of the nodes of a specified XML file are read, with each node's NodeType, Name, and Value property displayed:

// Simple loop to read all nodes of an XML file
try
{
  String* format = S"XmlNodeType::{0,-12}{1,-10}{2}";

  XmlTextReader* xmlreader = new XmlTextReader(fileName);
  while (xmlreader->Read())
  {
    String* out = String::Format(format,
                                 __box(xmlreader->NodeType),
                                 xmlreader->Name,
                                 xmlreader->Value);
    Console::WriteLine(out);
  }
}
catch (Exception* ex)
{
  Console::WriteLine(ex->Message);
}

As you can see, the code simply instantiates an XmlTextReader object (passing it the file name to be read) and then calls the XmlTextReader::Read method until a value of false is returned. As each node is read, the NodeType, Name, and Value properties are then accessed. Figure 1 shows a screen shot of this article's attached demo project, where all nodes are being displayed for the following file:

<?xml version="1.0" encoding="us-ascii"?>
<!-- Test comment -->
<emails>
  <email language="EN" encrypted="no">
    <from>Tom@ArcherConsultingGroup.com</from>
    <to>BillG@microsoft.com</to>
    <copies>
      <copy>Krista@ArcherConsultingGroup.com</copy>
    </copies>
    <subject>Buyout of Microsoft</subject>
    <message>Dear Bill...</message>
  </email>
</emails>

Figure 1: Displaying All Nodes of an XML File

As you can see in Figure 1, there are quite a few nodes that most applications will not care about the majority of the time. This includes things like the declaration node, comments, and whitespace. The next couple of sections show how to skip over these nodes so that your code can more efficiently concern itself only with nodes containing data (for example, element and text nodes).

Ignoring Whitespace

First, look at how to remove the whitespace nodes. The most obvious way of ignoring these nodes in your code is to insert a conditional statement in your read loop that ignores nodes of type whitespace:

// Ignore whitespace
if (xmlreader->NodeType != XmlNodeType::Whitespace)
{
  ...
}

Although this does work, it's hardly efficient—especially with extremely large files. Therefore, the XmlTextReader class provides a property called XmlTextReader::WhitespaceHandling that is of type WhitespaceHandling that allows you to specify how you want to deal with whitespace. The valid values are:

  • All
  • None
  • Significant

The All and None enum values refer to either wanting all whitespace nodes (the default) or no whitespace nodes, respectively. The Significant value refers to nodes of type SignificantWhitepsace, which are returned only within the an xml:space='preserve' scope.

To specify that your code does not need to read any whitespace nodes, simply set this attribute accordingly:

// Ignore whitespace
xmlreader->WhitespaceHandling = WhitespaceHandling::None;

// read loop

Figure 2 shows the demo application with whitespace nodes ignored.

Figure 2: Ignoring Whitespace Nodes

Reading XML Files with the XmlTextReader Class, Part 2

Skipping to Content

Now, turn your attention to the issue of content versus non-content nodes. Contents nodes are nodes of type CDATA, Element, EndElement, EntityReference, or EndEntity. Non-content nodes are everything else—including the declaration, processing instruction, document type, and so on. Once again, you could simply insert conditional logic into your read loop to ignore the unwanted nodes. However, a more efficient technique—and certainly more maintainable should more non-content nodes be defined in the future—is to use the XmlTextReader::MoveToContent method.

The MoveToContent method basically tells the XML parser to jump from the current node to the next content node. One thing to understand is that the MoveToContent method doesn't actually read the node. Its sole responsibility is to jump non-content nodes. Therefore, this method is used in conjunction with the XmlTextReader::Read method to read all content nodes as follows:

// Read only content nodes
for (XmlNodeType nodeType = xmlreader->MoveToContent();
     NULL != nodeType;
     xmlreader->Read(), (nodeType = xmlreader->MoveToContent()))
{
...

Additionally, the following two generic methods (ReadFirstNode and ReadNextNode) allow the caller to specify whether to read the first/next content or non-content node:

// Generic functions to allow caller to specify whether the
// first/next node should be a content node
bool ReadFirstNode(XmlTextReader* xmlreader, bool moveToContent = true)
{
if (moveToContent)
  return (NULL != xmlreader->MoveToContent());
else
  return xmlreader->Read();
}

bool ReadNextNode(XmlTextReader* xmlreader, bool moveToContent = true)
{
  bool bMoreNodes = xmlreader->Read();
  if (bMoreNodes && moveToContent)
    bMoreNodes = (NULL != xmlreader->MoveToContent());

  return bMoreNodes;
}

By using these two functions, the previously presented read loop now can be updated as follows:

// Update of read loop to use ReadFirstNode and ReadNextNode helper
// methods
XmlTextReader* xmlreader = new XmlTextReader(fileName);
for (bool bMoreNodes = ReadFirstNode(xmlreader); 
     bMoreNodes;
     bMoreNodes = ReadNextNode(xmlreader))
{
  ...
}

Figure 3 shows the results of reading only content nodes that would be applicable for most purposes.

[XmlTextReader2-3.jpg]

Figure 3: Reading Only "Content" Nodes

Reading Attributes

If you look at the XML file presented at the outset of this article, you'll note that the email element contains two attributes (language and encrypted). However, these items are not displayed in any of the demo application's screen captures. That's because the demo application displays nodes, and attributes are not nodes. Rather, attributes are included within nodes of type element. To determine whether an element node has attributes, you can use the XmlTextReader::HasAttributes property. If this Boolean property is set to true, you can then use the XmlTextReader::MoveToAttribute method to iteratively move to each attribute defined for the node. The following snippet illustrates how to read the attributes for an element node:

if (xmlreader->NodeType == XmlNodeType::Element
&& xmlreader->HasAttributes)
{
  for (int i = 0; i < xmlreader->AttributeCount; i++)
  {
    xmlreader->MoveToAttribute(i);
    // XmlTextReader::Name will contain the name of the attribute
    // XmlTextReader::Value will contain the value of the attribute
    ...

Be careful here, because when you read nodes and attributes using the XmlTextReader class, you are continually moving a pointer or cursor through the file in a forward-only manner. Therefore, depending on how you're reading your element nodes (I presented several alternatives in the previous article), you'll want to ensure that you read the attributes before the element value, as that's the order in which they appear in a properly formed XML file.

Looking Ahead

The past three articles have covered the basics of creating XML files, reading nodes (including elements and text), parsing for specific node types and values, skipping non-content nodes, and reading attributes. Next week's article will pull all this together and present a step-by-step tutorial for writing a maintenance application using the XmlTextWriter and XmlTextReader classes.



About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • On-demand Event Event Date: September 10, 2014 Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild." This loop of continuous delivery and continuous feedback is how the best mobile …

  • Packaged application development teams frequently operate with limited testing environments due to time and labor constraints. By virtualizing the entire application stack, packaged application development teams can deliver business results faster, at higher quality, and with lower risk.

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds