STL and BOOST Parsing Iterators

Introduction

Suppose we need to implement some protocol such as Radius or IPDR Streaming. The implementation will probably contain several modules; a transport module and a protocol parser are among others. The transport module gets messages (as plain byte buffers) from outer sources (via TCP/IP sockets, for example); the protocol parser decodes incoming byte buffers into protocol specific messages.

It is good practice to make these modules as general as possible to reuse them for different protocol s as it is usually done for transport modules. The parser is often considered as entirely protocol specific, i.e. the parsing procedure is based on protocol specific knowledge how a message should be de-serialized. But abstract presentation of a message as a sequence of elements (fields) helps us in making the parsing procedure general. In terms of elements/sequence the parsing process can be described as following steps:

  1. Stand at the message beginning
  2. Parse current element
  3. Find out (from the parsed element) how to jump to next element
  4. Jump to next element
  5. Repeat steps 2-4 until end of the message is reached (or an error occurred)

These steps obviously can be re-phased in term of iterators:

   protocol::message p;
   protocol::message::iterator it=p.begin();
   while(it!=p.end()){*it++;}

Here it=p.begin() stands for “Stand at the payload beginning”, *it – “Parse current element”, it++ -” Jump to next element” and it!=p.end() – “until end of the message is reached”.

This form has all protocol specific knowledge concentrated within the iterator implementation and leaves the cycle protocol agnostic. Such de-copulation between protocol details and parsing process has certain advantages: first, the same code can be reused for different protocols (all that is needed is to choose the appropriate iterator type); second, the code can work with standard algorithms from STL or BOOST libraries.

The first required step is implementation of the parsing iterator. In this article you will find out the shortest way to do so.

Let’s formulate our goals:

Goals

To implement a protocol specific parsing iterator that allows the representation of the parsing process in protocol agnostic form as:

   protocol::message p;
   protocol::message::iterator it=p.begin ();
   while(it!=p.end()){*it++;}

and (for full message parsing)

   std::for_each(p.begin (), p.end (),  parse_opt);

and (for partial message parsing, useful for a proxy server implementation)

   protocol::message::iterator it=std::find_if(p.begin (),
    p.end (), till_user_id_not_found());

We need to choose appropriate form of iterator to fit our requirements. Also we need to choose presentation of parsed message fields.

Parsed Field Presentation

Messages come from outer sources in serialized form – as plain binary buffers. The message format is protocol specific and we want to hide it from our application, hence to work with parsed data the application needs a protocol independent presentation of message content. Attribute-Value Pair (AVP) is a convenient protocol independent presentation for message fields. We use the following AVP structure in our example:

   struct avp
   {
     avp() : data(NULL), length(0) {}
     unsigned int id;
     const boost::uint8_t* data;
     size_t length;
   };

Here the attribute is identified by its ID; value is represented as a memory chunk – pointer and size. This is the most common form of value representation. The application can then convert the memory chunk to required type, for example integer or string.

What Kind of Iterator Is Needed?

We need to read elements (fields) from sequence (message). Thus we need to choose appropriate form of read access iterator from the following categories defined by STL:

  • Input iterators – can move only forward, have read only access, have special end-of-sequence mark, referred as “last” element. All iterators can be tested (==) against this “last” element. Required operators are: it++, ++it, *it, it->member, it1!=it2, it1==it2. One interesting detail: input iterator can read elements only once, it does not guarantee that subsequent call to *it operator will return the same value as first time.
  • Forward Iterators – can move only forward, have read/write access, have “last” element. Unlike input iterators any two iterators can be tested (==) against each other. The same element can be accessed more than once. Required operators are: it++, ++it, *it, it->member, it1!=it2, it1==it2
  • Bidirectional Iterators – are like Forward Iterators with additional ability move backward. Additional required operators are: it--, --it
  • Random Access Iterators – are like Bidirectional Iterators that can move to any element of sequence – they support full iterator arithmetic (like “pointer arithmetic”). Additional required operators are: it+=n, it-=n, it1=it2+n, it1=it2-n, it1=n+it2, n=it2-it3, it1>it2, it1>=it2, it1<it2, it1<=it2
  • These categories constitute a hierarchy chart - code that works with upper (weaker) category must work with lower (stronger) category (but not necessary vise versa).

    We need to select the weakest (the simplest to implement) category that meets our goals. This is obviously the Input iterator category. In particular both our algorithms std::for_each and std::find_if works with Input iterators.

    After the appropriate iterator category was found we can build the required iterator interface.

    Iterator Interface

    The required iterator interface looks like:

       struct iterator
       {
          iterator();
          iterator(const iterator& other);
          iterator& operator=(const iterator& other);
          bool operator==(const iterator& other) const;
          bool operator!=(const iterator& other) const;
          avp operator*() const;
          const avp* operator->() const;
          iterator& operator ++();
          iterator operator ++(int);
       };
    

    The default constructor iterator() creates empty iterator; the copy constructor iterator(other) creates copy of the given iterator.

    The operator==() compares the given iterator with the EOF mark.

    The de-reference avp operator*() returns the parsed current field. The operator returns the parsed result as avp object. Generally, it is more efficient to return const avp& than avp object. But we chose the last option to demonstrate some techniques used in the BOOST library.

    Unlike input iterator the forward Iterator's *it is expected to be called more than once. Hence it is more important to forward iterators than to input iterators to use efficient form of the de-reference operator.

    Note that operator*() is constant. The iterator is considered as generalized pointer and de-reference does not change its state. The iterator state is changed only when it moves to another element.

    The operator->() returns pointer to the parsed field. If the de-reference operator*() returned const avp&, the operator->() would be implemented as:

       avp* operator->() const
       {
          return &**this;
       }
    

    But in our case it is impossibe because the de-reference operator*() returns avp object and &**this; will points to temporary stack object.

    The operator++() moves to the next field.

    The operator!=() usually is implemented through the operator==() as:

       bool operator!=(const iterator& other) const
       {
          return !(*this==other);
       }
    

    And the postfix form of operator++(int) usually is implemented through the prefix form as:

       iterator operator ++(int)
       {
          iterator tmp=*this;
          ++*this;
          return tmp;
       }
    

    This example illustrates why the prefix form ++it is usually preferable over the postfix form it++: it is obviously more efficient.

    After the required interface is set we are ready to implement it.

    Implementation

    For our modeling example we chose simple message format - each incoming message consist of header and payload; the payload consists of sequential fields. Each field in its turn is composed from field header and field payload. The parsing process retrieves field payload and convert it to AVP.

    The field header contains information about its size, thus have parsed one chunk the parser knows how to move to next one. The message header contains information about total message payload size. Hence the parser knows when it reaches end of the payload.

    We use the following data format:

       #pragma pack (push ,1)
    
       /** Message binary layout
       */
       struct message_header
       {
          // full message lenght including this header
          boost::uint8_t length;
          // data that follows the header
          boost::uint8_t data[0];
       };
    
       /** Field binary layout
       */
       struct field_header
       {
          // full chunk lenght including this header
          boost::uint8_t length;
          // attribute id
          boost::uint8_t id;
          // data that follow the header
          boost::uint8_t data[0];
       };
    
       #pragma pack (pop)
    

    #pragma pack:

    Binary layouts must be aligned by 1 byte. To avoid unpleasant surprised it is good practice to surround binary layout structures with #pragma pack even if the structures are naturally aligned.

    boost::uint8_t:

    Usage of built-in C types as int, short or long within binary layout structures makes these structures not portable because the length of the types varies from platform to platform. Portable binary layout structures require exact-width integer types such as uint32_t, for example. The 1999 C Standard header <stdint.h> defines such types. The BOOST library <boost/cstdint.hpp> header wraps the types in namespace boost.

    The complete iterator implementation looks like:

       struct iterator
       {
    
          /** Default ctor creates EOF itearor:
          Points to NULL, zero bytes to parse
          */
          iterator() : m_data (NULL), m_remain(0) {}
    
          /** Copy ctor */
          iterator(const iterator& other) :
           m_data(other.m_data), m_remain(other.m_remain){}
    
          /** Assign operator */
          iterator& operator=(const iterator& other)
          {
             // do not assign to itself 
             if(&other!=this)
             {
                m_data=other.m_data;
                m_remain=other.m_remain;
             }
               return *this;
          }
    
          /** Comparison with EOF */
          bool operator==(const iterator& other) const
          {
             // compares only how many bytes
             // remain to parse 
             return m_remain==other.m_remain;
          }
    
          /** Comparison with EOF */
          bool operator!=(const iterator& other) const
          {
             // reuse
             return !(*this==other);
          }
    
          /** De-reference operator:
          converts current field to AVP
          */
          avp operator*() const
          {
             return current_avp();
          }
    
          /** Arrow operator:
          converts current field to AVP
          */
          const avp* operator->() const
          {
             // must return pointer to not local
             // variable 
             m_ret=m_impl.current_avp();
             return &ret;
    
             // will crash, of course
             //return &**this;
          }
    
          /** Move to next, prefix form ++it */
          iterator& operator ++()
          {
             // check conditions
             // if an error occured
             // mark as EOF
             if(!hdr()->length) m_remain=0;
             if(hdr()->length>m_remain) m_remain=0;
             if(hdr()->length<sizeof(field_header)) m_remain=0;
    
             // if not EOF
             // move to next
             if(m_remain)
             {
                m_remain-=hdr()->length;
                // data update must be last because
                // changes the header
                m_data+=hdr()->length;
             }
             return *this;
          }
    
          /** Move to next, postfix form it++*/
          iterator operator ++(int)
          {
             // store old state, move
             // one step and return stored state
             iterator tmp=*this;
             ++*this;
             return tmp;
          }
    
       private:
          // for iterator(const boost::uint8_t* data, size_t length)
          friend class message;
    
          /** Ctor accept data and size.
          The ctor can be called only within class message.
          */
          iterator(const boost::uint8_t* data, size_t length)
             : m_data(data), m_remain(length)
          {
          }
    
          /** Reprsent current data pointer as field poiner
          */
          const field_header* hdr() const
          {
             return reinterpret_cast<const field_header*>(m_data);
          }
    
          /** Converts current field to AVP
          */
          avp current_avp() const
          {
             // special value to return
             // when de-reference is not deined
             static const avp invalid;
    
             // check error conditions
             if(!m_data || hdr()->length>m_remain ||
                hdr()->length<sizeof(field_header))
             return invalid;
    
             // create avp from buffer
             avp ret;
             ret.data=hdr()->data;
             ret.id=hdr()->id;
             ret.length=hdr()->length-sizeof(field_header);
             return ret;
          }
    
          /** current data pointer*/
          const boost::uint8_t* m_data;
          /** how many bytes remain to parse */
          size_t m_remain;
          /** temporary variable*/
          mutable avp m_tmp;
       };
    

    The implementation must be accompanied by two methods message::begin() and message::end(). Their implementation looks like:

       iterator message::begin()
       {
          // creates iterator that points on the payload
          // beginning
          return iterator(m_data+sizeof(message_header), m_length);
       }
    
       iterator message::end()
       {
          // creates iterator that has eof mark
          return iterator();
       }
    

    Note that the begin() method uses private constructor iterator(const boost::uint8_t* data, size_t length).

    The implementation will work perfectly with hand-made cycles like:

       protocol::message p;
       protocol::message::iterator it=p.begin();
       while(it!=p.end()){*it++;}
    

    But it won't compile with STL algorithms like:

       std::for_each(p.begin(), p.end(), parse_opt);
    

    STL Algorithms

    The compiler complains that our iterator class does not have 'iterator_category' type defined. STL algorithms use this type to select appropriate algorithm implementation. For example, STL binary search algorithm can have different implementations for random access iterators and forward iterators. In the first case it will be real binary search, in the second it will be linear search actually. To choose appropriate implementation at compilation time the STL uses template specializations, one for random access iterator category, the second for forward iterator category.

    To solve the problem we can add to our class: typedef std::input_iterator_tag iterator_category;

    But it is not only required definition. It is better idea to derive our iterator from std::iterator. Have specifid required we will get all iterator traits automatically.

    std::iterator is a template defined as

       template<class _Category,
        class _Ty,
        class _Diff = ptrdiff_t,
        class _Pointer = _Ty *,
        class _Reference = _Ty&>
            struct iterator
    

    The first template parameter _Category defines template category such as std::input_iterator_tag, std::ouput_iterator_tag, std::forward_iterator_tag, std::bidireactional_iterator_tag or std::random_access_iterator_tag. The second template parameter _Ty defines type the iterator works with. The third template parameter _Diff is optional and defines pointer difference type. The fourth template parameter _Pointer is optional and can be derived from _Ty. It defines type returned by operator ->(). The last template parameter _Reference is optional and can be derived from _Ty. It defines type returned by operator*(). Because our operator*() returns type avp rather than avp&, we need to re-define this last type.

    The iterator implementation will look like:

        struct std_iterator :
        public std::iterator<std::input_iterator_tag,
          avp, ptrdiff_t, avp*, avp>
        {...};
    

    Now our iterator works perfectly not only with hand-made cycles like:

       protocol::message p;
       protocol::message::iterator it=p.begin ();
       while(it!=p.end()){*it++;}
    

    But also with STL algorithms like:

       std::for_each(p.begin(), p.end(), parse_opt);
    

    Boost Iterator Facade

    The iterator implementation has required design and realization of an interface that consists of 7 functions. Three of them (increment, compare and de-reference) perform real job, they are "essential" methods. The others are merely "decoration" needed by the STL algorithms.

    It is worth to have an iterator "molding form" class that defines "decoration" methods through "essential". Have implemented the "essential" methods we will get the rest automatically.

    Fortunately the BOOST library provides such class, this is boost::iterator_facade from #include <boost/iterator/iterator_facade.hpp>.

    The class is defined as a template:

       template <class _Derived,
        class _Value,
        class _CategoryOrTraversal,
        class _Reference  = Value&,
        class _Difference = ptrdiff_t >
          class iterator_facade;
    

    The first template parameter _Derived must be our iterator class because the iterator_facade uses the CRTP (Curiously Recurring Template Pattern) pattern.

    The _Value parameter determines the type we are iterating over. In our case it is avp.

    The _CategoryOrTraversal parameter defines iterator category. The BOOST library uses iterator classification more elaborated than STL. Possible categories are: incrementable_traversal_tag, single_pass_traversal_tag, forward_traversal_tag, bidirectional_traversal_tag. Likely STL categories, the BOOST categories constitute a hierarchy chart - code that works with upper (weaker) category must work with lower (stronger) category (but not necessary vise versa). The weakest category that fits our requirements is single_pass_traversal_tag. Note that the _CategoryOrTraversal supports "old-style" STL category as well.

    The _Reference parameter is optional and can be derived from _Value. It defines type returned by operator*(). Because our operator*() returns type avp rather than avp&, we need re-define this last type.

    The last _Difference parameter is optional. It determines how the distance between two iterators will be measured and will also be the same as std::iterator_traits<node_iterator>::difference_type. The default for _Difference is std::ptrdiff_t. It is an appropriate type, so the last parameter can be omitted.

    As a result the whole iterator implementation looks like:

       struct iterator :
        public boost::iterator_facade< boost_iterator,
        avp, boost::single_pass_traversal_tag,
        avp>
       {
          /** Default ctor creates EOF itearor:
          Points to NULL, zero bytes to parse
          */
          iterator() : m_data (NULL), m_remain(0) {}
    
       private:
          /** for access to private constructor */
          friend class message;
          /** for access to private methods increment(),
          equal and dereference()
          */
          friend class boost::iterator_core_access;
          /** Core method. Moves to next element in sequence */
          void increment()
          {
             // check conditions
             // if an error occured
             // mark as EOF
             if(!hdr()->length) m_remain=0;
             if(hdr()->length>m_remain) m_remain=0;
             if(hdr()->length<sizeof(field_header)) m_remain=0;
             // if not EOF
             // mode to next
             if(m_remain)
             {
                m_remain-=hdr()->length;
                // data update must be last because
                // changes the header
                m_data+=hdr()->length;
             }
          }
    
          /** Core method. Compares the eleemnt with EOF */
          bool equal(boost_iterator const& other) const
          {
             // compares only how many bytes
             // remain to parse 
             return m_remain==other.m_remain;
          }
    
          /** Core method. Returns parsed value */
          avp dereference() const{ return current_avp();}
    
          /** Ctor accept data and size.
          The ctor can be called only within class message.
          */
          iterator(const boost::uint8_t* data, size_t length)
           : m_data(data), m_remain(length)
          {
          }
    
          /** Reprsent current data pointer as field poiner
          */
          const field_header* hdr() const
          {
             return reinterpret_cast<const field_header*>(m_data);
          }
    
          /** Converts current field to AVP
          */
          avp current_avp() const
          {
             // special value to return
             // when de-reference is not deined
             static const avp invalid;
             // check error conditions
             if(!m_data || hdr()->length>m_remain ||
                hdr()->length<sizeof(field_header))
                return invalid;
             // create avp from buffer
             avp ret;
             ret.data=hdr()->data;
             ret.id=hdr()->id;
             ret.length=hdr()->length-sizeof(field_header);
             return ret;
          }
    
          /** current data pointer*/
          const boost::uint8_t* m_data;
          /** how many bytes remain to parse */
          size_t m_remain;
       };
    

    This implementation is more compact then the previous one and it requires implementations of only essential methods such as increment(), dereference() and equal(). The facade_iterator class provides the rest of functionality.

    Operator->() Implementation

    One interesting detail regards to implementation of the operator->(). If the template parameter _Reference were set to a reference type (for example avp&), the BOOST library would implement the operator as:

       avp* operator->() const
       {
          return &**this;
       }
    

    In this case it is safely because the operator*() returns reference to a member of the class. However if the template parameter _Reference is set to a class, the operator*() will return object itself. If the operator->() returned pointer of the operator*() result, it would return pointer of temporary object and, hence, would cause crash.

    To avoid the problem the BOOST library will use proxy member and generate code like this:

       avp* operator->() const
       {
          m_proxy=**this;
          return &m_proxy;
       }
    

    In our case if the template parameter _Reference were omitted (set to default avp& type) the facade_iterator class would implement the operator->() in the first form. That would cause memory crash. To avoid this problem we must declare the _Reference explicitly as avp.

    It is worth to consider how the BOOST library distinguishes between two these cases: "reference" and "class" data types. We will talk about the topic in the next section.

    Type Traits and Meta-functions

    Template class must work with any data type that meets a set of requirements. For example, std::set<typename T> works with any data type that has operator "less" and copy constructor/assign operator. However, there are times when the template class should use additional information about data type T for more efficient implementation. In the example above the operator->() implementation differs from "class" to "reference" data type.

    The BOOST used "traits technique" to distinguish between type features. The technique is based on partial template class specialization. Let's consider the following code:

       /** Most common form */
       template <typename T>
       struct is_reference
       {
          static const bool value=false;
       };
    
       /** Specialization for 'reference' */
       template <typename T>
       struct is_reference <T&>
       {
          static const bool value=true;
       };
    
       // the first form of the template will be chosen
       // and "false" will be assigned 
       static bool A=is_reference<int>::value;
       // the second form of the template will be chosen
       // and "true" will be assigned 
       static bool B=is_reference<int&>::value;
    

    Compiler will substitute is_reference<>::value by "true" or "false" depending on chosen template specialization. From all possible template specializations compiler will select one that fits the best: in case A the compiler will take the first form of the template; in the case B it will take the second form.

    Hence the boolean variable A becomes "false" and the boolean variable B becomes "true"; Note that these substitution are performed at compiler time and, therefore, do not affect program run-time performances.

    The is_reference<typename T> template looks like a "function" that operates on metadata and can be "invoked" at compiler time. The BOOST library adopted "meta-function" term for such templates.

    Besides is_reference<typename T> meta-function the BOOST library defines set of useful meta-functions in the <boost/type_traits.hpp> header file. The set includes, for example, remove_reference<typename T> meta-function. The meta-function removes "&" from the type definition, for example, remove_reference<int&>::type will be int.

    The full list of the type traits meta-functions can be found at: Boost Type Traits

    Test Application

    The test application implements all 3 iterator versions in parser.h file. The test.cpp makes test run for the implementations. Besides the source files the zip archive contains parser_it.vcproj project file for MS Visual Studio 2008 and Make.am file for KDevelop.

    To build the project the BOOST library must be installed and configured at your computer. The library can be downloaded from the BOOST Site.

    Conclusion

    The BOOST library provides convenient infrastructure for iterator implementation. The infrastructure includes elaborated hierarchy of iterator categories and the boost::facade_iterator template. The boost::facade_iterator template requires implementations of only essential methods (such as increment(), dereference() and equal()) and provides the rest of functionality.

    References:

    1. The C++ Standard Template Library by Plauger, Stepanov, Lee, and Musser.
    2. C++ Standard Library by Nicolai M. Josuttis.
    3. BOOST Iterators
    4. Coplien, J., Curiously Recurring Template Patterns, C++ Report, February 1995, pp. 24-27.

    More by Author

    Get the Free Newsletter!

    Subscribe to Developer Insider for top news, trends & analysis

    Must Read