Extending the STL: A Set of Ranges

Environment: C++, STL

Introduction

This article introduces an STL-compatible container that is very much like STL's own set. It can be used in a lot of situations where you would use std::set while providing extra functionality and improved performance in some cases. Some parts of this article are a bit more technical and can be skipped if you only want to use range_set. For those readers who just want to know how to use the code, you can skip to the section explaining the differences with std::set. For those who are interested in the details and the way the code works internally, start at the section "A Range."

A Range

A range [a, b] is composed of all elements between a and b, including a, but not b. This is the same convention that is used all through STL with iterators. An example is when you sort elements of a vector. The simplest code is:

std::vector<int> v;
// fill the vector with some data
std::sort(v.begin(), v.end());

v.begin() points to the first element in the vector and v.end() points to the one-past-last element in the vector. Similary, when you use the range_set, each range is specified by the first element in the range and the one-past-last element in the range.

What Is a Set of Ranges?

A set of ranges is pretty much the same thing as a set. It's a container that allows you to add and erase elements in O(log(N)) time like std::map. The advantage over std::map is that you can quickly insert a range of values [x, y) also in O(log(N)) time, whereas with std::map you would need O(log(N)(y-x)) time. This may sound a bit technical, but there are quite a few real applications where this makes sense. The essence of the difference is that when you know that the things you are storing may very well be consecutive, it may make sense to use a range_set. One simple example is storing the selected items in a list. If your list is large (say 10 thousand items), it takes O(log(2500)5000) time to insert the elements from 1 to 5000. Using a range_set, this time is shortened to just O(1) if your container is empty previously. So, if you are adding a lot of consecutive items, it really pays off.

Usage Scenarios

Personally, I have developed this range_set for use in sort of a large list view. Previously, I was storing the elements in an std::map and performance was unacceptable when the user tried to go to the first item, select it, then go to the last one and select the whole range. Adding over five thousand items just took too much time. Using one of the extension containers provided by most STLs, hash_map, resulted in only a marginal improvement. The fact remained that I needed to add 5000 integers seperately to the container. So, I came up with the range_set. In this container, you only have to call one function, namely insert_range, that will insert all 5000 integers. Moreover, it only takes a time proportional to O(log(5000)), not O(5000) like hash_map or O(log(2500)5000), like map. So, if you have large lists where you can select multiple items, a container such as range_set will be useful for storing the IDs of the selected items.

Another usage I have found for range_set is to represent ranges of Unicode characters. If, for example, you need to know which characters are used by Japanese scripts, there are several different ranges. The ranges each contain a lot of elements, but there are also several ranges of characters. A hash_map or a map would have to contain each of the different characters and they would not take advantage of the fact that a lot of these characters are adjacent. Here, range_set allows me to reduce both memory consumption and increase speed.

A third usage is more for analysis purposes. If you have some data that could be clustered into ranges, you can try adding it to a range_set and inspect the std::map it is holding as a representation. It doesn't matter in which order you add the elements; the ranges will always be as short and as few as possible. So, it will show you at a glance which ranges your data is clustered in and which strategies you could adopt for optimizing it.

Conventions and Requirements on Types

A range in the range_set is like a range of iterators in STL. This means that if you use (x, y) as the parameters for a function that takes a range, then x belongs to the range and y is one-past-the-end. For example, the range specified by 5 and 10 contains the integers 5, 6, 7, 8 and 9, but not 10.

For the needs of its representation, range_set requires that the Key type you pass it is a model of a random access iterator. In practice, this means that the type you pass needs to have the ++ and -- functions defined, as well as the operators - and +.

Apart from this, you also need a comparison function operating on the Key type. This can be the same one that is used with std::maps and more often than not, you can just leave that template argument empty and use std::less. The standard allocator will also probably suit your needs, so you can just go with the default one.

Difference with std::set

The general difference is that it has a few more requirements on the type of Key passed to it, but at the same time it has a best case and average case behaviour far superiour to std::set. The functions that let you add a range directly are the most obvious improvements.

New functions

  • insert_range
    void insert_range(const Key &val1, const Key &val2)
    This function inserts a range of values from val1 to val2 in O(log(N)) time where N is the number of ranges already contained in the range_set. The equivalent in std::map would be to call insert val2 - val1 times, which would result in a total time of O(log(N) (val2 - val1)).
  • erase_range
    void erase_range(const Key &val1, const Key &val2)
    This function is the pendant to insert_range, which deletes a range of values in O(log(N)) time.
  • intersect_range
    range_set *intersect_range(range_set &src, const Key &val1, const Key &val2)
    This returns the intersection of range_set with a fixed range of values. This function also executes in O(log(N)) time.
  • get_nth
    iterator get_nth(unsigned int n)
    This returns the Nth element from the range_set (starting at 1) in O(log(N)) time. Note that this function requires the Key type to be a model of a random access iterator.

Existing functions from std::map

  • begin
    iterator begin()
    Returns an iterator pointing to the beginning of the range_set.
  • end
    iterator end()
    Returns an iterator pointing to the end of the range_set.
  • find
    iterator find(const Key val)
    Returns an iterator to the element that contains the value val. If it is not found, it returns an iterator pointing to range_set.end()
  • insert
    iterator insert(const Key val)
    Inserts a single element into the range_set. It returns an iterator to the element inserted and executes in O(log(N)), where N is the number of distinct ranges in the range_set.
  • erase
    void erase(const Key val)
    Erases an element by key. The same function exists to delete an element by iterator.

Advanced (possibly dangerous) functions

  • quick_insert
    void quick_insert(const Key val)
    This function inserts a single element without enforcing the container invariant. That means that you can use this only if you are sure that the inserted value is neither already contained in the range_set, nor adjacent to any other range.
  • quick_insert_range
    void quick_insert_range(const Key &val1, const Key &val2)
    This function inserts a range of elements without enforcing the container invariant. Same remarks as for quick_insert.

Implementation Notes

The range_set is based on std::map and uses this to map the beginning of a range to one-past-the-end of a range. The basic invariant of the container is that the end of a range can never be one less than, equal to, or greater than the beginning of the next range. So, if you have, for example, a range_set containing the integers 5,6,7,9,10, it would store the two pairs (5, 8) and (9, 11) in the underlying std::map. Looking for a specific element is achieved by using the lower_bound function of std::map. This will give you an iterator to the biggest element that is not greater than the element you are looking for. So, say we are looking for element 6 in the range_set above. Then, lower_bound will give us an iterator to the pair (5,8) and a quick check comparing the end of the range (8) with the element we are looking for tells us that 6 is indeed contained in the range (5,8) and hence in the range_set.

We also need a custom iterator that takes the abstraction we make of consecutive elements away. Its implementation is rather trivial.

Possible Extensions

Instead of relying on std::map, using our own red-black tree for storage would improve certain operations. For the uses I've put it to, range_set is fast enough, though, so I saw no need to do this.

Downloads

Download demo project - 102 Kb
Download source - 26 Kb


Comments

  • Prior similar work

    Posted by aphillips on 05/21/2008 01:57am

    I only just noticed this but way back in 1999 I created a very similar STL compatible template class also called range_set (and also using the header file name range_set.h). I actually wrote an article on it for the May 1999 edition of The C/C++ Users' Journal (later taken over by DDJ) - see http://www.ddj.com/cpp/184403660. I trust there is no plagiarism involved here, though the similarity is striking. I will put it down to great minds thinking alike. My purpose was, like yours, to create a sparse "set" that could efficiently store millions or billions of elements, if they were added/removed in ranges. It was designed to be compatible with std::set but with the one restriction, of course, that the contained type be "rangeable", ie support +, -, ++ etc. For an appropriate type (such as integer types) it was completely compatible with std::set (apart from some obscure technical details that nobody would ever encounter in real code). In fact I wrote a test suite that tested every aspect of std::set then applied it to my range_set container to verify that it worked identically. (I actually identified some bugs in the version of std::set that shipped with VC++5 in doing that.) I am not sure how you handled this but one problem I encountered was with iterators. Since the container did not contain an instance of every element of the "contained" type T, there was in general nothing for an iterator to "point" to. I solved this by using a "fat" iterator where the iterator class itself contained an instance of T which was updated whenever the iterator moved. Over the years I have found this class very useful. In fact I use it in several places in my hex editor (see http://www.hexedit.com). For example, I use it to keep tracks of highlights in huge files (possible hundreds of GigaBytes in size).

    • Same name, similar function, different implementation

      Posted by Yves M on 02/25/2009 11:34am

      Yes, it's more something along the lines of great minds thinking alike. I found out about your article only a lot later. However there is a substantial difference in the implementations. You use a std::list as a basis, and I use an std::map. The map has the advantage that you can very quickly O(log(N)) decide whether an element belongs to the range_set or not (N being the number of distinct ranges). With a list that's not possible. In cases such as saving the selection of a listbox that doesn't matter as much. However for storing unicode ranges, it does make a difference. The other thing is that your class seems more complete. I didn't implement const interators and didn't test it as extensively for compatibility.

      Reply
    Reply
  • C++

    Posted by Legacy on 02/06/2004 12:00am

    Originally posted by: kelechi

    very good and efficient

    Reply
  • Very nice

    Posted by Legacy on 02/06/2004 12:00am

    Originally posted by: Per

    Magnificent!
    
    

    Conceptually simple and easy to use, the complicated stuff
    neatly wrapped up though a bit too readable to qualify as
    STL code ;-)

    One tip though - Compile your code before submitting it:
    TestRange.cpp(4284) : error C2143: syntax error : missing ';' before 'string'
    cout << "Time for std::set: " << ct.Stop() " seconds " << endl;
    we're missing a << here...

    Same on line 4294.

    It'd be a shame if such trivialities should distract ppl
    from seeing the exellence...

    Reply
  • very good

    Posted by Legacy on 02/06/2004 12:00am

    Originally posted by: kelechi

    Very Good and Effective
    

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • On-demand Event Event Date: September 10, 2014 Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild." This loop of continuous delivery and continuous feedback is how the best mobile …

  • Java developers know that testing code changes can be a huge pain, and waiting for an application to redeploy after a code fix can take an eternity. Wouldn't it be great if you could see your code changes immediately, fine-tune, debug, explore and deploy code without waiting for ages? In this white paper, find out how that's possible with a Java plugin that drastically changes the way you develop, test and run Java applications. Discover the advantages of this plugin, and the changes you can expect to see …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds