Building the Right Environment to Support AI, Machine Learning and Deep Learning
This article introduces an STL-compatible container that is very much like STL's own set. It can be used in a lot of situations where you would use std::set while providing extra functionality and improved performance in some cases. Some parts of this article are a bit more technical and can be skipped if you only want to use range_set. For those readers who just want to know how to use the code, you can skip to the section explaining the differences with std::set. For those who are interested in the details and the way the code works internally, start at the section "A Range."
A range [a, b] is composed of all elements between a and b, including a, but not b. This is the same convention that is used all through STL with iterators. An example is when you sort elements of a vector. The simplest code is:
std::vector<int> v; // fill the vector with some data std::sort(v.begin(), v.end());
What Is a Set of Ranges?
A set of ranges is pretty much the same thing as a set. It's a container that allows you to add and erase elements in O(log(N)) time like std::map. The advantage over std::map is that you can quickly insert a range of values [x, y) also in O(log(N)) time, whereas with std::map you would need O(log(N)(y-x)) time. This may sound a bit technical, but there are quite a few real applications where this makes sense. The essence of the difference is that when you know that the things you are storing may very well be consecutive, it may make sense to use a range_set. One simple example is storing the selected items in a list. If your list is large (say 10 thousand items), it takes O(log(2500)5000) time to insert the elements from 1 to 5000. Using a range_set, this time is shortened to just O(1) if your container is empty previously. So, if you are adding a lot of consecutive items, it really pays off.
Personally, I have developed this range_set for use in sort of a large list view. Previously, I was storing the elements in an std::map and performance was unacceptable when the user tried to go to the first item, select it, then go to the last one and select the whole range. Adding over five thousand items just took too much time. Using one of the extension containers provided by most STLs, hash_map, resulted in only a marginal improvement. The fact remained that I needed to add 5000 integers seperately to the container. So, I came up with the range_set. In this container, you only have to call one function, namely insert_range, that will insert all 5000 integers. Moreover, it only takes a time proportional to O(log(5000)), not O(5000) like hash_map or O(log(2500)5000), like map. So, if you have large lists where you can select multiple items, a container such as range_set will be useful for storing the IDs of the selected items.
Another usage I have found for range_set is to represent ranges of Unicode characters. If, for example, you need to know which characters are used by Japanese scripts, there are several different ranges. The ranges each contain a lot of elements, but there are also several ranges of characters. A hash_map or a map would have to contain each of the different characters and they would not take advantage of the fact that a lot of these characters are adjacent. Here, range_set allows me to reduce both memory consumption and increase speed.
A third usage is more for analysis purposes. If you have some data that could be clustered into ranges, you can try adding it to a range_set and inspect the std::map it is holding as a representation. It doesn't matter in which order you add the elements; the ranges will always be as short and as few as possible. So, it will show you at a glance which ranges your data is clustered in and which strategies you could adopt for optimizing it.
Conventions and Requirements on Types
A range in the range_set is like a range of iterators in STL. This means that if you use (x, y) as the parameters for a function that takes a range, then x belongs to the range and y is one-past-the-end. For example, the range specified by 5 and 10 contains the integers 5, 6, 7, 8 and 9, but not 10.
For the needs of its representation, range_set requires that the Key type you pass it is a model of a random access iterator. In practice, this means that the type you pass needs to have the ++ and -- functions defined, as well as the operators - and +.
Apart from this, you also need a comparison function operating on the Key type. This can be the same one that is used with std::maps and more often than not, you can just leave that template argument empty and use std::less. The standard allocator will also probably suit your needs, so you can just go with the default one.
Difference with std::set
The general difference is that it has a few more requirements on the type of Key passed to it, but at the same time it has a best case and average case behaviour far superiour to std::set. The functions that let you add a range directly are the most obvious improvements.
void insert_range(const Key &val1, const Key &val2)
This function inserts a range of values from val1 to val2 in O(log(N)) time where N is the number of ranges already contained in the range_set. The equivalent in std::map would be to call insert val2 - val1 times, which would result in a total time of O(log(N) (val2 - val1)).
void erase_range(const Key &val1, const Key &val2)
This function is the pendant to insert_range, which deletes a range of values in O(log(N)) time.
*intersect_range(range_set &src, const Key &val1, const Key &val2)
This returns the intersection of range_set with a fixed range of values. This function also executes in O(log(N)) time.
iterator get_nth(unsigned int n) This returns the Nth element from the range_set (starting at 1) in O(log(N)) time. Note that this function requires the Key type to be a model of a random access iterator.
Existing functions from std::map
Returns an iterator pointing to the beginning of the range_set.
Returns an iterator pointing to the end of the range_set.
iterator find(const Key val)
Returns an iterator to the element that contains the value val. If it is not found, it returns an iterator pointing to range_set.end()
iterator insert(const Key val)
Inserts a single element into the range_set. It returns an iterator to the element inserted and executes in O(log(N)), where N is the number of distinct ranges in the range_set.
void erase(const Key val)
Erases an element by key. The same function exists to delete an element by iterator.
Advanced (possibly dangerous) functions
void quick_insert(const Key val)
This function inserts a single element without enforcing the container invariant. That means that you can use this only if you are sure that the inserted value is neither already contained in the range_set, nor adjacent to any other range.
void quick_insert_range(const Key &val1, const Key &val2)
This function inserts a range of elements without enforcing the container invariant. Same remarks as for quick_insert.
The range_set is based on std::map and uses this to map the beginning of a range to one-past-the-end of a range. The basic invariant of the container is that the end of a range can never be one less than, equal to, or greater than the beginning of the next range. So, if you have, for example, a range_set containing the integers 5,6,7,9,10, it would store the two pairs (5, 8) and (9, 11) in the underlying std::map. Looking for a specific element is achieved by using the lower_bound function of std::map. This will give you an iterator to the biggest element that is not greater than the element you are looking for. So, say we are looking for element 6 in the range_set above. Then, lower_bound will give us an iterator to the pair (5,8) and a quick check comparing the end of the range (8) with the element we are looking for tells us that 6 is indeed contained in the range (5,8) and hence in the range_set.
We also need a custom iterator that takes the abstraction we make of consecutive elements away. Its implementation is rather trivial.
Instead of relying on std::map, using our own red-black tree for storage would improve certain operations. For the uses I've put it to, range_set is fast enough, though, so I saw no need to do this.
DownloadsDownload demo project - 102 Kb
Download source - 26 Kb