OverviewA challenge I've seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session - to do so would merely allow circular loops and endless headaches.
To address this issue, I wrote an N-tree based search algorithm that breaks strings down by their leading characters. Identical portions are 'clipped' and stored in a single node, with the remainder of the string in similarly divided child nodes.
Inserting the following text: 1234 12345 123456 1234567 www.another.site.org/index.html www.site.org/index.html www.test.com/index2.html www.test.com/index.html www.test.com Results in this data structure +-NULL +-1234 +-NULL +-5 +-NULL +-6 +-NULL +-7 +-www. +-another.site.org/index.html +-site.org/index.html +-test.com +-NULL +-/index +-2.html +-.html
The advantage to this approach over typical linear searches is two-fold.
- The memory overhead can be reduced as redundant text is partially eliminated
- Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)
On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.
Linear search: 2.7 seconds StringTree: 0.02 secondsA reasonable improvement!
The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.
Changes to current source (zip file below):
- Fixed several small memory leaks
- Added significant functionality for searching and File / CStringArray I/O
- Compiler defs for Borland compilers
- String counting and data association per node