Virtual Developer Workshop: Containerized Development with Docker
OverviewA challenge I've seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session - to do so would merely allow circular loops and endless headaches.
To address this issue, I wrote an N-tree based search algorithm that breaks strings down by their leading characters. Identical portions are 'clipped' and stored in a single node, with the remainder of the string in similarly divided child nodes.
Inserting the following text: 1234 12345 123456 1234567 www.another.site.org/index.html www.site.org/index.html www.test.com/index2.html www.test.com/index.html www.test.com Results in this data structure +-NULL +-1234 +-NULL +-5 +-NULL +-6 +-NULL +-7 +-www. +-another.site.org/index.html +-site.org/index.html +-test.com +-NULL +-/index +-2.html +-.html
The advantage to this approach over typical linear searches is two-fold.
- The memory overhead can be reduced as redundant text is partially eliminated
- Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)
Linear search: 2.7 seconds StringTree: 0.02 secondsA reasonable improvement!
The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.
Changes to current source (zip file below):
- Fixed several small memory leaks
- Added significant functionality for searching and File / CStringArray I/O
- Compiler defs for Borland compilers
- String counting and data association per node