SHARE

Algorithm for Detecting Duplicate/Previously Encountered Strings

Environment: Straight C++ Overview A challenge I’ve seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session – to do so would […]

Written By

CodeGuru Staff

Oct 1, 2002

2 minute read

CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Environment: Straight C++

Overview

A challenge I’ve seen pop up often while writing search tools is the ability to
verify that your code is not traversing the same source multiple times. A spider
traversing the web, for example, would never wish to traverse the same URL twice
in a session – to do so would merely allow circular loops and endless headaches.

To address this issue, I wrote an N-tree based search algorithm that breaks strings
down by their leading characters. Identical portions are ‘clipped’ and stored in a
single node, with the remainder of the string in similarly divided child nodes.

Example

Inserting the following text:
1234
12345
123456
1234567
www.another.site.org/index.html
www.site.org/index.html
www.test.com/index2.html
www.test.com/index.html
www.test.com
Results in this data structure
+-NULL
  +-1234
    +-NULL
    +-5
      +-NULL
      +-6
        +-NULL
        +-7
  +-www.
    +-another.site.org/index.html
    +-site.org/index.html
    +-test.com
     +-NULL
     +-/index
        +-2.html
        +-.html

The advantage to this approach over typical linear searches is two-fold.

The memory overhead can be reduced as redundant text is partially eliminated
Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)

On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.

Linear search: 2.7 seconds
StringTree: 0.02 seconds

A reasonable improvement!

The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.

Updates

Changes to current source (zip file below):

Fixed several small memory leaks
Added significant functionality for searching and File / CStringArray I/O
Compiler defs for Borland compilers
String counting and data association per node

Downloads

Download demo and source code – 12 Kb