Algorithm for Detecting Duplicate/Previously Encountered Strings | CodeGuru

Algorithm for Detecting Duplicate/Previously Encountered Strings

Environment: Straight C++ Overview A challenge I’ve seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session – to do so would […]

Written By
CodeGuru Staff
CodeGuru Staff
Oct 1, 2002
2 minute read
CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Environment: Straight C++

Overview

A challenge I’ve seen pop up often while writing search tools is the ability to
verify that your code is not traversing the same source multiple times. A spider
traversing the web, for example, would never wish to traverse the same URL twice
in a session – to do so would merely allow circular loops and endless headaches.

To address this issue, I wrote an N-tree based search algorithm that breaks strings
down by their leading characters. Identical portions are ‘clipped’ and stored in a
single node, with the remainder of the string in similarly divided child nodes.

Example

Inserting the following text:
1234
12345
123456
1234567
www.another.site.org/index.html
www.site.org/index.html
www.test.com/index2.html
www.test.com/index.html
www.test.com
Results in this data structure
+-NULL
  +-1234
    +-NULL
    +-5
      +-NULL
      +-6
        +-NULL
        +-7
  +-www.
    +-another.site.org/index.html
    +-site.org/index.html
    +-test.com
     +-NULL
     +-/index
        +-2.html
        +-.html

The advantage to this approach over typical linear searches is two-fold.

  1. The memory overhead can be reduced as redundant text is partially eliminated
  2. Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)

On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.

Linear search: 2.7 seconds
StringTree: 0.02 seconds

A reasonable improvement!

The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.

Advertisement

Updates

Changes to current source (zip file below):

  • Fixed several small memory leaks
  • Added significant functionality for searching and File / CStringArray I/O
  • Compiler defs for Borland compilers
  • String counting and data association per node

Downloads

Download demo and source code – 12 Kb

CodeGuru Logo

CodeGuru covers topics related to Microsoft-related software development, mobile development, database management, and web application programming. In addition to tutorials and how-tos that teach programmers how to code in Microsoft-related languages and frameworks like C# and .Net, we also publish articles on software development tools, the latest in developer news, and advice for project managers. Cloud services such as Microsoft Azure and database options including SQL Server and MSSQL are also frequently covered.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.