Algorithm for Detecting Duplicate/Previously Encountered Strings

Environment: Straight C++

Overview

A challenge I've seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session - to do so would merely allow circular loops and endless headaches.

To address this issue, I wrote an N-tree based search algorithm that breaks strings down by their leading characters. Identical portions are 'clipped' and stored in a single node, with the remainder of the string in similarly divided child nodes.

Example

Inserting the following text:
1234
12345
123456
1234567
www.another.site.org/index.html
www.site.org/index.html
www.test.com/index2.html
www.test.com/index.html
www.test.com

Results in this data structure
+-NULL
  +-1234
    +-NULL
    +-5
      +-NULL
      +-6
        +-NULL
        +-7
  +-www.
    +-another.site.org/index.html
    +-site.org/index.html
    +-test.com
     +-NULL
     +-/index
        +-2.html
        +-.html

The advantage to this approach over typical linear searches is two-fold.

  1. The memory overhead can be reduced as redundant text is partially eliminated
  2. Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)

On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.
Linear search: 2.7 seconds
StringTree: 0.02 seconds
A reasonable improvement!

The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.

Updates

Changes to current source (zip file below):

  • Fixed several small memory leaks
  • Added significant functionality for searching and File / CStringArray I/O
  • Compiler defs for Borland compilers
  • String counting and data association per node

Downloads

Download demo and source code - 12 Kb


Comments

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • This ESG study by Mark Peters evaluated a common industry-standard disk VTl deduplication system (with 15:1 reduction ratio) versus a tape library with LTO-5, drives with full nightly backups, over a five-year period.  The scenarios included replicated systems and offsite tape vaults.  In all circumstances, the TCO for VTL with deduplication ranged from about 2 to 4 times more expensive than the LTO-5 tape library TCO. The paper shares recent ESG research and lots more. 

  • Live Event Date: September 17, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Another day, another end-of-support deadline. You've heard enough about the hazards of not migrating to Windows Server 2008 or 2012. What you may not know is that there's plenty in it for you and your business, like increased automation and performance, time-saving technical features, and a lower total cost of ownership. Check out this upcoming eSeminar and join Rich Holmes, Pomeroy's practice director of virtualization, as he discusses the …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds