Algorithm for Detecting Duplicate/Previously Encountered Strings

Environment: Straight C++


A challenge I've seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session - to do so would merely allow circular loops and endless headaches.

To address this issue, I wrote an N-tree based search algorithm that breaks strings down by their leading characters. Identical portions are 'clipped' and stored in a single node, with the remainder of the string in similarly divided child nodes.


Inserting the following text:

Results in this data structure

The advantage to this approach over typical linear searches is two-fold.

  1. The memory overhead can be reduced as redundant text is partially eliminated
  2. Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)

On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.
Linear search: 2.7 seconds
StringTree: 0.02 seconds
A reasonable improvement!

The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.


Changes to current source (zip file below):

  • Fixed several small memory leaks
  • Added significant functionality for searching and File / CStringArray I/O
  • Compiler defs for Borland compilers
  • String counting and data association per node


Download demo and source code - 12 Kb


Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Moving from an on-premises environment to Office 365 does not remove the need to plan for disruptions or reduce the business risk requirements for protecting email services. If anything, some risks increase with a move to the cloud. Read how to ease the transition every business faces if considering or already migrating to cloud email. This white paper discusses: Setting expectations when migrating to Office 365 Understanding the implications of relying solely on Exchange Online security Necessary archiving …

  • Enterprises are increasingly looking to platform as a service (PaaS) to lower their costs and speed their time to market for new applications. Developing, deploying, and managing applications in the cloud eliminates the time and expense of managing a physical infrastructure to support them. PaaS offerings must deliver additional long-term benefits, such as a lower total cost of ownership (TCO), rapid scalability, and ease of integration, all while providing robust security and availability. This report …

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date