Algorithm for Detecting Duplicate/Previously Encountered Strings

Environment: Straight C++


A challenge I've seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session - to do so would merely allow circular loops and endless headaches.

To address this issue, I wrote an N-tree based search algorithm that breaks strings down by their leading characters. Identical portions are 'clipped' and stored in a single node, with the remainder of the string in similarly divided child nodes.


Inserting the following text:

Results in this data structure

The advantage to this approach over typical linear searches is two-fold.

  1. The memory overhead can be reduced as redundant text is partially eliminated
  2. Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)

On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.
Linear search: 2.7 seconds
StringTree: 0.02 seconds
A reasonable improvement!

The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.


Changes to current source (zip file below):

  • Fixed several small memory leaks
  • Added significant functionality for searching and File / CStringArray I/O
  • Compiler defs for Borland compilers
  • String counting and data association per node


Download demo and source code - 12 Kb


Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • On-demand Event Event Date: September 23, 2015 The cloud is not just about a runtime platform for your projects – now, you can do your development in the cloud, too. Check out this webcast to learn how the cloud improves your development experience and team collaboration. Join Dana Singleterry, Principal Product Manager for Oracle Dev Tools, as he discusses how to simplify every aspect of the development lifecycle, including requirements gathering, version management, code reviews, build automation, and …

  • Thanks to the Internet of Things (IoT), physical assets are turning into participants in real-time global digital markets. The countless types of assets around us will become as easily indexed, searched and traded as any online commodity. While some industries will be tougher to transform than others – those with physical limitations, such as manufacturing, will be harder to digitize – untold economic opportunities exist for growth and advancement. Our research shows this will create a new "Economy …

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date