CodeGuru
Earthweb Search
Forums Wireless Jars Gamelan Developer.com
CodeGuru Navigation
RSS Feeds

RSSAll

RSSVC++/C++

RSS.NET/C#

RSSVB

See more EarthWeb Network feeds

follow us on Twitter

Member Sign In
User ID:
Password:
Remember Me:
Forgot Password?
Not a member?
Click here for more information and to register.

Become a Marketplace Partner

jobs.internet.com

internet.commerce
Partners & Affiliates
















Home >> Visual C++ / C++ >> C++ >> Algorithms & Formulas >> String Algorithms


Algorithm for Detecting Duplicate/Previously Encountered Strings
Rating: none

Jeremy Cattone (view profile)
October 1, 2002

Environment: Straight C++

Overview

A challenge I've seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session - to do so would merely allow circular loops and endless headaches.

To address this issue, I wrote an N-tree based search algorithm that breaks strings down by their leading characters. Identical portions are 'clipped' and stored in a single node, with the remainder of the string in similarly divided child nodes.

Example

Inserting the following text:
1234
12345
123456
1234567
www.another.site.org/index.html
www.site.org/index.html
www.test.com/index2.html
www.test.com/index.html
www.test.com

Results in this data structure
+-NULL
  +-1234
    +-NULL
    +-5
      +-NULL
      +-6
        +-NULL
        +-7
  +-www.
    +-another.site.org/index.html
    +-site.org/index.html
    +-test.com
     +-NULL
     +-/index
        +-2.html
        +-.html

(continued)



The advantage to this approach over typical linear searches is two-fold.

  1. The memory overhead can be reduced as redundant text is partially eliminated
  2. Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)

On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.

Linear search: 2.7 seconds
StringTree: 0.02 seconds
A reasonable improvement!

The attached code implements the algorithm as a CStringTree class, and provides a test / performance driver to demo the code.

Updates

Changes to current source (zip file below):

  • Fixed several small memory leaks
  • Added significant functionality for searching and File / CStringArray I/O
  • Compiler defs for Borland compilers
  • String counting and data association per node

Downloads

Download demo and source code - 12 Kb

Tools:
Add www.codeguru.com to your favorites
Add www.codeguru.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed







RATE THIS ARTICLE:   Excellent  Very Good  Average  Below Average  Poor  

(You must be signed in to rank an article. Not a member? Click here to register)

Latest Comments:
mfc programming - Legacy CodeGuru (01/24/2004)
Unicode Version - Legacy CodeGuru (12/09/2003)
in certain conditions - Legacy CodeGuru (10/28/2003)
Delete - Legacy CodeGuru (04/04/2003)
Trie - Legacy CodeGuru (03/28/2003)

View All Comments
Add a Comment:
Title:
Comment:
Pre-Formatted: Check this if you want the text to display with the formatting as typed (good for source code)



(You must be signed in to comment on an article. Not a member? Click here to register)

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info

Legal Notices, Licensing, Reprints, Permissions, Privacy Policy.
Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Whitepapers and eBooks

Intel Whitepaper: Comparing Two- and Four-Socket Platforms for Server Virtualization
IBM Solutions Brief: Go Green With IBM System xTM And Intel
HP eBook: Simplifying SQL Server Management
IBM Contest: Are You the Next Superstar? Join the "Search for the XML Superstar" Contest to Find Out
Microsoft PDF: Top 10 Reasons to Move to Server Virtualization with Hyper-V
Microsoft PDF: Six Reasons Why Microsoft's Hyper-V Will Overtake Vmware
Microsoft Step-by-Step Guide: Hyper-V and Failover Clustering
Intel PDF: Quad-Core Impacts More Than the Data Center
Intel PDF: Virtualization Delivers Data Center Efficiency
Go Parallel Article: PDC 2008 in Review
Microsoft PDF: Top 11 Reasons to Upgrade to Windows Server 2008
Avaya Article: Communication-Enabled Mashups: Empowering Both Business Owners and IT
Intel Whitepaper: Building a Real-World Model to Assess Virtualization Platforms
  PDF: Intel Centrino Duo Processor Technology with Intel Core2 Duo Processor
Microsoft Article: Build and Run Virtual Machines with Hyper-V Server 2008
Go Parallel Article: Q&A with a TBB Junkie
IBM Whitepaper: Innovative Collaboration to Advance Your Business
Internet.com eBook: Real Life Rails
IBM eBook: The Pros and Cons of Outsourcing
Internet.com eBook: Best Practices for Developing a Web Site
IBM CXO Whitepaper: The 2008 Global CEO Study "The Enterprise of the Future"
Avaya Article: Call Control XML in Action - A CCXML Auto Attendant
IBM CXO Whitepaper: Unlocking the DNA of the Adaptable Workforce--The Global Human Capital Study 2008
Adobe Acrobat Connect Pro: Web Conferencing and eLearning Whitepapers
HP eBook: Guide to Storage Networking
MORE WHITEPAPERS, EBOOKS, AND ARTICLES