CEditDist Abstract Template Class for Edit Distance Calculation on Generic Data Types

Environment: Any modern C++ compiler

Purpose

The CEditDist class performs edit distance calculations on abstract data types.  The edit distance is defined as the minimum cost required to convert one string into another, where the conversion can include changing one character to another, deleting characters and inserting characters, with user-defined costs for each basic operation.  The algorithm is O(nm) where n and m are the lengths of the two strings compared (this is the fastest known algorithm).  Edit distance calculations are useful for finding the degree of similarity between strings, e.g. in non-exact database queries.

The package is implemented as an abstract template base class.  The programmer derives from this base class in order to define the data types used and the cost functions.  The package includes complete documentation and an example of a simple implementation.

Example

Overview:
This example will create a class that can be used to calculate the edit distance between two arrays of integers, according to the following cost function: Deleting or inserting an integer will have a cost of 5; Changing from one integer value to a different value will have a cost of 3.

Step 1: Instantiate the Base Class.
The instantiation must be explicitly performed because of the special treatment of abstract template classes in C++.  In our case, we want a base class of integers, so we will use the following statement:

template class CEditDist<int>;

Step 2: Derive from CEditDist.
We must now create a derivation from the base class.  This derivation will define the cost functions, which are virtual functions called from the base class during the edit distance calculation.

class CIntEditClass : public CEditDist<int>
{
public:
   int DeleteCost(const int& deleted, int x, int y) { return 5; };
   int InsertCost(const int& inserted,int x, int y) { return 5; };
   int ChangeCost(const int& from, const int& to, int x, int y)
         { return (from==to ? 0 : 3); };
};

Step 3: Using the Class.
To use the class, construct an object from the base class and call its EditDistance function, like so:

CIntEditDist ed;
int a[7] = {1,1,2,3,1,2,2};
int b[6] = {1,2,2,3,1,1};
int c = ed.EditDistance(a,7,b,6);

In the above example, the edit distance stored in variable c should be 11.  This is because the least expensive edit replaces the second character in a from 1 to 2, replaces the sixth character from 2 to 1, and deletes the last character in a, for a total of two changes and one deletion, with a cost of 3+3+5=11.

Downloads

Download source - 2 Kb
EditDist classes and documentation (external link) - 27Kb


Comments

  • What is it useful for?

    Posted by Legacy on 12/02/1999 12:00am

    Originally posted by: Tomaz Stih

    Just wonder could you give us a hint what this thing would be useful for (except the obvious word evaluation)...

    I can see many uses for cumulative cost evaluations, for example if I have:

    1,2,3,4,5 and
    1,2,2,3,4,5 then

    the cost from your example is:

    change 3->2 ... 3
    change 4->3 ... 3
    change 5->4 ... 3
    add 5 ... 5

    total 14.

    But if you would do it smart by having algoritm returning when the cost exceeds "last cost+the other operation" then the cost would be smarter:

    delete 2 on 3rd position ... 5

    total 5

    Also something like this should be compatible (consistent) with STL to be really useful.

    Sincerely,
    Tomaz

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Salesforce has been recognized by Gartner as a leader in this report for three years in a row. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Salesforce.com. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research …

  • Microsoft® Office 365 is a top choice for enterprises that want a cloud-based suite of productivity/ collaboration applications. With Office 365, you get access to Microsoft™ Office solutions practically anytime, anywhere, on virtually any device. It's a great option for current Microsoft users who can now build on their experience with Microsoft™ solutions while enjoying the flexibility of a cloud-based delivery. But even organizations with no previous investment in Microsoft will find that …

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date