Fuzzy Matching Demo in Access

I, just a humble Access programmer, in 2004 decided to become adventurous and glean what I could from the jungle of artificial intelligence in fuzzy matching. What an adventure!

Before proceeding further, it is good to let you know that I have attached Fuzzy1.zip, the contents of which are StrMatching2k.mde, ApprxStrMatchingEngine.pdf, IEEESoundexV5.pdf, VBAlgorithms.txt, and How2Use.rtf. Another attached file is Fuzzy2.zip. It contains the Names.mdb file that contains one table called FullNameRef that needs to be in StrMatching2k.mde. On opening StrMatching2k.mde, it will look for Names.mdb in the same folder and import the FullNameRef table. I had to do it this way because the max size for attached files is 500 Kb, and the size of StrMatching2k.mde with the FullNameRef table zipped is more than 500 Kb. StrMatching2k.mde is the Access database with the demo, tables of names, and so forth.

ApprxStrMatchingEngine.pdf is a PowerPoint presentation introducing Fuzzy Matching and explaining StrMatching2k.mde, and should be read before proceeding.

VBAlgorithms.txt is the source code for the suite of algorithms used in StrMatching2k.mde.

How2Use.rtf explains how to set up StrMatching2k.mde and how to use it for your own Fuzzy Matching.

IEEESoundexV5.pdf is an abstract entitled "Improving Precision and Recall for Soundex Retrieval."

The introduction states, "In the field of tax compliance, identity matching helps find non-filers by locating individuals found in external source systems that are absent from tax systems. In Texas, solving this problem was estimated to be worth at least $43M over 18 months[NCR98]."

In the references section was, "[NCR98] NCR. State of Texas Selects NCR Data Warehouse Solution to Help it Collect $43 Million in Additional Taxes, News Release, May 18, 1998." So, it appears that the NCR Corporation was involved in this process for the state of Texas. In the abstract is an interesting overview of approximate string matching and fuzzy matching algorithms.

A statement near the end reads, "Our algorithm did somewhat better than other individual tests, but the best recall came from a fully integrated test. Using all of the phonetic algorithms, code shifts and diagrams raised recall to 96%." 96% seemed good, but not good enough for me. That is where the adventure started.

After researching various algorithms, I came to the realization that the Fuzzy Matching algorithmic wheels were already invented. It was now a matter of determining what algorithms to use and which were the best for this challenge. So, I plowed (as opposed to "surfed") through the Internet for algorithms. What I discovered was that most of what I found were incomplete, buggy, or written in languages other than VB. Many were in C++, C, Perl, and so on. So, I translated them into VB, debugged them, added the missing pieces, and optimized them.

What an adventure!

Then, I obtained a corpus of 17,295 names; I don't remember where I got them. I imported them into Access and established unique IDs. From this, I created a test table and moved 1200 names into it with a field for recording the original ID. After that, I got the wife to morph the 1200 names with all kinds of typical aberrations from transposition, typos, maiden name variations, to middle names reduced to an initial. Wives are really good at this....

The next step was to create a table for the results from running the table of 1200 morphed names against the table of 17,295 names using the algorithms to attempt the matching.

The results table had fields for the morphed name, the original ID, the ID of the name to which the algorithms matched, the match value that the algorithms gave, and another Boolean field used to flag the name as to whether it is a good match or not. This field would be used by a reviewer and could be updated to true by a query to flag all those with a match value of say, greater than .85. Then, a query could be run to view all those with a false for a good match to determine whether they might possibly qualify as a good match.

Statistics also could be compiled from comparing original IDs to match IDs. This would be used to calculate the precision/recall percentage, and the like after the test run.

There are two functions in the database that call the algorithms, IsSimilar() and NameSimilarity(). Unfortunately, I cannot release the source code on these because I consider them proprietary and of high value. But, the source for the suite of algorithms is pure gold. Two strings to be compared are passed to the IsSimilar() function that runs the strings through the algorithm suite and returns the match value.

The NameSimilarity() function is specifically customized to deal with first, middle, and last names for matching. It calls the IsSimilar() function many times when parsing names. In the end, the NameSimilarity() function calculates the average of all the scores that the IsSimilar() function returned while parsing the two names for comparison. So, NameSimilarity() function returns the 99% precision/recall, and couldn't do so if the IsSimilar() function was not good and solid.

In addition to all this, there is the demo/workbench. The "Algorithm Suite" form will allow you to type in your own names/strings to compare and displays the match value from either the algorithm suite or an individual algorithm. You can compare two names by entering them into the first, middle, and last boxes or simply compare two other strings, such as addresses, by entering them into the two last name boxes.

The Algorithms

There are only four algorithms called in the IsSimilar() function:

  • Dice Coefficient
  • Levenshtein Edit Distance
  • Longest Common Subsequence
  • Double Metaphone

These algorithms are used to detect a list of possible matches between a known good reference list to a questionable list of the same type. The method used for the Dice Coefficient is the linear q-gram (or n-gram) model.

Q or N would be the length of the gram.
Q-gram: monograms, bigrams, trigrams, and so forth

Example bigram for Nelson: %N Ne el ls so on n#

Public Sub Bigram(ByVal pStr, BGArray())
' * Breaks pStr into bigram segments
' * Example: zantac - %z za an nt ta ac c#
Dim X As Integer, BGCount As Integer, StrLen As Integer

   pStr = "%" & pStr & "#"
   StrLen = Len(pStr)
   BGCount = StrLen - 1
   ReDim BGArray(BGCount)

      For X = 0 To BGCount - 1
      BGArray(X) = Mid(pStr, X + 1, 2)
      Next X

End Sub

Dice Coefficient

A similarity metric:

  • D is Dice Coefficient
  • SB is Shared Bigrams
  • TBg1 is total number of bigrams in Qgram1
  • TBg2 is total number of bigrams in Qgram2
  • D = (2SB)/(TBg1+TBg2)

Double the number of shared bigrams and divide by total number of bigrams in each string.

Example Bigram: zantac - %z za an nt ta ac c#

A good Dice Coefficient value would be a value greater than 0.33. The Dice Coefficient is used as the threshold algorithm. If the Dice value is less than 0.2, the comparisons are rejected immediately and not analyzed by the other algorithms.

Public Function Dice(ByVal Qgram1, ByVal Qgram2) As Single
' *************************** Dice Coefficient ****************
' * Similarity metric
' * D is Dice coefficient
' * SB is Shared Bigrams
' * TBg1 is total number of bigrams in Qgram1
' * TBg2 is total number of bigrams in Qgram2
' * D = (2SB)/(TBg1+TBg2)
' * Double the number of shared bigrams and divide by total
' * number of bigrams in each string.
' *************************************************************
Dim QgramCount1 As Integer, QgramCount2 As Integer, BG1, BG2
Dim SharedQgrams As Integer, q1 As Integer, q2 As Integer

If IsNull(Qgram1) Or IsNull(Qgram2) Then
   Exit Function
End If

QgramCount1 = UBound(Qgram1)
QgramCount2 = UBound(Qgram2)

If QgramCount1 = 0 Or QgramCount2 = 0 Then Exit Function

For q1 = 0 To QgramCount1 ' get shared bigram count
   BG1 = Qgram1(q1)

   For q2 = 0 To QgramCount2
      BG2 = Qgram2(q2)

      If Not IsEmpty(BG1) And Not IsEmpty(BG2) Then
         If BG1 = BG2 Then
            SharedQgrams = SharedQgrams + 1
         End If
      End If
   Next q2
Next q1

Dice = CSng(Format((2 * SharedQgrams) / (QgramCount1 + _
            QgramCount2), "0.00"))

End Function

Levenshtein Edit Distance

The Levenshtein Distance algorithm is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. The Levenshtein edit distance is the number of insertions, deletions, or replacements of single characters that are required to convert one string to the other.

Character transposition is detected. Transposition is given a cost of 1.

Transposition detection is taken from: Berghel, Hal; Roach, David:
"An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorithm"
(http://www.acm.org/~hlb/publications/asm/asm.html)

A Levenshtein Distance of 2, 3, or more characters is meaningless without evaluating the lengths of the two comparison strings as well.

Public Function LevDist(ByVal pStr1, ByVal pStr2) As Integer
'******************** Levenshtein Distance *********************
'* Levenshtein Distance algorithm is named after the Russian
'* scientist Vladimir Levenshtein, who devised the algorithm
'* in 1965
'*
'* Levenshtein edit distance is the number of insertions,
'* deletions, or replacements of single characters that are
'* required to convert one string to the other.
'*
'* Character transposition is detected in Step 6A.
'* Transposition is given a cost of 1.
'*
'* pStr1 and pStr2 are arrays.
'*
'* Normally Levenshtein edit distance is symmetric.
'* That is, LevDist(pStr1 , pStr2) is the same as 
'* LevDist(pStr2 , pStr1).
'*
'*
'* pStr1 and pStr2 are arrays.
'***************************************************************
Dim n As Integer, m As Integer, matrix() As Integer
Dim i As Integer, j As Integer, cost As Integer, t_j, s_i
Dim above As Integer, left As Integer, diag As Integer
Dim cell As Integer
Dim trans As Integer

   ' Step 1

      n = UBound(pStr1)
      m = UBound(pStr2)

         If n = 0 Then
            LevDist = 0
            GoTo LevDistXit
         End If
         If m = 0 Then
            LevDist = 0
            GoTo LevDistXit
         End If

      ReDim matrix(n + 1, m + 1)

   ' Step 2

   For i = 0 To n
      matrix(i, 0) = i
   Next i

   For j = 0 To m
      matrix(0, j) = j
   Next j

   ' Step 3

   For i = 1 To n

      s_i = pStr1(i - 1)

   ' Step 4

   For j = 1 To m

      t_j = pStr2(j - 1)

   ' Step 5

   cost = 0

      If Not IsEmpty(s_i) And Not IsEmpty(t_j) Then
         If s_i <> t_j Then
            cost = 1
         End If
      End If

   ' Step 6

   above = matrix(i - 1, j)
   left  = matrix(i, j - 1)
   diag  = matrix(i - 1, j - 1)
   cell  = Minimum((above + 1), (left + 1), (diag + cost))

   ' Step 6A: Cover transposition, in addition to deletion,
   ' insertion and substitution. This step is taken from:
   ' Berghel, Hal; Roach, David: "An Extension of Ukkonen's
   ' Enhanced Dynamic Programming ASM Algorithm"
   ' (http://www.acm.org/~hlb/publications/asm/asm.html)

   If i > 1 And j > 1 Then
   ' changed here to detect transposition of first 2 
   ' characters and return 1 instead of 2 for a cost
      trans = matrix(i - 2, j - 2) + 1

      If pStr1(i - 2) <> pStr2(j - 1) Then ' was t_j
         trans = trans + 1
      End If
      If pStr1(i - 1) <> pStr2(j - 2) Then ' was s_i
         trans = trans + 1
      End If


      If cell > trans Then
         cell = trans
      End If
   End If

   matrix(i, j) = cell
      Next j
   Next i

   ' Step 7

   LevDist = matrix(n, m)

LevDistXit:
End Function

Fuzzy Matching Demo in Access

Longest Common Subsequence

The LCS is calculated by the length of the longest common, not necessarily contiguous, sub-sequence of characters divided by the average character lengths of both strings. A good value for LCS is greater than or equal to 0.8.

Public Function LCS(ByVal str1, ByVal str2) As Single
'* *************** Longest Common Subsequence ******************
'* The LCS is calculated by the length of the longest common,
'* not necessarily contiguous, sub-sequence of characters
'* divided by the average character lengths of both strings.
'* In this case:  c(m, n) / (((Str1Len) + (Str2Len)) / 2))
'* LCS is symmetric.
'*
'* str1 and str2 are arrays.
'***************************************************************

Dim i As Integer, j As Integer, m As Integer, n As Integer
Dim c() As Integer, b() As Integer, X$(), Y$()
Dim Str1Len As Integer, Str2Len As Integer, SmStr(), LgStr()

   Str1Len = UBound(str1)
   Str2Len = UBound(str2)

   n = Minimum(Str1Len, Str2Len)
   m = Maximum(Str1Len, Str2Len)

   ReDim X(m)
   ReDim Y(m)
   ReDim c(m, m)
   ReDim b(m, m)

      If Str1Len > Str2Len Then
         For i = 0 To Str1Len - 1
            X(i) = str1(i)
         Next i
         For i = 0 To Str2Len - 1
            Y(i) = str2(i)
         Next i
      Else
         For i = 0 To Str2Len - 1
            X(i) = str2(i)
         Next i
         For i = 0 To Str1Len - 1
            Y(i) = str1(i)
         Next i
      End If ' Str1Len >Str2Len

      For i = 1 To m
         For j = 1 To n
            If X(i - 1) = Y(j - 1) Then
               c(i, j) = c(i - 1, j - 1) + 1
               b(i, j) = 1 ' /* from north west */
            ElseIf c(i - 1, j) >= c(i, j - 1) Then
               c(i, j) = c(i - 1, j)
               b(i, j) = 2 ' /* from north */
            Else
               c(i, j) = c(i, j - 1)
               b(i, j) = 3 ' /* from west */
            End If
         Next j
      Next i

   ' return c[m][n];

   If c(m, n) > 0 Then
      LCS = CSng(Format((c(m, n) / (((Str1Len) + _
            (Str2Len)) / 2)), "#.##"))
   Else
      LCS = 0
   End If

   Erase X
   Erase Y
   Erase c
   Erase b

End Function

Double Metaphone

This is an algorithm to code English words (and foreign words often heard in the United States) phonetically by reducing them to 12 consonant sounds. This reduces matching problems from wrong spelling.

The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms.

He wrote it as a replacement for SOUNDEX. These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen," and no phonetic relationship between "Adam" and "Nelson."

Double Metaphone is designed primarily to encode American English names (though it also encodes most English words well) while taking into account the fact that such words can have more than one acceptable pronunciation.

Double Metaphone can compute a primary and a secondary encoding for a given word or name to indicate both the most likely pronunciation as well as an optional alternative pronunciation (hence the "double" in the name).

"United States" would generate this primary phonetic code: ANTTSTTS
"United Statess" would generate this primary phonetic code: ANTTSTTSS

The final step in Double Metaphone is to translate the two primary codes into match values. That must be done through one of the other algorithms. I will not show the code here because the main code and supporting code is very extensive. The code is all there in VBAlgorithms.txt, included in the attachment for this article.

Algorithmically speaking, matching is divine and 99% is sweet! What an adventure!



About the Author

George Brown

MS Access programmer.

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • IBM Worklight is a mobile application development platform that lets you extend your business to mobile devices. It is designed to provide an open, comprehensive platform to build, run and manage HTML5, hybrid and native mobile apps.

  • The proliferation of cloud computing options has begun to change the way storage is thought about, procured, and used. IT managers and departments need to think through how cloud options might fit into and complement their onsite data infrastructures. This white paper explains cloud storage and backup, providing advice about the tools and best practices for its implementation and use. Read this white paper for some useful takeaways about how to take advantage of cloud storage for high availability, backup and …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds