Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
I, just a humble Access programmer, in 2004 decided to become adventurous and glean what I could from the jungle of artificial intelligence in fuzzy matching. What an adventure!
Before proceeding further, it is good to let you know that I have attached Fuzzy1.zip, the contents of which are StrMatching2k.mde, ApprxStrMatchingEngine.pdf, IEEESoundexV5.pdf, VBAlgorithms.txt, and How2Use.rtf. Another attached file is Fuzzy2.zip. It contains the Names.mdb file that contains one table called FullNameRef that needs to be in StrMatching2k.mde. On opening StrMatching2k.mde, it will look for Names.mdb in the same folder and import the FullNameRef table. I had to do it this way because the max size for attached files is 500 Kb, and the size of StrMatching2k.mde with the FullNameRef table zipped is more than 500 Kb. StrMatching2k.mde is the Access database with the demo, tables of names, and so forth.
ApprxStrMatchingEngine.pdf is a PowerPoint presentation introducing Fuzzy Matching and explaining StrMatching2k.mde, and should be read before proceeding.
VBAlgorithms.txt is the source code for the suite of algorithms used in StrMatching2k.mde.
How2Use.rtf explains how to set up StrMatching2k.mde and how to use it for your own Fuzzy Matching.
IEEESoundexV5.pdf is an abstract entitled "Improving Precision and Recall for Soundex Retrieval."
The introduction states, "In the field of tax compliance, identity matching helps find non-filers by locating individuals found in external source systems that are absent from tax systems. In Texas, solving this problem was estimated to be worth at least $43M over 18 months[NCR98]."
In the references section was, "[NCR98] NCR. State of Texas Selects NCR Data Warehouse Solution to Help it Collect $43 Million in Additional Taxes, News Release, May 18, 1998." So, it appears that the NCR Corporation was involved in this process for the state of Texas. In the abstract is an interesting overview of approximate string matching and fuzzy matching algorithms.
A statement near the end reads, "Our algorithm did somewhat better than other individual tests, but the best recall came from a fully integrated test. Using all of the phonetic algorithms, code shifts and diagrams raised recall to 96%." 96% seemed good, but not good enough for me. That is where the adventure started.
After researching various algorithms, I came to the realization that the Fuzzy Matching algorithmic wheels were already invented. It was now a matter of determining what algorithms to use and which were the best for this challenge. So, I plowed (as opposed to "surfed") through the Internet for algorithms. What I discovered was that most of what I found were incomplete, buggy, or written in languages other than VB. Many were in C++, C, Perl, and so on. So, I translated them into VB, debugged them, added the missing pieces, and optimized them.
What an adventure!
Then, I obtained a corpus of 17,295 names; I don't remember where I got them. I imported them into Access and established unique IDs. From this, I created a test table and moved 1200 names into it with a field for recording the original ID. After that, I got the wife to morph the 1200 names with all kinds of typical aberrations from transposition, typos, maiden name variations, to middle names reduced to an initial. Wives are really good at this....
The next step was to create a table for the results from running the table of 1200 morphed names against the table of 17,295 names using the algorithms to attempt the matching.
The results table had fields for the morphed name, the original ID, the ID of the name to which the algorithms matched, the match value that the algorithms gave, and another Boolean field used to flag the name as to whether it is a good match or not. This field would be used by a reviewer and could be updated to true by a query to flag all those with a match value of say, greater than .85. Then, a query could be run to view all those with a false for a good match to determine whether they might possibly qualify as a good match.
Statistics also could be compiled from comparing original IDs to match IDs. This would be used to calculate the precision/recall percentage, and the like after the test run.
There are two functions in the database that call the algorithms, IsSimilar() and NameSimilarity(). Unfortunately, I cannot release the source code on these because I consider them proprietary and of high value. But, the source for the suite of algorithms is pure gold. Two strings to be compared are passed to the IsSimilar() function that runs the strings through the algorithm suite and returns the match value.
The NameSimilarity() function is specifically customized to deal with first, middle, and last names for matching. It calls the IsSimilar() function many times when parsing names. In the end, the NameSimilarity() function calculates the average of all the scores that the IsSimilar() function returned while parsing the two names for comparison. So, NameSimilarity() function returns the 99% precision/recall, and couldn't do so if the IsSimilar() function was not good and solid.
In addition to all this, there is the demo/workbench. The "Algorithm Suite" form will allow you to type in your own names/strings to compare and displays the match value from either the algorithm suite or an individual algorithm. You can compare two names by entering them into the first, middle, and last boxes or simply compare two other strings, such as addresses, by entering them into the two last name boxes.
There are only four algorithms called in the IsSimilar() function:
- Dice Coefficient
- Levenshtein Edit Distance
- Longest Common Subsequence
- Double Metaphone
These algorithms are used to detect a list of possible matches between a known good reference list to a questionable list of the same type. The method used for the Dice Coefficient is the linear q-gram (or n-gram) model.
Q or N would be the length of the gram.
Q-gram: monograms, bigrams, trigrams, and so forth
Example bigram for Nelson: %N Ne el ls so on n#
Public Sub Bigram(ByVal pStr, BGArray()) ' * Breaks pStr into bigram segments ' * Example: zantac - %z za an nt ta ac c# Dim X As Integer, BGCount As Integer, StrLen As Integer pStr = "%" & pStr & "#" StrLen = Len(pStr) BGCount = StrLen - 1 ReDim BGArray(BGCount) For X = 0 To BGCount - 1 BGArray(X) = Mid(pStr, X + 1, 2) Next X End Sub
A similarity metric:
- D is Dice Coefficient
- SB is Shared Bigrams
- TBg1 is total number of bigrams in Qgram1
- TBg2 is total number of bigrams in Qgram2
- D = (2SB)/(TBg1+TBg2)
Double the number of shared bigrams and divide by total number of bigrams in each string.
Example Bigram: zantac - %z za an nt ta ac c#
A good Dice Coefficient value would be a value greater than 0.33. The Dice Coefficient is used as the threshold algorithm. If the Dice value is less than 0.2, the comparisons are rejected immediately and not analyzed by the other algorithms.
Public Function Dice(ByVal Qgram1, ByVal Qgram2) As Single ' *************************** Dice Coefficient **************** ' * Similarity metric ' * D is Dice coefficient ' * SB is Shared Bigrams ' * TBg1 is total number of bigrams in Qgram1 ' * TBg2 is total number of bigrams in Qgram2 ' * D = (2SB)/(TBg1+TBg2) ' * Double the number of shared bigrams and divide by total ' * number of bigrams in each string. ' ************************************************************* Dim QgramCount1 As Integer, QgramCount2 As Integer, BG1, BG2 Dim SharedQgrams As Integer, q1 As Integer, q2 As Integer If IsNull(Qgram1) Or IsNull(Qgram2) Then Exit Function End If QgramCount1 = UBound(Qgram1) QgramCount2 = UBound(Qgram2) If QgramCount1 = 0 Or QgramCount2 = 0 Then Exit Function For q1 = 0 To QgramCount1 ' get shared bigram count BG1 = Qgram1(q1) For q2 = 0 To QgramCount2 BG2 = Qgram2(q2) If Not IsEmpty(BG1) And Not IsEmpty(BG2) Then If BG1 = BG2 Then SharedQgrams = SharedQgrams + 1 End If End If Next q2 Next q1 Dice = CSng(Format((2 * SharedQgrams) / (QgramCount1 + _ QgramCount2), "0.00")) End Function
Levenshtein Edit Distance
The Levenshtein Distance algorithm is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. The Levenshtein edit distance is the number of insertions, deletions, or replacements of single characters that are required to convert one string to the other.
Character transposition is detected. Transposition is given a cost of 1.
Transposition detection is taken from: Berghel, Hal; Roach, David:
"An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorithm"
A Levenshtein Distance of 2, 3, or more characters is meaningless without evaluating the lengths of the two comparison strings as well.
Public Function LevDist(ByVal pStr1, ByVal pStr2) As Integer '******************** Levenshtein Distance ********************* '* Levenshtein Distance algorithm is named after the Russian '* scientist Vladimir Levenshtein, who devised the algorithm '* in 1965 '* '* Levenshtein edit distance is the number of insertions, '* deletions, or replacements of single characters that are '* required to convert one string to the other. '* '* Character transposition is detected in Step 6A. '* Transposition is given a cost of 1. '* '* pStr1 and pStr2 are arrays. '* '* Normally Levenshtein edit distance is symmetric. '* That is, LevDist(pStr1 , pStr2) is the same as '* LevDist(pStr2 , pStr1). '* '* '* pStr1 and pStr2 are arrays. '*************************************************************** Dim n As Integer, m As Integer, matrix() As Integer Dim i As Integer, j As Integer, cost As Integer, t_j, s_i Dim above As Integer, left As Integer, diag As Integer Dim cell As Integer Dim trans As Integer ' Step 1 n = UBound(pStr1) m = UBound(pStr2) If n = 0 Then LevDist = 0 GoTo LevDistXit End If If m = 0 Then LevDist = 0 GoTo LevDistXit End If ReDim matrix(n + 1, m + 1) ' Step 2 For i = 0 To n matrix(i, 0) = i Next i For j = 0 To m matrix(0, j) = j Next j ' Step 3 For i = 1 To n s_i = pStr1(i - 1) ' Step 4 For j = 1 To m t_j = pStr2(j - 1) ' Step 5 cost = 0 If Not IsEmpty(s_i) And Not IsEmpty(t_j) Then If s_i <> t_j Then cost = 1 End If End If ' Step 6 above = matrix(i - 1, j) left = matrix(i, j - 1) diag = matrix(i - 1, j - 1) cell = Minimum((above + 1), (left + 1), (diag + cost)) ' Step 6A: Cover transposition, in addition to deletion, ' insertion and substitution. This step is taken from: ' Berghel, Hal; Roach, David: "An Extension of Ukkonen's ' Enhanced Dynamic Programming ASM Algorithm" ' (http://www.acm.org/~hlb/publications/asm/asm.html) If i > 1 And j > 1 Then ' changed here to detect transposition of first 2 ' characters and return 1 instead of 2 for a cost trans = matrix(i - 2, j - 2) + 1 If pStr1(i - 2) <> pStr2(j - 1) Then ' was t_j trans = trans + 1 End If If pStr1(i - 1) <> pStr2(j - 2) Then ' was s_i trans = trans + 1 End If If cell > trans Then cell = trans End If End If matrix(i, j) = cell Next j Next i ' Step 7 LevDist = matrix(n, m) LevDistXit: End Function