Help needed implementing fuzzy logic

  • Thread starter Thread starter John Moore
  • Start date Start date
J

John Moore

I would like to provide the user with the ability to identify
duplicate records based on a person's first and last name. This works
OK as long as the spelling of both names matches. However, if the user
types a person's name slightly different it gets around my duplicate
checking code. For example, I would like to catch that Robert Smith
may be the same person as Rob Smith or Bob Smith or Robt. Smith.

I have recently investigated using a Soundex type algorithm to do this
fuzzy logic. There are numerous algorithms available but I have been
unable to find any instructions and/or code that describes how you
actually implement this process.

Any and all help or guidance is greatly appreciated. If you wish to
respond to me via email, remove the .invaild from the end.

John Moore
DSI
 
Thanks for the suggestion. I forgot to mention that I did that before
posting to the newsgroup and had no luck with the specific question I
asked in any of the items I was able to find.
 
John Moore said:
Thanks for the suggestion. I forgot to mention that I did that before
posting to the newsgroup and had no luck with the specific question I
asked in any of the items I was able to find.

I found the code below somewhere, a while ago, and stashed it away
(untested). Unfortunately I can't remember where I found it, or the
author's name for attribution. Obviously you can't claim this code as
your own or sell it by itself, but since it was made public I assume you
can use it freely in your applications.

'------ start of code (watch for line wraps) ------
Function Soundex(Name As String) As String

' Implements SOUNDEX algorithm, reasonably efficiently.
' Uses Array lookup to get soundex code digits and a
' relatively fast loop to scan the name.

' Returns null string if passed string contains numeric characters
' or non-numeric characters other than ', -, or <Space>, which are
' ignored (i.e. treated as if the characters on either side of them
' are directly adjacent to each other).

Static CodeLookup As Variant
Static InitDone As Boolean
Dim SoundTemp As String
Dim NameTemp As String
Dim ThisVal As Integer
Dim PrevVal As Integer
Dim ThisChar As Integer


If Not InitDone Then
CodeLookup = Array(0, 1, 2, 3, 0, 1, 2, -1, 0, 2, 2, 4, 5, 5, 0, 1,
2, 6, 2, 3, 0, 1, -1, 2, 0, 2)
InitDone = True
End If ' only need to do this once

NameTemp = UCase(RTrim$(LTrim$(Name)))
Soundex = vbNullString
If Len(NameTemp) = 0 Then Exit Function

ThisChar = Asc(NameTemp)
If IsCharAlpha(ThisChar) = 0 Then Exit Function ' first character of
name must be alpha
SoundTemp = Mid$(NameTemp, 1, 1)
NameTemp = Mid$(NameTemp, 2)
PrevVal = CodeLookup(ThisChar - 64)

While Len(NameTemp) > 0 And Len(SoundTemp) < 4
ThisChar = Asc(NameTemp)

If IsCharAlpha(ThisChar) Then
ThisVal = CodeLookup(ThisChar - 64)
ElseIf ThisChar = 32 Or ThisChar = 39 Or ThisChar = 45 Then
ThisVal = -1 ' included hyphens, apostrophes, and spaces are
treated like H or W
Else
Exit Function ' invalid character in name
End If

If ThisVal = PrevVal Or ThisVal = 0 Then
' do nothing
ElseIf ThisVal = -1 Then
ThisVal = PrevVal ' H, W, and punctuation are totally "silent"
Else
SoundTemp = SoundTemp & ThisVal
End If

PrevVal = ThisVal
NameTemp = Mid$(NameTemp, 2)
Wend

While Len(SoundTemp) < 4
SoundTemp = SoundTemp & "0"
Wend

Soundex = SoundTemp

End Function
'------ end of code ------
 
Soundex wasn't designed for quite this purpose and might give you a lot
of false matches (IIRC it will match Robert Smith and Robin Smythe).
Also and it puts a lot of weight on the first character and doesn't (as
far as I can remember) match Rob and Bob.

The Metaphone algorithm might get you a bit closer. There used to be a
DLL downloadable from
http://www.programmersheaven.com/zone15/cat161/2902.htm

Another approach would be to use the Levenshtein distances (the distance
between two strings is the number of individual character edits needed
to convert one into the other):
Robert Smith
Rob Smith (3)
Bob Smith (4)
Robt. Smith (2 if you strip out the punctuation)
There's an algorithm with VB implementation at
http://www.merriampark.com/ld.htm#VB
 
(e-mail address removed) (John Moore) wrote:
"I would like to provide the user with the ability to identify
duplicate records based on a person's first and last name. This works
OK as long as the spelling of both names matches. However, if the user
types a person's name slightly different it gets around my duplicate
checking code. For example, I would like to catch that Robert Smith
may be the same person as Rob Smith or Bob Smith or Robt. Smith.

I have recently investigated using a Soundex type algorithm to do this
fuzzy logic. There are numerous algorithms available but I have been
unable to find any instructions and/or code that describes how you
actually implement this process."


The general idea is to transform both of the two (names, whatever...)
using Soundex (NYSIIS, metaphone, double metaphone, etc.), and then
check for an exact match.

-Will Dwinnell
http://will.dwinnell.com
 
John Nurick said:
Soundex wasn't designed for quite this purpose and might give you a lot
of false matches (IIRC it will match Robert Smith and Robin Smythe).
Also and it puts a lot of weight on the first character and doesn't (as
far as I can remember) match Rob and Bob.

The Metaphone algorithm might get you a bit closer. There used to be a
DLL downloadable from
http://www.programmersheaven.com/zone15/cat161/2902.htm

Another approach would be to use the Levenshtein distances (the distance
between two strings is the number of individual character edits needed
to convert one into the other):
Robert Smith
Rob Smith (3)
Bob Smith (4)
Robt. Smith (2 if you strip out the punctuation)
There's an algorithm with VB implementation at
http://www.merriampark.com/ld.htm#VB


Double Metaphone would be even better. See http://aspell.sourceforge.net/metaphone/

-Lawrence Philips
Dolby Laboratories
 
Back
Top