Help needed implementing fuzzy logic

  • Thread starter Thread starter Tony Ciconte
  • Start date Start date
T

Tony Ciconte

I would like to provide the user with the ability to identify
duplicate records based on a person's first and last name. This works
OK as long as the spelling of both names matches. However, if the user
types a person's name slightly different it gets around my duplicate
checking code. For example, I would like to catch that Robert Smith
may be the same person as Rob Smith or Bob Smith or Robt. Smith.

I have recently investigated using a Soundex type algorithm to do this
fuzzy logic. There are numerous algorithms available but I have been
unable to find any instructions and/or code that describes how you
actually implement this process.

Any and all help or guidance is greatly appreciated.
 
I would like to provide the user with the ability to identify
duplicate records based on a person's first and last name. This works
OK as long as the spelling of both names matches. However, if the user
types a person's name slightly different it gets around my duplicate
checking code. For example, I would like to catch that Robert Smith
may be the same person as Rob Smith or Bob Smith or Robt. Smith.

I have recently investigated using a Soundex type algorithm to do this
fuzzy logic. There are numerous algorithms available but I have been
unable to find any instructions and/or code that describes how you
actually implement this process.

Any and all help or guidance is greatly appreciated.

I can't help you directly on your question, but just a note of
caution.
You have no guarantee that Bob Smith, Rob Smith, and Robert Smith and
Robt. Smith are the same person. (And is Job Smith simply a
mis-spelled Bob or a different person?) Indeed, doing any search for a
person by name is going to return mis-identified multiple records.

Search for the person by the NameID field, not the Name.
You can use a Combo Box to narrow the search to the correct Smith, by
including other identifying fields beside the Name, such as Address,
Social Security Number, Work Department, etc.
Robert Smith on Elm St. is not the same person as Bob Smith on Oak
Ave.
Using the Street to narrow the search, if the Combo Box is bound to
the NameID field, then only the correct Smith records will be
returned.
 
Email me if you would like a small A97 mdb showing techniques using double
metaphone. It converts to A2K and A2K as is.

Clive
 
Tony Ciconte said:
I would like to provide the user with the ability to identify
duplicate records based on a person's first and last name. This works
OK as long as the spelling of both names matches. However, if the user
types a person's name slightly different it gets around my duplicate
checking code. For example, I would like to catch that Robert Smith
may be the same person as Rob Smith or Bob Smith or Robt. Smith.

Not exactly what you are looking for but another possibility.
Create Families from Volunteers User Interface
http://www.granite.ab.ca/access/familiesui.htm

Tony

--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm
 
I do not mention in my original post that we do a lot more checking
after the name fields are entered. Specifically, once we have a match,
we alert the user and display all the records that match this
first/last name combination including all the address and personal
info (except SSN) you listed.

The whole purpose of my question was to help us catch more of these
potential dups and let the user decide, with appropriate address, etc.
information displayed, whether it is a dup or not.

Thanks for your help.

TC
 
Soundex was designed to convert to a code "American Last Names"
that "Sound the Same". It was used for the American census.
It was specifically designed to handle the case where a last
name could be spelled a few different ways by a census worker.

To use Soundex or Metaphone or any of the variants, you add an
extra field or two or four, and store the Soundex/Metaphone
code(s) in the field(s). Then when you want to check for a
match, you compare the code fields instead of comparing the
names. Because first name /last name confusion is common in
data entry, you might also want to check the 'first name code(s)'
against the "last name codes(s)". Just run the names through
the Soundex/Metaphone to get the codes for storage and comparison.

You can customize the Soundex/Metaphone algorithms to return
a number of different codes (double Metaphone), or to return
codes that are more alike: you will always get both false
positives (too many matches) and false negatives (not enough
matches).

Soundex is not very good for 'non-American' last names, so you
would use a variant algorithm. You might also use as 'all
numeric' variant instead of a Soundex Code, just so that you
could store the result in a numeric field.

Unfortunately, "Robert" and "Rob" do not "Sound the Same", so
Soundex/Metaphone is not going to do a good job matching those
names. A Spell Check is not going to be very good either:
spell check is designed to look for common spelling errors.

If you want to do a good job of first name matching, you need
to use code specifically written for that task (regular expression
matching with large exception lists). Unfortunately there does
not seem to be any easily available: just don't expect to much
from Soundex/Metaphone

(david)
 
Tony Ciconte said:
I do not mention in my original post that we do a lot more checking
after the name fields are entered. Specifically, once we have a match,
we alert the user and display all the records that match this
first/last name combination including all the address and personal
info (except SSN) you listed.

The whole purpose of my question was to help us catch more of these
potential dups and let the user decide, with appropriate address, etc.
information displayed, whether it is a dup or not.

Another option, which won't help the Robert/Bob problem is to try
searching on just the first few letters of the first name and last
name. With one clients database of 10,000 names even jo sm had
surprisingly few entries.

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm
 
Back
Top