Duplicate detection (looking for code)

  • Thread starter Thread starter Mark Andrews
  • Start date Start date
M

Mark Andrews

I would like to add code to detect for duplicates or records with similar
information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for exact
matches.

Wondering if anyone has some code they would care to share that might make
my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark
 
You may not have received any responses yet because what you are proposing
is not particularly simple.

.... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob Smith" as
duplicating "Bobby Smith" when your database could legitimately contain two
separate individuals with those names?

And what happens when you have two unique individuals, both named Lynne
Johnson? (there are two in my state, and they were both born on the same
date!)

I suspect you'll have to create your own code that tells Access exactly when
and how to consider two records to be close enough to be a match ... and you
might want to consider them only as "potential" matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.
 
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen saying here
are some similar contacts "just to make sure this is not a duplicate contact
being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230
If they knew that the "Bobby Smith" they were entering lived on Main Street
in Pittsburgh they would choose to quit entering data for this new contact
because they know that they are entering a duplicate or alternately they
could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post and yes
it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some sort of
character by character comparison (if 90% of the characters match consider
it a "potential duplicate"). I need some sort of "Similar" function.

I don't think there are any potentially serious flaws with my analysis, my
analysis at this point is "hey this might be a little work, I wonder if
anyone else has attempted this and would let me see their code".

Now is your chance to post the code you wrote to do these types of checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark
 
Mark

Just for the record, the folks who read and write here in the newsgroups are
not all MVPs ... and some of the best answers I've seen come from folks who
aren't. Don't limit your audience...

Most of the approaches I've seen that work for this involve USB (using
someone's brain). It sounds like that's part of your approach, too.

Do a search on "Soundex". This is an algorithm that uses how words (e.g.,
names) sound to compare them. Words with similar soundex scores sounds
similar. This could help with last names and street names, but I don't see
it helping with Bobby vs. Robert, or with all the embellishments that
addresses have. Again, you'd need to tell Access that Bobby and Robert are
(sometimes) synonymous.

One approach might be to sort all entries by "lastname, firstname - delivery
address" (as a concatenated field) and USB to break the ties.

Another approach might be to use the built-in autocomplete feature in Access
comboboxes on a form.

Have your user start typing a lastname and have Access jump to the
"lastname, firstname - delivery address"es that start that way.

You don't mention whether you're working with a couple hundreds entries, a
couple thousand, or a couple hundred thousand. The approach you take may
need to differ, depending on volume.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP
 
Jeff,

Ok searching on Soundex I think will help me dig up some code or flush out
the best approach a little better.

If anyone has done this would love to see your approach!

Code I'm writing will be installed at multiple companies, most will have
5000 or less contact records.
The method should be designed to work well for 10,000 contacts.

Thanks,
Mark
 
Mark

Check on-line for Allen Browne's website. He has a routine that helps limit
the number of records a combobox has to pull down by "waiting" for the first
"n" letters to get entered. This would be useful if you followed the
combobox and concatenation route.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.
 
I searched a little, my best lead right now is the code I found at:
http://www.kdkeys.net/forums/thread/6450.aspx

You do need to sign up to the forum to download it.

It's an MDE but it looks like he included the source code for most of the
fuzzy logic search algorithms.

algorithms included
- Levenshtein Edit Distance
- Dice Coefficient
- Longest Common Subsequence
- Double Metaphone

I also read soundex2 is good (soundex is a little too general).

I am by no means an expert but I did do a little searching and if you want
to do fuzzy matching of some
sort I guess you need to jump into this stuff. Perhaps even store some
algorithm results so at runtime
you can compare faster to the thousands of records you have in the db.

Maybe this will help someone else out?
Mark
 
Thanks for posting back what you found. That will undoubtedly help someone
in their (future) search.

Be aware that "thousands" of records in a combobox leads to poor response
time. Allen B's approach speeds that up considerably.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.
 
No problem. Hope it ends up working. The word Soundex got me started down
the right path.
"Fuzzy logic vba" is also a good search keyword.
 
After all, can YOU be sure that all of the following are duplicates?:

And what about my friends, Fred Brown and Fred Brown? Young Fred is no longer
living at home, but he was when I first met him...
 
Back
Top