De-Duping Lists

Allison · Dec 5, 2003

I need to de-dupe an outside mailing list against our
house membership base by company name. However, as the
outside list often has a variation of the company name we
have on file(i.e. missing the inc. or ltd.)the find
duplicates query isn't picking them up.

Is there a SQL script or a Union Query that can find
duplicates using 'any part of field' as opoosed to whole
field?

Any help would be greatly appreciated!

Douglas J. Steele · Dec 5, 2003

There isn't anything built into Access to do such a thing. I believe there
may be some 3rd party tools that may assist in this regard, but I don't have
the names of any of them. It's really not a trivial exercise!

Dirk Goldgar · Dec 6, 2003

Allison said:
I need to de-dupe an outside mailing list against our
house membership base by company name. However, as the
outside list often has a variation of the company name we
have on file(i.e. missing the inc. or ltd.)the find
duplicates query isn't picking them up.

Is there a SQL script or a Union Query that can find
duplicates using 'any part of field' as opoosed to whole
field?

Any help would be greatly appreciated!

I've done a fair amount of "de-duping" in my time, and it's always a
pain unless you're only seeking to remove exact duplicates. The degree
of success you can achieve depends on the nature of the variations that
you can consider "duplicates". You can certainly write a query that
looks for matches on one key being contained in the other. For example,
here's some "air SQL":

SELECT
OurList.*,
TheirList.*
FROM
OurList INNER JOIN TheirList
ON OurList.CompanyName Like '*' & TheirList.CompanyName &
'*'

I don't guarantee that will work -- you may have to use the InStr
function instead -- but I hope you get the idea.

Where possible, I have found it easiest to write a function that
transforms the subject field into a "lowest common denominator"; for
example, by removing insignificant text and punctuation marks so that
only the essentials are compared. For example, for company names you
might write a function that removes words like "The", "Ltd.", "Inc.",
"Corp." -- and the expanded forms of the abbreviations as well -- and
returns only what's left. Then you'd apply this function to the
CompanyName field in each table and join the tables on the function
results.

In any case, you still have to examine the candidate "dupes" returned by
the function and determine which are truly duplicates.

De-Duping Lists

Allison

Douglas J. Steele

Dirk Goldgar