Dealing with Duplicates after the fact

  • Thread starter Thread starter Robert Neville
  • Start date Start date
R

Robert Neville

My database serves as a complex contact manager, which no other
commercial product emulates. Even though my database has been
normalized, inconsistent data imports have created duplicate address
records due to misspelling and other idiosyncrasies. Multiple field
index help to prevent these occurrences, yet duplicates still occur.
Examples follow.

1501D Cloverfield Boulevard
1501-D Cloverfield Boulevard
2550 Broadway Avenue
2550 Broadway Street

Now, my challenge becomes correcting these idiosyncrasies after one
import them (I am doing stuff before the import, yet cannot completely
prevent it; see example). But the scenario is not as straight forward
as a Find Duplicate or No Match query. Addresses reside in their own
table, tblAddr, attached to the company table through a relationship
table, trelCompAddr; and don't forget the main table tblComp. In my
eyes, this situation adds to the complexity of the resolution. A query
with three tables remains non-editable.

One may suggest placing the address information with the Company
table. This resolution does not suffice since my Project (tblProj) and
Contact (tblCont) table utilize the same addresses.

Let me know if you have any additional insights since I am not query
or normalization expert. Links to examples and suggestions would aid
a great deal.
 
Hi Robert,

You're actually in a pretty strong position compared with most people
facing this impossible task: all your addresses are in one table, and
you know about string manipulation, regular expressions, and (if I
remember right) standardising or expanding abbreviations. You're also
using Access/Jet, which means it's dead easy to use custom VBA functions
in queries.

The general approach is to use the same sort of query you'd use to
return duplicate values - but instead of looking for duplicate values in
the fields themselves you use functions in the queries to transform the
actual values and compare the results. So to find ones like this
2550 Broadway Avenue
2550 Broadway Street
you'd just dump the last word and compare the remainder:

'Code starts-----------------------------------------
Public Function NearDupe1( _
ByVal First As Variant, _
ByVal Second As Variant) As Boolean
'Tests whether street number and name are the
'same, ignoring the kind of street
'Demo version only; assumes that the data passed
'matches
' ^\d+ (\w+ )+\w+$
'so does not handle addresses with no number,
'suite/apt numbers or other complications

Dim lngPos As Long

'If an argument is Null, then no match
If IsNull(First) Or IsNull(Second) Then
NearDupe1 = False
Exit Function
End If

'Trim lines at last space
lngPos = InStrRev(First, " ")
If lngPos > 0 Then
First = Left(First, lngPos - 1)
End If
lngPos = InStrRev(Second, " ")
If lngPos > 0 Then
Second = Left(Second, lngPos - 1)
End If

NearDupe1 = (First = Second)
End Function
'Code ends-----------------------------------

and then use a query along these lines to return the potential
duplicates:

SELECT A.ID, A.Street, A.City
FROM AddrNew As A INNER JOIN AddrNew As B
ON ((NearDupe1(A.Street, B.Street) = True)
AND (A.City = B.City) AND (A.ID <> B.ID));




My database serves as a complex contact manager, which no other
commercial product emulates. Even though my database has been
normalized, inconsistent data imports have created duplicate address
records due to misspelling and other idiosyncrasies. Multiple field
index help to prevent these occurrences, yet duplicates still occur.
Examples follow.

1501D Cloverfield Boulevard
1501-D Cloverfield Boulevard
2550 Broadway Avenue
2550 Broadway Street

Now, my challenge becomes correcting these idiosyncrasies after one
import them (I am doing stuff before the import, yet cannot completely
prevent it; see example). But the scenario is not as straight forward
as a Find Duplicate or No Match query. Addresses reside in their own
table, tblAddr, attached to the company table through a relationship
table, trelCompAddr; and don't forget the main table tblComp. In my
eyes, this situation adds to the complexity of the resolution. A query
with three tables remains non-editable.

One may suggest placing the address information with the Company
table. This resolution does not suffice since my Project (tblProj) and
Contact (tblCont) table utilize the same addresses.

Let me know if you have any additional insights since I am not query
or normalization expert. Links to examples and suggestions would aid
a great deal.

John Nurick [Microsoft Access MVP]

Please respond in the newgroup and not by email.
 
1) Don't try to be too clever: If you can work out a slow
solution using code, don't think you still have to work out
an optimal solution using just SQL.

A query with three tables remains non-editable.

Generally, because you have mixed up the indexes or
foreign relationships. In a Jet Database (why don't
they tell you this in the help files?), a query is
not editable unless the joined fields are indexed.
A relationship creates an index, so you don't see
this problem if you have relationships. If you have
a multi-field index, the join has to include the
first field of the index. If you have multi-field
indexes on multiple tables, the fields have to be
in the same order in both indexes.

Jet has to be able to identify WHICH record you are
editing, USING THE INDEX, so there is also a problem
if you use DISTINCT, particularly with Many to Many
joins..... (sometimes this means also including the
primary key in the select query, just so that JET
has sufficient information, even if you don't need to
know the index value yourself)

(david)
 
John,

Sorry for the late response. My motherboard failed on me. The
troubleshooting process took a greater portion of last week. Hardware
issues are not always easy to verify. The situation still has me
irate.

My question refers more to relationship tables rather than data in an
imported table. In other words, the data has already found itself into
the main tables. Now, I have to fix inconsistencies in the main
tables, which involves dealing with indexes and relationship tables.

Your suggestion is novel, yet does not account for other
idiosyncrasies like the such.

1501D Cloverfield Boulevard
1501-D Cloverfield Boulevard
1501 Cloverfeild Boulevard

These inconsistencies are not always predictable, because they involve
user input errors, which exist at the data source where I have no
control over. Please let me know if you any more suggestions.

David,

Thanks for joining the thread. You raise some interesting points about
indexes. You seem to understand the challenge dealing with this
situation. Let me give more details about the relationship in case you
could add to your insight mentioned in your reply.


TABLE
tblComp trelCompAddr tblAddr

FIELDS
CompID CompID AddrName
CompName AddrID Addr1
CompTypeID Addr2 City
Sector AddrTypeID State
Phone PostalCode
Phone2 Country
Fax2
Email
 
Let's not get ahead ourselves with an expensive commercial pattern
matching solutions. Remember this development is more personal in
nature. This situation is a more manually endeavor from my
perspective. The regular expression module, which you coached me on,
performs well. Yet they may introduce few if any duplicates about 10
for every 500 data imports. The main issue lies in data already in
the table about 15 duplicates in a thousand records (not too bad).

I need insight on creating a manual interface to change company
addresses (for current and future need). The ambiguity of the
situation stems from the complicated relationship from Company records
to address records. MS Access does not perform intuitively in this
design situation. The challenge lies with MS Access idiosyncrasies
with indexes, relationship tables, and editable queries. Ideally, one
could create an editable query with three tables and sorted by
addresses; thus allows the user to easily change the AddrID in
relationship to the CompID; yet knowing the CompName behind the
CompID.

Let use this analogy. My car has encountered a flat tire, which needs
repair. I need to change the flat tire so I could drive to a serve
station and have it fixed. Rather than fix the flat on the spot
without the proper tools; or scour the roads for puncture items (which
are viable solutions if you have the means or time).


Robert,

This is getting into the territory of commercial address-matching and
de-duping products, which sell for thousands if not tens of thousands of
dollars. AFAIK these firms throw a wide variety of techniques into the
pot including regex matching, transforms (as in the far-from-novel
suggestion in my previous post), and "dictionaries" of streets, names,
etc. They also use algorithms to estimate the reliability of the
approximate matches they identify.

For your example below, for instance, one can be virtually certain that
1501D and 1501-D are the same place unless (a) there's more than one
Cloverfield Boulevard in the city and/or ZIP code area; or (b) there's
also a 15010 Cloverfield Boulevard, in which case the D could be an OCR
error; and so on.

Here are a couple of articles that deal with approximate matching:
http://www.merriampark.com/ld.htm
http://groups.google.co.uk/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=OpdWwX2dBHA.2156@tkmsftngp03


On Tue, 08 Jul 2003 23:36:58 GMT, Robert Neville

John,

Sorry for the late response. My motherboard failed on me. The
troubleshooting process took a greater portion of last week. Hardware
issues are not always easy to verify. The situation still has me
irate.

My question refers more to relationship tables rather than data in an
imported table. In other words, the data has already found itself into
the main tables. Now, I have to fix inconsistencies in the main
tables, which involves dealing with indexes and relationship tables.

Your suggestion is novel, yet does not account for other
idiosyncrasies like the such.

1501D Cloverfield Boulevard
1501-D Cloverfield Boulevard
1501 Cloverfeild Boulevard

These inconsistencies are not always predictable, because they involve
user input errors, which exist at the data source where I have no
control over. Please let me know if you any more suggestions.

David,

Thanks for joining the thread. You raise some interesting points about
indexes. You seem to understand the challenge dealing with this
situation. Let me give more details about the relationship in case you
could add to your insight mentioned in your reply.


TABLE
tblComp trelCompAddr tblAddr

FIELDS
CompID CompID AddrName
CompName AddrID Addr1
CompTypeID Addr2 City
Sector AddrTypeID State
Phone PostalCode
Phone2 Country
Fax2
Email

John Nurick [Microsoft Access MVP]

Please respond in the newgroup and not by email.
 
Robert, I don't see an AddrID field in tblAddr. Is this
a transcription error?

(david)
 
Robert, I don't see an AddrID field in tblAddr. Is this
a transcription error?
Yeah, it's a typo. AddrID does exist in the tblAddr table.


TABLE
tblComp trelCompAddr tblAddr

FIELDS
CompID CompID AddrID
CompName AddrID AddrName
CompTypeID Addr2 Addr1
Sector AddrTypeID City
Phone State
Phone2 PostalCode
Fax2 Country
Email
 
There should be a unique index on tblComp.CompID,
on tblAddr.AddrID, and on tblCompAddr(CompID, AddrID)

I have done this by creating primary keys on those
field in the tables. I have also created a relationship
between the Comp and CompAddr tables, on CompID,
and a relationship between the CompAddr and Addr
tables, on AddrID (this creates foreign key indexes).

In this structure,
One company may have many addresses,
One Address may have many companies,
Each company can have each address only once.

I have created a query with inner joins between
Comp and CompAddr (on CompID) and between CompAddr
and Addr (on AddrID), showing all the fields in
all three tables.

All the fields in my query are updatable: I can update
the address or the company text, or I can change the
companies associated with an address, or the addresses
associated with the companies (by updating the ID fields).


(david)





If you create a query on the three tables,
 
Back
Top