How to find similar records?

  • Thread starter Thread starter ZenMasta
  • Start date Start date
Z

ZenMasta

Is there a way to find similar records where the fields are not always
exactly the same?

I thought about using the duplicates wizard but that wont work because
sometimes there are slight variations and I don't knowhow to make the
duplicate wizard not look for something exactly.

example:
biznissname 1, 123 Maple Rd. Sufolk, VA 12345 800-123-4568 831-465-1234
company one, 123 Maple Rd. Suffolk, VA 12345 800-123-4568 831-465-1234
company 1, 123 Maple Rd Suffolk, VA 12344 800-123-4567
831-465-1234

I have lots of fields I can use for comparison.
Mostly address, city state zip, phone, fax so basically if I can compare
using the other fields an employee can use good judgement or call the
customer to determin/confirm etc.
I'm not looking for an automatic way to combine the records. I just need
help finding what are probably duplicates.Tthe address/phone info will be
enough to help us decide if it is really a different customer or maybe they
just changed the name (or maybe we spelled it wrong etc)

Notice 2 of them have periods in the address, one of them spells the city
wrong, and one has a different zipcode, and another has a slightly different
phone number
All of this data is in one table.

Thanks :)
 
I forgot to mention I'm using access 2002 if that makes much difference. I
could probably use 2007 on a friends computer if that would help at all.
 
Thanks for that query Ken,
I've modified it to match my table/column names. However I'm not sure if it
is working or what.

First it seems as if I ran it about 10 minutes ago and I am unable to select
any columns (to try and sort) or use the scroll bar to browser the list.

For example if I select the State column and then try to move the cursor
outside the window it just turns into a text icon instead of an arrow, or an
hour glass. It seems pretty unresponsive. I eventually had to force quit
access because I was unable to use other forms in my DB.
 
I'm kind of a newbie here. What do you mean indexed?
Each record has a primary key.

Maybe I should also run the query on the computer hosting the MDB instead of
from the network share?

KenSheridan via AccessMonster.com said:
The EXISTS predicate is very efficient so it should perform well enough.
I've tried it against a table of 7500 rows, including the same number of
subqueries as in my example and the result was near instantaneous.

One thing which will make a difference is indexing, so make sure all the
columns on which you are comparing rows are indexed.

Ken Sheridan
Stafford, England
Thanks for that query Ken,
I've modified it to match my table/column names. However I'm not sure if
it
is working or what.

First it seems as if I ran it about 10 minutes ago and I am unable to
select
any columns (to try and sort) or use the scroll bar to browser the list.

For example if I select the State column and then try to move the cursor
outside the window it just turns into a text icon instead of an arrow, or
an
hour glass. It seems pretty unresponsive. I eventually had to force quit
access because I was unable to use other forms in my DB.
If you can work out a set of different permutations of columns you can
use
to
[quoted text clipped - 81 lines]
 
Splitting it in this way is how it should be done.

I totally agree with that statement. Of all the tips I've gotten from
the Access newsgroups, splitting the database has saved me the most
grief. BTW, Happy Christmas Ken.

James A. Fortune
(e-mail address removed)

My pool highlights for 2009:

2009 State BCA Championships: Top eight Nineball Open
2009 State BCA Championships: Top eight Eightball Open
2009 VNEA Int'l Championships: Top 16 Scotch Doubles Open
 
My DB is split already. I was running the query from the FE on my computer
rather than on the backend directly...

I decided to copy the backend and put it on my desktop and run the query
directly from my computer. Plus I didn't want to potentially cause a problem
with any live data.

I went ahead and set all the columns except primrary key of course to
indexed (yes allow duped). primary key was already index with no dupes.

After doing so and running the query it returned zero results. Here you can
see the table design view and how I set all the fields to yes allow dupes
http://buggyonpurpose.com/random/table-design-view.gif

Merry Christmas to you all too!
 
Well, I'm wasn't 100% that there are any REAL duplicates but I figured it
might atleast find similar companies in the same area.

We have lots of Churches that have the same name or even different names but
on the same street (just different building numbers) so I thought it might
atleast find those.

Since I'm working on copy of the backend I copied a a record so everything
was exactly the same except the contact id, it did not find that.

I just slightly modified your SQL to match my field names...

SELECT tblContactData.ChurchAddress, tblContactData.ChurchCity,
tblContactData.ChurchState, tblContactData.ChurchZip,
tblContactData.ChurchPhone, tblContactData.ChurchFax,
tblContactData.ContactID, tblContactData.ChurchName,
tblContactData.FirstName, tblContactData.LastName, tblContactData.DCreated
FROM tblContactData
WHERE (((tblContactData.ChurchAddress) In (SELECT [ChurchAddress] FROM
[tblContactData] As Tmp GROUP BY
[ChurchAddress],[ChurchCity],[ChurchState],[ChurchZip],[ChurchPhone],[ChurchFax]
HAVING Count(*)>1 And [ChurchCity] = [tblContactData].[ChurchCity] And
[ChurchState] = [tblContactData].[ChurchState] And [ChurchZip] =
[tblContactData].[ChurchZip] And [ChurchPhone] =
[tblContactData].[ChurchPhone] And [ChurchFax] =
[tblContactData].[ChurchFax])))
ORDER BY tblContactData.ChurchAddress, tblContactData.ChurchCity,
tblContactData.ChurchState, tblContactData.ChurchZip,
tblContactData.ChurchPhone, tblContactData.ChurchFax;
 
Well, when I copied yours I did a find replace for the fields you had with
the ones I did, that's the only thing I can recall doing.

Maybe I replaced too many things. trying again


KenSheridan via AccessMonster.com said:
I think you've got your wires crossed at some stage. That doesn't bear any
relationship to the query I posted. It's a standard 'find duplicates'
query,
but only on the basis of there being exact duplication on the complete set
of
columns by which the subquery is grouped rather than on any of a number of
differing combinations of subsets of the columns. I can only assume
you've
obtained this from some other source, but somewhere along the line its
become
confused with the one I posted. This was my original example query:

SELECT * FROM Customers AS C1
WHERE EXISTS
(SELECT *
FROM Customers AS C2
WHERE C2.CustomerID <> C1.CustomerID
AND C2.Address = C1.Address
AND C2.PhoneNumber = C1.PhoneNumber)
OR EXISTS
(SELECT *
FROM Customers AS C2
WHERE C2.CustomerID <> C1.CustomerID
AND C2.Address = C1.Address
AND C2.City = C1.City )
OR EXISTS
(SELECT *
FROM Customers AS C2
WHERE C2.CustomerID <> C1.CustomerID
AND C2.Address = C1.Address
AND C2.ZipCode = C1.ZipCode)
OR EXISTS
(SELECT *
FROM Customers AS C2
WHERE C2.CustomerID <> C1.CustomerID
AND C2.City = C1.City
AND C2.PhoneNumber = C1.PhoneNumber);

In this duplication is determined by any of the subqueries. My original
post
has a fuller explanation. You'll need to adapt this to your table and its
columns, once you have decided what possible combinations of columns might
indicate that the row duplicates the same church as in at least one other
row.
You can of course have fewer or more subqueries as appropriate to your
circumstances.

Ken Sheridan
Stafford, England
Well, I'm wasn't 100% that there are any REAL duplicates but I figured it
might atleast find similar companies in the same area.

We have lots of Churches that have the same name or even different names
but
on the same street (just different building numbers) so I thought it might
atleast find those.

Since I'm working on copy of the backend I copied a a record so everything
was exactly the same except the contact id, it did not find that.

I just slightly modified your SQL to match my field names...

SELECT tblContactData.ChurchAddress, tblContactData.ChurchCity,
tblContactData.ChurchState, tblContactData.ChurchZip,
tblContactData.ChurchPhone, tblContactData.ChurchFax,
tblContactData.ContactID, tblContactData.ChurchName,
tblContactData.FirstName, tblContactData.LastName, tblContactData.DCreated
FROM tblContactData
WHERE (((tblContactData.ChurchAddress) In (SELECT [ChurchAddress] FROM
[tblContactData] As Tmp GROUP BY
[ChurchAddress],[ChurchCity],[ChurchState],[ChurchZip],[ChurchPhone],[ChurchFax]
HAVING Count(*)>1 And [ChurchCity] = [tblContactData].[ChurchCity] And
[ChurchState] = [tblContactData].[ChurchState] And [ChurchZip] =
[tblContactData].[ChurchZip] And [ChurchPhone] =
[tblContactData].[ChurchPhone] And [ChurchFax] =
[tblContactData].[ChurchFax])))
ORDER BY tblContactData.ChurchAddress, tblContactData.ChurchCity,
tblContactData.ChurchState, tblContactData.ChurchZip,
tblContactData.ChurchPhone, tblContactData.ChurchFax;
I can't really see why you should be having problems with this. You are
sure
[quoted text clipped - 25 lines]
Merry Christmas to you all too!
 
Yeah I can obviously see now the query is nowhere near the same. I don't
know what happened there. Okay so after trying again it worked perfect and
pretty much instantly unlike the first time.

Actually it's strange because now that I remember it did work the first time
but it was slow. I remember the record count was the same when I just tried
it now as the first time when it was slow. Maybe when I tried switching the
query from sql view to design view that's where it messed up.
 
Back
Top