Find near duplicates

  • Thread starter Thread starter stuart
  • Start date Start date
S

stuart

One of the problems I have is people entering the same data one or two
days apart so I'm trying to construct a find duplicates query which
returns records where the amount and the customer name are duplicated
and the date is pretty damned close to being a duplicate (say plus or
minus 2 days).
Sorry, I probably haven't explained this very well but any help or
pointers appreciated.
 
One of the problems I have is people entering the same data one or two
days apart so I'm trying to construct a find duplicates query which
returns records where the amount and the customer name are duplicated
and the date is pretty damned close to being a duplicate (say plus or
minus 2 days).
Sorry, I probably haven't explained this very well but any help or
pointers appreciated.

Create a Self Join query by adding the table to the query grid
*twice*. Join the two instances by amount and CustomerID (I *hope*
you're not storing customer names in your sales table... if you are,
how can you be sure that the Jim Smith who bought on Tuesday is the
same person as the Jim Smith who bought stuff on Wednesday?)

Put in a calculated field:

DaysBetween: DateDiff("d",
.[SaleDate], [Table_1].[SaleDate])

and use a criterion of

IN (-2, -1, 1, 2)

to limit it to plus or minus two days and to exclude same-day sales.
 
Stuart,

SELECT M1.*
FROM MyTable AS M1
INNER JOIN
MyTable AS M2
ON M1.PrimaryKey = M2.PrimaryKey
WHERE M1.Customer = M2.Customer
AND (M1.TransactionDate
BETWEEN
(M2.TransactionDate - 2)
AND
(M2.TransactionDate + 2)

This is untested . . . just a guess on my part.


Sincerely,

Chris O.
 
John Vinson said:
(I *hope* you're not storing customer names in your sales table...

Now, would I do a thing like that?!! If I've learnt nothing else from
your posts over the years, I think some of the design stuff has sunk
in :-)
Put in a calculated field:

DaysBetween: DateDiff("d",
.[SaleDate], [Table_1].[SaleDate])

and use a criterion of

IN (-2, -1, 1, 2)

to limit it to plus or minus two days and to exclude same-day sales.


Right, I'll go away and try this. Many thanks.
 
Chris2 said:
Stuart,

SELECT M1.*
FROM MyTable AS M1
INNER JOIN
MyTable AS M2
ON M1.PrimaryKey = M2.PrimaryKey
WHERE M1.Customer = M2.Customer
AND (M1.TransactionDate
BETWEEN
(M2.TransactionDate - 2)
AND
(M2.TransactionDate + 2)

This is untested . . . just a guess on my part.

Many thanks, Chris. I'll give this a try.
Regards
Stuart
 
Chris2 said:
Stuart,

SELECT M1.*
FROM MyTable AS M1
INNER JOIN
MyTable AS M2
ON M1.PrimaryKey = M2.PrimaryKey
WHERE M1.Customer = M2.Customer
AND (M1.TransactionDate
BETWEEN
(M2.TransactionDate - 2)
AND
(M2.TransactionDate + 2)

This is untested . . . just a guess on my part.


Sincerely,

Chris O.


Thanks again for the replies. This is what happened........
===========================================
Chris's suggestion:

After a bit of trial and error this was the best I could do, and it
returns all records.

SELECT M1.*, M1.namesID, M1.transdate
FROM TBL_TRANSACTIONS AS M1 INNER JOIN TBL_TRANSACTIONS AS M2 ON
M1.transactionID = M2.transactionID
WHERE (((M1.namesID)=[M2].[namesID]) AND ((M1.transdate) Between
[M2].[transdate]-2 And [M2].[transdate]+2));
============================================
John's suggestion

This works fine but, as indicated, it excludes duplicates with the
same date, which I'd like to include if possible. I've tried tinkering
with the criteria but I then get all records returned.

SELECT DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate])
AS DaysBetween, TBL_TRANSACTIONS.transdate, TBL_TRANSACTIONS.namesID,
TBL_TRANSACTIONS.amount
FROM TBL_TRANSACTIONS INNER JOIN TBL_TRANSACTIONS AS
TBL_TRANSACTIONS_1 ON (TBL_TRANSACTIONS.amount =
TBL_TRANSACTIONS_1.amount) AND (TBL_TRANSACTIONS.namesID =
TBL_TRANSACTIONS_1.namesID)
WHERE (((DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate]))
In (-2,-1,1,2)));
============================================

The Access Wizard gives me same day only duplicates

SELECT DISTINCTROW TBL_TRANSACTIONS.transdate,
TBL_TRANSACTIONS.namesID
FROM TBL_TRANSACTIONS
WHERE (((TBL_TRANSACTIONS.transdate) In (SELECT [transdate] FROM
[TBL_TRANSACTIONS] As Tmp GROUP BY [transdate],[amount],[namesID]
HAVING Count(*)>1 And [amount] = [TBL_TRANSACTIONS].[amount] And
[namesID] = [TBL_TRANSACTIONS].[namesID])))
ORDER BY TBL_TRANSACTIONS.transdate, TBL_TRANSACTIONS.namesID;
=============================================
I hope there's a combination somewhere in here that will work. Thanks
for the help so far.
 
This works fine but, as indicated, it excludes duplicates with the
same date, which I'd like to include if possible. I've tried tinkering
with the criteria but I then get all records returned.

I'm presuming you have a TransactionID primary key in the table; just
include a criterion to ensure that the ID's are different (even if
everything else is the same or similar:

SELECT
DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate])
AS DaysBetween, TBL_TRANSACTIONS.transdate, TBL_TRANSACTIONS.namesID,
TBL_TRANSACTIONS.amount
FROM TBL_TRANSACTIONS INNER JOIN TBL_TRANSACTIONS AS
TBL_TRANSACTIONS_1 ON (TBL_TRANSACTIONS.amount =
TBL_TRANSACTIONS_1.amount) AND (TBL_TRANSACTIONS.namesID =
TBL_TRANSACTIONS_1.namesID)
WHERE
(((DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate]))
BETWEEN -2 AND 2) AND TBL_TRANSACTIONS.Transaction_ID <>
TBL_TRANSACTIONS_1.Transaction_ID;
 
See Below:

stuart said:
"Chris2" <[email protected]> wrote in message
Stuart,

SELECT M1.*
FROM MyTable AS M1
INNER JOIN
MyTable AS M2
ON M1.PrimaryKey = M2.PrimaryKey
WHERE M1.Customer = M2.Customer
AND (M1.TransactionDate
BETWEEN
(M2.TransactionDate - 2)
AND
(M2.TransactionDate + 2)

This is untested . . . just a guess on my part.


Sincerely,

Chris O.


Thanks again for the replies. This is what happened........
===========================================
Chris's suggestion:

After a bit of trial and error this was the best I could do, and it
returns all records.

SELECT M1.*, M1.namesID, M1.transdate
FROM TBL_TRANSACTIONS AS M1 INNER JOIN TBL_TRANSACTIONS AS M2 ON
M1.transactionID = M2.transactionID
WHERE (((M1.namesID)=[M2].[namesID]) AND ((M1.transdate) Between
[M2].[transdate]-2 And [M2].[transdate]+2));
============================================
John's suggestion

This works fine but, as indicated, it excludes duplicates with the
same date, which I'd like to include if possible. I've tried tinkering
with the criteria but I then get all records returned.

Yeah, I was wondering. I knew I shouldn't have posted that without actually
loading some data and testing.




SELECT DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate]
)
AS DaysBetween, TBL_TRANSACTIONS.transdate, TBL_TRANSACTIONS.namesID,
TBL_TRANSACTIONS.amount
FROM TBL_TRANSACTIONS INNER JOIN TBL_TRANSACTIONS AS
TBL_TRANSACTIONS_1 ON (TBL_TRANSACTIONS.amount =
TBL_TRANSACTIONS_1.amount) AND (TBL_TRANSACTIONS.namesID =
TBL_TRANSACTIONS_1.namesID)
WHERE (((DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transda
te]))
In (-2,-1,1,2)));
============================================

The Access Wizard gives me same day only duplicates

SELECT DISTINCTROW TBL_TRANSACTIONS.transdate,
TBL_TRANSACTIONS.namesID
FROM TBL_TRANSACTIONS
WHERE (((TBL_TRANSACTIONS.transdate) In (SELECT [transdate] FROM
[TBL_TRANSACTIONS] As Tmp GROUP BY [transdate],[amount],[namesID]
HAVING Count(*)>1 And [amount] = [TBL_TRANSACTIONS].[amount] And
[namesID] = [TBL_TRANSACTIONS].[namesID])))
ORDER BY TBL_TRANSACTIONS.transdate, TBL_TRANSACTIONS.namesID;
=============================================
I hope there's a combination somewhere in here that will work. Thanks
for the help so far.
 
Chris2 said:
Yeah, I was wondering. I knew I shouldn't have posted that without actually
loading some data and testing.
I was grateful for the suggestion anyway.
 
John Vinson said:
I'm presuming you have a TransactionID primary key in the table; just
include a criterion to ensure that the ID's are different (even if
everything else is the same or similar:

SELECT
DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate])
AS DaysBetween, TBL_TRANSACTIONS.transdate, TBL_TRANSACTIONS.namesID,
TBL_TRANSACTIONS.amount
FROM TBL_TRANSACTIONS INNER JOIN TBL_TRANSACTIONS AS
TBL_TRANSACTIONS_1 ON (TBL_TRANSACTIONS.amount =
TBL_TRANSACTIONS_1.amount) AND (TBL_TRANSACTIONS.namesID =
TBL_TRANSACTIONS_1.namesID)
WHERE
(((DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate]))
BETWEEN -2 AND 2) AND TBL_TRANSACTIONS.Transaction_ID <>
TBL_TRANSACTIONS_1.Transaction_ID;

You've lost me there I'm afraid.....
 
John Vinson said:
I'm presuming you have a TransactionID primary key in the table; just
include a criterion to ensure that the ID's are different (even if
everything else is the same or similar:

SELECT
DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate])
AS DaysBetween, TBL_TRANSACTIONS.transdate, TBL_TRANSACTIONS.namesID,
TBL_TRANSACTIONS.amount
FROM TBL_TRANSACTIONS INNER JOIN TBL_TRANSACTIONS AS
TBL_TRANSACTIONS_1 ON (TBL_TRANSACTIONS.amount =
TBL_TRANSACTIONS_1.amount) AND (TBL_TRANSACTIONS.namesID =
TBL_TRANSACTIONS_1.namesID)
WHERE
(((DateDiff("d",[TBL_TRANSACTIONS].[transdate],[TBL_TRANSACTIONS_1].[transdate]))
BETWEEN -2 AND 2) AND TBL_TRANSACTIONS.Transaction_ID <>
TBL_TRANSACTIONS_1.Transaction_ID;

You've lost me there I'm afraid.....

Well, I guess you've lost me.

Is there a Primary Key in the Transactions table? You'll need *some*
way to be able to find transactions that happen on the same day (the
datediff equal to 0) but to avoid the transaction finding itself.
That's why I suggested adding the additional criterion that the
Transaction_ID in the first instance of the table (TBL_TRANSACTIONS)
should be unequal to the Transaction_ID in the second instance
(TBL_TRANSACTIONS_1).
 
Well, I guess you've lost me.
Is there a Primary Key in the Transactions table? You'll need *some*
way to be able to find transactions that happen on the same day (the
datediff equal to 0) but to avoid the transaction finding itself.
That's why I suggested adding the additional criterion that the
Transaction_ID in the first instance of the table (TBL_TRANSACTIONS)
should be unequal to the Transaction_ID in the second instance
(TBL_TRANSACTIONS_1).
Ah, right! I'm beginning to think it'll be easier to run your original
query and the wizard exact duplicates query separately.
Many thanks for your help.
 
Back
Top