Duplicates

Bert Gale · Nov 7, 2003

How can delete duplicate records without deleting the
original one. I have many email addresses which are
duplicates, I want to somehow show those duplicates in a
query (or something) then do a mass delete, without
deleting the original one. HELP!!!

Bert

Tom Ellison · Nov 7, 2003

Dear Bert:

What is your definition of the "original" one? As a human, I think I
know what you mean, but is there a definition of this which the
computer can understand? Do you have a column showing the Date/Time
the email was received, so the the "original" on is the oldest one?

Without a way to technically define what you mean by "original" the
problem is ambiguous in terms the computer can be programmed to do.

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts

Bert Gale · Nov 7, 2003

Tom,

Thanks for your reply. What I mean about "Original": I
have field called Email and I've notice it has several
duplicate email address in many records. Let's say my
email address is in this table 6 six times, I want to keep
an original one and delete the remaining 5. I hope this
helps. Thanks again.

Bert Gale

Tom Ellison · Nov 7, 2003

Dear Bert:

Sorry, no, I already understood that much. Of the 6 emails, which one
is the "original." The answer to this has to be in terms of the data
recorded in the table. If there is no answer to this, then your
problem is not defined in a way that can be performed.

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts

Bert Gale · Nov 7, 2003

Tom,

I guess it can't just pick any record to keep and allow me
to delete the remaining ones. Well I'll guess I'll have
to delete them manually. Thanks again for your time.

Bert Gale

Tom Ellison · Nov 7, 2003

Dear Bert:

From your earlier post, I thought there was one specific record you
wanted to keep. It now sounds like you either don't care which one is
kept, or that there is no defined process to know which one to keep.

If you have a way of looking at the records and knowing which one you
want to keep, then perhaps it can be programmed to do this.

If not, or if the number of rows with which you are dealing is
minimal, you may well be best off doing it manually.

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts

Bert Gale · Nov 7, 2003

Tom,

Specific record, yes just one. That's right I don't care
which one gets deleted. I'm dealing with 90,000 records
and 45,000 are duplicates there is no macro or query that
I can perform to delete all duplicate records from no
specific record (example - 6 Bert Gale records to only
keep one)?

Bert Gale

Tom Ellison · Nov 7, 2003

Dear Bert:

If you don't have one, you could add an autonumber column to the
table, then you can keep only the row from each set of duplicates that
has the smallest value in the autonumber column. This would
arbitrarily allow you to select the one to keep.

With tens of thousands of rows to delete, this would be a relatively
simple and easy way to do the job.

Remember that, if there is information in other columns (besides the
one(s) used to determine duplication) you will lose the additional
information in the deleted rows, if there is any.

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts

Jonathan Smith · Nov 8, 2003

Please forgive me for piggybacking on another post, but I am having a
similar situation, and thought this might be the best place to start.

I have a table that contains Patient Treatment information, with a field
labeled "PatientID". I would like to delete from this table all of the
duplicate rows -- based on "PatientID", with the exception of the Most
Recent -- I have a "ModDate" field also.

Thank you.

Jon

Tom Ellison · Nov 8, 2003

Dear Jonathan:

First of all, when you say there are duplicate rows, are you saying
that this means two rows with every column identical? Or are you
defining "duplicate" as being based on a subset of all the columns?

The typical approach is to aggregate (create a totals query) grouped
on all the rows that are in your definition of what is a duplicate.
Then, for every other column in the table, you must have some rule
that exactly specifies which value will go into those columns in the
row to be retained. This can be done by aggregating (sum, min, max,
etc.) the values column-by-column or by choosing a specific row from
which the values are to be retained. If you wish to choose a specific
row from which to retain the values, then the method of selecting this
row MUST specify only one row from which to choose.

Your post suggest you have a ModDate by which you would want to choose
the row to be retained. I'll assume you want to keep the most recent
ModDated row for each PatientID.

SELECT PatientID, ModDate, <other columns go here>
FROM PatientTreatment PT
WHERE ModDate = (SELECT MAX(ModDate)
FROM PatientTreatment PT1
WHERE PT1.PatientID = PT.PatientID)

The above MAY give you just one row for each PatientID. However, if
there exists a patient where the most recent ModDate is found in more
than 1 row then ALL the rows with that PatientID and ModDate will be
represented. That would be because you have not provided sufficient
data to make this unambiguous (we would say "not unique")

The need to have PatientID / ModDate unique may be a requirement in
your database. In fact, you may have implemented this as the Primary
Key to this table, or as a Unique Key to the table. This would
prevent there being duplicates in the table on this combination. In
order to test whether there is such ambiguity in choosing the most
recent record for each patient, you could add a Unique Index on this
key. It will fail if it is not unique. If it succeeds, it not only
proves this combination is unique, but will enforce that against
future entries into the table (whether insertions or updates).

If the PatientID / ModDate does not prove to be unique, you would have
to find additional rules to determine uniquely the values to be
extracted from the existing data into the new data. The query simply
cannot be made to work until such ambiguities are removed by rules you
define. For this reason, sufficient constraints on the original
database (especially the PatientID / ModDate unique index I suggested)
are invaluable.

Finally, I suggest that, rather than delete data from the existing
table, you keep this table intact and create a new table for the data
you are preparing. Create this table empty and use the above SELECT
query (or one modified according to the rules you develop) to INSERT
into that table (an append query). Put the necessary constraints on
this table (Primary Key of PatientID for example) to prove the
integrity of the process thus created. You can then run the process
again and again as you test it without lowing any of the original
data.

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts

Duplicates

Bert Gale

Tom Ellison

Bert Gale

Tom Ellison

Bert Gale

Tom Ellison

Bert Gale

Tom Ellison

Jonathan Smith

Tom Ellison