Could anyone help me to remove duplicate rows?

  • Thread starter Thread starter Leon
  • Start date Start date
L

Leon

I have a table that has more than 1 milion rows so practically it is
impossible to remove all duplicate rows by hand.
Could you help me to remove those duplicate rows at all?
This table doesn't have primary key( identity ) column.

Say Table name => MyTable
Column => Col1,Col2,Col3
No primary key column here.

Thanks a lot in advance.
 
You can use the duplicate record query in Access to find them.

You can create an Access table matching the one with the duplicates and then
set on field as no duplicates and import the records. This will remove the
duplicate records.

Or using code:

SELECT Table1.Field1, Table1.ID
FROM Table1
WHERE (((Table1.Field1) In (SELECT [Field1] FROM [Table1] As Tmp GROUP BY
[Field1] HAVING Count(*)>1 )))
ORDER BY Table1.Field1;

And then just execute a delete but be careful, this returns all duplicate
records not just the duplicates of x

HTH,
 
Leon,
Are you populating the table from a database? If so, you might be able to
use DISTINCT in you SELECT statement to avoid loading them in the first
place.
Another approach (I haven't tested this!) might be to set constraints before
filling the table (a PrimaryKey comprising all 3 columns) and handle the
FillError event of the DataAdaptor

Stephen
 
Brian,

I'm not sure that would work if Field1 is not the PK. What Leon is saying,
if I understand him correctly, is he has 3 fields in the table and they are
effectively the PK. So it would be something along the lines of:

SELECT Field1 + Field2 + Field3 FROM MyTable
GROUP BY Field1 + Field2 + Field3 HAVING COUNT(*) > 1
ORDER BY Field1,Field2,Field3

That will give you one copy of each duplicate; then, for each duplicate I'd
probably save the 3 field values, delete each group of records in the
underlying table and then re-insert one record:

/* Inside a loop scanning through the above result */
@@FieldOne = Field1
@@FieldTwo = Field2
@@FieldThree = Field3
DELETE FROM MyTable WHERE Field1 = @@FieldOne AND Field2 = @@FieldTwo AND
Field3 = @@FieldThree
INSERT INTO MyTable (Field1,Field2,Field3) VALUES
(@@FieldOne,@@FieldTwo,@@FieldThree)

.... but, I'd wrap all that in a transaction for safety, to be on the safe
side.

And of course, experiment on a copy of the original table for safety's sake.

--Bob


Brian P. Hammer said:
You can use the duplicate record query in Access to find them.

You can create an Access table matching the one with the duplicates and then
set on field as no duplicates and import the records. This will remove the
duplicate records.

Or using code:

SELECT Table1.Field1, Table1.ID
FROM Table1
WHERE (((Table1.Field1) In (SELECT [Field1] FROM [Table1] As Tmp GROUP BY
[Field1] HAVING Count(*)>1 )))
ORDER BY Table1.Field1;

And then just execute a delete but be careful, this returns all duplicate
records not just the duplicates of x

HTH,
--
Brian P. Hammer
Leon said:
I have a table that has more than 1 milion rows so practically it is
impossible to remove all duplicate rows by hand.
Could you help me to remove those duplicate rows at all?
This table doesn't have primary key( identity ) column.

Say Table name => MyTable
Column => Col1,Col2,Col3
No primary key column here.

Thanks a lot in advance.
 
First of all, thanks a lot for all of you.
I think Bob Grommes's solution sounds good.
I might try this first and see how it goes.

Thanks again for all of you to help me out.

I will let you know the result.
 
What I did was
-----------------------------------------------------------
DECLARE @FieldOne as int
DECLARE @FieldTwo as varchar(50)
DECLARE @FieldThree as varchar(40)
DECLARE @FieldFour as varchar(50)
DECLARE @FieldFive as varchar(50)
DECLARE @value as int

BEGIN TRAN DELETE_Du
SELECT
@FieldOne=[1],@FieldTwo=[2],@FieldThree=[3],@FieldFour=[4],@FieldFive=[5
],@value=value FROM [5]
GROUP BY [1],[2],[3],[4],[5],value HAVING COUNT(*) > 1
While @FieldOne <> NULL
BEGIN
DELETE [5] WHERE @FieldOne=[1] and @FieldTwo=[2] and @FieldThree=[3]
and @FieldFour=[4] and @FieldFive=[5] and @value=value

INSERT INTO [5] ([1],[2],[3],[4],[5],value) VALUES
(@FieldOne,@FieldTwo,@FieldThree,@FieldFour,@FieldFive,@value)
SELECT
@FieldOne=[1],@FieldTwo=[2],@FieldThree=[3],@FieldFour=[4],@FieldFive=[5
],@value=value FROM [5]
GROUP BY [1],[2],[3],[4],[5],value HAVING COUNT(*) > 1
END

COMMIT TRAN DELETE_Du

-----------------------------------------------------------
And the message I had was "the command(s) completed successfully" but
the duplicates are still there.

What did I miss?

Thanks a lot~~~~~~~~
 
Thanks a lot for your help.
but still I have a problem.
what I did was,
------------------------------------------------------
DECLARE @FieldOne as int
DECLARE @FieldTwo as varchar(50)
DECLARE @FieldThree as varchar(40)
DECLARE @FieldFour as varchar(50)
DECLARE @FieldFive as varchar(50)
DECLARE @value as int

BEGIN TRAN DELETE_Du
SELECT
@FieldOne=[1],@FieldTwo=[2],@FieldThree=[3],@FieldFour=[4],@FieldFive=[5
],@value=value FROM [5]
GROUP BY [1],[2],[3],[4],[5],value HAVING COUNT(*) > 1
While @FieldOne <> NULL
BEGIN
DELETE [5] WHERE @FieldOne=[1] and @FieldTwo=[2] and @FieldThree=[3]
and @FieldFour=[4] and @FieldFive=[5] and @value=value

INSERT INTO [5] ([1],[2],[3],[4],[5],value) VALUES
(@FieldOne,@FieldTwo,@FieldThree,@FieldFour,@FieldFive,@value)
SELECT
@FieldOne=[1],@FieldTwo=[2],@FieldThree=[3],@FieldFour=[4],@FieldFive=[5
],@value=value FROM [5]
GROUP BY [1],[2],[3],[4],[5],value HAVING COUNT(*) > 1
END

COMMIT TRAN DELETE_Du
---------------------------------------------------------
and this end of this query, it said "The command(s) completed
successfully."
Did I miss something?

Thanks a lot..
 
This is the method I use to remove duplicate records from a table. The
first three columns in this example are assumed to be the PK.

1. Backup the table or the database. I usually just do a SELECT * INTO
FROM SOURCE_TABLE

2. Find all ROWS that have a duplicate and stage.

SELECT COL1, COL2, COL3, COL4, count(*)
into #TEMP_DATA
FROM SOURCE_TABLE
GROUP BY COL1, COL2, COL3, COL4
HAVING count(*) > 1

3. REMOVE "ALL" records which have duplicates from the source table.

DELETE SOURCE_TABLE
FROM SOURCE_TABLE a, #TEMP_DATA b
WHERE a.COL1 = b.COL1
and a.COL2 = b.COL2
and a.COL3 = b.COL3

4. INSERT THE "SINGLE" RECORDS BACK INTO THE TABLE

INSERT INTO SOURCE_TABLE (COL1, COL2, COL3, COL4)
SELECT COL1, COL2, COL3, COL4
FROM #TEMP_DATA2

5. Duplicates are now gone. No Rows returned.

SELECT COL1, COL2, COL3, COL4, count(*)
FROM SOURCE_TABLE
GROUP BY COL1, COL2, COL3, COL4
HAVING count(*) > 1

6. Delete Backup table.
 
Back
Top