Remove Duplicate - Count Number of Unique Results

Jeff · Jan 8, 2010

I used OCR software to read the Names and addresses off ~500k images. For
each image we received ID and Account Number combinations that correspond to
a Name and Address. Within the data set there are several examples of each
ID, Account, Name and Address combination.

We are interested in removing the duplicates and only having the distinct
records. The software is not 100% accurate so the Names and Addresses can be
different from result to result for the same ID/Account.

My thought was to use a majority voting approach. Below is a link to an
example of the results. There are 19 records for the ID=900000023,
Account=123456789 with varied results. Some of the results within this
ID/Account group are duplicates. I was looking to add a field to output a
count of records for ID/Account group (19) and another column to output the
count of the Name and Address result string. I could then use this along
with the score to narrow down the correct result.

https://spreadsheets.google.com/ccc?key=0At39uG1JJzvCdFpfSl9LVEpJYVFTQ1JGeUIxTTd4V0E&hl=en

Tom van Stiphout · Jan 8, 2010

On Thu, 7 Jan 2010 19:14:01 -0800, Jeff

If you already have the correct accountno + name + address in some
table and are simply trying to match the document to the correct
record, perhaps a "fuzzy string matching" algorithm will help. I wrote
a chapter about one such algorithm in this book:
www.manning.com/nielsen/

To do what you suggested, you would get the count like this:
select count(*) from myTable
group by ID, Account

select count(*) from myTable
group by Name1, Street, City, State, Zip

-Tom.
Microsoft Access MVP

Jeff · Jan 8, 2010

Tom

Thanks for the Quick response; this worked
How would I write this so the 2 counts could exists in the same Query?

Thanks
Jeff

Tom van Stiphout · Jan 8, 2010

On Fri, 8 Jan 2010 04:46:02 -0800, Jeff

Use subselects. Here is an example using the Northwind sample
database:
SELECT Customers.ID, Customers.[State/Province], (select
count([State/Province]) from Customers C where C.[State/Province] =
Customers.[State/Province] group by [State/Province]) AS StateCount
FROM Customers;

Note how the subselect's data is joined to the main select (in this
case using a where clause but you could also use a join).

-Tom.
Microsoft Access MVP

Jeff · Jan 8, 2010

Tom

What Norhtwind Exapme DB is this from?

THansk
Jeff

Tom van Stiphout said:
On Fri, 8 Jan 2010 04:46:02 -0800, Jeff

Use subselects. Here is an example using the Northwind sample
database:
SELECT Customers.ID, Customers.[State/Province], (select
count([State/Province]) from Customers C where C.[State/Province] =
Customers.[State/Province] group by [State/Province]) AS StateCount
FROM Customers;

Note how the subselect's data is joined to the main select (in this
case using a where clause but you could also use a join).

-Tom.
Microsoft Access MVP

Tom

Thanks for the Quick response; this worked
How would I write this so the 2 counts could exists in the same Query?

Thanks
Jeff

Click to expand...

.

Tom van Stiphout · Jan 12, 2010

On Fri, 8 Jan 2010 10:06:01 -0800, Jeff

I was using Access 2007.

-Tom.
Microsoft Access MVP

Tom van Stiphout said:
Tom

What Norhtwind Exapme DB is this from?

THansk
Jeff

Tom van Stiphout said:

On Fri, 8 Jan 2010 04:46:02 -0800, Jeff

Use subselects. Here is an example using the Northwind sample
database:
SELECT Customers.ID, Customers.[State/Province], (select
count([State/Province]) from Customers C where C.[State/Province] =
Customers.[State/Province] group by [State/Province]) AS StateCount
FROM Customers;

Note how the subselect's data is joined to the main select (in this
case using a where clause but you could also use a join).

-Tom.
Microsoft Access MVP

Tom

Thanks for the Quick response; this worked
How would I write this so the 2 counts could exists in the same Query?

Thanks
Jeff

:

On Thu, 7 Jan 2010 19:14:01 -0800, Jeff

If you already have the correct accountno + name + address in some
table and are simply trying to match the document to the correct
record, perhaps a "fuzzy string matching" algorithm will help. I wrote
a chapter about one such algorithm in this book:
www.manning.com/nielsen/

To do what you suggested, you would get the count like this:
select count(*) from myTable
group by ID, Account

select count(*) from myTable
group by Name1, Street, City, State, Zip

-Tom.
Microsoft Access MVP

I used OCR software to read the Names and addresses off ~500k images. For
each image we received ID and Account Number combinations that correspond to
a Name and Address. Within the data set there are several examples of each
ID, Account, Name and Address combination.

We are interested in removing the duplicates and only having the distinct
records. The software is not 100% accurate so the Names and Addresses can be
different from result to result for the same ID/Account.

My thought was to use a majority voting approach. Below is a link to an
example of the results. There are 19 records for the ID=900000023,
Account=123456789 with varied results. Some of the results within this
ID/Account group are duplicates. I was looking to add a field to output a
count of records for ID/Account group (19) and another column to output the
count of the Name and Address result string. I could then use this along
with the score to narrow down the correct result.

https://spreadsheets.google.com/ccc?key=0At39uG1JJzvCdFpfSl9LVEpJYVFTQ1JGeUIxTTd4V0E&hl=en
.

Click to expand...

.

Click to expand...

Selecting unique record from duplicate	2	Mar 24, 2010
Simple question about query result	2	Nov 2, 2008
unique records Access 2003	2	May 27, 2010
Duplicate Entries	4	Feb 5, 2007
Count unique in a query instead of count #	5	Oct 16, 2008
Excel Subtotal without calculate duplicate lines	4	Feb 22, 2017
Duplicate rows - Distinct Count	2	Sep 30, 2009
This has got to be an easy question?	9	Mar 4, 2010

Remove Duplicate - Count Number of Unique Results

Jeff

Tom van Stiphout

Jeff

Tom van Stiphout

Jeff

Tom van Stiphout

Ask a Question

Similar Threads