J
Jeff
I used OCR software to read the Names and addresses off ~500k images. For
each image we received ID and Account Number combinations that correspond to
a Name and Address. Within the data set there are several examples of each
ID, Account, Name and Address combination.
We are interested in removing the duplicates and only having the distinct
records. The software is not 100% accurate so the Names and Addresses can be
different from result to result for the same ID/Account.
My thought was to use a majority voting approach. Below is a link to an
example of the results. There are 19 records for the ID=900000023,
Account=123456789 with varied results. Some of the results within this
ID/Account group are duplicates. I was looking to add a field to output a
count of records for ID/Account group (19) and another column to output the
count of the Name and Address result string. I could then use this along
with the score to narrow down the correct result.
https://spreadsheets.google.com/ccc?key=0At39uG1JJzvCdFpfSl9LVEpJYVFTQ1JGeUIxTTd4V0E&hl=en
each image we received ID and Account Number combinations that correspond to
a Name and Address. Within the data set there are several examples of each
ID, Account, Name and Address combination.
We are interested in removing the duplicates and only having the distinct
records. The software is not 100% accurate so the Names and Addresses can be
different from result to result for the same ID/Account.
My thought was to use a majority voting approach. Below is a link to an
example of the results. There are 19 records for the ID=900000023,
Account=123456789 with varied results. Some of the results within this
ID/Account group are duplicates. I was looking to add a field to output a
count of records for ID/Account group (19) and another column to output the
count of the Name and Address result string. I could then use this along
with the score to narrow down the correct result.
https://spreadsheets.google.com/ccc?key=0At39uG1JJzvCdFpfSl9LVEpJYVFTQ1JGeUIxTTd4V0E&hl=en