matching mismatched text

  • Thread starter Thread starter Brandi
  • Start date Start date
B

Brandi

I have 2 files. Each ahs a column that contains the name of a company I do
business with. I need to merge the files together by company name (Column A)
so I get columns A thru D of file2 merged with the same data (A thru D) in
file1. The end result would be column A from either of the 2 files in
ColumnA of file3 and B-D of file1 in B-D of file3 and B-D of file2 in E-G of
file3.

My problem is that the company name in one file might say ABC Brick Company
and the other may say ABC, Inc. or ABC Brick or ABC Brick Co. or something
else. I need some way to automatically match things up when there's an
exact match and to make suggested matches for the ones that aren't exactly
the same. In a perfect solution, the exact matches would just copy over into
file3 and maybe a popup window would come up for the one's that are closest.

I don't know how to do any of this. I'm no programmer. I'm no macro writer
either but I can take a sample and make simple adjustments. Anybody know how
to do this?

Thanks!
 
Brandi,

You can only do the exact matches easily.

Open file 1, then save it as file 3. Open up file 2. In file 3, cells
E2:G2, use the formulas

=VLOOKUP(A2,[File2.xls]Sheet1!$A$2:$D$XXXX,2,FALSE)
=VLOOKUP(A2,[File2.xls]Sheet1!$A$2:$D$XXXX,3,FALSE)
=VLOOKUP(A2,[File2.xls]Sheet1!$A$2:$D$XXXX,4,FALSE)

where the XXXX is the row number of your data.

When you have finished, copy the formulas down to match your list in column
A.

Then copy columns E:G, and pastespecial values to get rid of the formulas.
With E:G still selected, use Edit | Go To... | Special.... check
"Constants" and "Errors" then press OK. Then press "Del" to clear out the
unmatched data.

All those blanks are values that you will need to match manually.... It's
not easy to match ABC Brick to ABC, especially if you also have ABC Blocks
and ABC Paints - the logic is next to impossible to program, which is why
you often get junk mail addressed to "Brandi" as well as to "Brandy" from
the same source.

HTH,
Bernie
MS Excel MVP
 
That's why I was hoping for a solution that matched the most similar names
and then somhow let me select from the slimmed down listing. no way?

Bernie Deitrick said:
Brandi,

You can only do the exact matches easily.

Open file 1, then save it as file 3. Open up file 2. In file 3, cells
E2:G2, use the formulas

=VLOOKUP(A2,[File2.xls]Sheet1!$A$2:$D$XXXX,2,FALSE)
=VLOOKUP(A2,[File2.xls]Sheet1!$A$2:$D$XXXX,3,FALSE)
=VLOOKUP(A2,[File2.xls]Sheet1!$A$2:$D$XXXX,4,FALSE)

where the XXXX is the row number of your data.

When you have finished, copy the formulas down to match your list in column
A.

Then copy columns E:G, and pastespecial values to get rid of the formulas.
With E:G still selected, use Edit | Go To... | Special.... check
"Constants" and "Errors" then press OK. Then press "Del" to clear out the
unmatched data.

All those blanks are values that you will need to match manually.... It's
not easy to match ABC Brick to ABC, especially if you also have ABC Blocks
and ABC Paints - the logic is next to impossible to program, which is why
you often get junk mail addressed to "Brandi" as well as to "Brandy" from
the same source.

HTH,
Bernie
MS Excel MVP

Brandi said:
I have 2 files. Each ahs a column that contains the name of a company I do
business with. I need to merge the files together by company name
(Column
A)
so I get columns A thru D of file2 merged with the same data (A thru D) in
file1. The end result would be column A from either of the 2 files in
ColumnA of file3 and B-D of file1 in B-D of file3 and B-D of file2 in
E-G
of
file3.

My problem is that the company name in one file might say ABC Brick Company
and the other may say ABC, Inc. or ABC Brick or ABC Brick Co. or something
else. I need some way to automatically match things up when there's an
exact match and to make suggested matches for the ones that aren't exactly
the same. In a perfect solution, the exact matches would just copy over into
file3 and maybe a popup window would come up for the one's that are closest.

I don't know how to do any of this. I'm no programmer. I'm no macro writer
either but I can take a sample and make simple adjustments. Anybody know how
to do this?

Thanks!
 
Harlan,

So how do you match these two entries besides manually - this is a real
world example, and they are for the same entity, and these were the records
that matched most closely:

1275 SW TV HWY POBOX 647
1725 TUALATIN VALLEY HEY

Data entry is never clean - the times I've needed to match databases, I've
found roughly 50% quickly, 25% to 40% by partial matches using various
formulas, and the other 25% to 10% by simple process of elimination.

Bernie
 
Bernie Deitrick said:
So how do you match these two entries besides manually - this is a real
world example, and they are for the same entity, and these were the records
that matched most closely:

1275 SW TV HWY POBOX 647
1725 TUALATIN VALLEY HEY
....

For cases such as this there's no alternative to manual matching, though the
transposed digits make this more of a data cleansing exercise than an
approximate text matching one. Without the road numbers, the mp function in
one of the indirectly linked articles returns a match score of 0.263158. Not
great, but it may be the closest match. FWIW, the mp function returns 0.875
for 'ABC Corp' and 'A.B.C. Company'.

Also, I'd never try matching addresses if there were company names to match,
and the OP is matching company names. Approximate matching methods can be
used to provide most probable matches. It still leaves work for the OP to
do, but it does help reduce the magnitude of the task.
 
Harlan,

What is the best method for employing the mp function? For example, let's
say that I have 2500 remaining unmatched records/rows in each of two
databases of 10 fields each.

Thanks,
Bernie
 
Bernie Deitrick wrote...
What is the best method for employing the mp function? For
example, let's say that I have 2500 remaining unmatched
records/rows in each of two databases of 10 fields each.

That there are 10 fields is immaterial. Match like fields. If you have
a closer match between the Address field in one record in table A and
the Product field in another record in table B than you do against any
of the table B records' Address field values is immaterial or suggests
that your problem is much larger than simply matching records.

So matching records means matching corresponding fields in different
tables. I needed to tell you this?

The mp function as given in the linked article can't cope with
comparing one string against an array of strings, so add another
function.


Function samp(s As String, a As Variant) As Variant
Dim x As Variant, k As Long, n As Long, rv() As Double

n = 256 'change as the whim suits you
ReDim rv(1 To n)

For Each x In a

k = k + 1

If k >= n Then
n = 2 * n
ReDim Preserve rv(1 To n)
End If

rv(k) = mp(s, CStr(x))

Next x

ReDim Preserve rv(1 To k)

samp = rv

End Function


Then append columns to the original table, searching for matches one
each field separately. If TblA A were in A1:J5000, enter the following
formulas.

K2:
=INDEX(TblB,MATCH(LARGE(samp($A2,OFFSET(TblB,1,0,
ROWS(TblB)-1,1)),COLUMN()-10),samp($A2,OFFSET(TblB,1,0,
ROWS(TblB)-1,1)),0),1)

and fill K2 right into L2:T2 to get the 10 closest matches to A2 in the
first column in TblB. Selecting the best match may still be a manual
process, but finding the most likely candidates can be automated.

Now, to be honest, this is the sort of text processing operation that's
VERY POORLY DONE IN ANY SPREADSHEET. Much, much better to use a
language like Perl to load the smaller of the two files into arrays for
each field, then cycle through the larger file finding the closest
matches. Indeed, Perl already has a few CPAN modules that handle this.
See

http://search.cpan.org/~jhi/String-Approx-3.23/Approx.pm

Metaphorically speaking, just because Excel crudely approximates flint
and steel doesn't mean one should use it when one could use matches
instead.
 
Back
Top