Comparing Similarity of Strings and Returning Best Match

  • Thread starter Thread starter AlanR
  • Start date Start date
A

AlanR

Dear all
I have been tasked with a very manual process comparing thousands of
rows of data from 2 data dumps. I have to locate a 'best match'
Product Code from one data dump and copy its value to another data
dump. One data dump has the following columns: PO No, PO line No,
Product Code, Value. The second data dump has only PO No and Product
Code. The idea is on the second data dump to look up the relevant PO
number and most similar Product code and return the value from the
first data dump. The trouble is the product codes in both lists are
slightly different. Sometimes the beginning of the code is different,
sometimes the middle and sometimes the end! However, the characters in
both types of product codes are around 90-95% similar and easily
spottable by manual comparison. To automate this, I have tried a
number of methods so far to do a closest match, for example a vlookup
using TRUE, and I have also tried the 'FuzzyMatch' user-defined-
function that I saw on Mr Excel. Unfortunately neither is providing me
with a workable solution. Therefore I thought I'd ask the experts to
see if what I want is possible via a VBA macro. If so, I would be very
grateful of your help as doing this exercise manually is very
disheartening! Thank you, AlanR.
Example of the first data dump:
PO No Line No Product Code Value
45001 1 AB-ZZZ-HHH-45T 100
45001 2 TY-55555-99-ZA 110
45001 3 CCC-MODEL-XX-YYY 120
45001 4 YYYYY_35-KLMN 130
45002 1 TY-55555-99-ZA 140
45002 2 CCC-MODEL-XX-YYY 150
45002 3 AB-ZZZ-HHH-45T 160
45003 1 YYYYY_35-KLMN 170
45003 2 AB-ZZZ-HHH-45T 180
45004 1 CCC-MODEL-XX-YYY 190
45004 2 KIT-MODEL678 200
45004 3 HYT-JJJ-TOP10 210
45004 4 AB-ZZZ-HHH-45T 220
45004 5 GTO-GTOP-25L 230
45004 6 YYYYY_35-KLMN 240
45004 7 TY-55555-99-ZA 250

Example of the second data dump (including the 4 columns I would like
to auto-populate using VBA or formulae)
PO No. Product Code Closest Match Line No. Closest
Match Product Code Similarity% Value
45001 CCC-MODEL-XX-YYY-35
45001 45-AB-ZZZ-HHH-45T
45001 YYYYY_35-MODEL-KLMN
45001 TY-55555-99C-ZA
45002 CCC-MODEL-XX-YYY-35
45002 45-AB-ZZZ-HHH-45T
45002 TY-55555-99C-ZA
45003 45-AB-ZZZ-HHH-45T
45003 YYYYY_35-MODEL-KLMN
45004 KIT-MODEL678_A
45004 CCC-MODEL-XX-YYY-35
45004 45-AB-ZZZ-HHH-45T
45004 HYT-JJJ-TOP10
45004 YYYYY_35-MODEL-KLMN
45004 GTO-GTOP-25L
45004 TY-55555-99C-ZA
 
If you ask me - you're up for an uphill struggle.
Try googling for: regexp vba
might give you some clues.
 
If you ask me - you're up for an uphill struggle.
Try googling for: regexp vba
might give you some clues.

I am assuming that the entire product code and PO are one long
string. If they are not, skip this code.
First thing you could do is use the following code.

Dim txt As String
Dim x As Variant
Dim i As Long
Dim q As Integer


txt = Cells(1, 1).Value
x = Split(txt, " ")
For i = 0 To UBound(x)
Cells(i+1, 2) = x(i)
Next i

Use this code on one string in cells(1,1). This code will break the
string at every space. So your first value of
45001 1 AB-ZZZ-HHH-45T 100 would come out as

45001
1
AB-ZZZ-HHH-45T
100

Now you can separate out the product code.

Step 2 would be to compare the product code by chunks of strings to a
database of known product codes. Consider the comparison of 45-AB-ZZZ-
HHH-45T to a known product code of AB-ZZZ-HHH-45T.

If you wrote some code to take 10 characters at a time and see if they
exist in a another string, you'd probably get it.

MyString = Cells(1, 1)
X=len(MyString)

for startval=1 to 4
for endval=11 to 15

partial=Mid(MyString, startval, endval)
' Here you would compare "partial" to all known product codes. If
TRUE then move on to next cell.
next
next


So your code would take the following strings and look for a match.
45-AB-ZZZ- FALSE
4-AB-ZZZ-H FALSE
-AB-ZZZ-HH FALSE
AB-ZZZ-HHH TRUE

If the codes are unique enough that a match of 10 characters is
sufficient, then this would work. You will need to determine exactly
how many characters in a row have to be

Good luck
 
AlanR said:
Dear all
I have been tasked with a very manual process comparing thousands of
rows of data from 2 data dumps. I have to locate a 'best match'

<snip>

From a recent post in another group ... you might get some ideas:

<quote> ... What it cant do is tell you if something is 'real'
or not. For example, does Brand X really exist or is it just a
mispelled Brand Y? You can get close with heuristic lookups and
comparisons, and I will be using Levenshtein distances etc to find
nearest fits when a fields value is unknown - but this is only a
guess. In the end a user must make the decision and 'approve' the
fields value. </quote>

http://groups.google.com/group/comp.databases.ms-access/msg/94f29c021a5cc4b4?hl=en
 
Back
Top