Extracting matching words within cells

  • Thread starter Thread starter ciara
  • Start date Start date
C

ciara

I have an Excel file filled with sentences. What I need
Excel to do is compare each sentence with the one in the
row above it, then extract the individual words that match betwee
these and put these words in another column. I've found out lots o
stuff on how to return the location (in
numbers) of matching strings, but I need the actual
words. Any suggestions
 
ciara said:
I have an Excel file filled with sentences. What I need
Excel to do is compare each sentence with the one in the
row above it, then extract the individual words that match between
these and put these words in another column. I've found out lots of
stuff on how to return the location (in
numbers) of matching strings, but I need the actual
words. Any suggestions?

This isn't something Excel or any other spreadsheet is good at. This could
be done in 3 lines of awk or Perl code. But if you insist on using Excel to
do this, you'll need to use VBA to do this only because Excel provides no
other mechanism to concatenate an *arbitrary* number of strings in a single
expression.

If you were to use a defined name Seq referring to =ROW(INDIRECT("1:1024")),
then you could generate an array of all words in one sentence (A2) also
occurring (at least as a substring) in another sentence (A1) using

=IF(COUNTIF(A1,"*"&MID(A2,SMALL(IF(MID(" "&A2,Seq,1)=" ",Seq),
ROW(INDIRECT("1:"&(LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1)))),
SMALL(IF(MID(A2&" ",Seq,1)=" ",Seq),
ROW(INDIRECT("1:"&(LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1))))
-SMALL(IF(MID(" "&A2,Seq,1)=" ",Seq),
ROW(INDIRECT("1:"&(LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1)))))&"*"),
MID(A2,SMALL(IF(MID(" "&A2,Seq,1)=" ",Seq),
ROW(INDIRECT("1:"&(LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1)))),
SMALL(IF(MID(A2&" ",Seq,1)=" ",Seq),
ROW(INDIRECT("1:"&(LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1))))
-SMALL(IF(MID(" "&A2,Seq,1)=" ",Seq),
ROW(INDIRECT("1:"&(LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1))))),"")

Yes, that's all one big, ugly formula. There's so much redundancy in it that
a udf may actually be faster to recalc.


Function CommonWords(x As String, y As String) As String
Dim i As Long, j As Long, k As Long, m As Long, n As Long
Dim w As String

m = Len(x)
n = Len(y)

If m > n Then
'swap x and y to iterate through shorter string (x)
w = x
x = y
y = w
k = m
m = n
n = k
End If

For i = 1 To m
If Mid(x, i, 1) Like "[A-Za-z]" Then
j = i
Do While Mid(x, i, 1) Like "[A-Za-z]"
i = i + 1
Loop

w = Mid(x, j, i - j)

k = InStr(1, y, w)

If k <= 0 Then
'do nothing - expedient to use this
ElseIf k = 1 And i - j = n Then
CommonWords = CommonWords & " " & w
ElseIf k = 1 And Mid(y, i - j + 1, 1) Like "[!A-Za-z]" Then
CommonWords = CommonWords & " " & w
ElseIf k + i - j > n And Mid(y, k - 1, 1) Like "[!A-Za-z]" Then
CommonWords = CommonWords & " " & w
ElseIf Mid(y, k - 1, i - j + 2) Like "[!A-Za-z]" & w & "[!A-Za-z]"
Then
CommonWords = CommonWords & " " & w
End If

End If

Next i

CommonWords = LTrim(CommonWords)
End Function
 
Hi Harlan,
Thanks so much! I really appreciate you writing the VBA for me. On
little thing, though, it only extracts the first matching word fro
each pair of sentences, rather than all of them. I'm hoping this migh
be a tiny glitch in the formula that would only take you a second t
spot, since you've already put so much work in!
Thanks,
Ciar
 
ciara said:
Thanks so much! I really appreciate you writing the VBA for me. One
little thing, though, it only extracts the first matching word from
each pair of sentences, rather than all of them. I'm hoping this might
be a tiny glitch in the formula that would only take you a second to
spot, since you've already put so much work in!


Function CommonWords( _
ByVal x As String, _
ByVal y As String, _
Optional z As Boolean = False _
) As String
'-----------------------------------------------------------------
'optional parameter z determines maximal matching (TRUE), meaning
'any word is repeated the greater number of times it appears in
'either string as long as it appears at least once in each string,
'or minimal matching (FALSE), meaning any word appearing at least
'once in each string appears only once in the result
'-----------------------------------------------------------------
Dim i As Long, j As Long
Dim kx As Long, ky As Long, nx As Long, ny As Long
Dim w As String

w = " "
nx = Len(x)
For i = 1 To nx
If Mid(x, i, 1) Like "[A-Za-z]" Then
w = w & Mid(x, i, 1)
Else
w = w & " "
End If
Next i
x = w & " "
nx = Len(x)

w = " "
ny = Len(y)
For j = 1 To ny
If Mid(y, j, 1) Like "[A-Za-z]" Then
w = w & Mid(y, j, 1)
Else
w = w & " "
End If
Next j
y = w & " "
ny = Len(y)

If nx > ny Then
'swap x and y to iterate through shorter string (x)
w = x
x = y
y = w
i = nx
nx = ny
ny = i
End If

For i = 2 To nx - 1
If Mid(x, i, 1) Like "[A-Za-z]" Then
j = i
Do While Mid(x, i, 1) Like "[A-Za-z]"
i = i + 1
Loop

w = Mid(x, j, i - j)

kx = (Len(x) - Len(Application.WorksheetFunction.Substitute(x, _
" " & w & " ", ""))) / (Len(w) + 2)
ky = (Len(y) - Len(Application.WorksheetFunction.Substitute(y, _
" " & w & " ", ""))) / (Len(w) + 2)

If InStr(1, CommonWords, w) = 0 And kx > 0 And ky > 0 Then
If z Then
For j = 1 To IIf(kx > ky, kx, ky)
CommonWords = CommonWords & " " & w
Next j
Else
CommonWords = CommonWords & " " & w
End If
End If

End If

Next i

CommonWords = LTrim(CommonWords)
End Function
 
Back
Top