find by similarity

  • Thread starter Thread starter cathal
  • Start date Start date
C

cathal

I have an 800 x 7 matrix, how can I find similar cell contents, perhaps
showing similarites in different colours?

thanks, cathal...
 
cathal said:
I have an 800 x 7 matrix, how can I find similar cell contents, perhaps
showing similarites in different colours?

This would require one hell of a lot of nontrivial programming. Aside from
at least 3 hours of your time spent writing detailed specifications, I'd
guess this would take a professional at least 10 hours to implement. Total
cost roughly $1200 or higher. Is this really worth it to you?
 
....or you could try this bit of code, which took about 10
minutes to write and about the same amount of time to run.

One limitation is that it cycles through 39 colors, so if
you have more similarities than that, the colors will
be "recycled". You could always tweak that bit to allow
more colors though.

Sub FindSimilarities()
Dim a As Range, b As Range, color As Byte
Range("A1:G800").Interior.ColorIndex = xlColorIndexNone
color = 2
For Each a In Range("A1:G800")
If a.Interior.ColorIndex = xlColorIndexNone Then
For Each b In Range("A1:G800")
If b.Value = a.Value And b.Address <>
a.Address Then
a.Interior.ColorIndex = color
b.Interior.ColorIndex = color
End If
Next b
If a.Interior.ColorIndex <> xlColorIndexNone Then
color = color + 1
If color > 40 Then color = 2
End If
Next a
End Sub

HTH,
Ryan
 
I used colors 1 to 56 (all you can use). Some colors made the cell unreadable:

Option Explicit
Sub testme01()

Dim curWks As Worksheet
Dim tempWks As Worksheet
Dim iCol As Long
Dim destCell As Range
Dim myCell As Range
Dim FoundCell As Range
Dim FirstAddress As String
Dim CellCtr As Long
Dim myRng As Range

Application.ScreenUpdating = False

Set curWks = ActiveSheet
Set tempWks = Worksheets.Add

With tempWks
.Range("a1").Value = "Header"
For iCol = 1 To 7
Set destCell = .Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0)
With curWks
.Range(.Cells(1, iCol), _
.Cells(.Rows.Count, iCol).End(xlUp)).Copy _
Destination:=destCell
End With
Next iCol

.Range("a1", .Cells(.Rows.Count, "A").End(xlUp)).AdvancedFilter _
Action:=xlFilterCopy, _
CopyToRange:=.Range("B1"), _
Unique:=True

.Columns(1).Delete

Set myRng = .Range("a2", .Cells(.Rows.Count, "a").End(xlUp))

If myRng.Rows.Count > 56 Then '56 colors
MsgBox "too many values"
Application.DisplayAlerts = False
.Delete
With Application
.DisplayAlerts = True
.ScreenUpdating = True
Exit Sub
End If
End With

CellCtr = 0
With curWks.Range("A1:G800")
For Each myCell In myRng.Cells
CellCtr = CellCtr + 1
Application.StatusBar = "processing #:" & CellCtr
myCell.Interior.ColorIndex = CellCtr
Set FoundCell = Nothing
Set FoundCell = .Find(What:=myCell.Value, After:=.Cells(1), _
LookIn:=xlValues, LookAt:=xlWhole, _
SearchOrder:=xlByRows, _
SearchDirection:=xlNext, _
MatchCase:=False)

If Not FoundCell Is Nothing Then
FirstAddress = FoundCell.Address
Do
FoundCell.Interior.ColorIndex = CellCtr
Set FoundCell = .FindNext(FoundCell)
Loop While Not FoundCell Is Nothing _
And FoundCell.Address <> FirstAddress
End If
Next myCell
End With

With Application
.StatusBar = False
.ScreenUpdating = True
End With
End Sub
 
...or you could try this bit of code, which took about 10
minutes to write and about the same amount of time to run.

One limitation is that it cycles through 39 colors, so if
you have more similarities than that, the colors will
be "recycled". You could always tweak that bit to allow
more colors though.
...

Not a bad starting point, but its real strength is pointing out why the OP needs
more precise specs. You interpretted 'showing similarites in different colours'
as color coding entire cell background colors (.Interior.ColorIndex). Another
interpretation would be matching the text colors of matching substrings (e.g.,
giving each substring that would be stored separately by L-Z compression a
separate color). Using the .Color property and the RGB function would have taken
a bit more time to implement, but it would handle thousands of colors if the OP
were using 16-bit pixel color depth. Also, working with Characters objects for
each cell in the range would also have added to the complexity.

Then there's the question of how text should match. If A1 contained 'One Two',
B1 contained 'Two Three', and C1 contained 'One Two Three', should C1 match A1
(which appears first in C1) or B1 (which is a longer match than A1) or both
somehow?

Also liked the O(N^2) algorithm. If you've already iterated through the first N
rows in the outer 'For Each a' loop, there's no reason to iterate through those
rows in the 'For Each b' loop.

If you simplify any problem sufficiently, it becomes simple to solve. Whether it
remains relevant is the big question.
 
Not a bad starting point, but its real strength is
pointing out why the OP needs
more precise specs. You interpretted 'showing similarites in different colours'
as color coding entire cell background colors
(.Interior.ColorIndex). Another
interpretation would be matching the text colors of matching substrings (e.g.,
giving each substring that would be stored separately by L-Z compression a
separate color). Using the .Color property and the RGB function would have taken
a bit more time to implement, but it would handle thousands of colors if the OP
were using 16-bit pixel color depth. Also, working with Characters objects for
each cell in the range would also have added to the complexity.

Then there's the question of how text should match. If A1 contained 'One Two',
B1 contained 'Two Three', and C1 contained 'One Two Three', should C1 match A1
(which appears first in C1) or B1 (which is a longer match than A1) or both
somehow?

Also liked the O(N^2) algorithm. If you've already iterated through the first N
rows in the outer 'For Each a' loop, there's no reason to iterate through those
rows in the 'For Each b' loop.

If you simplify any problem sufficiently, it becomes simple to solve. Whether it
remains relevant is the big question.

Yes, with a bit more time and effort I could have
implemented a more efficient algorithm and wider range of
colors (I stated those shortcomings in my previous post),
and I accept that value equality may not be what the OP
was after, but rather a looser definition of 'similarity'.
The point is that 3 hours of specification, 10 hours of
programming and $1200 total cost may have been a little
excessive. Its entirely possible that this is all the OP
needed.

And if the cathal hasn't given up following this thread, I
hope either Dave's or my answer brought you closer to the
solution you sought.

Regards,
Ryan
 
Back
Top