Frequency of Words in Excel, a little different . . . .

  • Thread starter Thread starter devon
  • Start date Start date
D

devon

Ok, here is the deal. I have 52 columns (one per week) and 10 rows
(representing top 10 sellers). The top 10 lists are store names such
as 'Mall of America', 'Bridgewater Commons', etc. In each column I
have the weeks top selling malls listed.

I want to count the number of times these mall names appear. I tried
the =COUNTIF function, but this will only count all words in a cell.
So if Bridgewater Commons is on my list 23 times but is spelled Common
sometimes or Commons other times, excel will not pick this up. It
will only pick up exactly what is in the formula, understandably.

What I'd like to do is do a =COUNTIF function just using the word
Bridgewater. Excel doesn't seem to do this weith the =COUNTIF
formula.

Any ideas? The bottom line is, we want to count how many times
specific stores made it into the top ten for the year. Still with me?
I hope I'm making sense!
 
Hi Devon,

You can use wildcards with certain functions and Countif
is one of them:

=COUNTIF(A$29:A$39,"*Bridgewater*")

This will count all cells that contain the word
Bridgewater.

Biff
 
...
...
Any ideas? The bottom line is, we want to count how many times
specific stores made it into the top ten for the year. Still with me?

This is one of the biggest problems faced by most large companies in any system
in which users (i.e., non-IT employees) enter data. You could search for
substrings, but if you have XYZ, X Y Z and X.Y.Z. all representing the same
entity, that won't work. There are fuzzy matching algorithms, but they require
some nontrivial programming. Bottom line: the most reliable way to get the
information you need from your data is for *YOU* to take responsibility for your
data quality. That means ensuring one and only one way of spelling any
customer's name.
 
Thanks Biff. Being the dummy I am, I found a better way to do this.
I did try the wildcards first but then realized I could just use the
store numbers. Major ID 10 T error on my part. :)
 
Now I have another issue. I have all the results in a column thanks
to the =COUNTIF function. I know want to sort these numbers from high
to low. When I do this, I get all sorts of errors. Any idea why, and
how I can fix the problem?
 
Harlan,

Thanks for the tip but your response didn't help fix the issue.
Luckily I'm just a temp and I'm not the one who screwed up the system.
I agree, names should be exactly the same every time they are listed.

Anyone have ideas for my sorting issue?

Thanks,
Devon
 
Hi,

Your problem needs what is called 'approximate string matching' algos.
The key concept of the field is the 'edit distance'.

Although there are fairly sophisticated algos to computer the 'edit distance',
your presentation leads me to think a simple approach could suit you well. The
one here only compares the letters occurences (whatever their respective
position in the strings). So "Hello World" would be perfectly compliant with
"dlroW olleH".

In your case (misspelled company names), I do believe you'd obtain very good
results.


Strategy:
1. Build a list of correct stores name called STORES (elsewhere in your sheet,
or even in another sheet).
2. Using a VBA fonction that will return the 'closest' name in STORES for every
name in column A.
3. Then, you could use that to replace the store name with the standard ones,
count them, etc.


So let's use a VBA function returning a 'compliance' factor (from 0 to 1, or 0%
to 100%) between two strings.

Then build a IterCmp() function that returns the indice (in STORES) of the
closest match (highest compliance factor returned) to A1.

Assuming:
Your stores in A1:Ax
Your correct store names in range STORES
In B1:
=INDEX(STORES,IterCmp(STORES,A1))

Copy to Bx

Then you do what you want with the results (count, sum, replace by value Column
A content, etc.)
Below is a simple VBA implementation (cmpTab could be improved a lot but I have
no time). But again, given long strings to identify your stores, the results
should be pretty close to what you need.


Public Function IterCmp(aRng As Range, S As String) As Long
Dim C As Range, i As Long, temp As Double, TheMax As Double
Dim V1 As Variant, len1 As Integer
TheMax = 0: i = 0
V1 = TL(S): len1 = Len(S)

For Each C In aRng
i = i + 1
temp = cmpTab(V1, len1, TL(C), Len(C))
If temp > TheMax Then TheMax = temp: IterCmp = i
Next C
End Function


Public Function cmpTab(ByRef V1 As Variant, ByRef len1 As Integer, _
ByRef V2 As Variant, ByRef len2 As Integer) As Double
Dim i As Integer, cDiff As Integer, nbLet As Integer

nbLet = IIf(len1 > len2, len1, len2)
If nbLet = 0 Then Exit Function

cDiff = 0
For i = 0 To 255
cDiff = cDiff + Abs(V1(i) - V2(i))
Next i

cmpTab = 1 - (cDiff / nbLet)
If cmpTab < 0 Then cmpTab = 0
End Function

Public Function TL(ByVal S As String) As Variant
Dim V(0 To 255), i%, j%, C%

j = Len(S)
For i = 1 To j
C = Asc(Mid(S, i, 1))
V(C) = V(C) + 1
Next i
TL = V
End Function

Regards,

Daniel M.
 
Now I have another issue. I have all the results in a column thanks
to the =COUNTIF function. I know want to sort these numbers from high
to low. When I do this, I get all sorts of errors. Any idea why, and
how I can fix the problem?
...

What errors? #VALUE! or #REF! errors? Or something else? What exactly do the
COUNTIF function calls look like? Are you using absolute range addresses?
 
Make sure the range being counted uses $ to show that it will never
change even if you move the formula 50000 rows down.

Though if you didn't, and you started the formula at the top of the
data set and copied it down, you would get incorrect counts since
every row you copy the formula down would result in 1 fewer row being
looked at by the formula (but might not affect the bottom of the data
set if you used something like A1:A5000).

make your your countif function is like +countif(a$1:a$2000,{XXXX})
 
Uhm disregard what I said about having to extend the range further
than the data set, obviously it would extend itself if it were
relational and you copied it down (and would still be whacked if you
didn't use $ and tried to sort the range, depending on the size), but
its best to use $ around the top and the bottom boundaries anyways.

:|
 
Back
Top