Anagram Algorithm - How can I improve it?

  • Thread starter Thread starter Andrew
  • Start date Start date
A

Andrew

Hi all

I am looking for your advice on a project of mine to create a fully-
featured dictionary class.

I have a source file with around 75000 words of UK English. These are
sorted, then loaded into a Dictionary object, each with a long as its
key.

I have methods (code at the end of this message) as follows:

1) CheckWordExists: Check if a given word exists. If so, returns its
position in the dictionary, as a long. If not, returns the compliment
of its nearest neighbour

2) ScrambleWord: Takes a string, and returns the letters randomly
scrambled

3) ReturnRandomWord: Returns a random word from the dictionary. (Code
not listed below)

4) FindAnagram: Given a string, invokes a loop, calling ScrambleWord.
Then passes this to CheckWordExists to determine if this is a genuine
word rather than a random string of characters. If not, passes the
string to ScrambleWord again, to retrieve another anagram and checks
if this is a real word. This loops either until a valid word is
returned (which is not the original word!) or until 100000 attempts
have been made, at which point it gives up!

Now, the first 3 routines work nicely - taking usually less than a
millisecond to return their values.

However, the FindAnagram method is rather slow. I created a test loop
which does the following:
1) Calls ReturnRandomWord to generate a random 8-character word
2) Passes this to FindAnagram to try to find an anagram
3) If this fails to find an anagram, generate another random word and
try again.

Using a couple of datetime variables and a timespan variable, I tested
this a few times, and gained the following information:

1) Each time FindAnagram runs, it takes an average of 4.3 seconds to
either find an anagram, or report failure.
2) Using 8-letter words results in finding an anagram for, on average,
1 word in 23 taking an average of 1:42 minutes.


Now, I'm wondering if I'm missing something - is there a better way to
approach this than the "brute force" method I'm using to find
anagrams?

If anyone has any thoughts or input, I'd be really grateful!

Thanks in advance
Andrew


Code: (dictionaryList is the name of the Dictionary object which
contains the list of words)
-----------------------------------------------------------
Function CheckWordExists(ByVal wordToCheck As String) As Long
'Function takes a word to find in the dictionary
'If found, will return the value of its position in the
dictionary
'If not, will return the compliment of its nearest neighbour
'NOTE- All values start at 1 - ie the first word in the
dictionary is word 1, etc
' This is so that if a word cannot be found, we never return a
-0 where the first word
' is the closest match.
Dim firstWord As String, lastWord As String
Dim firstVal As Double, lastVal As Double


wordToCheck = wordToCheck.ToLower
firstVal = 0
lastVal = dictionaryList.Count - 1
firstWord = dictionaryList.Item(CLng(firstVal)).ToLower
lastWord = dictionaryList.Item(CLng(lastVal)).ToLower
If wordToCheck >= firstWord And wordToCheck <= lastWord Then
Do Until firstWord = wordToCheck Or lastWord = wordToCheck
'Check if first and last vals are already adjacent -
we have no match
If (lastVal - firstVal) < 1 Then Exit Do
If firstWord.CompareTo(wordToCheck) < 0 Then 'first
word is before wordtocheck
If lastWord.CompareTo(wordToCheck) > 0 Then
'Last word is after wordtocheck
lastVal = firstVal + ((lastVal - firstVal) /
2) 'Move last word back to check again
Else 'Move first and last words up to check again
Dim diff As Double
diff = lastVal - firstVal
firstVal = lastVal
lastVal += diff
End If
Else 'First word is after wordtocheck
'Move firstval up by half difference between first
and last vals
firstVal = firstVal + (lastVal - firstVal) / 2
End If
firstWord =
dictionaryList.Item(CLng(firstVal)).ToLower
lastWord = dictionaryList.Item(CLng(lastVal)).ToLower
Loop
End If
'By this point, we are as close as possible to a match
If firstWord = wordToCheck Then
Return CLng(firstVal) + 1
ElseIf lastWord = wordToCheck Then
Return CLng(lastVal) + 1
Else
Dim i As Integer
On Error Resume Next

Do Until (i = Math.Min(firstWord.Length,
Math.Min(lastWord.Length, wordToCheck.Length)) - 1)
'Find whether firstWord or lastWord is the closest
match
'Do this by comparing each, letter by letter, with
wordToCheck
'until one is further away than the other.
'Stop if we run out of letters to compare!
If
Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
firstWord.Chars(i))) < _
Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
lastWord.Chars(i))) Then
Return -firstVal - 1

ElseIf
Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
firstWord.Chars(i))) > _

Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
lastWord.Chars(i))) Then
Return -lastVal - 1
End If
i = i + 1

Loop


Return -firstVal - 1 'By this point the two words are
identical to the given word,
'simply having more letters, which differ. Just return the
first one - no point checking further
End If
End Function

--------------------

Public Function ScrambleWord(ByVal OriginalWord As String) As
String
Dim NewWord(OriginalWord.Length - 1) As Char
Dim rtnString As String = String.Empty
Dim newPos As Integer

Randomize()
For i As Integer = 0 To OriginalWord.Length - 1
Do
newPos = Int(Rnd() * (OriginalWord.Length))
Loop Until NewWord(newPos) = Nothing
NewWord(newPos) = OriginalWord.Chars(i)
Next
For i As Integer = 0 To OriginalWord.Length - 1
rtnString &= NewWord(i)
Next
Return rtnString
End Function

-----------------------------------------------

Public Function FindAnagram(ByVal OriginalWord As String) As
String
Dim testWord As String = String.Empty
Dim i As Integer

Do
i = i + 1
If i > 100000 Then Return "-" 'Run 362880 checks, then
give up (value is 9 factorial)
testWord = Me.ScrambleWord(OriginalWord)
Loop Until Me.CheckWordExists(testWord) > 0 AndAlso
testWord.ToLower <> OriginalWord.ToLower

Return testWord

End Function
----------------------------------------------
 
This sounds like something very similar to a scrabble clone I am working on.
It seems that B-star search algorithm is the key to my needs and perhaps
yours as well. I haven't implemented it yet, nor have I found decent
articles on the inner workings of the algorithm.
 
Hi Andrew,
However, the FindAnagram method is rather slow. I created a test loop
which does the following: 1) Calls ReturnRandomWord to generate a
random 8-character word 2) Passes this to FindAnagram to try to find
an anagram 3) If this fails to find an anagram, generate another
random word and try again.

do you really want to find an anagram or a similar word? For similar
words, maybe also for an anagram you will have more success with a
combination of algorithms like Double Metaphone and Levensthein.


Regards,

Mathias Wuehrmann
 
Hi Andrew,


do you really want to find an anagram or a similar word? For similar
words, maybe also for an anagram you will have more success with a
combination of algorithms like Double Metaphone and Levensthein.

Regards,

Mathias Wuehrmann

No - I really want an anagram! In other words, I'm after another word
(Word B) which can be made using only and all the letters of a given
Word A. Don't mind how dissimilar they are (as long as they're the
same language... :-) )

Andrew
 
4) FindAnagram: Given a string, invokes a loop, calling ScrambleWord.
Then passes this to CheckWordExists to determine if this is a genuine
word rather than a random string of characters. If not, passes the
string to ScrambleWord again, to retrieve another anagram and checks
if this is a real word. This loops either until a valid word is
returned (which is not the original word!) or until 100000 attempts
have been made, at which point it gives up!

However, the FindAnagram method is rather slow. I created a test loop
which does the following:

For anagrams, make a second dictionary whose keys are the words of your
dictionary with letters arranged alphabetically and whose values are all the
words that map to the same key.

In other words, if your dictionary contains both 'pear' and 'pare', then the
new dictionary will contain key 'aepr' and value 'pear' and 'pare'. The
value could be the concatenation of the words, or a collection of words, or
whatever is convenient. The main idea is to clump together words that are
anagrams. You can produce this anagram dictionary with a simple loop over
your dictionary. For words that have no other anagrams, it is up to you
whether you delete them or not from the anagram dictionary.
 
AMercer said:
For anagrams, make a second dictionary whose keys are the words of your
dictionary with letters arranged alphabetically and whose values are all the
words that map to the same key.

In other words, if your dictionary contains both 'pear' and 'pare', then the
new dictionary will contain key 'aepr' and value 'pear' and 'pare'. The
value could be the concatenation of the words, or a collection of words, or
whatever is convenient. The main idea is to clump together words that are
anagrams. You can produce this anagram dictionary with a simple loop over
your dictionary. For words that have no other anagrams, it is up to you
whether you delete them or not from the anagram dictionary.

Now that is nice! I hadn't thought of approaching it that way.
 
One obvious weakness to your FindAnagram method is that it could
conceivably be checking the same combinations of letters many times.
So, it's worse that brute force really.

AMercer's suggestion is a good one if all you're looking to do is find
words that can be made from ALL of a series of letters (e.g. find six
letter word that can be made from six letters). And that seems to be
what you're looking for here.

That being said, if you wanted to generate a function that could find
the words, of any length, that could be made by using the letters in a
series of letters (something that would certainly be of value in
Scrabble like game), you could write an algorithm that systematically
generated words of varying lengths, then looked them up, after
scrambling letters to be in alphabetical order, using AMercer's
approach.
 
Andrew,

This is in fact outside my knowledge as it is about language, but as I had
to do it without a real educated one to help me in that part, then for
English and Dutch dictionary I would start removing all vowels on both side
and replacing the c, s, z by s.

In Dutch I would than replace all "ch" by g, etc. Then I would search the
words while using inside the word string the Find because that is the fasted
for that.

Cor
 
This is interesting.

There is a site called a word a day which provides anagrams - which is
similar to what you want.
http://wordsmith.org/anagram/index.html (I enjoy their daily a-word-a-day
mailings )
But since you are doing this for scrabble, wouldn't it be cool to have a
blank tile?
 
Firstly, it sounds very inefficient to just be looping and calling the
FindAnagram method up to a maximum of 10,000 times. Surely you're likely to
get repeated hits on the same jumbled array of letters that way? If it were
me, I'd design a routine that iterates through every possible combination of
arrangement of the letters in the word and I'd do it in alphabetical order
as well.

Secondly, have a SortedList of the words and check for the existence of the
word in that. The framework should find it much easier to search through
the SortedList using binary chops so you should see better performance.
Watch your heap though, with 75k+ strings loaded into two different
collection classes you'll probably start paging and that won't be good news
for lookup speed.

-Alex
 
Firstly, it sounds very inefficient to just be looping and calling the
FindAnagram method up to a maximum of 10,000 times.  Surely you're likely to
get repeated hits on the same jumbled array of letters that way?  If it were
me, I'd design a routine that iterates through every possible combination of
arrangement of the letters in the word and I'd do it in alphabetical order
as well.

Secondly, have a SortedList of the words and check for the existence of the
word in that.  The framework should find it much easier to search through
the SortedList using binary chops so you should see better performance.
Watch your heap though, with 75k+ strings loaded into two different
collection classes you'll probably start paging and that won't be good news
for lookup speed.

-Alex































- Show quoted text -

Okay. Thanks to everyone who responded - I really appreciate all your
input and you've certainly given me some things to think about.

One or two specific thoughts:
1) I'm not actually trying to create a scrabble clone, as has been
suggested, so while the ideas of finding all words (including shorter
words) that can be made out of a given set of letters, or of including
"blanks" is not directly relevant to the project. But it might be an
interesting thing to have a play with later....

2) In response to AMercer - that sounds like a good way to tackle it,
and I may give that a go as my next step...

3) In response to Kerry Moorman - no, I hadn't googled it (have now!)
because part of the challenge was to find a way without (initially)
relying on someone else's algorithm. I know you could argue that
posting for help here amounts to the same thing, but I wanted to have
a go and then learn from my mistakes rather than simply tap into some
pre-determined solution.

4) Thanks to all those who pointed out that simply looping over a
"random scrambler" routine will result in much wasted effort as the
same non-words are regenerated multiple times. This was something that
had occurred to me too, so was one of the first areas I was going to
look at in coming back to my code.

Incidentally, this seems to have struck a bit of a common chord with
people saying "I'm working on something similar..." or "I've done that
with..." . I'm sure this is an obvious question that I could just
google again (I think I just prefer the human interaction!) but are
there sites out there where (say) a few people get together to work
interactively on stuff like this, such as developing Scrabble clones,
or other similar things, just for fun and as a way of learning more
from each other?

Thanks again for your input all.

Andrew
 
Andrew,
To a great extent, the open source movement is just that - a project where
many developers collaborate to develop software.
Sourceforge http://Sourceforge.net is the meeting place for open source
projects. A search for scrabble generated 37 matches.

HS
 
Well, I have been reticent to post this but a long time ago, in the days
of VB6 (or maybe even 5 of 4, I don't remember), I wrote a little
program to "solve" the daily "Jumble" puzzles. It is limited to 6
letters so it runs fairly fast. I had a dictionary that I took all of
the words of 6 letters of less and put them into an Access database (if
I remember correctly).

I then used the code below to find all permutations of the word I wanted
to figure out. I then listed all of the matches from the dictionary on
the screen. Frequently there was only one and that was the answer to
the Jumble puzzle word.

The routine makes sure that a single letter is not used twice in the
test word and then tries to find it.

Brute force, indeed, but it was quite fast. I never had a speed
complaint. The Findword routine is the one which looked up the test
word in the database. The duplicate business was so that the same word
would not be listed on screen twice. This could happen with a word with
a repeated letter.

Here it is. **Try not to laugh at it.** It is quite old but still
works.

Mike

For p1 = 1 To 6
For p2 = 1 To 6
For p3 = 1 To 6
For p4 = 1 To 6
For p5 = 1 To 6
For p6 = 1 To 6
If p2 = p1 Or p3 = p2 Or p3 = p1 Or p4 = p3 Or p4 = p2 Or
p4 = p1 Or p5 = p4 Or p5 = p3 Or p5 = p2 Or p5 = p1 Or p6 = p5 Or p6 =
p4 Or p6 = p3 Or p6 = p2 Or p6 = p1 Then
Else
created_word = ""

testchar = Mid$(txtJumble, p1, 1)
If testchar <> " " Then created_word = created_word &
testchar
testchar = Mid$(txtJumble, p2, 1)
If testchar <> " " Then created_word = created_word &
testchar
testchar = Mid$(txtJumble, p3, 1)
If testchar <> " " Then created_word = created_word &
testchar
testchar = Mid$(txtJumble, p4, 1)
If testchar <> " " Then created_word = created_word &
testchar
testchar = Mid$(txtJumble, p5, 1)
If testchar <> " " Then created_word = created_word &
testchar
testchar = Mid$(txtJumble, p6, 1)
If testchar <> " " Then created_word = created_word &
testchar

If FindWord(created_word) Then
duplicate = False
For i = 0 To lstFound.ListCount - 1
If lstFound.List(i) = created_word Then
duplicate = True
Exit For
End If
Next i

If Not duplicate Then
lstFound.AddItem created_word
DoEvents
End If ' not duplicate
End If ' word found
End If
Next
Next
Next
Next
Next
Next

---------------------------------------
 
Back
Top