fuzzy string compare in .Net ?

  • Thread starter Thread starter Tim Mackey
  • Start date Start date
T

Tim Mackey

hi,
when a user reaches a 404 page not found on my web site, i want to give them
one or more 'best guess' links to the page they are looking for (MS do this
on their site...). i have a list of all the available pages to compare the
requested url against.
i've hunted around on google and newsgroups like alt.comp.fuzzy, but i can't
find a nice .NET API that i can just call in my code to give me a closest
match. i would define closest match as the one with the smallest 'edit
distance' to the original, i.e. the one that requires the least amount of
insertions, deletions and edits to match it exactly. i can't see how a
regular expression could help (but i may of course be wrong) since i want to
be quite tolerant of variances in the strings.
if there isn't such an API, i would have the time to implement one, if
someone gave me a few pointers in the right direction. it would want to be
efficient also, because i can easily see an algorithm going into n cubed or
worse here.

thanks for reading this
tim

blog: http://tim.mackey.ie
67d0ebfec70e8db3
 
You should try searching for "Levenshtein" on Google.
This should get you a number of good resources on String Edit Distance
methodologies.
At one point, I came across a great dotNet implementation. However, I cannot
locate it at the moment, but I will keep looking.
While this is not really what you want, it is a good read and has some very
informative and helpful links:
http://www.codeproject.com/string/dmetaphone6.asp
it might also give you some ideas on maybe implementing a less resource
intensive version of what you want.

Gerald
 
hi Gerald
many thanks for the pointer. i've got a working prototype using a c#
levenshtein implementation, which is a good start. i think i need more
flexibility than what it provides though.
for example, i'm searching for a best guess for 'consortium', the results of
EditDistance() are shown below.
result format: <EditDistance> <Compared With>:
6 contact
6 conor ryan
7 motorola
7 joe morris
8 bocse 2004
8 consortium members

it's clear that contact does require fewer character changes to change to
consortium, but the fact that "consortium members" has the exact word match
in it should put it ahead of all the others. i can obviously start catering
for manual cases like this, but it would be buggy and it probably take a
long time to get a generic solution.

i'm very happy with the efficiency of this algorithm, i calculated the edit
distance of a 25 character string with 165 other strings of variable length
in .06 seconds. are there other alternatives that are a little more clever,
without getting into genetic programming?!

many thanks
tim
 
You could perhaps take into account the length by dividing the edit distance
by the length to get an "mean edit distance per letter" result.

Patrice
 
hi Patrice
i was skeptical to go starting my own extensions to the algorithm, because i
could keep adding special cases till the cows come home, but i did what you
suggested and it works great! for the record, in my situation, a value below
0.60 is quite good. thanks for giving me the tip.
i've included the algorithm here and a sample usage below it to save anyone
looking it up.
many thanks
tim


public class Levenshtein
{
/// <summary>
/// Compute Levenshtein distance.
/// http://www.merriampark.com/ld.htm
/// </summary>
/// <returns>Distance between the two strings.
/// The larger the number, the bigger the difference.
/// </returns>
public static int CalcEditDistance (string s, string t)
{
int n = s.Length; //length of s
int m = t.Length; //length of t
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if(n == 0) return m;
if(m == 0) return n;
// Step 2
for(int i = 0; i <= n; d[i, 0] = i++);
for(int j = 0; j <= m; d[0, j] = j++);
// Step 3
for(int i = 1; i <= n;i++)
{
//Step 4
for(int j = 1; j <= m;j++)
{
// Step 5
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
// Step 6
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] +
1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}

// sample usage:
double distance = Levenshtein.CalcEditDistance(s, input);
double meanDistancePerLetter = (distance / s.Length);
 
Back
Top