Get words. Suggestions to improve code.

  • Thread starter Thread starter shapper
  • Start date Start date
S

shapper

Hello,

I am trying to get all the words in a phrase and then filter the ones
which are longer then 3 characters.
Basically, I am trying to get words from a phrase that contain some
meaning.
My selecting only the ones longer then three characters I ignore
"and", "so", "or", etc.
Of course this is a rough approximation but ...

I came up with the following:

char[] delimiters = new char[] { ' ', '.', ',', ';', '!', '?',
'-' };
string[] wordlist = phrase.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);

// Filter words
string[] wordBiggerThen3List = wordlist.Where(w => w.Length >
3).ToArray();

Those are the delimiters that I came up with.

Any suggestion to improve my code?

The filter words part is not really the problem.
The problem is to have the best possible way to get the words with no
punctuation and spaces attached.

Thanks,
Miguel
 
shapper said:
Hello,

I am trying to get all the words in a phrase and then filter the ones
which are longer then 3 characters.
Basically, I am trying to get words from a phrase that contain some
meaning.
My selecting only the ones longer then three characters I ignore
"and", "so", "or", etc.
Of course this is a rough approximation but ...

I came up with the following:

char[] delimiters = new char[] { ' ', '.', ',', ';', '!', '?',
'-' };
string[] wordlist = phrase.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);

// Filter words
string[] wordBiggerThen3List = wordlist.Where(w => w.Length >
3).ToArray();

Those are the delimiters that I came up with.

Any suggestion to improve my code?

The filter words part is not really the problem.
The problem is to have the best possible way to get the words with no
punctuation and spaces attached.

Thanks,
Miguel

You are missing at least these: ':', '\t', '\n'. In one of your other
threads called "At least one space...", Alberto Poblacion gave you a
regex string that would check word count. I rarely use regex, but it
seems like that string could be modified to return the words you need.
This eliminates you getting $433.32 as one or two words using the above
method.
 
shapper said:
Hello,

I am trying to get all the words in a phrase and then filter the ones
which are longer then 3 characters.
Basically, I am trying to get words from a phrase that contain some
meaning.
My selecting only the ones longer then three characters I ignore
"and", "so", "or", etc.
Of course this is a rough approximation but ...

I came up with the following:

char[] delimiters = new char[] { ' ', '.', ',', ';', '!', '?',
'-' };
string[] wordlist = phrase.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);

// Filter words
string[] wordBiggerThen3List = wordlist.Where(w => w.Length >
3).ToArray();

Those are the delimiters that I came up with.

Any suggestion to improve my code?

The filter words part is not really the problem.
The problem is to have the best possible way to get the words with no
punctuation and spaces attached.

I don't think you could do it anymore efficiently.

You say "improve" like there's a problem, is there a problem or some
scenario you have in mind here?

~ Mike
 
You say "improve" like there's a problem, is there a problem or some
scenario you have in mind here?

I am not sure because enumerating the delimiters feels strange.

I am trying the following:

public static IEnumerable<String> Words(this String value) {

MatchCollection collection = Regex.Matches(value, @"\b(?:\w|\')+
\b");
Match[] matches = new Match[collection.Count];
collection.CopyTo(matches, 0);
return matches.Select(m => m.Value).AsEnumerable();

} // Words

And then I apply it as follows:

keywords = String.Join(",", model.Title.Words().Where(w => w.Length >
3).Select(w => w.Capitalize()).Take(5).ToArray());

It seems better now ... not?

Thanks,
Miguel
 
shapper said:
You say "improve" like there's a problem, is there a problem or some
scenario you have in mind here?

I am not sure because enumerating the delimiters feels strange.

I am trying the following:

public static IEnumerable<String> Words(this String value) {

MatchCollection collection = Regex.Matches(value, @"\b(?:\w|\')+
\b");
Match[] matches = new Match[collection.Count];
collection.CopyTo(matches, 0);
return matches.Select(m => m.Value).AsEnumerable();

} // Words

And then I apply it as follows:

keywords = String.Join(",", model.Title.Words().Where(w => w.Length >
3).Select(w => w.Capitalize()).Take(5).ToArray());

It seems better now ... not?

Wish I knew some RegEx!

Well, as the other poster pointed out (about the missing punctuation) a
RegEx might be the way to go. But I have no idea what that RegEx indicates.

Also, you need to account for what type of encoding the input is, just
ANSI/ASCII? Or Unicode? In the case of the later, perhaps some more
thought required.

~ Mike
 
Back
Top