Parse text into words?

  • Thread starter Thread starter jim_adams
  • Start date Start date
J

jim_adams

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim
 
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer. :-)
 
Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")
 
Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vblr7/html/vafctinstr.asp

http://msdn.microsoft.com/library/d...rlrfsystemcollectionssortedlistclasstopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor
 
Scott,

In past I have suggested this as a kind of 7th alternative (more for fun).

It works but it is slow with hug strings, even slower than Regex.

(We have tested this ones in this newsgroup, maybe you remember it you again
now I write this).

Cor
 
Travers said:
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.
The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer. :-)

No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Done.
 
Jim,
is it essential that ALL words are added into your array? if not you could
probably optimise this by only doing the first few GB, maybe check to see how
many words have been added for each GB or 10000 words or whatever.

my bet is that you will quite quickly find that you are adding very few
words, and these will be hightly specialized ones, therefore you only need to
read the first few GB

hth

guy
 
I need a list of unique words among all documents. Since many of the
documents will contain technical terms, now and then it's likely that a
new term will pop up.
 
Hi Cor,

Thanks for the tip. I was always under the impression that doing
string parsing in a loop was very inefficient, and that regex was the
"enlightened" way.

My first hunch would have been to:

1) replace punctuation with spaces
2) split on spaces
3) step through the array one by one doing a binarysearch off a sorted
array.

Maybe I should go down this brute force route.

Thanks,

Jim
 
Larry said:
Travers said:
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.
No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Um, that IS lexical analysis.
 
Travers said:
Larry said:
Travers said:
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.

Sure, but is it going to be faster than the below?
Um, that IS lexical analysis.

My mistake.
 
Back
Top