Parse text into words?

jim_adams · Jun 22, 2006

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Travers Naran · Jun 22, 2006

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer. :-)

Scott M. · Jun 23, 2006

Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")

Cor Ligthert [MVP] · Jun 23, 2006

Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vblr7/html/vafctinstr.asp

http://msdn.microsoft.com/library/d...rlrfsystemcollectionssortedlistclasstopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor

Cor Ligthert [MVP] · Jun 23, 2006

Scott,

In past I have suggested this as a kind of 7th alternative (more for fun).

It works but it is slow with hug strings, even slower than Regex.

(We have tested this ones in this newsgroup, maybe you remember it you again
now I write this).

Cor

Larry Lard · Jun 23, 2006

Travers said:
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Click to expand...

You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer.

No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Done.

Guest · Jun 23, 2006

Jim,
is it essential that ALL words are added into your array? if not you could
probably optimise this by only doing the first few GB, maybe check to see how
many words have been added for each GB or 10000 words or whatever.

my bet is that you will quite quickly find that you are adding very few
words, and these will be hightly specialized ones, therefore you only need to
read the first few GB

hth

guy

jim_adams · Jun 23, 2006

I need a list of unique words among all documents. Since many of the
documents will contain technical terms, now and then it's likely that a
new term will pop up.

jim_adams · Jun 23, 2006

Hi Cor,

Thanks for the tip. I was always under the impression that doing
string parsing in a loop was very inefficient, and that regex was the
"enlightened" way.

My first hunch would have been to:

1) replace punctuation with spaces
2) split on spaces
3) step through the array one by one doing a binarysearch off a sorted
array.

Maybe I should go down this brute force route.

Thanks,

Jim

Travers Naran · Jun 23, 2006

Larry said:
Travers said:

You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Click to expand...

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.

No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Um, that IS lexical analysis.

Larry Lard · Jun 24, 2006

Travers said:
Larry said:

Travers said:

You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Click to expand...

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

Click to expand...

Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.

Sure, but is it going to be faster than the below?

Um, that IS lexical analysis.

My mistake.

Parse text into words?

jim_adams

Travers Naran

Scott M.

Cor Ligthert [MVP]

Cor Ligthert [MVP]

Larry Lard

Guest

jim_adams

jim_adams

Travers Naran

Larry Lard