Data Structure question

Tommy Vercetti · May 4, 2004

I have an application that loads a 150 MB text file containing 6.6
million strings. I want to be able to quickly determine run lookups
against this and determine yes or no; is a given string in this collection.

This obviously requires a binary tree (or possibly a hash table). My
first solution was to use STL with std::set<std::string>. However, this
solution took up 600MB and was unacceptable. I revised this by reading
the entire 150MB text file into a single contiguous block, and building
an std::set based on pointers into this block. I used a custom
comparison operator that compares until the end of a line. This worked
and dramatically reduced load time and RAM use. RAM use is now at 350MB.

However, I'd like to reduce this RAM footprint further. Anyone have any
ideas for this kind of task? I'm thinking of implementing a special case
binary tree that is optimized for this specific task. Obviously, I only
need to balance the tree once and I don't need to worry about additions
or removals once everything is built.

Any ideas/suggestions?

thanks!

Carl Daniel [VC++ MVP] · May 4, 2004

Tommy said:
I have an application that loads a 150 MB text file containing 6.6
million strings. I want to be able to quickly determine run lookups
against this and determine yes or no; is a given string in this
collection.

This obviously requires a binary tree (or possibly a hash table). My
first solution was to use STL with std::set<std::string>. However,
this solution took up 600MB and was unacceptable. I revised this by
reading
the entire 150MB text file into a single contiguous block, and
building
an std::set based on pointers into this block. I used a custom
comparison operator that compares until the end of a line. This worked
and dramatically reduced load time and RAM use. RAM use is now at
350MB.

However, I'd like to reduce this RAM footprint further. Anyone have
any ideas for this kind of task? I'm thinking of implementing a
special case binary tree that is optimized for this specific task.
Obviously, I only need to balance the tree once and I don't need to
worry about additions
or removals once everything is built.

Store the data in a contiguous buffer (with embedded nulls to demarcate the
strings) and store pointers in a sorted std::vector instead of a std::set.
That should get you down to about 174Mb.

A bit more fancy, you could turn the entire dictionary into a large ternary
search tree, or perhaps build a large finite automata that can recognize the
strings. Obviously either of those would require a significant
pre-processing of the data, but I'm guessing that wouldn't necessarily be a
problem. Without detailed knowledge of the distribution of characters in
the strings it's difficult to predict how large the associated data
structure for either of those would be, but it's not unlikely that it would
be considerably smaller and also quite possibly faster.

-cd

Tommy Vercetti · May 5, 2004

Ah, and since the collection is sorted you can still get logarithmic
search times.

Wow, those are actually really good ideas. I'm surprised. I will
investigate those solutions. I have just googled ternary search trees
and it looks promising. Thanks!

leo · May 8, 2004

Tommy Vercetti said:
Ah, and since the collection is sorted you can still get logarithmic
search times.

Wow, those are actually really good ideas. I'm surprised. I will
investigate those solutions. I have just googled ternary search trees
and it looks promising. Thanks!

Using Ternary Search Tree could still be a problem in memory
consumption. If you don't need partial search functionality, probably
hashtable is the best choice. On the other hand, if the ternary tree
you built is not balanced, then the performance of it will be worse
than hashtable.

Leo

Data Structure question

Tommy Vercetti

Carl Daniel [VC++ MVP]

Tommy Vercetti

leo