B-Tree advice needed

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hello
I am making a syntax highlighter for T-SQL, and I am going to hardcode the
words into it for speed's sake (I will probably have thought up enough new
features for a new version when the next version of SQL comes out).
My main algorithm is going to go through all the words in the RICHEDIT
control, and pass each valid word it finds to a word-checking function, which
I want to operate as fast as possible as it will be called frequently. For
these purposes, I was going to try and have a b-tree based algorithm. My
thinking is that if, say, the words it has to find are (hypothetically) get,
set, and select, then it could be of the form
int getwordtype(word)
{
if((word[0] == 'g') && (word[1] == 'e') && (word[2] == 't'))
return WORDTYPE_GET;
if((word[0] == 's') && (word[1] == 'e'))
{
if((word[2] == 't')) return WORDTYPE_SET;
if((word[3] == 'l') && (word[4] == 'e') && (word[5] == 'c') &&
(word[6] == 't')) return WORDTYPE_SELECT;
}
return WORDTYPE_NONE;
}

would this the fastest method? I do want to implement a b-tree.

Obviously I am not going to type out the whole algorithm with all the words
in SQL, I am going to write a program in C# to generate the .cpp file which
will hold it.

Any tips?

Thanks!
 
Bonj said:
Hello
I am making a syntax highlighter for T-SQL, and I am going to
hardcode the words into it for speed's sake (I will probably have
thought up enough new features for a new version when the next
version of SQL comes out).
My main algorithm is going to go through all the words in the RICHEDIT
control, and pass each valid word it finds to a word-checking
function, which I want to operate as fast as possible as it will be
called frequently. For these purposes, I was going to try and have a
b-tree based algorithm. My thinking is that if, say, the words it has
to find are (hypothetically) get, set, and select, then it could be
of the form
int getwordtype(word)
{
if((word[0] == 'g') && (word[1] == 'e') && (word[2] == 't'))
return WORDTYPE_GET;
if((word[0] == 's') && (word[1] == 'e'))
{
if((word[2] == 't')) return WORDTYPE_SET;
if((word[3] == 'l') && (word[4] == 'e') && (word[5] == 'c') &&
(word[6] == 't')) return WORDTYPE_SELECT;
}
return WORDTYPE_NONE;
}

would this the fastest method? I do want to implement a b-tree.

Obviously I am not going to type out the whole algorithm with all the
words in SQL, I am going to write a program in C# to generate the
.cpp file which will hold it.

What you want is a TST - Ternary Search Tree - not a b-tree.

Do some Googling and you'll find numerous references. Here's one that
describes TST algorithms in Java.

http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html

As for generating inline C++ code to implement the algorithm, you might want
to write it straightforwardly firsrt and time it. It's possible that the
all-inline version won't be faster, and could even be slower due to the
greatly increased code size and rate of branching.

-cd
 
The conventional structure used for symbol table lookup, is a hash table.
You could use std::hash_map straight out of the box.

Keith MacDonald
 
Keith said:
The conventional structure used for symbol table lookup, is a hash
table. You could use std::hash_map straight out of the box.

For a fixed set of keys (e.g. keywords in a language), a TST typically gives
far better performance than a hash table. Hash tables are very commonly
used for dynamic symbol tables, however.

-cd
 
I don't want a unique number, I want a number of my own choosing that I
specify (e.g. 1 for keywords, 2 for functions, etc).

Can it do this?
 
A TST sounds great Carl, but I am slightly confused about 2 issues:
a) Why it is ternary - implying three child nodes per node?
b) How I would rig one up for a certain set of words, such that a certain
group of words (that I define) would give one result, another group would
give another, etc.?

in order to generate a function where I can feed a word in , and get what
type it is, or -1 if it isn't found?
NOT a unique number for each word, like a hash table?
 
Bonj said:
A TST sounds great Carl, but I am slightly confused about 2 issues:
a) Why it is ternary - implying three child nodes per node?
Yes.

b) How I would rig one up for a certain set of words, such that a
certain group of words (that I define) would give one result, another
group would give another, etc.?

You build a map using a TST to implement the "key" part of the map. In leaf
nodes of the map, you store your "data".

Did you actually look at the link I posted, or do any reading on TSTs?

-cd
 
You build a map using a TST to implement the "key" part of the map. In leaf
nodes of the map, you store your "data".

Did you actually look at the link I posted, or do any reading on TSTs?

Yes, I did try to understand it, even though I neither understand, nor
trust, Java.
Although if you look at figure 2 (click on it to enlarge), if you look at
the topmost node, "t", then go down to "o", then down to "o" again, then
right to "t", it holds the "tot" data, whereas having gone from "t" to "o" to
"o" to "t", I would have expected "toot", not "tot". And I don't understand
how some nodes can be each other's parent - I thought that in a tree an arrow
could only be traversed in one direction.

But the most baffling thing for me is how do I get it into a C++ algorithm,
and how does the data get stored? I have over 1,000 words. What does it
*look* like - a huge 'if' statement, a set of classes, function pointers,
templates, what?

Sorry for my dumbness - I will understand it, it's just that if I can't get
the first principles I can't really progress.
 
Bonj said:
Yes, I did try to understand it, even though I neither understand, nor
trust, Java.

Suggestion: use std::map for starters and replace it with a TST
implementation when you've shown that the performance matters (by profiling)
and after you have a more complete picture of what you need to do.

In the meantime, do some more research:

http://www.cs.princeton.edu/~rs/strings/

There's a TST implementation that's buried inside the Boost Spirit library,
but extracting it is probably a bit much. In any case, if you want to look
at it, it's at boost/boost/spirit/symbols/impl/tst.ipp.

-cd
 
OK, I think I sort of get what this guy's on about
(http://www.ddj.com/documents/s=921/ddj9804a/)

What would the data at each node be?
In my first attempt I thought each successive level deeper down the tree
would be one successive letter into the word.
But this seems to work on the principle that each node IS a *complete* word,
am I right? Does this mean that there would be less nodes than my first
attempt, but at each node there would be a whole word comparison? I was
hoping not to have to use the str* (_tcs*) functions....

I'll tell you what was wrong with my first attempt. I was trying to get it
so the first letter would be the first level of the tree, the second letter
the second level, etc. so that it only had to make one character comparison
and it'd have eliminated 25 26ths of all the possible words. There's then an
incredibly high chance that it would only have to do one (short-circuiting)
if statement to nail the word.
In addition, all the data was embedded into the if statements of the
algorithm, a strategy I would like to keep if possible. However this seems to
load it dynamically at runtime.... is there any way round this?
The only problem was my algorithm-generating algorithm didn't know how to
treat words that were an entire subset of another word, for example "sp_help"
and "sp_help_index" - it only listed the biggest one. So I tried to split it
up into words of defined length, but the algorithm had been going all night
and the .cpp file had grown to over 2MB, so I thought that was a no-brainer.

Any thoughts?

My main issues are the speed of the actual comparisons themselves (i.e.,
what is the minimum then can consist of), and how the data is stored (would
like it to be in the algorithm, failing that in a resource, failing that,
loaded at runtime).

If the data has to be objects, can you do some sort of 'heap dump' so I can
run the code that generates the objects, then dump them to a resource file,
and deserialize them later?

I also need it case insensitive.

Thanks for your help
 
OK. (Please also see my other post)
I pretty much get how the tree operates, I just need some specifics... as in
my other post.
when you've shown that the performance matters (by profiling)

I have done: of my list of about 1,100 words, my 'first attempt' (see other
post for details of this) took 0.7 seconds to find all the words once,
against 0.9 for a control algorithm where all the strings were bunched
together in one space-delimited string, it would then add a space to the
front and back and just do one _tcsstr search on that. That works, but it is
clearly slow... and that's including the overhead of VB writing to the
database (the algorithm is in a Win32 DLL called by VB).

So I know I need a tree, and I know how it works. But I just need to know
some specifics now, particularly relating to what each specific comparison is
(see my other post).
I think the examples are too theoretical - I understand the theory. I need
an example of what one of the comparisons actually looks like - given that
I'm using std::basic_string<_TCHAR> for the strings. Does this have a <
operator, and if so, is it fast enough? Is this what I'd want to be using?

Cheers
 
Back
Top