Ternary Search Trees

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)
 
Hi,
i am assuming you have libraries to do string comparison. say you are in
node "root" and it has 3 children child1, child2, child3. if the word that
you are searching is less than child1(Which you can find using strcmp or any
other string utility functions) then go down in child1. if it is between
child1 and child2(greater than child1 and lesser than child2) then go down in
child2 else go down in child3. if this answered your question well and good!
if not get back with more details of what you are trying to do.
regards,
Lingesh

Bonj said:
In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)
This posting is provided "AS IS" with no warranties, and confers no rights
 
I'm trying to build a word-classifying algorithm for a syntax highlighter for
T-SQL.
i am assuming you have libraries to do string comparison.

No, that's what I don't have. The strings will be std::basic_string<_TCHAR>
but that's the only library I am using, but I want the algorithm to be as
fast as possible.
say you are in
node "root" and it has 3 children child1, child2, child3. if the word that
you are searching is less than child1(Which you can find using strcmp or any
other string utility functions) then go down in child1. if it is between
child1 and child2(greater than child1 and lesser than child2) then go down in
child2 else go down in child3. if this answered your question well and good!
if not get back with more details of what you are trying to do.

What I don't get, is what form would "root" take, what form would "child1"
take, and especially what form would "is less than" take, in this
*particular* scenario? I'm hitting brick walls because the only examples I
can find are text-book generic.
For instance, if I was searching for the type of word for sp_help, the
parameter I pass in is a string with the value "sp_help" and I want to get
back the value WT_SYSTEMPROC (where WT_SYSTEMPROC is an application-defined
constant).
Is the "root" node the first letter, "s"? Is the root node instead the very
first (alphabetically) word out of the ones I want to recognise? Can you
describe the process of traversing down the tree for this example of this
specific algorithm?
 
sorry another one, what *form* would "go down" take?
i.e. go from "sp_he" to "sp_hel", one step nearer to the child node of
"sp_help"?
(am I on the right lines?)
What about "sp_helpindex", that would be a child node of "sp_help", but
"sp_help" is a child node in itself. This is one thing that confuses me.
 
Bonj said:
In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?

First, I would try std::map to see if you really need to go through all
the work of writing and debugging this new data structure. Also,
boost::spirit has a symbol table class that is supposedly based on a
TST, which you might look into using. (Actually, if you're doing
parsing, you should probably look into boost::spirit. It's incredibly
nifty.)
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?

Only the first character. Otherwise you could just have a binary search
tree.
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)

A TST will basically be a binary search tree on the first character of
the string and when a match is found, the node will contain another TST
that starts with the second character, and so on until you run out of
string or something doesn't match.

So you have a structure something like:

struct tst_node
{
_TCHAR match;
struct tst_node *left, *right;
union
{
struct tst_node *ptr; /* when match != 0 */
unsigned data; /* only used when match == 0 */
} down;
};

struct tst_node *root;

And for your three examples, "sp_help", "select", and "sysobjects", the
top of the TST might look like:

root -> node0

/* first level BST */

node0 =
{
match = 's'
left = right = NULL
ptr -> node1
}

/* second level BST (advance string iterator/pointer/index) */

node1 =
{
match = 'p'
left = node2 /* less than 'p' */
right = node3 /* greater than 'p' */
ptr -> subtree with only "_help"
}

node2 =
{
match = 'e'
left = right = NULL
ptr -> subtree with only "lect"
}

node3 =
{
match = 'y'
left = right = NULL
ptr -> subtree with only "sobjects"
}

Now the subtree with only "sobjects" is going to take at least 16*9 =
144 bytes, which is quite wasteful. If you're sneaky, you can store a
sequence of characters in each node rather than a single character. You
still treat it as having only one character for the purpose of the
binary search, but if that matches, you compare the rest of the
characters in that node and give up if any do not match. But this is
just an optimization, and it makes things a lot more complex. (of
course you are looking at implementing a TST, so maybe this is important
to you)

I hope this is clear enough to be of some use. :)

-josh
 
Yes, that's great, thanks.
I'll study your answer, but it seems to englighten the situation.
Now the subtree with only "sobjects" is going to take at least 16*9 =
144 bytes, which is quite wasteful. If you're sneaky, you can store a
sequence of characters in each node rather than a single character. You
still treat it as having only one character for the purpose of the
binary search, but if that matches, you compare the rest of the
characters in that node and give up if any do not match. But this is
just an optimization, and it makes things a lot more complex. (of
course you are looking at implementing a TST, so maybe this is important
to you)

That's very good to know, cheers.
 
Just another question:
One last thing I can't quite get my head round and I'm hoping you'll be able
to help:
What would happen algorithmically in the situation, where a node has got
child nodes, but is also a root node itself? For instance, "sp_help" is a
child node, but also has child nodes relating to, say, "sp_helpindex"
???

Any thoughts, like what would the code basically do?
 
Bonj said:
Just another question:
One last thing I can't quite get my head round and I'm hoping you'll be able
to help:
What would happen algorithmically in the situation, where a node has got
child nodes, but is also a root node itself? For instance, "sp_help" is a
child node, but also has child nodes relating to, say, "sp_helpindex"
???

Any thoughts, like what would the code basically do?

The node at "sp_help" will have an entry with its matching character
equal to 0 and an entry with its maching character equal to 'i'. The
first holdes the data for "sp_help" and the second goes on to match
"sp_helpindex".

If your strings are nul terminated, you are guaranteed that a node that
corresponds to a final symbol will not have children.

-josh
 
Back
Top