One more question on TSTs

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I almost understand TSTs, to the point where I just need to know the answer
to this:
When making a TST (in C++) that will have as its leaf nodes words that make
up SQL language and an categorising identifier for each one, and each layer
of the tree will represent comparison of a further letter within the search
string, what will happen when a particular node is a leaf node itself, but
also has leaf nodes of its own? i.e. specifically, as far as the code goes,
for this particular scenario.

for instance, the node "sp_he" has leaf nodes "sp_hel" (and possibly others)
but "sp_help" is a leaf nodes, as "sp_help" is a SQL word, BUT it also has
leaf nodes of its own, "sp_helpindex" for example.
What would I want to have going on in the C++ code to identify when this is
the case?
 
Bonj said:
I almost understand TSTs, to the point where I just need to know the
answer to this:
When making a TST (in C++) that will have as its leaf nodes words
that make up SQL language and an categorising identifier for each
one, and each layer of the tree will represent comparison of a
further letter within the search string, what will happen when a
particular node is a leaf node itself, but also has leaf nodes of its
own? i.e. specifically, as far as the code goes, for this particular
scenario.

for instance, the node "sp_he" has leaf nodes "sp_hel" (and possibly
others) but "sp_help" is a leaf nodes, as "sp_help" is a SQL word,
BUT it also has leaf nodes of its own, "sp_helpindex" for example.
What would I want to have going on in the C++ code to identify when
this is the case?

You'd store something in the data of each node that identifies it as a legal
end-node. For example, store your keyword classifier value in all potential
end-nodes, and -1 in all others. Of course, all leaf nodes are end-nodes.

If the data in your tree is dynamically allocated (I wouldn't think it would
be in your case), then you could simply store null for non-end nodes.

-cd

PS: I'm glad you stuck with it! I was going to put together a small TST
demo to post here, but just haven't had the time.
 
You'd store something in the data of each node that identifies it as a legal
end-node. For example, store your keyword classifier value in all potential
end-nodes, and -1 in all others. Of course, all leaf nodes are end-nodes.

Right, got that bit. The bit I'm still a bit muddy on is how to decide
whether the 'incoming' (i.e. in-parameter) string that's finding its way
through the tree should *use* this end-node (if it exists), or make another
comparison. For instance, if "sp_helpindex" was the 'incoming' string, then
the node that it wants to get to (its target) is the "sp_helpindex" node. It
will go via the "sp_help" node, but will somehow have to choose to make
another comparison and go down the route towards the node for "sp_helpindex",
rather than stop at "sp_help".
Further, "sp_help" will also have to go via the "sp_help" node, but will
have to stop there.
"sp_helpgobbledegook" will have to go via "sp_help" aswell, but will have to
take the decision to make another comparison with its 'g' character, and then
fall out.
My current thinking is to also pass in the length of the string aswell
(which will be known in the calling algorithm anyway) and make it so each
node knows how long a string must be to stop at it. I'm pretty sure this will
work, although I haven't got the algorithm written yet, I like to have a
clear path for the alrogithm worked out before I write it, as I program best
when I program fast!

If the data in your tree is dynamically allocated (I wouldn't think it would
be in your case), then you could simply store null for non-end nodes.

That's the other bit I really want to know. How can I construct an algorithm
that contains word nodes like this, but *isn't* dynamically generated? The
only way I can think how I'm going to do it at the moment is if I have a
'node' class (or struct). Which means populating it at runtime. Which means
loading the data when the program starts. Is there another way?
-cd

PS: I'm glad you stuck with it!

Cheers! I always find it's best when you do. The only thing that puts me off
is "this'll be useless", I'm never put off by "it'll never work" - because if
anybody can get it to work, there's no reason why I can't. And this is
something that I actually *want* as a product to use in my own day to day
work, so it can't possibly be pointless. Otherwise, if I can't see myself
using it, I couldn't possibly have the motivation to have even started it! I
think the best programs are the ones that you actually specify yourself and
write for yourself...
BTW cheers for the help on the RICHEDIT by the way, I think I've finally
mastered that (although I shouldn't speak too soon...)

I was going to put together a small TST
demo to post here, but just haven't had the time.

Let me know if/ when you do...

Cheers
 
Bonj said:
Right, got that bit. The bit I'm still a bit muddy on is how to decide
whether the 'incoming' (i.e. in-parameter) string that's finding its
way through the tree should *use* this end-node (if it exists), or
make another comparison. For instance, if "sp_helpindex" was the
'incoming' string, then the node that it wants to get to (its target)
is the "sp_helpindex" node. It will go via the "sp_help" node, but
will somehow have to choose to make another comparison and go down
the route towards the node for "sp_helpindex", rather than stop at
"sp_help".

If the next character is whitespace or end-of-text, you stop at sp_help.
Otherwise, it sp_help can't possibly match, so you continue on.
--
With best wishes,
Igor Tandetnik

"On two occasions, I have been asked [by members of Parliament], 'Pray,
Mr. Babbage, if you put into the machine wrong figures, will the right
answers come out?' I am not able to rightly apprehend the kind of
confusion of ideas that could provoke such a question." -- Charles
Babbage
 
Your criterion is very simple - do you have unconsumed letters
in your input word. If the answer is yes, you must continue -
you havent' found your word yet. Sounds logical, no?

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: (e-mail address removed)
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================
 
Yes, pretty much. That effectively compounds my logic about the length of the
word.


Cheers
 
Thanks all, for the great logic and suggestions.
But can anyone shed any light on how the data storage could be hard-coded
such that it didn't have to dynamically load the data? Or if it does, such
that the time is minimal.
My only thoughts on how this could be done at the moment are an
algorithm-generating algorithm, which spits out c++ code, which I'm sure
isn't the best way to do it.

Cheers
 
A hint: if you want efficient storage, you have to think as a C
programmer (not C++). You can do it either with pointers
to the first child and to the next sibling (NULL for child means
no children, NULL for sibling means no more siblings), or you
can use flat portable layout using offsets. Flat layout may allow
you to optimize the size for small trees, since you won't need
32 bits for offsets (with even greater impact on the upcoming
64-bit platforms!). Then depending on the approach, you either
generate a bunch of struct instances with pointers within them,
or you generate a single byte chunk (can be comprised of
WORDs/DWORDs instead if you align your data with the
offsets) with the entire tree. In either case you'd use some
code to generate your source from the original tree format.

Note: I'd advise the no-sibling pointer/offset optimization. E.g.
you have all siblings in a continuous chunk of memory and
use some tag (an invalid input char as an entry after the last
sibling for example, or a bit flag in each node) to determine the
end of the chunk. This is natural for flat memory representation.

Another advantage of the flat representation is you can load
and save the data to disk files for an alternative storage method.
Then you simply map a view of the file in your process and
use it as embedded data.

Another optimization, only possible in flat layout, is there you
can use variable size entries. If a node doesn't have children,
you don't need the offset after all. If you use the optimization
suggested to compress consecutive children with no siblings
in a single node, the data (the substring) can be varibale length.
Otherwise you'd either use a pointer for the string, or have all
nodes allocate memory sufficient for the largest string.

A word on offsets: you can use either absolute offsets from
the beginning of the whole structure, or relative offsets from
the beginning of the current node. The latter are more efficient
since they may allow you greater trees with narrower offsets.
Furthermore, even though it's natural to generate the tree in
left to right order, you can randomize the internal representation
since you use offsets to locate each chunk anyway. This may
allow you even greater trees when using narrower relative
offsets.

The top of the icing, useful mostly for larger trees where memory
is more important than speed, is that you can use progressive
Huffman encoding on each chunk to potentially minimize the
size even further.

Of course, the logic to traverse the tree will go into a C++
class, you only need C for the data itself.

As you may notice, I've done these and other embedded data
structures quite a bit in my past... :)

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: (e-mail address removed)
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================
 
A hint: if you want efficient storage, you have to think as a C
programmer (not C++). You can do it either with pointers
to the first child and to the next sibling (NULL for child means
no children, NULL for sibling means no more siblings),

What in this specific instance are you referring to as a sibling?
For instance, "sp_hel" has as one of its children, "sp_help". "sp_help" has
"sp_helpi" (leading to "sp_helpindex") as one of its children, but what would
be a sibling? Can you give me an example, 'cos I can't picture it. Or is it
just an 'extra' child, i.e. after the first one?
or you
can use flat portable layout using offsets.

Can you explain in any way how I would store and retrieve this?
Flat layout may allow
you to optimize the size for small trees

Without worrying about optimization - I can understand how I would generate
this tree of structs, when each one has a pointer to another one (or more
than one) in order to represent children. But I seem to have a mental block
figuring out how to have a flat layout when the structs contain pointers. You
see, a pointer is simply a long integer, that's pointing to a different
memory location. But that memory location is only valid for the life of the
current process (at most). So, the memory needed to store the tree of structs
may not necessarily be contigious.
What is the usual way a C programmer would get round this, in order to
persist the tree of structures as a 'block'?
, since you won't need
32 bits for offsets

I don't know what you mean by an offset...... I think I would probably
understand it if you told me how to store the tree as a 'flat' set of
structures, in a contigious memory space?
(with even greater impact on the upcoming
64-bit platforms!). Then depending on the approach, you either
generate a bunch of struct instances with pointers within them,
or you generate a single byte chunk (can be comprised of
WORDs/DWORDs instead if you align your data with the
offsets) with the entire tree. In either case you'd use some
code to generate your source from the original tree format.


You mean using an algorithm to actually achieve a level of indirection with
respect to compiling code, i.e. write an algorithm to produce C++ code? This
is what I originally tried, but when a word was a subset of another word,
like "sp_help" is of "sp_helpindex", it completely ignored "sp_help".
Probably just the way I wrote it... but is this what you would recommend I
do? If so, any tips on how I might do it better?
Note: I'd advise the no-sibling pointer/offset optimization. E.g.
you have all siblings in a continuous chunk of memory and
use some tag (an invalid input char as an entry after the last
sibling for example, or a bit flag in each node) to determine the
end of the chunk. This is natural for flat memory representation.

Another advantage of the flat representation is you can load
and save the data to disk files for an alternative storage method.
Then you simply map a view of the file in your process and
use it as embedded data.

Another optimization, only possible in flat layout, is there you
can use variable size entries. If a node doesn't have children,
you don't need the offset after all. If you use the optimization
suggested to compress consecutive children with no siblings
in a single node, the data (the substring) can be varibale length.
Otherwise you'd either use a pointer for the string, or have all
nodes allocate memory sufficient for the largest string.

A word on offsets: you can use either absolute offsets from
the beginning of the whole structure, or relative offsets from
the beginning of the current node. The latter are more efficient
since they may allow you greater trees with narrower offsets.
Furthermore, even though it's natural to generate the tree in
left to right order, you can randomize the internal representation
since you use offsets to locate each chunk anyway. This may
allow you even greater trees when using narrower relative
offsets.

oh... right ! I think I almost understand what you're going on about... is
it that you simply force all the structs to be next to each other and instead
of storing actual pointers, you just store the distance from the current one,
hence 'offset' ?
The top of the icing, useful mostly for larger trees where memory
is more important than speed, is that you can use progressive
Huffman encoding on each chunk to potentially minimize the
size even further.

Of course, the logic to traverse the tree will go into a C++
class, you only need C for the data itself.

As you may notice, I've done these and other embedded data
structures quite a bit in my past... :)

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: (e-mail address removed)
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

Bonj said:
Thanks all, for the great logic and suggestions.
But can anyone shed any light on how the data storage could be hard-coded
such that it didn't have to dynamically load the data? Or if it does, such
that the time is minimal.
My only thoughts on how this could be done at the moment are an
algorithm-generating algorithm, which spits out c++ code, which I'm sure
isn't the best way to do it.

Cheers
 
Sibling is as same as in English - your brother, sister, and in this
case a letter of an alternative word in the set. E.g. if AB and
AC are words in your language, B and C are siblings when
treated as children of A. Same with intermediate letters: for
ABC and ADE, B and D are again siblings.

I guess you grasped the concept of offset instead of a pointer
at the end of your post, so I won't go into further details (which
is a relief considering there aren't many...)

What you produce is C data, not code. With pointers it would be
a long list of:

const struct blah c_NodeN {
"data",
&c_NodeK, /* child (can be NULL instead) */
&c_NodeL /* next sibling (can be NULL instead) */
};

Note this is the least efficient representation, I'm just showing the
concept.

With flat layout it would be much simpler:

const BYTE data[] = {
// the linearized tree goes here
};

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: (e-mail address removed)
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

Bonj said:
A hint: if you want efficient storage, you have to think as a C
programmer (not C++). You can do it either with pointers
to the first child and to the next sibling (NULL for child means
no children, NULL for sibling means no more siblings),

What in this specific instance are you referring to as a sibling?
For instance, "sp_hel" has as one of its children, "sp_help". "sp_help"
has
"sp_helpi" (leading to "sp_helpindex") as one of its children, but what
would
be a sibling? Can you give me an example, 'cos I can't picture it. Or is
it
just an 'extra' child, i.e. after the first one?
or you
can use flat portable layout using offsets.

Can you explain in any way how I would store and retrieve this?
Flat layout may allow
you to optimize the size for small trees

Without worrying about optimization - I can understand how I would
generate
this tree of structs, when each one has a pointer to another one (or more
than one) in order to represent children. But I seem to have a mental
block
figuring out how to have a flat layout when the structs contain pointers.
You
see, a pointer is simply a long integer, that's pointing to a different
memory location. But that memory location is only valid for the life of
the
current process (at most). So, the memory needed to store the tree of
structs
may not necessarily be contigious.
What is the usual way a C programmer would get round this, in order to
persist the tree of structures as a 'block'?
, since you won't need
32 bits for offsets

I don't know what you mean by an offset...... I think I would probably
understand it if you told me how to store the tree as a 'flat' set of
structures, in a contigious memory space?
(with even greater impact on the upcoming
64-bit platforms!). Then depending on the approach, you either
generate a bunch of struct instances with pointers within them,
or you generate a single byte chunk (can be comprised of
WORDs/DWORDs instead if you align your data with the
offsets) with the entire tree. In either case you'd use some
code to generate your source from the original tree format.


You mean using an algorithm to actually achieve a level of indirection
with
respect to compiling code, i.e. write an algorithm to produce C++ code?
This
is what I originally tried, but when a word was a subset of another word,
like "sp_help" is of "sp_helpindex", it completely ignored "sp_help".
Probably just the way I wrote it... but is this what you would recommend I
do? If so, any tips on how I might do it better?
Note: I'd advise the no-sibling pointer/offset optimization. E.g.
you have all siblings in a continuous chunk of memory and
use some tag (an invalid input char as an entry after the last
sibling for example, or a bit flag in each node) to determine the
end of the chunk. This is natural for flat memory representation.

Another advantage of the flat representation is you can load
and save the data to disk files for an alternative storage method.
Then you simply map a view of the file in your process and
use it as embedded data.

Another optimization, only possible in flat layout, is there you
can use variable size entries. If a node doesn't have children,
you don't need the offset after all. If you use the optimization
suggested to compress consecutive children with no siblings
in a single node, the data (the substring) can be varibale length.
Otherwise you'd either use a pointer for the string, or have all
nodes allocate memory sufficient for the largest string.

A word on offsets: you can use either absolute offsets from
the beginning of the whole structure, or relative offsets from
the beginning of the current node. The latter are more efficient
since they may allow you greater trees with narrower offsets.
Furthermore, even though it's natural to generate the tree in
left to right order, you can randomize the internal representation
since you use offsets to locate each chunk anyway. This may
allow you even greater trees when using narrower relative
offsets.

oh... right ! I think I almost understand what you're going on about... is
it that you simply force all the structs to be next to each other and
instead
of storing actual pointers, you just store the distance from the current
one,
hence 'offset' ?
The top of the icing, useful mostly for larger trees where memory
is more important than speed, is that you can use progressive
Huffman encoding on each chunk to potentially minimize the
size even further.

Of course, the logic to traverse the tree will go into a C++
class, you only need C for the data itself.

As you may notice, I've done these and other embedded data
structures quite a bit in my past... :)

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: (e-mail address removed)
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

Bonj said:
Thanks all, for the great logic and suggestions.
But can anyone shed any light on how the data storage could be
hard-coded
such that it didn't have to dynamically load the data? Or if it does,
such
that the time is minimal.
My only thoughts on how this could be done at the moment are an
algorithm-generating algorithm, which spits out c++ code, which I'm
sure
isn't the best way to do it.

Cheers
 
Thanks very much, to all, for your help, I'm confident I can have a bash at
it now!


Alexander Nickolov said:
Sibling is as same as in English - your brother, sister, and in this
case a letter of an alternative word in the set. E.g. if AB and
AC are words in your language, B and C are siblings when
treated as children of A. Same with intermediate letters: for
ABC and ADE, B and D are again siblings.

I guess you grasped the concept of offset instead of a pointer
at the end of your post, so I won't go into further details (which
is a relief considering there aren't many...)

What you produce is C data, not code. With pointers it would be
a long list of:

const struct blah c_NodeN {
"data",
&c_NodeK, /* child (can be NULL instead) */
&c_NodeL /* next sibling (can be NULL instead) */
};

Note this is the least efficient representation, I'm just showing the
concept.

With flat layout it would be much simpler:

const BYTE data[] = {
// the linearized tree goes here
};

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: (e-mail address removed)
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

Bonj said:
A hint: if you want efficient storage, you have to think as a C
programmer (not C++). You can do it either with pointers
to the first child and to the next sibling (NULL for child means
no children, NULL for sibling means no more siblings),

What in this specific instance are you referring to as a sibling?
For instance, "sp_hel" has as one of its children, "sp_help". "sp_help"
has
"sp_helpi" (leading to "sp_helpindex") as one of its children, but what
would
be a sibling? Can you give me an example, 'cos I can't picture it. Or is
it
just an 'extra' child, i.e. after the first one?
or you
can use flat portable layout using offsets.

Can you explain in any way how I would store and retrieve this?
Flat layout may allow
you to optimize the size for small trees

Without worrying about optimization - I can understand how I would
generate
this tree of structs, when each one has a pointer to another one (or more
than one) in order to represent children. But I seem to have a mental
block
figuring out how to have a flat layout when the structs contain pointers.
You
see, a pointer is simply a long integer, that's pointing to a different
memory location. But that memory location is only valid for the life of
the
current process (at most). So, the memory needed to store the tree of
structs
may not necessarily be contigious.
What is the usual way a C programmer would get round this, in order to
persist the tree of structures as a 'block'?
, since you won't need
32 bits for offsets

I don't know what you mean by an offset...... I think I would probably
understand it if you told me how to store the tree as a 'flat' set of
structures, in a contigious memory space?
(with even greater impact on the upcoming
64-bit platforms!). Then depending on the approach, you either
generate a bunch of struct instances with pointers within them,
or you generate a single byte chunk (can be comprised of
WORDs/DWORDs instead if you align your data with the
offsets) with the entire tree. In either case you'd use some
code to generate your source from the original tree format.


You mean using an algorithm to actually achieve a level of indirection
with
respect to compiling code, i.e. write an algorithm to produce C++ code?
This
is what I originally tried, but when a word was a subset of another word,
like "sp_help" is of "sp_helpindex", it completely ignored "sp_help".
Probably just the way I wrote it... but is this what you would recommend I
do? If so, any tips on how I might do it better?
Note: I'd advise the no-sibling pointer/offset optimization. E.g.
you have all siblings in a continuous chunk of memory and
use some tag (an invalid input char as an entry after the last
sibling for example, or a bit flag in each node) to determine the
end of the chunk. This is natural for flat memory representation.

Another advantage of the flat representation is you can load
and save the data to disk files for an alternative storage method.
Then you simply map a view of the file in your process and
use it as embedded data.

Another optimization, only possible in flat layout, is there you
can use variable size entries. If a node doesn't have children,
you don't need the offset after all. If you use the optimization
suggested to compress consecutive children with no siblings
in a single node, the data (the substring) can be varibale length.
Otherwise you'd either use a pointer for the string, or have all
nodes allocate memory sufficient for the largest string.

A word on offsets: you can use either absolute offsets from
the beginning of the whole structure, or relative offsets from
the beginning of the current node. The latter are more efficient
since they may allow you greater trees with narrower offsets.
Furthermore, even though it's natural to generate the tree in
left to right order, you can randomize the internal representation
since you use offsets to locate each chunk anyway. This may
allow you even greater trees when using narrower relative
offsets.

oh... right ! I think I almost understand what you're going on about... is
it that you simply force all the structs to be next to each other and
instead
of storing actual pointers, you just store the distance from the current
one,
hence 'offset' ?
The top of the icing, useful mostly for larger trees where memory
is more important than speed, is that you can use progressive
Huffman encoding on each chunk to potentially minimize the
size even further.

Of course, the logic to traverse the tree will go into a C++
class, you only need C for the data itself.

As you may notice, I've done these and other embedded data
structures quite a bit in my past... :)

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: (e-mail address removed)
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

Thanks all, for the great logic and suggestions.
But can anyone shed any light on how the data storage could be
hard-coded
such that it didn't have to dynamically load the data? Or if it does,
such
that the time is minimal.
My only thoughts on how this could be done at the moment are an
algorithm-generating algorithm, which spits out c++ code, which I'm
sure
isn't the best way to do it.

Cheers
 
Back
Top