I was told "Fields are expensive, records are cheap"

  • Thread starter Thread starter Dennis
  • Start date Start date
BruceM:

Your comment: It is unusual to add 30 fields to a normalized table and
still maintain the normalization, yet you seem unwilling to describe the new
need for 30 fields.

Response: That is because these fields had NOTHING to do with the
questions! As repeated stated, and repeatedly missed the question concern
disk access speed over the network and the speed of ACCESS to extra values
from a big record. That was the question and the only questions. Allen
Browne was kind enough to answer it.

Your comment: But all we have to go on is that you have 30 years
experience, from which it seems we are to accept that the design is beyond
question.

Response: I’ve re-read this entire series of comments and no where in there
did I say “you were accept that the design is beyond questionâ€. I wish you
would point out where I said this. But again, you are taking a truly tiny
bit of background for a question and making it a mountain. The issue you
want to discuss had NOTHING to do with my original question, if you had
bothered to read it. If I remember, in my response to Duane, I said that
I’ve clean up other people messes and I’m sure other people have come along
behind me and clean up message that I left behind.

Your Comment: Had you been willing to describe the database's structure it
could be that somebody here, many of whom are well-trained and very
experienced, could have suggested a different approach.

Response:
1. This question had nothing to do with the db design. As stated above,
I've always been taught the exact opposite - that "Fields are cheap, records
are expensive" since going to disk is so slow versus accessing data in memory.

This is a disk access verus memory access question.


Your comments: Had you been willing to describe the database's structure it
could be that somebody here, many of whom are well-trained and very
experienced, could have suggested a different approach. You could have
accepted or rejected the suggestions, but since that conversation did not
take place it seems you posted to argue with the contention that "fields are
expensive, records are cheap".

My response: Interesting comment. I have posted many database design
questions and I’ve either received no response from “many of whom are
well-trained and very experienced†or responses so vague that I’m left
scratching my head wondering what in the world they were talking about.
Sometime I was able to repost those question in other forums and receive
usable replies and other time I just had to come up with another approach.

I will say that though, I could not have climb partially up the Access
learning cliff with out the help of many people on this forum.

It has been my experience that posting something in the Database group is a
measure of last resort. This question is a prime example. People want to
talk about something that is not part of the question. All this time on
everyone’s part and NO ONE of the people who “are well-trained and very
experienced†answered the question! What a waste!





Dennis
 
David,

I want everyone to know that I stand corrected. In my response to Bruce I
stated that no one had answered the original question in this series of
correspondences. At that point I had not read your response.

Thank you for answering the question. I also reposted the question and
Allen Browne, you, and a few other responded to the re-post and answered the
disk access speed issue.


Your comment :A Jet/ACE data store doesn't store data in records, but in
data pages. It's not like legacy systems at all. And the fields are variable
length, so records are variable length. The efficiency of interaction with
the storage and RAM are pretty moot since the days of disk caching, when even
a write is done to RAM (and then persisted to disk when convenient). This
means there's hardly any difference between RAM manipulation and
reading/writing from/to the data store, since both are happening in RAM.

Response: I can see what you are saying, but that assumes that you are
accessing data records sequentially. What happens to all that disk caching
when I am randomly reading records which I would do during data entry (which
is where I am most concerned about disk access and network speed). The user
might need the 1st rcd, then the 1,000 rcd, then the 5,257th rcd, then the
10th rcd, etc. etc.). Unless the entire table is cached, was good does
loading a data page do if the record that you need next is not on that page?
And if the data is cached in a server, that does not good all because you
have the network bottle neck.

I know about write caching and it does not affect the speed of retrieving
data (unless you do not have enough memory and your machine is thrashing),
which is why is was not part of the question.

Your comment: And when a server database is involved, the bottleneck is not
RAM vs. disk, but the network connection, which is vastly slower than not
just the RAM, but even slower than disk access.

My response: I know the network speed is MUCH, MUCH slower which is why I
made it part of the question. Remember, my original interpretation of John’s
“Fields are expensive, records are cheap†had with do with disk access over a
network versus internal CPU process. I though he knew something about
internal process of Access where Access was very slow when processing large
records. I grossly misunderstood his comment. I thought he was saying it is
better to access the disk multiple times with small records that to have one
large Access record. He has since explained his comment and now I know it
had nothing to do with disk accessing speed.

Dennis
 
David,

Your comment “When table modifications call for lots of new fields it often
means that a one-to-many relationship that ought to be set up in
multiple tables is being shoe-horned into a single table.â€

It also means that you have to alter your front-end forms and
reports by adding all the fields, whereas if you're adding records
instead, whatever continuous or datasheet form you're using to
display those records will just have more records in it, so there is
no need to alter then user interface objects.

Response: I don’t understand you comment. It is absolutely that that is I
add more records, I don’t’ have altered anything. But when the user asks
where to they enter these new fields and where the new reports are, what am I
suppose to say. Well, I did not do add those fields but you can enter more
data? That makes no sense at all.

I’ve yet to figure how to capture additional data and produce reports with
this additional data without modifying the UI or creating the new reports.
Sorry, I have no idea what you mean by the above statement.

Your comment: To *me*, that's what the phrase means, that adding fields is
more complicated in terms of UI, which, to be frank, is where 90% of our time
is spent in the development process (rather than in schema design/changes).

My response: You are preaching to the choir! I fully agree.

However, this project started out a simple membership mailing list. And
that is all it was. Once the users saw it working, their imagination went
wild and they wanted to start capturing additional information so they could
produce additional reports.

When this project started, I knew about the other report and strongly
suggested that they just capture the additional fields and produce all of the
report. However, the users kept saying no, no, no we don’t need at that
stuff. The process is working just fine. And the boss man said – just do the
mailing labels and that all. So that is what I did. I designed it for the
full requirements, but only implemented the mailing label piece.

But what are you supposed to do when the user, who is paying you, say no we
don’t need that when you know that are going to need it? You can only push
the customer so far.

But again, none of this had anything to do with the original disk access
speed vs. Access internal process speed question.


Dennis
 
Duane Hookom

Your comment: I made the comment: I took over a survey application at a
hospital from a programmer with many more years of programming than I had at
the time.

Your response: I don’t understand what this comment has to do with the
subject.

This was all based on your statement regarding your "over 30 years
experience with relational databased".

Years of experience are not always the same as understanding and knowledge.
In your case, your knowledge may be much greater than 30 years. I'm not sure
how we would know that.

I was simply providing an example suggesting that I have followed developers
who had many years of experience without a basic understanding of properly
structured tables. In no way was I suggesting you don't understand
normalization.


My response: Ah, you are preaching to the choir. My apologizes. I fully
understand what you are saying. It is very true that experience does not
always count for much. I too have come behind very experience people and had
to clean up the mess and convoluted logic. And as I said, I know that I’ve
had off weeks and left messes behind for someone else to clean up, much to my
embarrassment.

I’ve also had programmer with only a year or so of experience, but with a
fresh out look, come up with some very creative and elegant solutions (which
I doubt I would have arrived at). I’ve done that same thing when I’ve gone
to a new project with much more senior people. Sometimes that can not see the
forest for the trees.


Dennis
 
Your comment :A Jet/ACE data store doesn't store data in records,
but in data pages. It's not like legacy systems at all. And the
fields are variable length, so records are variable length. The
efficiency of interaction with the storage and RAM are pretty moot
since the days of disk caching, when even a write is done to RAM
(and then persisted to disk when convenient). This means there's
hardly any difference between RAM manipulation and reading/writing
from/to the data store, since both are happening in RAM.

Response: I can see what you are saying, but that assumes that
you are accessing data records sequentially. What happens to all
that disk caching when I am randomly reading records which I
would do during data entry (which is where I am most concerned
about disk access and network speed). The user might need the 1st
rcd, then the 1,000 rcd, then the 5,257th rcd, then the 10th rcd,
etc. etc.). Unless the entire table is cached, was good does
loading a data page do if the record that you need next is not on
that page? And if the data is cached in a server, that does not
good all because you have the network bottle neck.

You seem not to have done much investigation of how Jet works.
Indexes are the key. When you request data from a table, Jet first
requests the relevant indexes (based on your criteria) and then uses
the indexes to figure out which data pages to retrieve. The
requested data is going to be brought into RAM for processing, and
since it's been requested from disk, it will be cached in Jet's
internal cache, as well as in the OS's disk cache. When you write to
the data, Jet will take care of writing to disk, which will then be
handled by the OS, and if the written data pages are in the disk
cache (and you are caching writes), the writes will be to the disk
cache and then will be persisted to disk according to your disk
cache settings (and you can also tell Jet explicitly to flush its
cache, which in turn tells the OS to flush its write cache, i.e.,
persist to disk).

This is all managed quite efficienctly, so efficiently, in fact,
that there is very little reason to worry about the things you are
worrying about until you reach extremely large datasets. That's
usually not what is being used by an Access application directly
(i.e., if you have a large dataset, you're likely going to use a
different back-end than Jet/ACE).
I know about write caching and it does not affect the speed of
retrieving data (unless you do not have enough memory and your
machine is thrashing), which is why is was not part of the
question.

Writing is much slower than reading because it involves so many
different operations (writing indexes, determining if new pages need
to be allocated), as well as maintaining as set of appropriate locks
(write locks are a much bigger deal than read locks).

But disk caching is a big deal for reading, too. As I explained
above about the way Jet works, the indexes are the key to retrieving
data efficiently. Data pages in Jet 4 and ACE are 4KBs in size, so
you're retrieving data from the disk in blocks whose size is very
often well-aligned with the file system's block size (the NTFS file
system uses default block sizes in increments of KBs, with 4KBs
being the smallest usual block size, and the next being 8, 16, etc.,
so reading 4KB data pages is likely to be very efficient, since they
will be aligned with the file system's block size). Jet retrieves
what it needs as needed, but only as much as needed.

A simplified explanation of how Jet works: this SQL statement:

SELECT tblPerson.LastName, tblPerson.FirstName, tblPerson.PersonID
FROM tblPerson
WHERE tblPerson.LastName = "Fenton"

....will first retrieve the index on LastName which will be keyed to
the Primary Key, so Jet will then retrieve the primary key index. In
both cases, only as much of the index as necessary is retrieved
(it's not a sequential scan, but a b-tree traversal), and in the
case of the primary key index, since Jet tables are clustered on the
primary key, I'm pretty sure the primary key index is actually the
real data pages (I could be wrong on that).

I don't know for certain if the non-PK indexes store a link to the
PK or if they directly reference the data page of the PK, but the
point is that it's not a sequential scan of anything, but a b-tree
index traversal, which is substantially more efficient.
 
Your comment “When table modifications call for lots of new
fields it often means that a one-to-many relationship that ought
to be set up in multiple tables is being shoe-horned into a single
table.â€

It also means that you have to alter your front-end forms and
reports by adding all the fields, whereas if you're adding records
instead, whatever continuous or datasheet form you're using to
display those records will just have more records in it, so there
is no need to alter then user interface objects.

Response: I don’t understand you comment. It is absolutely
that that is I add more records, I don’t’ have altered
anything. But when the user asks where to they enter these new
fields and where the new reports are, what am I suppose to say.
Well, I did not do add those fields but you can enter more data?
That makes no sense at all.

I suspect you haven't looked at Duane's survey database. In the
structure there (which is a fairly extreme version of what I'm
talking about), when you add a new question to a survey, you add a
new record to the questions table, and link it to a particular
survey. This will then be displayed by the same user interface that
was used to display the other questions.

In the structure you're suggestion, you'd have to add a field for
the question to your survey table and then alter your form for data
entry, and your report for printing out, and all of the tools you
use to analyze and summarize the results.

Now, a lot of statistical tools expect a flat-file structure, with
one record for each completed survey, so it's usually the case that
one takes the many-records structure and denormalizes it into a flat
structure that is exported for use by the statistics program. The
key there is that you don't choose your data structure based on how
you're going to output it. Instead, you choose the structure that is
going to be most extensible with the least amount of work, and then
you take that data and convert it to the structures required for
particular purposes, which is very often a denormalized structure.
I’ve yet to figure how to capture additional data and produce
reports with this additional data without modifying the UI or
creating the new reports. Sorry, I have no idea what you mean by
the above statement.

I strongly recommend you examine Duane's survey sample database.
Your comment: To *me*, that's what the phrase means, that adding
fields is more complicated in terms of UI, which, to be frank, is
where 90% of our time is spent in the development process (rather
than in schema design/changes).

My response: You are preaching to the choir! I fully agree.

However, this project started out a simple membership mailing
list. And that is all it was. Once the users saw it working,
their imagination went wild and they wanted to start capturing
additional information so they could produce additional reports.

I have plenty of databases that are badly structured because they
started out small, when it seemed it wasn't necessary to choose an
industrial-strength schema, and then when they got extended, I
greatly regretted not doing it the right way on the front end.

So, the existing app is no justification for continuing to use a
less than optimal schema.
When this project started, I knew about the other report and
strongly suggested that they just capture the additional fields
and produce all of the report. However, the users kept saying no,
no, no we don’t need at that stuff. The process is working just
fine. And the boss man said – just do the mailing labels and
that all. So that is what I did. I designed it for the full
requirements, but only implemented the mailing label piece.

But what are you supposed to do when the user, who is paying you,
say no we don’t need that when you know that are going to need
it? You can only push the customer so far.

The approach I'm advocating doesn't take more time to implement to
cover the initial requirements and is inherently extensible, without
any additional programming for the UI.

You've front-loaded the programming to handle the future extensions,
which is something I do, too, but the suggestion by those advocating
records over fields is that there is no extra work involved. You
don't have to do anything extra to prepare for the N new fields,
because the structure you've built handles any number of records
from the very beginning.

Have you ever implemented a many-to-many join when the current
requirement was only a many-to-one? Say the client has customers and
there's a customer type. You can store it in a field in the customer
record, but what if the client later decides that a customer can be
more than one customer type simultaneously? At that point you need a
many-to-many join between customers and customer types. If you
structured it that way from the beginning, it would handle 1 type
per customer, and then would handle as many customer types as you
needed when the scope expands.

This is the simplest example of the cheap records vs. expensive
fields equation. And the point is that as soon as the client needs
more than 1 type per customer, you have to decide how many you're
going to accomodate. You could, I guess, add a second customer type
field, but then when they need three, you'd have to add a third new
field. With the join table, you accomodate anything from 0 to
however many you could ever conceive of. It's not more complicated
to implement, and when the customer needs more than 1, there is
nothing that needs to be done (other than, perhaps, altering your UI
to remove any limitations you may have previously put on entry).
But again, none of this had anything to do with the original disk
access speed vs. Access internal process speed question.

That was your red herring. You asked for an explanation of the
expression. You misinterpreted it and went down the access speed
garden path. I'm explaining what the phrase refers to, and it is, in
fact, more efficient to add records than fields (consider the number
of indexes that have to be maintained/searched).
 
Response: You are absolutely correct. The entity, at least from
12 independent software houses that I know of, is the policy. The
policy, regardless of which line of business, has about 100 to 200
common fields.

I would love to see one of those tables. My bet is that I could make
a much more extensible structure with none of the tables having more
than 30 fields and maybe only even one of them having that many
(most of the data would end up in tables that had only two or three
or four fields).
 
As repeated stated, and repeatedly missed the question concern
disk access speed over the network and the speed of ACCESS to
extra values from a big record.

Maybe when people aren't answering the question you wish was being
answered, it's a clue that the problem is not with all the people
who are offering what you consider unacceptable answers.
 
It has been my experience that posting something in the Database
group is a measure of last resort. This question is a prime
example. People want to talk about something that is not part of
the question.

I would say that this is your fault for the way in which you cast
your question. You were responding to a rule of thumb that many
database developers use and you went off on a red herring about
something that hasn't really been relevant in schema design for
about 20 years. That those who responded to your question didn't
follow you down that garden path is not an indication of a problem
with the newsgroups -- it's an indicator that you weren't asking the
right question, or you were doing so in a manner that was
unprofitable.
 
Dennis

Your comment; Please re-read my response. It starts with the word "If". I
was describing the workload/maintenance of an overly-simplified design, and
pointing out that adding fields is expensive, in terms of the maintenance it
requires to all affected objects.

My response: Opps, you are absolutely correct. My bad.


Your comments: That would depend on what data elements the user & I agreed
were needed.

My response: Hmm, you obviously have better educated users than I do.
Normally, the users do not request a data field. I guess they sort of do.
They specify the output they want. From the output, I determine which fields
the need to input.

I’ve yet been able to figure out how to include a field on some output
(display, report, data feed, etc.) that was never entered or calculated
within the system. So for me, I don’t believe it is for me to decide if a
fields is required or not. The system output requirements determine the need
for a field. If the user decides that the want to pay for that output item
(display, report, data feed) then the fields is required. If not, then the
field is not required.

I once had a customer require that I add about 5 new data fields to the
input screen just so that they *only* showed on a single display screen /
form. I thought this was a waste, so I asked them why wanted this
enhancement? They explain that the addition of those 5 fields enabled the
customer combined with the information already on the form enabled the
customer service rep to completely answer about 75% of the calls on the first
call received (versus the 50% pre-enhancement). In addition, they did not
have to pull the policy file to obtain these 5 little pieces of information.
Nor did they have to re-file those policies folders. This little change
increased the customer service group’s performance and drastically reduced
the load on the filing staff. I have asked why this had not been done
sooner. The response was the previous developer did not consider this an
important upgrade and never did it.

I believe that I should advice the client on the cost of a field, but only
they can decide if it worth the cost. Therefore, we will have to agree to
disagree on whether or not the developer should be involved in the decision
as to which fields should be included or not.



Your comment: If you are looking for other folks ideas, to compare them
with yours and decide what "balanced" approach would work best for you, let
us know. You asked for an assessment of John V.'s statement.

My response: No, that is not quite right either. Maybe in a round about
way it is. I misunderstood his remark to mean that it was better for Access
to have two smaller records in a 1:1 relationship than it was to have one
large record. My response to this was obviously John knows something about
Access that I don’t and I need resolve this issue so I can create efficient
designs. Hence, the disk access speed versus Access extracting variables
from large records.

It was all of the other respondents who have brought up the data structure
issues. At *no time* did I bring up the data structure issue. (I only
responded to those who brought up the issue.) As I’ve repeatedly stated
through out my responses, the data structure was *never* the question. I’ve
done it on other issues, but not this one. I really don’t know how much
clearer I could have been on this issue.



Your comment: Maintenance is maintenance, whether on one object or several.
My comments were intended to offer the option of a design that would require
NO additional maintenance, since the table would grow longer, not wider.


Response: Ok, now you have my interest! I agree that maintenance is
maintenance.

I am being very serious when I’m saying that I would love to know how to
capture additional data and utilize that data on new pre-defined / formatted
forms without require any additional maintenance.

What is the secret to this? Please I am being serious. How can I add new
fields and create new reports without any additional maintenance?




Your comment: Response: Again, I wish you had read my introduction where I
state I have 30 years of relational database design experience. As I stated
in my response to Rick, this was something I learned NOT to do over 30 years
ago.

That you've learned not to do this was not evident in your original post.

My Response: I guess there is a two part answer to this question.

First, as I have repeatedly stated the data structure was NEVER part of the
question. Given that it was never part of the question, why should I include
anything to do with a question that was never asked? That’s like saying why
did I not include the weather conditions at the time? It is because it had
nothing to do with my original question.

Per my original post †I've always been taught the exact opposite - that
"Fields are cheap, records are expensive" *since going to disk is so slow
versus accessing data in memory*.â€

People started focusing on a non-issue as far as I was concerned and ignored
the original question. I thought I stated my question pretty clearly about
disk access speed versus accessing data in memory. Also, you have to
remember that I interpreted John’s original comment to mean something that it
did not. But I did not know that at the time, hence my question.

The question was ONLY about disk access speed versus Access’ speed of
extracting variables from a long record. NEVER in my wildest imagination did
I EVER think it would become about data structures because that was NEVER the
question.

Secondly, as I stated in another response, if a doctor was to tell you that
he had been a doctor for 30 years, would you ask him if he could read a
thermometer? Of course not. When I stated I had 30 years, I figured that
would tell people that I had some experience under my belt and was asking a
more detailed and in depth question that a less experienced person would not
know to ask.



Your comment: If you want detailed suggestions/ideas/approaches, provide
detailed descriptions.

Response: In this question, I did *not* want suggestions, ideas, or
approaches. I wanted the answer to the question is there something about
Access that causes it to be very slow with large records?

As I stated before, the really sad part of all of this is that NO ONE in
this series of questions has answered the original question of disk access
speed versus Access extracting data from a large record.

I don’t count David’s response concerning disk access speeds. That is
because his comment on this discussion was made AFTER he responded to Allen
Browne’s answer to my re-statement of my question in this forum.


Your Comment: Responding as you have could be interpreted as 'baiting'
folks, offering an incomplete description and then criticizing folks for not
inferring or knowing your detailed situation.


My Response: I take EXTREME INSULT (caps meant) at your statement that
“Responding as you have could be interpreted as 'baiting' folksâ€.

I am a lone developer and I believe that this forum is a GREAT PRIVILEDGE!
I have FREE access to some of the greatest Access minds in the world. What a
GREAT PRIVILEDGE! The access to such knowledge is something to be highly
valued and respected!

There have been so many people in this forum that have helped me slowly
climb the Access learning cliff that I don’t know how to or even where to
begin to thank them. Even though I’m still somewhere near the bottom of the
Access learning cliff, I could not have gotten as high as I have without the
assistance of the knowledge people on the Access forums. I try very hard to
always thank the people who help me. As a way of show my appreciation to
them, I try to answer those question on the forum for which I am qualified to
answer. I figure that if I answer the ones I know, that provides more time
to the more knowledge people to answer the harder questions and help just a
few more people.

I have to say that I’ve learned more from the members on this forum than I
have from all of the Access books that I have read.

I go to great length to avoid wasting anyone’s time. Before I post a
question on this forum, I try reading my books, searching for the answer via
Google, or searching the different forums. ONLY after I have read, and
searched, and could not find the answer or did not understand the answers I
found do I post my question.

Yes, I take GREAT OFFENCE (caps meant) to you accusing me of wasting the
time of the people who are so kind to respond to question.

Even though all of the respondent totally misunderstood the question and
went off on irrelevant tangent, I actually respect and appreciate their
intentions and their passion.

I am very passionate about my work and I am always glad to associate with
people of equal passion. Even if I disagree with them or they don’t
understand me.



Regarding your “incomplete description†comment. I guess you are right on
that one. It was an incomplete description because it had *nothing* do with
my disk access speed versus memory access speed.

That is like criticizing me for not telling you that it was blues skies and
75 degrees Fahrenheit outside when I wrote the original question. It had
*nothing* to do with my question.

If I had had *any* idea that people would take an irrelevant comment and
blow it up into a huge mountain, I would *never* have included it in my
original question.


If I have any criticism, it is of the folks (including you) that even after
I have repeated and repeatedly stated the fields was *not* an issue, they
would not drop that issue. I understand that issue and I fully agree with
that issue. Even you last comment “Definite expensive†address an issue I
never asked about.

And as I said before, the truly sad thing is, no one in this discussion
(excluding David because read and responded to the other discussion before he
repeated his answer in this forum) has addressed my original question. Even
*after* I have repeatedly stated that my question had to do with disk access
speed. NO ONE responded to my question.

I saw one person stated that there were very experienced people willing to
help. If that is so, why is that that not one of those experience people ever
answered the question? Even after I repeated the question?

Fortunately, Allen Browne, and John Vinson, and David Fenton were kind
enough to answer this same question quite completely in another discussion on
this forum.

I really can not believe that over 32 responses were generated regarding
something that was repeatedly stated was *not* an issue and the original
question was *never* answered.


Dennis
 
I misunderstood his remark to mean that it was better for Access
to have two smaller records in a 1:1 relationship than it was to
have one large record.

That was a pretty perverse reading of it. As I think you now
understand, the statement was about normalization, that a large
record very often has repeating data where each instance of the
repeating field should be a record in a separate table rather than a
field in the main table.

And the reason for this is not as much data retrieval performance
(though it is often more efficient to retrieve data from a single
field in another table than it is to query multiple fields in a
single table; not to mention multiple index maintenance vs. a single
index to maintain) as it was the cost in terms of front-end design.

I don't think it was reasonable at any point to think the 1:1 design
was being promoted by anyone.
 
I really can not believe that over 32 responses were generated
regarding something that was repeatedly stated was *not* an issue
and the original question was *never* answered.

You're very close to landing in my killfile, and I'm one of the 3
you single out as praiseworthy.

Make of that what you will, but you might consider it evidence that
just because I'm still engaging in the discussion does not mean I
think you've behaved admirably throughout. You've been at times
combative, dismissive, obtuse and stubborn. You are lucky that
anybody is still reading your posts.
 
Back
Top