Design question (normalization versus practical)

  • Thread starter Thread starter Lars Brownies
  • Start date Start date
L

Lars Brownies

In my app I have a main table with about 22 fields of which 8 have repeating
values which, according to normalization, require their own tables which
should be linked by a meaningless ID number. Also this main table is linked
to 2 1XM tables.

On the other hand I'm quite happy with the table as it it. My forms load
very quick, the datamodel is simple, I don't have to drag multiple tables
every time I run a query. When I choose proper normalization I'm afraid to
wind up with a very large datamodel which will load much slower. Also, I
think it will make developing more cumbersome. The benefit I think will be a
smaller *.mdb. Also when you want to change a value, you only have to do it
once. But will it be worth the effort?

Any thoughts?

Thanks, Lars
 
Lars Brownies said:
In my app I have a main table with about 22 fields of which 8 have
repeating values which, according to normalization, require their own
tables which should be linked by a meaningless ID number. Also this main
table is linked to 2 1XM tables.

What happens when (not if! <g>) all of a sudden you need to expand the
repeating group to ten fields? You'll have to redesign everything, whereas
had you done it correctly in the first place (using a subform), it would be
trivial.
On the other hand I'm quite happy with the table as it it. My forms load
very quick, the datamodel is simple, I don't have to drag multiple tables
every time I run a query.

Not sure I understand this comment. Are you running ad-hoc queries that
often, as opposed to having stored queries? Seems odd for a production
database.
When I choose proper normalization I'm afraid to wind up with a very large
datamodel which will load much slower.

The fact that you'll have more tables shouldn't mean it will load any
slower. You should always be working with queries that return only the
necessary data.
Also, I think it will make developing more cumbersome.

It shouldn't (although you need to learn the proper mindset for dealing with
the model)
The benefit I think will be a smaller *.mdb. Also when you want to change
a value, you only have to do it once. But will it be worth the effort?

In my opinion, it's always worth the effort of designing the database
properly!
 
Thanks Doug. Let me explain better. See inline.

Douglas J. Steele said:
What happens when (not if! <g>) all of a sudden you need to expand the
repeating group to ten fields? You'll have to redesign everything, whereas
had you done it correctly in the first place (using a subform), it would
be trivial.

My 2 1XM tables are in subforms. But for instance in my main Person table I
have field Nationality. Many persons have the American nationality so this
value is repeated a lot. But I don't think I need a subform for that field.
I guess I'd need a nationality table and a combobox in the mainform with a
link on ID_nationality. Do you agree? Or is there another way to show this
value in the main form.
Not sure I understand this comment. Are you running ad-hoc queries that
often, as opposed to having stored queries? Seems odd for a production
database.

I have stored queries, lots of them. Seems like a lot of work to drag the
tables every time. Also I'm afraid that forms can't be updated because of
the complexity of the source query.
The fact that you'll have more tables shouldn't mean it will load any
slower. You should always be working with queries that return only the
necessary data.


It shouldn't (although you need to learn the proper mindset for dealing
with the model)

Guess I have to experiment with that.
In my opinion, it's always worth the effort of designing the database
properly!

Define properly :-)
 
In my app I have a main table with about 22 fields of which 8 have repeating
values which, according to normalization, require their own tables which
should be linked by a meaningless ID number. Also this main table is linked
to 2 1XM tables.

On the other hand I'm quite happy with the table as it it. My forms load
very quick, the datamodel is simple, I don't have to drag multiple tables
every time I run a query. When I choose proper normalization I'm afraid to
wind up with a very large datamodel which will load much slower. Also, I
think it will make developing more cumbersome. The benefit I think will be a
smaller *.mdb. Also when you want to change a value, you only have to do it
once. But will it be worth the effort?

Any thoughts?

Thanks, Lars

Douglas and Ken make some good points; let me add a couple of my own thoughts.

It is NOT obilgatory to have a numeric ID and a Text field in a "lookup
table". If (for your business rules and design decisions) a single nationality
is satisfactory, then you would benefit from having a little table of
Nationalities, a single text field table with one record for each, simply to
make data entry easier and save mistypings. The numeric ID does indeed make
changing spelling or changing the name simpler, and may be worth it just for
that.

Secondly, it is probably NOT necessary to drag in every related table every
time you create a query; in fact it certainly *isn't*. If you aren't reporting
the nationality (or this lookup field, or that lookup field) in a particular
Report, or including it in a Query, then you just... don't include that join!

You're concerned about "loading the whole datamodel"? Well... don't. There are
VERY few cases, in any reasonably big database, where it's necessary to "load
the whole datamodel" onto the screen or page at once, so I'm not sure where
this concern comes from!
 
Thanks Ken.
A 'surrogate' numeric key is only required if
the values in a column in a table are not distinct, e.g. a Cities table
needs
a CityID primary key as city names can be legitimately duplicated

I thought by using a surrogate numeric key in stead of the text itself,
you'd save database space.

Also, I though the second rule of normalization was to eliminate redundant
data, but I just reread and that's only applicable when the ID-number is
part of a multivalued key.

Nationality was a bad example, but you got my point :-)

Lars
 
Thanks John. see inline.
Douglas and Ken make some good points; let me add a couple of my own
thoughts.

It is NOT obilgatory to have a numeric ID and a Text field in a "lookup
table". If (for your business rules and design decisions) a single
nationality
is satisfactory, then you would benefit from having a little table of
Nationalities, a single text field table with one record for each, simply
to
make data entry easier and save mistypings. The numeric ID does indeed
make
changing spelling or changing the name simpler, and may be worth it just
for
that.

I see, this answers my question. I thought it was obligatory to use a
surrogate key and that it saves database space.
Secondly, it is probably NOT necessary to drag in every related table
every
time you create a query; in fact it certainly *isn't*. If you aren't
reporting
the nationality (or this lookup field, or that lookup field) in a
particular
Report, or including it in a Query, then you just... don't include that
join!

I agree. But in the example I gave a person could have surrogate keys for
instance for fields like State, City, Status, Entered_by, Category_1,
Category_2, etc. (assuming a person can only have one value for every
aforementioned field). So if you need to make a Person Report you'd need all
those tables in your query.
You're concerned about "loading the whole datamodel"? Well... don't. There
are
VERY few cases, in any reasonably big database, where it's necessary to
"load
the whole datamodel" onto the screen or page at once, so I'm not sure
where
this concern comes from!

I'm not sure where I got the idea from, but I thought that loading, let's
say 15 tables into a form, would take much longer than 2 or 3.

Lars
 
More comments inline.
Thanks John. see inline.


I see, this answers my question. I thought it was obligatory to use a
surrogate key and that it saves database space.

Well... sometimes it does and sometimes it doesn't. For a State code there are
only 71 two letter codes (counting US States, the District of Columbia,
territories and Canadian provinces). Two bytes is all you need. It can be a
tossup for other fields.
I agree. But in the example I gave a person could have surrogate keys for
instance for fields like State, City,

There's good reasons for State but again, it's a balance of issues question
about City. I have a animal shelter database based in Los Angeles - I do
indeed have a table of cities (actually the Postal Service's preferred city
name in a Zip Code table); but if you're dealing with a broader area, it may
make more sense to either enter cities by hand, or go whole hog and buy the
Postal Service's $500 zipcode database.
Status, Entered_by, Category_1,
Category_2, etc.

BEEEEPPPP!

The above examples are probably *not* repeating fields.
Category_1, Category_2 almost certainly ARE repeating fields.

If one person can be in many categories, and each category can consist of many
people, you *have a many to many relationship*, and you really should consider
modeling it as such: a table of Categories (controlled vocabulary, prevent
users from making up new categories out of their imagination), and a
CategoryMembership table with one *row* per category per person. With fields
Category_1 through Category_5 you have MUCH more complex searches (OR logic
across five fields), and a real nightmare when you have to add Category_6.

Or, if you put in up through Category_10 just in case, you're making your
table unnecessarily complex.

This is one where I'm going to strongly suggest that you're mistaken!

(assuming a person can only have one value for every
aforementioned field). So if you need to make a Person Report you'd need all
those tables in your query.


I'm not sure where I got the idea from, but I thought that loading, let's
say 15 tables into a form, would take much longer than 2 or 3.

It will. But you ordinarily won't need to; and if five of those tables are
just in the rowsource of a combo box or listbox, the burden is trivial. If you
have ten subforms, then you may need to do some tweaking (such as loading the
subform's Recordsource only when the tab page containing that subform is
activated).
 
Status, Entered_by, Category_1,
BEEEEPPPP!

The above examples are probably *not* repeating fields.
Category_1, Category_2 almost certainly ARE repeating fields.

In this particular case the status is the phase persons are currently in
(regarding the proces for applying a job). It's a field for selections in
reports. There's currently no need to keep track of the status changes, just
one value for each person. However, I do feel it's better to make this a 1xM
relation and do keep track of the status changes but it will cost me a lot
more developing time. For instance people can't use the build-in filter
option
on the subform so I have to build extra forms/code to make this work. And I
have limited amount of time for this non paid job.

Entered_by is the user who initially enters the record. There can be only
one
user per record who does this. I also have a 1xm table in which users can
type in their remarks.

Category_1 (in fact is Experience level) can have values from 1 to 3. A
person can only have one value and there's no need to keep track of changes
in this field.

Thanks.

Lars

If one person can be in many categories, and each category can consist of
many
people, you *have a many to many relationship*, and you really should
consider
modeling it as such: a table of Categories (controlled vocabulary, prevent
users from making up new categories out of their imagination), and a
CategoryMembership table with one *row* per category per person. With
fields
Category_1 through Category_5 you have MUCH more complex searches (OR
logic
across five fields), and a real nightmare when you have to add Category_6.

Or, if you put in up through Category_10 just in case, you're making your
table unnecessarily complex.

This is one where I'm going to strongly suggest that you're mistaken!



It will. But you ordinarily won't need to; and if five of those tables are
just in the rowsource of a combo box or listbox, the burden is trivial. If
you
have ten subforms, then you may need to do some tweaking (such as loading
the
subform's Recordsource only when the tab page containing that subform is
activated).
 
Thanks for this explanation with examples which helps me understand it
better.

Lars

KenSheridan via AccessMonster.com said:
Lars:

Its true that a purpose of normalization is to eliminate redundancy, but
that
in itself is not the real objective. Redundancy allows for inconsistent
data
or update anomalies, so by its elimination the integrity of the data is
protected, which is the important thing.

An example of lack of proper normalization can be found in the sample
Northwind database's Customers table. You'll see that this has City,
Region
and Country columns so we are told numerous times that São Paulo is in SP
region (as is Resende) and that SP region is in Brazil. Not only does
this
require repetitive data entry, but more importantly it opens up the risk
of
inconsistent data, e.g. it would be perfectly possible to put São Paulo in
California in one row and California in Ireland! Proper normalization
would
prevent this as the fact that São Paulo is in SP region would be stored
only
once in the database as would the fact that SP region is in Brazil and
that
California is in the USA.

An example of what at first sight might seem to be redundancy, but in fact
is
not, can also be found in Northwind. The Products table and the
OrderDetails
table both have UnitPrice columns. It might be thought that the unit
price
of a product could always be looked up from the Products table, so its
unnecessary in Order Details. However, the unit price of a product will
change over time, but each order needs to retain the price in force at the
time the order was created. Consequently a UnitPrice column is needed in
both tables; that in products holds the current price and is used to get
the
value for that in Order Details (code in the ProductID control's
AfterUpdate
event procedure in the Order Details Subform does this), which then
remains
static when the current price (in products) changes. In each case
UnitPrice
is functionally dependent solely on the whole of the key of the table, so
there is no redundancy.

Its often said that having separate columns to represent different values
of
an attribute, e.g. if we were to have separate columns British and Irish
to
cater for hybrids like me (an absurd example, I know, but it serves to
illustrate the point), is an example of lack of normalization, but its
really more a case of a breach of Codd's Rule 1, The Information Rule,
which
as restated by Date in a slightly expanded form is:

'The entire information content of the database is represented in one and
only one way, namely as explicit values in column positions in rows in
tables'.
C J Date - Introduction to Database Systems; 7th Edition; 2000

Having a separate column for each nationality is what is known as (I think
these are Date's word's also) 'encoding data as column headings', i.e. it's
the names of the column which are the data values, 'British' and 'Irish'
rather than the data being stored as values of type Nationality in
separate
rows as Codd's rule requires.

It could be said that having the multiple columns means that the table is
not
in First Normal Form:

'A relvar is in 1NF if and only if, in every legal value of that relvar,
every tuple contains exactly one value for each attribute.'

But this requires that we regard the attribute as being Nationality, but
no
such column exists of course, so while the row (loosely speaking the
equivalent of a tuple) does contain more than one value of attribute
Nationality, can we say this is the case when no such column exists? This
is
why I feel that its not really a normalization issue, but a breach at the
more fundamental level of the Information Rule. Its an interesting
philosophical point, but however we regard it, I think we can conclude
that
its not a correct approach in terms of the database relational model.

Ken Sheridan
Stafford, England

Lars said:
Thanks Ken.
A 'surrogate' numeric key is only required if
the values in a column in a table are not distinct, e.g. a Cities table
needs
a CityID primary key as city names can be legitimately duplicated

I thought by using a surrogate numeric key in stead of the text itself,
you'd save database space.

Also, I though the second rule of normalization was to eliminate redundant
data, but I just reread and that's only applicable when the ID-number is
part of a multivalued key.

Nationality was a bad example, but you got my point :-)

Lars
I could go on.
[quoted text clipped - 5 lines]
many relationship (one modelling the relationship by resolving it into
two
one-to-many relationships).
A 'surrogate' numeric key is only required if
the values in a column in a table are not distinct, e.g. a Cities table
[quoted text clipped - 67 lines]
Thanks, Lars
 
In this particular case the status is the phase persons are currently in
(regarding the proces for applying a job). It's a field for selections in
reports. There's currently no need to keep track of the status changes, just
one value for each person. However, I do feel it's better to make this a 1xM
relation and do keep track of the status changes but it will cost me a lot
more developing time. For instance people can't use the build-in filter
option
on the subform so I have to build extra forms/code to make this work. And I
have limited amount of time for this non paid job.
Entered_by is the user who initially enters the record. There can be only
one
user per record who does this. I also have a 1xm table in which users can
type in their remarks.

Status and Entered_By certainly sound like 1:n examples, and if you have a
lookup table to control the vocabulary using a combo box, with a relationship
with referential integrity enforced, then I agree that you don't need a many
to many resolver table.

Category_1 (in fact is Experience level) can have values from 1 to 3. A
person can only have one value and there's no need to keep track of changes
in this field.

But what about Category_2 and Category_3? THAT'S my objection. If you have
three different categories, isn't that ipso facto a many to many relationship?
Each Person can be in zero, one, two or three categories (one of them being
Experience Level, another being Skill Level, whatever); each category will
have many people. Aren't you in fact storing data ("what kind of category") in
a fieldname, rather than properly storing in a field?
 
But what about Category_2 and Category_3? THAT'S my objection. If you have
three different categories, isn't that ipso facto a many to many
relationship?
Each Person can be in zero, one, two or three categories (one of them
being
Experience Level, another being Skill Level, whatever); each category will
have many people. Aren't you in fact storing data ("what kind of
category") in
a fieldname, rather than properly storing in a field?

OK, I got it. Thanks for sticking with me.

Lars
 
KenSheridan said:
4. Querying the database is very much simpler.

I could go on…

I can believe that point for someone like yourself who can do complex
joins in his sleep. For most, properly normalizing the data will
increase the complexity of most of the queries required.

In:

http://groups.google.com/group/microsoft.public.access/msg/d20b387c35944260

I said:

"That's definitely great advice and a great article. However, the SQL
programmer also needs to be aware that doing things properly with a
many-to-many relationship can cause future SQL creation and maintenance
to become much more difficult if more than a few other tables start
getting joined in along with the many-to-many tables. It's not anything
that can't be overcome, but things can suddenly get a bit more complex
when you have, say, a few many-to-many joins in the mix.

For a small database, I think adding a single many-to-many relationship
is a very good way for an Access programmer to get used to the SQL that
she needs to create.

For a larger database with existing many-to-many relationships, I think
that the database planner should sit down with the schemata to convince
herself that she can create the queries that she knows she'll need in
the near future before adding another many-to-many relationship. That
way, the painful trade-offs that might come into play will become clearer.

That said, based on past experience, it's good to normalize as much as
one's SQL abilities and available time allow."

Normalization does a lot of wonderful things, but I don't count
simplifying queries as one of them. It's good that you continue to
stress the importance of normalization.

James A. Fortune
(e-mail address removed)
 
I wouldn't want to argue with you, Jim, but I observe that ignoring
relational database design principles and proper normalization is a
prescription for hair removal (by the handful) sometime in the future. It
can make it downright ugly trying to retrieve and work with data, and
sometimes darn near impossible.

That said, as a practical matter, third normal form is about as far as most
business databases need to take normalization.

Larry Linson
Microsoft Office Access MVP
 
Larry said:
I wouldn't want to argue with you, Jim, but I observe that ignoring
relational database design principles and proper normalization is a
prescription for hair removal (by the handful) sometime in the future. It
can make it downright ugly trying to retrieve and work with data, and
sometimes darn near impossible.

That said, as a practical matter, third normal form is about as far as most
business databases need to take normalization.

Larry Linson
Microsoft Office Access MVP

I've been there. I also take Ken's advice to heart. I rarely
flagellate myself over having too much normalization, except for when a
miracle must be performed. I note that the Entity Data Model (EDM)

http://en.wikipedia.org/wiki/Entity_Data_Model

has some concepts that at least try to address and solve some of the
problems caused by the growth of SQL complexity. I also note that,
based on personal experience, both under-normalization and
over-normalization have the potential to cause autodepilation. Third
normal form is a reasonable rule-of-thumb, but more is better, IMO, if
you can handle the SQL.

James A. Fortune
(e-mail address removed)
 
I've always thought it a great shame that Microsoft left the
Customers table in Northwind so badly lacking in normalization.

The worst possible examples for developing Access applications are
found in the sample databases that come with Access and in the ones
created from the database templates. It's always been a scandal, in
my opinion -- I've been ignoring them since about A97.
 
David W. Fenton said:
The worst possible examples for developing Access applications are
found in the sample databases that come with Access and in the ones
created from the database templates. It's always been a scandal, in
my opinion -- I've been ignoring them since about A97.

The product group is really promoting their templates for Access 2007, and
the implication is that they are "done right". I'm sorry that I am not able
to confirm or deny that implication*, but one reason they redid templates so
extensively is that they are also really promoting using macros instead of
VBA. If they were not well aware that they'd have a user revolt on their
hands, I think they might have gotten rid of VBA under the guise of
security.

* None of my clients / customers have moved to Access 2007, and
I see no compelling reason to try to sell them on the idea of doing
so, thus my use has been limited to just a little bit of learning

Larry
 
KenSheridan via AccessMonster.com said:
I've never liked this "3NF is usually good enough" view.
Its good enough if that's what the business model requires;
if it requires normalization to a higher level then that's
what's good enough. I know its often said, but I think
it tends to make people think that's all they have to aim
at regardless of the underlying reality. No free lunches
in this game!

I hear similar arguments from "relational purists" (often the same people
who think surrogate keys are sacrilegious), and I would not argue that there
are databases that would benefit from higher levels of normalization -- I'm
sure that some done by my fellow MVP Tom Ellison would fall in that
category.

But, I've been involved to some level with relational database since the
only one was the "System R" testbed created at Ed Codd's urging in IBM, with
my involvement increased in the 1980s with mainframe DB2 and SQL/DS, a
little with DOS Paradox, and much deeper once I got in the Access
business -- and, in all that time, I have never dealt with a single
real-world business database that would have benefitted from 4NF or higher.
So, I don't feel at all uncomfortable making a general statement as I did...
I'm not a "relational purist" nor is it "my yob" to lead young
wet-behind-the-ears developers down the path of righteousness to purist
nirvana.

I said "That said, as a practical matter, third normal form is about as far
as most business databases need to take normalization." And that is entirely
consistent with my more-than-a-few years of experience.

Larry Linson
Microsoft Office Access MVP
 
KenSheridan via AccessMonster.com said:
I've never liked this "3NF is usually good enough" view.
Its good enough if that's what the business model requires;
if it requires normalization to a higher level then that's
what's good enough. I know its often said, but I think
it tends to make people think that's all they have to aim
at regardless of the underlying reality. No free lunches
in this game!

Larry said:
I hear similar arguments from "relational purists" (often the same people
who think surrogate keys are sacrilegious), and I would not argue that there
are databases that would benefit from higher levels of normalization -- I'm
sure that some done by my fellow MVP Tom Ellison would fall in that
category.

FWIW, my take is that both are correct. There is a good reason for
aiming for 3NF, though it's almost never mentioned: The chance of a
relation in 3NF but not in higher normal form is far much smaller than
the chance of a relation being in a 1NF or 2NF but not 3NF. Why, one
could argue that HNF are 'special' in same sense as square is a special
case of rectangle.

I don't have a link handy but I do recall seeing an blurb on 'net (and
we all know everything on internet is true. ;) ) that a research survey
found that even though developers may claim to stop at 3NF, there has
been production databases that were found to be in HNF.

Make of it as you will.
 
KenSheridan said:
I can only assume then that you've never had occasion to work within a
procurement and supply context in a manufacturing industry. I began my
career in such around 40 years ago, and I can assure you that normalization
to 5NF is frequently essential for quality control purposes.

Ken Sheridan
Stafford, England

Ken,

I work in a manufacturing industry and would love to hear the specifics
of procurement and supply situations you encountered that require 5NF.
It's good to know in advance when 5NF is coming at you so that
management can be properly forewarned :-).

James A. Fortune
(e-mail address removed)
 
The product group is really promoting their templates for Access
2007, and the implication is that they are "done right".

I'm just barely getting started in A2007, so haven't had a chance to
look, but what I saw of the Contacts template didn't look
significantly different from the old one, which was held together
with chewing gum and bailing wire.

I mean, they are still naming fields with spaces in them. I thought
MS figured out that this was a bad idea. After all, with Vista they
showed that they'd figured out that Programs was better than Program
Files and Users better than Documents and Settings.

One thing I do notice is that they are smart enough to turn off the
default Allow ZLS setting.
I'm sorry that I am not able
to confirm or deny that implication*, but one reason they redid
templates so extensively is that they are also really promoting
using macros instead of VBA.

The Contacts template does use the embedded macros. I don't quite
understand the point. This is yet another epicycles response to a
problem they created in the first place by not properly thinking
through how to implement code security.

When I look at the convoluted embedded macros in this template, I
have to admire the ingenuity, and the additional functionality they
added to macros (TempVars and minimal Error Handling), but have to
ask what the point is -- the result is quite difficult to
understand. Even for a novice, VBA code is going to be more
comprehensible, I'd think.

And they don't avail themselves of the COMMENTS column to explain
what any of the lines do. That's not a very good way to construct a
sample app, seems to me.
If they were not well aware that they'd have a user revolt on
their
hands, I think they might have gotten rid of VBA under the guise
of security.

Even the Mac users want VBA in Excel, and they're getting back!
* None of my clients / customers have moved to Access 2007, and
I see no compelling reason to try to sell them on the idea of
doing so, thus my use has been limited to just a little bit
of learning

I'm getting my feet wet in the interests of being ready for the next
version. It seems to me that, like Windows 7 vs. Vista, the next
version of Access is going to be a must-have blockbuster. If you're
not up to speed on what was introduced in A2007, it may be kind of
hard to adapt to the next version.

I also need to keep in mind that in order to make an Access app look
up-to-date and spiffy like the modern apps users are encountering
elsewhere, you often have to jump through lots of hoops. A2007 has a
more "modern" appearance and several UI innovations that I think are
quite helpful (the tabbed interface and split forms seem to me to
both solve problems that were in the past much more difficult to
address).

Also, I think it would be pretty difficult were I to get a project
where the app was being built from scratch (or nearly so) to propose
building it in A2003, which is already 6 years old and the next
version of Access already well on the way. I mean, I started using
A2000 in a lot less than 5 years after it was released, and there
were a whole lot more compelling reasons to avoid the upgrade from
A97 than with A2007.

Or so it seems to me.
 
Back
Top