do all primary keys use autonumber

  • Thread starter Thread starter Guest
  • Start date Start date
k said:
do all primary keys use autonumber

(Point of netiquette: It's considered good form to put your actual
question into the body of a message.)

No, of course they don't. But Autonumber is a convenient way to be sure
that you won't have any duplicate values, and an Autonumber field
occupies only about 4 bytes per record, much shorter than many other
fields you might choose to use.

Your primary key could even consist of more than one field, for example
using both [First Name] and [Last Name] (assuming you don't have two
people named "Mary Jones").

If you don't use Autonumber, you'll need to get your key values from
somewhere, and Access is likely to complain to you if a new key value
matches some existing one, as they must be unique in the Table. In such
a case, you'll need to choose some other value, and if you were trying
to enter a name, you will have to modify it somehow, such as by changing
"Mary" to "Mary_1".

To avoid confusing my Autonumber values with something meaningful (such
as people's ages), I usually set the "New Values" property of the field
to Random, making it obvious that they're keys.

-- Vincent Johns <[email protected]>
Please feel free to quote anything I say here.
 
No, certainly not.

An Autonumber is convenient if you don't have a "natural" key, but it
can actually be a bad idea. It makes all records unique automatically,
so that if you have duplicate data, it can be harder to discover.

If you have some field - or combination of up to ten fields - which
uniquely identifies each record in your table, use it. A good Primary
Key candidate should meet three criteria: it should be UNIQUE - only
one record should have it; it should be STABLE - that is, it shouldn't
change very often or at all; and (less importantly) it should be SHORT
so that table indexes, etc. don't have to be overly large and/or slow.

For example, if you have a table of States (or US States and
Territories and Canadian provinces), the Postal Service two-letter
code makes a good primary key. Every state and province has a unique,
two letter code; the codes change only rarely (Newfoundland, Nova
Scotia, and Nunavut in Canada being the last change that I remember,
and that's several years back); and it's only two characters.

Or, if you have a table of Employees, your Human Resources department
probably already has an EmployeeID - perhaps the Social Security
Number (though this has some problems), perhaps something else. It
would be unique, stable, and short, if it's designed properly; you'ld
use it as the Primary Key of the employee table, rather than adding an
autonumber which would really be a redundant duplication of its
function.

John W. Vinson[MVP]
 
Vincent said:
Autonumber is a convenient way to be sure
that you won't have any duplicate values, and an Autonumber field
occupies only about 4 bytes per record, much shorter than many other
fields you might choose to use.

Bear in mind that PRIMARY KEY has a special meaning which makes an
autonumber the choice of last resort for the PRIMARY KEY designation.

Take the example of a table which has two 'unique' columns: a random
integer and an industry standard key, of which the ISBN is an excellent
example. In relational theory there is no issue: all keys are equal,
simply use the one most appropriate to your needs. However, with SQLs
you can only have one PRIMARY KEY per table. You are forced to choose
just one but what criteria should you use to make this choice? Both
could be given NOT NULL UNIQUE constraints and your 'uniquifier' needs
would be satisfied.

But everyone (correctly) says a table should have a PRIMARY KEY. So
what does PRIMARY KEY give you that NOT NULL UNIQUE does not? For
Access/Jet, there's only one answer: physical order on disk a.k.a. the
table's clustered index.

What makes the best clustered index? In a nutshell, the column(s)
primarily used in queries that use GROUP BY and BETWEEN constructs.
Does anyone write SQL that includes GROUP BY (autonumber_col) or
BETWEEN (this autonumber value) AND (that autonumber value)? I doubt
it.

In the earlier example, ISBN is the natural key and wins the PRIMARY
KEY designation hands down. What about a table where the autonumber is
the only one 'unique' column?

Say, a table with just two columns, a random integer column named ID
and a last_name column. The last_name is unlike to be unique. Then
again, last_name is far more likely than the random integer to be used
in GROUP BY or BETWEEN constructs. The best approach is to constrain
the random integer (ID) with NOT NULL UNIQUE and make the PRIMARY KEY
designation to be (last_name, ID) in that order. This will cause the
physical order of the disk to be rebuilt on compact in last_name order.


Choosing an autonumber as the sole PRIMARY KEY column is likely to have
the effect of a performance hit on your database because a random
integer makes for a lousy clustered index.
 
Bear in mind that PRIMARY KEY has a special meaning which makes an
autonumber the choice of last resort for the PRIMARY KEY designation.

You can look at it this way if you wish (and I consider Autonumber the
choice of first resort, because it satisfies the requirements for a
primary key, is small, and is not bound to anything else in the model,
making it practically immune to any need for revision). But you might
consider that Autonumber isn't required; it's an optional feature for
those who wish to use it.
Take the example of a table which has two 'unique' columns: a random
integer and an industry standard key, of which the ISBN is an excellent
example. In relational theory there is no issue: all keys are equal,
simply use the one most appropriate to your needs. However, with SQLs
you can only have one PRIMARY KEY per table. You are forced to choose
just one but what criteria should you use to make this choice? Both
could be given NOT NULL UNIQUE constraints and your 'uniquifier' needs
would be satisfied.

I suppose I'd choose the ISBN if the Autonumber isn't already present,
and the Autonumber if it is. That's because, if I later discover a
transcription error in one of the ISBN fields, it's a lot easier to
correct that if I don't also have to worry about other Tables linked
using it. The "Cascade Update" feature will take care of many such
changes, but not if a linked Table happens to reside in a separate
database file. (OK, this is a bit far-fetched, I admit, but I still
think that the Autonumber lets me avoid having to worry about such stuff.)
But everyone (correctly) says a table should have a PRIMARY KEY. So
what does PRIMARY KEY give you that NOT NULL UNIQUE does not? For
Access/Jet, there's only one answer: physical order on disk a.k.a. the
table's clustered index.

Even assuming that one can tease the "physical order on disk" out of a
key value, when is that useful? If you wish to know what record is the
most recently updated, you can use a date/time field, for example. If
you know of some other uses, I'd be interested.
What makes the best clustered index? In a nutshell, the column(s)
primarily used in queries that use GROUP BY and BETWEEN constructs.
Does anyone write SQL that includes GROUP BY (autonumber_col) or
BETWEEN (this autonumber value) AND (that autonumber value)? I doubt
it.

I certainly wouldn't use an Autonumber field that way. Normally, my
Autonumber fields are Random, so that I won't be tempted to use them for
anything besides acting as a primary key. But I'm not fanatical about
this -- if I had a Table in which I could use something like ISBN as a
primary key, and space were tight, I might very well choose to use that
instead of my usual random Autonumber. Maybe.

Also note that, in "Northwind Traders", the [Orders].[OrderID] field
that generates a new order number is an (incremental) Autonumber field,
so Microsoft is implicitly suggesting here that you might want to make
an Autonumber field do double duty: primary key, and source of a datum
used outside the database (in this case, it appears on invoices). But I
don't normally use them that way.

Incidentally, assuming ISBN is unique in a Table, for what possible
reason would one use GROUP BY on that field? I can possibly imagine a
BETWEEN involving ISBNs, if one is trying to correlate items in a
catalog organized by ISBN.
In the earlier example, ISBN is the natural key and wins the PRIMARY
KEY designation hands down. What about a table where the autonumber is
the only one 'unique' column?

Say, a table with just two columns, a random integer column named ID
and a last_name column. The last_name is unlike to be unique. Then
again, last_name is far more likely than the random integer to be used
in GROUP BY or BETWEEN constructs. The best approach is to constrain
the random integer (ID) with NOT NULL UNIQUE and make the PRIMARY KEY
designation to be (last_name, ID) in that order. This will cause the
physical order of the disk to be rebuilt on compact in last_name order.

How does knowing the "physical order of the disk" (assuming you can
determine what that is) help you do ordinary database operations? It
sounds as if you're involved in teasing apart database files without
going through Access to do it. I suggest not trying to do that, not
only for legal reasons (it might involve violation of trade-secret laws,
or of contract law -- the EULA to which you agreed when you installed
Access), but also because whatever method you use is likely to become
invalid under the next version of Access.

There are alternatives to hacking the file format. For example, you
will probably always be able to perform operations such as exporting the
contents of a Table to a flat tab-delimited text file, and import it
into any of a myriad of other applications (including into any version
of Access).
Choosing an autonumber as the sole PRIMARY KEY column is likely to have
the effect of a performance hit on your database because a random
integer makes for a lousy clustered index.

This is a feature of Access that I'm not aware of. People are more
likely, from what I've heard (and I'm not privy to the source code of
Access, nor to the details of its algorithms), to suffer performance
hits from such mistakes as failing to define indices where they should,
and not from choosing a "lousy clustered index". Again, I'd be
interested in knowing more details if you have them.

-- Vincent Johns <[email protected]>
Please feel free to quote anything I say here.
 
How does knowing the "physical order of the disk" (assuming
you can determine what that is) help you do ordinary database
operations?

In the example of when you have one column constrained with NOT NULL
UNIQUE and another column constrained with PRIMARY KEY, how does
knowing the PRIMARY KEY help you do ordinary database operations?
Telling the DBMS 'system' (hardware, software - who cares?) that you
have designated one as PRIMARY KEY will tell the system to offer this
column special treatment. For Access/Jet, what do you think this
special treatment is?

I don't mean this rhetorically: you really need to tell me what you
think the difference is.

Here's my answer. In relational terms, they both do the same job i.e.
can't be null, must be unique, therefore may be used to uniquely
identify a row. For Jet, the PRIMARY KEY designation says, 'Build an
index on this column; not just any index, a physical index, the best
index you have in your toolbox.'

When you say 'ordinary database operations' I guess you mean
'relational stuff'. Well, to answer your question, in Jet the PRIMARY
KEY designation is about physical implementation and therefore is
outside of 'ordinary database operations'. It is about building indexes
for maximum efficiency. Indexes are non-relational by their nature (try
a CREATE INDEX in the Mimer SQL-92 Validator -
http://developer.mimer.com/validator/parser92/index.tml).

Don't ask me about file hacks, I don't know enough about Jet
implementation, mainly because I'm not interested. I want to know how
Jet does things the same as everyone else (i.e. how it complies with
the SQL-92 standard), not about how it is different in implementation.

I don't have that much knowledge about indexes either but I do know
that I would not recommend someone use their one

How do I know all this? I read it in a newsgroup post. Sounded like
nothing I'd read before about Access/Jet, so I thought I'd check it
out. Guess what? It checked out
(http://support.microsoft.com/default.aspx?scid=kb;en-us;137039).
Sometime you just have to be prepared for your paradigms to be shifted
:)
People are more
likely, from what I've heard (and I'm not privy to the source code of
Access, nor to the details of its algorithms), to suffer performance
hits from such mistakes as failing to define indices where they should
and not from choosing a "lousy clustered index".

You just summed it up my point nicely. If you use your one physical
index on a unique integer column, wouldn't you be "failing to define
indices where they should" be?
 
Vincent said:
How does knowing the "physical order of the disk" (assuming you can
determine what that is) help you do ordinary database operations?

Physical order of the disk (a.k.a the cluster index) for a table is
determined by its PRIMARY KEY designation.

Here's a quick demo of why for Jet you must choose a PRIMARY KEY
carefully.

I assume you have a 'sequence' table i.e. a table of integers e.g.

CREATE TABLE [Sequence] (seq INTEGER)
;
INSERT INTO [Sequence] VALUES (1)
;
INSERT INTO [Sequence] VALUES (2)
;
INSERT INTO [Sequence] VALUES (3)
;
etc etc.

In reality, my data was created in Excel and imported <g>.

Now create two test tables:

CREATE TABLE Test1 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (key_col)
)
;
CREATE TABLE Test2 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (data_col, key_col)
)
;

Note the difference in PK designation between them.

Now load some data using the 'sequence table':

INSERT INTO Test1 (data_col)
SELECT DT1.data_col FROM (
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
) AS DT1
;
INSERT INTO Test1 (data_col)
SELECT DT1... (as before)
;

At this point, compact the file. This has the effect of physically
rebuilding the tables based on their PRIMARY KEY designations.

To test the relative performance of the PRIMARY KEY choices, run a
query which uses a BETWEEN construct because this favours the physical
order:

SELECT key_col, data_col
FROM Test1
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95
;
SELECT key_col, data_col
FROM Test2
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95

In my test timings, table with PRIMARY KEY (key_col) is about 10 times
slower than the table with PRIMARY KEY (data_col, key_col).
 
I ran the test you suggested. Perhaps I missed something (details
below), but I wasn't impressed with the results I got. Please let me
know if I did miss something here.

Vincent said:
How does knowing the "physical order of the disk" (assuming you can
determine what that is) help you do ordinary database operations?

Physical order of the disk (a.k.a the cluster index) for a table is
determined by its PRIMARY KEY designation.

Here's a quick demo of why for Jet you must choose a PRIMARY KEY
carefully.

I assume you have a 'sequence' table i.e. a table of integers e.g.

CREATE TABLE [Sequence] (seq INTEGER)
;
INSERT INTO [Sequence] VALUES (1)
;
INSERT INTO [Sequence] VALUES (2)
;
INSERT INTO [Sequence] VALUES (3)
;
etc etc.

In reality, my data was created in Excel and imported <g>.

You didn't say how many records you created. What I did was to put
100,000 records into [Sequence], with values of 1 to 100,000.
Now create two test tables:

CREATE TABLE Test1 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (key_col)
)
;
CREATE TABLE Test2 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (data_col, key_col)
)
;

I set [Test1].[key_col] and [Test2].[key_col] to be incremental
Autonumber (long integer) fields, and [Test1].[data_col] and
[Test2].[data_col] to be Number (long integer) fields. For [Test1], I
set the primary key field to be [key_col], and for [Test2] I set the
primary key to include both fields.
Note the difference in PK designation between them.

Now load some data using the 'sequence table':

INSERT INTO Test1 (data_col)
SELECT DT1.data_col FROM (
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
) AS DT1
;
INSERT INTO Test1 (data_col)
SELECT DT1... (as before)
;

For the second version, I assume you really meant

INSERT INTO Test2 (data_col)

.. Your SQL didn't work on my Access 2000, so what I actually did was to
define the following Queries:

[Q_ShowSeq] SQL:
SELECT S1.seq MOD 100 AS data_col
FROM Sequence AS S1
WHERE S1.seq<=50000;

[Q_UnionAll] SQL:
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq;

[Q_PopulateTest1] SQL:
INSERT INTO Test1 ( data_col )
SELECT DT1.data_col
FROM [SELECT * FROM Q_UnionAll
]. AS DT1;

and similarly for [Q_PopulateTest2].

This put 250,000 records into each of [Test1] and [Test2].
At this point, compact the file. This has the effect of physically
rebuilding the tables based on their PRIMARY KEY designations.

To test the relative performance of the PRIMARY KEY choices, run a
query which uses a BETWEEN construct because this favours the physical
order:

SELECT key_col, data_col
FROM Test1
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95
;
SELECT key_col, data_col
FROM Test2
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95

In my test timings, table with PRIMARY KEY (key_col) is about 10 times
slower than the table with PRIMARY KEY (data_col, key_col).

My version of your Queries (essentially identical to yours) was

[Q_SelectTest1] SQL:
SELECT [key_col], [data_col]
FROM Test1
WHERE data_col Between 10 And 15
Or data_col Between 30 And 35
Or data_col Between 50 And 55
Or data_col Between 70 And 75
Or data_col Between 90 And 95
;

and similarly for [Q_SelectTest2].

In my test timings, after compacting the database, both versions
completed essentially instantly (for example, 1.23E-08) as measured by
the Time function (system time, = days since midnight), so both were
apparently completed in a few milliseconds or less.

But looking at your description, even assuming I had included enough
records to be able to notice a difference, it appears that you forgot to
INDEX the [Test1].[data_col] field! Declaring the primary key to
include only [Test1].[key_col], omitting the field on which you were
filtering the records, would also omit putting an index onto the other
field. I imagine that that's why the Query took longer when you ran it.

This is exactly what I was referring to earlier, specifically the
problem of FORGETTING to index a field that one needs to use for sorting
or filtering a recordset.

Choosing an index that fails to match the physical record order on disk
pales in comparison to this type of omission.

Bear in mind that putting lots of indices onto a Table may not improve
performance. Indices improve performance when searching and sorting,
but they reduce performance with updating (since each index must be
updated when a record is changed).

In summary, although it's perhaps useful to know that a primary key gets
slightly special treatment from Access, the improvement in performance
that you might experience is likely to be pretty marginal. Far more
important, IMHO, is organizing the information (via primary keys;
indices; helpfully named Tables, fields, and Queries; properly
human-engineered Forms; etc.) in your database so that it will do the
job that you need done and will be easy to maintain. Worrying too much
about internal implementation or "efficiency" can be counterproductive,
if it distracts you from the imporant parts of the design. (Worse than
any of this, of course, is failing to make the database model the
real-world system it's supposed to model, but that's another story.)

-- Vincent Johns <[email protected]>
Please feel free to quote anything I say here.
 
I agree with most of this, but suggest that one should be a bit
cautious, depending on individual circumstances.

John said:
No, certainly not.

An Autonumber is convenient if you don't have a "natural" key, but it
can actually be a bad idea. It makes all records unique automatically,
so that if you have duplicate data, it can be harder to discover.

This is a good point, but if it's important to be sure that the data are
unique, that can also be accomplished via an index that is separate from
the primary key.
If you have some field - or combination of up to ten fields - which
uniquely identifies each record in your table, use it.
[...]

Depending on how the database is structured, this might or might not be
a good idea. I normally try to avoid using a multiple-field primary
key, merely because I think it would be a nuisance to maintain all those
fields (as parts of a foreign key) in the other Tables that refer to the
current Table. A foreign key that is merely a long integer seems a lot
tidier. Conceptually, there's no difference, but the long integer
occupies less space and (I expect) takes less time to compare when
sorting or searching. At the least, I'd want all or many of those extra
fields to have some value in the other Tables other than just being part
of a foreign key.

OTOH, I might want to store those long multiple-field key values in the
other Tables if I were worried about the time required to accomplish an
indirection. Microsoft suggests avoiding linkages requiring more than
about half a dozen indirections -- this is kind of a rule of thumb, I
suppose, but it seems like good advice. So if you can use foreign-key
values directly instead of looking them up, and if you don't have to
update them very often, they might be part of a good design in your case.

My own preference is to lay on the indirect linkages (usually via
Autonumbers) pretty heavily early in the design stages, when I want to
be sure that redundant data are kept to a minimum. Then in the later
stages, when I might be more concerned with performance, I might
duplicate some of the key values to reduce indirections. Of course, if
I'm /really/ concerned about performance, I might go to SQL Server, or
ISAM on flat files, or some other system besides Access. But Access has
the advantage of being relatively easy to use and inexpensive, and
performance on modern computers usually does not have the same
importance it did in the olden days.

-- Vincent Johns <[email protected]>
Please feel free to quote anything I say here.
 
[snip]
... I normally try to avoid using a multiple-field primary key, merely because
I think it would be a nuisance to maintain all those fields (as parts of a
foreign key) in the other Tables that refer to the current Table. A foreign
key that is merely a long integer seems a lot tidier. Conceptually, there's
no difference, but the long integer occupies less space and (I expect) takes
less time to compare when sorting or searching. At the least, I'd want all or
many of those extra fields to have some value in the other Tables other than
just being part of a foreign key.
[snip]

But one could go the other way. Make the multiple field combination as the
Primary Key, put a unique index on the AutoNumber, and then use the AutoNumber
as the Foreign Key in related tables.

Does relational theory indicate that a FK must or should be the PK in the other
table?
 
Does relational theory indicate that a FK must or should be the PK in the other
table?

In practice it doesn't appear to make any real difference. A uniquely
indexed field works fine in relationships, AFAICS.

John W. Vinson[MVP]
 
John said:
In practice it doesn't appear to make any real difference. A uniquely
indexed field works fine in relationships, AFAICS.

John W. Vinson[MVP]

For that matter, I imagine that multiple unique keys could coexist in
one Table, causing no more trouble than that it would waste space. But
given the existence of an Autonumber field in the Table (for whatever
reason), that's likely the field I'd choose most of the time as the
foreign key to that Table from anywhere else, largely because it would
be a short field, easy to manage.

Exceptions would include natural unique keys that might be even shorter,
such as 2-letter postal abbreviations for states and provinces, assuming
I had a Table describing states (e.g., they might be sales territories,
or parts of a geography lesson).

-- Vincent Johns <[email protected]>
Please feel free to quote anything I say here.
 
Back
Top