G
k said:do all primary keys use autonumber
Vincent said:Autonumber is a convenient way to be sure
that you won't have any duplicate values, and an Autonumber field
occupies only about 4 bytes per record, much shorter than many other
fields you might choose to use.
Bear in mind that PRIMARY KEY has a special meaning which makes an
autonumber the choice of last resort for the PRIMARY KEY designation.
Take the example of a table which has two 'unique' columns: a random
integer and an industry standard key, of which the ISBN is an excellent
example. In relational theory there is no issue: all keys are equal,
simply use the one most appropriate to your needs. However, with SQLs
you can only have one PRIMARY KEY per table. You are forced to choose
just one but what criteria should you use to make this choice? Both
could be given NOT NULL UNIQUE constraints and your 'uniquifier' needs
would be satisfied.
But everyone (correctly) says a table should have a PRIMARY KEY. So
what does PRIMARY KEY give you that NOT NULL UNIQUE does not? For
Access/Jet, there's only one answer: physical order on disk a.k.a. the
table's clustered index.
What makes the best clustered index? In a nutshell, the column(s)
primarily used in queries that use GROUP BY and BETWEEN constructs.
Does anyone write SQL that includes GROUP BY (autonumber_col) or
BETWEEN (this autonumber value) AND (that autonumber value)? I doubt
it.
In the earlier example, ISBN is the natural key and wins the PRIMARY
KEY designation hands down. What about a table where the autonumber is
the only one 'unique' column?
Say, a table with just two columns, a random integer column named ID
and a last_name column. The last_name is unlike to be unique. Then
again, last_name is far more likely than the random integer to be used
in GROUP BY or BETWEEN constructs. The best approach is to constrain
the random integer (ID) with NOT NULL UNIQUE and make the PRIMARY KEY
designation to be (last_name, ID) in that order. This will cause the
physical order of the disk to be rebuilt on compact in last_name order.
Choosing an autonumber as the sole PRIMARY KEY column is likely to have
the effect of a performance hit on your database because a random
integer makes for a lousy clustered index.
you can determine what that is) help you do ordinary database
operations?
People are more
likely, from what I've heard (and I'm not privy to the source code of
Access, nor to the details of its algorithms), to suffer performance
hits from such mistakes as failing to define indices where they should
and not from choosing a "lousy clustered index".
Vincent said:How does knowing the "physical order of the disk" (assuming you can
determine what that is) help you do ordinary database operations?
Vincent said:How does knowing the "physical order of the disk" (assuming you can
determine what that is) help you do ordinary database operations?
Physical order of the disk (a.k.a the cluster index) for a table is
determined by its PRIMARY KEY designation.
Here's a quick demo of why for Jet you must choose a PRIMARY KEY
carefully.
I assume you have a 'sequence' table i.e. a table of integers e.g.
CREATE TABLE [Sequence] (seq INTEGER)
;
INSERT INTO [Sequence] VALUES (1)
;
INSERT INTO [Sequence] VALUES (2)
;
INSERT INTO [Sequence] VALUES (3)
;
etc etc.
In reality, my data was created in Excel and imported <g>.
Now create two test tables:
CREATE TABLE Test1 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (key_col)
)
;
CREATE TABLE Test2 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (data_col, key_col)
)
;
Note the difference in PK designation between them.
Now load some data using the 'sequence table':
INSERT INTO Test1 (data_col)
SELECT DT1.data_col FROM (
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
) AS DT1
;
INSERT INTO Test1 (data_col)
SELECT DT1... (as before)
;
At this point, compact the file. This has the effect of physically
rebuilding the tables based on their PRIMARY KEY designations.
To test the relative performance of the PRIMARY KEY choices, run a
query which uses a BETWEEN construct because this favours the physical
order:
SELECT key_col, data_col
FROM Test1
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95
;
SELECT key_col, data_col
FROM Test2
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95
In my test timings, table with PRIMARY KEY (key_col) is about 10 times
slower than the table with PRIMARY KEY (data_col, key_col).
John said:No, certainly not.
An Autonumber is convenient if you don't have a "natural" key, but it
can actually be a bad idea. It makes all records unique automatically,
so that if you have duplicate data, it can be harder to discover.
[...]If you have some field - or combination of up to ten fields - which
uniquely identifies each record in your table, use it.
[snip]... I normally try to avoid using a multiple-field primary key, merely because
I think it would be a nuisance to maintain all those fields (as parts of a
foreign key) in the other Tables that refer to the current Table. A foreign
key that is merely a long integer seems a lot tidier. Conceptually, there's
no difference, but the long integer occupies less space and (I expect) takes
less time to compare when sorting or searching. At the least, I'd want all or
many of those extra fields to have some value in the other Tables other than
just being part of a foreign key.
Does relational theory indicate that a FK must or should be the PK in the other
table?
John said:In practice it doesn't appear to make any real difference. A uniquely
indexed field works fine in relationships, AFAICS.
John W. Vinson[MVP]