Interesting idea, but I still am not sure how to deal with db level
validation. Any pointers there would be appreciated.
Well, if we're talking about database validation then let's just
consider Jet's CHECK constraints and only worry for now about ANSI
wildcard characters: % for multiple characters and _ (underscore) for a
single character.
The way I see it there are two flavours of Jet CHECK constraint: row
level and table level (but in reality they are the same animal).
Access's Validation Rules are considered to be either at the 'column'
level or at the 'table' level, though I'd argue the latter is actually
at the *row* level because one can reference multiple columns in the
same row but not other rows in the table. I think the differentiation
is based on the Validation Text property more than anything.
Basic validation of values to enforce business rules, like the one that
started this thread: "must start with a the letter 'H' followed by a
numeric value of varying length". I also assume that H01 is illegal
because it is considered equivalent to H1 and that H0 is illegal
because it is effectively 'zero'.
Jet allows use to define multiple CHECK constraints per row/table,
therefore we can follow Celko's advice in the article (up thread) and
split these into multiple rules. It would be good to set a Validation
Text property for each but because it is tied to closely to Access
we're only allowed one per *table*. The best we can do is use a
meaningful name. Remember it is our intention that we will trap input
errors with validation in the front end application so the user being
exposed to these in error messages is a 'last resort':
CREATE TABLE Employees (
employee_nbr VARCHAR(11) NOT NULL PRIMARY KEY,
CONSTRAINT employee_nbr__basic_pattern
CHECK (employee_nbr NOT LIKE 'H%[!0-9]%'),
CONSTRAINT employee_nbr__no_leading_zeros
CHECK (employee_nbr NOT LIKE 'H0%')
);
At this point, before we start building the front end, we should test
each rule e.g.
values I expect to pass:
INSERT INTO Employees (employee_nbr) VALUES ('H1');
INSERT INTO Employees (employee_nbr) VALUES ('H900');
values I expect to fail:
INSERT INTO Employees (employee_nbr) VALUES ('F111');
INSERT INTO Employees (employee_nbr) VALUES ('H1A3');
INSERT INTO Employees (employee_nbr) VALUES ('H0');
INSERT INTO Employees (employee_nbr) VALUES ('H05');
[Just to prove this is a worthy exercise, my first insert inexplicable
failed. A quick look at my code revealed a typo [!9-0], which could
have been tricky to debug later.]
Validation using multiple values on the same row. For our payroll
table, let's model periods of earnings history in the recommended way
using closed-open representation with start_date and end date pairs, a
null end date signifying the current pay period.
Basic validation of the dates is that the start date will be midnight
and the end date will be the smallest granule (for Jet this is one
second) before midnight.
Note that although end_date can be NULL there is no need to explicitly
test for NULL in validation. This is due to the nature of nulls in SQL
DDL (data declaration language e.g. table design). Whereas in SQL DML
(data declaration language e.g. queries) an UNKNOWN resulting from a
comparison with a NULL value causes rows to be removed from a
resultset, in SQL DDL the UNKNOWN cannot be known to fail the rule
therefore it is allowed to pass. This makes sense when you think of our
NULL end data as being a placeholder for a date which will certainly be
known at some time in the future, so it is right to defer validation
until the value is known. I think this is mainly due to pragmatics
though i.e. without this implicit behaviour, validation of nullable
columns would *always* have to explicitly handle NULL.
An obvious domain rule, yet one that is often missed in database
validation rules, is that the end date cannot occur before the start
date in the same row. Similarly, salary cannot be negative.
Another obvious one is that each employee there should only be a
maximum of one row with a null date, which is best enforces with a
UNIQUE constraint, because the end dates should not be duplicated
either for an employee. Start dates are similarly unique for an
employee and, because they are not nullable, makes a good compound
natural key:
CREATE TABLE EarningsHistory (
employee_nbr VARCHAR(11) NOT NULL
REFERENCES Employees (employee_nbr)
ON DELETE NO ACTION
ON UPDATE CASCADE,
start_date DATETIME DEFAULT DATE() NOT NULL,
CONSTRAINT earnings_start_date__open_interval
CHECK(
HOUR(start_date) = 0
AND MINUTE(start_date) = 0
AND SECOND(start_date) = 0),
end_date DATETIME,
CONSTRAINT earnings_end_date__one_granule_closed_interval
CHECK(
HOUR(end_date) = 23
AND MINUTE(end_date) = 59
AND SECOND(end_date) = 59),
CONSTRAINT earnings_dates_order
CHECK (start_date < end_date),
salary_amount CURRENCY NOT NULL,
CONSTRAINT earnings_salary_amount__value
CHECK (salary_amount >= 0),
UNIQUE (employee_nbr, end_date),
PRIMARY KEY (employee_nbr, start_date)
);
Once again, test immediately.
Rows I expected to pass:
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2004-01-01 00:00:00#, #2004-12-31 23:59:59#, 10000.00)
;
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2006-01-01 00:00:00#, NULL, 12000.00)
;
Rows I expected to fail:
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2005-01-01 00:00:00#, #2005-12-31 00:00:00#, 11000.00)
;
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2005-01-01 14:29:39#, #2005-12-31 23:59:59#, 11000.00)
;
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2005-12-01 00:00:00#, #2005-01-31 23:59:59#, 11000.00)
;
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2005-01-01 00:00:00#, #2005-12-31 23:59:59#, -99.99)
;
[Once again my first test insert revealed an error: a copy and paste
resulted in me testing start_date in my rule for end_date!]
The data integrity of our payroll table is quite good. If we can't get
bad values in using direct INSERTs then no front end application can.
However, the constraints are not complete to my satisfaction. For
example, this spoils the data:
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2004-11-01 00:00:00#, #2006-02-28 23:59:59#, 11000.00)
;
It passes all the existing constraints yet it creates bad (illogical)
data. Note end_date can be null, signifying the current date, so we'll
replace it with the current date in queries:
SELECT employee_nbr,
#2006-01-01# AS report_date, salary_amount
FROM EarningsHistory
WHERE #2006-01-01# BETWEEN start_date AND
IIF(end_date IS NULL, NOW(), end_date);
According to the data, the employee was earning two different amounts
simultaneously. This could result in some tricky situations so a
further constraint is required.
To write this kind of constraint I usually start with some data that
fails the rule, write a query that identifies all the bad rows, then
turn it into a constraint.
First some more bad data:
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2004-03-01 00:00:00#, #2006-04-01 23:59:59#, 15000.00)
;
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2004-07-01 00:00:00#, #2004-07-31 23:59:59#, 16000.00)
;
INSERT INTO EarningsHistory
(employee_nbr, start_date, end_date, salary_amount)
VALUES ('H1', #2006-05-01 00:00:00#, #2006-12-31 23:59:59#, 17000.00)
;
As everyone knows, if they think about it, that a period overlaps a
later period if its end date occurs before the later period's start
date (I think...I'm doing this all off the top of my head!) Therefore,
I think this should identify the bad data:
SELECT *
FROM EarningsHistory, EarningsHistory AS E2
WHERE EarningsHistory.employee_nbr = E2.employee_nbr
AND EarningsHistory.start_date < E2.start_date
AND
(
E2.start_date
< IIF(EarningsHistory.end_date IS NULL, NOW(),
EarningsHistory.end_date)
OR IIF(E2.end_date IS NULL,
NOW(),
E2.end_date) < IIF(EarningsHistory.end_date IS NULL,
NOW(), EarningsHistory.end_date)
)
I then start deleting the bad rows individually. The resultset should
only return zero rows when all the bad data has been removed, so I run
the query after each single row deletion:
DELETE FROM EarningsHistory WHERE employee_nbr = 'H1'
AND start_date = #2004-11-01 00:00:00#
;
DELETE FROM EarningsHistory WHERE employee_nbr = 'H1'
AND start_date = #2004-03-01 00:00:00#
;
DELETE FROM EarningsHistory WHERE employee_nbr = 'H1'
AND start_date = #2004-07-01 00:00:00#
;
DELETE FROM EarningsHistory WHERE employee_nbr = 'H1'
AND start_date = #2006-05-01 00:00:00#
;
OK, only after the final delete did the resultset return empty. I'll
assume the logic is sound and convert it to a constraint:
ALTER TABLE EarningsHistory ADD
CONSTRAINT earnings_history__no_overlapping_periods
CHECK (0 = (
SELECT *
FROM EarningsHistory, EarningsHistory AS E2
WHERE EarningsHistory.employee_nbr = E2.employee_nbr
AND EarningsHistory.start_date < E2.start_date
AND
(
E2.start_date
< IIF(EarningsHistory.end_date IS NULL, NOW(),
EarningsHistory.end_date)
OR IIF(E2.end_date IS NULL,
NOW(),
E2.end_date) < IIF(EarningsHistory.end_date IS NULL,
NOW(), EarningsHistory.end_date)
)
)
);
Try the inserts again and they should all fail.
Some further business rules spring to mind. Contiguous periods were the
salary amounts are the same is useless information indicative of an
error:
ALTER TABLE EarningsHistory ADD
CONSTRAINT earnings_history__contiguous_periods_salary_must_change
CHECK (0 = (
SELECT COUNT(*)
FROM EarningsHistory, EarningsHistory AS E2
WHERE EarningsHistory.employee_nbr =
E2.employee_nbr
AND DATEADD('s', 1,
IIF(EarningsHistory.end_date IS NULL, NOW(),
EarningsHistory.end_date)) = E2.start_date
AND EarningsHistory.salary_amount = E2.salary_amount
)
);
Another rule could disallow gaps between periods (if they still
employed but aren't being paid then add a period where the salary is
zero):
ALTER TABLE EarningsHistory ADD
CONSTRAINT earnings_history__periods_must_be_contiguous
CHECK ( 0 = (
SELECT COUNT(*)
FROM EarningsHistory AS E1
WHERE EXISTS (
SELECT *
FROM EarningsHistory AS E2
WHERE E1.employee_nbr = E2.employee_nbr
AND E1.start_date < E2.start_date)
AND NOT EXISTS (
SELECT * FROM EarningsHistory AS E2
WHERE E1.employee_nbr = E2.employee_nbr
AND DATEADD('s', 1, E1.end_date) = E2.start_date
)
)
);
Then there are CHECK constraints that can reference rows in other
tables...
It should be obvious by now that writing validation rules in the