Category : Dbase (Clipper, FoxBase, etc) Languages Source Code
Archive   : TN9002.ZIP
Filename : DATABASE.TXT
Database File Normalization
Rick Biegel
The following is a parable for our time Ä and profession. Read it and
shudder, but be of good cheer; what follows is an explanation and some
solutions.
The Story
Once upon a time, in the Land of MIS, there was a company called
DataLove Corporation.
DataLove was a suffering organization. True, it was a Fortune-500
company, with hefty profits and a seemingly bright future; but looks
can be deceiving. DataLove had an information problem.
It seems that, one year earlier, DataLove had hired a programmer named
Ernie Schnickelfritz to write a database-management program to keep
track of the company's ongoing projects, and the individual worker
assignments that these projects were comprised of. The clerical staff
had fond hopes that the new system would make their work faster,
easier, more accurate and more reliable.
What's that you say? Fat chance?
Unfortunately, in this case you are right. Within a few days of the
installation of the new program, Jane Doe, the Project Coordination
Manager, began to encounter some rather infuriating problems. She
called a meeting with the MIS Manager to discuss these new problems.
Problem # 1
"My problem," said Jane, "is this: My job is to keep track of all the
projects that the company is working on. Each project is divided up
into various assignments. Now, I can delete or change old assignments
just fine, but I've discovered I can only add ten assignments per
project! And no matter what I try, there doesn't seem to be any way
at all to do a proper seek on, or even alphabetically list out, any of
these individual assignments! You have to seek out the general
project first, and then page through the various assignments in that
project to get to the one you're looking for. Oh, and by the way, I
have also discovered that when I get rid of a particular assignment,
or even a lot of assignments, it doesn't free up any space on the disk
at all! The only way to free up disk space is to delete an entire
project! Sorry, but that just doesn't make sense to me."
That day, the MIS Manager had a brief chat with Ernie. "Fix it," he
concluded with a frown.
Three weeks later, Ernie delivered the new version of the program.
Jane was pleased. "I can enter all the assignments I like per
project, I can do a seek or sort on all of them, and removing an
assignment yields space on the disk."
But the incident left a bad aftertaste.
Problem # 2
The bad aftertaste was also a bad omen. A few weeks later, the MIS
Manager was summoned for another meeting.
"Well, what do you think of this?" said John Q. Public, the Project
Site Manager. "As you know, a project assignment is identified by its
Project Number, Employee Number and Assignment Date. But the location
of where a particular assignment is worked on depends only on its
Project Number and Employee Number. That means that there are
multiple places in the database that contain the same location, and
that has caused us a big problem.
"Sometimes we change our minds about locations. Recording these
changes in the computer ought to be simple, but it isn't! We have to
go to every assignment record for the project and employee in question
and change the assignment location field manually! That's a royal
pain! Remember last week when we shuffled the work locations of
Project A1032F1? What a mess! For instance, Jim Workaholic had over
two hundred assignments given him for the period, and thirty-one of
them were for that particular project. Jim was originally designated
to do those thirty-one assignments in Los Angeles; all we needed to do
was change them all to San Diego. But our data entry people had to go
through every one of them and make the change INDIVIDUALLY! In the
process, mistakes and omissions were made. Can you talk to Ernie
about it?"
The MIS Manager talked to Ernie about it, but this time without the
cheerfulness of their previous meeting. "You're making me look bad,"
he growled. "Fix it."
When Ernie completed the new changes, John Q. Public was pleased.
"Well, at least now we can enter one location that will fit all of a
particular project/employee/assignment class. Making changes should
now be much easier, and we won't make so many mistakes."
But the MIS Manager, being a rather unforgiving soul, was not so happy
with Ernie.
Problem # 3
A week later, the MIS Manager was spoken to by Joe Soap, the Project
Assignment Coordinator. "It's like this," he said. "Every record in
our Assignment database has its own unique Assignment Number, and a
Description field associated with it. We occasionally need to switch
Assignment Numbers between two or more particular assignments due to
administrative snafus. That's no problem. But we also have to switch
the Description field, too! Do you realize what a hassle that is?
The Description field is a Memo-type and some descriptions have as
many as fifty lines of information in them! And it shouldn't be
necessary. Each Assignment Number is unique, and so is each
Description field, and each is directly associated with the other;
there is a one-to-one correspondence between them. What I'm driving
at is that switching Assignment Number fields between records should
automatically switch the Description field as well. Otherwise, it's
stupid and inefficient."
Joe's "stupid and inefficient" remark appalled the MIS Manager. "Fix
it," he hissed at Ernie later that afternoon, "or I'll fix you!"
Soon after Ernie delivered the newly-changed program, Joe Soap was
pleased. "Now making switches is a breeze. No need to re-type those
darned Description fields!"
Nonetheless, a cloud had begun to form over Ernie's career at DataLove
Corporation. Rumor had it that the MIS Manager was looking for a new
programmer.
On that inconclusive note, our story ends. Now let's look at what
happened, and why.
The Inside Story
In order to see what went wrong with Ernie's application, it is
essential to learn about a concept known as normalization.
Normalization theory is concerned with the elimination of certain
undesirable characteristics in the design of database files and their
indexing keys. Most of these undesirable characteristics have one
overriding trait in common: data redundancy. More about that in a
moment.
Problem # 1 revisited: First Normal Form
First, let's review Jane Doe's complaints:
1. There was an arbitrary limit of ten assignments per project
that was very restricting;
2. Neither seeks nor sorts on the individual assignments were
practical;
3. Removing assignments did not yield new space on the disk
unless an entire project was removed.
When Ernie created his application, he originally created only one
database, which he named Project.DBF. He intended to put all the
application's data into it, with the rationale that this would make
things simpler and easier to code.
Field Field Name Type Width Dec Index
1 PROJECT_NO Character 7 Y
2 PROJ_NAME Character 25 N
3 START_DATE Date 8 N
4 ASSIGN_1 Character 7 N
5 EMP_NO_1 Character 8 N
6 A_DATE_1 Date 8 N
7 LOCATION_1 Character 25 N
8 DESCRPT_1 Memo 10 N
9 ASSIGN_2 Character 7 N
10 EMP_NO_2 Character 8 N
11 A_DATE_2 Date 8 N
12 LOCATION_2 Character 25 N
13 DESCRPT_2 Memo 10 N
. . . .
. . . .
. . . .
48 ASSIGN_10 Character 7 N
49 EMP_NO_10 Character 8 N
50 A_DATE_10 Date 8 N
51 LOCATION_10 Character 25 N
52 DESCRPT_10 Memo 10 N
** Total ** 621
The idea was that each record in Project.DBF would represent one
project. Since each project consisted of a series of assignments,
each with an Assignment Number, Employee Number, Assignment Date,
Location and Description, Ernie included Assign_x, Emp_No_x, A_Date_x,
Location_x and Descrpt_x fields for each assignment within a project,
up to and including the tenth assignment. "That way", he reasoned,
"I'll have all my data in one database! None of this garbage about
switching from one database to another. Simplicity, ease of
programming, and no nonsense!"
Of course, this was folly. When Jane Doe wanted to enter more than
ten assignments per project, she was frustrated by the finite Ä and
arbitrary Ä number of assignment "slots" available. And of course,
she could not do any practical sort or seek on the individual
projects, since they were clumped within records instead of
distributed one per record in a database. And what if Jane had only
wanted to enter, say, three assignments for a given project? Since
each record in a database file is the same physical length regardless
of whether any real information is contained in its fields, the same
amount of disk space is used as in a record with ten assignments in
it. Small wonder that Jane had discovered that removing assignments
did not free up space on her disk!
This latter problem is our first example of redundancy. The
arrangement is redundant because each time a record is created, an
arbitrary number of assignment fields is created, regardless of how
many will ultimately be used. The result is a lot of uselessly
repeated blank spaces and wasted disk space. Of course, this becomes
especially severe when the file gets very large!
The database design rule Ernie had violated is known as First Normal
Form. In the dBASE IV programming language, First Normal Form
mandates the following:
A database file MUST NOT have multiple occurrences of the "same" field
within one record.
Another way of stating this is: DO NOT attempt to simulate a series
of records within a record. Inserting a repetition of "like" fields
into one record is usually a fatal flaw in database design.
Note that I say usually. As in so many other rules in programming (or
life) there are exceptions. In the vast majority of cases, First
Normal Form should be fanatically adhered to.
Ernie solved Jane Doe's problem by re-organizing the data and program
code in accordance with First Normal Form. He saw that the only way
to achieve this was to break up Project.DBF into two databases,
Project.DBF and Assign.DBF:
Project.DBF
Field Field Name Type Width Dec Index
1 PROJECT_NO Character 7 Y
2 PROJ_NAME Character 25 N
3 START_DATE Date 8 N
** Total ** 41
Assign.DBF
Field Field Name Type Width Dec Index
1 PROJECT_NO Character 7 Y
2 EMP_NO Character 8 N
3 A_DATE Date 8 N
4 LOCATION Character 25 N
5 ASSIGN Character 7 N
6 DESCRPT Memo 10 N
. . . .
. . . .
. . . .
Other Fields
WHAT IS A KEY?
Keys determine the order in which records in a database can be listed
or accessed. A key can consist of a single field in a database, or
several fields combined together. For an example of the latter type,
our article includes the key Project_No+Emp_No+DTOS(A_Date); this
means that Project_No is the primary key field, Emp_No is the
secondary key field, and A_Date is the tertiary key field (converted
from a date type to a character type by the DTOS function so it can be
combined with the other two fields). In other words, the data will be
ordered on Project_No first, Emp_No second and A_Date third. To see
how this works, examine this sample listing:
PROJECT_NO Emp_No A_DATE . . . Other Fields
ABCDEF1 RTS-BRGB 03/25/88 . . .
ABCDEF1 RTS-BRGB 07/15/88 . . .
ABCDEF1 RTS-DXLM 11/08/87 . . .
ABCDEF1 RTS-DXLM 01/14/88 . . .
ABCDEF1 RTS-DXLM 12/18/88 . . .
ABCDEF1 RTS-DXLM 06/22/89 . . .
BBDEFC1 RTS-BRGB 02/11/87 . . .
BBDEFC1 RTS-BRGB 11/27/87 . . .
. . .
. . .
. . .
Notice that the first two records have the same Project_No and Emp_No
values, but have different A_Date values, with the later A_Date coming
after the earlier one. Notice also that, while the second and third
records have the same Project_No values, the third record has a higher
Emp_No value. The third record and the three records that follow it
have the same Project_No and Emp_No, with the A_Date arranged in
ascending order from one record to the next. A_Date is therefore said
to be Òordered withinÓ Emp_No. Note also that Emp_No is ordered
within Project_No.
Keys are defined and created in special files with .MDX or .NDX
extensions; these are known as Òindex filesÓ. See the dBASE IV
Language Reference manual for a more detailed explanation of how index
files are designed and created.
Let us assume that Ernie included other fields in Assign.DBF that were
not part of this or future problems.
On the occasions that the information in both files needed to be used
in tandem, Ernie related them together on the common field Project_No
with the following program code:
SELECT 2
USE Assign ORDER Project_No
SELECT 1
USE Project ORDER Project_No
SET RELATION TO Project_No INTO B
SET SKIP TO B
In this way, records in Project.DBF (all of which were unique) could
each "point" to any number of related records in Assign.DBF (which
were not necessarily unique). Since Assign.DBF was its own file, the
number of assignment records per project record was effectively
limitless. And, of course, the problem of the redundant spaces was
solved.
On the occasions that Assign.DBF was used by itself, a "complex" index
tag was used, with the following expression: Project_No + Emp_No +
DTOS(A_Date). This meant that the assignments were conveniently
indexed by Project Number, Employee Number and Assignment Date, and
could now be accessed and listed in the way that Jane had wanted.
Ernie's database design now conformed to First Normal Form. But alas,
Ernie's problems did not end there.
Problem # 2 Revisited: Second Normal Form
Now let's review John Q. Public's complaint. John discovered that
when it was necessary to change the value in a Location field, it was
often necessary to change the field more than once Ä even after the
system conformed to First Normal Form.
The three fields Project_No, Emp_No and A_Date served as a complex key
for Assign.DBF when it was not being used in tandem with Project.DBF.
It was during one of these occasions that John discovered his problem.
A database field is said to be dependent on a key when the combination
of that field and the key are unique within the database. It's an
important concept, so let's restate it another way: the field is
dependent on a key when the database has no two records with different
values in that field without also having a different value in the
key. Let's see what that means in our example.
Ernie designed the database such that any particular
Project_No+Emp_No+DTOS(A_Date) key combination could be associated
with only one Assign value, and only one Descrpt value. Put another
way, there could be no two records in which the Assignment Number or
Description fields differed, while still having the same Project
Number, Employee Number and Assignment Date simultaneously. Note in
particular that all three fields in the key are needed to identify an
Assign or a Descrpt field; as we shall see, this is of extreme
importance.
Now let's see what is different about the Location field. Recall from
our story that the Location field was dependent only upon the Project
Number and the Employee Number. The Assignment Date had nothing to do
with the choice of location. This is why Location is not a true
dependency. A true dependency occurs only if the entire key is needed
to identify the field.
To understand why this is such a problem, remember the example that
John used when making his complaint. Jim Workaholic, who had been
given over two hundred assignments, had thirty-one of these
assignments in the same city, Los Angeles. Each of these assignments
was scheduled for a different date, but that was not relevant to the
choice of location; nonetheless, each of the thirty-one Location
fields had to be filled with "Los Angeles" when the assignment records
were first enteredÄand of course, altered repetitively when the change
was made to "San Diego" later on.
The fundamental flaw in this arrangement is that a unit of information
had to occupy thirty-one places rather than oneÄanother example of
redundancy in Ernie's system. Not only was John Q. Public annoyed at
the extra work it took to keep track of his data, but space on the
disk was wasted on those thirty-one iterations of the single string
"Los Angeles".
The database design rule violated here is known as Second Normal
Form. Second Normal Form mandates the following:
The database MUST BE in First Normal Form, and every field in the
database MUST BE dependent on the entire key, and not just part of the
key.
This implies that any field in the database not so dependent on the
key belongs somewhere else. And that is what Ernie had discovered the
hard way.
Ernie solved the problem by breaking up the Assign.DBF database into
two databases, Assign.DBF and Location.dbf:
Assign.DBF
Field Field Name Type Width Dec Index
1 PPRJECT_NO Character 7 Y
2 EMP_NO Character 8 N
3 A_DATE Date 8 N
4 ASSIGN Character 7 N
5 DESCRPT Memo 10 N
. . . .
. . . .
. . . .
Other Fields
Location.DBF
Field Field Name Type Width Dec Index
1 PROJECT_NO Character 7 N
2 EMP_NO Character 8 N
3 LOCATION Character 25 N
For times when Assign.DBF and Location.dbf were to be used in tandem,
Ernie set up a Query linking the Project_No and Emp_No fields between
the two databases.
Note that there is now a true dependency between the key
Project_No+Emp_No and Location, since Location is entirely dependent
on Project Number and Employee Number. A relation like this is said
to be atomic. Simply put, an atomic relation is one which cannot be
broken down further into smaller component relations.
To see the benefits of atomization, consider the fact that now when
John Q. Public wishes to change a Location for a particular
Project_No+Emp_No, he only needs to make the change once since this
location is stored in only one place. Needless to say, this also
saves disk space, since data redundancy has been reduced.
Ernie's databases and keys now conformed to First and Second Normal
Form. But there was one last change that needed to be made.
Problem # 3 Revisited: Third Normal Form
Finally, let's look at Joe Soap's complaint. Joe found it pointless
and annoying that every time he switched Assign fields between two
records, he also had to switch Descrpt fields as well. It seems
intuitively obvious that if there is a one-to-one correspondence
between two fields, switching one should switch the otherÄespecially
if the other is a memo field with fifty or so lines to type in!
At first glance, things seem to be in order. There are true
dependencies between the key and Assign and between the key and
Descrpt; both are dependent on the entire key.
But looks can be deceiving. The one-to-one correspondence between
Assign and Descrpt has created an unwanted relationship known as a
transitive dependency.
While there is a true, valid dependency between the key and each of
the two fields in question, there is also an indirect (transitive)
dependency leading from the key to Assign and then to Descrpt. For
that matter, there is also a transitive dependency leading from the
key to Descrpt and then to Assign. In other words, although Assign
and Descrpt may be dependent on the key, they also manifest a
redundant co-dependency on each other.
And this explains Joe's exasperation. The two fields are wrongly
treated by the application as separate entities, dependent only on a
third entity, the key. Of course, this means that when the Assign
field is switched from one record to another, Descrpt must also be
switched, as if there were no direct relationship between the two
fields at allÄdespite the inconvenience and the outraged rationality
of the user.
This flaw violates the design rule known as Third Normal Form. Third
Normal Form mandates the following:
The database MUST BE in Second Normal Form, and every non-key field
MUST BE nontransitively dependent on the key.
To put it another way, all non-key fields must be fully dependent on
the key, and there must be no non-key field that is also dependent
upon another non-key field.
Ernie rectified the error by removing the Descrpt field from
Assign.DBF and putting it into a new database named AsgnDesc.DBF.
Here are the structures of these two files:
Assign.DBF
Field Field Name Type Width Dec Index
1 PROJECT_NO Character 7 Y
2 EMP_NO Character 8 N
3 A_DATE Date 8 N
4 ASSIGN Character 7 N
. . . .
. . . .
. . . .
Other Fields
AsgnDesc.DBF
Field Field Name Type Width Dec Index
1 ASSIGN Character 7 Y
2 DESCRPT Memo 10 N
For times when these two files were to be used in tandem, Ernie set a
relation between them on the Assign field.
Now the relationship between Assign and Descrpt is atomic; it can no
longer be broken down further, and there is no redundancy. Better
yet, switching the Assign field between records in Assign.DBF will
automatically cause the Descrpt fields to be logically switched as
well. Third Normal Form has been adhered to.
It should be noted that there is an even stronger version of Third
Normal Form known as Boyce-Codd Normal Form. While Third Normal Form
requires that there be no transitive dependencies between non-key
fields, Boyce-Codd Normal Form requires that there be no transitive
dependencies at all, not even between key fields.
In dBASE IV, a determinant is any field or combination of fields upon
which one or more other fields is dependent. Note, for instance, that
the key Project_No+Emp_No+DTOS(A_Date) is a determinant for Assign,
and that Assign is itself a determinant for Descrpt. Note also that
each determinant is also a key. With this new insight, we can now
state the requirements for Boyce-Codd Normal Form:
Every determinant in database, whether or not it is a part of a key,
must itself be a key.
Luckily for Ernie, the new arrangement he created with Assign.DBF and
AsgnDesc.DBF met this exacting standard.
Postscript
Now for the final part of our story. The MIS Manager replaced Ernie
with another programmer. But the ending is really not so tragic;
Ernie may have stumbled, but in the process he learned about database
normalization. Here is an overview of his new wisdom:
1. Data redundancy must be avoided; redundancy anomalies impose
extra work on users, present greater opportunities for errors
and waste disk space.
2. Data redundancy can be avoided through adherence to the Normal
Forms. There are other Normal Forms besides the ones
discussed in this article. Normal Forms 1 through 3 and
Boyce-Codd Normal Form are typically the most important for
the dBASE IV user; the others were excluded for the sake of
brevity and relevance. If the reader is interested in
learning about these other normal forms, I recommend the book
by C. J. Date included in the References section.
One final note: Ernie is now earning a six-digit figure working as an
MIS Manager for another company. It just goes to show what a little
normalization can do.
Very nice! Thank you for this wonderful archive. I wonder why I found it only now. Long live the BBS file archives!
This is so awesome! 😀 I’d be cool if you could download an entire archive of this at once, though.
But one thing that puzzles me is the “mtswslnkmcjklsdlsbdmMICROSOFT” string. There is an article about it here. It is definitely worth a read: http://www.os2museum.com/wp/mtswslnk/