Dec 092017
 
Database Design and Normalization, a paper by Dan Ehrman. Explains principles of data-normalization in relational databases, with particular reference to Paradox.
File NORMAL.ZIP from The Programmer’s Corner in
Category Paradox DBMS
Database Design and Normalization, a paper by Dan Ehrman. Explains principles of data-normalization in relational databases, with particular reference to Paradox.
File Name File Size Zip Size Zip Type
NORMAL.DOC 17787 5231 deflated

Download File NORMAL.ZIP Here

Contents of the NORMAL.DOC file


Database Design and Normalization
----------------------------------------------------------------------
Dan Ehrmann
President
Kallista, Inc.
----------------------------------------------------------------------
Copyright 1990, ALL RIGHTS RESERVED.

May not be reproduced without the express written permission
of the author. Call me if you need permission: 312-663-0101.
----------------------------------------------------------------------


Introduction

Relational databases are becoming the predominant method for storing
repetitive data in computers because they allow us to store and maintain
that data much more efficiently than other types of databases.

Relational databases are so named because they allow us to split our
data up into different tables, and then relate common information
between those tables together. As we shall see, it is more efficient to
store information in many narrow tables, than in one very wide table.
It also results in more consistent and reliable data.


Breaking data up into a number of smaller tables is called
"normalizing", after the mathematical theory which underlies relational
databases. There are 5 "normal forms", or rules for splitting data up
into separate tables. The first four of these are of general interest;
the last one is much more esoteric, and we will not be discussing it
within the scope of this paper.

If we start with a flat file layout, and then apply each of the first
three normal forms in turn, you will see how data is transformed into
the tables we have been working with.


Flat File

If we had to use one table to store all the information in a Telephone
Chargebacks System, say a product like PFS-File, Reflex or Q&A, we would
end up with a table structure not unlike the following:

Structure Of: TELEPHON

1 Staff Member A20*
2 Position A20
3 Department A20
4 Department Manager A20
5 Equipment One A30
6 Equipment Two A30
7 Equipment Three A30
8 Equipment Four A30
9 Equipment Five A30

This table assumes that no person has more than five pieces of
equipment, which are specifically described. Each Department and
Department Manager is also spelled out in detail.

But what happens if Joe Blow requires more than five pieces of
equipment? We would have to add a new field "Equipment Six" for just
that person. This is especially inefficient if the vast majority of
people in the database have just 1 or 2 items. In a very large table,
the amount of wasted space can become truly horrendous.

Here is another problem. What if we want to query for the departments
where staff people have the equipment item known as "Message Call
Answering". This item could appear in any one of 5 different fields,
(or 6, if we added a new field to allow for Joe Blow.) The query find
this list looks like this (abbreviated to fit on the page):

TELEPHONDepartmentEquipment 1Equipment 2Equipment 3Equipment 4Equipment 5
Check Message Call
Check Message Call
Check Message Call
Check Message Call
Check Message Call

Notice how the query has as many rows as there are repeating fields,
since each instance of an Equipment Item must be separately queried. We
cannot put the criteria for each of the "Equipment.." fields on the same
line, since doing this would be asking for "Message Call Answering" in
each and every "Equipment.." field.

This is a very slow query. Most databases will make 5 separate passes
through the table, finding matching records in each case. There has to
be a better way.


1st Normal Form

The first "rule" of normalizing databases says "Eliminate Repeating
Groups". For each set of related fields, make a separate table and give
that table a primary key.

What this means for our application is that we split up the data into
two tables, as follows:

Structure Of: STAFF

1 Staff Member A20*
2 Position A20
3 Department A20
4 Department Manager A20


Structure Of: EQUIPMNT

1 Staff Member A20*
2 Equipment Description A30*


All equipment descriptions are split off into the separate Equipment
table, and linked to the Staff table via the Staff Member's name.
Both tables are much narrower than the original Telephone table, and
correspondingly longer. But this is no problem. Relational Databases
are designed to handle relatively narrow but long tables much more
efficiently than wide, short tables. This is especially true if the
operations we perform on these tables use the key fields, since linking
on key fields is very quick indeed.

Notice how the Equipment table has both fields keyed. The primary
index (consisting of the concatenation of all keyed fields) must be
unique according to the theory. In instances such as the one above,
where the key fields are getting very wide, programmers generally start
using "ID" numbers to establish uniqueness between records. So we would
modify the Equipment table as follows:

Structure Of: EQUIPMNT

1 Staff Member A20*
2 Equipment ID A2*
3 Equipment Description A30

Now to perform the query described above, we link the two tables
together using Paradox's example elements, as follows:

EQUIPMNTStaff MemberEquipment Description
_xxx Message Call Answering

STAFFStaff MemberDepartment
_xxx Check

As before, the Answer table is a list of Departments where Staff Members
have this item of equipment. But this query will run far quicker than
the first one, use less memory and less disk space. If we place a
Paradox secondary index on the "Equipment Description" field, the query
will run even faster, although this is usually not necessary.

But this layout also has it problems. What if we want to change the
text associated with a specific "ID", for example, from "Message Call
Answering" to "Message Call Forwarding". In Paradox, this can be done
with a CHANGETO query, but many relational databases don't support such
an operation. If you have to perform the operation manually, and miss
some entries, you end up with inconsistent information in the database,
the same ID number referring to different descriptions. This is known
as an "Update" anomaly.

Another problem occurs if every person who has Message Call Answering is
transferred to another division, or laid off by the company. When you
delete all references to this piece of Equipment, it disappears from the
system completely! There is now no record that it ever existed, or is a
valid option to use in this system. This is known as a "Delete"
anomaly.


2nd Normal Form

The second "rule" of relational databases is designed to avoid these two
problems. If a field in a table is related to only part of a
multi-field key, remove it to a separate table.

In the example above, the Equipment Description field depends entirely
on the Equipment ID field, part of our two field key. It has no
relationship to the Staff Member field at all. If we split it off into
a third table, the resulting structure looks like this:

Structure Of: STAFF

1 Staff Member A20*
2 Position A20
3 Department A20
4 Department Manager A20


Structure Of: INVENTRY

1 Staff Member A20*
2 Equipment ID A2*


Structure Of: EQUIPMNT

1 Equipment ID A2*
2 Equipment Description A30


In this 2nd Normal Form design, the Equipment descriptions are separated
out into their own table, keyed by Equipment ID. This ID field is also
used in the INVENTRY table as part of the key.

"Update" anomalies are much easier to avoid, since the description of
each equipment item appears once and only once in the whole system. It
does not need to be changed in multiple places. "Delete" anomalies are
avoided completely, since changes to the equipment assigned to each
staff member, and to staff members themselves are insulated from the
approved equipment list.

Notice how our tables are getting narrower, with less duplication of
wide field values. Where duplication is unavoidable, we make it as
small as possible by using the ID number.

The query referred to above is now programmed as follows:

EQUIPMNTEquipment IDEquipment Description
_xxx Message Call Answering

INVENTRYStaff MemberEquipment ID
_yyy _xxx

STAFFStaff MemberDepartment
_yyy Check


This query processes as quickly as the previous one, even though it uses
one more table than before, because the links are established on key
fields. Relational databases are designed and optimized to process
keyed links between tables.


3rd Normal Form

Third normal form is similar to 2nd normal form in that it is designed
to avoid Update and Delete anomalies. But it specifically addresses
relationships in tables which have only one key field.

In the STAFF table described above, both 1st and 2nd normal forms
are satisfied: There are no repeating fields, and no multi-field
key. These is however, one problem with the structure of the table:

Each staff member has a position within the firm. Each staff member
also belongs to a department. But the Department Manager has nothing
to do with the Employee (from a relational database standpoint.) This
field "belongs" to the Department name; it is the person designed to
manage the department. It thus makes sense to split information about
Departments into a separate table.

3rd normal form specifies that if fields do not contribute to a
description of the of the table's key, they should be removed to a
separate table. The STAFF table is thus split up into two tables as
follows:

Structure Of: STAFF

1 Staff Member A20*
2 Position A20
3 Department ID A3


Structure Of: DEPTMENT

1 Department ID A3*
2 Department A20
3 Manager A20


Structure Of: INVENTRY

1 Staff Member A20*
2 Equipment ID A2*


Structure Of: EQUIPMNT

1 Equipment ID A2*
2 Equipment Description A30


(We have introduced a Department ID field for the same reason noted
above; when duplication is necessary, make the duplicated values as
small as possible.)

Our example query now requires all four tables:

EQUIPMNTEquipment IDEquipment Description
_xxx Message Call Answering

INVENTRYStaff MemberEquipment ID
_yyy _xxx

STAFFStaff MemberDepartment ID
_yyy _zzz

DEPTMENTDepartment IDDepartment
_zzz Check


This query can be interpreted as follows:

Using the EQUIPMNT table, find the Equipment ID which corresponds to
"Message Call Answering", then go to the INVENTRY table and find out
which staff members are assigned this piece of equipment. With this
information in hand, go to the STAFF table and find out which
Departments those employees are assigned to, then go to the DEPTMENT
table and extract the names associated with these department ID's.

With 3rd normal form, the information in each table pertains to the
key of that table only. The whole system is contained in four tables
with a total of 10 fields, instead of 1 table with 9 fields. But most
operations are performed on keys, so processing speed is much faster.
We can have as many pieces of equipment for each employee as we need,
with no wasted space for those people who only have 1 or 2 pieces.
Department and Equipment details are ensconsed in their own tables,
secure from Update and Delete anomalies.

Applications in a relational database design have a solid theoretical
underpinning; it is one of the reasons why relational databases are
becoming so popular.


4th Normal Form

Most applications need go no further than 3rd normal form. But there
are some situations where we need to break the information up into an
even finer modularity.

For example, what if we also want to store the speed Dial numbers which
staff members program into their systems, perhaps to send out notices
when such numbers are changed. We might simply add a field to the
INVENTRY table to maintain this information:

Structure Of: INVENTRY

1 Staff Member A20*
2 Equipment ID A2*
3 Speed Dial Numbers A4

Here is some sample data from this table:

Table Name: INVENTRY
Staff MemberEquipment IDSpeed Dial Numbers
1 Adachi, Nancy 04 1000
2 Adachi, Nancy 05 2000
3 Adachi, Nancy 11
4 Adachi, Nancy 27
5 Ahlers, Kristine 05 1000
6 Ahlers, Kristine 12 2000
7 Ahlers, Kristine 25
8 Ahlers, Kristine 28
9 Ahlers, Kristine 30
10 Alsteen, Sheri 01 1000
11 Alsteen, Sheri 04 2000
12 Alsteen, Sheri 05 3000
13 Alsteen, Sheri 4000
14 Alsteen, Sheri 5000
15 Anderson, Dave 04


Notice how the number of equipment items bears no relationship to the
number of speed dial numbers defined. For some employees, there are
more items; for others, there are more numbers.

The table above implies a relationship between "Equipment ID" and
"Speed Dial Number" for each employee. It suggests, for example, that
Nancy Adachi's equipment item "04" is in some way connected to her speed
dial number "1000". In fact, there is no relationship between these two
fields, other than that they are both attributes of the "Staff Member"
field. Each staff member can have multiple equipment items, and can
also have multiple speed dial numbers.

4th normal form requires us to isolate independent multiple
relationships. It specifies that no table should contain two or more
One-Many or Many-Many relationship unless they are also directly related
to each other.

If the additional field we need to store is the date on which that
equipment item was assigned to that staff member, there is a direct
relationship between the two fields, and they logically belong in the
same table. Another way of looking at it is that there is a One-One
relationship between a peice of equipment assigned to a staff member,
and the date it was assigned. There is a One-Many relationship between
staff members and this entity of "equipment-date". The table looks like
this:

Structure Of: INVENTRY

1 Staff Member A20*
2 Equipment ID A2*
3 Date Assigned D


But in the original example using Speed Dial numbers, we need to create
two tables, one for the equipment items assigned to that staff member,
the other to store speed dial numbers.

Structure Of: INV_EQPT

1 Staff Member A20*
2 Equipment ID A2*

Structure Of: INV_SPDL

1 Staff Member A20*
2 Speed Dial Numbers A4*


As a related issue, the situation of having dual detail tables linked to
a Master is one that Paradox does not handle well. Reporting in
particular is difficult since multi-table reports must be tied to one of
the details, and there is no easy way to link in the other detail.

---------------------------- end of paper -----------------------------


 December 9, 2017  Add comments

Leave a Reply