Dec 212017
 
A text file that explains how to eliminate duplicate name+address records when using dBase III.
File DUPCHK.ZIP from The Programmer’s Corner in
Category Dbase Source Code
A text file that explains how to eliminate duplicate name+address records when using dBase III.
File Name File Size Zip Size Zip Type
DUPCHK.TXT 12801 4861 deflated

Download File DUPCHK.ZIP Here

Contents of the DUPCHK.TXT file


Duplicate Checking

by Joe Stolz

Duplicate-checking programs play an important part in most database
management systems. In the simplest example, one of a mailing list,
duplicate entries of names lead to multiple copies of the same mailing
reaching the same house. This looks unprofessional and frustrates the
residents. Also in this era of rising postage costs, we should try to
economize and eliminate duplicates in our databases.

Duplicate-checking programs are a crucial part of many dBASE systems.
Grouped report programs are based on "duplicate checking" of a key
field. When the group key changes, a total is printed. Let's look at
some simple routines for eliminating duplicates.

Using UNIQUE

One simple solution to duplicate checking programs is to pretend they
don't exist. To avoid the necessity of running a duplicate-checking
program but to still remove the duplicate records from the file, we
could simply INDEX the file on a key field using the UNIQUE option,
and COPY the indexed file to a new file! This will allow only the
first record in a duplicate series to copy over to the new file. This
is, however, truly "quick and dirty." Typographical errors in the key
field will become very significant, and dBASE III PLUS will not treat
the records as unique. Even the slightest discrepancy between similar
records will cause a "nonexact duplicate" to be included.

A "nonexact duplicate" is defined here as a record that is an actual
duplicate of another in the database, but which contains a minor
typographical or spelling difference, in one or more fields. These
differences render the two records unique, even though they are truly
duplicates with regard to their information.

A Simple Routine

The simplest of all duplicate-checking programs simply stores
information about the first record in the file, skips to the next
record, and tests the next record for equivalency. In the simplest
case, we would index on Lastname, then run Dupcheck.PRG. We won't be
worrying about nonexact duplicates here.

Dupcheck.PRG
* ---Simple duplicate-checking program
USE FILE
INDEX ON Lastname TO Dups
DO WHILE .NOT. EOF()
* ---save the current record's key field
mlastname = Lastname
* ---move to the next record in the index
SKIP
* ---if second matches first they are dups
IF mlastname = lastname
DELETE NEXT 1
DISPLAY TO PRINT
ENDIF
ENDDO
*EOP: Dupcheck.PRG

This program uses the data file, indexes it on the last name so that
all identical last names are grouped together, and then walks through
the file sequentially. The current record's last name field is stored
to a memory variable and then the record pointer is moved to the next
record. If the key fields match, the IF loop takes effect and the
match is deleted and printed. Then the process is repeated. If the
match is true, the key fields are exactly the same and so re-storing
the key will have no effect on testing the third record of a series,
and so on. The simplicity of the program makes it compelling. When
duplicates are found, we only mark them for deletion. And then, we
also print them for further verification. Later, we can PACK the
database and truly eliminate them.

Typographical Errors

However, there are some considerations here. First and foremost is the
key field. The above example used the Lastname field. This is
dangerous since there may be many different people with the same last
name in the database whose records are not duplicates. We could,
therefore, index on Lastname + Firstname. This should work for most
occasions. However two factors come into play to complicate matters:
typographical errors (the nonexact duplicate problem) and long key
expressions.

It turns out that the longer the field length, the more prone the data
is to typographical errors. The name and address fields will be
especially problematic since these fields have the longest widths in
most databases. Our duplicate checking program will check for exact
matches in every single character of the field. If the 30th character
is even slightly different, the records will not match even though 29
preceding characters do match.

Further, it's a great insult to misspell a person's name. It suggests
the lack of the personal touch that personally addressed letters are
supposed to convey. If the street address is slightly wrong however,
no insult occurs, and the mail might still get delivered to the
correct address! This may not be professional, but the job gets done
with minimal personal affront. We want first and foremost that the
name be correctly spelled. By merely testing the Lastname field alone,
we could easily allow a mildly misspelled duplicate to slip by. dBASE
III PLUS will determine that "Smithenberg" does not equal
"Smithenbergg", no duplicate will be marked, and the same address will
get two mailings.

But, if we don't compare Lastnames at all, we may allow misspelled
names into our mailing list. We need to compare names to make sure
that they are all truly unique, even if they are misspelled. The
Soundex function (discussed later) will allow us to work around these
sort of misspelling problems, by handling nonexact comparisons of
names.

Nonexact Duplicates

Perhaps we can test for duplicates using a different technique? We
have two tasks to tackle:

Testing for nonexact duplicates

Verifying that the data is not misspelled and is accurate.

Combining key fields may give us a more refined test for the first
problem, nonexact duplicates.

Combining the Lastname with the Firstname field could yield a good
test key. However, a husband may have submitted one data-entry form
and his wife may submit a second one. Once these forms have been
entered, our combined index key Firstname + Lastname will not register
a duplicate between these two records, but two mailings will reach the
same home. A better key will incorporate the Lastname and the Address.
Presumably, if the last name and the address are identical, the
records are identical and only one mailing is necessary. If the
address is the same for two records and the names are different, then
two mailings probably should be sent anyway. In fact, this is a better
way to test.
We could also test on a small portion of Lastname and a small portion
of
Address. This will allow us to test fewer characters in both fields
(minimizing the chances of typographical error) and yet the key
expression will be unique for a particular address. Perhaps combining
the Lastname and Zip fields can also lead to accurate duplicate
identification in some cases. Some suggestions are to:

INDEX ON SUBSTR(Lastname, 1, 5) + SUBSTR(Address, 1, 6)

or

INDEX ON SUBSTR(Lastname, 1, 5) + Zip

Note: See the November 1986 issue of TechNotes for indexing on the
street portion of an address as an alternate for the above
expressions.

Soundex

One way to check for "nonexact duplicates," in name fields especially,
is to use Soundex. The classic treatment of Soundex is found in The
Art of Computer Programming, Volume 3 by Donald E. Knuth (page 392,
"Sorting and Searching"). This algorithm converts names to a numeric
equivalent.

Names that sound alike (for example Smith and Smyth) will yield an
equivalent Soundex value. Also, only a portion of the letters in a
name are evaluated for Soundex values. Small typos will often have no
effect on a Soundex value. Even very dissimilar names will evaluate to
numbers that are not far from each other in value. We could test
Soundex values and if they differ by only 1 or 2, consider the names
similar enough to mark them as duplicates, for further evaluation.

Soundex tools for use with dBASE files are available from a number of
sources. Volume one of the dBASE Programmer's Utilities (Ashton-Tate,
1985) includes Soundex.BIN. We can use this utility to REPLACE a field
with the Soundex value of the name. Then we can INDEX on the Soundex
value instead of the name field itself. Misspellings, even gross ones,
can often be deftly handled with this tool.

Verifying Duplicates

We mentioned that we have two problemsnonexact duplicates and data
verification. We have discussed techniques to solve the first, but how
do we solve the second one? Unfortunately, computers cannot be as
smart (in many ways) as human beings. A large part of data
verification must be done manually. This is why our program prints-out
the records marked for deletion so that they can be verified before
the file is PACKed.

Records will test as duplicates and will be marked for deletion; yet,
they may not necessarily be the ones you will want to delete.
Remember, when dBASE III PLUS indexes records, it orders them using
the key expression. If two records have the same key value, the record
appearing first in the index will be the one that appears physically
first in the file (the one having the lowest RECNO() value). This
record is often the preferred one and, since it comes first in the
file, it is most likely to have been kept up-to-date.

Perhaps we have three duplicate copies of a person's record in our
file. We may need to resolve all three into one single, accurate
record by comparing the three visually. A computer could never see
that two of the three records have the same Zip whereas the third has
a different one. Perhaps the first record in the series reflects the
most up-to-date balance due. Care must be taken when consolidating
records and human judgment is often necessary. Discrepancies between
seemingly duplicate records can be owing to a new address, prompting
entry of a new record. This, coupled with misspellings in names and
addresses and errors such as typos in Zip codes, are the sort of
errors that require human resolution. Fortunately, these types of
errors are random and rare; our program will help to eliminate many or
at least bring them to our attention.

Applications

Duplicate-checking programs are the engines employed in many types of
programs beyond mailing-list management.

Grouped reports check a series of records for a duplicate key fields.
Take the example of a general ledger program. To determine the
expenses for a particular account number, we index the file on the
Accountno field.

Once the file is in order we need to walk through the file and
accumulate a total as long as the account number does not change. Here
is a sample of this type of program:

DO WHILE .NOT. EOF()
* ---Initialize the variables
Macctno = Accountno
* ---totmount is at least Amount number of dollars
totmount = Amount
SKIP
* ---If the Account series continues,
IF Macctno = Accountno
* ---Accumulate the total
totmount = totmount + Amount
ELSE
* ---Else the series stopped, so print
Do Printrtn
ENDIF
ENDDO

This routine accumulates the amount field to a variable Totmount and
prints the total when the series of records having the same Accountno
changes. Then it moves on to the next account number series.

Perhaps our records are indexed both by Accountno and by date. This
way we can produce totals for a particular account, for any date
range. We can ask the user to enter the date range to apply to the
report. Our records are in increasing date order, so we may have to
skip over a series of records having dates too low for the date range
desired. The program code (see Figure 1) that accomplishes this is
slightly more complicated

A carefully designed duplicate-checking program can be used to test
your whole data-entry system. You can test the accuracy of your data
entry, the source of the data, and the number of duplicates entered.
Such testing could help in determining the source of the duplicates in
the first place, which, in turn, might lead to their prevention.

Figure 1

DO WHILE .NOT. EOF()
* ---Initialize the variables
Macctno = Accountno
* ---There may be no records in the range
totmount = 0
* ---This loop moves us through each Accountno range.
* ---Lowdate = the low date of the range
DO WHILE Datefield < Lowdate .AND. Macctno = Accountno .AND. .NOT. EOF()
* ---Skips us to the right range or to the next account
SKIP
ENDDO
* ---If the Account series continues in the
* ---right date range,
DO WHILE Macctno = Accountno .AND. Datefield < hidate
* ---Accumulate the total
totmount = totmount + Amount
SKIP
ENDDO
* ---the series stopped, so print
Do Printrtn
ENDDO


 December 21, 2017  Add comments

Leave a Reply