Category : C Source Code
Archive   : CSRC2.ZIP
Filename : DIFF.TXT

 
Output of file : DIFF.TXT contained in archive : CSRC2.ZIP

/*)BUILD $(TKBOPTIONS) = {
TASK = ...DIF
}
*/

#ifdef DOCUMENTATION
title diff Differential File Comparison
index Differential File Comparison

synopsis

diff [option] file1 file2

description

Diff compares two files, showing what must be changed to make
them identical. Either file1 or file2 (but not both) may refer
to directories. If that is the case, a file in the directory
whose name is the same as the other file argument will be used.
The standard input may be used for one of the files by replacing
the argument by "-". Except for the standard input, both files
must be on disk devices.
.s
Options:
.lm +8
.s.i -8;-b Remove trailing whitespace (blanks and tabs)
and compress all other strings of whitespace to a single blank.
.s.i -8;-c Print some context -- matching lines before
and after the non-match section. Mark non-matched sections
with "|".
.s.i -8;-i Ignore lower/upper case distinctions.
.s.i -8;-e Output is in an "editor script" format which
is compatible with the Unix 'ed' editor.
.s.lm -8
All information needed to compare the files is maintained in main
memory. This means that very large files (or fairly large files with
many differences) will cause the program to abort with an "out
of space" message. Main memory requirements (in words) are
approximately:
.s
2 * (length of file1 + length of file2)
.br
+ 3 * (number of changes)
.s
(Where "length" is the number of lines of data in each file.)
.s
The algorithm reads each file twice, once to build hash tables and once
to check for fortuitous matches (two lines that are in fact different,
but which have the same hash value). CPU time requirements include
sorting the hash tables and randomly searching memory tables for
equivalence classes. For example, on a time-shared VAX-11/780,
comparing two 1000 line files required about 30 seconds (elapsed
clock time) and about 10,000 bytes of working storage. About 90
per-cent of the time was taken up by file i/o.

diagnostics

.lm +8
.s.i -8;Warning, bad option 'x'
.s
The option is ignored.
.s.i -8;Usage ...
.s
Two input files were not specified.
.s.i -8;Can't open input file "filename". Can't continue.
.s.i -8;Out of space
.s
The program ran out of memory while comparing the two files.
.s.i -8;Can't read line nnn at xxx in file[A/B]
.s
This indicates an I/O error when seeking to the specific line.
It should not happen.
.s.i -8;Spurious match, output is not optimal.
.s
Two lines that were different yielded the same hash value. This is
harmless except that the difference output is not the minimum set of
differences between the two files. For example, instead of the output:
.s
lines 1 to 5 were changed to ...
.s
the program will print
.s
lines 1 to 3 were changed to ...
.br
lines 4 to 5 were changed to ...
.s
The program uses a CRC16 hash code. The likelihood of this error is
quite small.
.lm -8

author

The diff algorithm was developed by J. W. Hunt and M. D. McIlroy,
using a central algorithm defined by H. S. Stone.
It was published in:
.s.lm +4.nf
Hunt, J. W., and McIlroy, M. D.,
An Algorithm for Differential File Comparison,
Computing Science Technical Report #41,
Bell Laboratories, Murray Hill, NJ 07974
.s.lm -4.f

bugs

On RSX systems, diff may fail if the both
files are not "variable-length, implied carriage control"
format. The scopy program can be used to convert files
to this format. This may happen if one of the files has
been copied to a VMS V3.0 system from a system, such as RSTS/E,
that supports "stream format" files.

#endif

/*
* Diff maintains all information needed to compare the two files in main
* memory. This means that very large files (or fairly large files with
* many differences) will cause the program to abort with an "out of space"
* error. Main memory requirements (in words) are approximately:
*
* 2 * (length of file1 + length of file2) + (3 * number of changes)
*
* The diff algorithm reads each file twice (once to build hash tables and
* a second time to check for fortuitous matches), then reads the differences
* by seeking randomly within the files. CPU time requirements include
* sorting the two hash vectors and randomly searching memory tables for
* equivalence classes. For example, running in Vax compatibility
* mode, two 1000 line files with a fair number of differences took
* about 25 seconds (elapsed wall clock time) for processing. Most of this
* time was spent in the file read routines. This test required slightly
* more than 6000 words of memory for internal tables.
*
* The diff algorithm was developed by J. W. Hunt and M. D. McIlroy,
* using a central algorithm defined by H. S. Stone. The algorithm
* was described in:
*
* Hunt, J. W., and McIlroy, M. D.,
* An Algorithm for Differential File Comparison,
* Computing Science Technical Report #41,
* Bell Laboratories, Murray Hill, NJ 07974
*
* The following description is summarized from that document. While
* it has been slightly modified to correspond to the program source, the
* algorithm is essentially identical.
*
* 1. Read the input files, building two vectors containing the
* line number (serial) and hash value (hash) of each line.
* Data for fileA will be in a vector pointed to by fileA[],
* while data for fileB will be pointed to by fileB[]. The
* lengths (number of lines) of the files will be represented
* by lenA and lenB respectively. [This is slightly different
* from the published algorithm.]
*
* 2. Note initial and final sequences that have identical
* hash values to shorten subsequent processing. Note that
* the "jackpot" phase (step 9.) will examine all lines in
* the file. Next, sort each file using hash as the primary
* key and serial as the secondary key.
*
* 3. Construct an array of equivalence classes (member[1..lenB])
* where each element contains the line number in fileB and a
* flag which is True if the indicated line is the first member
* of an equivalence class. (An equivalence class is a set of
* lines which all hash to the same value. The lines themselves
* are not necessarily identical.)
*
* 4. Construct an array, class[1..lenA], where each element, I, is set to
* the index of a line, J, in fileB if member[J] is the first
* element in an equivalence class and the hash code of line[I] in
* fileA is the same as the hash code of line[J] in fileB. Class[I]
* is set to zero if no such line exists.
*
* If non-zero, class[I] now points in member[] to the beginning of
* the class of lines in fileB equivalent to line[I] in fileA.
*
* The next two steps implement the longest common subsequence algorithm.
*
* 5. Structure CANDIDATE { a, b, previous }, where a and b are line
* numbers and previous a reference to a candidate, will store
* candidate lists as they are constructed.
*
* Vector clist[] stores references to candidates. It is dimensioned
* (0..min(lenA, lenB) + 1)
*
* Initialize
* clist[0] = CANDIDATE { 0, 0, -1 };
* clist[1] = CANDIDATE { A+1, B+1, -1 };
* ktop = 1;
*
* clist[1] is a fence beyond the last usefully filled element
* and -1 is an out-of-range clist index. Ktop is the index of the
* fence. Note, because of the memory allocation used, clist[]
* is actually composed of two vectors, clist[] containing the
* candidate reference, and klist[] containing pointers to clist.
*
* 6. For (A = 1 to lenA) {
* I = class[A]; -- the index in member[]: beginning of
* -- the class of lines in fileB equivalent
* -- to this line in fileA.
* if (I is non-zero) {
* Merge each member into the candidate list
* as discussed below.
* }
*
* Unravel the chain of candidates, getting a vector of common subsequences:
*
* 7. Set all elements of match[0..lenA] to zero.
*
* 8. clist[ktop-1] points to the subsequence chain head. For each element
* in the chain, let A and B be the line number entries. Then, set
*
* match[A] = B;
*
* The non-zero elements of match[] now pick out a longest common
* subsequence chain, possibly including spurious matches due to
* hash coincidences. The pairings between the two files are:
*
* if (match[A] is non-zero) {
* line A in fileA matches line match[A] in fileB;
* }
*
* Now, read each line of fileA and fileB to check for jackpots:
*
* 9. for (A = 1 to lenA) {
* if (match[A] is nonzero) {
* if (fileA[A] is not identical to fileB[match[A]])
* match[A] = 0; -- Hash congruence
* }
* }
*
* Ignoring "squish" complications, the merge step may be defined as follows:
*
* Entry:
* clist[] Candidate pointer array
* ktop Fence beyond last filled index
* A Current index in fileA
* member[] Equivalence vector
* I The index in member[] of the first element
* of the class of lines in fileB that are
* equivalent to line[A] in fileA.
*
* 1. Let clist[R] be "an r-candidate" and C a reference to
* the last candidate found, which will always be an r-candidate.
* clist[R] will be updated with this reference once the previous
* value of clist[R] is no longer needed. Initialize:
*
* R = 0;
* C = clist[0];
*
* 2. Do steps 3 through 6 repeatedly:
*
* 3. set B to the line number in member[I];
* search clist[R..ktop] for an element, clist[S], such that
*
* clist[S-1].b < B and clist[S].b >= B
*
* Note that clist[] is ordered on clist[].b so that binary
* search will work. The search algorithm used requires the
* two "fence" entries described above.
*
* If such an element is found, perform steps 4. and 5.
*
* 4. Extend the longest common subsequence chain:
*
* If (clist[S].b > B) {
* clist[R] = C;
* R = S;
* C = candidate(A, B, clist[S - 1]);
* }
*
* 5. Extend the number of subsequences, moving the fence:
*
* If (S == ktop) {
* clist[ktop + 1] = clist[ktop]
* ktop = ktop + 1;
* break out of step 2's loop;
* }
*
* 6. I = I + 1;
* if (member[I] is the first element in another class) {
* break out of step 2's loop;
* }
* else {
* continue at step 2.
* }
*
* 7. clist[R] = C;
* exit merge subroutine.
*
* To illustrate vector contents, here is a sample run:
*
* File A:
* line 1
* line 2
* line 3
* line 4
* line 5 gets deleted
* line 6
*
* File B:
* line 1
* line 1.5 inserted
* line 2
* line 3 changed
* line 4
* line 6
*
* (For clarity, the "squish" step is omitted from the following)
*
* On entry to equiv() (after readin and sorting), the file[] vector is
* as follows (the first entry in each pair is the line number, the
* second is the hash value. Entries are sorted on hash value):
*
* FileA[] (1..lines in fileA):
* line hash
* 3 042400 6 043300 5 050026 1 102201 2 102701 4 103501
* FileB[] (1..lines in fileB):
* 6 043300 2 045600 1 102201 3 102701 5 103501 4 147166
*
*
* After Equiv has processed file[]:
*
* FileA[] (1..lines in fileA):
* line value
* 3 0 6 1 5 0 1 3 2 4 4 5
* Member[] (0..lines in fileB)
* 0 -6 -2 -1 -3 -5 -4
*
*
* After unsort() has unwound fileB:
*
* Class[] (1 .. lines in fileA):
* 3 4 0 5 0 1
*
* Within unravel(), match is built in the following order:
*
* match[6] := 6
* match[4] := 5
* match[2] := 3
* match[1] := 1
*
* Match[] (0 .. lines in fileA):
*
* 0 1 3 0 5 0 6
*
* Output is as follows:
*
* 1a2
* > line 1.5 inserted
* 3c4
* < line 3
* ---
* > line 3 changed
* 5d5
* < line 5 gets deleted
*
*
*/


  3 Responses to “Category : C Source Code
Archive   : CSRC2.ZIP
Filename : DIFF.TXT

  1. Very nice! Thank you for this wonderful archive. I wonder why I found it only now. Long live the BBS file archives!

  2. This is so awesome! 😀 I’d be cool if you could download an entire archive of this at once, though.

  3. But one thing that puzzles me is the “mtswslnkmcjklsdlsbdmMICROSOFT” string. There is an article about it here. It is definitely worth a read: http://www.os2museum.com/wp/mtswslnk/