Contents of the HUFFMAN2.DOC file
HUFFMAN2.BAS and DECODER.BAS
By Rich Geldreich 1992
The two QuickBASIC 4.5 programs you have just received can
compress and decompress files with the Huffman compression algorithm.
The first program HUFFMAN2.BAS, takes an input file and compresses it
to OUTPUT.HUF. The second program, DECODER.BAS, takes OUTPUT.HUF and
decompresses it a file specified via the command line. The two
programs should be compiled for maximum speed.
To compress EATME.DOC, use:
To decompress the compressed file to EATME2.DOC, use:
I have released these two programs to show that Huffman
compression isn't really that hard to implement. What is Huffman
compression? It's a simple compression algorithm which takes
advantage of a file's character distribution: if "A" is used more than
"B", for instance, then the code representing "A" is significantly
smaller than the code that represents "B". It's not the best, but it
isn't that bad.
Lets say we want to compress the string "ABCDDBBAAAADD". First,
we must scan the string and tally up the count for each character:
A is used 5 times
B is used 3 times
C is used 1 time
D is used 4 times
Now that we have this distribution, we must start making a binary
tree that represents all of these characters; like this:
(* = not used)
5 3 1 4
/ \ / \ / \ / \
A * B * C * D *
Now we must start combining "nodes" until there is only one node
at the top. To do this we keep on combining 2 nodes which have the
1 and 3 have the lowest frequency:
5 4 4
/ \ / \ / \
A * B C D *
4 and 4 have the lowest frequency:
/ \ / \
A * D 4
And finally, 5 and 8 have the lowest frequency:
To represent this tree, both programs use a linked list. To
represent each node, the Father() array is used. To represent the
left and right arrows LeftSon() and RightSon() arrays are used.
No that we have the combined tree, we must travel down the tree
and find out which bit sequence represents each character we wish to
compress. (A recursive procedure is used to do this.) We start at the
top, and every time we go left we send a "1" and every time we go
right we send a "0". So "A" will be represented with "1", "B" with
"001", "C" with "000", & "D" with "01". Notice that the character
used most in the original string, "A", has the smallest code size, and
the character used least, "C", has the biggest code size.
A = 1 [used in 38% of the string]
B = 001 [used in 23%]
C = 000 [used in 7%]
D = 01 [used in 31%]
"ABCDDBBAAAADD" will be compressed as:
1 001 000 01 01 001 001 1 1 1 1 01 01
10010000 10100100 11111010 1
...so the original 13 character sequence has been compressed to
just 4 bytes.
To decompress this 4 byte sequence, the decoder must have the
encoding tree handy(which must be send separately).
The encoding tree was:
The output was:
10010000 10100100 11111010 1
Start at the top. The first bit is "1"- so go left. The first
character is then "A". So far so good! Now we go back to the top and
start all over...
The second bit is "0"- so go right. We're now at "8".
The third bit is "0"- so go right again. We're now at "4".
The fourth bit is "1"- so go left. We're now at "B". "B" is the
Keep on repeating this until all of the characters have been
That's all! Have fun. The two QB programs are in the public
domain so do what you want with 'em. If you have any questions or
cool ideas, I can be contacted at the following address:
410 Market St.
Gloucester City, NJ 08030
May 29th, 1992