Dec 182017

Discussion of compresion used by LHARC. | |||
---|---|---|---|

File Name | File Size | Zip Size | Zip Type |

CMPRSN.DOC | 13583 | 4959 | deflated |

# Download File CMPRSN.ZIP Here

## Contents of the CMPRSN.DOC file

April 2, 1989

------------------------------------------------------------------------

Data Compression Algorithms of LARC and LHarc

Haruhiko Okumura*

* The author is the Sysop of the Science SIG of PV-VAN. His

address is: 12-2-404 Green Heights, 580 Nagasawa, Yokosuka

239, Japan

------------------------------------------------------------------------

1. Introduction

In the spring of 1988, I wrote a very simple data compression program

named LZSS in C language, and uploaded it to the Science SIG (forum)

of PC-VAN, Japan's biggest personal computer network.

That program was based on Storer and Szymanski's slightly modified

version of one of Lempel and Ziv's algorithms. Despite its simplic-

ity, for most files its compression outperformed the archivers then

widely used.

Kazuhiko Miki rewrote my LZSS in Turbo Pascal and assembly language,

and soon made it evolve into a complete archiver, which he named

LARC.

The first versions of LZSS and LARC were rather slow. So I rewrote

my LZSS using a binary tree, and so did Miki. Although LARC's

encoding was slower than the fastest archiver available, its decoding

was quite fast, and its algorithm was so simple that even self-

extracting files (compressed files plus decoder) it created were

usually smaller than non-self-extracting files from other archivers.

Soon many hobby programmers joined the archiver project at the forum.

Very many suggestions were made, and LARC was revised again and

again. By the summer of 1988, LARC's speed and compression have

improved so much that LARC-compressed programs were beginning to be

uploaded in many forums of PC-VAN and other networks.

In that summer I wrote another program, LZARI, which combined the

LZSS algorithm with adaptive arithmetic compression. Although it was

slower than LZSS, its compression performance was amazing.

Miki, the author of LARC, uploaded LZARI to NIFTY-Serve, another big

information network in Japan. In NIFTY-Serve, Haruyasu Yoshizaki

replaced LZARI's adaptive arithmetic coding with a version of

adaptive Huffman coding to increase speed. Based on this algorithm,

which he called LZHUF, he developed yet another archiver, LHarc.

In what follows, I will review several of these algorithms and supply

simplified codes in C language.

2. Simple coding methods

Replacing several (usually 8 or 4) "space" characters by one "tab"

character is a very primitive method for data compression. Another

simple method is run-length coding, which encodes the message

"AAABBBBAACCCC" into "3A4B2A4C", for example.

3. LZSS coding

This scheme is initiated by Ziv and Lempel [1]. A slightly modified

version is described by Storer and Szymanski [2]. An implementation

using a binary tree is proposed by Bell [3]. The algorithm is quite

simple: Keep a ring buffer, which initially contains "space"

characters only. Read several letters from the file to the buffer.

Then search the buffer for the longest string that matches the

letters just read, and send its length and position in the buffer.

If the buffer size is 4096 bytes, the position can be encoded in 12

bits. If we represent the match length in four bits, the length> pair is two bytes long. If the longest match is no more than

two characters, then we send just one character without encoding, and

restart the process with the next letter. We must send one extra bit

each time to tell the decoder whether we are sending a length> pair or an unencoded character.

The accompanying file LZSS.C is a version of this algorithm. This

implementation uses multiple binary trees to speed up the search for

the longest match. All the programs in this article are written in

draft-proposed ANSI C. I tested them with Turbo C 2.0.

4. LZW coding

This scheme was devised by Ziv and Lempel [4], and modified by Welch

[5].

The LZW coding has been adopted by most of the existing archivers,

such as ARC and PKZIP. The algorithm can be made relatively fast,

and is suitable for hardware implementation as well.

The algorithm can be outlined as follows: Prepare a table that can

contain several thousand items. Initially register in its 0th

through 255th positions the usual 256 characters. Read several

letters from the file to be encoded, and search the table for the

longest match. Suppose the longest match is given by the string

"ABC". Send the position of "ABC" in the table. Read the next

character from the file. If it is "D", then register a new string

"ABCD" in the table, and restart the process with the letter "D". If

the table becomes full, discard the oldest item or, preferably, the

least used. A Pascal program for this algorithm is given in Storer's

book [6].

5. Huffman coding

Classical Huffman coding is invented by Huffman [7]. A fairly

readable accound is given in Sedgewick [8]. Suppose the text to be

encoded is "ABABACA", with four A's, two B's, and a C. We represent

this situation as follows:

4 2 1

| | |

A B C

Combine the least frequent two characters into one, resulting in the

new frequency 2 + 1 = 3:

4 3

| / \

A B C

Repeat the above step until the whole characters combine into a tree:

7

/ \

/ 3

/ / \

A B C

Start at the top ("root") of this encoding tree, and travel to the

character you want to encode. If you go left, send a "0"; otherwise

send a "1". Thus, "A" is encoded by "0", "B" by "10", "C" by "11".

Algotether, "ABABACA" will be encoded into ten bits, "0100100110".

To decode this code, the decoder must know the encoding tree, which

must be sent separately.

A modification to this classical Huffman coding is the adaptive, or

dynamic, Huffman coding. See, e.g., Gallager [9]. In this method,

the encoder and the decoder processes the first letter of the text as

if the frequency of each character in the file were one, say. After

the first letter has been processed, both parties increment the

frequency of that character by one. For example, if the first letter

is 'C', then freq['C'] becomes two, whereas every other frequencies

are still one. Then the both parties modify the encoding tree

accordingly. Then the second letter will be encoded and decoded, and

so on.

6. Arithmetic coding

The original concept of arithmetic coding is proposed by P. Elias.

An implementation in C language is described by Witten and others

[10].

Although the Huffman coding is optimal if each character must be

encoded into a fixed (integer) number of bits, arithmetic coding wins

if no such restriction is made. As an example we shall encode "AABA"

using arithmetic coding. For simplicity suppose we know beforehand

that the probabilities for "A" and "B" to appear in the text are 3/4

and 1/4, respectively.

Initially, consider an interval:

0 <= x < 1

Since the first character is "A" whose probability is 3/4, we shrink

the interval to the lower 3/4:

0 <= x < 3/4

The next character is "A" again, so we take the lower 3/4:

0 <= x < 9/16

Next comes "B" whose probability is 1/4, so we take the upper 1/4:

27/64 <= x < 9/16

because "B" is the second element in our alphabet, {A, B}. The last

character is "A" and the interval is:

27/64 <= x < 135/256

which can be written in binary notation:

0.011011 <= x < 0.10000111

Choose from this interval any number that can be represented in

fewest bits, say 0.1, and send the bits to the right of "0."; in this

case we send only one bit, "1". Thus we have encoded four letters

into one bit! With the Huffman coding, four letters could not be

encoded into less than four bits.

To decode the code "1", we just reverse the process: First, we supply

the "0." to the right of the received code "1", resulting in "0.1" in

binary notation, or 1/2. Since this number is in the first 3/4 of

the initial interval 0 <= x < 1, the first character must be "A".

Shrink the interval into the lower 3/4. In this new interval, the

number 1/2 lies in the lower 3/4 part, so the second character is

again "A", and so on. The number of letters in the original file

must be sent separately (or a special 'EOF' character must be ap-

pended at the end of the file).

The algorithm described above requires that both the sender and

receiver know the probability distribution for the characters. The

adaptive version of the algorithm removes this restriction by first

supposing uniform or any agreed-upon distribution of characters that

approximates the true distribution, and then updating the

distribution after each character is sent and received.

7. LZARI

In each step the LZSS algorithm sends either a character or a

pair. Among these, perhaps character "e" appears

more frequently than "x", and a pair of length 3

might be commoner than one of length 18, say. Thus, if we encode the

more frequent in fewer bits and the less frequent in more bits, the

total length of the encoded text will be diminished. This

consideration suggests that we use Huffman or arithmetic coding,

preferably of adaptive kind, along with LZSS.

This is easier said than done, because there are many possible

combinations. Adaptive compression must keep

running statistics of frequency distribution. Too many items make

statistics unreliable.

What follows is not even an approximate solution to the problem posed

above, but anyway this was what I did in the summer of 1988.

I extended the character set from 256 to three-hundred or so in size,

and let characters 0 through 255 be the usual 8-bit characters,

whereas characters 253 + n represent that what follows is a position

of string of length n, where n = 3, 4 .... These extended set of

characters will be encoded with adaptive arithmetic compression.

I also observed that longest-match strings tend to be the ones that

were read relatively recently. Therefore, recent positions should be

encoded into fewer bits. Since 4096 positions are too many to encode

adaptively, I fixed the probability distribution of the positions "by

hand." The distribution function given in the accompanying LZARI.C

is rather tentative; it is not based on thorough experimentation. In

retrospect, I could encode adaptively the most significant 6 bits,

say, or perhaps by some more ingenious method adapt the parameters of

the distribution function to the running statistics.

At any rate, the present version of LZARI treats the positions rather

separately, so that the overall compression is by no means optimal.

Furthermore, the string length threshold above which strings are

coded into pairs is fixed, but logically its value

must change according to the length of the pair we

would get.

8. LZHUF

LZHUF, the algorithm of Haruyasu Yoshizaki's archiver LHarc, replaces

LZARI's adaptive arithmetic coding with adaptive Huffman. LZHUF

encodes the most significant 6 bits of the position in its 4096-byte

buffer by table lookup. More recent, and hence more probable,

positions are coded in less bits. On the other hand, the remaining 6

bits are sent verbatim. Because Huffman coding encodes each letter

into a fixed number of bits, table lookup can be easily implemented.

Though theoretically Huffman cannot exceed arithmetic compression,

the difference is very slight, and LZHUF is fairly fast.

The LZHUF.C file was written by Yoshizaki. I translated the comments

into English and made a few trivial changes to make it conform to the

ANSI C standard.

References

[1] J. Ziv and A. Lempel, IEEE Trans. IT-23, 337-343 (1977).

[2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951

(1982).

[3] T. C. Bell, IEEE Trans. COM-34, 1176-1182 (1986).

[4] J. Ziv and A. Lempel, IEEE Trans. IT-24, 530-536 (1978).

[5] T. A. Welch, Computer, 17, No.6, 8-19 (1984).

[6] J. A. Storer, Data Compression: Methods and Theory

(Computer Science Press, 1988).

[7] D. A. Huffman, Proc IRE 40, 1098-1101 (1952).

[8] R. Sedgewick, Algorithms, 2nd ed. (Addison-Wesley, 1988).

[9] R. G. Gallager, IEEE Trans. IT-24, 668-674 (1978).

[10] I. E. Witten, R. M. Neal, and J. G. Cleary, Commun. ACM

30, 520-540 (1987).

- end -

------------------------------------------------------------------------

Data Compression Algorithms of LARC and LHarc

Haruhiko Okumura*

* The author is the Sysop of the Science SIG of PV-VAN. His

address is: 12-2-404 Green Heights, 580 Nagasawa, Yokosuka

239, Japan

------------------------------------------------------------------------

1. Introduction

In the spring of 1988, I wrote a very simple data compression program

named LZSS in C language, and uploaded it to the Science SIG (forum)

of PC-VAN, Japan's biggest personal computer network.

That program was based on Storer and Szymanski's slightly modified

version of one of Lempel and Ziv's algorithms. Despite its simplic-

ity, for most files its compression outperformed the archivers then

widely used.

Kazuhiko Miki rewrote my LZSS in Turbo Pascal and assembly language,

and soon made it evolve into a complete archiver, which he named

LARC.

The first versions of LZSS and LARC were rather slow. So I rewrote

my LZSS using a binary tree, and so did Miki. Although LARC's

encoding was slower than the fastest archiver available, its decoding

was quite fast, and its algorithm was so simple that even self-

extracting files (compressed files plus decoder) it created were

usually smaller than non-self-extracting files from other archivers.

Soon many hobby programmers joined the archiver project at the forum.

Very many suggestions were made, and LARC was revised again and

again. By the summer of 1988, LARC's speed and compression have

improved so much that LARC-compressed programs were beginning to be

uploaded in many forums of PC-VAN and other networks.

In that summer I wrote another program, LZARI, which combined the

LZSS algorithm with adaptive arithmetic compression. Although it was

slower than LZSS, its compression performance was amazing.

Miki, the author of LARC, uploaded LZARI to NIFTY-Serve, another big

information network in Japan. In NIFTY-Serve, Haruyasu Yoshizaki

replaced LZARI's adaptive arithmetic coding with a version of

adaptive Huffman coding to increase speed. Based on this algorithm,

which he called LZHUF, he developed yet another archiver, LHarc.

In what follows, I will review several of these algorithms and supply

simplified codes in C language.

2. Simple coding methods

Replacing several (usually 8 or 4) "space" characters by one "tab"

character is a very primitive method for data compression. Another

simple method is run-length coding, which encodes the message

"AAABBBBAACCCC" into "3A4B2A4C", for example.

3. LZSS coding

This scheme is initiated by Ziv and Lempel [1]. A slightly modified

version is described by Storer and Szymanski [2]. An implementation

using a binary tree is proposed by Bell [3]. The algorithm is quite

simple: Keep a ring buffer, which initially contains "space"

characters only. Read several letters from the file to the buffer.

Then search the buffer for the longest string that matches the

letters just read, and send its length and position in the buffer.

If the buffer size is 4096 bytes, the position can be encoded in 12

bits. If we represent the match length in four bits, the

two characters, then we send just one character without encoding, and

restart the process with the next letter. We must send one extra bit

each time to tell the decoder whether we are sending a

The accompanying file LZSS.C is a version of this algorithm. This

implementation uses multiple binary trees to speed up the search for

the longest match. All the programs in this article are written in

draft-proposed ANSI C. I tested them with Turbo C 2.0.

4. LZW coding

This scheme was devised by Ziv and Lempel [4], and modified by Welch

[5].

The LZW coding has been adopted by most of the existing archivers,

such as ARC and PKZIP. The algorithm can be made relatively fast,

and is suitable for hardware implementation as well.

The algorithm can be outlined as follows: Prepare a table that can

contain several thousand items. Initially register in its 0th

through 255th positions the usual 256 characters. Read several

letters from the file to be encoded, and search the table for the

longest match. Suppose the longest match is given by the string

"ABC". Send the position of "ABC" in the table. Read the next

character from the file. If it is "D", then register a new string

"ABCD" in the table, and restart the process with the letter "D". If

the table becomes full, discard the oldest item or, preferably, the

least used. A Pascal program for this algorithm is given in Storer's

book [6].

5. Huffman coding

Classical Huffman coding is invented by Huffman [7]. A fairly

readable accound is given in Sedgewick [8]. Suppose the text to be

encoded is "ABABACA", with four A's, two B's, and a C. We represent

this situation as follows:

4 2 1

| | |

A B C

Combine the least frequent two characters into one, resulting in the

new frequency 2 + 1 = 3:

4 3

| / \

A B C

Repeat the above step until the whole characters combine into a tree:

7

/ \

/ 3

/ / \

A B C

Start at the top ("root") of this encoding tree, and travel to the

character you want to encode. If you go left, send a "0"; otherwise

send a "1". Thus, "A" is encoded by "0", "B" by "10", "C" by "11".

Algotether, "ABABACA" will be encoded into ten bits, "0100100110".

To decode this code, the decoder must know the encoding tree, which

must be sent separately.

A modification to this classical Huffman coding is the adaptive, or

dynamic, Huffman coding. See, e.g., Gallager [9]. In this method,

the encoder and the decoder processes the first letter of the text as

if the frequency of each character in the file were one, say. After

the first letter has been processed, both parties increment the

frequency of that character by one. For example, if the first letter

is 'C', then freq['C'] becomes two, whereas every other frequencies

are still one. Then the both parties modify the encoding tree

accordingly. Then the second letter will be encoded and decoded, and

so on.

6. Arithmetic coding

The original concept of arithmetic coding is proposed by P. Elias.

An implementation in C language is described by Witten and others

[10].

Although the Huffman coding is optimal if each character must be

encoded into a fixed (integer) number of bits, arithmetic coding wins

if no such restriction is made. As an example we shall encode "AABA"

using arithmetic coding. For simplicity suppose we know beforehand

that the probabilities for "A" and "B" to appear in the text are 3/4

and 1/4, respectively.

Initially, consider an interval:

0 <= x < 1

Since the first character is "A" whose probability is 3/4, we shrink

the interval to the lower 3/4:

0 <= x < 3/4

The next character is "A" again, so we take the lower 3/4:

0 <= x < 9/16

Next comes "B" whose probability is 1/4, so we take the upper 1/4:

27/64 <= x < 9/16

because "B" is the second element in our alphabet, {A, B}. The last

character is "A" and the interval is:

27/64 <= x < 135/256

which can be written in binary notation:

0.011011 <= x < 0.10000111

Choose from this interval any number that can be represented in

fewest bits, say 0.1, and send the bits to the right of "0."; in this

case we send only one bit, "1". Thus we have encoded four letters

into one bit! With the Huffman coding, four letters could not be

encoded into less than four bits.

To decode the code "1", we just reverse the process: First, we supply

the "0." to the right of the received code "1", resulting in "0.1" in

binary notation, or 1/2. Since this number is in the first 3/4 of

the initial interval 0 <= x < 1, the first character must be "A".

Shrink the interval into the lower 3/4. In this new interval, the

number 1/2 lies in the lower 3/4 part, so the second character is

again "A", and so on. The number of letters in the original file

must be sent separately (or a special 'EOF' character must be ap-

pended at the end of the file).

The algorithm described above requires that both the sender and

receiver know the probability distribution for the characters. The

adaptive version of the algorithm removes this restriction by first

supposing uniform or any agreed-upon distribution of characters that

approximates the true distribution, and then updating the

distribution after each character is sent and received.

7. LZARI

In each step the LZSS algorithm sends either a character or a

more frequently than "x", and a

might be commoner than one of length 18, say. Thus, if we encode the

more frequent in fewer bits and the less frequent in more bits, the

total length of the encoded text will be diminished. This

consideration suggests that we use Huffman or arithmetic coding,

preferably of adaptive kind, along with LZSS.

This is easier said than done, because there are many possible

running statistics of frequency distribution. Too many items make

statistics unreliable.

What follows is not even an approximate solution to the problem posed

above, but anyway this was what I did in the summer of 1988.

I extended the character set from 256 to three-hundred or so in size,

and let characters 0 through 255 be the usual 8-bit characters,

whereas characters 253 + n represent that what follows is a position

of string of length n, where n = 3, 4 .... These extended set of

characters will be encoded with adaptive arithmetic compression.

I also observed that longest-match strings tend to be the ones that

were read relatively recently. Therefore, recent positions should be

encoded into fewer bits. Since 4096 positions are too many to encode

adaptively, I fixed the probability distribution of the positions "by

hand." The distribution function given in the accompanying LZARI.C

is rather tentative; it is not based on thorough experimentation. In

retrospect, I could encode adaptively the most significant 6 bits,

say, or perhaps by some more ingenious method adapt the parameters of

the distribution function to the running statistics.

At any rate, the present version of LZARI treats the positions rather

separately, so that the overall compression is by no means optimal.

Furthermore, the string length threshold above which strings are

coded into

must change according to the length of the

would get.

8. LZHUF

LZHUF, the algorithm of Haruyasu Yoshizaki's archiver LHarc, replaces

LZARI's adaptive arithmetic coding with adaptive Huffman. LZHUF

encodes the most significant 6 bits of the position in its 4096-byte

buffer by table lookup. More recent, and hence more probable,

positions are coded in less bits. On the other hand, the remaining 6

bits are sent verbatim. Because Huffman coding encodes each letter

into a fixed number of bits, table lookup can be easily implemented.

Though theoretically Huffman cannot exceed arithmetic compression,

the difference is very slight, and LZHUF is fairly fast.

The LZHUF.C file was written by Yoshizaki. I translated the comments

into English and made a few trivial changes to make it conform to the

ANSI C standard.

References

[1] J. Ziv and A. Lempel, IEEE Trans. IT-23, 337-343 (1977).

[2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951

(1982).

[3] T. C. Bell, IEEE Trans. COM-34, 1176-1182 (1986).

[4] J. Ziv and A. Lempel, IEEE Trans. IT-24, 530-536 (1978).

[5] T. A. Welch, Computer, 17, No.6, 8-19 (1984).

[6] J. A. Storer, Data Compression: Methods and Theory

(Computer Science Press, 1988).

[7] D. A. Huffman, Proc IRE 40, 1098-1101 (1952).

[8] R. Sedgewick, Algorithms, 2nd ed. (Addison-Wesley, 1988).

[9] R. G. Gallager, IEEE Trans. IT-24, 668-674 (1978).

[10] I. E. Witten, R. M. Neal, and J. G. Cleary, Commun. ACM

30, 520-540 (1987).

- end -

December 18, 2017
Add comments