Errata list for Data Compression: The Complete Reference, 4th Edition

Last Updated 1 June 2014 Error found by Cosmin Truta Page 827, Line -1, Section 7.15 (AAC). The file extensions used by iTunes and iTunes Music Store are mentioned as "m4a and m4v". These should be changed to "m4a and m4p". Specifically, the extensions used by iTunes are m4a (unprotected audio, used when compressing user's CDs, but not sold on the iTunes Music Store) and m4p (protected audio, sold on the iTunes Music Store). The 3rd format is indeed m4v, but that's for video. Another file extension, mp4, is understood by the iTunes software as generic MPEG-4 (audio, video or both), but when creating files, specific file extensions (m4a or m4v) are being used. ------- Errors found by Derong Bao Page 44, Tables 1.18 and 1.19, The text 0101=6 on the top rows of these tables should be 0110=6. (24Apr09) Page 212, Table 3.25 (LZAP Example) should be changed to: yabbadabbadabbadoo 1 y y � 2 a a 256-ya 3 b b 257-ab 4 b b 258-bb 5 a a 259-ba 6 d d 260-ad 7 ab ab 261-da, 262-dab 8 ba ba 263-abb, 264-abba 9 dab dab 265-bad, 266-bada, 267-badab 10 bad bad 268-dabb, 269-dabba, 270-dabbad 11 o o 271-bado 12 o o 272-oo -------- Errors found by Roberto Mantaci (these were too late to correct in the 5th edition of the book). Page 27, line 23. Each pair is either an actual 16-bit measurement ... or two 8-bit signed differences. Thus .... differences can be between 0 and +/-127". Section 1.4.1 is really useful only for monochromatic images. It's hard to see how the compression of a color image can benefit from the type of run merging proposed in this section. Page 84, line 6: "It does not take much to realize that the symbols have to have probability p_1=a," should be changed to "A possible set of probabilities for the symbols is $p_1=a$,..." Page 75 last 3 lines and page 76, first 3 lines: I feel that there is a need for a more rigorous algorithm to determine the Huffman tree with minimum variance. Here is what I've been able to locate so far: 1. A set of lecture notes (available in pdf here): CS 493: Algorithms for Massive Data Sets Huffman and Arithmetic Coding DATE : Thursday, 2/7/2002 Scribe: Chi Zhang Where the author says the following about our question: "This is achieved by an extension to the Huffman algorithm. When combining trees, break ties by picking the earliest produced subtrees with same smallest probability." (RM : This rule would indeed combine first a_2 and a_3 in your example, and I think it is very easy to accomplish. It suffices that the sifting of the heap does not let younger nodes move in front of older nodes of same weight in the priority queue.) However, he asks an interesting question : "Question: Is this equivalent to breaking ties by picking "shortest" subtrees (for some appropriate definition of shortest) ?" 2. On wikipedia ( the question is answered in the last paragraph of the following quote (but the source is unclear) : "If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:

• Start with as many leaves as there are symbols.

• Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).

• While there is more than one node in the queues:

• Dequeue the two nodes with the lowest weight by examining the fronts of both queues.

• Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.

• Enqueue the new node into the rear of the second queue.

• The remaining node is the root node; the tree has now been generated. It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code." (RM : This rule would also combine first a_2 and a_3 in your example). 3. Then there is the paper "Minimum Variance Huffman Codes," SIAM J. Comput. 11(1): 138-148 (1982), by Lawrence T. Kou. I haven't been able to obtain the actual paper, but there is an abstract at: Abstract Huffman's well-known coding method constructs a minimum redundancy code which minimizes the expected value of the word length. In this paper, we characterize the minimum redundancy code with the minimum variance of the word length. An algorithm is given to construct such a code. It isshown that the code is in a certain sense unique. Furthermore, the code is also shown to have a strong property in that it minimizes a general class of functions of the minimum redundancy codes as long as the functions are nondecreasing with respect to the path lengths from the root to the internal nodes of the corresponding decoding trees. �1982 (Copyright) Society for Industrial and Applied Mathematics" Readers are invited to send responses to Roberto Mantaci and David Salomon. -------- Error found by Raphaël Lemoine (this was also too late to correct in the 5th edition of the book). Page 531, lines 1--2. ``any periodic function can be expressed as a...'' should be ``almost any periodic function can be expressed as a....'' -------- (May 2010): Error found by Bernd Herd. Page 176, line -13. The book claims that I proposed the LZH idea. While I would be proud if I had, I have only worked a lot with the lha packer developed by Haruyasu Yoshizaki around 1990, especially I used it for quite a while to develop setup programs for windows applications and a phonebook database CD. The file format used by LHA is called LZH. It uses the combination of LZ and Huffmann coding. I have never cared a lot about the details. Take a look at "" for more information: -------- (Apr 2011): Error found by Frédéric Kayser. Page 238, line -5. "This table has codes 0--256 for..." should be "This table has codes 0--255 for..." -------- (Nov 2011): Errors found by Rolf-Rainer Grigat. Page 123, last line of subsection 2.14.2 Underflow. An output of 4 should be filled with nines and a 5 should be filled with zeros (or perhaps 4 and 5 should be swapped). Page 124, entropy versus information. In the third and fourth paragraphs, the term "entropy" should be replaced by "information", because this is not the average information of the source model but just the information content of one possible event, calculated by the logarithm. Page 124, sixth paragraph. The proposed 19-bit number is too short. If it is filled with all 0 on the right, the resulting number is smaller than Low, i.e. outside the coded interval. At least two additional 1 bits are required. The interval related to 111111101001111110011 (compared to 1111111010011111100) is completely contained between Low and High. This 21-bit number is the nearest higher integer for the information 20.0249 bits/symbol. Therefore your explanation by limited accuracy of Table 2.52 can be removed. -------- (Dec 2012): Error found by Rayne Olivetti. Page 75, line -13. "It turns out that the arbitrary decisions made in constructing the Huffman tree affect the individual codes but not the average size of the code." However, this doesn't seem to be correct in some cases. Here is one example case: However, even in that case, the rule "The rule is: When there are more than two smallest-probability nodes, select the ones that are lowest and highest in the tree and combine them. This will combine symbols of low probability with ones of high probability, thereby reducing the total variance of the code" seems to hold. ------------ (Jun 2013): Errors found by Richard Llewelyn. A number of errors found on pages 303, 535, and 537 (Fig 5.5). For a detailed description of these please download the small PDF file found here. ------------ (Apr 2014): Error found by Seetha Rama Raju sanapala. Page 5, line -4. "In such text, each letter occurs with equal probability so assigning them fixed-size codes does not add any redundancy." But I feel that suppose you have 20 source outputs with equal probability and you use necessarily 5 bit code, then you have redundancy which can be removed by using variable length code. ------------ (May 2014): Error found by Carlos Fajardo. Page 51, line -15. [Redundancy] It is defined as the difference between a symbol set�s largest possible entropy and its actual entropy. I think that the entropy is constant for a given dataset and the "set's largest possible entropy" (Log2(n)) is a constant too, so, Is the redundancy a constant?. I don't think so. However, at the beginning of section 2.2 you suggest that the redundancy is the current number of bits per symbol used to represent the dataset minus the dataset entropy. I agree with the last definition. ------------ The most valuable thing you can make is a mistake - you can't learn anything from being perfect. -Adam Osborne. American entrepreneur, 1939-2003