A formula based approach to Arithmetic Coding

A formula based approach to Arithmetic Coding

This article presents a formula based approach to Arithmetic Coding. It also explains the mathematical foundation of Arithmetic Coding from a different perspective.

The new approach is demonstrated using a spreadsheet by compressing and decompressing a simple string ("Hello World"). Compression using conventional approach is also demonstrated in the same spreadsheet. It can be seen that the same compressed value is obtained using both the methods.

Simply put, the spreadsheet compresses the string "Hello World" having length 11 letters to a 4 byte value 2166290293 (31.298 bits to be more exact).

Click here to download the spreadsheet. Feel free to play around with it. For arithmetic coding, radix is taken as 2 in this article, but any radix can be used.

Click here to download a formal LaTeX PDF version of this article.

Abstract

The Arithmetic Coding process involves re-calculation of intervals for each symbol that need to be encoded. This article discovers a formula based approach for calculating compressed codes and provides proof forderiving the formula from the usual approach. A spreadsheet is also provided for verification of the approach. Consequently, the similarities between Arithmetic Coding and Huffman coding are also visually illustrated.

Known facts

The basic principles of Arithmetic Coding are explained well in [2].

Referring to this article [3], for a given input string 'A', with 'N' symbols (letters) and 'n' unique symbols,

  • if each symbol is represented as ai, 'i' being the index of symbol when ordered by descending order of weights,
  • if each symbol appears ki times,
  • if weights (or probability) of each symbol is given by wi = ki / N,
we know:
  • the length (in bits) of optimal possible code for symbol ai is -log2(wi) bits, referred as h(ai).
  • the total compressed length L will be Σ[i=1 to n] -ki log2(wi) or Σ[i=1 to n] ki h(ai) bits.

This work

The formula for h(ai) has been known for several decades [1]. If this indicates the length of the compressed code (in bits), what is the 'value' contained in that length?

Surely it cannot be all 0s or all 1s, in which case the compressed value will simply be 0 or 1. Also, it cannot also take any arbitrary value, as there could be more than one symbol having the same compressed length. So the value is distinct and specific for each symbol. Let us call this 'value' as v(ai).

If the formula for this value could be discovered, it would be simply a matter of concatenating 'lengths' of 'values' to obtain the code, without having to recalculate the intervals for each symbol, as required in the common approach.

Formulae

v(ai) = (Σ[j=1 to i-1] kj) / ki for all i > 1. v(a1) = 0.
compressed_value = Σ[i=1 to N] (v(A[i]) / 2Σ[j=1 to i] h(A[j]))

Proof (derivation)

We derive the formula from the common approach, which slices the interval according to the weights wi. So for coding any symbol ai, the following value is used:
code_valuei = Σ[j=1 to i-1] (wj * interval_length)
for all i > 1. code_value1 = 0
When we use interval as 0 to 1, interval_length is equal to 1, so it becomes:
code_valuei = Σ[j=1 to i-1] wj
However, since we have taken the interval as 0 to 1, the value is a fraction and we have to scale it to get the value we are seeking. The length of this value is h(ai) bits, so we shift it left as follows to get the desired value:
v[ai] = (Σ[j=1 to i-1] wj) * 2h(ai)
v[ai] = (Σ[j=1 to i-1] wj) * 2-log2(wi)
v[ai] = (Σ[j=1 to i-1] wj) * 2log2(1/wi)
v[ai] = (Σ[j=1 to i-1] wj) * (1/wi)
v[ai] = (Σ[j=1 to i-1] wj) / wi
v[ai] = (Σ[j=1 to i-1] (kj/N)) / (ki/N)
Therefore:
v[ai] = (Σ[j=1 to i-1] kj) / ki

For all above statements, i > 1 and v(a1) is always 0.

For the above derivation, any interval could be used as long as h(ai) is then right aligned to 0. For example, if we use an interval of 0 to 4294967296 (232), the value would have to be shifted right by 32-h(ai).

If interval_end is expressed in terms of powers of 2, say 2x, then the value should be right shifted by x-h(ai). Note that when interval is 0 to 1, interval_end = 20 (x=0).

Application to other coding methods

The same formulas and approach are applicable to other coding methods such as Huffman coding, Shannon-Fano coding where 'length' and 'value' are available. The only difference is that 'length' would be a natural number in these cases.

Once the codes for symbols are obtained using the respective methods, the Frequencies need to be re(verse)-calculated according to the code lengths (ki = 2-h(ai)*N). Then, the formulas can be applied to obtain the compressed value. This is shown in a separate sheet (Huffman_coding) in the above download and only the columns with blue headings were changed. A picture for visual comparison is given under the example section below.

Example

Given A = "Hello World", then
  • N = 11, n = 8,
  • a1 = 'l', a2 = 'o', a3 = 'H', a4 = 'e', a5 = ' ', a6 = 'W', a7 = 'r', a8 = 'd', and
  • k1 = 3, k2 = 2, k3 to k8 = 1
  • w1 = 0.2727 (3/11), w2 = 0.1818 (2/11), w3 to w8 = 0.0909 (1/11)
then
  • h(a1) = 1.8745, h(a2) = 2.4594, h(a3) to h(a8) = 3.4594, and
  • L = 31.2989

which means after compression, 11 bytes will become 31.2989 bits (around 4 bytes).

By applying the formulas, we get:
  • v(a1) = 0, v(a2) = 1.5, v(a3) = 5, v(a4) = 6, v(a5) = 7, v(a6) = 8, v(a7) = 9, v(a8) = 10
  • Compressed value = 2166290392.64712 (0.50437878646122 unscaled)

The following picture visually shows placement of letters in compressed value for both Arithmetic and Huffman coding:


The values are the same as those shown in the example spreadsheets. A screenshot of the spreadsheets are also given in the last section.

Conclusion

The current work simplifies the encoding process and explains Arithmetic coding in simpler terms. However, for practical implementations, the following points need to be considered:

  • The formula based approach would heavily depend on the performance of exp2 function. It is likely to be slower than calculating the interval table for each symbol.
  • While the interval based approach does not allow parallel processing [1], the formula based approach would allow parallel processing.

Till the answers to the above points are available, this work presently serves the following purposes:

  • Understand Arithmetic Coding from a different perspective
  • Visualize positions of compressed symbols
  • Visually compare Arithmetic coding and Huffman coding
  • Precursor for further research on entropy coding

References

  1. Paul G. Howard and Jeffrey Scott Vitter, "Practical Implementations of Arithmetic Coding", Brown University, April 1992
  2. Wikipedia, "Arithmetic Coding", "https://en.wikipedia.org/wiki/Arithmetic_coding", September 2015
  3. Wikipedia, "Huffman Coding", "https://en.wikipedia.org/wiki/Huffman_coding", August 2015

Feedback from public forums

DateForumUser ID (nick)Type
12-Sep-2015xkcd Computer Science forumTubReview
10-Sep-2015Physics forumsChiroComment
8-Sep-2015The Science forumsJagellaComment
7-Sep-2015Science forumsSatoComment

Screenshots of Spreadsheets