This article presents a formula based approach to Arithmetic Coding. It also explains the mathematical foundation of Arithmetic Coding from a different perspective.
The new approach is demonstrated using a spreadsheet by compressing and decompressing a simple string ("Hello World"). Compression using conventional approach is also demonstrated in the same spreadsheet. It can be seen that the same compressed value is obtained using both the methods.
Simply put, the spreadsheet compresses the string "Hello World" having length 11 letters to a 4 byte value 2166290293 (31.298 bits to be more exact).
Click here to download the spreadsheet. Feel free to play around with it. For arithmetic coding, radix is taken as 2 in this article, but any radix can be used.
Click here to download a formal LaTeX PDF version of this article.
The Arithmetic Coding process involves re-calculation of intervals for each symbol that need to be encoded. This article discovers a formula based approach for calculating compressed codes and provides proof forderiving the formula from the usual approach. A spreadsheet is also provided for verification of the approach. Consequently, the similarities between Arithmetic Coding and Huffman coding are also visually illustrated.
The basic principles of Arithmetic Coding are explained well in .
Referring to this article , for a given input string 'A', with 'N' symbols (letters) and 'n' unique symbols,
The formula for h(ai) has been known for several decades . If this indicates the length of the compressed code (in bits), what is the 'value' contained in that length?
Surely it cannot be all 0s or all 1s, in which case the compressed value will simply be 0 or 1. Also, it cannot also take any arbitrary value, as there could be more than one symbol having the same compressed length. So the value is distinct and specific for each symbol. Let us call this 'value' as v(ai).
If the formula for this value could be discovered, it would be simply a matter of concatenating 'lengths' of 'values' to obtain the code, without having to recalculate the intervals for each symbol, as required in the common approach.
v(ai) = (Σ[j=1 to i-1] kj) / ki for all i > 1. v(a1) = 0.
compressed_value = Σ[i=1 to N] (v(A[i]) / 2Σ[j=1 to i] h(A[j]))
Proof (derivation)We derive the formula from the common approach, which slices the interval according to the weights wi. So for coding any symbol ai, the following value is used:
code_valuei = Σ[j=1 to i-1] (wj * interval_length)
for all i > 1. code_value1 = 0When we use interval as 0 to 1, interval_length is equal to 1, so it becomes:
code_valuei = Σ[j=1 to i-1] wjHowever, since we have taken the interval as 0 to 1, the value is a fraction and we have to scale it to get the value we are seeking. The length of this value is h(ai) bits, so we shift it left as follows to get the desired value:
v[ai] = (Σ[j=1 to i-1] wj) * 2h(ai)
⇒ v[ai] = (Σ[j=1 to i-1] wj) * 2-log2(wi)
⇒ v[ai] = (Σ[j=1 to i-1] wj) * 2log2(1/wi)
⇒ v[ai] = (Σ[j=1 to i-1] wj) * (1/wi)
⇒ v[ai] = (Σ[j=1 to i-1] wj) / wi
⇒ v[ai] = (Σ[j=1 to i-1] (kj/N)) / (ki/N)Therefore:
v[ai] = (Σ[j=1 to i-1] kj) / ki
For all above statements, i > 1 and v(a1) is always 0.
For the above derivation, any interval could be used as long as h(ai) is then right aligned to 0. For example, if we use an interval of 0 to 4294967296 (232), the value would have to be shifted right by 32-h(ai).
If interval_end is expressed in terms of powers of 2, say 2x, then the value should be right shifted by x-h(ai). Note that when interval is 0 to 1, interval_end = 20 (x=0).
Application to other coding methods
The same formulas and approach are applicable to other coding methods such as Huffman coding, Shannon-Fano coding where 'length' and 'value' are available. The only difference is that 'length' would be a natural number in these cases.
Once the codes for symbols are obtained using the respective methods, the Frequencies need to be re(verse)-calculated according to the code lengths (ki = 2-h(ai)*N). Then, the formulas can be applied to obtain the compressed value. This is shown in a separate sheet (Huffman_coding) in the above download and only the columns with blue headings were changed. A picture for visual comparison is given under the example section below.
ExampleGiven A = "Hello World", then
which means after compression, 11 bytes will become 31.2989 bits (around 4 bytes).By applying the formulas, we get:
The following picture visually shows placement of letters in compressed value for both Arithmetic and Huffman coding:
The values are the same as those shown in the example spreadsheets. A screenshot of the spreadsheets are also given in the last section.
The current work simplifies the encoding process and explains Arithmetic coding in simpler terms. However, for practical implementations, the following points need to be considered:
Till the answers to the above points are available, this work presently serves the following purposes:
Feedback from public forums
Screenshots of Spreadsheets