| |||||||||||||||||||||

## A formula based approach to Arithmetic CodingThis article presents a formula based approach to Arithmetic Coding. It also explains the mathematical foundation of Arithmetic Coding from a different perspective. The new approach is demonstrated using a spreadsheet by compressing and decompressing a simple string ("Hello World"). Compression using conventional approach is also demonstrated in the same spreadsheet. It can be seen that the same compressed value is obtained using both the methods. Simply put, the spreadsheet compresses the string "Hello World" having length 11 letters to a 4 byte value 2166290293 (31.298 bits to be more exact). Click here to download the spreadsheet. Feel free to play around with it. For arithmetic coding, radix is taken as 2 in this article, but any radix can be used. Click here to download a formal LaTeX PDF version of this article. ## AbstractThe Arithmetic Coding process involves re-calculation of intervals for each symbol that need to be encoded. This article discovers a formula based approach for calculating compressed codes and provides proof forderiving the formula from the usual approach. A spreadsheet is also provided for verification of the approach. Consequently, the similarities between Arithmetic Coding and Huffman coding are also visually illustrated. ## Known factsThe basic principles of Arithmetic Coding are explained well in [2]. Referring to this article [3], for a given input string ' - if each symbol is represented as
**a**, '_{i}**i**' being the index of symbol when ordered by descending order of weights, - if each symbol appears
**k**times,_{i} - if weights (or probability) of each symbol is given by
**w**=_{i}**k**,_{i}/ N
- the length (in bits) of optimal possible code for symbol
**a**is_{i}**-log2(w**bits, referred as_{i})**h(a**._{i}) - the total compressed length
**L**will be**Σ**or_{[i=1 to n]}-k_{i}log2(w_{i})**Σ**bits._{[i=1 to n]}k_{i}h(a_{i})
## This workThe formula for Surely it cannot be all 0s or all 1s, in which case the compressed value will simply be 0 or 1. Also, it cannot also take any arbitrary value, as there could be more than one symbol having the same compressed length. So the value is distinct and specific for each symbol. Let us call this 'value' as If the formula for this value could be discovered, it would be simply a matter of concatenating 'lengths' of 'values' to obtain the code, without having to recalculate the intervals for each symbol, as required in the common approach. ## Formulae
## Proof (derivation)We derive the formula from the common approach, which slices the interval according to the weightsw. So for coding any symbol _{i}a, the following value is used:
_{i}
for all i > 1.When we use interval as 0 to 1, interval_length is equal to 1, so it becomes: However, since we have taken the interval as 0 to 1, the value is a fraction and we have to scale it to get the value we are seeking. The length of this value is h(a bits, so we shift it left as follows to get the desired value:
_{i})
⇒ ⇒ ⇒ ⇒ ⇒Therefore:
For all above statements, For the above derivation, any interval could be used as long as right aligned to 0. For example, if we use an interval of 0 to 4294967296 (2^{32}), the value would have to be shifted right by 32-h(a._{i})If interval_end = 2^{0} (x=0).## Application to other coding methodsThe same formulas and approach are applicable to other coding methods such as Huffman coding, Shannon-Fano coding where 'length' and 'value' are available. The only difference is that 'length' would be a natural number in these cases. Once the codes for symbols are obtained using the respective methods, the Frequencies need to be re(verse)-calculated according to the code lengths ( ## ExampleGivenA = "Hello World", then
**N**= 11,**n**= 8,**a**= 'l',_{1}**a**= 'o',_{2}**a**= 'H',_{3}**a**= 'e',_{4}**a**= ' ',_{5}**a**= 'W',_{6}**a**= 'r',_{7}**a**= 'd', and_{8}**k**= 3,_{1}**k**= 2,_{2}**k**to_{3}**k**= 1_{8}**w**= 0.2727 (3/11),_{1}**w**= 0.1818 (2/11),_{2}**w**to_{3}**w**= 0.0909 (1/11)_{8}
**h(a**= 1.8745,_{1})**h(a**= 2.4594,_{2})**h(a**to_{3})**h(a**= 3.4594, and_{8})- L =
**31.2989**
which means after compression, 11 bytes will become 31.2989 bits (around 4 bytes). By applying the formulas, we get:**v(a**= 0,_{1})**v(a**= 1.5,_{2})**v(a**= 5,_{3})**v(a**= 6,_{4})**v(a**= 7,_{5})**v(a**= 8,_{6})**v(a**= 9,_{7})**v(a**= 10_{8})**Compressed value**= 2166290392.64712 (0.50437878646122 unscaled)
The following picture visually shows placement of letters in compressed value for both Arithmetic and Huffman coding: The values are the same as those shown in the example spreadsheets. A screenshot of the spreadsheets are also given in the last section. ## ConclusionThe current work simplifies the encoding process and explains Arithmetic coding in simpler terms. However, for practical implementations, the following points need to be considered: - The formula based approach would heavily depend on the performance of
**exp2**function. It is likely to be slower than calculating the interval table for each symbol. - While the interval based approach does not allow parallel processing [1], the formula based approach would allow parallel processing.
Till the answers to the above points are available, this work presently serves the following purposes: - Understand Arithmetic Coding from a different perspective
- Visualize positions of compressed symbols
- Visually compare Arithmetic coding and Huffman coding
- Precursor for further research on entropy coding
## References*Paul G. Howard and Jeffrey Scott Vitter*, "Practical Implementations of Arithmetic Coding", Brown University, April 1992*Wikipedia*, "Arithmetic Coding", "https://en.wikipedia.org/wiki/Arithmetic_coding", September 2015*Wikipedia*, "Huffman Coding", "https://en.wikipedia.org/wiki/Huffman_coding", August 2015
## Feedback from public forums
## Screenshots of Spreadsheets | |||||||||||||||||||||