Byte pair encoding

Byte pair encoding: Byte pair encoding or digram coding^[1] is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression" in the C Users Journal.^[2]

Byte pair encoding example

Suppose we wanted to encode the data

aaabdaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now we have the following data and replacement table:

ZabdZabac Z=aa

Then we repeat the process with byte pair "ab", replacing it with Y:

ZYdZYac Y=ab Z=aa

We could stop here, as the only literal byte pair left occurs only once. Or we could continue the process and use recursive byte pair encoding, replacing "ZY" with "X":

XdXac X=ZY Y=ab Z=aa

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

References

^ Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes. New York: Van Nostrand Reinhold, 1994. ISBN 9780442018634.

^ "Byte Pair Encoding". http://www.csse.monash.edu.au/cluster/RJK/Compress/problem.html.

v · d · eData compression methods

Information theory
Entropy · Complexity · Redundancy · Lossy · Timeline of information theory

Lossless

Entropy encoding

Shannon–Fano · Shannon–Fano–Elias · Huffman · Adaptive Huffman · Arithmetic · Range · Golomb · Universal (Gamma · Exp-Golomb · Fibonacci · Levenshtein)

Dictionary

RLE · Byte pair encoding · DEFLATE · Lempel–Ziv (LZ77/78 · LZSS · LZW · LZWL · LZO · LZMA · LZX · LZRW · LZJB · LZS · LZT · ROLZ) · Statistical Lempel Ziv

Others

CTW · BWT · PPM · DMC · Delta

Audio

Theory

Companding · Convolution · Dynamic range · Latency · Sampling · Nyquist–Shannon theorem · Sound quality

Audio codec parts

LPC (LAR · LSP) · WLPC · CELP · ACELP · A-law · μ-law · ADPCM · DPCM · MDCT · Fourier transform · Psychoacoustic model

Others

Bit rate (CBR · ABR · VBR) · Speech compression · Sub-band coding

Image

Terms

Color space · Pixel · Chroma subsampling · Compression artifact · Image resolution

Methods

RLE · Fractal · Wavelet · EZW · SPIHT · LP · DCT · Chain code · KLT

Others

Test images · PSNR quality measure · Quantization

Video

Terms

Video characteristics · Frame · Frame rate · Interlace · Frame types · Video quality · Video resolution

Video codec parts

Motion compensation · DCT · Quantization

Others

Video codecs · Rate distortion theory · Bit rate (CBR · ABR · VBR)

See Compression formats for formats and Compression software implementations for codecs

Categories:
Lossless compression algorithms

Игры ⚽ Нужен реферат?

Look at other dictionaries:

Delta encoding — Not to be confused with Elias delta coding. Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding… … Wikipedia
Entropy encoding — In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium. One of the main types of entropy coding creates and assigns a unique prefix free code to each… … Wikipedia
Percent-encoding — For the urlencode in MediaWiki, see Help:Magic words. Percent encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI) under certain circumstances. Although it is known as URL encoding… … Wikipedia
Variable-width encoding — This article is about the storage of text in computers. For the transmission of data across noisy channels, see variable length code. A variable width encoding is a type of character encoding scheme in which codes of differing lengths are used to … Wikipedia
8b/10b encoding — In telecommunications, 8b/10b is a line code that maps 8 bit symbols to 10 bit symbols to achieve DC balance (see DC coefficient) and bounded disparity, and yet provide enough state changes to allow reasonable clock recovery. This means that the… … Wikipedia
Bit rate — Bit rates Decimal prefixes (SI) Name Symbol Multiple kilobit per second kbit/s 103 megabit per second Mbit/s 106 gigabit per second Gbit/s 109 … Wikipedia
Dictionary coder — A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure… … Wikipedia
Lossless data compression — is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be… … Wikipedia
Huffman coding — Huffman tree generated from the exact frequencies of the text this is an example of a huffman tree . The frequencies and codes of each character are below. Encoding the sentence with this code requires 135 bits, as opposed of 288 bits if 36… … Wikipedia
Dynamic Markov compression — (DMC) is a lossless data compression algorithm developed by Gordon Cormack and Nigel Horspool [1]. It uses predictive arithmetic coding similar to prediction by partial matching (PPM), except that the input is predicted one bit at a time (rather… … Wikipedia

Academic Dictionaries and Encyclopedias

Byte pair encoding

Byte pair encoding example

References

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Byte pair encoding

Byte pair encoding example

References

Look at other dictionaries:

Share the article and excerpts

Direct link