# Byte pair encoding

﻿
Byte pair encoding

Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression" in the C Users Journal.

## Byte pair encoding example

Suppose we wanted to encode the data

```aaabdaaabac
```

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now we have the following data and replacement table:

```ZabdZabac
Z=aa
```

Then we repeat the process with byte pair "ab", replacing it with Y:

```ZYdZYac
Y=ab
Z=aa
```

We could stop here, as the only literal byte pair left occurs only once. Or we could continue the process and use recursive byte pair encoding, replacing "ZY" with "X":

```XdXac
X=ZY
Y=ab
Z=aa
```

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Delta encoding — Not to be confused with Elias delta coding. Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding… …   Wikipedia

• Entropy encoding — In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium. One of the main types of entropy coding creates and assigns a unique prefix free code to each… …   Wikipedia

• Percent-encoding — For the urlencode in MediaWiki, see Help:Magic words. Percent encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI) under certain circumstances. Although it is known as URL encoding… …   Wikipedia

• Variable-width encoding — This article is about the storage of text in computers. For the transmission of data across noisy channels, see variable length code. A variable width encoding is a type of character encoding scheme in which codes of differing lengths are used to …   Wikipedia

• 8b/10b encoding — In telecommunications, 8b/10b is a line code that maps 8 bit symbols to 10 bit symbols to achieve DC balance (see DC coefficient) and bounded disparity, and yet provide enough state changes to allow reasonable clock recovery. This means that the… …   Wikipedia

• Bit rate — Bit rates Decimal prefixes (SI) Name Symbol Multiple kilobit per second kbit/s 103 megabit per second Mbit/s 106 gigabit per second Gbit/s 109 …   Wikipedia

• Dictionary coder — A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure… …   Wikipedia

• Lossless data compression — is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be… …   Wikipedia

• Huffman coding — Huffman tree generated from the exact frequencies of the text this is an example of a huffman tree . The frequencies and codes of each character are below. Encoding the sentence with this code requires 135 bits, as opposed of 288 bits if 36… …   Wikipedia

• Dynamic Markov compression — (DMC) is a lossless data compression algorithm developed by Gordon Cormack and Nigel Horspool . It uses predictive arithmetic coding similar to prediction by partial matching (PPM), except that the input is predicted one bit at a time (rather… …   Wikipedia