Data differencing

Data differencing

In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data ("patching" the source with the difference to produce the target).

Contents

Examples

One of the best-known examples of data differencing is the diff utility, which produces line-by-line differences of text files (and in some implementations, binary files, thus being a general-purpose differencing tool). Differencing of general binary files goes under the rubric of delta encoding, with a widely-used example being the algorithm used in rsync. A standardized generic differencing format is VCDIFF, implemented in such utilities as Xdelta version 3. A high-efficiency (small patch files) differencing program is bsdiff, which is based on bzip2 compression, demonstrating the close connection between differencing and compression.

Concerns

Main concerns for data differencing are usability and space efficiency (patch size).

If one simply wishes to reconstruct the target given the source and patch, one may simply include the entire target in the patch and "apply" the patch by discarding the source and outputting the target that has been included in the patch; similarly, if the source and target have the same size one may create a simple patch by XORing source and target. In both these cases, the patch will be as large as the target. As these examples show, if the only concern is reconstruction of target, this is easily done, at the expense of a large patch, and the main concern for general-purpose binary differencing is reducing the patch size.

For structured data especially, one has other concerns, which largely fall under "usability" – for example, if one is comparing two documents, one generally wishes to know which sections have changed, or if some sections have been moved around – one wishes to understand how the documents differ. For instance "here 'cat' was changed to 'dog', and paragraph 13 was moved to paragraph 14". One may also wish to have robust differences – for example, if two documents A and B differ in paragraph 13, one may wish to be able to apply this patch even if one has changed paragraph 7 of A. An example of this is in diff, which shows which lines changed, and where the context format allows robustness and improves human readability.

Other concerns include computational efficiency, as for data compression – finding a small patch can be very time and memory intensive.

Best results occur when one has knowledge of the data being compared and other constraints: diff is designed for line-oriented text files, particularly source code, and works best for these; the rsync algorithm is used based on source and target being across a network from each other and communication being slow, so it minimizes data that must be transmitted; and the updates for Google Chrome use an algorithm customized to the archive and executable format of the program's data.[1][2]

Connection with data compression

Data compression can be seen as a special case of data differencing[3][4] – data differencing consists of producing a difference given a source and a target, with patching producing a target given a source and a difference, while data compression consists of producing a compressed file given a target, and decompression consists of producing a target given only a compressed file. Thus, one can consider data compression as data differencing with empty source data, the compressed file corresponding to a "difference from nothing". This is the same as considering absolute entropy (corresponding to data compression) as a special case of relative entropy (corresponding to data differencing) with no initial data.

When one wishes to emphasize the connection, one may use the term differential compression to refer to data differencing.

A dictionary translating between the terminology of the two fields is given as:

compression differencing
none source
uncompressed target
compressed difference, delta
compression differencing
decompression patching

See also

References

  1. ^ Chromium Blog: Smaller is Faster (and Safer Too)
  2. ^ Software Updates: Courgette (The Chromium Projects)
  3. ^ RFC 3284
  4. ^ Korn, D.G.; Vo, K.P. (1995), B. Krishnamurthy, ed., Vdelta: Differencing and Compression, Practical Reusable Unix Software, John Wiley & Sons 

Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • Data compression — Source coding redirects here. For the term in computer programming, see Source code. In computer science and information theory, data compression, source coding or bit rate reduction is the process of encoding information using fewer bits than… …   Wikipedia

  • Kullback–Leibler divergence — In probability theory and information theory, the Kullback–Leibler divergence[1][2][3] (also information divergence, information gain, relative entropy, or KLIC) is a non symmetric measure of the difference between two probability distributions P …   Wikipedia

  • Delta encoding — Not to be confused with Elias delta coding. Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding… …   Wikipedia

  • diff — This article is about the file comparison utility. For other uses, see DIFF (disambiguation). Diffs redirects here. For the American punk rock group, see The Diffs. In computing, diff is a file comparison utility that outputs the differences… …   Wikipedia

  • Global Positioning System — GPS redirects here. For other uses, see GPS (disambiguation). Geodesy Fundamentals …   Wikipedia

  • Geo-replication — systems improve the distribution of data across geographically distributed data networks. This enables improved end user experience of data heavy applications such as web portals. Geo replication can be achieved using software, hardware or a… …   Wikipedia

  • List of software that supports Office Open XML — Office Open XML Office Open XML file formats Open Packaging Conventions Open Specification Promise Vector Markup Language Office Open XML software Comparison of Office Open XML software Office Open XML standardization This is an overview of… …   Wikipedia

  • Stochastic drift — In probability theory, stochastic drift is the change of the average value of a stochastic (random) process. A related term is the drift rate which is the rate at which the average changes. This is in contrast to the random fluctuations about… …   Wikipedia

  • Autoregressive integrated moving average — In statistics, an autoregressive integrated moving average (ARIMA) model is a generalisation of an autoregressive moving average or (ARMA) model. These models are fitted to time series data either to better understand the data or to predict… …   Wikipedia

  • MagicDraw — Class diagram in MagicDraw 17.0 Developer(s) No Magic, Inc …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”