Compressed data structure


Compressed data structure

The term compressed data structure arises in the computer science subfields of algorithms, data structures, and theoretical computer science. It refers to a data structure whose operations are roughly as fast as those of a conventional data structure for the problem, but whose size can be substantially smaller. The size of the compressed data structure is typically highly dependent upon the entropy of the data being represented.

Important examples of compressed data structures include the compressed suffix array[1][2] and the FM-index,[3] both of which can represent an arbitrary text of characters T for pattern matching. Given any input pattern P, they support the operation of finding if and where P appears in T. The search time is proportional to the sum of the length of pattern P, a very slow-growing function of the length of the text T, and the number of reported matches. The space they occupy is roughly equal to the size of the text T in entropy-compressed form, such as that obtained by Prediction by Partial Matching or gzip. Moreover, both data structures are self-indexing, in that they can reconstruct the text T in a random access manner, and thus the underlying text T can be discarded. In other words, they simultaneously provide a compressed and quickly searchable representation of the text T. They represent a substantial space improvement over the conventional suffix tree and suffix array, which occupy many times more space than the size of T. They also support searching for arbitrary patterns, as opposed to the inverted index, which can support only word-based searches. In addition, inverted indexes do not have the self-indexing feature.

An important related notion is that of a succinct data structure, which uses space roughly equal to the information-theoretic minimum, which is a worst-case notion of the space needed to represent the data. In contrast, the size of a compressed data structure depends upon the particular data being represented. When the data are compressible, as is often the case in practice for natural language text, the compressed data structure can occupy substantially less space than the information-theoretic minimum.

References

  1. ^ R. Grossi and J. S. Vitter, Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching], Proceedings of the 32nd ACM Symposium on Theory of Computing, May 2000, 397-406. Journal version in SIAM Journal on Computing, 35(2), 2005, 378-407.
  2. ^ R. Grossi, A. Gupta, and J. S. Vitter, High-Order Entropy-Compressed Text Indexes, Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms, January 2003, 841-850.
  3. ^ P. Ferragina and G. Manzini, Opportunistic Data Structures with Applications, Proceedings of the 41st IEEE Symposium on Foundations of Computer Science, November 2000, 390-398. Journal version in Indexing Compressed Text, Journal of the ACM, 52(4), 2005, 552-581.

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Compressed suffix array — The Compressed Suffix Array[1][2] is a compressed data structure for pattern matching. Given a text T of n characters from an alphabet Σ, the compressed suffix array support searching for arbitrary patterns in T. For an input pattern P of m… …   Wikipedia

  • Data set (IBM mainframe) — This article is about mainframe computer file. For a general meaning in computing field, see Data set. data set (archaic), dataset (preferred), is a computer file having a record organization. The term pertains to the IBM mainframe operating… …   Wikipedia

  • Compressed-air energy storage — (CAES) refers to the compression of airto be used later as energy source. At utility scale, it can be stored during periods of low energy demand (off peak), for use in meeting periods of higher demand (peak load). Alternatively it can be used to… …   Wikipedia

  • Compressed Air Foam System — A Compressed Air Foam System for hand hose, abbreviated CAFS, is a system used in firefighting to deliver fire retardant foam for the purpose of extinguishing a fire or protecting unburned areas from becoming involved in flame.DescriptionRon… …   Wikipedia

  • CEBAF On-line Data Acquisition — The Continuous Electron Beam Accelerator Facility (CEBAF) at Jefferson Lab is an electron beam particle accelerator constructed using funding from the United States Department of Energy in the city of Newport News, Virginia.DescriptionThe… …   Wikipedia

  • Multi-function structure — Multi function material is a composite material. The traditional approach to the development of structures is to address the loadcarrying function and other functional requirements separately. Recently, however, there has been increased interest… …   Wikipedia

  • Straight-line grammar — Straight line grammars (SLG) are formal grammars that do not branch (every non terminal has only one associated production rule) nor loop (if non terminal A appears in a derivation of B, B does not appear in a derivation of A). Such grammars… …   Wikipedia

  • ZIP (file format) — unzip redirects here. For the program, see Info ZIP. ZIP Filename extension .zip .zipx (newer compression algorithms) Internet media type application/zip Uniform Type Identifier com.pkware.zip archive Magic …   Wikipedia

  • BMP file format — Windows Bitmap Filename extension .bmp or .dib Internet media type image/x ms bmp (unofficial) or image/x bmp (unofficial) Type code BMP BMPf BMPp Uniform Type Identifier com.microsoft.bmp …   Wikipedia

  • NTFS — Developer Microsoft Full name New Technology File System[1] Introduced July 1993 (Windows NT 3.1) Partition identifier 0x07 (MBR) EBD0A0A2 B9E5 4433 87C0 68B6 …   Wikipedia


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.