Hash function

Hash function

A hash function is any well-defined procedure or mathematical function for turning some kind of data into a relatively small integer, that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.

Hash functions are mostly used to speed up table lookup or data comparison tasks — such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.

Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomizing functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements. The HashKeeper database maintained by the National Drug Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values.

Applications

Hash tables

Hash functions are mostly used in hash tables, to quickly locate a data record (for example, a dictionary definition) given its search key (the headword). Specifically, the hash function is used to map the search key to the index of a slot in the table where the corresponding record is supposedly stored.

In general, a hashing function may map several different keys to the same hash value. Therefore, each slot of a hash table contains (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called "bucket indices".

Thus, the hash function only hints at the record's location — it only tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.

In the Java programming language, for example, the Object parent class provides a standard hashCode() method that is required to generate a 32-bit integer hash value of its object. This method is used in several hash-table-based classes, such as HashMap and HashSet.

Finding duplicate records

To find duplicated records in a large unsorted file, one may use a hash function to map each file record to an index into a table "T", and collect in each bucket "T" ["i"] a list of the numbers of all records with the same hash value "i". Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket "T" ["i"] which contains two or more members, fetching those records, and comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs).

Finding similar records

Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most "m", where "m" is a small integer (say, 1 or 2). If one builds a table of "T" of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket "T" ["i"] against those in buckets "T" ["i"+"k"] where "k" ranges between -"m" and "m".

This class includes the so-called acoustic fingerprint algorithms, that are used to locate similar-sounding entries in large collection of audio files (as in the MusicBrainz song labeling service). For this application, the hash function must be as insensitive as possible to data capture or transmission errors, and to "trivial" changes such as timing and volume changes, compression, etc. [http://citeseer.ist.psu.edu/rd/11787382%2C504088%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/25861/http:zSzzSzwww.extra.research.philips.comzSznatlabzSzdownloadzSzaudiofpzSzcbmi01audiohashv1.0.pdf/haitsma01robust.pdf "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen"] ] .

Finding similar substrings

The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a document repository or a genomic database. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above.

The Rabin-Karp algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.

Geometric hashing

This principle is widely used in computer graphics, computational geometry and many other disciplines, to locate close pairs of points in the plane or in three-dimensional space, similar shapes in a list of shapes, similar images in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid of "cells". The table is often an array with two or more indices (called a "bucket grids"), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or "the grid method".

Properties

Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below. Note that different requirements apply to the other related concepts (cryptographic hash functions, checksums, etc.).

Low cost

The cost of computing a hash function must be small enough to make a hashing-based solution advantageous other approaches. For instance, binary search can locate an item in a table of "n" items with log2 "n" key comparisons. Therefore, a hash-table solution will be more efficient than binary search only if computing the hash function for one key costs less than performing log2 "n" key comparisons.

Determinism

A hash function must be deterministic — meaning that two identical or equivalent inputs must generate the same hash value.

Uniformity

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of "collisions" — pairs of inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.

Note that this criterion only requires the value to be "uniformly distributed", not "random" in any sense. A good randomizing function is usually good for hashing, but the converse need not be true.

Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.

In other words, if a typical set of "m" records is hashed to "n" table slots, the probability of a bucket receiving many more than "m/n" records should be vanishingly small. In particular, if "m" is less than "n", very few buckets should have more than one or two records. (Ideally, no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if "n" is much larger than "m" (see the birthday paradox).

Variable range

In many applications, the range of hash values may be different for each run of the program, or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters — the input data "z", and the number "n" of allowed hash values.

Data normalization

In some applications, the input data may contain features that are irrelevant for comparison purposes. When looking up a personal name, for instance, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value.

Continuity

A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.

Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other related concepts. Continuity is desirable for hash functions only in some applications, such as hash tables that use linear search.

Hash function algorithms

The choice of a hashing function depends strongly on the nature of the input data, and their probability distribution in the intended application.

Trivial hash function

If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer in binary notation) as the hashed value. Invalid data values (such as the country code 'xx' or the zip code 00000) may be left undefined in the table, or mapped to some appropriate 'null' value. The cost of computing this "trivial" (identity) hash function is effectively zero.

The meaning of 'small enough' depends on how much memory is available for the hash table. A typical PC (as of 2008) might have a gigabyte of available memory, meaning that hash values of up to 30 bits could be accommodated. However, there are many applications that can get by with much less. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternate form of that character ('A' for 'a', '8' for '8', etc.). If each character is stored in 8 bits (as in ASCII or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode characters, the table would have 216 = 65536 entries. The same technique can be used to map two-letter country codes like 'us' or 'za' to country names (65536 table entries), 5-digit zip codes like 13083 to city names (100000 entries), etc.

Injective and perfect hashing

The ideal hashing function should be injective — that is, it should map each valid input to a different hash value. Such a function would directly locate the desired entry in a hash table, without any additional search.

An injective hash function whose range is all integers between 0 and "n"−1, where "n" is the number of valid inputs, is said to be perfect. Besides providing single-step lookup, a perfect hash function also results in a compact hash table, without any vacant slots.

Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping month names to the integers 0 to 11); and even then they are often too complicated or expensive to be of practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much smaller range of hash values and therefore cannot be injective.

Hashing uniformly distributed data

If the inputs are bounded-length strings (such as telephone numbers, car license plates, invoice numbers, etc.), and each input may independently occur with uniform probability, then a hash function need only map roughly the same number of inputs to each hash value. For instance, suppose that each input is an integer "z" in the range 0 to "N"−1, and the output must be an integer "h" in the range 0 to "n"−1, where "N" is much larger than "n". Then the hash function could be "h" = "z" mod "n" (the remainder of "z" divided by "n"), or "h" = ("z" × "n") ÷ "N" (the value "z" scaled down by "n"/"N" and truncated to an integer), or many other formulas.

Hashing data with other distributions

These simple formulas will not do if the input values are not equally likely, or are not independent. For instance, most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that case, if "n" is 10000 or so, the division formula ("z" × "n") ÷ "N", which depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula "z" mod "n", which is quite sensitive to the trailing digits, may still yield a fairly even distribution of hash values.

When the data values are long (or variable-length) character strings — such as personal names, web page addresses, or mail messages — their distribution is usually very uneven, with complicated dependencies. For example, text in any natural language has highly non-uniform distributions of characters, and character pairs, very characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string — and depends on each character in a different way.

pecial-purpose hash functions

In many such cases, one can design a special-purpose (heuristic) hash function that yield many fewer collisions than a good general-purpose hash function. For example, suppose that the input data are file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part "k" of the file name and returns "k" mod "n" would be nearly optimal. Needless to say, a function that is exceptionally good for a specific kind of data may have dismal performance on data with different distribution.

Hashing with checksum functions

One can obtain good general-purpose hash functions for string data by adapting certain checksum or fingerprinting algorithms. Some of those algorithms will map arbitrary long string data "z", with any typical real-world distribution — no matter how non-uniform and dependent — to a fixed length bit string, with a fairly uniform distribution. This string can be interpreted as a binary integer "k", and turned into a hash value by the formula "h" = "k" mod "n".

This method will produce a fairly even distribution of hash values, as long as the hash range size "n" is small compared to the range of the checksum function. Bob Jenkins' [http://burtleburtle.net/bob/c/lookup3.c LOOKUP3] algorithm Citation | title=Algorithm Alley | contribution=Hash Functions | first=Bob | last1=Jenkins | journal=Dr. Dobb's Journal | date=September, 1997 |url=http://www.ddj.com/184410284 ] uses a 32-bit checksum. A 64-bit checksum should provide adequate hashing for tables of any feasible size.

Hashing with cryptographic hash functions

Some cryptographic hash functions, such as MD5, have even stronger uniformity guarantees than checksums or fingerprints, and thus can provide very good general-purpose hashing functions. However, the uniformity advantage may be too small to offset their much higher cost.

Origins of the term

The term "hash" comes by way of analogy with its standard meaning in the physical world, to "chop and mix".
Donald Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris used the term in a survey paper in CACM which elevated the term from technical jargon to formal terminology. [cite book | author=Knuth, Donald | year=1973 | title=The Art of Computer Programming, volume 3, Sorting and Searching | pages=506-542 ]

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular division.

See also

*Universal hashing
*Cryptography
*Cryptographic hash function
*HMAC
*Geometric hashing
*Distributed hash table
*Perfect hash function
*Linear hash
*Rolling hash
*Rabin-Karp string search algorithm
*Zobrist hashing
*Bloom filter
*Hash table
*Hash list
*Hash tree
*Coalesced hashing
*Transposition table
*List of hash functions

Notes

# In the remainder of this article, the term "function" is used to refer to algorithms as well as the functions they compute.

References

External links

* [http://www.partow.net/programming/hashfunctions/index.html General purpose hash function algorithms (C/C++/Pascal/Java/Python/Ruby)]
* [http://burtleburtle.net/bob/hash/index.html Hash Functions and Block Ciphers by Bob Jenkins]
* [http://www.concentric.net/~Ttwang/tech/inthash.htm Integer Hash Function] by Thomas Wang
* [http://www.geocities.com/drone115b/Goulburn06.pdf The Goulburn Hashing Function] (PDF) by Mayur Patel
* [http://www.azillionmonkeys.com/qed/hash.html Hash Functions] by Paul Hsieh
* [http://herbert.gandraxa.com/herbert/hsh.asp HSH 11/13] by Herbert Glarner
* [http://www.paulschou.com/tools/xlate/ Online Char (ASCII), HEX, Binary, Base64, etc... Encoder/Decoder with MD2, MD4, MD5, SHA1+2, etc. hashing algorithms]
* [http://isthe.com/chongo/tech/comp/fnv/ FNV] Fowler, Noll, Vo Hash Function
* [http://www.sinfocol.org/herramientas/hashes.php Hash Generator] Online Hash Generator (md2,md4,md5,sha1,tiger,snefru,ripemd,whirlpool,haval...)
* [http://hash.stephan-brumme.com/ Ajax-based Hash Generator] Online Hash Generator with instant hash computation while typing
* [http://www.qdecoder.org/goto/qHash.html qDecoder's C/C++ hash functions] — opensource library


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • hash function —    A function that maps a data item to a numeric value by use of a transformation. A hash function can convert a number that has meaning to a user, such as a key or other identifier, into a value for the location of that data in a structure such… …   Dictionary of networking

  • hash function — maišos funkcija statusas T sritis informatika apibrėžtis Funkcija, iš duomenų ↑įrašo arba jo ↑rakto (1), apskaičiuojanti ↑maišos reikšmę. Funkcijos algoritmas priklauso nuo to, kam bus naudojama jos reikšmė. Pavyzdžiui, jeigu projektuojama įrašų… …   Enciklopedinis kompiuterijos žodynas

  • hash function — one way function (not reversible) that changes input data in to a unique digital code (also used in information security as a digital signature that identifies the sender and verifies the contents of the message) …   English contemporary dictionary

  • hash function — noun an algorithm that generates a numeric, or fixed size character output from a variable sized piece of text or other data; used in database table queries, cryptography and in error checking …   Wiktionary

  • Cryptographic hash function — A cryptographic hash function (specifically, SHA 1) at work. Note that even small changes in the source input (here in the word over ) drastically change the resulting output, by the so called avalanche effect. A cryptographic hash function is a… …   Wikipedia

  • NIST hash function competition — Cryptography portal The NIST hash function competition is an open competition held by the US National Institute of Standards and Technology for a new SHA 3 function to replace the older SHA 1 and SHA 2, which was formally announced in the Federal …   Wikipedia

  • Perfect hash function — A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no… …   Wikipedia

  • NIST hash function competition — La NIST hash function competition est une compétition organisée par la NIST afin de trouver une nouvelle fonction de hachage (SHA 3) destinée à remplacer les anciennes fonctions SHA 1 et SHA 2. Sommaire 1 Participants 1.1 Finalistes 1.2 …   Wikipédia en Français

  • GOST (hash function) — The GOST hash function, defined in the standards GOST R 34.11 94 and GOST 34.311 95, is a 256 bit cryptographic hash function. It was initially defined in the Russia s national standard GOST R 34.11 94 Information Technology Cryptographic… …   Wikipedia

  • The Hash Function Hamsi — Содержание 1 Краткое описание 2 История 3 The Hash Funtion Hamsi 3.1 Общая структу …   Википедия

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”