External sorting


External sorting

External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted does not fit into the main memory of a computing device (usually RAM) and a slower kind of memory (usually a hard drive) needs to be used.

Carefully implemented, external sorting can be done in-place (with no additional disk space required).

External mergesort

One example of external sorting is the external mergesort algorithm. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
# Read 100 MB of the data in main memory and sort by some conventional method (usually quicksort).
# Write the sorted data to disk.
# Repeat steps 1 and 2 until all of the data is sorted in 100 MB chunks, which now need to be merged into one single output file.
# Read the first 10 MB of each sorted chunk (call them input buffers) in main memory (90 MB total) and allocate the remaining 10 MB for output buffer.
# Perform a 9-way merging and store the result in the output buffer. If the output buffer is full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk or otherwise mark it as exhausted if there is no more data in the sorted chunk and do not use it for merging. This algorithm can be generalized by assuming that the amount of data to be sorted exceeds the available memory by a factor of "K". Then, "K" chunks of data need to be sorted and a "K"-way merge has to be completed. If "X" is the amount of main memory available, there will be "K" input buffers and 1 output buffer of size "X"/("K"+1) each. Depending on various factors (how fast the hard drive is, what is the value of "K") better performance can be achieved if the output buffer is made larger (for example twice as large as one input buffer).

In the example, a single-pass merge was used. If the ratio of data to available main memory is particularly large, a multi-pass sorting is preferable. For example, merge only the first half of the sorted chunks, then the other half and now the problem has been reduced to merging just two sorted chunks. The exact number of passes depends on the above mentioned ratio, as well as the physical characteristics of the hard drive (transfer rate and seeking time). As a rule of thumb, it is inadvisable to perform a more-than-20-to-30-way merge.Fact|date=November 2007

External links

* [http://www.softpanorama.org/Tools/sort.shtml A description of the unix 'Sort' command]
* [http://cis.stvincent.edu/html/tutorials/swd/extsort/extsort.html An external mergesort example]
* [http://sourceforge.net/projects/kwaymerge A K-Way Merge Implementation]

References

* Donald Knuth. "The Art of Computer Programming", Volume 3: "Sorting and Searching", Second Edition. Addison-Wesley, 1998. ISBN 0-201-89685-0. Section 5.4: External Sorting, pp.248–379.
* Ellis Horowitz and Sartaj Sahni. "Fundamentals of Data Structures", H. Freeman & Co. ISBN 0-716-78042-9.


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Sorting algorithm — In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most used orders are numerical order and lexicographical order. Efficient sorting is important for optimizing the use of other… …   Wikipedia

  • Sorting — For the sorting of sediment, see Sorting (sediment). Sorting type Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings: ordering: arranging items of the… …   Wikipedia

  • Sorting network — A sorting network is an abstract mathematical model of a network of wires and comparator modules that is used to sort a sequence of numbers. Each comparator connects two wires and sort the values by outputting the smaller value to one wire, and a …   Wikipedia

  • Topological sorting — Dependency resolution redirects here. For other uses, see Dependency (disambiguation). In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that, for every edge uv, u comes… …   Wikipedia

  • Patience sorting — is a sorting algorithm, based on a solitaire card game, that has the property of being able to efficiently compute the length of the longest increasing subsequence in a given array.The card gameThe game begins with a shuffled deck of cards,… …   Wikipedia

  • Pancake sorting — is a variation of the sorting problem in which the only allowed operation is to reverse the elements of some prefix of the sequence. Unlike a traditional sorting algorithm, which attempts to sort with the least comparisons possible, the goal is… …   Wikipedia

  • Royal Mail Mount Pleasant Sorting Office — London s largest sorting office, Mount Pleasant …   Wikipedia

  • Post Office Sorting Van — Infobox DMU name = British Rail Post Office Sorting Van imagesize = 300px background = #0033cc caption = NSA 80390 Ernie Gosling on display at Doncaster Works open day on 27 July 2003. This vehicle was operated by EWS country wide in the consist… …   Wikipedia

  • Industry Sorting Code Directory — The Industry Sorting Code Directory (ISCD) is the definitive list of bank branches and sub branches in the United Kingdom. The directory is maintained by VocaLink on behalf of the Association for Payment Clearing Services (APACS).The ISCD… …   Wikipedia

  • Merge sort — Example of merge sort sorting a list of random dots. Class Sorting algorithm Data structure Array Worst case performance O(n log n) …   Wikipedia


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.