Needleman-Wunsch algorithm

Needleman-Wunsch algorithm

The Needleman–Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul Needleman and Christian Wunschcite journal | journal=J Mol Biol | volume=48 | issue=3 | pages=443-53 | date=1970 | author=Needleman SB, Wunsch CD. | title=A general method applicable to the search for similarities in the amino acid sequence of two proteins | url=http://linkinghub.elsevier.com/retrieve/pii/0022-2836(70)90057-4 | pmid=5420325 | doi = 10.1016/0022-2836(70)90057-4 ] .

The Needleman–Wunsch algorithm is an example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison.

Scores for aligned characters are specified by a similarity matrix. Here, S(i, j) is the similarity of characters i and j. It uses a linear gap penalty, here called d.

For example, if the similarity matrix was

-AGCT
A10-1-3-4
G-17-5-3
C-3-590
T-4-308
then the alignment: AGACTAGTTAC CGA---GACGTwith a gap penalty of -5, would have the following score... S(A,C) + S(G,G) + S(A,A) + 3 imes d + S(G,G) + S(T,A) + S(T,C) + S(A,G) + S(C,T) = -3 + 7 + 10 - 3 imes 5 + 7 + -4 + 0 + -1 + 0 = 1

To find the alignment with the highest score, a two-dimensional array (or matrix) is allocated. This matrix is often called the F matrix, and its (i,j)th entry is often denoted F_{ij} There is one column for each character in sequence A, and one row for each character in sequence B. Thus, if we are aligning sequences of sizes n and m, the running time of the algorithm is O(nm) and the amount of memory used is in O(nm). (However, there is a modified version of the algorithm which uses only O(m + n) space, at the cost of a higher running time. This modification is in fact a general technique which applies to many dynamic programming algorithms; this method was introduced in Hirschberg's algorithm for solving the longest common subsequence problem.)

As the algorithm progresses, the F_{ij} will be assigned to be the optimal score for the alignment of the first i characters in A and the first j characters in B. The principle of optimality is then applied as follows. Basis: F_{0j} = d*j F_{i0} = d*i Recursion, based on the principle of optimality: F_{ij} = max(F_{i-1,j-1} + S(A_i, B_j), F_{i,j-1} + d, F_{i-1,j} + d)

The pseudo-code for the algorithm to compute the F matrix therefore looks like this (the sequence indexes start at 1, the F array starts at 0 to include the boundary values defined above): for i=0 to length(A) F(i,0) <- d*i for j=0 to length(B) F(0,j) <- d*j for i=1 to length(A) for j = 1 to length(B) { Choice1 <- F(i-1,j-1) + S(A(i), B(j)) Choice2 <- F(i-1, j) + d Choice3 <- F(i, j-1) + d F(i,j) <- max(Choice1, Choice2, Choice3) }Once the F matrix is computed, note that the bottom right hand corner of the matrix is the maximum score for any alignments. To compute which alignment actually gives this score, you can start from the bottom right cell, and compare the value with the three possible sources(Choice1, Choice2, and Choice3 above) to see which it came from. If Choice1, then A(i) and B(i) are aligned, if Choice2, then A(i) is aligned with a gap, and if Choice3, then B(i) is aligned with a gap. (In general several choices may have the same value, leading to alternative optimal alignments.) AlignmentA <- "" AlignmentB <- "" i <- length(A) j <- length(B) while (i > 0 AND j > 0) { Score <- F(i,j) ScoreDiag <- F(i - 1, j - 1) ScoreUp <- F(i, j - 1) ScoreLeft <- F(i - 1, j) if (Score = ScoreDiag + S(A(i), B(j))) { AlignmentA <- A(i-1) + AlignmentA AlignmentB <- B(j-1) + AlignmentB i <- i - 1 j <- j - 1 } else if (Score = ScoreLeft + d) { AlignmentA <- A(i-1) + AlignmentA AlignmentB <- "-" + AlignmentB i <- i - 1 } otherwise (Score = ScoreUp + d) { AlignmentA <- "-" + AlignmentA AlignmentB <- B(j-1) + AlignmentB j <- j - 1 } } while (i > 0) { AlignmentA <- A(i-1) + AlignmentA AlignmentB <- "-" + AlignmentB i <- i - 1 } while (j > 0) { AlignmentA <- "-" + AlignmentA AlignmentB <- B(j-1) + AlignmentB j <- j - 1 }

References

External links

* [http://www.bigbold.com/snippets/posts/show/2199 Needleman-Wunsch Algorithm as Ruby Code]
* [http://www25.brinkster.com/denshade/NeedlemanWunsch.java.htm Java Implementation of the Needleman-Wunsch Algorithm]
* [http://baba.sourceforge.net/ B.A.B.A.] &mdash; an applet (with source) which visually explains the algorithm.
* [http://www.ludwig.edu.au/course/lectures2005/Likic.pdf A clear explanation of NW and its applications to sequence alignment]

ee also

*Smith-Waterman algorithm
*BLAST
*Levenshtein distance
*Dynamic time warping


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Needleman–Wunsch algorithm — The Needleman–Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and… …   Wikipedia

  • Algorithme de Needleman-Wunsch — L algorithme de Needleman Wunsch effectue un alignement global maximal de deux chaînes de caractères (appelées ici A et B). Il est couramment utilisé en bio informatique pour aligner des séquences de protéines ou de nucléotides. L algorithme a… …   Wikipédia en Français

  • Hirschberg's algorithm — Hirschberg s algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the least cost sequence alignment between two strings, where cost is measured as Levenshtein edit distance, defined to be the sumof… …   Wikipedia

  • Smith-Waterman algorithm — The Smith Waterman algorithm is a well known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith Waterman… …   Wikipedia

  • Saul Needleman — Saul B. Needleman is a bioinformaticist. Along with Christian Wunsch, in 1970 he published A general method applicable to the search for similarities in the amino acid sequence of two proteins , which introduced the Needleman Wunsch algorithm.… …   Wikipedia

  • Christian Wunsch — Christian D. Wunsch is a bioinformaticist. Along with Saul Needleman, in 1970 he published A general method applicable to the search for similarities in the amino acid sequence of two proteins , which introduced the Needleman Wunsch algorithm.… …   Wikipedia

  • Алгоритм Нидлмана — Алгоритм Нидлмана  Вунша  это алгоритм для выполнения выравнивания двух последовательностей (будем называть их и ), который используется в биоинформатике при построении выравниваний аминокислотных или нуклеотидных последовательностей.… …   Википедия

  • Sequence alignment — In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1]… …   Wikipedia

  • List of algorithms — The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.If you intend to describe a new algorithm,… …   Wikipedia

  • Levenshtein distance — In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance. The… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”