Sequence assembly

Sequence assembly: In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. Besides the confusion introduced by shredding the book, the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.

Contents

1 Genome assemblers

2 EST assemblers

3 De-novo vs. mapping assembly

4 Influence of technological changes

5 Greedy algorithm

6 Available assemblers

7 See also

8 References

Genome assemblers

The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria and finally eukaryotes), the assembly programs used in these genome projects needed to increasingly employ more and more sophisticated strategies to handle:

terabytes of sequencing data which need processing on computing clusters;

identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms exponentially;

and errors in the fragments from the sequencing instruments, which can confound assembly.

Faced with the challenge of assembling the first larger eukaryotic genomes, the fruit fly Drosophila melanogaster, in 2000 and the human genome just a year later, scientists developed assemblers like Celera Assembler^[1] and Arachne^[2] able to handle genomes of 100-300 million base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS^[3] was launched to bring together all the innovations in genome assembly technology under the open source framework.

EST assemblers

EST assembly differs from genome assembly in several ways. The sequences for EST assembly are the transcribed mRNA of a cell and represent only a subset of the whole genome. At a first glance, underlying algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, mainly in the inter-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, cells tend to have a certain number of genes that are constantly expressed in very high amounts (housekeeping genes), which again leads to the problem of similar sequences present in high amounts in the data set to be assembled.

Furthermore, genes sometimes overlap in the genome (sense-antisense transcription), and should ideally still be assembled separately. EST assembly is also complicated by features like (cis-) alternative splicing, trans-splicing, single-nucleotide polymorphism, recoding, and post-transcriptional modification.

De-novo vs. mapping assembly

In sequence assembly, two different types can be distinguished:

de-novo: assembling reads together so that they form a new, previously unknown sequence

mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence

In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm need to compare every read with every other read (an operation that has a complexity of O(n²) but can be reduced to O(n log(n))^{[citation needed]}. Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as template (perhaps with the names of the main characters and a few locations changed), the de-novo assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, or a novel, or a catalogue etc.

Influence of technological changes

The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats.

In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.

In 1975, the Dideoxy termination method (also known as Sanger sequencing) was invented and until shortly after 2000, the technology was improved up to a point were fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing projects where the reads

are about 800–900 bases long

contain sequencing artifacts like sequencing and cloning vectors

have error rates between 0.5 and 10%

With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger ones like the human genome with approximately 35 million reads needed already large computing farms and distributed computing.

By 2004 / 2005, pyrosequencing had been brought to commercial viability by 454 Life Sciences. This new sequencing methods generated reads much shorter than from Sanger sequencing: initially about 100 bases, now 400-500 bases. However, due to the much higher throughput and lower cost than Sanger sequencing, the adoption of this technology by genome centers pushed development of sequence assemblers to deal with this new type of sequences. The sheer amount of data coupled with technology specific error patterns in the reads delayed development of assemblers, at the beginning in 2004 only the Newbler assembler from 454 was available. Presented in mid 2007,^[4] the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler who could assemble 454 reads and mixtures of 454 reads and Sanger reads; using sequences from different sequencing technologies was subsequently coined hybrid assembly.

Since 2006, the Illumina (previously Solexa) technology is available and able to generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina initially was limited to a length of only 36 bases, making it less suitable for de novo assembly, but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3-400bp clone. Presented by the end of 2007, the SHARCGS assembler^[5] by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads, quickly followed by a number of others.

Later, new technologies like SOLiD from Applied Biosystems are released, and new technologies (e.g. IonTorrent, PacBio) continue to emerge at a rapid rate.

Greedy algorithm

Given a set of sequence fragments the object is to find the Shortest common supersequence.

calculate pairwise alignments of all fragments

choose two fragments with the largest overlap

merge chosen fragments

repeat step 2. and 3. until only one fragment is left

The result is a suboptimal solution to the problem.

Available assemblers

The following table lists assemblers that have a de-novo assembly capability on at least one of the supported technologies.^[6]

Name Type Technologies Author Presented /
Last updated
Licence* Homepage

ABySS (large) genomes Solexa, SOLiD Simpson, J. et al. 2008 / 2011 OS link

ALLPATHS-LG (large) genomes Solexa, SOLiD Gnerre, S. et al. 2011 OS link

AMOS genomes Sanger, 454 Salzberg, S. et al. 2002? / 2008? OS link

Celera WGA Assembler / CABOG (large) genomes Sanger, 454, Solexa Myers, G. et al.; Miller G. et al. 2004 / 2010 OS link

CLC Genomics Workbench genomes Sanger, 454, Solexa, SOLiD CLC bio 2008 / 2010 C link

Cortex genomes Solexa, SOLiD Iqbal, Z. et al. 2011 OS link

DNA Dragon genomes Illumina, SOLiD, Complete Genomics, 454, Sanger SequentiX 2011 C link

DNAnexus genomes Illumina, SOLiD, Complete Genomics DNAnexus 2011 C link

Edena genomes Solexa D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel. 2008 C link

Euler genomes Sanger, 454 (,Solexa ?) Pevzner, P. et al. 2001 / 2006? (C / NC-A?) link

Euler-sr genomes 454, Solexa Chaisson, MJ. et al. 2008 NC-A link

Forge (large) genomes, EST, metagenomes 454, Solexa , SOLID, Sanger Platt, DM, Evers, D. 2010 OS link

Geneious genomes Sanger, 454, Solexa Biomatters Ltd 2009 / 2010 C link

Graph Constructor (large) genomes Sanger, 454, Solexa, SOLiD Convey Computer Corporation 2011 C link

IDBA (Iterative De Bruijn graph short read Assembler) (large) genomes Sanger,454,Solexa Yu Peng, Henry C. M. Leung, Siu-Ming Yiu, Francis Y. L. Chin 2010 (C / NC-A?) link

MIRA (Mimicking Intelligent Read Assembly) genomes, ESTs Sanger, 454, Solexa Chevreux, B. 1998 / 2011 OS link

NextGENe (small genomes?) 454, Solexa, SOLiD Softgenetics 2008 C link

Newbler genomes, ESTs 454, Sanger 454/Roche 2009 C link

PASHA (large) genomes Illumina Liu, Schmidt, Maskell 2011 OS link

Phrap genomes Sanger, 454 Green, P. 2002 / 2003 / 2008 C / NC-A link

TIGR Assembler genomic Sanger - 1995 / 2003 OS link

Ray^[7] genomes Illumina, mix of Illumina and 454, paired or not Sébastien Boisvert, François Laviolette & Jacques Corbeil. 2010 OS [GNU General Public License] link

Sequencher genomes traditional and next generation sequence data Gene Codes Corporation 1991 / 2009 / 2011 C link

SeqMan NGen (large) genomes, exomes, transcriptomes, metagenomes, ESTs Illumina, ABI SOLiD, Roche 454, Ion Torrent, Solexa, Sanger DNASTAR 2007 / 2011 C link

SHARCGS (small) genomes Solexa Dohm et al. 2007 / 2007 OS link

SOPRA genomes Illumina, SOLiD, Sanger, 454 Dayarian, A. et al. 2010 / 2011 OS link

SSAKE (small) genomes Solexa (SOLiD? Helicos?) Warren, R. et al. 2007 / 2007 OS link

SOAPdenovo genomes Solexa Li, R. et al. 2009 / 2009 Closed link

Staden gap4 package BACs (, small genomes?) Sanger Staden et al. 1991 / 2008 OS link

Taipan (small) genomes Illumina Schmidt, B. et al. 2009 OS link

VCAKE (small) genomes Solexa (SOLiD?, Helicos?) Jeck, W. et al. 2007 / 2007 OS link

Phusion assembler (large) genomes Sanger Mullikin JC, et al. 2003 OS link

Quality Value Guided SRA (QSRA) genomes Sanger, Solexa Bryant DW, et al. 2009 OS link

Velvet (small) genomes Sanger, 454, Solexa, SOLiD Zerbino, D. et al. 2007 / 2009 OS link

*Licences: OS = Open Source; C = Commercial; C / NC-A = Commercial but free for non-commercial and academics; Brackets = unclear, but most likely C / NC-A

See also

Sequence alignment

Genome assembly

GenomeABC : A server for Benchmarking of Genome Assemblers.

References

^ Myers EW, Sutton GG, Delcher AL, et al. (March 2000). "A whole-genome assembly of Drosophila". Science 287 (5461): 2196–204. doi:10.1126/science.287.5461.2196. PMID 10731133. http://www.sciencemag.org/cgi/pmidlookup?view=long&pmid=10731133.

^ Batzoglou S, Jaffe DB, Stanley K, et al. (January 2002). "ARACHNE: a whole-genome shotgun assembler". Genome Res. 12 (1): 177–89. doi:10.1101/gr.208902. PMC 155255. PMID 11779843. http://www.genome.org/cgi/pmidlookup?view=long&pmid=11779843.

^ AMOS page with links to various papers

^ Copy in Google groups of the post announcing MIRA 2.9.8 hybrid version in the bionet.software Usenet group

^ Dohm JC, Lottaz C, Borodina T, Himmelbauer H (November 2007). "SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing". Genome Res. 17 (11): 1697–706. doi:10.1101/gr.6435207. PMC 2045152. PMID 17908823. http://www.genome.org/cgi/pmidlookup?view=long&pmid=17908823.

^ list of software including mapping assemblers in the SeqAnswers discussion forum.

^ Boisvert S, Laviolette F, Corbeil J. (October 2010). "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.". J Comput Biol. 17 (11): 1519–33. doi:10.1089/cmb.2009.0238. PMID 20958248. http://www.liebertonline.com/doi/abs/10.1089/cmb.2009.0238.

Categories:
Bioinformatics

Name	Type	Technologies	Author	Presented / Last updated	Licence*	Homepage
ABySS	(large) genomes	Solexa, SOLiD	Simpson, J. et al.	2008 / 2011	OS	link
ALLPATHS-LG	(large) genomes	Solexa, SOLiD	Gnerre, S. et al.	2011	OS	link
AMOS	genomes	Sanger, 454	Salzberg, S. et al.	2002? / 2008?	OS	link
Celera WGA Assembler / CABOG	(large) genomes	Sanger, 454, Solexa	Myers, G. et al.; Miller G. et al.	2004 / 2010	OS	link
CLC Genomics Workbench	genomes	Sanger, 454, Solexa, SOLiD	CLC bio	2008 / 2010	C	link
Cortex	genomes	Solexa, SOLiD	Iqbal, Z. et al.	2011	OS	link
DNA Dragon	genomes	Illumina, SOLiD, Complete Genomics, 454, Sanger	SequentiX	2011	C	link
DNAnexus	genomes	Illumina, SOLiD, Complete Genomics	DNAnexus	2011	C	link
Edena	genomes	Solexa	D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel.	2008	C	link
Euler	genomes	Sanger, 454 (,Solexa ?)	Pevzner, P. et al.	2001 / 2006?	(C / NC-A?)	link
Euler-sr	genomes	454, Solexa	Chaisson, MJ. et al.	2008	NC-A	link
Forge	(large) genomes, EST, metagenomes	454, Solexa , SOLID, Sanger	Platt, DM, Evers, D.	2010	OS	link
Geneious	genomes	Sanger, 454, Solexa	Biomatters Ltd	2009 / 2010	C	link
Graph Constructor	(large) genomes	Sanger, 454, Solexa, SOLiD	Convey Computer Corporation	2011	C	link
IDBA (Iterative De Bruijn graph short read Assembler)	(large) genomes	Sanger,454,Solexa	Yu Peng, Henry C. M. Leung, Siu-Ming Yiu, Francis Y. L. Chin	2010	(C / NC-A?)	link
MIRA (Mimicking Intelligent Read Assembly)	genomes, ESTs	Sanger, 454, Solexa	Chevreux, B.	1998 / 2011	OS	link
NextGENe	(small genomes?)	454, Solexa, SOLiD	Softgenetics	2008	C	link
Newbler	genomes, ESTs	454, Sanger	454/Roche	2009	C	link
PASHA	(large) genomes	Illumina	Liu, Schmidt, Maskell	2011	OS	link
Phrap	genomes	Sanger, 454	Green, P.	2002 / 2003 / 2008	C / NC-A	link
TIGR Assembler	genomic	Sanger	-	1995 / 2003	OS	link
Ray^[7]	genomes	Illumina, mix of Illumina and 454, paired or not	Sébastien Boisvert, François Laviolette & Jacques Corbeil.	2010	OS [GNU General Public License]	link
Sequencher	genomes	traditional and next generation sequence data	Gene Codes Corporation	1991 / 2009 / 2011	C	link
SeqMan NGen	(large) genomes, exomes, transcriptomes, metagenomes, ESTs	Illumina, ABI SOLiD, Roche 454, Ion Torrent, Solexa, Sanger	DNASTAR	2007 / 2011	C	link
SHARCGS	(small) genomes	Solexa	Dohm et al.	2007 / 2007	OS	link
SOPRA	genomes	Illumina, SOLiD, Sanger, 454	Dayarian, A. et al.	2010 / 2011	OS	link
SSAKE	(small) genomes	Solexa (SOLiD? Helicos?)	Warren, R. et al.	2007 / 2007	OS	link
SOAPdenovo	genomes	Solexa	Li, R. et al.	2009 / 2009	Closed	link
Staden gap4 package	BACs (, small genomes?)	Sanger	Staden et al.	1991 / 2008	OS	link
Taipan	(small) genomes	Illumina	Schmidt, B. et al.	2009	OS	link
VCAKE	(small) genomes	Solexa (SOLiD?, Helicos?)	Jeck, W. et al.	2007 / 2007	OS	link
Phusion assembler	(large) genomes	Sanger	Mullikin JC, et al.	2003	OS	link
Quality Value Guided SRA (QSRA)	genomes	Sanger, Solexa	Bryant DW, et al.	2009	OS	link
Velvet	(small) genomes	Sanger, 454, Solexa, SOLiD	Zerbino, D. et al.	2007 / 2009	OS	link
*Licences: OS = Open Source; C = Commercial; C / NC-A = Commercial but free for non-commercial and academics; Brackets = unclear, but most likely C / NC-A

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Assembly — may refer to:Biology* Sequence assembly * Genome assembly * Neuronal assembly and Cell assembly theoryPolitics* Deliberative assembly ** General Assembly ** House of Assembly ** Legislative Assembly ** Roman assemblies ** National Assembly **… … Wikipedia
Sequence-tagged site — A sequence tagged site (or STS) is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known.STSs can be easily detected by the polymerase chain reaction (PCR) using… … Wikipedia
Assembly language — See the terminology section below for information regarding inconsistent use of the terms assembly and assembler. Motorola MC6800 Assembly Language An assembly language is a low level programming language for computers, microprocessors,… … Wikipedia
Assembly of the International Space Station — An artist s impression of the International Space Station s configuration as of March 2011 … Wikipedia
Sequence alignment — In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1]… … Wikipedia
assembly language — a computer language most of whose expressions are symbolic equivalents of the machine language instructions of a particular computer. [1960 65] * * * Type of low level computer programming language consisting mostly of symbolic equivalents of a… … Universalium
Assembly, DNA — The process of putting fragments of DNA that have been sequenced into their correct chromosomal positions. The pieces of DNA are assembled to reconstitute the sequence of the chromosome from which they came … Medical dictionary
assembly line — assem′bly line n. an arrangement of machines, tools, and workers in which a product is assembled in a particular sequence as it is moved along a direct line or route • Etymology: 1910–15, amer … From formal English to slang
Just in Sequence — (JIS) is an inventory strategy that is at once Just In Time and in sequence, that is, component parts arrive at a production line moments before they are needed and in the sequence in which they will be used. The manufacturing sequence signal is… … Wikipedia
Polymerase cycling assembly — (or PCA) is a method for the assembly of large DNA oligonucleotides from shorter fragments. The process uses the same technology as PCR, but takes advantage of DNA hybridization and annealing as well as DNA polymerase to amplify a complete… … Wikipedia

Academic Dictionaries and Encyclopedias

Sequence assembly

Contents

Genome assemblers

EST assemblers

De-novo vs. mapping assembly

Influence of technological changes

Greedy algorithm

Available assemblers

See also

References

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Sequence assembly

Contents

Genome assemblers

EST assemblers

De-novo vs. mapping assembly

Influence of technological changes

Greedy algorithm

Available assemblers

See also

References

Look at other dictionaries:

Share the article and excerpts

Direct link