Here
is the list of most commonly used assembler for genomic reads
used. The list is extensive but by no means is complete. I will try to update as soon as I come
across a new one. Help me keeping the list updated if you come across any new
and interesting assembler I have missed.
ABySS: http://www.bcgsc.ca/downloads/abyss/
: Widespread
adoption of massively parallel deoxyribonucleic acid (DNA) sequencing
instruments has prompted the recent development of de novo short read assembly
algorithms. A common shortcoming of the available tools is their inability to
efficiently assemble vast amounts of data generated from large-scale sequencing
projects, such as the sequencing of individual human genomes to catalog natural
genetic variation. To address this limitation, we developed ABySS (Assembly By
Short Sequences), a parallelized sequence assembler. As a demonstration of the
capability of our software, we assembled 3.5 billion paired-end reads from the
genome of an African male publicly released by Illumina, Inc. Approximately
2.76 million contigs ≥100 base pairs (bp) in length were created with an N50
size of 1499 bp, representing 68% of the reference human genome. Analysis of
these contigs identified polymorphic and novel sequences not present in the
human reference assembly, which were validated by alignment to alternate human
assemblies and to other primate genomes. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2694472/
SOAPdenovo: http://soap.genomics.org.cn/ : Next-generation massively parallel DNA
sequencing technologies provide ultrahigh throughput at a substantially lower
unit data cost; however, the data are very short read length sequences, making
de novo assembly extremely challenging. Here, we describe a novel method for de
novo assembly of large genomes from short read sequences. We successfully
assembled both the Asian and African human genome sequences, achieving an N50
contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb,
respectively. The development of this de novo short read assembly method creates
new opportunities for building reference sequences and carrying out accurate
analyses of unexplored genomes in a cost-effective way. http://genome.cshlp.org/content/early/2009/12/16/gr.097261.109
Velvet: http://www.ebi.ac.uk/~zerbino/velvet
: We have developed a new set of algorithms,
collectively called “Velvet,” to manipulate de Bruijn graphs for genomic
sequence assembly. A de Bruijn graph is a compact representation based on short
words (k-mers) that is ideal for high coverage, very short read
(25–50 bp) data sets. Applying Velvet to very short reads and paired-ends
information only, one can produce contigs of significant length, up to 50-kb
N50 length in simulations of prokaryotic data and 3-kb N50 on simulated
mammalian BACs. When applied to real Solexa data sets without read pairs,
Velvet generated contigs of ∼8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement
with our simulated results without read-pair information. Velvet represents a
new approach to assembly that can leverage very short reads in combination with
read pairs to produce useful assemblies. http://genome.cshlp.org/content/18/5/821.short
ALLPATHS-LG: ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/: Massively parallel DNA sequencing technologies
are revolutionizing genomics by making it possible to generate billions of
relatively short (~100-base) sequence reads at very low cost. Whereas such data
can be readily used for a wide range of biomedical applications, it has proven
difficult to use them to generate high-quality de novo genome assemblies of large,
repeat-rich vertebrate genomes. To date, the genome assemblies generated from
such data have fallen far short of those obtained with the older (but much more
expensive) capillary-based sequencing approach. Here, we report the development
of an algorithm for genome assembly, ALLPATHS-LG, and its application to
massively parallel DNA sequence data from the human and mouse genomes,
generated on the Illumina platform. The resulting draft genome assemblies have
good accuracy, short-range contiguity, long-range connectivity, and coverage of
the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold
sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those
obtained with capillary-based sequencing. The combination of improved sequencing
technology and improved computational methods should now make it possible to
increase dramatically the de novo sequencing of large genomes.
The ALLPATHS-LG
program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd. http://www.pnas.org/content/108/4/1513.short
Bambus2: http://amos.sf.net: Motivation: Sequencing projects increasingly target samples from non-clonal sources. In particular, metagenomics has enabled scientists to begin to characterize the structure of microbial communities. The software tools developed for assembling and analyzing sequencing data for clonal organisms are, however, unable to adequately process data derived from non-clonal sources. Results: We present a new scaffolder, Bambus 2, to address some of the challenges encountered when analyzing metagenomes. Our approach relies on a combination of a novel method for detecting genomic repeats and algorithms that analyze assembly graphs to identify biologically meaningful genomic variants. We compare our software to current assemblers using simulated and real data. We demonstrate that the repeat detection algorithms have higher sensitivity than current approaches without sacrificing specificity. In metagenomic datasets, the scaffolder avoids false joins between distantly related organisms while obtaining long-range contiguity. Bambus 2 represents a first step toward automated metagenomic assembly. Availability: Bambus 2 is open source and available from http://amos.sf.net. http://bioinformatics.oxfordjournals.org/content/27/21/2964.short
Newbler: http://454.com/contact-us/software-request.asp
: In the last year, high-throughput sequencing
technologies have progressed from proof-of-concept to production quality. While
these methods produce high-quality reads, they have yet to produce reads
comparable in length to Sanger-based sequencing. Current fragment assembly
algorithms have been implemented and optimized for mate-paired Sanger-based
reads, and thus do not perform well on short reads produced by short read
technologies. We present a new Eulerian assembler that generates nearly optimal
short read assemblies of bacterial genomes and describe an approach to assemble
reads in the case of the popular hybrid protocol when short and long
Sanger-based reads are combined. http://genome.cshlp.org/content/18/2/324.full
MIRA: http://www.chevreux.org/mira_downloads.html
We present an EST sequence assembler that
specializes in reconstruction of pristine mRNA transcripts, while at the same
time detecting and classifying single nucleotide polymorphisms (SNPs) occuring
in different variations thereof. The assembler uses iterative multipass strategies
centered on high-confidence regions within sequences and has a fallback
strategy for using low-confidence regions when needed. It features special
functions to assemble high numbers of highly similar sequences without prior
masking, an automatic editor that edits and analyzes alignments by inspecting
the underlying traces, and detection and classification of sequence properties
like SNPs with a high specificity and a sensitivity down to one mutation per
sequence. In addition, it includes possibilities to use incorrectly
preprocessed sequences, routines to make use of additional sequencing
information such as base-error probabilities, template insert sizes, strain
information, etc., and functions to detect and resolve possible misassemblies.
The assembler is routinely used for such various tasks as mutation detection in
different cell types, similarity analysis of transcripts between organisms, and
pristine assembly of sequences from various sources for oligo design in
clinical microarray experiments. http://genome.cshlp.org/content/14/6/1147
Euler-USR: http://euler-assembler.ucsd.edu/portal/
: Increasing read length is currently viewed as
the crucial condition for fragment assembly with next-generation sequencing
technologies. However, introducing mate-paired reads (separated by a gap of
length, GapLength) opens a possibility to transform short mate-pairs into long
mate-reads of length ≈ GapLength, and thus raises the question as to whether
the read length (as opposed to GapLength) even matters. We describe a new tool,
EULER-USR, for assembling mate-paired short reads and use it to analyze the question
of whether the read length matters. We further complement the ongoing
experimental efforts to maximize read length by a new computational approach
for increasing the effective read length. While the common practice is to trim
the error-prone tails of the reads, we present an approach that substitutes
trimming with error correction using repeat graphs. An important and
counterintuitive implication of this result is that one may extend sequencing
reactions that degrade with length “past their prime” to where the error rate
grows above what is normally acceptable for fragment assembly. http://genome.cshlp.org/content/19/2/336.full
Celera Assembler: http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-7.0/
:
Minia: http://minia.genouest.org/files/minia-1.6088.tar.gz
: Minia is a short-read
assembler based on a de Bruijn graph, capable of assembling a human genome on a
desktop computer in a day. The output of Minia is a set of contigs. Minia
produces results of similar contiguity and accuracy to other de Bruijn
assemblers (e.g. Velvet). http://minia.genouest.org/files/minia.pdf
Ray: http://sourceforge.net/projects/denovoassembler/files/
: An accurate genome sequence of a
desired species is now a pre-requisite for genome research. An important step in
obtaining a high-quality genome sequence is to correctly assemble short reads
into longer sequences accurately representing contiguous genomic regions.
Current sequencing technologies continue to offer increases in throughput, and
corresponding reductions in cost and time. Unfortunately, the benefit of
obtaining a large number of reads is complicated by sequencing errors, with
different biases being observed with each platform. Although software are
available to assemble reads for each individual system, no procedure has been
proposed for high-quality simultaneous assembly based on reads from a mix of
different technologies. In this paper, we describe a parallel short-read
assembler, called Ray, which has been developed to assemble reads obtained from
a combination of sequencing platforms. We compared its performance to other
assemblers on simulated and real datasets. We used a combination of Roche/454
and Illumina reads to assemble three different genomes. We showed that mixing
sequencing technologies systematically reduces the number of contigs and the
number of errors. Because of its open nature, this new tool will hopefully
serve as a basis to develop an assembler that can be of universal utilization
(availability: http://deNovoAssembler.sf.Net/).
For online Supplementary Material, seewww.liebertonline.com.
http://online.liebertpub.com/doi/abs/10.1089/cmb.2009.0238
Edena: http://www.genomic.ch/edena : Novel high-throughput DNA sequencing technologies
allow researchers to characterize a bacterial genome during a single experiment
and at a moderate cost. However, the increase in sequencing throughput that is
allowed by using such platforms is obtained at the expense of individual
sequence read length, which must be assembled into longer contigs to be
exploitable. This study focuses on the Illumina sequencing platform that
produces millions of very short sequences that are 35 bases in length. We
propose a de novo assembler software that is dedicated to process such data.
Based on a classical overlap graph representation and on the detection of
potentially spurious reads, our software generates a set of accurate contigs of
several kilobases that cover most of the bacterial genome. The assembly results
were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their
published genomes acquired by conventional sequencing of 1.5- to 3.0-kb
fragments. We also provide indications that the broad coverage achieved by
high-throughput sequencing might allow for the detection of clonal
polymorphisms in the set of DNA molecules being sequenced. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2336802/
MSR-CA: http://www.genome.umd.edu/SR_CA_MANUAL.htm
: The MSR-CA assembler combines the benefits of
deBruijn graph and Overlap-Layout-Consensus assembly approaches. The
strength of the deBruijn graph approach is in its ability to quickly create a
graph representation of the genome assembly from the deep coverage short read
data. However in most cases the graph is extremely complex and it is hard
to find a way to recover the original genome sequence from simply traversing
it. On the other hand, overlap-layout-consensus is better suited for longer
reads with high coverage, and since it usually relies on overlaps of 40 bases
or longer, it is better for resolving short repetitive structures.
SGA: https://github.com/jts/sga
De novo genome sequence assembly is important
both to generate new sequence assemblies for previously uncharacterized genomes
and to identify the genome sequence of individuals in a reference-unbiased way.
We present memory efficient data structures and algorithms for assembly using
the FM-index derived from the compressed Burrows-Wheeler transform, and a new
assembler based on these called SGA (String Graph Assembler). We describe
algorithms to error correct, assemble and scaffold large sets of sequence data.
SGA uses the overlap-based string graph model of assembly, unlike most de novo
assemblers that rely on de Bruijn graphs, and is simply parallelizable. We
demonstrate the error correction and assembly performance of SGA on 1.2 billion
sequence reads from a human genome, which we are able to assemble using 54 GB
of memory. The resulting contigs are highly accurate and contiguous, while
covering 95% of the reference genome (excluding contigs less than 200bp in
length). Because of the low memory requirements and parallelization without
requiring inter-process communication, SGA provides the first practical
assembler to our knowledge for a mammalian-sized genome on a low-end computing
cluster. http://genome.cshlp.org/content/early/2011/12/07/gr.126953.111
SSAKE: http://www.bcgsc.ca/bioinfo/software/ssake
: Novel
DNA sequencing technologies with the potential for up to three orders magnitude
more sequence throughput than conventional Sanger sequencing are emerging. The
instrument now available from Solexa Ltd, produces millions of short DNA
sequences of 25 nt each. Due to ubiquitous repeats in large genomes and the
inability of short sequences to uniquely and unambiguously characterize them,
the short read length limits applicability for de novo sequencing. However,
given the sequencing depth and the throughput of this instrument, stringent
assembly of highly identical sequences can be achieved. We describe SSAKE, a
tool for aggressively assembling millions of short nucleotide sequences by
progressively searching through a prefix tree for the longest possible overlap
between any two sequences. SSAKE is designed to help leverage the information
from short sequence reads by stringently assembling them into contiguous
sequences that can be used to characterize novel sequencing targets. Availability:
http://www.bcgsc.ca/bioinfo/software/ssake
. http://bioinformatics.oxfordjournals.org/content/23/4/500.full
VCAKE: http://sourceforge.net/projects/vcake/
: Inexpensive
de novo genome sequencing, particularly in organisms with small genomes, is now
possible using several new sequencing technologies. Some of these technologies
such as that from Illumina's Solexa Sequencing, produce high genomic coverage
by generating a very large number of small reads (∼30 bp). While prior work
shows that partial assembly can be performed by k-mer extension in error-free
reads, this algorithm is unsuccessful with the sequencing error rates found in
practice. We present VCAKE (Verified Consensus Assembly by K-mer Extension), a
modification of simple k-mer extension that overcomes error by using high depth
coverage. Though it is a simple modification of a previous approach, we show
significant improvements in assembly results on simulated and experimental
datasets that include error. Availability: http://152.2.15.114/~labweb/VCAKE. http://bioinformatics.oxfordjournals.org/content/23/21/2942.long
QSRA: http://qsra.cgrb.oregonstate.edu : New
rapid high-throughput sequencing technologies have sparked the creation of a
new class of assembler. Since all high-throughput sequencing platforms
incorporate errors in their output, short-read assemblers must be designed to
account for this error while utilizing all available data. Results: We have
designed and implemented an assembler, Quality-value guided Short Read
Assembler, created to take advantage of quality-value scores as a further
method of dealing with error. Compared to previous published algorithms, our
assembler shows significant improvements not only in speed but also in output
quality. Conclusion: QSRA generally produced the highest genomic coverage,
while being faster than VCAKE. QSRA is extremely competitive in its longest
contig and N50/N80 contig lengths, producing results of similar quality to
those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo
assembly of complex genomes, improving upon the original VCAKE algorithm by not
only drastically reducing runtimes but also increasing the viability of the
assembly algorithm through further error handling capabilities. http://www.biomedcentral.com/1471-2105/10/69
SHARCGS: http://sharcgs.molgen.mpg.de : The latest revolution in the DNA sequencing
field has been brought about by the development of automated sequencers that
are capable of generating giga base pair data sets quickly and at low cost.
Applications of such technologies seem to be limited to resequencing and
transcript discovery, due to the shortness of the generated reads. In order to
extend the fields of application to de novo sequencing, we developed the
SHARCGS algorithm to assemble short-read (25–40-mer) data with high accuracy
and speed. The efficiency of SHARCGS was tested on BAC inserts from three
eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes
>20 kbp for Drosophila and Arabidopsis and >4 kbp for human in
simulations taking missing reads and wrong base calls into account. We
assembled 949,974 contigs with length >50 bp, and only one single contig
could not be aligned error-free against the reference sequences. We generated
36-mer reads for the genome of Helicobacter acinonychis on the Illumina 1G sequencing
instrument and assembled 937 contigs covering 98% of the genome with an N50
size of 3.7 kbp. With the exception of five contigs that differ in 1–4
positions relative to the reference sequence, all contigs matched the genome
error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel
sequencing technologies by assembling sequence contigs de novo with high
confidence and by outperforming existing assembly algorithms in terms of speed
and accuracy. http://genome.cshlp.org/content/early/2007/10/01/gr.6435207
CABOG: http://wgs-assembler.sourceforge.net/wiki/index.php?title=Main_Page
: The emergence of next-generation sequencing
platforms led to resurgence of research in whole-genome shotgun assembly
algorithms and software. DNA sequencing data from the Roche 454,
Illumina/Solexa, and ABI SOLiD platforms typically present shorter read
lengths, higher coverage, and different error profiles compared with Sanger
sequencing data. Since 2005, several assembly software packages have been
created or revised specifically for de novo assembly of next-generation
sequencing data. This review summarizes and compares the published descriptions
of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler,
Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two
standard methods known as the de Bruijn graph approach and the
overlap/layout/consensus approach to assembly. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646/
Shorty:
http://www.cs.sunysb.edu/~skiena/shorty
: New
short-read sequencing technologies produce enormous volumes of 25–30 base
paired-end reads. The resulting reads have vastly different characteristics
than produced by Sanger sequencing, and require different approaches than the
previous generation of sequence assemblers. In this paper, we present a
short-read de novo assembler particularly targeted at the new ABI SOLiD
sequencing technology. Results This paper presents what we believe to be the first de novo
sequence assembly results on real data from the emerging SOLiD
platform, introduced by Applied Biosystems. Our assembler SHORTY
augments short-paired reads using a trivially small number (5 – 10) of seeds of
length 300 – 500 bp. These seeds enable us to produce significant assemblies
using short-read coverage no more than 100×, which can be obtained in a single
run of these high-capacity sequencers. SHORTY exploits two ideas which we
believe to be of interest to the short-read assembly community: (1) using
single seed reads to crystallize assemblies, and (2) estimating intercontig
distances accurately from multiple spanning paired-end reads. Conclusion We demonstrate effective
assemblies (N50 contig sizes ~40 kb) of three different bacterial species using
simulated SOLiD data. Sequencing artifacts limit our performance on real data,
however our results on this data are substantially better than those achieved
by competing assemblers. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648751/
Taipan: http://taipan.sourceforge.net : Summary: The shorter and vastly more numerous reads produced by
second-generation sequencing technologies require new tools that can assemble
massive numbers of reads in reasonable time. Existing short-read assembly tools
can be classified into two categories: greedy
extension-based and graph-based. While
the graph-based approaches are generally superior in terms of assembly quality,
the computer resources required for building and storing a huge graph are very
high. In this article, we present Taipan,
an assembly algorithm which can be viewed as a hybrid of these two approaches.
Taipan uses greedy extensions for contig construction but at each step realizes
enough of the corresponding read graph to make better decisions as to how
assembly should continue. We show that this approach can achieve an assembly
quality at least as good as the graph-based approaches used in the popular
Edena and Velvet assembly tools using a moderate amount of computing resources.
Availability and
Implementation: Source
code in C running on Linux is freely available at http://taipan.sourceforge.net http://bioinformatics.oxfordjournals.org/content/25/17/2279.long
PCAP long-read assembler: http://seq.cs.iastate.edu/pcap.html
: This unit describes how to use the Parallel
Contig Assembly Program (PCAP) to assemble the data produced by a whole-genome
shotgun sequencing project. We present a basic protocol for using PCAP on a
multiprocessor computer in a 300-Mb genome assembly project. A support protocol
to prepare input files for PCAP is also described. Another basic protocol for
using PCAP on a distributed cluster of computers in a 3-Gb genome assembly
project is presented, in addition to suggestions for understanding results from
PCAP. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1103s11/abstract;jsessionid=D1E990D19BC5B53F5145818C47152BE5.f04t03
Seqcons: http://www.seqan.de/uploads/media/MicroRazerS.zip
: Motivation: Novel high-throughput sequencing technologies pose new algorithmic
challenges in handling massive amounts of short-read, high-coverage data. A
robust and versatile consensus tool is of particular interest for such data
since a sound multi-read alignment is a prerequisite for variation analyses,
accurate genome assemblies and insert sequencing. Results: A multi-read alignment algorithm for de novo or reference-guided genome assembly is
presented. The program identifies segments shared by multiple reads and then
aligns these segments using a consistency-enhanced alignment graph. On real de novo sequencing data obtained from the
newly established NCBI Short Read Archive, the program performs similarly in
quality to other comparable programs. On more challenging simulated datasets
for insert sequencing and variation analyses, our program outperforms the other
tools. Availability: The consensus program can be
downloaded fromhttp://www.seqan.de/projects/consensus.html.
It can be used stand-alone or in conjunction with the Celera Assembler. Both
application scenarios as well as the usage of the tool are described in the
documentation. http://bioinformatics.oxfordjournals.org/content/25/9/1118.abstract
No comments:
Post a Comment