Bioinformatics Tools

Pages

Saturday, July 5, 2014

Metagenomic Assemblers:

Here is the list of most commonly used assembler for metagenomics reads used. The list is extensive but by no means is complete. I will try to update as soon as I come across a new one. Help me keeping the list updated if you come across any new and interesting assembler I have missed.

MetaVelvet : http://metavelvet.dna.bio.keio.ac.jp/ : An extension of Velvet assembler to de novo metagenome assembly from short sequence reads: An important step in ‘metagenomics’ analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as ‘Velvet’, to metagenome assembly, which we called ‘MetaVelvet’, for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes. http://nar.oxfordjournals.org/content/40/20/e155.short

MetAMOS: https://github.com/marbl/metAMOS a metagenomic assembly and analysis pipeline for AMOS: We describe MetAMOS, an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations. MetAMOS can aid in reducing assembly errors, commonly encountered when assembling metagenomic samples, and improves taxonomic assignment accuracy while also reducing computational cost. MetAMOS can be downloaded from: https://github.com/treangen/MetAMOS. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053804/

IDBA-UD: http://www.cs.hku.hk/~alse/idba_ud a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Motivation: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs. Results: We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy. Availability: The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud. http://bioinformatics.oxfordjournals.org/content/28/11/1420.short

Meta-IDBA: http://www.cs.hku.hk/~alse/metaidba a de Novo assembler for metagenomic data. Motivation: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated. Results: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy. Availability: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba. http://bioinformatics.oxfordjournals.org/content/27/13/i94.short

Ray Meta: http://denovoassembler.sf.net: Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights for specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net. : http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4056372/

MAP: http://bioinfo.ctb.pku.edu.cn/MAP/ Motivation: A high-quality assembly of reads generated from shotgun sequencing is a substantial step in metagenome projects. Although traditional assemblers have been employed in initial analysis of metagenomes, they cannot surmount the challenges created by the features of metagenomic data. Result: We present a de novo assembly approach and its implementation named MAP (metagenomic assembly program). Based on an improved overlap/layout/consensus (OLC) strategy incorporated with several special algorithms, MAP uses the mate pair information, resulting in being more applicable to shotgun DNA reads (recommended as >200 bp) currently widely used in metagenome projects. Results of extensive tests on simulated data show that MAP can be superior to both Celera and Phrap for typical longer reads by Sanger sequencing, as well as has an evident advantage over Celera, Newbler and the newest Genovo, for typical shorter reads by 454 sequencing. Availability and implementation: The source code of MAP is distributed as open source under the GNU GPL license, the MAP program and all simulated datasets can be freely available at http://bioinfo.ctb.pku.edu.cn/MAP/. http://bioinformatics.oxfordjournals.org/content/28/11/1455.short

Genovo: http://cs.stanford.edu/group/genovo/ : Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A nonparametric prior accounts for the unknown number of genomes in the sample. Inference is performed by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs in a series of synthetic experiments and across nine metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, even for low-abundance sequences, and yield a higher assembly score. http://online.liebertpub.com/doi/abs/10.1089/cmb.2010.0244  

Extended Genovo: http://xgenovo.dna.bio.keio.ac.jp Metagenomes present assembly challenges, when assembling multiple genomes from mixed reads of multiple species. An assembler for single genomes can’t adapt well when applied in this case. A metagenomic assembler, Genovo, is a de novo assembler for metagenomes under a generative probabilistic model. Genovo assembles all reads without discarding any reads in a preprocessing step, and is therefore able to extract more information from metagenomic data and, in principle, generate better assembly results. Paired end sequencing is currently widely-used yet Genovo was designed for 454 single end reads. In this research, we attempted to extend Genovo by incorporating paired-end information, named Xgenovo, so that it generates higher quality assemblies with paired end reads. First, we extended Genovo by adding a bonus parameter in the Chinese Restaurant Process used to get prior accounts for the unknown number of genomes in the sample. This bonus parameter intends for a pair of reads to be in the same contig and as an effort to solve chimera contig case. Second, we modified the sampling process of the location of a read in a contig. We used relative distance for the number of trials in the symmetric geometric distribution instead of using distance between the offset and the center of contig used in Genovo. Using this relative distance, a read sampled in the appropriate location has higher probability. Therefore a read will be mapped in the correct location. Results of extensive experiments on simulated metagenomic datasets from simple to complex with species coverage setting following uniform and lognormal distribution showed that Xgenovo can be superior to the original Genovo and the recently proposed metagenome assembler for 454 reads, MAP. Xgenovo successfully generated longer N50 than Genovo and MAP while maintaining the assembly quality even for very complex metagenomic datasets consisting of 115 species. Xgenovo also demonstrated the potential to decrease the computational cost. This means that our strategy worked well. The software and all simulated datasets are publicly available online at http://xgenovo.dna.bio.keio.ac.jp. https://peerj.com/articles/196/

SmashCommunity: a metagenomic annotation and analysis tool. SmashCommunity is a stand-alone metagenomic annotation and analysis pipeline suitable for data from Sanger and 454 sequencing technologies. It supports state-of-the-art software for essential metagenomic tasks such as assembly and gene prediction. It provides tools to estimate the quantitative phylogenetic and functional compositions of metagenomes, to compare compositions of multiple metagenomes and to produce intuitive visual representations of such analyses. Availability: SmashCommunity source code and documentation are available at http://www.bork.embl.de/software/smash: http://bioinformatics.oxfordjournals.org/content/26/23/2977.short

Bambus 2: http://amos.sf.net. Motivation: Sequencing projects increasingly target samples from non-clonal sources. In particular, metagenomics has enabled scientists to begin to characterize the structure of microbial communities. The software tools developed for assembling and analyzing sequencing data for clonal organisms are, however, unable to adequately process data derived from non-clonal sources. Results: We present a new scaffolder, Bambus 2, to address some of the challenges encountered when analyzing metagenomes. Our approach relies on a combination of a novel method for detecting genomic repeats and algorithms that analyze assembly graphs to identify biologically meaningful genomic variants. We compare our software to current assemblers using simulated and real data. We demonstrate that the repeat detection algorithms have higher sensitivity than current approaches without sacrificing specificity. In metagenomic datasets, the scaffolder avoids false joins between distantly related organisms while obtaining long-range contiguity. Bambus 2 represents a first step toward automated metagenomic assembly. Availability: Bambus 2 is open source and available from http://amos.sf.net. http://bioinformatics.oxfordjournals.org/content/27/21/2964.short


MetaCAA: https://metagenomics.atc.tcs.com/MetaCAA A clustering-aided methodology for efficient assembly of metagenomic datasets. A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA — a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at https://metagenomics.atc.tcs.com/MetaCAA. http://www.sciencedirect.com/science/article/pii/S0888754314000135

Genomic Assemblers:

Here is the list of most commonly used assembler for genomic reads used. The list is extensive but by no means is complete. I will try to update as soon as I come across a new one. Help me keeping the list updated if you come across any new and interesting assembler I have missed.

ABySS: http://www.bcgsc.ca/downloads/abyss/ : Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs ≥100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2694472/

SOAPdenovo: http://soap.genomics.org.cn/ : Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way. http://genome.cshlp.org/content/early/2009/12/16/gr.097261.109

Velvet: http://www.ebi.ac.uk/~zerbino/velvet : We have developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25–50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies. http://genome.cshlp.org/content/18/5/821.short

ALLPATHS-LG: ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/: Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. 

Bambus2: http://amos.sf.net: Motivation: Sequencing projects increasingly target samples from non-clonal sources. In particular, metagenomics has enabled scientists to begin to characterize the structure of microbial communities. The software tools developed for assembling and analyzing sequencing data for clonal organisms are, however, unable to adequately process data derived from non-clonal sources. Results: We present a new scaffolder, Bambus 2, to address some of the challenges encountered when analyzing metagenomes. Our approach relies on a combination of a novel method for detecting genomic repeats and algorithms that analyze assembly graphs to identify biologically meaningful genomic variants. We compare our software to current assemblers using simulated and real data. We demonstrate that the repeat detection algorithms have higher sensitivity than current approaches without sacrificing specificity. In metagenomic datasets, the scaffolder avoids false joins between distantly related organisms while obtaining long-range contiguity. Bambus 2 represents a first step toward automated metagenomic assembly. Availability: Bambus 2 is open source and available from http://amos.sf.net. http://bioinformatics.oxfordjournals.org/content/27/21/2964.short

Newbler: http://454.com/contact-us/software-request.asp : In the last year, high-throughput sequencing technologies have progressed from proof-of-concept to production quality. While these methods produce high-quality reads, they have yet to produce reads comparable in length to Sanger-based sequencing. Current fragment assembly algorithms have been implemented and optimized for mate-paired Sanger-based reads, and thus do not perform well on short reads produced by short read technologies. We present a new Eulerian assembler that generates nearly optimal short read assemblies of bacterial genomes and describe an approach to assemble reads in the case of the popular hybrid protocol when short and long Sanger-based reads are combined. http://genome.cshlp.org/content/18/2/324.full

MIRA: http://www.chevreux.org/mira_downloads.html We present an EST sequence assembler that specializes in reconstruction of pristine mRNA transcripts, while at the same time detecting and classifying single nucleotide polymorphisms (SNPs) occuring in different variations thereof. The assembler uses iterative multipass strategies centered on high-confidence regions within sequences and has a fallback strategy for using low-confidence regions when needed. It features special functions to assemble high numbers of highly similar sequences without prior masking, an automatic editor that edits and analyzes alignments by inspecting the underlying traces, and detection and classification of sequence properties like SNPs with a high specificity and a sensitivity down to one mutation per sequence. In addition, it includes possibilities to use incorrectly preprocessed sequences, routines to make use of additional sequencing information such as base-error probabilities, template insert sizes, strain information, etc., and functions to detect and resolve possible misassemblies. The assembler is routinely used for such various tasks as mutation detection in different cell types, similarity analysis of transcripts between organisms, and pristine assembly of sequences from various sources for oligo design in clinical microarray experiments. http://genome.cshlp.org/content/14/6/1147

Euler-USR: http://euler-assembler.ucsd.edu/portal/ : Increasing read length is currently viewed as the crucial condition for fragment assembly with next-generation sequencing technologies. However, introducing mate-paired reads (separated by a gap of length, GapLength) opens a possibility to transform short mate-pairs into long mate-reads of length ≈ GapLength, and thus raises the question as to whether the read length (as opposed to GapLength) even matters. We describe a new tool, EULER-USR, for assembling mate-paired short reads and use it to analyze the question of whether the read length matters. We further complement the ongoing experimental efforts to maximize read length by a new computational approach for increasing the effective read length. While the common practice is to trim the error-prone tails of the reads, we present an approach that substitutes trimming with error correction using repeat graphs. An important and counterintuitive implication of this result is that one may extend sequencing reactions that degrade with length “past their prime” to where the error rate grows above what is normally acceptable for fragment assembly. http://genome.cshlp.org/content/19/2/336.full


Minia: http://minia.genouest.org/files/minia-1.6088.tar.gz  : Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day. The output of Minia is a set of contigs. Minia produces results of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet).  http://minia.genouest.org/files/minia.pdf

Ray: http://sourceforge.net/projects/denovoassembler/files/ : An accurate genome sequence of a desired species is now a pre-requisite for genome research. An important step in obtaining a high-quality genome sequence is to correctly assemble short reads into longer sequences accurately representing contiguous genomic regions. Current sequencing technologies continue to offer increases in throughput, and corresponding reductions in cost and time. Unfortunately, the benefit of obtaining a large number of reads is complicated by sequencing errors, with different biases being observed with each platform. Although software are available to assemble reads for each individual system, no procedure has been proposed for high-quality simultaneous assembly based on reads from a mix of different technologies. In this paper, we describe a parallel short-read assembler, called Ray, which has been developed to assemble reads obtained from a combination of sequencing platforms. We compared its performance to other assemblers on simulated and real datasets. We used a combination of Roche/454 and Illumina reads to assemble three different genomes. We showed that mixing sequencing technologies systematically reduces the number of contigs and the number of errors. Because of its open nature, this new tool will hopefully serve as a basis to develop an assembler that can be of universal utilization (availability: http://deNovoAssembler.sf.Net/). For online Supplementary Material, seewww.liebertonline.com. http://online.liebertpub.com/doi/abs/10.1089/cmb.2009.0238

Edena: http://www.genomic.ch/edena : Novel high-throughput DNA sequencing technologies allow researchers to characterize a bacterial genome during a single experiment and at a moderate cost. However, the increase in sequencing throughput that is allowed by using such platforms is obtained at the expense of individual sequence read length, which must be assembled into longer contigs to be exploitable. This study focuses on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length. We propose a de novo assembler software that is dedicated to process such data. Based on a classical overlap graph representation and on the detection of potentially spurious reads, our software generates a set of accurate contigs of several kilobases that cover most of the bacterial genome. The assembly results were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their published genomes acquired by conventional sequencing of 1.5- to 3.0-kb fragments. We also provide indications that the broad coverage achieved by high-throughput sequencing might allow for the detection of clonal polymorphisms in the set of DNA molecules being sequenced. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2336802/

MSR-CA: http://www.genome.umd.edu/SR_CA_MANUAL.htm : The MSR-CA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches.  The strength of the deBruijn graph approach is in its ability to quickly create a graph representation of the genome assembly from the deep coverage short read data.  However in most cases the graph is extremely complex and it is hard to find a way to recover the original genome sequence from simply traversing it. On the other hand, overlap-layout-consensus is better suited for longer reads with high coverage, and since it usually relies on overlaps of 40 bases or longer, it is better for resolving short repetitive structures.

SGA: https://github.com/jts/sga De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error correct, assemble and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs less than 200bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster. http://genome.cshlp.org/content/early/2011/12/07/gr.126953.111

SSAKE: http://www.bcgsc.ca/bioinfo/software/ssake : Novel DNA sequencing technologies with the potential for up to three orders magnitude more sequence throughput than conventional Sanger sequencing are emerging. The instrument now available from Solexa Ltd, produces millions of short DNA sequences of 25 nt each. Due to ubiquitous repeats in large genomes and the inability of short sequences to uniquely and unambiguously characterize them, the short read length limits applicability for de novo sequencing. However, given the sequencing depth and the throughput of this instrument, stringent assembly of highly identical sequences can be achieved. We describe SSAKE, a tool for aggressively assembling millions of short nucleotide sequences by progressively searching through a prefix tree for the longest possible overlap between any two sequences. SSAKE is designed to help leverage the information from short sequence reads by stringently assembling them into contiguous sequences that can be used to characterize novel sequencing targets. Availability: http://www.bcgsc.ca/bioinfo/software/ssake . http://bioinformatics.oxfordjournals.org/content/23/4/500.full

VCAKE: http://sourceforge.net/projects/vcake/ : Inexpensive de novo genome sequencing, particularly in organisms with small genomes, is now possible using several new sequencing technologies. Some of these technologies such as that from Illumina's Solexa Sequencing, produce high genomic coverage by generating a very large number of small reads (30 bp). While prior work shows that partial assembly can be performed by k-mer extension in error-free reads, this algorithm is unsuccessful with the sequencing error rates found in practice. We present VCAKE (Verified Consensus Assembly by K-mer Extension), a modification of simple k-mer extension that overcomes error by using high depth coverage. Though it is a simple modification of a previous approach, we show significant improvements in assembly results on simulated and experimental datasets that include error. Availability: http://152.2.15.114/~labweb/VCAKE. http://bioinformatics.oxfordjournals.org/content/23/21/2942.long

QSRA: http://qsra.cgrb.oregonstate.edu : New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data. Results: We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality. Conclusion: QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities. http://www.biomedcentral.com/1471-2105/10/69

SHARCGS: http://sharcgs.molgen.mpg.de : The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quickly and at low cost. Applications of such technologies seem to be limited to resequencing and transcript discovery, due to the shortness of the generated reads. In order to extend the fields of application to de novo sequencing, we developed the SHARCGS algorithm to assemble short-read (25–40-mer) data with high accuracy and speed. The efficiency of SHARCGS was tested on BAC inserts from three eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes >20 kbp for Drosophila and Arabidopsis and >4 kbp for human in simulations taking missing reads and wrong base calls into account. We assembled 949,974 contigs with length >50 bp, and only one single contig could not be aligned error-free against the reference sequences. We generated 36-mer reads for the genome of Helicobacter acinonychis on the Illumina 1G sequencing instrument and assembled 937 contigs covering 98% of the genome with an N50 size of 3.7 kbp. With the exception of five contigs that differ in 1–4 positions relative to the reference sequence, all contigs matched the genome error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy. http://genome.cshlp.org/content/early/2007/10/01/gr.6435207

CABOG: http://wgs-assembler.sourceforge.net/wiki/index.php?title=Main_Page : The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646/

Shorty: http://www.cs.sunysb.edu/~skiena/shorty : New short-read sequencing technologies produce enormous volumes of 25–30 base paired-end reads. The resulting reads have vastly different characteristics than produced by Sanger sequencing, and require different approaches than the previous generation of sequence assemblers. In this paper, we present a short-read de novo assembler particularly targeted at the new ABI SOLiD sequencing technology. Results This paper presents what we believe to be the first de novo sequence assembly results on real data from the emerging SOLiD platform, introduced by Applied Biosystems. Our assembler SHORTY augments short-paired reads using a trivially small number (5 – 10) of seeds of length 300 – 500 bp. These seeds enable us to produce significant assemblies using short-read coverage no more than 100×, which can be obtained in a single run of these high-capacity sequencers. SHORTY exploits two ideas which we believe to be of interest to the short-read assembly community: (1) using single seed reads to crystallize assemblies, and (2) estimating intercontig distances accurately from multiple spanning paired-end reads. Conclusion We demonstrate effective assemblies (N50 contig sizes ~40 kb) of three different bacterial species using simulated SOLiD data. Sequencing artifacts limit our performance on real data, however our results on this data are substantially better than those achieved by competing assemblers. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648751/

Taipan: http://taipan.sourceforge.net : Summary: The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan, an assembly algorithm which can be viewed as a hybrid of these two approaches. Taipan uses greedy extensions for contig construction but at each step realizes enough of the corresponding read graph to make better decisions as to how assembly should continue. We show that this approach can achieve an assembly quality at least as good as the graph-based approaches used in the popular Edena and Velvet assembly tools using a moderate amount of computing resources. Availability and Implementation: Source code in C running on Linux is freely available at http://taipan.sourceforge.net http://bioinformatics.oxfordjournals.org/content/25/17/2279.long

PCAP long-read assembler: http://seq.cs.iastate.edu/pcap.html : This unit describes how to use the Parallel Contig Assembly Program (PCAP) to assemble the data produced by a whole-genome shotgun sequencing project. We present a basic protocol for using PCAP on a multiprocessor computer in a 300-Mb genome assembly project. A support protocol to prepare input files for PCAP is also described. Another basic protocol for using PCAP on a distributed cluster of computers in a 3-Gb genome assembly project is presented, in addition to suggestions for understanding results from PCAP. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1103s11/abstract;jsessionid=D1E990D19BC5B53F5145818C47152BE5.f04t03



Seqcons: http://www.seqan.de/uploads/media/MicroRazerS.zip : Motivation: Novel high-throughput sequencing technologies pose new algorithmic challenges in handling massive amounts of short-read, high-coverage data. A robust and versatile consensus tool is of particular interest for such data since a sound multi-read alignment is a prerequisite for variation analyses, accurate genome assemblies and insert sequencing. Results: A multi-read alignment algorithm for de novo or reference-guided genome assembly is presented. The program identifies segments shared by multiple reads and then aligns these segments using a consistency-enhanced alignment graph. On real de novo sequencing data obtained from the newly established NCBI Short Read Archive, the program performs similarly in quality to other comparable programs. On more challenging simulated datasets for insert sequencing and variation analyses, our program outperforms the other tools. Availability: The consensus program can be downloaded fromhttp://www.seqan.de/projects/consensus.html. It can be used stand-alone or in conjunction with the Celera Assembler. Both application scenarios as well as the usage of the tool are described in the documentation. http://bioinformatics.oxfordjournals.org/content/25/9/1118.abstract

Wednesday, April 30, 2014

How to make a protein soluble?

Cloning, expression and purification of difficult to clone, express and purify proteins in E. coli 


I have got some mails in relation to the expression of difficult to purify proteins, so I thought of making a short do's and don't's. For pure bioinformatic people, please bear with me for a couple of posts. First of all it is important to know about the protein, gather as much information about the protein as you can. All those small pieces of information help a lot if kept in mind while designing the strategy for cloning, expression and purification of the proteins. Also be informed about the source of protein, eukaryotic or prokaryotic or any others source. Some of the basic parameters like the size of the protein, PI, amino acid composition etc. pays a vital role in designing the strategy. Here are some tools to look for such information I have compiled on this blog before http://bioinformatictools.blogspot.in/2014/04/functional-annotation-of-hypothetical.html and http://bioinformatictools.blogspot.in/2011/11/in-silico-characterization-of-proteins.html. Look for other sources too. Main theme is to find as much information about the protein as much one could. I am not a big fan of purifying the protein under denaturing condition. There are lots of question that are difficult to answer if the protein needs to be refolded from denaturing conditions, like if the protein has folded properly, if this is the way the protein is natively folded and not just any random refolding of the protein, which are difficult to demonstrate experimentally until you already have some assay in mind. Since I have tried that too I will end by suggesting what all I have learned on that part. 

Downstream experimental procedures: Before designing strategy for Cloning, expression and purification of protein, it is wise to determine the downstream experimental procedure you are going to perform and strategy for Cloning, expression and purification mainly depends on this. At times it is possible to purify the protein in soluble form in very small amount using a very large culture (which is ok, if you need very small amount of protein for downstream experiments) for which one need not go through all the standardization experiments with trials in different vectors and host cells. However, in case if large amount of protein is required (such as in crystallization experiments) it is advised to optimize the purification process overall.

Read as much as you can: There are various resources available for suggestions for cloning, expression and purification of the protein in soluble fraction (i.e. QIAexpress handbook). But please keep in mind that it’s easy to suggest in wet lab work but it takes a lot of time and energy to perform the experiments the way one wishes to, so try what you think is logical and more importantly easily available to you (do-able).

Membrane or membrane associated protein: check if the selected protein is Membrane or membrane associated protein. This can be done by using surface localization tools, some of them are listed here http://bioinformatictools.blogspot.in/2007/09/predicting-subcellular-localization-of.html. Also, check if the protein Transmembrane domain (TMHMM http://www.cbs.dtu.dk/services/TMHMM/) or signal peptide (Signal Phttp://www.cbs.dtu.dk/services/SignalP/) in it. These are hydrophobic regions and are normally intrinsically disordered.  Membrane proteins are bit tough to get in soluble form till one removes the transmembrane or signal peptide part. It is logical to remove the initial (normally N-terminal) transmembrane or signal peptide part to get the functional domain or multiple domains in soluble form. (I had similar problem with a protein I was working on, when removed the signal peptide and transmembrane domain, it solved everything, got the protein into soluble fraction and got purified as charm, got it crystallized also). 

Check for the functional domain in protein if any:  This will help in determining the probable function the protein might be having. This will also indicate the other proteins with similar domain and their nature with respect to the cloning, expression and purification of the protein in E. coli. If you can find the protein with the similar domain use the cloning, expression and purification protocol for target protein. Also, for some of the protein the sequence based analysis results/characters change with addition of the tag, keep this in mind too, it might lead to change in PI or so on.

Optimize the temperature: Try different temperature for growth and induction. Induction temperature is more crucial.
  1. Try growing cells at 37 C and induction at 37 C.
  2. Try growing cells at 37 C and induction at 25 C for long time.
  3. Try growing cells at 37 C and induction at 16 C for long time.
  4. Try growing cells at 25 C and induction at 16 C for long time.
  5. Try growing cells at 37 C followed by chilling at 16 C at least one hour before induction.

        Low temperature decreases the rate of protein synthesis and usually more soluble protein is obtained. Also, if the temperature is reduced before induction of the cells, it is more likely to yield protein in soluble fraction, it kind of diverts from the pathway of going into inclusion bodies (Sorry, I do not know how).

        Optimize the IPTG concentration: it is a good idea to check a gradient in a small scale for the amount of IPTG (using a range from 0.1, 0.2, 0.3 ….mM) required for optimal expression level of the protein. Normally, IPTG is required at very low levels for optimal expression and using higher concentration not only is costly, but also doesn’t show much improvement in the expression level of the protein.

        Use a large tag, but make sure to make and arrangement to remove it once you have the protein: Larger tags like intein tag, His-SUMO, GST tag, MBP (maltose binding protein) etc. are known to increase the solubility of proteins, use them if you have the corresponding vectors easily available for them.

        Change the vector: using a weaker promoter (e.g. trc instead of T7) and using a lower copy number plasmid normally increases the chance of protein to be purified in soluble fraction. Also, using N- and/or C- terminal tags (in various vectors) affects the solubility of the protein, especially in those protein where folding is dependent on any of these terminals.

        Change the host cells: Some of the E. coli strains are better capable of handling toxic or membrane proteins in comparison to others. I had very good experience working with C41 and C43 strains which I came to know through this paper http://www.ncbi.nlm.nih.gov/pubmed/15294299. There are also pLysS versions of these strains, I did not try but you can read and try. Other strains like rosetta etc. might also be good to try (depends upon the strains you can get your hands on) (So, beg, borrow or steal ;)). For a new protein I usually perform as many changes one by one as I can do at small scale and then move them onto large scale. Also, check if your protein is using codons that are rarely used in E. coli. You can check ‘rare codon usage’ using different software available.

        Change the culture media: After changing and optimizing as many parameters I could, I was getting low level of protein in soluble fraction in LB media, I read somewhere that someone had good yield with the Terrific Broth, I tried and it gave a way more protein in soluble fraction. I was happy to use it thereafter for any protein I had to purify.  

        Use Auto-induction media: it will be worthwhile trying auto-induction. The idea is that instead of using an inducing agent like IPTG one uses the native function of the T7 promoter. So if you use media containing glucose and lactose and grow the cells, as the glucose is depleted, the cells will slowly start activating their T7 promoters which will start using lactose in place of glucose. This will also induce the promoters on your expression vector and lead to a much more gradual expression than from using IPTG.

        Functional Annotation of Hypothetical proteins

        Experimental work is though time taking but is direct approach for functional annotation of hypothetical proteins; however, at times it is difficult to decide upon the experimental design for a relatively new class of a protein. With increasing size and quality of various protein databases, it is becoming relatively easier to look for the experimental design for the probable function of a protein. Following are the steps that can be used in choosing the type of experimental analysis that needs to be performed and the substrate to be used during laboratory tests.

        1.     If the protein is predicted to be an enzyme, BLAST results normally indicates its closely related proteins that can be looked upon for the experimental procedures to be performed as indicated by the matching hits (look for the papers on those proteins that might indicate the type of related function the protein might perform).

        2.   With the increasing domain databases, it is possible to analyze the protein domain wise indicating the ability to perform certain kind of biochemical reactions if any. The NCBI’s Conserved Doamin Database (CDD), Pfam and InterProScan databases have a large number of conserved domains that defines a functional class. Presence of certain domain is also indicative of the possible activity of the protein and therefore the type of substrate to be used for defining its chemical activity in laboratory could be helpful.

        3.     Composition based analysis of protein: there are various bioinformatics tools available online to studying the amino acid composition based analysis of protein informing various properties which help in indicating the properties of protein which later help with the functional annotation of the proteins i.e. Protparam, SPAAN, MP3 and a lot more etc.

        4.  Homology based modeling: this is an important step in determining the functional annotation of protein based on the structure of the protein, though it may be difficult for the proteins with low identity (<30%) with the already known crystal structures of the protein. However, a good homology model can be an important step towards determining functional annotation for a protein. So also the secondary and tertiary structure prediction of the protein will tell the similar functional categories thereby help in designing relative experimental assays. Some of the commonly used homology based modeling tools are listed here http://bioinformatictools.blogspot.in/2012/01/homology-modeling-of-proteins.html.

        5.  Phylogenetic analysis: Phylogenetic analysis not only shows evolutionary divergence of the protein but also act as an important step towards functional conservation of the protein. This helps in determining the degree of functional similarity with other related homologous proteins. Thus, determining the appropriate experimental assays towards functional annotation of the protein. With the help of molecular dynamic simulation, this also helps in-silico assessment of the ability of substrate to bind to the protein. In fact it can cut down from large number of substrate molecules to the top most hits, helping to prioritize the experimental analysis, saving time and resources.

        6.     It is sometimes a bit difficult while working with novel proteins for which relevant data is almost negligible worldwide, so you can wait till you get more information.


        Let me know if you have more suggestions to add on.