Thursday, February 5, 2015

MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data

Another work from our group which came sometime back. Please do write to ( in case of any problem. Comments are welcome.

The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. 

Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51–100 amino acids and Blind B: 30–50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. 

To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100–150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at​ex.php.

16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets

A recent article from our lab. Please explore and write back to ( in case of any problem. Comments are welcome. 

The sequencing of 16S rRNA gene is commonly performed to estimate the microbial diversity in a metagenomic study. The rapid developments in genome sequencing technologies have shifted the focus on sequencing the selected hypervariable regions (HVRs) of 16S rRNA gene instead of sequencing the complete gene. The recent metagenomic projects involve the sequencing of only a single HVR or a combination of two or more HVRs. At present there is no specialized method available for the correct identification and classification of species using short variable 16S rRNA sequences. Therefore, we have developed 16S Classifier using a machine learning method, Random Forest, for faster and accurate taxonomic classification of short hypervariable regions of 16S rRNA sequence. It displayed the precision values of up to 0.91 on training datasets and the precision values of up to 0.98 on the first test dataset. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level. 16S classifier displayed up to 42.9%, 40.7%, 41.0%, 57.9% and 73.8% higher accuracy at phylum, class, order, family and genus levels, respectively, as compared to the commonly used RDP classifier program. In addition, it is 7.5 times faster than RDP Classifier and 800 times faster than BLAST. 16S classifier can be easily used with the QIIME pipeline which is commonly used for the 16S rRNA analysis.

To the best of our knowledge, 16S Classifier is the only available tool which can carry out the efficient, sensitive and accurate taxonomic assignment of any of the 16S rRNA hypervariable regions which are commonly used in metagenomic projects. In the case of complete 16S rRNA also, it displayed exceptional (precision of 0.97) performance on the test dataset. Thus, the wide usage of this tool is anticipated in different metagenomic projects. 16S Classifier is available freely at

Instructions for running the stand-alone version of 16S Classifier on the Linux PC.
1. User can download zip file of a particular hypervariable region or complete 16S, which is freely available at
2. Extract the zipped file which contains a model file (*.Rdata), a script file (*.sh) and an exe file (16sclassifier.exe).
Other dependencies:
1. User has to install R from the following link
2. intall Randomforest by typing the following commands in terminal  R  and install.packages ('randomForest')
# Command line usage #
./16sclassifier.exe 'queryfile' 'modelname'
The query file should be in Fasta format and the model name could be v2, v3, v4, v5, v6, v7, v8, v23, v34, v35, v45, v56, v67, v78 and Complete16S.

Saturday, July 5, 2014

Metagenomic Assemblers:

Here is the list of most commonly used assembler for metagenomics reads used. The list is extensive but by no means is complete. I will try to update as soon as I come across a new one. Help me keeping the list updated if you come across any new and interesting assembler I have missed.

MetaVelvet : : An extension of Velvet assembler to de novo metagenome assembly from short sequence reads: An important step in ‘metagenomics’ analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as ‘Velvet’, to metagenome assembly, which we called ‘MetaVelvet’, for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes.

MetAMOS: a metagenomic assembly and analysis pipeline for AMOS: We describe MetAMOS, an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations. MetAMOS can aid in reducing assembly errors, commonly encountered when assembling metagenomic samples, and improves taxonomic assignment accuracy while also reducing computational cost. MetAMOS can be downloaded from:

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Motivation: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs. Results: We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy. Availability: The IDBA-UD toolkit is available at our website

Meta-IDBA: a de Novo assembler for metagenomic data. Motivation: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated. Results: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy. Availability: Meta-IDBA toolkit is available at our website

Ray Meta: Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights for specific environments. Ray Meta is open source and available at :

MAP: Motivation: A high-quality assembly of reads generated from shotgun sequencing is a substantial step in metagenome projects. Although traditional assemblers have been employed in initial analysis of metagenomes, they cannot surmount the challenges created by the features of metagenomic data. Result: We present a de novo assembly approach and its implementation named MAP (metagenomic assembly program). Based on an improved overlap/layout/consensus (OLC) strategy incorporated with several special algorithms, MAP uses the mate pair information, resulting in being more applicable to shotgun DNA reads (recommended as >200 bp) currently widely used in metagenome projects. Results of extensive tests on simulated data show that MAP can be superior to both Celera and Phrap for typical longer reads by Sanger sequencing, as well as has an evident advantage over Celera, Newbler and the newest Genovo, for typical shorter reads by 454 sequencing. Availability and implementation: The source code of MAP is distributed as open source under the GNU GPL license, the MAP program and all simulated datasets can be freely available at

Genovo: : Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A nonparametric prior accounts for the unknown number of genomes in the sample. Inference is performed by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs in a series of synthetic experiments and across nine metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, even for low-abundance sequences, and yield a higher assembly score.  

Extended Genovo: Metagenomes present assembly challenges, when assembling multiple genomes from mixed reads of multiple species. An assembler for single genomes can’t adapt well when applied in this case. A metagenomic assembler, Genovo, is a de novo assembler for metagenomes under a generative probabilistic model. Genovo assembles all reads without discarding any reads in a preprocessing step, and is therefore able to extract more information from metagenomic data and, in principle, generate better assembly results. Paired end sequencing is currently widely-used yet Genovo was designed for 454 single end reads. In this research, we attempted to extend Genovo by incorporating paired-end information, named Xgenovo, so that it generates higher quality assemblies with paired end reads. First, we extended Genovo by adding a bonus parameter in the Chinese Restaurant Process used to get prior accounts for the unknown number of genomes in the sample. This bonus parameter intends for a pair of reads to be in the same contig and as an effort to solve chimera contig case. Second, we modified the sampling process of the location of a read in a contig. We used relative distance for the number of trials in the symmetric geometric distribution instead of using distance between the offset and the center of contig used in Genovo. Using this relative distance, a read sampled in the appropriate location has higher probability. Therefore a read will be mapped in the correct location. Results of extensive experiments on simulated metagenomic datasets from simple to complex with species coverage setting following uniform and lognormal distribution showed that Xgenovo can be superior to the original Genovo and the recently proposed metagenome assembler for 454 reads, MAP. Xgenovo successfully generated longer N50 than Genovo and MAP while maintaining the assembly quality even for very complex metagenomic datasets consisting of 115 species. Xgenovo also demonstrated the potential to decrease the computational cost. This means that our strategy worked well. The software and all simulated datasets are publicly available online at

SmashCommunity: a metagenomic annotation and analysis tool. SmashCommunity is a stand-alone metagenomic annotation and analysis pipeline suitable for data from Sanger and 454 sequencing technologies. It supports state-of-the-art software for essential metagenomic tasks such as assembly and gene prediction. It provides tools to estimate the quantitative phylogenetic and functional compositions of metagenomes, to compare compositions of multiple metagenomes and to produce intuitive visual representations of such analyses. Availability: SmashCommunity source code and documentation are available at

Bambus 2: Motivation: Sequencing projects increasingly target samples from non-clonal sources. In particular, metagenomics has enabled scientists to begin to characterize the structure of microbial communities. The software tools developed for assembling and analyzing sequencing data for clonal organisms are, however, unable to adequately process data derived from non-clonal sources. Results: We present a new scaffolder, Bambus 2, to address some of the challenges encountered when analyzing metagenomes. Our approach relies on a combination of a novel method for detecting genomic repeats and algorithms that analyze assembly graphs to identify biologically meaningful genomic variants. We compare our software to current assemblers using simulated and real data. We demonstrate that the repeat detection algorithms have higher sensitivity than current approaches without sacrificing specificity. In metagenomic datasets, the scaffolder avoids false joins between distantly related organisms while obtaining long-range contiguity. Bambus 2 represents a first step toward automated metagenomic assembly. Availability: Bambus 2 is open source and available from

MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA — a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at