Thursday, February 5, 2015

MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data

Another work from our group which came sometime back. Please do write to ( in case of any problem. Comments are welcome.

The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. 

Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51–100 amino acids and Blind B: 30–50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. 

To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100–150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at​ex.php.

16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets

A recent article from our lab. Please explore and write back to ( in case of any problem. Comments are welcome. 

The sequencing of 16S rRNA gene is commonly performed to estimate the microbial diversity in a metagenomic study. The rapid developments in genome sequencing technologies have shifted the focus on sequencing the selected hypervariable regions (HVRs) of 16S rRNA gene instead of sequencing the complete gene. The recent metagenomic projects involve the sequencing of only a single HVR or a combination of two or more HVRs. At present there is no specialized method available for the correct identification and classification of species using short variable 16S rRNA sequences. Therefore, we have developed 16S Classifier using a machine learning method, Random Forest, for faster and accurate taxonomic classification of short hypervariable regions of 16S rRNA sequence. It displayed the precision values of up to 0.91 on training datasets and the precision values of up to 0.98 on the first test dataset. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level. 16S classifier displayed up to 42.9%, 40.7%, 41.0%, 57.9% and 73.8% higher accuracy at phylum, class, order, family and genus levels, respectively, as compared to the commonly used RDP classifier program. In addition, it is 7.5 times faster than RDP Classifier and 800 times faster than BLAST. 16S classifier can be easily used with the QIIME pipeline which is commonly used for the 16S rRNA analysis.

To the best of our knowledge, 16S Classifier is the only available tool which can carry out the efficient, sensitive and accurate taxonomic assignment of any of the 16S rRNA hypervariable regions which are commonly used in metagenomic projects. In the case of complete 16S rRNA also, it displayed exceptional (precision of 0.97) performance on the test dataset. Thus, the wide usage of this tool is anticipated in different metagenomic projects. 16S Classifier is available freely at

Instructions for running the stand-alone version of 16S Classifier on the Linux PC.
1. User can download zip file of a particular hypervariable region or complete 16S, which is freely available at
2. Extract the zipped file which contains a model file (*.Rdata), a script file (*.sh) and an exe file (16sclassifier.exe).
Other dependencies:
1. User has to install R from the following link
2. intall Randomforest by typing the following commands in terminal  R  and install.packages ('randomForest')
# Command line usage #
./16sclassifier.exe 'queryfile' 'modelname'
The query file should be in Fasta format and the model name could be v2, v3, v4, v5, v6, v7, v8, v23, v34, v35, v45, v56, v67, v78 and Complete16S.