Thursday, July 16, 2015

Woods: A fast and accurate functional annotator and classifier of genomic and metagenomic sequences

Another recent publication from our lab.

Citation: 

Sharma, A. K., Gupta, A., Kumar, S., Dhakan, D. B., & Sharma, V. K. (2015). Woods: A fast and accurate functional annotator and classifier of genomic and metagenomic sequences. Genomics.

Functional annotation of the gigantic metagenomic data is one of the major time-consuming and computationally demanding tasks, which is currently a bottleneck for the efficient analysis. The commonly used homology-based methods to functionally annotate and classify proteins are extremely slow. Therefore, to achieve faster and accurate functional annotation, we have developed an orthology-based functional classifier 'Woods' by using a combination of machine learning and similarity-based approaches. Woods displayed a precision of 98.79% on independent genomic dataset, 96.66% on simulated metagenomic dataset and >97% on two real metagenomic datasets. In addition, it performed >87 times faster than BLAST on the two real metagenomic datasets. Woods can be used as a highly efficient and accurate classifier with high-throughput capability which facilitates its usability on large metagenomic datasets.

The Woods web server is freely accessible at http://metagenomics.iiserb.ac.in/woods/index.php and http://metabiosys.iiserb.ac.in/woods/index.php. The standalone version of Woods can be downloaded from the above web servers and usage instructions are provided in Text S1 and also in the Tutorial section of the web server.


For further information and quarries please directly contact Ashok K. Sharma (ashok@iiserb.ac.in). 

Thursday, February 5, 2015

MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data

Another work from our group which came sometime back. Please do write to (ankitgmeister@gmail.com) in case of any problem. Comments are welcome.

The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. 

Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51–100 amino acids and Blind B: 30–50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. 

To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100–150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at http://metagenomics.iiserb.ac.in/mp3/ind​ex.php.

16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets



A recent article from our lab. Please explore and write back to (ashok@iiserb.ac.in) in case of any problem. Comments are welcome. 
 

The sequencing of 16S rRNA gene is commonly performed to estimate the microbial diversity in a metagenomic study. The rapid developments in genome sequencing technologies have shifted the focus on sequencing the selected hypervariable regions (HVRs) of 16S rRNA gene instead of sequencing the complete gene. The recent metagenomic projects involve the sequencing of only a single HVR or a combination of two or more HVRs. At present there is no specialized method available for the correct identification and classification of species using short variable 16S rRNA sequences. Therefore, we have developed 16S Classifier using a machine learning method, Random Forest, for faster and accurate taxonomic classification of short hypervariable regions of 16S rRNA sequence. It displayed the precision values of up to 0.91 on training datasets and the precision values of up to 0.98 on the first test dataset. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level. 16S classifier displayed up to 42.9%, 40.7%, 41.0%, 57.9% and 73.8% higher accuracy at phylum, class, order, family and genus levels, respectively, as compared to the commonly used RDP classifier program. In addition, it is 7.5 times faster than RDP Classifier and 800 times faster than BLAST. 16S classifier can be easily used with the QIIME pipeline which is commonly used for the 16S rRNA analysis.

To the best of our knowledge, 16S Classifier is the only available tool which can carry out the efficient, sensitive and accurate taxonomic assignment of any of the 16S rRNA hypervariable regions which are commonly used in metagenomic projects. In the case of complete 16S rRNA also, it displayed exceptional (precision of 0.97) performance on the test dataset. Thus, the wide usage of this tool is anticipated in different metagenomic projects. 16S Classifier is available freely at http://metagenomics.iiserb.ac.in/16Sclassifier.


Instructions for running the stand-alone version of 16S Classifier on the Linux PC.
1. User can download zip file of a particular hypervariable region or complete 16S, which is freely available at http://metagenomics.iiserb.ac.in/16Sclassifier/download.html
2. Extract the zipped file which contains a model file (*.Rdata), a script file (*.sh) and an exe file (16sclassifier.exe).
Other dependencies:
1. User has to install R from the following link http://cran.r-project.org/
2. intall Randomforest by typing the following commands in terminal  R  and install.packages ('randomForest')
# Command line usage #
./16sclassifier.exe 'queryfile' 'modelname'
The query file should be in Fasta format and the model name could be v2, v3, v4, v5, v6, v7, v8, v23, v34, v35, v45, v56, v67, v78 and Complete16S.



StumbleUpon