Bioinformatics Tools

Pages

Sunday, February 26, 2012

Pharmacological compound databases

Zinc: http://zinc.docking.org/ Welcome to ZINC, a free database of commercially-available compounds for virtual screening. ZINC contains over 14 million purchasable compounds in ready-to-dock, 3D formats. ZINC is provided by the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF). To cite ZINC, please reference: Irwin and Shoichet, J. Chem. Inf. Model. 2005;45(1):177-82 PDF, DOI. We thank NIGMS for financial support (GM71896).

PubChem: http://pubchem.ncbi.nlm.nih.gov/ PubChem, released in 2004, provides information on the biological activities of small molecules. PubChem is organized as three linked databases within the NCBI's Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical structure similarity search tool. More information about using each component database may be found using the links in the homepage. Links from PubChem's chemical structure records to other Entrez databases provide information on biological properties. These include links to PubMed scientific literature and NCBI's protein 3D structure resource. Links to PubChem's bioassay database present the results of biological screening. Links to depositor web sites provide further information. A PubChem FTP site, Download Facility, Power User Gateway(PUG), Standardization Service, Score Matrix Service, Structure Clustering, and Deposition Gateway are also available. PubChem provides tips and example code to allow users to add PubChem search tool (free) in their sites. A PubChem publication site provides links to published articles. 

The DrugBank database: http://www.drugbank.ca/ is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6712 drug entries including 1441 FDA-approved small molecule drugs, 134 FDA-approved biotech (protein/peptide) drugs, 83 nutraceuticals and 5086 experimental drugs. Additionally, 4231 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. DrugBank is supported by David Wishart, Departments of Computing Science & Biological Sciences, University of Alberta. DrugBank is also supported by The Metabolomics Innovation Centre, a Genome Canada-funded core facility serving the scientific community and industry with world-class expertise and cutting-edge technologies in metabolomics. 

ChemDB: http://cdb.ics.uci.edu/index.htm ChemicalSearch: Find Chemicals by Various Criteria Find a chemical by basic criteria like molecular weight and predicted logP, or by the more abstract notion of structural similarity. Virtual Chemical Space: Retro-Synthesis and Combinatorial Library Design Interactively deconstruct target compounds into component precursors and reconstruct similar building-blocks into combinatorial libraries representing the "virtual chemical space" near the target compound. Reaction Explorer: Synthesis Explorer and Mechanism Explorer Interactive system for learning and practicing reactions, syntheses and mechanisms in organic chemistry, with advanced support for the automatic generation of random problems, curved-arrow mechanism diagrams, and inquiry-based learning. Datasets: For Machine Learning and Searching Experiments Various available chemical datasets annotated with interesting properties to train and test machine-learning prediction and searching methods. Supplements: Articles and Support Material Online articles relating to the system with supplementary data and figures referenced in them.

 The Chapman & Hall/CRC Chemical Database is a structured database holding information on chemical substances. It includes descriptive and numerical data on chemical, physical and biological properties of compounds; systematic and common names of compounds; literature references; structure diagrams and their associated connection tables. The Dictionary of Natural Products Online is a subset of this database and includes all compounds contained in the Dictionary of Natural Products (Main Work and Supplements). The Dictionary of Natural Products (DNP) is the only comprehensive and fully-edited database on natural products. It arose as a daughter product of the well-known Dictionary of Organic Compounds (DOC) which, since its inception in the 1930s has, through successive editions, always been a leading source of natural product information. In the early 1980s, following the publication of the Fifth Edition of DOC, the first to be founded on database methods, the Editors and contributors for the various classes of natural products embarked on a programme of enlargement, rationalisation and classification of the natural product entries, while at the same time keeping the coverage up-to-date. In 1992 the results of this major project, which had grown to match DOC in size, were separately published in both book (7 volumes) and CD-ROM format, leaving DOC with coverage of only the most widely distributed and/or practically important natural products. DNP compilation has since continued unabated by a combination of an exhaustive survey of current literature and of historical sources such as reviews to pick up minor natural products and items of data previously overlooked. The compilation of DNP is undertaken by a team of academics and freelancers who work closely with the in-house editorial staff at Chapman & Hall. Each contributor specialises in a particular natural product class (e.g. alkaloids) and is able to reorganise and classify the data in the light of new research so as to present it in the most consistent and logical manner possible. Thus the compilation team is able to reconcile errors and inconsistencies. The resulting on-line version represents an extremely well organised dictionary documenting virtually every known natural product. A valuable feature of the design is that closely related natural products (e.g. where one is a glycoside or simple ester of another) are organised into the same entry, thus simplifying and bringing out the underlying structural and biosynthetic relationships of the compounds. Structure diagrams are drawn and numbered in the most consistent way according to best stereochemical and biogenetic relationships. In addition, every natural product is indexed by structural/biogenetic type under one of more than 1000 headings, allowing the rapid location of all compounds in the category, even where they have undergone biogenetic modification and no longer share exactly the same skeleton. There is extensive (but not complete) coverage of natural products of unknown structure, and the coverage of these is currently being enhanced by various retrospective searches. 

ChemSpider: http://www.chemspider.com/ is a free chemical structure database providing fast text and structure search access to over 26 million structures from hundreds of data sources.

ChemBank: http://chembank.broadinstitute.org/ is a public, web-based informatics environment created by the Broad Institute's Chemical Biology Program and funded in large part by the National Cancer Institute's Initiative for Chemical Genetics (ICG). This knowledge environment includes freely available data derived from small molecules and small-molecule screens, and resources for studying the data so that biological and medical insights can be gained. ChemBank is intended to guide chemists synthesizing novel compounds or libraries, to assist biologists searching for small molecules that perturb specific biological pathways, and to catalyze the process by which drug hunters discover new and effective medicines. ChemBank stores an increasingly varied set of cell measurements derived from, among other biological objects, cell lines treated with small molecules. Analysis tools are available and are being developed that allow the relationships between cell states, cell measurements and small molecules to be determined. Currently, ChemBank stores information on hundreds of thousands of small molecules and hundreds of biomedically relevant assays that have been performed at the ICG in collaborations involving biomedical researchers worldwide. These scientists have agreed to perform their experiments in an open data-sharing environment.The goals of ChemBank are to provide life scientists unfettered access to biomedically relevant data and tools heretofore available almost exclusively in the private sector. We intend for ChemBank to be a planning and discovery tool for chemists, biologists, and drug hunters anywhere, with the only necessities being a computer, access to the Internet, and a desire to extract knowledge from public experiments whose greatest value is likely to reside in their collective sum.

SuperDrug: http://bioinf.charite.de/superdrug/ Different resources exist for experimentally determined and computed three-dimensional (3D)-structures of low molecular weight structures but for approved drugs, no free, publicly accessible source of 3D-structures and conformers is available. Furthermore, for selection purposes or for correlation of structural similarity with medical application, the assignment of the Anatomical Therapeutic Chemical (ATC) classification codes to each structure according to the WHO-scheme would be desirable.RESULTS: The database contains approximately 2500 3D-structures of active ingredients of essential marketed drugs. To account for structural flexibility they are represented by 10(5) structural conformers. Here we present a web-query system enabling searches for drug name, synonyms, trade name, trivial name, formula, CAS-number, ATC-code etc. 2D-similarity screening (Tanimoto coefficients) and an automatic 3D-superposition procedure based on conformational representation are implemented. Drug structures above a similarity threshold as well as superimposed conformers can be retrieved in the mol- file format via a graphical interface. AVAILABILITY: For academic use the system is accessible at http://bioinf.charite.de/superdrug . The retrieval system requires the free browser-plugin 'chime' from MDL for visualization.

Ligand Expo: http://ligand-expo.rutgers.edu/ Ligand Expo (formerly Ligand Depot) provides chemical and structural information about small molecules within the structure entries of the Protein Data Bank. Tools are provided to search the PDB dictionary for chemical components, to identify structure entries containing particular small molecules, and to download the 3D structures of the small molecule components in the PDB entry. A sketch tool is also provided for building new chemical definitions from reported PDB chemical components.

Schrödinger has made available a set of the ligand decoys used in Glide enrichment studies. 1K Drug-Like Ligand Decoys Set: This collection of ligands was created by selecting 1000 ligands from a one million compound library that were chosen to exhibit "drug-like" properties. Creation and application of the ligand set is presented in the following publications: 

Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shaw, D. E.; Shelley, M.; Perry, J. K.; Francis, P.; Shenkin, P. S, "Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy", J. Med. Chem. 2004, 47, 1739-1749.

Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T.; Banks, J. L., "Glide: A New Approach for Rapid, Accurate Docking and Scoring. 2. Enrichment Factors in Database Screening", J. Med. Chem. 2004, 47, 1750-1759.

The SuperLigands: http://bioinf-tomcat.charite.de/superligands/ The SuperLigands is an encyclopedia that is dedicated to a ligand oriented view of the protein structural space. The database contains small molecule structures occurring as ligands in the Protein Data Bank. SuperLigands integrates different information about drug-likeness or binding properties. A 3D superpositioning algorithm is implemented that allows screening all ligands for possible scaffold hoppers as well as a 2D similarity screen for compounds based on fingerprints.

ChEBI: http://www.ebi.ac.uk/chebi/ Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds. The term ‘molecular entity’ refers to any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity. The molecular entities in question are either products of nature or synthetic products used to intervene in the processes of living organisms.ChEBI incorporates an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified.ChEBI uses nomenclature, symbolism and terminology endorsed by the following international scientific bodies: 
Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) 

Molecules directly encoded by the genome (e.g. nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. All data in the database is non-proprietary or is derived from a non-proprietary source. It is thus freely accessible and available to anyone. In addition, each data item is fully traceable and explicitly referenced to the original source.

Wednesday, February 1, 2012

Working with multiple urls from text file to tabs and vice versa

Copy or save multiple tabs to a text file

There are various methods; I use this one for Firefox:

1. Install send-tab-urls add-on to Firefox https://addons.mozilla.org/en-US/firefox/addon/send-tab-urls/

2. Open all the urls in different tabs.

3. Go to Files --> Send tab urls --> select your options and send to clipboard

4. Url open in all tabs will be copied to your clipboard, you can just paste them to a text file and save them or you can reuse them for opening in multiple tabs again.

Use: While looking for published articles on Pubmed or on Google, you need to save the relevant article list search wise as a text file, and then reuse them whenever required. At home I did not have accesses to various journals, so I used to save the links in text files and then at institute just open and save all the articles required.

Open multiple urls in different tabs from a text file

1. Copy all urs in a txt file

2. Open http://www.urlopener.com/index.php

3. Paste it in the space provided

4. Click submit and then click open all

Use: Opening all the urls in one go, for faster work. It would be logical to open a new window in Firefox to do this so that you won’t clutter your ongoing work with the multiple urls

Meanwhile I also found a very useful tool for common comparisons or lists and making Venn diagrams

Please cite: Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams. http://bioinfogp.cnb.csic.es/tools/venny/index.html

Happy surfing.

Sunday, January 1, 2012

Homology modeling of proteins


CPHmodels: http://www.cbs.dtu.dk/services/CPHmodels/ CPHmodels-3.0 is a web-server predicting protein 3D-structure by use of single template homology modeling. The server employs a hybrid of the scoring functions of CPHmodels-2.0 and a novel remote homology-modeling algorithm. A query sequence is first attempted modeled using the fast CPHmodels-2.0 profile-profile scoring function suitable for close homology modeling. The new computational costly remote homology-modeling algorithm is only engaged provided that no suitable PDB template is identified in the initial search. CPHmodels-3.0 was benchmarked in the CASP8 competition and produced models for 94% of the targets (117 out of 128), 74% were predicted as high reliability models (87 out of 117). These achieved an average RMSD of 4.6? When superimposed to the 3D-structure. The remaining 26% low reliably models (30 out of 117) could superimpose to the true 3D-structure with an average RMSD of 9.3?. These performance values place the CPHmodels-3.0 method in the group of high performing 3D-prediction tools. Beside its accuracy, one of the important features of the method is its speed. For most queries, the response time of the server is less than 20 minutes. The web server is available at http://www.cbs.dtu.dk/services/CPHmodels/.

MODELLER: http://www.salilab.org/modeller/ MODELLER is used for homology or comparative modeling of protein three-dimensional structures (1,2). The user provides an alignment of a sequence to be modeled with known related structures and MODELLER automatically calculates a model containing all non-hydrogen atoms. MODELLER implements comparative protein structure modeling by satisfaction of spatial restraints (3,4), and can perform many additional tasks, including de novo modeling of loops in protein structures, optimization of various models of protein structure with respect to a flexibly defined objective function, multiple alignment of protein sequences and/or structures, clustering, searching of sequence databases, comparison of protein structures, etc. MODELLER is available for download for most Unix/Linux systems, Windows, and Mac.

SWISS-MODEL: http://swissmodel.expasy.org/ SWISS-MODEL is a fully automated protein structure homology-modeling server, accessible via the ExPASy web server, or from the program DeepView (Swiss Pdb-Viewer). The purpose of this server is to make Protein Modeling accessible to all biochemists and molecular biologists worldwide.

Phyre2: http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index (Protein Homology/AnalogY Recognition Engine; pronounced as 'fire') are web-based services for protein structure prediction that are free for non-commercial use. Phyre is among the most popular methods for protein structure prediction having been cited over 1000 times. Like other remote homology recognition techniques (see protein threading), it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed (funded by the BBSRC) to ensure a user-friendly interface for users inexpert in protein structure prediction methods.

HHpred : http://toolkit.tuebingen.mpg.de/hhpred the primary aim in developing HHpred was to provide biologists with a method for sequence database searching and structure prediction that is as easy to use as BLAST or PSI-BLAST and that is at the same time much more sensitive in finding remote homologs. In fact, HHpred's sensitivity is competitive with the most powerful servers for structure prediction currently available. HHpred is the first server that is based on the pair wise comparison of profile hidden Markov models (HMMs). Whereas most conventional sequence search methods search sequence databases such as UniProt or the NR, HHpred searches alignment databases, like Pfam or SMART. This greatly simplifies the list of hits to a number of sequence families instead of a clutter of single sequences. All major publicly available profile and alignment databases are available through HHpred. HHpred accepts a single query sequence or a multiple alignment as input. Within only a few minutes it returns the search results in an easy-to-read format similar to that of PSI-BLAST. Search options include local or global alignment and scoring secondary structure similarity. HHpred can produce pairwise query-template sequence alignments, merged query-template multiple alignments (e.g. for transitive searches), as well as 3D structural models calculated by the MODELLER software from HHpred alignments.

LOMATES: http://zhanglab.ccmb.med.umich.edu/LOMETS/ LOMETS (Local Meta-Threading-Server) is an on-line web service for protein structure prediction. It generates 3D models by collecting high-scoring target-to-template alignments from 8 locally-installed threading programs (FUGUE, HHsearch, MUSTER, PPA, PROSPECT2, SAM-T02, SPARKS, SP3). A detailed description of the server can be seen in the Readme file.

MODBASE: MODBASE (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by MODPIPE, an automated modeling pipeline that relies primarily on MODELLER for fold assignment, sequence–structure alignment, model building and model assessment (http:/salilab.org/modeller). MODBASE currently contains 5 152 695 reliable models for domains in 1 593 209 unique protein sequences; only models based on statistically significant alignments and/or models assessed to have the correct fold are included. MODBASE also allows users to calculate comparative models on demand, through an interface to the MODWEB modeling server (http://salilab.org/modweb). Other resources integrated with MODBASE include databases of multiple protein structure alignments (DBAli), structurally defined ligand binding sites (LIGBASE), predicted ligand binding sites (AnnoLyze), structurally defined binary domain interfaces (PIBASE) and annotated single nucleotide polymorphisms and somatic mutations found in human proteins (LS-SNP, LS-Mut). MODBASE models are also available through the Protein Model Portal (http://www.proteinmodelportal.org/).

Robetta: http://www.robetta.org/ Robetta provides both ab initio and comparative models of protein domains. It uses the ROSETTA fragment insertion method (Simons et al. (1997) J Mol Biol. 268:209-225). Domains without a detectable PDB homolog are modeled with the Rosetta de novo protocol (Bonneau et al. (2002) J Mol Biol. 322:65-78). Comparative models are built from Parent PDBs detected by UW-PDB-BLAST or HHSEARCH and aligned by various methods which include HHSEARCH, Compass, and Promals. Loop regions are assembled from fragments and optimized to fit the aligned template structure (Rohl et al. (2004) Proteins 55:656-677). The procedure is fully automated. Robetta is evaluated in the blind benchmarking experiment CASP. Robetta uses ROSETTA software which is developed and maintained by the Rosetta Commons

chunk-TASSER: http://cssb.biology.gatech.edu/skolnick/webservice/chunk-TASSER/index.html A protein structure prediction method that combines threading templates from SP3 and ab initio folded chunk structures (three consecutive segments of regular secondary structures). It is better for extreme hard targets

PSiFR (Protein Structure and Function predicton Resource) http://psifr.cssb.biology.gatech.edu/ provides integrated tools for protein tertiary structure prediction and structure and sequence-based function annotation. The details of various methods used are described below: 

Protein structure prediction methods 
TASSER (Threading/ASSembly/Refinement) is an automated protein structure prediction and modeling method. TASSER employs a hierarchical approach consisting of template identification by threading, followed by tertiary structure assembly by rearranging continuous template fragments (Zhang, Y. and Skolnick, J., 2004, PNAS). 

TASSER-Lite is a comparative protein tertiary structure modeling tool. It is presently optimized for the modeling of single domain (41-200 residues) homologous protein sequences; that is, proteins with a sequence identity greater than 25% with respect to the best threading template (Pandit et. al., 2006, Biophysical Journal). The templates for the modeling of the query sequence are identified using the threading program PROSPECTOR_3 (Skolnick et. al., 2004, Proteins). Subsequently, the structure is refined using TASSER program with optimized parameters. 

METATASSER is a protein tertiary prediction method that employs the 3D-Jury approach to select threading templates from SPARKS2 (Zhou H. and Zhou Y., 2004, Proteins), SP3 ( Zhou H. and Zhou Y., 2005, Proteins) and PROSPECTOR_3 (Skolnick et. al., 2004, Proteins), which provides aligned fragments and tertiary restraints as an input to TASSER procedure to generate full-length models. In the CASP7 and CASP8 assessment of server performance, METATASSER is among the top performing servers (Zhou et. al, 2007, Proteins; Zhou et al., 2009, Proteins (in press)). 

ESyPred3D: http://www.fundp.ac.be/sciences/biologie/urbm/bioinfo/esypred/ ESyPred3D is a new automated homology modeling program. The method gets benefit of the increased alignment performances of a new alignment strategy using neural networks. Alignments are obtained by combining, weighting and screening the results of several multiple alignment programs. The final three dimensional structure is built using the modeling package MODELLER.

Protein Model Portal (PMP): http://www.proteinmodelportal.org/ PMP gives access to various models computed by comparative modeling methods provided by different partner sites, and provides access to various interactive services for model building, and quality assessment.

ProModel: http://www.vlifesciences.com/products/VLifeMDS/Protein_Modeller.php ProModel is a complete package for modeling proteins, whose crystal structure is unknown based on the amino acid sequences of a close homologue. ProModel allows homology modeling from either a selected template or a user defined template. Users can perform an automated homology modeling simply by reading in the template file or can perform a knowledge based manual modeling by specific loop insertions or by changing specific amino acid residues. A local BLAST speeds up the process of modeling. ProModel enables an exhaustive analysis of the target protein structure, active site and channels. The user can conveniently view, edit and superimpose proteins with ProModel. Facilities to distribute the secondary structure elements, distribute the Phi-Psi angles of residues in Ramachandran plot, identify and visualize cavities and channels make it a very useful product. ProModel is available for both Linux and Windows® operating systems.

SCWRL4: http://dunbrack.fccc.edu/scwrl4/index.php SCWRL4 is based on a new algorithm and new potential function that results in improved accuracy at reasonable speed. This has been achieved through: 1) a new backbone-dependent rotamer library based on kernel density estimates; 2) averaging over samples of conformations about the positions in the rotamer library; 3) a fast anisotropic hydrogen bonding function; 4) a short-range, soft van der Waals atom-atom interaction potential; 5) fast collision detection using k-discrete oriented polytopes; 6) a tree decomposition algorithm to solve the combinatorial problem; and 7) optimization of all parameters by determining the interaction graph within the crystal environment using symmetry operators of the crystallographic space group. Accuracies as a function of electron density of the side chains demonstrate that side chains with higher electron density are easier to predict than those with low electron density and presumed conformational disorder. For a testing set of 379 proteins, 86% of chi1 angles and 75% of chi1+2 are predicted correctly within 40 degrees of the X-ray positions. Among side chains with higher electron density (25th-100th percentile), these numbers rise to 89% and 80%. The new program maintains its simple command-line interface, designed for homology modeling. To achieve higher accuracy, SCWRL4 is somewhat slower than SCWRL3 when run in the default flexible rotamer model (FRM) by a factor of 3-6, depending on the protein. When run in the rigid rotamer model (RRM), SCWRL4 is about the same speed as SCWRL3. In both cases, SCWRL4 will converge on very large proteins or protein complexes or those with very dense interaction graphs, while SCWRL3 sometimes would not. The SCWRL4 paper has been published in Proteins: Structure, Function, Bioinformatics. A reprint is available. Please cite the paper: G. G. Krivov, M. V. Shapovalov, and R. L. Dunbrack, Jr. Improved prediction of protein side-chain conformations with SCWRL4. Proteins (2009). 

VADAR: http://vadar.wishartlab.com/ VADAR (Volume, Area, Dihedral Angle Reporter) is a compilation of more than 15 different algorithms and programs for analyzing and assessing peptide and protein structures from their PDB coordinate data. The results have been validated through extensive comparison to published data and careful visual inspection. The VADAR web server supports the submission of either PDB formatted files or PDB accession numbers. VADAR produces extensive tables and high quality graphs for quantitatively and qualitatively assessing protein structures determined by X-ray crystallography, NMR spectroscopy, 3D-threading or homology modelling. Please cite the following: Leigh Willard, Anuj Ranjan,Haiyan Zhang,Hassan Monzavi, Robert F. Boyko, Brian D. Sykes, and David S. Wishart "VADAR: a web server for quantitative evaluation of protein structure quality" Nucleic Acids Res. 2003 July 1; 31 (13): 3316.3319

IntFOLD : http://www.reading.ac.uk/bioinf/IntFOLD/ The IntFOLD server provides a unified interface for Tertiary structure prediction/3D modeling, 3D model quality assessment, Intrinsic disorder prediction, Domain prediction, Prediction of protein-ligand binding residues

PEPstr: http://www.imtech.res.in/raghava/pepstr/ The Pepstr server predicts the tertiary structure of small peptides with sequence length varying between 7 to 25 residues. The prediction strategy is based on the realization that β-turn is an important and consistent feature of small peptides in addition to regular structures. Thus, the methods uses both the regular secondary structure information predicted from PSIPRED and β-turns information predicted from BetaTurns. The side-chain abgles are placed using standard backbone-dependent rotamer library. The structure is further refined with energy minimization and molecular dynamic simulations using Amber version6.

BSR: http://cssb.biology.gatech.edu/BSR Binding Site Refinement employs a new template-based method for the local refinement of ligand-binding regions in protein models using closely as well as distantly related templates identified by threading. A Support Vector Regression (SVR) model is used to select likely correct binding site geometries in a large ensemble of multiple receptor conformations. The SVR model employs several scoring functions that impose geometrical restraints on the Cα positions, account for a specific chemical environment within a binding site and optimize the interactions with putative ligands.

KeyRecep: http://www.immd.co.jp/en/product_2.html KeyRecep is the best-suited solution for rational molecular design when the 3D structure of the target protein is unknown. Users can estimate the characteristics of the binding site of the target protein by superposing multiple active compounds in 3D space so that the physicochemical properties of the compounds match maximally with each other. (Estimation of virtual receptor model) Users can also examine relationship between chemical structures and the activities based on the multiple regression analysis with indices of conformity of each compound to the virtual receptor model and the activity values. (3D-SAR function) For compounds whose activities are unknown, users can estimate the activities based on the indices of conformity to the virtual receptor model and can perform virtual screening. (DB search function) KeyRecep rationally and strategically accelerates the molecular design projects based on hit compounds discovered by high throughput screening (HTS) or based on information on compounds from literature or patents. KeyRecep facilitates the structural expansion of such compounds to obtain lead compounds and further drug candidates.

PROTEUS2: http://wks16338.biology.ualberta.ca/proteus2/ PROTEUS2 is a web server designed to support comprehensive protein structure prediction and structure-based annotation. PROTEUS2 accepts either single sequences (for directed studies) or multiple sequences (for whole proteome annotation) and predicts the secondary and, if possible, tertiary structure of the query protein(s). Unlike most other tools or servers, PROTEUS2 bundles signal peptide identification, transmembrane helix prediction, transmembrane β-strand prediction, secondary structure prediction (for soluble proteins) and homology modeling (i.e. 3D structure generation) into a single prediction pipeline. Using a combination of progressive multi-sequence alignment, structure-based mapping, hidden Markov models, multi-component neural nets and up-to-date databases of known secondary structure assignments, PROTEUS2 is able to achieve among the highest reported levels of predictive accuracy for signal peptides (Q2=94%), membrane spanning helices (Q2=87%) and secondary structure (Q3 score of 81.3% ). PROTEUS2's homology modeling services also provide high quality 3D models that compare favorably with those generated by SWISS-MODEL (within 0.2 Å RMSD). The average PROTEUS2 prediction takes ~2 minutes per query sequence. Source code is also freely available here.

PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/ is a simple and accurate secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST). Using a very stringent cross validation method to evaluate the method's performance, PSIPRED 2.6 achieves an average Q3 score of 80.7%. Predictions produced by PSIPRED were also submitted to the CASP4 evaluation and assessed during the CASP4 meeting, which took place in December 2000 at Asilomar. PSIPRED 2.0 achieved an average Q3 score of 80.6% across all 40 submitted target domains with no obvious sequence similarity to structures present in PDB, which ranked PSIPRED top out of 20 evaluated methods (an earlier version of PSIPRED was also ranked top in CASP3 held in 1998). It is important to realize, however, that due to the small sample sizes, the results from CASP are not statistically significant, although they do give a rough guide as to the current "state of the art". For a more reliable evaluation, the EVA web site at Columbia University provides a continuous evaluation. Also see the EVA servlet to visualize a breakdown of specific types of errors made by PSIPRED and other secondary structure prediction methods. NOTE that at the time of writing, the EVA site is no longer being updated. The PSIPRED V2.6 software can be downloaded from HERE. Please note that you should read the license terms given in the README file if you wish to incorporate PSIPRED in another program or Web server. Older releases of PSIPRED can be downloaded here HERE.

I-TASSER : http://zhanglab.ccmb.med.umich.edu/I-TASSER/ server is an Internet service for protein structure and function predictions. 3D models are built based on multiple-threading alignments by LOMETS and iterative TASSER assembly simulations; function insights are then derived by matching the predicted models with protein function databases. I-TASSER (as 'Zhang-Server') was ranked as the No 1 server for protein structure prediction in recent CASP7, CASP8 and CASP9 experiments. It was also ranked as the best for function prediction in CASP9. The server is in active development with the goal to provide the most accurate structural and functional predictions using state-of-the-art algorithms.

JPred: http://www.compbio.dundee.ac.uk/www-jpred/ Jpred is a Protein Secondary Structure Prediction server and has been in operation since approximately 1998. Jpred incorporates the Jnet algorithm in order to make more accurate predictions. In addition to protein secondary structure Jpred also makes predictions on Solvent Accessibility and Coiled-coil regions (Lupas method). The current version of Jpred (v3) follows on from previous versions of Jpred developed and maintained by James Cuff and Jonathan Barber

Verifying your modeled protein with online servers: 

Stuctural Analysis and Verification Server (SAVS): http://nihserver.mbi.ucla.edu/SAVES/ SAVS uses following servers to check the quality of the protein structures: Procheck: Checks the stereochemical quality of a protein structure by analyzing residue-by-residue geometry and overall structure geometry. [Reference] What_Check: Derived from a subset of protein verification tools from the WHATIF program (Vriend, 1990), this does extensive checking of many sterochemical parameters of the residues in the model. [Reference] ERRAT: Analyzes the statistics of non-bonded interactions between different atom types and plots the value of the error function versus position of a 9-residue sliding window, calculated by a comparison with statistics from highly refined structures. [Reference] Verify3D: Determines the compatibility of an atomic model (3D) with its own amino acid sequence (1D) by assigned a structural class based on its location and environment (alpha, beta, loop, polar, nonpolar etc) and comparing the results to good structures. [Reference] Prove: Calculates the volumes of atoms in macromolecules using an algorithm which treats the atoms like hard spheres and calculates a statistical Z-score deviation for the model from highly resolved (2.0 Ã… or better) and refined (R-factor of 0.2 or better) PDB-deposited structures. [Reference]

COLORADO-3D: http://asia2.genesilico.pl/colorado3d/ COLORADO-3D is a www-tool that greatly facilitates the visual analysis of various features in three-dimensional protein structures, directly at the level of the protein structure, with the aid of commonly used viewers such as RASMOL or SWISSPDBVIEWER. Among the features most important for the structural biologist that our server allows to visualize in color are potential errors in protein structure (detected by ANOLEA, PROSA, PROVE,VERIFY3D), regions buried in the protein core and inaccessible to the solvent, and regions of high or low sequence conservation (e.g. detected by RATE4SITE). In particular COLORADO3D may serve to visualize the results of assessment of the protein structure's quality at various stages of the model building and refinement (both in the case of experimental structure determination and homology modeling).

Tuesday, December 27, 2011

Circular dichroism code to help in data analysis

I was looking for some kind of code for rearranging the data I get for thermal melt from CD (Circular Dichroism). No I could not get a code to convert .jsw files to CSV in batch, neither JASCO’s Spectrum Analysis software helps on that, update me if there's batch conversion option for .jsw files to CSV. You have to convert individual .jsw files to CSV and group them in one folder. What I could get is after converting .jsw files to CSVs you can get data from all the files to one CSV file that assist in data analysis. The code given below will copy the data from all files to one files from 350nm to 200nm with the file name as a header for mdeg and tension (HV).

Steps:

1.Install python (if you do not have already http://www.python.org/getit/)

2.Copy all CSV files to one folder with their names

3. Write the name of CSV in one text file and save it as file_name.txt in the same folder as your data and code

    a.You can do this by Get to the MS-DOS prompt or the Windows command line. Navigate to the directory you wish to print the contents of. If you're new to the command line, familiarize yourself with the cd command and the dir command. Once in the directory you wish to print the contents of, type this command: dir /b > file_name.txt

    b.Open the new file created with name file_name.txt on the same folder and check for the file names and if file_name.txt is also there remove it so that you only have file names listed on the text file.

4.Copy the code given below in notepad and save it as .py file (it’s a python code) in the same folder

5.Right click on the python file and Run this code on python IDLE (press F5)

6.You will get a result file with name final_file.txt. It will be a CSV files with your data for mdeg and HV shorted from 350nm to 200nm, open it with excel. You can make changes in the code to suit your needs like if you are taking data from 200nm to 260 nm, make relevant change in the python code by changing x=range(151) to x=range(61) and then outfile.write(str(350-j)) to outfile.write(str(260-j)) respectively.

7.Hope that helps, thank Rhishikesh Bargaje (he wrote code for me) if it works, write me back if you face some problem, I can try to help.



Code:



infile = open('file_name.txt','r')

s = infile.read().split('\n')

infile.close()



outfile = open('final_file.txt','w')

outfile.write('Wavelength')



for k in s:

    for w in range(2):

        if w == 0:

            outfile.write('\t' + k.replace('.csv','').replace(' ','_') + '_mdeg')

        if w == 1:

            outfile.write('\t' + k.replace('.csv','').replace(' ','_') + '_HV')

      

outfile.write('\n')

      

x = range(151)



for j in x:

    outfile.write(str(350-j))

    for i in s:

        infile = open(i,'r')

        t = infile.read().split('XYDATA\n')

        infile.close()

        data1 = t[1].split('\n\n')[0].split('\n')[j].split(',')[1]

        data2 = t[1].split('\n\n')[0].split('\n')[j].split(',')[2]      

        outfile.write('\t' + data1 + '\t' + data2)

    outfile.write('\n')

outfile.close()


##end of the code##

Alternatively, if you are acquainted with R (Download R if you haven't http://cran.r-project.org/, you can use following script to run it on R for the same result with temperature range for thermal melt from 10 degrees to 70 degrees, edit the code to customize for your use, if needed, remember that you do not have to have directory name printed for this R code and it may not work properly if there are other files in the data folder. Get acquainted with R. Thank Shrikant if you find it useful.

Code:

 ##Start of the code##

CSV_Files=list.files(path=".",pattern="\\.csv",full.names=FALSE);
ResultantMatrix=matrix(nrow=151);
ResultantMatrix[,1]=c(350:200);
for(i in 1:length(CSV_Files))
{
    Current_File=read.table(CSV_Files[[i]],header=FALSE,blank.lines.skip=FALSE);
    tempM=matrix(nrow=151,ncol=2);
    k=1;
    for(j in 21:171)
    {
        temp=strsplit(as.character(Current_File[j,1]),split=",");
        tempM[k,1]=temp[[1]][2];
        tempM[k,2]=temp[[1]][3];
        k=k+1;
       
    }
    t=as.numeric(gsub(".*(\\d+.+?)\\.csv","\\1",CSV_Files[[i]]))+9;
    colnames(tempM)=c(t,t);
    ResultantMatrix=cbind(ResultantMatrix,tempM);
   
}

write.csv(ResultantMatrix,file="Result.csv");

##End of the code##

Sunday, December 4, 2011

Protein-Protein Docking Servers

I was looking for protein-protein docking servers to use in my study, here is the list of online servers that are commonly used and are popular. There are other software giving good result for protein-protein docking, I have not listed them here as I am still trying to compile and I would put it here as soon as I am done with the list. Have fun. 

ClusPro: (http://nrc.bu.edu/cluster) represents the first fully automated, web-based program for the computational docking of protein structures. Users may upload the coordinate files of two protein structures through ClusPro's web interface, or enter the PDB codes of the respective structures, which ClusPro will then download from the PDB server (http://www.rcsb.org/pdb/). The docking algorithms evaluate billions of putative complexes, retaining a preset number with favorable surface complementarities. A filtering method is then applied to this set of structures, selecting those with good electrostatic and desolvation free energies for further clustering. The program output is a short list of putative complexes ranked according to their clustering properties, which is automatically sent back to the user via email.

RosettaDock: The RosettaDock protein-protein docking server predicts the structure of protein complexes given the structures of the individual components and an approximate binding orientation. The server uses the Rosetta 2.1 protein structure modeling suite. The RosettaDock server (http://rosettadock.graylab.jhu.edu) identifies low-energy conformations of a protein–protein interaction near a given starting configuration by optimizing rigid-body orientation and side-chain conformations. The server requires two protein structures as inputs and a starting location for the search. RosettaDock generates 1000 independent structures, and the server returns pictures, coordinate files and detailed scoring information for the 10 top-scoring models. A plot of the total energy of each of the 1000 models created shows the presence or absence of an energetic binding funnel. RosettaDock has been validated on the docking benchmark set and through the Critical Assessment of PRedicted Interactions blind prediction challenge.

ZDOCK, RDOCK: ZDOCK uses a fast Fourier transform to search all possible binding modes for the proteins, evaluating based on shape complementarity, desolvation energy, and electrostatics. The top 2000 predictions from ZDOCK are then given to RDOCK where they are minimized by CHARMM to improve the energies and eliminate clashes, and then the electrostatic and desolvation energies are recomputed by RDOCK (in a more detailed fashion than the calculations performed by ZDOCK). We then tested these programs with a benchmark of 49 non-redundant unbound test cases, where we identified a near-native structure (within 2.5 angstrom from the experimental structure) as the top prediction for 37% of the test cases, and within the top 4 predictions for 49% of the test cases. The superior performance of ZDOCK and RDOCK has also been demonstrated in a community-wide protein docking blind test, CAPRI. Check this out for more details. All software, as well as the benchmark is freely available to academic users. For basic information on running ZDOCK, see this site.
 
GPU.proton.DOCK: (Genuine Protein Ultrafast proton equilibria consistent DOCKing) is a state of the art service for in silico prediction of protein–protein interactions via rigorous and ultrafast docking code. It is unique in providing stringent account of electrostatic interactions self-consistency and proton equilibria mutual effects of docking partners. GPU.proton.DOCK is the first server offering such a crucial supplement to protein docking algorithms—a step toward more reliable and high accuracy docking results. The code (especially the Fast Fourier Transform bottleneck and electrostatic fields computation) is parallelized to run on a GPU supercomputer. The high performance will be of use for large-scale structural bioinformatics and systems biology projects, thus bridging physics of the interactions with analysis of molecular networks. We propose workflows for exploring in silico charge mutagenesis effects. Special emphasis is given to the interface-intuitive and user-friendly. The input is comprised of the atomic coordinate files in PDB format. The advanced user is provided with a special input section for addition of non-polypeptide charges, extra ionogenic groups with intrinsic pKa values or fixed ions. The output is comprised of docked complexes in PDB format as well as interactive visualization in a molecular viewer. GPU.proton.DOCK server can be accessed at http://gpudock.orgchm.bas.bg/.

GRAMM-X: Protein docking software GRAMM-X and its web interface (http://vakser.bioinformatics.ku.edu/resources/gramm/grammx) extend the original GRAMM Fast Fourier Transformation methodology by employing smoothed potentials, refinement stage, and knowledge-based scoring. The web server frees users from complex installation of database-dependent parallel software and maintaining large hardware resources needed for protein docking simulations. Docking problems submitted to GRAMM-X server are processed by a 320 processor Linux cluster. The server was extensively tested by benchmarking, several months of public use, and participation in the CAPRI server track.

HexServer: HexServer (http://hexserver.loria.fr/) is the first Fourier transform (FFT)-based protein docking server to be powered by graphics processors. Using two graphics processors simultaneously, a typical 6D docking run takes 15 s, which is up to two orders of magnitude faster than conventional FFT-based docking approaches using comparable resolution and scoring functions. The server requires two protein structures in PDB format to be uploaded, and it produces a ranked list of up to 1000 docking predictions. Knowledge of one or both protein binding sites may be used to focus and shorten the calculation when such information is available. The first 20 predictions may be accessed individually, and a single file of all predicted orientations may be downloaded as a compressed multi-model PDB file. The server is publicly available and does not require any registration or identification by the user.

3D-Garden: a system for modelling protein–protein complexes based on conformational refinement of ensembles generated with the marching cubes algorithm. 3DGarden is an integrated software suite for performing protein-protein and protein-polynucleotide docking. For any pair of biomolecules structures specified by the user, 3DGarden's primary function is to generate an ensemble of putative complexed structures and rank them. The highest-ranking candidates constitute predictions for the structure of the complex. 3DGarden cannot be used to decide whether or not a particular pair of biomolecules interacts. Complexes of protein and nucleic acid chains can also be specified as individual interactors for docking purposes.

Wednesday, November 23, 2011

Folder list to text file, text file to folders

How to make folder with name from test file ?

You could do this:
1. Make sure all your entries are in column A of your spreadsheet.
2. Edit/copy column A
3. Click Start / Run / notepad c:\folders.txt {OK}
4. Click Edit / paste. You now have a text file with all the folder names
inside.
5. Click Start / run / cmd {OK}
6. Type this test command:
for /F "tokens=*" %* in (c:\folders.txt) do @echo md "D:\My Folders\%*"
{Enter}

If you're happy with the result, make it happen by typing this command:
for /F "tokens=*" %* in (c:\folders.txt) do @md "D:\My Folders\%*"
{Enter}



How do I print a listing of files in a directory?

    Get to the MS-DOS prompt or the Windows command line.
    Navigate to the directory you wish to print the contents of. If you're new to the command line, familiarize yourself with the cd command and the dir command.
    Once in the directory you wish to print the contents of, type one of the below commands.

    dir > print.txt

    The above command will take the list of all the files and all of the information about the files, including size, modified date, etc., and send that output to the print.txt file in the current directory.

    dir /b > print.txt

    This command would print only the file names and not the file information of the files in the current directory.

    dir /s /b > print.txt

    This command would print only the file names of the files in the current directory and any other files in the directories in the current directory.

    After doing any of the above steps the print.txt file is created. Open this file in any text editor (e.g. Notepad) and print the file. You can also do this from the command prompt by typing notepad print.txt.

Saturday, November 5, 2011

In-silico characterization of proteins

BLAST : In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman, and Webb Miller at the NIH and was published in the Journal of Molecular Biology in 1990

CDD search: Conserved Domain Database (CDD) is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domains, which use 3D-structure information to explicitly to define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

PFAM: The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

TMHMM: A variety of tools are available to predict the topology of transmembrane proteins. To date no independent evaluation of the performance of these tools has been published. A better understanding of the strengths and weaknesses of the different tools would guide both the biologist and the bioinformatician to make better predictions of membrane protein topology.

SignalP: SignalP 4.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. 

STRING: STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources i.e. Genomic context, high throughput experiments, coexpression, previous knowledge. STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 5'214'234 proteins from 1133 organisms.

PROTPARAM: ProtParam (References / Documentation) is a tool which allows the computation of various physical and chemical parameters for a given protein stored in Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY)

PROSITE: Search your query sequence for protein motifs, rapidly compare your query protein sequence against all patterns stored in the PROSITE pattern database and determine what the function of an uncharacterised protein is. This tool requires a protein sequence as input, but DNA/RNA may be translated into a protein sequence using transeq and then queried.

InterPro: InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

GlobPlot Webservice:

Prediction of disorder:

  • DisEMBL - DisEMBL is our neural network based predictor.
  • DISOPRED - Predictor from David Jones' lab.

Function prediction in non-globular protein space:

  • ELM - The Eukaryotic Linear Motif Resource.
  • NetworKIN - Systematic Discovery of In Vivo Phosphorylation Networks.

Thesis on disorder and linear motifs

Function prediction in globular protein space:

  • SMART - SMART/Pfam domains

Domain boundaries:

  • DomCut - A domain boundary detector
  • DomPred - Domain predictor from David Jones' lab.

Synthetic Biology

Synthetic Biology Project @ SLRI - Applying GlobPlot.


Resources

Subcellular localization predictors:
Subcellular localization databases: