Bioinformatics Tools

Wednesday, January 29, 2014

LINUX BASICS FOR BEGINNERS AND ADVANCE USERS

Since long I was trying to learn Linux. I realized it is not as difficult as it seems. I have compiled some of the basic commands that are mandatory to learn to step into this beautiful world. They are not all, I have skipped grep, awk and sed knowingly, as I presume they deserve bit more attention. I will keep updating the list as soon as I can. Do look around too, some good video tutorials are also available that aid in learning. Also, I have made this available as a simple printable PDF.

LINUX BASICS FOR ADVANCE USERS - Download Here

Basic commands: On the terminal type these commands at-least two-three times to practice. Also, it is advisable to read man (help) pages of all these commands at least once.

####################################################

Command		Function
man <command>	:	Displays manual or help page for a commend, always read it full at least once
cd	:	Change directory, to navigate from one directory to another
cd -	:	Toggle between previous directory and current directory
ls	:	List files or folders, with several arguments give detail information about the files and folders, try: ls -ltrh
date	:	Displays current date
ssh	:	login to remote host
passwd	:	Change password
bc	:	Opens calculator
cp	:	Copies files or directories, usage: cp file.txt /directory
mv	:	Moves file, rename file, usage: mv file1.txt file2.txt (renames file1.txt to file2.txt)
pwd	:	Present working directory
mkdir	:	Make a new directory, usage: mkdir dir1 ; creates directory dir1
rmdir	:	Removes empty directory
cal	:	Opens calendar
echo	:	Displays message, usage: echo "hi" ; displays hi
printf	:	Alternative to echo, displays message
script	:	Records your session
uname	:	Know about your machine, usage: uname -a
tty	:	Know your terminal
stty	:	Display and setting terminal characteristics
gzip	:	Compressing files
gunzip	:	Decompress files
tar	:	Archival program
head <filename>	:	Displays beginning of a file
tail <filename>	:	Displays end of a file
exit	:	Exits terminal
who	:	Who is logged into the system
whoami	:	Username
hostname	:	Machine name
rm -r	:	Removes directory including its content recursively (be careful)
ctrl+w	:	Cuts last work with keyboard while typing on terminal
ctrl+y	:	Paste with keyboard
ctrl+a	:	Cursor at the beginning of the line while typing on terminal
ctrl+e	:	End of the line while typing on terminal
ctrl+k	:	Cuts to the end of the line while typing on terminal
ctra+c	:	Kills the process currently running
history	:	History of all the commands on terminal
~	:	Home directory, you ca do [ cd ~ ] to go to the home directory
wc	:	Word count, can be combined (piped through) various other commands (it gives results as lines, words and characters)
wc -c	:	Counts the number of bytes
wc -w	:	Counts the number of words
wc -l	:	Counts the number of lines
free -m	:	Free memory (ram) in mb
free -g	:	Free memory (ram) in gb
df	:	Disk space
du - sch <dir>	:	Disk space of the current directory
du -sch *	:	Disk space of individual files or directories
*du -sch \| sort -nr**	:	Disk space of individual files or directories sorted by file size
w	:	Who is logged onto the system and what are they doing
ps	:	Processes running by the users
ps -e	:	All the processes running in the system, also used with argument -a, -x, read man ps
ps -o %t -p <pid>	:	How long the process was running
ctrl+z	:	Suspend or sleep the current running process in foreground
bg	:	Run process in background
fg	:	Bring processes foreground, those running in background
kill <pid>	:	Kills a process with the given process id
kill -9 <pid>	:	Violently kills a given process id giving it no time to cleanup
kill -l <pid>	:	List all signals that can be sent to a process
kill -s sigstop <pid>	:	Suspends or sleep a process
kill -s sigcont <pid>	:	Resumes or wakes up a process
renice -n <value> <pid>	:	Gives a priority value to the process id, ranges from 1-19, higher the value lower the priority, default is 10, it is said to be how nice the process is i.e. How much ram it leaves for other process to run
cmp <file1> <file2>	:	Compares two given files
diff <file1> <file2>	:	Find the differences between two files, compares two files, argument -w ignores whitespace
mount	:	Mounts drive, usage: mount /media/usb
umount	:	Unmounts drive, usage: /media/usb
eject	:	Ejects cd rom, usage: /media/cdrom
join	:	Combines lines from two files on a common field, usage: join file1.txt file2.txt
tr a-z A-Z < file.txt	:	Transfers case to another i.e. Lowercase to uppercase
tr A-Z a-z < file.txt	:	Transfers case to another i.e. Uppercase to lowercase
xargs	:	Takes output of one command and passes it as an argument to another command usage: cat urllist.txt \| xargs wget -c
sort	:	Sorts the lines of a text files in ascending order, usage: sort file.txt ; argument -r sorts in descending order, usage: sort -r file.txt ; argument -t is delimiter (colon, space, comma, tab etc.), usage: sort -t: file.txt ; argument -k sorts on a particular field, I recommend to read more on this, important and very helpful, used in combination with other commands [ man sort ]
uniq	:	Removes duplicate entries from sorted files, therefore mostly used in combination with sort, in order for uniq to work, all the duplicate entries should be in adjacent lines, usage: sort names.txt \| uniq ; sort -u names.txt ; to count duplicate lines: sort file.txt \| uniq -c ; to display duplicate entries: sort file.txt \| uniq -cd
cut	:	Used to display only specific columns from a text file or other command outputs i.e. To display first field from a colon delimited file : cut -d: -f 1 file.txt ; to display first and third field from colon delimited file : cut -d: -f 1, 3 file.txt
find	:	Find files, usage: find [pathname] [condition], arguments it takes are -name (for names of the files), -type (for type of file) -size (for size of file i.e. +100m, -mtime (modified time in days i.e. +60 or -2), -exec
stat	:	Used either to check the status or properties of single file or file system, usage: stat /etc/file.txt ; argument -f i.e. Stat -f / (displays the status of the file system i.e. Size/total/free/available
ac -d	:	Displays the statics about the user connect time, argument -d breaks the output for individual days, usage: ac -d username
&	:	At the end of the command executes job in the background but if you logout the job will be killed
nohup	:	At the beginning of the command executes job in background with (ampercent &), usage: nohup ./script.sh & ; the job will run in background even if you logout
screen	:	Once you logout the same session will not be connected, to do that, end with screen and attach it later by screen to get back as it was when you logged out
at	:	Schedule job i.e. At -f backup.sh 10am tomorrow
watch	:	To execute command continuously at certain intervals, usage: watch df -h
split	:	Splits the given file (big file) as per the requirement with respective arguments, i.e. File size, number of lines, etc.

Tuesday, November 26, 2013

De-Novo Transcriptome Assembly and Annotation

Introduction

ranscriptome sequencing (RNA-seq) helps to find gene expression, reconstruct the transcripts, SNP detection and alternate splicing. There are two assembly approaches i.e. Genome dependent (alignment to reference genome) and genome independent (de novo assembly). De-novo assembly is the process of constructing a reference genome sequence for a newly sequenced organism. And it is necessary because for reference based sequencing reference genome is required and this approach is not useful for organism having partial or missing reference genome. Secondly genome sequences are incomplete, fragmented and altered. The disadvantage of genome assembly over transcriptome assembly is its inability to account for structural alterations of mRNA transcripts, i.e. alternative splicing.

In RNA-seq first mRNA is extracted and purified from cell and then reverse transcribed to create cDNA library with the help of high-throughput sequencing techniques. This cDNA is fragmented into various lengths.

Algorithm behind assembly is so simple, cDNA sequence reads are assembled into transcripts by using any short read assembler, here we are using ‘SOAPdenovo-Trans’. Short read assemblers generally one of two basic algorithms: overlap graphs – compute pairwise overlap between the reads and capture this information in a graph. De-Bruijin graph: breaks the reads into smaller sequences of DNA, called K-mers, and captures overlaps of length k-1 between these k-mers not between reads. As the number of reads are growing day by day and it is getting difficult to determine which read should be joined to contiguous sequence contigs, so, de-Bruijin is the solution of this problem in which a node is defined by fixed length of k-mer and nodes are connected by edges, if they overlap by k-1 nucleotide.

De novo assembly work flow

Input files

Paired-end RNA-seq reads of P.chabaudi in fastq format

File 1: ERR306016_1.fastq

File 2: ERR306016_2.fastq

Exercise 1. Quality control check using fastQC

Shell command for running fastQC on the read file:

./fastqc ERR306016_1.fastq

A. Removal of adaptor sequences using FASTX-Toolkit

Shell command:

./fastx_clipper -a -l -C -i ERR306016_1.fastq -o new_ERR306016_1.fastq

B. Run Quality check again on trimmed files i.e. new_ERR306016_1.fastq and new_ERR306016_2.fastq

Shell command:

./fastqc

Repeat Exercise 1 for ERR306016_2.fastq

Exercise 2. De novo assembly of reads as well as RPKM calculation using

SOAPdenovo-Trans

Shell command:

./SOAPdenovo-trans-127mer all -k 23 -R -s sample.conf -o plas_R

Clustering of resulted contigs or transcriptome using TGICL

Shell command:

./tgicl plas_R.contig

Result will be clustered contig indexes and singletons indexes (contigs which are not in cluster) and asm_1 directory containing CAP3 assembly result. If no singleton file form means all contigs are clustered. Contig sequence and Singlets sequence files inside the asm_1 directory are merged together using linux 'cat' command

Shell command:

cat contig singlets > contig_singlets.txt

Retreive the indexes from the cluster file by using 'grep' and ‘sed’

grep -v 'CL' plas_R.contig_clusters > contig_index.txt

Substitute the tab space with next line using 'sed' command.

sed 's/\t/\n/g' contig_index.txt > contig_index_new.txt

Retreive the clustered contig sequences from SOAPdenovo-trans assembled contig file 'plas_R.contig'

grep -Fxv -f contig_index_new.txt plas_R.contig > final_singletons.txt

Merge the 'contig_singlets.txt' into "final_singletons.txt" using ‘cat’ command

cat contig_singlets.txt final_singletons.txt > final_assembled_transcriptome.txt

Exercise 3. Read mapping using SeqMap

Shell command:

./seqmap 2 ERR306016_1.fastq plas_R_contig.fasta > seqmap_output.txt /eland:3 /available_memory:8000

Exercise 4. Annotation using standalone ncbi-BLAST and online tool KOBAS

Shell command:

./makeblastdb -in /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -input type 'fasta' -title 'plasmo_db' -dbtype 'prot'

./blastx -db /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -query /home/user/Desktop/NGS_Workshop/jyoti/plas_R.contig/ -out /kobas_input.fasta/ -outfmt 6

Run KOBAS using 'kobas_input.fasta'

KOBAS is available on the following url:

http://kobas.cbi.pku.edu.cn/home.do

Thursday, November 21, 2013

The structure of Rv3717 reveals a novel amidase from Mycobacterium tuberculosis

The article is OPEN ACCESS:

http://journals.iucr.org/d/issues/2013/12/00/lv5048/index.html

The structure of Rv3717 reveals a novel amidase from Mycobacterium tuberculosis

A. Kumar, S. Kumar, D. Kumar, A. Mishra, R. P. Dewangan, P. Shrivastava, S. Ramachandran and B. Taneja

Abstract: Bacterial N-acetylmuramoyl-L-alanine amidases are cell-wall hydrolases that hydrolyze the bond between N-acetylmuramic acid and L-alanine in cell-wall glycopeptides. Rv3717 ofMycobacterium tuberculosis has been identified as a unique autolysin that lacks a cell-wall-binding domain (CBD) and its structure has been determined to 1.7 Å resolution by the Pt-SAD phasing method. Rv3717 possesses an [alpha]

-fold and is a zinc-dependent hydrolase. The structure reveals a short flexible hairpin turn that partially occludes the active site and may be involved in autoregulation. This type of autoregulation of activity of PG hydrolases has been observed in Bartonella henselae amidase (AmiB) and may be a general mechanism used by some of the redundant amidases to regulate cell-wall hydrolase activity in bacteria. Rv3717 utilizes its net positive charge for substrate binding and exhibits activity towards a broad spectrum of substrate cell walls. The enzymatic activity of Rv3717 was confirmed by isolation and identification of its enzymatic products by LC/MS. These studies indicate that Rv3717, an N-acetylmuramoyl-L-alanine amidase from M. tuberculosis, represents a new family of lytic amidases that do not have a separate CBD and are regulated conformationally.

PDB reference: 4lq6

The article is OPEN ACCESS

http://journals.iucr.org/d/issues/2013/12/00/lv5048/index.html

Monday, November 4, 2013

Integrated gene co-expression network analysis in the growth phase of Mycobacterium tuberculosis reveals new potential drug targets.

http://pubs.rsc.org/en/content/articlelanding/2013/mb/c3mb70278b#!divAbstract

We have carried out weighted gene co-expression network analysis of Mycobacterium tuberculosis to gain insights into gene expression architecture during log phase growth. The differentially expressed genes between at least one pair of 11 different M. tuberculosis strains as source of biological variability were used for co-expression network analysis. This data included genes with highest coefficient of variation in expression. Five distinct modules were identified using topological overlap based clustering. All the modules together showed significant enrichment in biological processes: fatty acid biosynthesis, cell membrane, intracellular membrane bound organelle, DNA replication, Quinone biosynthesis, cell shape and peptidoglycan biosynthesis, ribosome and structural constituents of ribosome and transposition. We then extracted the co-expressed connections which were supported either by transcriptional regulatory network or STRING database or high edge weight of topological overlap. The genes trpC, nadC, pitA, Rv3404c, atpA, pknA, Rv0996, purB, Rv2106 and Rv0796 emerged as top hub genes. After overlaying this network on the iNJ661 metabolic network, the reactions catalyzed by 15 highly connected metabolic genes were knocked down in silico and evaluated by Flux Balance Analysis. The results showed that in 12 out of 15 cases, in 11 more than 50% of reactions catalyzed by genes connected through co-expressed connections also had altered fluxes. The modules ‘Turquoise’, ‘Blue’ and ‘Red’ also showed enrichment in essential genes. We could map 152 of the previously known or proposed drug targets in these modules and identified 15 new potential drug targets based on their high degree of co-expressed connections and strong correlation with module eigengenes.

Friday, August 2, 2013

Identification of Novel Adhesins of M. tuberculosis H37Rv Using Integrated Approach of Multiple Computational Algorithms and Experimental Analysis.

My recent publication:

Identification of Novel Adhesins of M. tuberculosis H37Rv Using Integrated Approach of Multiple Computational Algorithms and Experimental Analysis

Authors:

Sanjiv Kumar, Bhanwar Lal Puniya, Shahila Parween, Pradip Nahar, Srinivasan Ramachandran

Abstract

Pathogenic bacteria interacting with eukaryotic host express adhesins on their surface. These adhesins aid in bacterial attachment to the host cell receptors during colonization. A few adhesins such as Heparin binding hemagglutinin adhesin (HBHA), Apa, Malate Synthase of M. tuberculosis have been identified using specific experimental interaction models based on the biological knowledge of the pathogen. In the present work, we carried out computational screening for adhesins of M. tuberculosis. We used an integrated computational approach using SPAAN for predicting adhesins, PSORTb, SubLoc and LocTree for extracellular localization, and BLAST for verifying non-similarity to human proteins. These steps are among the first of reverse vaccinology. Multiple claims and attacks from different algorithms were processed through argumentative approach. Additional filtration criteria included selection for proteins with low molecular weights and absence of literature reports. We examined binding potential of the selected proteins using an image based ELISA. The protein Rv2599 (membrane protein) binds to human fibronectin, laminin and collagen. Rv3717 (N-acetylmuramoyl-L-alanine amidase) and Rv0309 (L,D-transpeptidase) bind to fibronectin and laminin. We report Rv2599 (membrane protein), Rv0309 and Rv3717 as novel adhesins of M. tuberculosis H37Rv. Our results expand the number of known adhesins of M. tuberculosis and suggest their regulated expression in different stages.

Read Full article here (Free)

Identification of Novel Adhesins of M. tuberculosis H37Rv Using Integrated Approach of Multiple Computational Algorithms and Experimental Analysis