Tuesday, November 26, 2013

De-Novo Transcriptome Assembly and Annotation


ranscriptome sequencing (RNA-seq) helps to find gene expression, reconstruct the transcripts, SNP detection and alternate splicing. There are two assembly approaches i.e. Genome dependent (alignment to reference genome) and genome independent (de novo assembly). De-novo assembly is the process of constructing a reference genome sequence for a newly sequenced organism. And it is necessary because for reference based sequencing reference genome is required and this approach is not useful for organism having partial or missing reference genome. Secondly genome sequences are incomplete, fragmented and altered.  The disadvantage of genome assembly over transcriptome assembly is its inability to account for structural alterations of mRNA transcripts, i.e. alternative splicing.

In RNA-seq first mRNA is extracted and purified from cell and then reverse transcribed to create cDNA library with the help of high-throughput sequencing techniques. This cDNA is fragmented into various lengths.

Algorithm behind assembly is so simple, cDNA sequence reads are assembled into transcripts by using any short read assembler, here we are using ‘SOAPdenovo-Trans’. Short read assemblers generally one of two basic algorithms: overlap graphs – compute pairwise overlap between the reads and capture this information in a graph. De-Bruijin graph: breaks the reads into smaller sequences of DNA, called K-mers, and captures overlaps of length k-1 between these k-mers not between reads. As the number of reads are growing day by day and it is getting difficult to determine which read should be joined to contiguous sequence contigs, so, de-Bruijin is the solution of this problem in which a node is defined by fixed length of k-mer and nodes are connected by edges, if they overlap by k-1 nucleotide.

 De novo assembly work flow
Input files

Paired-end RNA-seq reads of P.chabaudi in fastq format

File 1:         ERR306016_1.fastq
File 2:         ERR306016_2.fastq

Exercise 1. Quality control check using fastQC

Shell command for running fastQC on the read file:

./fastqc ERR306016_1.fastq

A.    Removal of adaptor sequences using FASTX-Toolkit

Shell command:

./fastx_clipper -a -l -C -i ERR306016_1.fastq -o new_ERR306016_1.fastq

B.    Run Quality check again on trimmed files i.e. new_ERR306016_1.fastq and new_ERR306016_2.fastq

Shell command:


Repeat Exercise 1 for ERR306016_2.fastq
Exercise 2. De novo assembly of reads as well as RPKM calculation using

Shell command:

./SOAPdenovo-trans-127mer all -k 23 -R -s sample.conf  -o plas_R

Clustering of resulted contigs or transcriptome using TGICL

Shell command:

./tgicl plas_R.contig

Result will be clustered contig indexes and singletons indexes (contigs which are not in cluster) and asm_1 directory containing CAP3 assembly result. If no singleton file form means all contigs are clustered. Contig sequence and Singlets sequence files inside the asm_1 directory are merged together using linux 'cat' command

Shell command:

cat contig singlets > contig_singlets.txt

Retreive the indexes from the cluster file by using 'grep' and ‘sed’

grep -v 'CL' plas_R.contig_clusters > contig_index.txt

Substitute the tab space with next line using 'sed' command.

sed 's/\t/\n/g' contig_index.txt > contig_index_new.txt

Retreive the clustered contig sequences from SOAPdenovo-trans assembled contig file 'plas_R.contig'

grep -Fxv -f contig_index_new.txt plas_R.contig > final_singletons.txt

Merge the 'contig_singlets.txt' into "final_singletons.txt" using ‘cat’ command

cat contig_singlets.txt final_singletons.txt > final_assembled_transcriptome.txt

Exercise 3. Read mapping using SeqMap

Shell command:

./seqmap 2 ERR306016_1.fastq plas_R_contig.fasta > seqmap_output.txt /eland:3 /available_memory:8000

Exercise 4. Annotation using standalone ncbi-BLAST and online tool KOBAS

Shell command:

./makeblastdb -in /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -input type 'fasta' -title 'plasmo_db' -dbtype 'prot'

./blastx -db /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -query /home/user/Desktop/NGS_Workshop/jyoti/plas_R.contig/ -out /kobas_input.fasta/ -outfmt 6

Run KOBAS using 'kobas_input.fasta'

KOBAS is available on the following url:


Thursday, November 21, 2013

The structure of Rv3717 reveals a novel amidase from Mycobacterium tuberculosis

The article is OPEN ACCESS: 


The structure of Rv3717 reveals a novel amidase from Mycobacterium tuberculosis

A. KumarS. KumarD. KumarA. MishraR. P. DewanganP. ShrivastavaS. Ramachandran and B. Taneja

Abstract: Bacterial N-acetylmuramoyl-L-alanine amidases are cell-wall hydrolases that hydrolyze the bond between N-acetylmuramic acid and L-alanine in cell-wall glycopeptides. Rv3717 ofMycobacterium tuberculosis has been identified as a unique autolysin that lacks a cell-wall-binding domain (CBD) and its structure has been determined to 1.7 Å resolution by the Pt-­SAD phasing method. Rv3717 possesses an [alpha]/[beta]-fold and is a zinc-dependent hydrolase. The structure reveals a short flexible hairpin turn that partially occludes the active site and may be involved in autoregulation. This type of autoregulation of activity of PG hydrolases has been observed in Bartonella henselae amidase (AmiB) and may be a general mechanism used by some of the redundant amidases to regulate cell-wall hydrolase activity in bacteria. Rv3717 utilizes its net positive charge for substrate binding and exhibits activity towards a broad spectrum of substrate cell walls. The enzymatic activity of Rv3717 was confirmed by isolation and identification of its enzymatic products by LC/MS. These studies indicate that Rv3717, an N-acetylmuramoyl-L-alanine amidase from M. tuberculosis, represents a new family of lytic amidases that do not have a separate CBD and are regulated conformationally.

PDB reference: 4lq6

The article is OPEN ACCESS


Monday, November 4, 2013

Integrated gene co-expression network analysis in the growth phase of Mycobacterium tuberculosis reveals new potential drug targets.


We have carried out weighted gene co-expression network analysis of Mycobacterium tuberculosis to gain insights into gene expression architecture during log phase growth. The differentially expressed genes between at least one pair of 11 different M. tuberculosis strains as source of biological variability were used for co-expression network analysis. This data included genes with highest coefficient of variation in expression. Five distinct modules were identified using topological overlap based clustering. All the modules together showed significant enrichment in biological processes: fatty acid biosynthesis, cell membrane, intracellular membrane bound organelle, DNA replication, Quinone biosynthesis, cell shape and peptidoglycan biosynthesis, ribosome and structural constituents of ribosome and transposition. We then extracted the co-expressed connections which were supported either by transcriptional regulatory network or STRING database or high edge weight of topological overlap. The genes trpC, nadC, pitA, Rv3404c, atpA, pknA, Rv0996, purB, Rv2106 and Rv0796 emerged as top hub genes. After overlaying this network on the iNJ661 metabolic network, the reactions catalyzed by 15 highly connected metabolic genes were knocked down in silico and evaluated by Flux Balance Analysis. The results showed that in 12 out of 15 cases, in 11 more than 50% of reactions catalyzed by genes connected through co-expressed connections also had altered fluxes. The modules ‘Turquoise’, ‘Blue’ and ‘Red’ also showed enrichment in essential genes. We could map 152 of the previously known or proposed drug targets in these modules and identified 15 new potential drug targets based on their high degree of co-expressed connections and strong correlation with module eigengenes.