Bioinformatics Tools: De-Novo Transcriptome Assembly and Annotation

Tuesday, November 26, 2013

De-Novo Transcriptome Assembly and Annotation

Introduction

ranscriptome sequencing (RNA-seq) helps to find gene expression, reconstruct the transcripts, SNP detection and alternate splicing. There are two assembly approaches i.e. Genome dependent (alignment to reference genome) and genome independent (de novo assembly). De-novo assembly is the process of constructing a reference genome sequence for a newly sequenced organism. And it is necessary because for reference based sequencing reference genome is required and this approach is not useful for organism having partial or missing reference genome. Secondly genome sequences are incomplete, fragmented and altered. The disadvantage of genome assembly over transcriptome assembly is its inability to account for structural alterations of mRNA transcripts, i.e. alternative splicing.

In RNA-seq first mRNA is extracted and purified from cell and then reverse transcribed to create cDNA library with the help of high-throughput sequencing techniques. This cDNA is fragmented into various lengths.

Algorithm behind assembly is so simple, cDNA sequence reads are assembled into transcripts by using any short read assembler, here we are using ‘SOAPdenovo-Trans’. Short read assemblers generally one of two basic algorithms: overlap graphs – compute pairwise overlap between the reads and capture this information in a graph. De-Bruijin graph: breaks the reads into smaller sequences of DNA, called K-mers, and captures overlaps of length k-1 between these k-mers not between reads. As the number of reads are growing day by day and it is getting difficult to determine which read should be joined to contiguous sequence contigs, so, de-Bruijin is the solution of this problem in which a node is defined by fixed length of k-mer and nodes are connected by edges, if they overlap by k-1 nucleotide.

De novo assembly work flow

Input files

Paired-end RNA-seq reads of P.chabaudi in fastq format

File 1: ERR306016_1.fastq

File 2: ERR306016_2.fastq

Exercise 1. Quality control check using fastQC

Shell command for running fastQC on the read file:

./fastqc ERR306016_1.fastq

A. Removal of adaptor sequences using FASTX-Toolkit

Shell command:

./fastx_clipper -a -l -C -i ERR306016_1.fastq -o new_ERR306016_1.fastq

B. Run Quality check again on trimmed files i.e. new_ERR306016_1.fastq and new_ERR306016_2.fastq

Shell command:

./fastqc

Repeat Exercise 1 for ERR306016_2.fastq

Exercise 2. De novo assembly of reads as well as RPKM calculation using

SOAPdenovo-Trans

Shell command:

./SOAPdenovo-trans-127mer all -k 23 -R -s sample.conf -o plas_R

Clustering of resulted contigs or transcriptome using TGICL

Shell command:

./tgicl plas_R.contig

Result will be clustered contig indexes and singletons indexes (contigs which are not in cluster) and asm_1 directory containing CAP3 assembly result. If no singleton file form means all contigs are clustered. Contig sequence and Singlets sequence files inside the asm_1 directory are merged together using linux 'cat' command

Shell command:

cat contig singlets > contig_singlets.txt

Retreive the indexes from the cluster file by using 'grep' and ‘sed’

grep -v 'CL' plas_R.contig_clusters > contig_index.txt

Substitute the tab space with next line using 'sed' command.

sed 's/\t/\n/g' contig_index.txt > contig_index_new.txt

Retreive the clustered contig sequences from SOAPdenovo-trans assembled contig file 'plas_R.contig'

grep -Fxv -f contig_index_new.txt plas_R.contig > final_singletons.txt

Merge the 'contig_singlets.txt' into "final_singletons.txt" using ‘cat’ command

cat contig_singlets.txt final_singletons.txt > final_assembled_transcriptome.txt

Exercise 3. Read mapping using SeqMap

Shell command:

./seqmap 2 ERR306016_1.fastq plas_R_contig.fasta > seqmap_output.txt /eland:3 /available_memory:8000

Exercise 4. Annotation using standalone ncbi-BLAST and online tool KOBAS

Shell command:

./makeblastdb -in /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -input type 'fasta' -title 'plasmo_db' -dbtype 'prot'

./blastx -db /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -query /home/user/Desktop/NGS_Workshop/jyoti/plas_R.contig/ -out /kobas_input.fasta/ -outfmt 6

Run KOBAS using 'kobas_input.fasta'

KOBAS is available on the following url:

http://kobas.cbi.pku.edu.cn/home.do

Bioinformatics Tools