Here I am compiling some of the short simple tips and tricks for use to ease out some of Bioinformatics work, I would also be thankful if you can contribute to it, if you can come across some. Write them as comment and I will update them.
#####################################
Running single or multiple commands on all files or selected files in a directory.
For running a command on number of files in a directory, use this:
CODE:
for i in *; do command; done;
Wherein * will select all the files to act upon, you can use *.fasta to work only with the fasta files, etc.
command is the linux command you wish to act on all the selected files, where in each selected file is represented by or recalled by "$i" and can also be used to generate output with the filename added to it.
Example:
for i in *.fasta; do grep ">gi" "$i" > "$i".output; done;
Similarly to copy a file from anywhere (previous directory in this example) in multiple directories
for i in *; do cp ../program.pl "$i" ; done;
##################################################
For working on multiples files, like removing, copying etc. on every directory inside a directory or multiple directories do the following:
To remove all the fasta files from the folder or sub folders
find /pathtofolder/ -type f -iname "*.fasta" -exec rm {} \;
This will find all the files
find /pathtofolder/ -type f -iname "*.fasta"
and this will remove these files
-exec rm {} \;
##################################################
To remove files containing a particular text in them from various folders inside the directory:
For example you have downloaded all genomes *.fna from NCBI Genomes for Bacterial Genomes and you just wish to use *.fna from the bacterial genomes but not from the plasmids which you wish to remove.
grep -lrIZ plasmid . | xargs -0 rm -f --
This will remove all files containing plasmid in it.
##################################################
To grep the count (say fasta) in various files inside a directory:
grep -c ">" */*/*.fasta > reads_count
This will count the number of reads from each file in those directories along with the file name and that can be redirected to a file.
NOTE: Change */*/*.fasta depending upon the number of directories inside i.e. */*.fasta will work if the files are in one directory below the present directory.
##################################################
Calculating Mean Length of FASTA Sequences
Here's a shot simple code for calculating the mean length of a FASTA sequence.
Code:
awk '{/>/&&++a||b+=length()}END{print b/a}' sequence.fasta > average
Alternate faster methods are available, read here BIOSTARS
##################################################
Convert FASTQ to FASTA:
Code:
sed -n ' 1~4s/^@/>/p; 2~4p ' file.fastq > file.fasta
sed 's/\ /\_/g' seqfile > seqfile_ed (to make unique ids in flash assembled sequences)
##################################################
Schedule one job after another
Run one script after another in such a way that second script starts after finishing first one. Without using Pipe | or ampercent && i.e. the first process is already running and you want second one to start after the first one finishes. And this can be done in different folder in case the output of second script will affect the output of first script. So run this on any folder you wish to.
Code:
while ps -p $PID; do sleep 1; done; script2
Explanation and Example:
while ps -p 4437; do sleep 1; done; echo "It Works"
Where $PID is the process id of the already running job (add PID number)
script2 is your script you wish to run after first script ends
sleep 1 is sleep for one second (SUFFIX may be ‘s’ for seconds (the default), ‘m’ for minutes, ‘h’ for hours or ‘d’ for days, read man sleep)
##################################################
Sample command for formatdb and blastall
FORMAT_DB:
nohup time /home/sanjiv/Softwares/blast-2.2.26/bin/formatdb -p T -i input.database &
BLAST_ALL:
Protein BLAST (blastp)
nohup /home/sanjiv/Softwares/blast-2.2.22/bin/blastall -p blastp -d location_of_database.fa -i location_of_input_file.fasta -o location_of_output_file.blastall.out -a 30 -e 0.000001 &
Nucleotide BLAST (blastx)
nohup /home/sanjiv/Softwares/blast-2.2.22/bin/blastall -p blastx -d location_of_database.fa -i location_of_input_file.fasta -o location_of_output_file.blastall.out -a 30 -e 0.000001 &
##################################################
Fetch FASTA sequences using identifier from a file.
CODE:
grep -A1 -f list mainfile > fastaofidsfromlist
Explanation:
To fetch selected fasta sequence from huge file. Keep your identifiers or fasta headers in 'list' (i.e. like Rv ids). 'mainfile' is the original file from where you have to fetch the fasta. 'fastaofidsfromlist' is the your file with fasta sequences of the ids you gave as list. Issue the following command.
CAUTION: For this to work properly, your mainfile should have sequences in one line only, without line breaks. Check output file before use.
##################################################
Convert multi-line fasta to single line fasta
awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file
##################################################
Using BLAST+
Make blast database
makeblastdb -in proteases.faa -dbtype 'prot' -out CJProteases
BLAST
blastp -query proteins.fasta -db DATABASE -out OUTPUT_File -evalue 1e-30 -outfmt 7
#####################################
Running single or multiple commands on all files or selected files in a directory.
For running a command on number of files in a directory, use this:
CODE:
for i in *; do command; done;
Wherein * will select all the files to act upon, you can use *.fasta to work only with the fasta files, etc.
command is the linux command you wish to act on all the selected files, where in each selected file is represented by or recalled by "$i" and can also be used to generate output with the filename added to it.
Example:
for i in *.fasta; do grep ">gi" "$i" > "$i".output; done;
Similarly to copy a file from anywhere (previous directory in this example) in multiple directories
for i in *; do cp ../program.pl "$i" ; done;
##################################################
For working on multiples files, like removing, copying etc. on every directory inside a directory or multiple directories do the following:
To remove all the fasta files from the folder or sub folders
find /pathtofolder/ -type f -iname "*.fasta" -exec rm {} \;
This will find all the files
find /pathtofolder/ -type f -iname "*.fasta"
and this will remove these files
-exec rm {} \;
##################################################
To remove files containing a particular text in them from various folders inside the directory:
For example you have downloaded all genomes *.fna from NCBI Genomes for Bacterial Genomes and you just wish to use *.fna from the bacterial genomes but not from the plasmids which you wish to remove.
grep -lrIZ plasmid . | xargs -0 rm -f --
This will remove all files containing plasmid in it.
##################################################
To grep the count (say fasta) in various files inside a directory:
grep -c ">" */*/*.fasta > reads_count
This will count the number of reads from each file in those directories along with the file name and that can be redirected to a file.
NOTE: Change */*/*.fasta depending upon the number of directories inside i.e. */*.fasta will work if the files are in one directory below the present directory.
##################################################
Calculating Mean Length of FASTA Sequences
Here's a shot simple code for calculating the mean length of a FASTA sequence.
Code:
awk '{/>/&&++a||b+=length()}END{print b/a}' sequence.fasta > average
Alternate faster methods are available, read here BIOSTARS
##################################################
Convert FASTQ to FASTA:
Code:
sed -n ' 1~4s/^@/>/p; 2~4p ' file.fastq > file.fasta
sed 's/\ /\_/g' seqfile > seqfile_ed (to make unique ids in flash assembled sequences)
##################################################
Schedule one job after another
Run one script after another in such a way that second script starts after finishing first one. Without using Pipe | or ampercent && i.e. the first process is already running and you want second one to start after the first one finishes. And this can be done in different folder in case the output of second script will affect the output of first script. So run this on any folder you wish to.
Code:
while ps -p $PID; do sleep 1; done; script2
Explanation and Example:
while ps -p 4437; do sleep 1; done; echo "It Works"
Where $PID is the process id of the already running job (add PID number)
script2 is your script you wish to run after first script ends
sleep 1 is sleep for one second (SUFFIX may be ‘s’ for seconds (the default), ‘m’ for minutes, ‘h’ for hours or ‘d’ for days, read man sleep)
##################################################
Sample command for formatdb and blastall
FORMAT_DB:
nohup time /home/sanjiv/Softwares/blast-2.2.26/bin/formatdb -p T -i input.database &
BLAST_ALL:
Protein BLAST (blastp)
nohup /home/sanjiv/Softwares/blast-2.2.22/bin/blastall -p blastp -d location_of_database.fa -i location_of_input_file.fasta -o location_of_output_file.blastall.out -a 30 -e 0.000001 &
Nucleotide BLAST (blastx)
nohup /home/sanjiv/Softwares/blast-2.2.22/bin/blastall -p blastx -d location_of_database.fa -i location_of_input_file.fasta -o location_of_output_file.blastall.out -a 30 -e 0.000001 &
##################################################
Fetch FASTA sequences using identifier from a file.
CODE:
grep -A1 -f list mainfile > fastaofidsfromlist
Explanation:
To fetch selected fasta sequence from huge file. Keep your identifiers or fasta headers in 'list' (i.e. like Rv ids). 'mainfile' is the original file from where you have to fetch the fasta. 'fastaofidsfromlist' is the your file with fasta sequences of the ids you gave as list. Issue the following command.
CAUTION: For this to work properly, your mainfile should have sequences in one line only, without line breaks. Check output file before use.
##################################################
Convert multi-line fasta to single line fasta
awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file
##################################################
Using BLAST+
Make blast database
makeblastdb -in proteases.faa -dbtype 'prot' -out CJProteases
BLAST
blastp -query proteins.fasta -db DATABASE -out OUTPUT_File -evalue 1e-30 -outfmt 7
No comments:
Post a Comment