A search for solutions: July 2016

Thursday, July 28, 2016

Pathway enrichment analysis - using KOBAS 2.0

Of late, I needed to do pathway enrichment analysis for a list of genes that have been differentially expressed (DEG). These are different approaches for this task, this was mine using KOBAS 2.0 which identifies statistically significant pathways from the fasta sequences.

Cons:
1) Online is limited to 500 seqs.
2) Databases are too huge to locally install.

Pros:
1) Can do a local blast and upload in the webserver to generate the results.

For this, I initially downloaded complete KO file (as mentioned in this page). But the size was huge to run the blastx (~6 gb). Since, I am working on a fish transcriptome, I used Homo sapeins peptide fasta file and ran a local blast.

$ wget http://kobas.cbi.pku.edu.cn/download/seq_pep/hsa.pep.fasta.gz
$ gunzip hsa.pep.fasta.gz
$ /data/software/ncbi-blast-2.2.31+-2.x86_64/bin/makeblastdb -in hsa.pep.fasta -dbtype prot -out dre.pep.KEGGDB -parse_seqids
$ time nohup /data/software/ncbi-blast-2.2.31+-2.x86_64/bin/blastx -db /data/rgg_data/Databases/KOBAS/hsa.pep.KEGGDB -query SB_GeneSet.fasta -evalue 0.000001 -outfmt 6 -out SB_Genes_vs_hsa_pep_KEGGDB.blastx.tab.txt -num_threads 23 -max_target_seqs 10 &

At first I tried with Danio rerio, because my species of interest is also fish. But then KOBAS failed repeatedly. It do not know if it is database issue or Danio rerio peptide set, not as extensively studied as Homo Sapiens.

So, ran annotate and identify steps, to retrive the significant pathways. I uploaded 6.5 M blastx output file with success. Once this is done, I have to sort according to corrected p-value to select most enriched pathways in my DEGset.

That it!

Edit2: July 29, 2016

Once, the raw file of the enriched pathways are downloaded, we have to find the pathways with corrected p-value < 0.05.

So, I used the following command to filter the file and retrive significant pathways from KOBAS identify output file:

$ cat SW_GILLS_65_vs_FW_GILLS_65_HSA_UpRegul.identify | sed 's/ /_/g' | sed '/^$/d' |egrep -v "#|Gene_ontology" | awk 'OFS="\t" { print $1,$2,$3} {printf("%.8f\n", $7)}' | paste - - | head -20 | awk '$4 < 0.05 {print $0}' | column -t ## For significant Pathways

$ cat SW-Gills_65_vs_FW_Gills_65.identify | sed 's/ /_/g' | sed '/^$/d' | grep 'Gene_Ontology' | grep -v "#" | awk 'OFS="\t" { print $1,$3} {printf("%.8f\n", $7)}' | paste  - - | awk '$3 < 0.05 {print $0}' | column -t ## For significant GO terms

Thursday, July 21, 2016

Finding the absolute path of a symlink file

If a file is in your present working directory, I could easily find the absolute path of the file by typing `pwd` in the terminal.

But, how would I deal with the following situation? (scroll to the complete right)

The symlink is not made of absolute path. But, I want to know the file's absolute path with a single command.

$ ls -lh Q1527_v2*
lrwxrwxrwx. 1 prakkisr bio-info   71 Jun 27 10:27 Q1527_v2_scaffolds_63ContaminantRemoved_facheck.fasta -> ../SIPE_LIB_REAPR/Q1527_v2_scaffolds_63ContaminantRemoved_facheck.fasta

I searched across and found a beautiful utility called readlink

$ readlink -f ../SIPE_LIB_REAPR/Q1527_v2_scaffolds_63ContaminantRemoved_facheck.fasta
/stor2/Genome/scaffolds/Q1527_v2_63ContaminantsRemoved/Reapr/SIPE_LIB_REAPR/Q1527_v2_scaffolds_63ContaminantRemoved_facheck.fasta

Tuesday, July 19, 2016

Reverse complement fastq to fastq/a

When dealing with long insert mate pairs, some tools like Reapr might need the reads to be reverse complemented fastq in order to run the tool. So, I started searching for some softwares which can actually do what I wanted that.

One of the tool, I most of the times used was fastx_toolkit. It has a utility called fastx_reverse_complement by which I could convert all fastq to rc fastq files.

FASTQ to REV-COMP FASTQ:

$ fastx_reverse_complement -i sample.fastq -o rc_sample_fastx_tk.fastq

FASTQ to REV-COMP FASTA:

Sometimes, we might have to reverse complement fastq and need to convert to fasta file. This could be achieved by running revseq tool in EMBOSS with the following command:

$ revseq -sequence sample.fastq -outseq rc_sample.fasta -notag

* If there is a following error

fastx_reverse_complement: Invalid quality score value (char '0' ord 48 quality value -16) on line 4

Try pasting "-Q33" (without quotes) after fastx_reverse_complement.

Monday, July 18, 2016

Grep for first occurrence for multiple strings

Inspired by a post from stackexchange. I posted this as an answer there too.

$ cat somelog.log
ADWN    1259    11:00   B23
ADWN    3009    12:00   B19
DDWN     723    11:30   B04
ADWN    1589    14:20   B12
ADWN    1259    11:10   B23
DDWN    2534    13:00   B16
ADWN    3009    11:50   B14

Using "for" loop and "grep" we can do the following:

for i in $ (cut -d " " -f1 somelog.log | sort -u); do LC_ALL=C fgrep -m1 "$i" somelog.log; done

Explanation:

$ cut -d " " -f1 somelog.log | sort -u # will result in unique identifiers

ADWN
DDWN

$ LC_ALL=C fgrep -m1 "$i" somelog.log # "m" option in grep will match the first pattern

1) for loop matches each pattern using grep and picks the first pattern.

2) LC_ALL=C is used to make search faster

3) fgrep is to match fixed strings instead of regular expressions.

fastest way to count reads in fastq file - use parallel

Unzipped fastq files with 16 cores, took very minute time to count number of reads.

$ time parallel "echo {} && cat {} | wc -l | awk '{d=\$1; print d/4;}'" ::: 500bp_insert*.fastq

500bp_insert_lane1_2_read2PE_single_appended.Genome.completely.cleaned.fastq_ORPHAN.fastq
408600
500bp_insert_lane1_2_read1PE_single_appended.Genome.completely.cleaned.fastq_ORPHAN.fastq
2668085
500bp_insert_lane1_2_read2PE_single_appended.Genome.completely.cleaned.fastq_matched.fastq
148713855
500bp_insert_lane1_2_read1PE_single_appended.Genome.completely.cleaned.fastq_matched.fastq
148713855

real    0m19.850s
user    0m13.144s
sys     0m56.604s

But the zipped files of size around ~ 23 Gb took a bit of around ~5 min.

$time parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: 500bp_insert_lane*_read1.fastq.gz

500bp_insert_lane1_read1.fastq.gz
75640549

500bp_insert_lane2_read1.fastq.gz
76849862

real    2m40.358s
user    5m6.051s
sys     0m24.710s

$ time parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: 500bp_insert_lane*_read2.fastq.gz

500bp_insert_lane2_read2.fastq.gz
76849862
500bp_insert_lane1_read2.fastq.gz
75640549

real    2m49.668s
user    5m17.894s
sys     0m28.807s

Thursday, July 7, 2016

Rename the fasta file with the first sequence header

Suppose there are multiple fasta files with single sequence. We want to rename the fasta file with the header of the single sequence present in the fasta file. We can run the following command.

Place all the fasta files you want to change the file names in one directory and run the following command.

for i in *; do if [ ! -f $i ]; then echo "skipping $i"; else newname=`head -1 $i | sed 's/^\s*\([a-zA-Z0-9]\+\).*$/\1/'`; [ -n "$newname" ] ; mv -i $i $newname.fasta || echo "error at: $i"; fi; done | rename s/ // *.fasta

Monday, July 4, 2016

Problem installing Cogent - COding GENome reconstruction Tool - SOLVED

I was trying to run the Cogent on data, and was using this tutorial for instructions.Everything was smooth, until i experienced this error:

$ make

gcc -o ssw_test ssw.o main.c -Wall -O3 -pipe -lm -lz/usr/bin/ld: cannot find -lgcc_scollect2: error: ld returned 1 exit statusmake: *** [ssw_test] Error 1

I googled and tried different answers, but finally this one worked:

$ locate libgcc_s.so.1

/lib/x86_64-linux-gnu/libgcc_s.so.1

if you see directory and their paths listed, then make softlink in the same folder

$ ln -s /lib/x86_64-linux-gnu/libgcc_s.so.1 /lib/x86_64-linux-gnu/libgcc_s.so

This ran make command without any error

$ make

gcc -o ssw_test ssw.o main.c -Wall -O3 -pipe -lm -lz
gcc -o example_c ssw.o example.c -Wall -O3 -pipe -lm -lz
g++ -c -o ssw_cpp.o ssw_cpp.cpp -Wall -O3 -pipe
g++ -o example_cpp example.cpp ssw.o ssw_cpp.o -Wall -O3 -pipe -lm -lz

gcc -Wall -O3 -pipe -fPIC -shared -rdynamic -o libssw.so ssw.c

Installing Java in CentOS

1) Search for available java versions to install using the following command

$ yum search java | grep 'java-'

...
java-1.6.0-openjdk.x86_64 : OpenJDK Runtime Environment
java-1.6.0-openjdk-demo.x86_64 : OpenJDK Demos
java-1.6.0-openjdk-devel.x86_64 : OpenJDK Development Environment
java-1.6.0-openjdk-javadoc.x86_64 : OpenJDK API Documentation
java-1.6.0-openjdk-src.x86_64 : OpenJDK Source Bundle
java-1.7.0-openjdk.x86_64 : OpenJDK Runtime Environment
java-1.7.0-openjdk-demo.x86_64 : OpenJDK Demos
java-1.7.0-openjdk-devel.x86_64 : OpenJDK Development Environment
java-1.7.0-openjdk-javadoc.noarch : OpenJDK API Documentation
java-1.7.0-openjdk-src.x86_64 : OpenJDK Source Bundle
java-1.8.0-openjdk.x86_64 : OpenJDK Runtime Environment
java-1.8.0-openjdk-debug.x86_64 : OpenJDK Runtime Environment with full debug on
java-1.8.0-openjdk-demo.x86_64 : OpenJDK Demos
java-1.8.0-openjdk-demo-debug.x86_64 : OpenJDK Demos with full debug on
java-1.8.0-openjdk-devel.x86_64 : OpenJDK Development Environment
java-1.8.0-openjdk-devel-debug.x86_64 : OpenJDK Development Environment with
java-1.8.0-openjdk-headless.x86_64 : OpenJDK Runtime Environment
java-1.8.0-openjdk-headless-debug.x86_64 : OpenJDK Runtime Environment with full
java-1.8.0-openjdk-javadoc.noarch : OpenJDK API Documentation
java-1.8.0-openjdk-javadoc-debug.noarch : OpenJDK API Documentation for packages
java-1.8.0-openjdk-src.x86_64 : OpenJDK Source Bundle
java-1.8.0-openjdk-src-debug.x86_64 : OpenJDK Source Bundle for packages with

...

I chose java-1.8.0-openjdk.x86_64 since it appears to be the latest.

Now install,

$ sudo yum install java-1.8.0-openjdk.x86_64

...
...
...
Resolving Dependencies
--> Running transaction check
---> Package java-1.8.0-openjdk.x86_64 1:1.8.0.91-1.b14.el6 will be installed
--> Processing Dependency: java-1.8.0-openjdk-headless = 1:1.8.0.91-1.b14.el6 for package: 1:java-1.8.0-openjdk-1.8.0.91-1.b14.el6.x86_64
--> Running transaction check
---> Package java-1.8.0-openjdk-headless.x86_64 1:1.8.0.91-1.b14.el6 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

===============================================================================================================================================================
Package Arch Version Repository Size
===============================================================================================================================================================
Installing:
java-1.8.0-openjdk x86_64 1:1.8.0.91-1.b14.el6 base 195 k
Installing for dependencies:
java-1.8.0-openjdk-headless x86_64 1:1.8.0.91-1.b14.el6 base 32 M

Transaction Summary
===============================================================================================================================================================
Install 2 Package(s)

Total download size: 32 M
Installed size: 102 M
Is this ok [y/N]: y
Downloading Packages:
(1/2): java-1.8.0-openjdk-1.8.0.91-1.b14.el6.x86_64.rpm | 195 kB 00:00
(2/2): java-1.8.0-openjdk-headless-1.8.0.91-1.b14.el6.x86_64.rpm | 32 MB 00:04
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 6.5 MB/s | 32 MB 00:04
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : 1:java-1.8.0-openjdk-headless-1.8.0.91-1.b14.el6.x86_64 1/2
Installing : 1:java-1.8.0-openjdk-1.8.0.91-1.b14.el6.x86_64 2/2
Verifying : 1:java-1.8.0-openjdk-headless-1.8.0.91-1.b14.el6.x86_64 1/2
Verifying : 1:java-1.8.0-openjdk-1.8.0.91-1.b14.el6.x86_64 2/2

Installed:
java-1.8.0-openjdk.x86_64 1:1.8.0.91-1.b14.el6

Dependency Installed:
java-1.8.0-openjdk-headless.x86_64 1:1.8.0.91-1.b14.el6

Complete!

You will see something like this. I have pasted only few lines from bottom.

Now checking java version in terminal:

$ java -version

openjdk version "1.8.0_91"

OpenJDK Runtime Environment (build 1.8.0_91-b14)

OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

Friday, July 1, 2016

Bioinformatics beginner commands - working with fasta files

PhD or Industry Job - My impressions so far

PhD

Advantages:
1) Chance for travelling globally
2) More independence to do research
3) Better salary than Masters
4) Have to write grants but not much travel required

Disadvantages:
1) Needs more mental strength
2) Not much financial improvement
3) Not a rocket career
4) Same as a master student in industry with 4 years experience

Industry Job

Advantages:
1) Learning skills at a faster pace because of commercial setting
2) Build the network and confidence
3) Academic salary not even comparable to industrial salary
4) Might have to travel or be customer oriented but no need to worry about next grants
5) Restricted to certain continent

Disadvantages:
1) May require travelling
2) If not a enthusiastic person to make new connection, very difficult to carry on
3) Comparatively more work pressure because of deadlines