Thursday, July 28, 2016

Pathway enrichment analysis - using KOBAS 2.0

Of late, I needed to do pathway enrichment analysis for a list of genes that have been differentially expressed (DEG). These are different approaches for this task, this was mine using KOBAS 2.0 which identifies statistically significant pathways from the fasta sequences.

Cons:
1) Online is limited to 500 seqs.
2) Databases are too huge to locally install.

Pros:
1) Can do a local blast and upload in the webserver to generate the results.

For this, I initially downloaded complete KO file (as mentioned in this page). But the size was huge to run the blastx (~6 gb). Since, I am working on a fish transcriptome, I used Homo sapeins peptide fasta file and ran a local blast.

1
2
3
4
$ wget http://kobas.cbi.pku.edu.cn/download/seq_pep/hsa.pep.fasta.gz
$ gunzip hsa.pep.fasta.gz
$ /data/software/ncbi-blast-2.2.31+-2.x86_64/bin/makeblastdb -in hsa.pep.fasta -dbtype prot -out dre.pep.KEGGDB -parse_seqids
$ time nohup /data/software/ncbi-blast-2.2.31+-2.x86_64/bin/blastx -db /data/rgg_data/Databases/KOBAS/hsa.pep.KEGGDB -query SB_GeneSet.fasta -evalue 0.000001 -outfmt 6 -out SB_Genes_vs_hsa_pep_KEGGDB.blastx.tab.txt -num_threads 23 -max_target_seqs 10 &


At first I tried with Danio rerio, because my species of interest is also fish. But then KOBAS failed repeatedly. It do not know if it is database issue or Danio rerio peptide set, not as extensively studied as Homo Sapiens.

So, ran annotate and identify steps, to retrive the significant pathways. I uploaded 6.5 M blastx output file with success. Once this is done, I have to sort according to corrected p-value to select most enriched pathways in my DEGset.

That it!

Edit2: July 29, 2016

Once, the raw file of the enriched pathways are downloaded, we have to find the pathways with corrected p-value < 0.05.

So, I used the following command to filter the file and retrive significant pathways from KOBAS identify output file:


1
$ cat SW_GILLS_65_vs_FW_GILLS_65_HSA_UpRegul.identify | sed 's/ /_/g' | sed '/^$/d' |egrep -v "#|Gene_ontology" | awk 'OFS="\t" { print $1,$2,$3} {printf("%.8f\n", $7)}' | paste - - | head -20 | awk '$4 < 0.05 {print $0}' | column -t ## For significant Pathways
$ cat SW-Gills_65_vs_FW_Gills_65.identify | sed 's/ /_/g' | sed '/^$/d' | grep 'Gene_Ontology' | grep -v "#" | awk 'OFS="\t" { print $1,$3} {printf("%.8f\n", $7)}' | paste  - - | awk '$3 < 0.05 {print $0}' | column -t ## For significant GO terms
 

No comments:

Post a Comment