Wednesday, November 8, 2017

Gubbins Phylogeny construction : Segmentation fault with some input files

 ***Note: There is a newer version gubbins (v2.3.4) which resolves the issue according to authors. Please find this post from the gubbins github page for more details.

This issue is already reported in gubbins github page for older versions, where it runs successfully for a set of fasta files and not for other set. Could not find much documentation to trace and resolve this error. This is the step which is throwing me the error:

gubbins -r -v seqnew.fa.gaps.vcf -a 100 -b 10000 -f test_gubbins/tmp/seqnew.fa -t seqnew.fa.iteration_1 -m 3 seqnew.fa.gaps.snp_sites.aln
Failed while running Gubbins. Please ensure you have enough free memory
After the parsnp alignment, i used the parsnp.fasta (seqnew.fasta here), to run the following steps. By breaking the alignment into shorter chunks I could successfully run gubbins. After the gubbins is run for all chunks, I combined them and plotted a phylogenetic tree. Commands I used are:

1) Formatting the alignment fasta file



$ sed -e '/^>/s/$/@/' -e 's/^>/#/' seqnew.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '      ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >seqnew.oneline.fa  ##fasta file in oneline 
 
$ fasta_formatter -i seqnew.fa -o seqnew.oneline.fa -w 0 ##convert multiple fasta file to single line fasta file

2) Split the alignment file

$ perl splitAlignment.pl

-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_0
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_1
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_2
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_3
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_4
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_5
-rw-rw-r-- 1 ttsh ttsh 2.8M Nov  9 11:00 seqnew.oneline.fa_6

splitAlignment.pl can be found at this github page.

3) Run Gubbins on chunk of sequences


$ for d in {0..6}; do time nohup run_gubbins.py --prefix postGubbins_ecloacae_$d --thread 12 --verbose --tree_builder fasttree seqnew.oneline.fa_$d >>gubbins.log 2>&1; done

4) Convert the alignment file into one line fasta format using fastx_toolkit


$ for d in {0..6}; do fasta_formatter -i postGubbins_ecloacae_$d.filtered_polymorphic_sites.fasta -o postGubbins_ecloacae_$d.filtered_polymorphic_sites.fasta_fastx -w 0; done

-rw-rw-r-- 1 ttsh ttsh 1.2M Nov  9 11:31 postGubbins_ecloacae_0.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.3M Nov  9 11:31 postGubbins_ecloacae_1.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.5M Nov  9 11:31 postGubbins_ecloacae_2.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.3M Nov  9 11:31 postGubbins_ecloacae_3.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.3M Nov  9 11:31 postGubbins_ecloacae_4.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.2M Nov  9 11:31 postGubbins_ecloacae_5.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 397K Nov  9 11:31 postGubbins_ecloacae_6.filtered_polymorphic_sites.fasta_fastx

5) Combine the alignment files


$ perl combinedAlignedfasta.pl 

combinedAlignedfasta.pl  can be found at this github page.

6) You can skip the 7th step by converting aligned fasta file to newick or .tre file using  phylogenetic tree inferring softwares like fasttree or RaxML.


$ FastTree -nt postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.fasta > postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.tre

6) Installed QIIME and from QIIME ran the make phylogeny script


$ qiime (enter into qiime environment)

$  qiime > make_phylogeny.py -i postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.fasta -o postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.tre

7) Used MEGA software to open the .tre file and plot the phylogenetic tree.

8) To change fasta file name to phyloID names, used a script taken from biostars.


$ while IFS= read -r aline; do filename=`echo "$aline" | cut -f1 -d "     " `; phyloID=`echo "$aline" | cut -f2 -d "  " `; sed -i "s/$filename/$phyloID/" postGubbins_seqnew_ecloacae.filtered_polymorphic_sites_Final_Parsed.nwk ; echo "$aline";  done < ~/Datta/7International_NDM/International_NDM_Proj_Filenames_PhyloIDs.txt 
  
$ perl script.pl PhyloIDs.txt postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.tre >postGubbinsChunk.filtered_polymorphic_sitesv2.tre

Note: *" " in the above one liner of cut command are tabs

updated on June 29th, 2018

Code: hilite.me, lang: CSS, CSS: border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;, Style: Perldoc

3 comments:

  1. Tested this gubbins_in_chunks pipeline with a dataset that has complete Gubbins result.
    The gubbins_in_chunks result captures ~96% of the polymorphic sites captured by the full result.
    Thank you Datta! :)

    ReplyDelete
  2. After parsnp I didn't find any .fasta files. There are parsnpAligner.ini parsnpAligner.log parsnp.ggr parsnp.tree parsnp.xmfa . How did you generated the .fasta file? I tried to convert the .xmfa file to .fasta but that file is not compatible for gubbins.

    ReplyDelete
    Replies
    1. I converted the xmfa to fasta file using sequence scripts available here:

      https://github.com/kjolley/seq_scripts/blob/master/xmfa2fasta.pl

      The output of this script is fed to gubbins

      Delete