Sunday, November 12, 2017

Replace a list of patterns with another list


$ while IFS= read -r aline; do filename=`echo "$aline" \| 
cut -f1 -d "        " `; phyloID=`echo "$aline" \| 
cut -f2 -d "  " `; 
sed -i "s/$filename/$phyloID/"  file_with_pattern_to_be_replaced.txt ; 
echo "$aline";  done < id_values.txt


## "id_values.txt" - contains pattern and replacement, separated by tab

## "file_with_pattern_to_be_replaced.txt" - replace patterns in this file

Wednesday, November 8, 2017

Gubbins Phylogeny construction : Segmentation fault with some input files

 ***Note: There is a newer version gubbins (v2.3.4) which resolves the issue according to authors. Please find this post from the gubbins github page for more details.

This issue is already reported in gubbins github page for older versions, where it runs successfully for a set of fasta files and not for other set. Could not find much documentation to trace and resolve this error. This is the step which is throwing me the error:

gubbins -r -v seqnew.fa.gaps.vcf -a 100 -b 10000 -f test_gubbins/tmp/seqnew.fa -t seqnew.fa.iteration_1 -m 3 seqnew.fa.gaps.snp_sites.aln
Failed while running Gubbins. Please ensure you have enough free memory
After the parsnp alignment, i used the parsnp.fasta (seqnew.fasta here), to run the following steps. By breaking the alignment into shorter chunks I could successfully run gubbins. After the gubbins is run for all chunks, I combined them and plotted a phylogenetic tree. Commands I used are:

1) Formatting the alignment fasta file



$ sed -e '/^>/s/$/@/' -e 's/^>/#/' seqnew.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '      ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >seqnew.oneline.fa  ##fasta file in oneline 
 
$ fasta_formatter -i seqnew.fa -o seqnew.oneline.fa -w 0 ##convert multiple fasta file to single line fasta file

2) Split the alignment file

$ perl splitAlignment.pl

-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_0
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_1
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_2
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_3
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_4
-rw-rw-r-- 1 ttsh ttsh 7.7M Nov  9 11:00 seqnew.oneline.fa_5
-rw-rw-r-- 1 ttsh ttsh 2.8M Nov  9 11:00 seqnew.oneline.fa_6

splitAlignment.pl can be found at this github page.

3) Run Gubbins on chunk of sequences


$ for d in {0..6}; do time nohup run_gubbins.py --prefix postGubbins_ecloacae_$d --thread 12 --verbose --tree_builder fasttree seqnew.oneline.fa_$d >>gubbins.log 2>&1; done

4) Convert the alignment file into one line fasta format using fastx_toolkit


$ for d in {0..6}; do fasta_formatter -i postGubbins_ecloacae_$d.filtered_polymorphic_sites.fasta -o postGubbins_ecloacae_$d.filtered_polymorphic_sites.fasta_fastx -w 0; done

-rw-rw-r-- 1 ttsh ttsh 1.2M Nov  9 11:31 postGubbins_ecloacae_0.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.3M Nov  9 11:31 postGubbins_ecloacae_1.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.5M Nov  9 11:31 postGubbins_ecloacae_2.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.3M Nov  9 11:31 postGubbins_ecloacae_3.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.3M Nov  9 11:31 postGubbins_ecloacae_4.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 1.2M Nov  9 11:31 postGubbins_ecloacae_5.filtered_polymorphic_sites.fasta_fastx
-rw-rw-r-- 1 ttsh ttsh 397K Nov  9 11:31 postGubbins_ecloacae_6.filtered_polymorphic_sites.fasta_fastx

5) Combine the alignment files


$ perl combinedAlignedfasta.pl 

combinedAlignedfasta.pl  can be found at this github page.

6) You can skip the 7th step by converting aligned fasta file to newick or .tre file using  phylogenetic tree inferring softwares like fasttree or RaxML.


$ FastTree -nt postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.fasta > postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.tre

6) Installed QIIME and from QIIME ran the make phylogeny script


$ qiime (enter into qiime environment)

$  qiime > make_phylogeny.py -i postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.fasta -o postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.tre

7) Used MEGA software to open the .tre file and plot the phylogenetic tree.

8) To change fasta file name to phyloID names, used a script taken from biostars.


$ while IFS= read -r aline; do filename=`echo "$aline" | cut -f1 -d "     " `; phyloID=`echo "$aline" | cut -f2 -d "  " `; sed -i "s/$filename/$phyloID/" postGubbins_seqnew_ecloacae.filtered_polymorphic_sites_Final_Parsed.nwk ; echo "$aline";  done < ~/Datta/7International_NDM/International_NDM_Proj_Filenames_PhyloIDs.txt 
  
$ perl script.pl PhyloIDs.txt postGubbins_seqnew_ecloacae.filtered_polymorphic_sites.tre >postGubbinsChunk.filtered_polymorphic_sitesv2.tre

Note: *" " in the above one liner of cut command are tabs

updated on June 29th, 2018

Code: hilite.me, lang: CSS, CSS: border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;, Style: Perldoc