Wednesday, January 12, 2022

Extract accessory genes as fasta files from the roary output for downstream functional characterization


1. First generate the locus_tag of the accessory genes (output file: pan_genome_results)
query_pan_genome -a complement -g clustered_proteins *.gff
mv pan_genome_results accessory_genome_locus_tags.list
2. Use the locus_tags of the accessory genes to extract the sequences from prokka folder
fgrep 'locus_tag=' *.gff >Combined_local_gffs.txt # search 'locus_tag' in all gff file and combine - for speeding purpose

time for locus_tag in $(awk '{print $2}' accessory_genome_locus_tags.list); do 
file=`fgrep -m1 "$locus_tag" Combined_local_gffs.txt | awk -F':' '{print $1}' | sed 's/.gff//g'`; # extract the samplename (foldername) where the gene is present
awk -v var="$locus_tag" 'BEGIN {RS=">"} $0~var {print ">"$0}' Prokka/MySampleFolders/"$file"_prokka_out/"$file".ffn >accessory_gene_representatives/"$locus_tag".fasta;  
done 

These genes can be combined and can be functionally characterised for downstream analysis

Extracting 5477 genes took me around ~19 mins by this way