Thursday, April 25, 2019

Extract a specific gene contig from multiple assemblies

1)  Copy and paste a gene (for example NDM-1) from a database in text file - NDM-1.fa

2) Create blast DB for NDM gene using following command

$ makeblastdb -in NDM-1.fa -dbtype nucl -out NDM-1.fa.DB -parse_seqids

3) BLAST the assemblies against the NDM gene

$ for d in $(ls */assembly.fasta); do \
sample=`echo $d | sed 's/\/assembly.fasta//g'`; \
echo "$sample"; sed -i "s/>/>"$sample"_/g" $d; \
sed -i 's/ /_/g' $d; \
~/sw/ncbi-blast-2.7.1+/bin/blastn -db NDM-1.fa.DB -query $d -outfmt '6 qseqid sseqid pident nident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen' -num_threads 4 -out "$sample"_vs_NDM_blastresults.txt; done

This would rename all the contigs in the assembly.fasta with the directory name which holds assembly.fasta

4) cat *.txt | column -t

THY1079 NDM-1_JQ080305_1-813_813 100 813 813 0 0 48157 48969 813 1 0 1502 91643 813
THY1219 NDM-1_JQ080305_1-813_813 100 813 813 0 0 4586 5398 813 1 0 1502 13505 813
THY1250 NDM-1_JQ080305_1-813_813 99.754 811 813 2 0 14408 15220 1 813 0 1491 88353 813
THY1357 NDM-1_JQ080305_1-813_813 100 813 813 0 0 78370 79182 813 1 0 1502 177751 813
THY1440 NDM-1_JQ080305_1-813_813 99.754 811 813 2 0 14408 15220 1 813 0 1491 96544 813
THY1669 NDM-1_JQ080305_1-813_813 100 813 813 0 0 6747 7559 813 1 0 1502 118064 813
THY1758 NDM-1_JQ080305_1-813_813 99.754 811 813 2 0 5708 6520 1 813 0 1491 62334 813
THY1916 NDM-1_JQ080305_1-813_813 100 813 813 0 0 19307 20119 813 1 0 1502 278408 813
THY1940 NDM-1_JQ080305_1-813_813 100 813 813 0 0 3171762 3172574 1 813 0 1502 4879405 813
THY1968 NDM-1_JQ080305_1-813_813 100 813 813 0 0 3173593 3174405 1 813 0 1502 4880726 813
THY285 NDM-1_JQ080305_1-813_813 99.754 811 813 2 0 25547 26359 1 813 0 1491 46161 813
THY337 NDM-1_JQ080305_1-813_813 100 813 813 0 0 57706 58518 1 813 0 1502 88057 813
THY708 NDM-1_JQ080305_1-813_813 100 813 813 0 0 302 1114 1 813 0 1502 14280 813
THY807 NDM-1_JQ080305_1-813_813 99.754 811 813 2 0 2469 3281 1 813 0 1491 5469 813
THY924 NDM-1_JQ080305_1-813_813 99.754 811 813 2 0 36917 37729 1 813 0 1491 75214 813
THY924 NDM-1_JQ080305_1-813_813 99.754 811 813 2 0 20047 20859 813 1 0 1491 70472 813


5) extract the NDM_gene from the assemblies

fgrep -A1 -f <(cat *.txt | cut -f1) <(cat */assembly.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}') >NDM_positive_contigs.fasta

<(cat *.txt | cut -f1) -process substitution operator - gives the list of contig names


cat */assembly.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' - this woud convert all the multiline fasta to single line fasta

-A1 get the single line after the match

No comments:

Post a Comment