Tuesday, February 13, 2018

Merged entries in the non-redundant Nucleotide (nt) database

I ran the following command for to extract a sequence from nt database:


$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KJ413946.1 | grep '>'

>KJ413946.1 Escherichia coli strain ECS01 plasmid pNDM-ECS01, complete sequence >KP900017.1 Raoultella ornithinolytica strain YNKP001 plasmid pYNKP001-NDM, complete sequence

I noticed that this is a merged entry in the 'non-redundant' Nucleotide (nt) database, where two nucleotide sequences which have the same sequence are combined under a single FASTA entry.

When I used the second identifier in the merged entry, blastdbcmd command gave me the same result in return:

$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KP900017.1 | grep '>'

>KJ413946.1 Escherichia coli strain ECS01 plasmid pNDM-ECS01, complete sequence >KP900017.1 Raoultella ornithinolytica strain YNKP001 plasmid pYNKP001-NDM, complete sequence

Curious to know, I checked if they are the same length. Yes they are of same length (41190).


$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KP900017.1 | grep -v '>' | tr -d '\n' | wc 

      0       1   41190

$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KJ413946.1 | grep -v '>' | tr -d '\n' | wc

      0       1   41190

Again curious to know,  if these two sequences are sharing an exact similarity match, I blasted both of them. And with no surprise, they share 100 % identity. See blast screenshots below:





So, how to sort this out? I need only one annotation instead of multiple for my analysis and do not want the second identifier to confuse my downstream analysis scripts.

By searching around, problem could solved by using  "-target_only"  option while running the command. I believe this situation might arise in nr database too but switching this on should help to achieve the purpose.


$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KJ413946.1 -target_only | grep '>'

>KJ413946.1 Escherichia coli strain ECS01 plasmid pNDM-ECS01, complete sequence

No comments:

Post a Comment