Tuesday, July 19, 2022

Calculate the frequency for a range of nucleotides aligned to reference from a BAM file

I wanted to find out the frequency for a range of nucleotides from multiple reads aligned to reference sequence from a BAM file. 

As you see from figure below, I want to find the frequency of nucleotides from position 13 to 15 and give their count.


For this, I use samtools mpileup command with a few bash commands as below.

samtools mpileup snps.bam -Q 0 -r CY116646.1:13-15 \| # Extract aligned region
sed 's/\^\]//g' \|  # "^]" signifies a base that was at the start of a read 
awk '{print $5}' \| # Read bases
sed -e 's/./&,/g' -e 's/,$//g' >infile && # temp file
ruby -F, -lane 'BEGIN{$a=[]};$a.concat([$F]);END{$a.transpose.each{|x|puts x.join(",")}}' infile \| # pass temp file to ruby oneliner
sed 's/,//g' \| #remove delimiter
sort \| 
uniq -c # Count


This results in:
[mpileup] 1 samples in 1 input files
     10 GGT
      1 GTT


EDIT 1:

But this would not work if the mpileup bases are not of same length.

This could be achived using a R library package. All we need is input bam file and coordinates of the sequence.

1
2
3
library(GenomicAlignments)    
stacked <- stackStringsFromBam("snps.bam", param = GRanges(CY116646.1:13-15))
write.table(stacked, Aligned_Reads_at_Region_Of_Interest.txt, quote = FALSE, col.names = F, row.names = F)


If you have multiple positions, then you can run the following R script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  fileName <- "regions.list"
  conn <- file(fileName, open = "r")
  linn <- readLines(conn)

  for (i in 1:length(linn)) {
    stacked <- stackStringsFromBam("snps.bam", param = GRanges(linn[i]))
    write.table(
      stacked,
      paste0("Region", linn[i], ".txt"),
      quote = FALSE,
      col.names = F,
      row.names = F
    )
  }
  close(conn)

Some reference posts:

https://www.biostars.org/p/259690/

Sunday, July 17, 2022

Useful Blast Commands and one-liners

 # Blast command with additional details such as query length and subject length

blastn -db /storage/data/Storage/nt/nt/nt -query query.fa -outfmt '6 qseqid sseqid pident nident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen' -out query.fa_vs_nt_with_stitle_blastresults.txt -num_threads 48

# To get title of subject, For example: Enterobacter_aerogenes_strain_G7,_complete_genome 

blastn -db /storage/data/Storage/nt/nt/nt -query query.fa -outfmt '6 qseqid sseqid pident nident length mismatch gapopen qstart qend sstart send evalue bitscore stitle qlen slen' -out query.fa_vs_nt_with_stitle_blastresults.txt -num_threads 48

# Tophits

$ awk '!x[$1]++' query.fa_vs_nt_with_stitle_blastresults.txt

# Tophits with plasmid assignment

$ awk '!x[$1]++' query.fa_vs_nt_with_stitle_blastresults.txt | sed 's/ /_/g' | awk '{print $0 "\t" $5*100/$(NF-1)}' | column -t | grep 'plasmid'

Thursday, July 14, 2022

Running MLST tool on individual contigs in fasta vs whole genome assembly fasta

This is more of a kind of observation.

When MLST tool was used to assign species and ST on assembled contigs as a whole file the assignment was ecoli species and ST 448. 

But when assembled file is separated into multiple contigs, only one contig was assigned with species "senterica", without ST assignment. The rest of the contigs were not assigned by any species and ST. 

It appears, running all the contigs together in a file gives MLST tool more confidence than running the tool on each single contig.