Tuesday, July 19, 2022

Calculate the frequency for a range of nucleotides aligned to reference from a BAM file

I wanted to find out the frequency for a range of nucleotides from multiple reads aligned to reference sequence from a BAM file. 

As you see from figure below, I want to find the frequency of nucleotides from position 13 to 15 and give their count.


For this, I use samtools mpileup command with a few bash commands as below.

samtools mpileup snps.bam -Q 0 -r CY116646.1:13-15 \| # Extract aligned region
sed 's/\^\]//g' \|  # "^]" signifies a base that was at the start of a read 
awk '{print $5}' \| # Read bases
sed -e 's/./&,/g' -e 's/,$//g' >infile && # temp file
ruby -F, -lane 'BEGIN{$a=[]};$a.concat([$F]);END{$a.transpose.each{|x|puts x.join(",")}}' infile \| # pass temp file to ruby oneliner
sed 's/,//g' \| #remove delimiter
sort \| 
uniq -c # Count


This results in:
[mpileup] 1 samples in 1 input files
     10 GGT
      1 GTT


EDIT 1:

But this would not work if the mpileup bases are not of same length.

This could be achived using a R library package. All we need is input bam file and coordinates of the sequence.

1
2
3
library(GenomicAlignments)    
stacked <- stackStringsFromBam("snps.bam", param = GRanges(CY116646.1:13-15))
write.table(stacked, Aligned_Reads_at_Region_Of_Interest.txt, quote = FALSE, col.names = F, row.names = F)


If you have multiple positions, then you can run the following R script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  fileName <- "regions.list"
  conn <- file(fileName, open = "r")
  linn <- readLines(conn)

  for (i in 1:length(linn)) {
    stacked <- stackStringsFromBam("snps.bam", param = GRanges(linn[i]))
    write.table(
      stacked,
      paste0("Region", linn[i], ".txt"),
      quote = FALSE,
      col.names = F,
      row.names = F
    )
  }
  close(conn)

Some reference posts:

https://www.biostars.org/p/259690/

No comments:

Post a Comment