I wanted to find out the frequency for a range of nucleotides from multiple reads aligned to reference sequence from a BAM file.
As you see from figure below, I want to find the frequency of nucleotides from position 13 to 15 and give their count.
samtools mpileup snps.bam -Q 0 -r CY116646.1:13-15 \| # Extract aligned region
sed 's/\^\]//g' \| # "^]" signifies a base that was at the start of a read
awk '{print $5}' \| # Read bases
sed -e 's/./&,/g' -e 's/,$//g' >infile && # temp file
ruby -F, -lane 'BEGIN{$a=[]};$a.concat([$F]);END{$a.transpose.each{|x|puts x.join(",")}}' infile \| # pass temp file to ruby oneliner
sed 's/,//g' \| #remove delimiter
sort \|
uniq -c # Count
This results in:
[mpileup] 1 samples in 1 input files
10 GGT
1 GTT
EDIT 1:
But this would not work if the mpileup bases are not of same length.
This could be achived using a R library package. All we need is input bam file and coordinates of the sequence.
1 2 3 | library(GenomicAlignments) stacked <- stackStringsFromBam("snps.bam", param = GRanges(CY116646.1:13-15)) write.table(stacked, Aligned_Reads_at_Region_Of_Interest.txt, quote = FALSE, col.names = F, row.names = F) |
If you have multiple positions, then you can run the following R script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | fileName <- "regions.list" conn <- file(fileName, open = "r") linn <- readLines(conn) for (i in 1:length(linn)) { stacked <- stackStringsFromBam("snps.bam", param = GRanges(linn[i])) write.table( stacked, paste0("Region", linn[i], ".txt"), quote = FALSE, col.names = F, row.names = F ) } close(conn) |
Some reference posts:
https://www.biostars.org/p/259690/