I wanted to find out the inward and outward orientation pairs from a bam file. So, I checked a few tools. Two of them were bamcheck and samtools stats. The reason for doing so, if the bam shows that there are more number of outward oriented pairs, that means there are higher chances that the region has some problem associated with it.
But, I was stuck because both these tools gave different outputs. So, I myself wanted to find out which is one more reliable.
This subsequently led me run the following steps.
1) Split the bam into smaller bam files, (1 scaffold and its associated mapped reads)
$ bamtools split -in 500bp_insert_mapping_subset2_sorted.bam -reference
$ ls 500bp_insert_mapping_subset2_sorted.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam 500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam
2) Count the inward and oriented pairs, using both bamcheck
$ for i in *bam; do echo "$i"; bamcheck $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t 500bp_insert_mapping_subset2_sorted.bam SN inward oriented pairs: 95506 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam SN inward oriented pairs: 2801 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam SN inward oriented pairs: 2861 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam SN inward oriented pairs: 2473 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam SN inward oriented pairs: 2844 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam SN inward oriented pairs: 2620 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam SN inward oriented pairs: 3212 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam SN inward oriented pairs: 3543 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam SN inward oriented pairs: 3018 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam SN inward oriented pairs: 3671 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam SN inward oriented pairs: 3530 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam SN inward oriented pairs: 3093 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam SN inward oriented pairs: 2951 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam SN inward oriented pairs: 1545 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam SN inward oriented pairs: 3242 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam SN inward oriented pairs: 2252 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam SN inward oriented pairs: 3876 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam SN inward oriented pairs: 3718 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam SN inward oriented pairs: 4269 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam SN inward oriented pairs: 2493 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam SN inward oriented pairs: 3140 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam SN inward oriented pairs: 4837 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam SN inward oriented pairs: 6686 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam SN inward oriented pairs: 2668 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam SN inward oriented pairs: 3145 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam SN inward oriented pairs: 2622 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam SN inward oriented pairs: 2729 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam SN inward oriented pairs: 2880 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam SN inward oriented pairs: 2668 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam SN inward oriented pairs: 4305 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam SN inward oriented pairs: 1814 SN outward oriented pairs: 0
2) Count the inward and oriented pairs, using samtools stats
$ for i in *bam; do echo "$i"; ~/Documents/samtools-1.2/samtools stats $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t 500bp_insert_mapping_subset2_sorted.bam SN inward oriented pairs: 96131 SN outward oriented pairs: 295 500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam SN inward oriented pairs: 2807 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam SN inward oriented pairs: 2861 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam SN inward oriented pairs: 2474 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam SN inward oriented pairs: 2845 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam SN inward oriented pairs: 2622 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam SN inward oriented pairs: 3221 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam SN inward oriented pairs: 3553 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam SN inward oriented pairs: 3020 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam SN inward oriented pairs: 3677 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam SN inward oriented pairs: 3545 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam SN inward oriented pairs: 3097 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam SN inward oriented pairs: 2961 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam SN inward oriented pairs: 1548 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam SN inward oriented pairs: 3251 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam SN inward oriented pairs: 2272 SN outward oriented pairs: 24 500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam SN inward oriented pairs: 3885 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam SN inward oriented pairs: 3727 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam SN inward oriented pairs: 4279 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam SN inward oriented pairs: 2850 SN outward oriented pairs: 227 500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam SN inward oriented pairs: 3176 SN outward oriented pairs: 14 500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam SN inward oriented pairs: 4841 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam SN inward oriented pairs: 6735 SN outward oriented pairs: 24 500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam SN inward oriented pairs: 2679 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam SN inward oriented pairs: 3150 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam SN inward oriented pairs: 2625 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam SN inward oriented pairs: 2732 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam SN inward oriented pairs: 2885 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam SN inward oriented pairs: 2671 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam SN inward oriented pairs: 4312 SN outward oriented pairs: 0 500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam SN inward oriented pairs: 1830 SN outward oriented pairs: 6
3) I selected this small BAM file: 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam
Bamcheck stats
$ bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam | head -28 # This file was produced by bamcheck (2012-04-24) # The command line was: bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam # Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part. SN sequences: 14606 SN is paired: 1 SN is sorted: 1 SN 1st fragments: 7843 SN last fragments: 6763 SN reads mapped: 14606 SN reads unmapped: 0 SN reads unpaired: 1234 SN reads paired: 13372 SN reads duplicated: 0 SN reads MQ0: 9217 SN total length: 1437208 SN bases mapped: 1437208 SN bases mapped (cigar): 1436380 SN bases trimmed: 0 SN bases duplicated: 0 SN mismatches: 0 SN error rate: 0.000000e+00 SN average length: 98 SN maximum length: 101 SN average quality: 36.8 SN insert size average: 497.2 SN insert size standard deviation: 29.1 SN inward oriented pairs: 6686 SN outward oriented pairs: 0
$ samtools stats 500bp_insert_mapping_subset2_sorted.REF__unitig_2151_outties_filtered.bam | head -37 SN raw total sequences: 14606 SN filtered sequences: 0 SN sequences: 14606 SN is sorted: 1 SN 1st fragments: 7843 SN last fragments: 6763 SN reads mapped: 14606 SN reads mapped and paired: 13526 # paired-end technology bit set + both mates mapped SN reads unmapped: 0 SN reads properly paired: 13372 # proper-pair bit set SN reads paired: 13526 # paired-end technology bit set SN reads duplicated: 0 # PCR or optical duplicate bit set SN reads MQ0: 9217 # mapped and MQ=0 SN reads QC failed: 0 SN non-primary alignments: 0 SN total length: 1437208 # ignores clipping SN bases mapped: 1437208 # ignores clipping SN bases mapped (cigar): 1436380 # more accurate SN bases trimmed: 0 SN bases duplicated: 0 SN mismatches: 0 # from NM fields SN error rate: 0.000000e+00 # mismatches / bases mapped (cigar) SN average length: 98 SN maximum length: 101 SN average quality: 36.8 SN insert size average: 491.7 SN insert size standard deviation: 29.4 SN inward oriented pairs: 6735 SN outward oriented pairs: 24
Observe here, (6735+24)*2 != 13,526 but equals to 13,518 paired reads. (check from step 7)
4) First take out all the mapped read pairs (irrespective of direction) from the bam and convert into sorted bed file.
$ fgrep -f <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam \| cut -f4 | cut -d "_" -f1 \| sort | uniq -c | sort -n \| fgrep ' 2' | sed 's/ 2 //g') <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam) \| sort -k4,4 >500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed
5) Checking the numbers of reads paired
$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \| awk '{print $4,$2,$3,$6}' \| paste - - | wc -l
6763 (reads paired: 13526) - This number is same as the number generated by samtools stats see above SN reads paired: 13526.
But the bamcheck was giving a different result (SN reads paired: 13372)
But the bamcheck was giving a different result (SN reads paired: 13372)
6) Now, from the read pairs, I want to know how many are
- Inward oriented pairs (--> <--)
- Outward oriented pairs (<-- -->)
- Same direction pairs (-->--> or <--<--)
7) At first I tried running the following command, which gave me slightly higher number than the samtools stats given number
$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \| awk '{print $4,$2,$3,$6}' | paste - - \| awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else print $0;}' \| sed 's/-/<@@@/g' | sed 's/+/###>/g' | awk '{print $1,$2,$3,$5,$6,$7,$4$8}' \| column -t | awk '{print $NF}' | sort | uniq -c
27 <--()--> 6736 -->()<--
(27 + 6736)*2 =13,526 paired reads.
But, samtools stats gives (6735+24 ) * 2 = 13518 reads are paired reads. I tried understand deeply and scrutinizing the bed file I found that 3 reads pairs are overlapping at the same position for outward oriented pairs and 1 overlapping read pair from the inward oriented pairs.
From overlapping, I mean the coordinates of both reads are mapping around the same point.
These were problematic ones!
From overlapping, I mean the coordinates of both reads are mapping around the same point.
HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/1 8700 8798 HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/2 8700 8798 <-- --> HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/1 8702 8796 HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/2 8702 8794 <-- --> HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/1 8717 8814 HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/2 8717 8814 <-- -->
These were problematic ones!
8) So, I made slight change in the above command
which is same as samtools stats command.
I have checked these reads in a visualization tool called tablet. It shows the directions of the pairs when the bam file is loaded into the tool.
$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \| awk '{print $4,$2,$3,$6}' | paste - - \| awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else if ( $2 == $6) next; else print $0;}' \| sed 's/-/<--/g' | sed 's/+/-->/g' \| awk '{print $1,$2,$3,$5,$6,$7,$4"()"$8}' \| column -t | awk '{print $NF}' \| sort | uniq -c 24 <--()--> 6735 -->()<--
which is same as samtools stats command.
I have checked these reads in a visualization tool called tablet. It shows the directions of the pairs when the bam file is loaded into the tool.
**Note:
The final script I used is not a one-liner to find the orientation of paired reads for bigger bam files, because other conditions have to be taken in consideration too. Such as overlapping pairs with different starting coordinates (in my script I used only same coordinates of both read pairs to report).
No comments:
Post a Comment