Monday, August 22, 2016

Pilon Error - Inappropriate call if not paired read

I was running Pilon to improve my genome assembly using pilon-1.16.jar. 

time nohup java -jar /data/software/pilon-1.16.jar --genome genome.fasta --frags SIPE1_filtered_sorted.bam --frags SIPE2_filtered_sorted.bam --frags SIPE3_filtered_sorted.bam --frags SIPE4_filtered_sorted.bam --frags SIPE5_filtered_sorted.bam --frags SIPE6_filtered_sorted.bam --jumps LIMP1_filtered_sorted.bam --jumps LIMP4c_filtered_sorted.bam --output Pilon_Output --changes --threads 22 --verbose &

Pilon version 1.16 Mon Dec 7 09:45:45 2015 -0500

Genome: genome.fasta
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.simontuffs.onejar.Boot.run(Boot.java:340)
    at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.IllegalStateException: Inappropriate call if not paired read
    at htsjdk.samtools.SAMRecord.requireReadPaired(SAMRecord.java:648)
    at htsjdk.samtools.SAMRecord.getProperPairFlag(SAMRecord.java:656)
    at org.broadinstitute.pilon.BamFile$$anonfun$scan$1.apply(BamFile.scala:273)
    at org.broadinstitute.pilon.BamFile$$anonfun$scan$1.apply(BamFile.scala:268)
    at scala.collection.Iterator$class.foreach(Iterator.scala:750)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
    at org.broadinstitute.pilon.BamFile.scan(BamFile.scala:268)
    at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$4.apply(GenomeFile.scala:92)
    at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$4.apply(GenomeFile.scala:92)
    at scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115)
    at scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62)
    at scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
    at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
    at scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
    at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Elsewhere I read files with reads mapped in different directions is causing the problem. So, I tried Pilon1.16 with only short insert libraries direction, it worked for a bit of time but eventually failed!

Since, I wanted to run pilon using both paired end and mate-pair libraries, I tried downloading the newest version of Pilonv1.18 which ALSO DID NOT SOLVE the above error.

Solution that worked for me:

So after exchannging few mails with the author, I finally extracted the reads whose pairs also mapped elsewhere in the genome and generated a BAM file. This time alone Pilon worked and still running without error! Let us see..if it will last! :)

I used the following command from the samtools.

samtools view -b -f 0x2 LIMP1.bam | samtools sort - LIMP1_mappedPairs_sorted.bam

Wednesday, August 10, 2016

Comparing bamcheck and samtools stats - finding inward and outward oriented read pairs from bam file

In addition to the biostar post, I am adding more details in this blog post.

I wanted to find out the inward and outward orientation pairs from a bam file. So, I checked a few tools. Two of them were bamcheck and samtools stats. The reason for doing so, if the bam shows that there are more number of outward oriented pairs, that means there are higher chances that the region has some problem associated with it.

But, I was stuck because both these tools gave different outputs. So, I myself wanted to find out which is one more reliable.





This subsequently led me run the following steps.

1) Split the bam into smaller bam files, (1 scaffold and its associated mapped reads)

$ bamtools split -in 500bp_insert_mapping_subset2_sorted.bam -reference

$ ls
500bp_insert_mapping_subset2_sorted.bam                   500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam

2) Count the inward and oriented pairs, using both bamcheck 

$ for i in *bam; do echo "$i"; bamcheck $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t
500bp_insert_mapping_subset2_sorted.bam                   SN  inward  oriented  pairs:  95506  SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  SN  inward  oriented  pairs:  2801   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   SN  inward  oriented  pairs:  2861   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  SN  inward  oriented  pairs:  2473   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  SN  inward  oriented  pairs:  2844   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   SN  inward  oriented  pairs:  2620   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  SN  inward  oriented  pairs:  3212   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   SN  inward  oriented  pairs:  3543   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  SN  inward  oriented  pairs:  3018   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  SN  inward  oriented  pairs:  3671   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  SN  inward  oriented  pairs:  3530   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  SN  inward  oriented  pairs:  3093   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  SN  inward  oriented  pairs:  2951   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  SN  inward  oriented  pairs:  1545   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  SN  inward  oriented  pairs:  3242   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam  SN  inward  oriented  pairs:  2252   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam  SN  inward  oriented  pairs:  3876   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam  SN  inward  oriented  pairs:  3718   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam  SN  inward  oriented  pairs:  4269   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam  SN  inward  oriented  pairs:  2493   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam  SN  inward  oriented  pairs:  3140   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam  SN  inward  oriented  pairs:  4837   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam  SN  inward  oriented  pairs:  6686   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam  SN  inward  oriented  pairs:  2668   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam  SN  inward  oriented  pairs:  3145   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam   SN  inward  oriented  pairs:  2622   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam   SN  inward  oriented  pairs:  2729   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam   SN  inward  oriented  pairs:  2880   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam   SN  inward  oriented  pairs:  2668   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam   SN  inward  oriented  pairs:  4305   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam   SN  inward  oriented  pairs:  1814   SN  outward  oriented  pairs:  0

2) Count the inward and oriented pairs, using samtools stats


$ for i in *bam; do echo "$i"; ~/Documents/samtools-1.2/samtools stats $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t
500bp_insert_mapping_subset2_sorted.bam                   SN  inward  oriented  pairs:  96131  SN  outward  oriented  pairs:  295
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  SN  inward  oriented  pairs:  2807   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   SN  inward  oriented  pairs:  2861   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  SN  inward  oriented  pairs:  2474   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  SN  inward  oriented  pairs:  2845   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   SN  inward  oriented  pairs:  2622   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  SN  inward  oriented  pairs:  3221   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   SN  inward  oriented  pairs:  3553   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  SN  inward  oriented  pairs:  3020   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  SN  inward  oriented  pairs:  3677   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  SN  inward  oriented  pairs:  3545   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  SN  inward  oriented  pairs:  3097   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  SN  inward  oriented  pairs:  2961   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  SN  inward  oriented  pairs:  1548   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  SN  inward  oriented  pairs:  3251   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam  SN  inward  oriented  pairs:  2272   SN  outward  oriented  pairs:  24
500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam  SN  inward  oriented  pairs:  3885   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam  SN  inward  oriented  pairs:  3727   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam  SN  inward  oriented  pairs:  4279   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam  SN  inward  oriented  pairs:  2850   SN  outward  oriented  pairs:  227
500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam  SN  inward  oriented  pairs:  3176   SN  outward  oriented  pairs:  14
500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam  SN  inward  oriented  pairs:  4841   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam  SN  inward  oriented  pairs:  6735   SN  outward  oriented  pairs:  24
500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam  SN  inward  oriented  pairs:  2679   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam  SN  inward  oriented  pairs:  3150   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam   SN  inward  oriented  pairs:  2625   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam   SN  inward  oriented  pairs:  2732   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam   SN  inward  oriented  pairs:  2885   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam   SN  inward  oriented  pairs:  2671   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam   SN  inward  oriented  pairs:  4312   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam   SN  inward  oriented  pairs:  1830   SN  outward  oriented  pairs:  6

3) I selected this small BAM file: 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam

Bamcheck stats

$ bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam | head -28
# This file was produced by bamcheck (2012-04-24)
# The command line was:  bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN sequences: 14606
SN is paired: 1
SN is sorted: 1
SN 1st fragments: 7843
SN last fragments: 6763
SN reads mapped: 14606
SN reads unmapped: 0
SN reads unpaired: 1234
SN reads paired: 13372
SN reads duplicated: 0
SN reads MQ0: 9217
SN total length: 1437208
SN bases mapped: 1437208
SN bases mapped (cigar): 1436380
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 0
SN error rate: 0.000000e+00
SN average length: 98
SN maximum length: 101
SN average quality: 36.8
SN insert size average: 497.2
SN insert size standard deviation: 29.1
SN inward oriented pairs: 6686
SN outward oriented pairs: 0

Samtools stats

$ samtools stats 500bp_insert_mapping_subset2_sorted.REF__unitig_2151_outties_filtered.bam | head -37
SN raw total sequences: 14606
SN filtered sequences: 0
SN sequences: 14606
SN is sorted: 1
SN 1st fragments: 7843
SN last fragments: 6763
SN reads mapped: 14606
SN reads mapped and paired: 13526 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 13372 # proper-pair bit set
SN reads paired: 13526 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 9217 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 0
SN total length: 1437208 # ignores clipping
SN bases mapped: 1437208 # ignores clipping
SN bases mapped (cigar): 1436380 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 0 # from NM fields
SN error rate: 0.000000e+00 # mismatches / bases mapped (cigar)
SN average length: 98
SN maximum length: 101
SN average quality: 36.8
SN insert size average: 491.7
SN insert size standard deviation: 29.4
SN inward oriented pairs: 6735
SN outward oriented pairs: 24

Observe here, (6735+24)*2 != 13,526 but equals to 13,518 paired reads. (check from step 7)

4) First take out all the mapped read pairs (irrespective of direction) from the bam and convert into sorted bed file. 

$ fgrep -f <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam \|
cut -f4 | cut -d "_" -f1 \|
sort | uniq -c | sort -n \| 
fgrep '      2' | sed 's/      2 //g') <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam) \|
sort -k4,4 >500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed

5) Checking the numbers of reads paired

$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \| 
awk '{print $4,$2,$3,$6}' \| 
paste - - | wc -l

6763 (reads paired: 13526) - This number is same as the number generated by samtools stats see above SN reads paired: 13526.

But the bamcheck was giving a different result (SN reads paired: 13372)

6) Now, from the read pairs, I want to know how many are 

  1. Inward oriented pairs (--> <--)
  2. Outward oriented pairs (<-- -->)
  3. Same direction pairs (-->--> or <--<--)
7) At first I tried running the following command, which gave me slightly higher number than the samtools stats given number

$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \|
awk '{print $4,$2,$3,$6}' | paste - - \|
awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else print $0;}' \|
sed 's/-/<@@@/g' | sed 's/+/###>/g' | awk '{print $1,$2,$3,$5,$6,$7,$4$8}' \|
column -t | awk '{print $NF}' | sort | uniq -c

    27  <--()-->
   6736 -->()<--

(27 + 6736)*2 =13,526 paired reads.

But, samtools stats gives (6735+24 ) * 2 = 13518 reads are paired reads. I tried understand deeply and scrutinizing the bed file I found that 3 reads pairs are overlapping at the same position for outward oriented pairs and 1 overlapping read pair from the inward oriented pairs.

From overlapping, I mean the coordinates of both reads are mapping around the same point.

HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/1   8700   8798   HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/2   8700   8798   <-- -->
HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/1   8702   8796   HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/2   8702   8794   <-- -->
HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/1   8717   8814   HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/2   8717   8814   <-- -->

These were problematic ones!

8) So, I made slight change in the above command


$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \|
awk '{print $4,$2,$3,$6}' | paste - - \|
awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else if ( $2 == $6) next; else print $0;}' \|
sed 's/-/<--/g' | sed 's/+/-->/g' \|
awk '{print $1,$2,$3,$5,$6,$7,$4"()"$8}' \|
column -t | awk '{print $NF}' \|
sort | uniq -c

     24 <--()-->
   6735 -->()<--

which is same as samtools stats command.

I have checked these reads in a visualization tool called tablet. It shows the directions of the pairs when the bam file is loaded into the tool.

**Note:

The final script I used is not a one-liner to find the orientation of paired reads for bigger bam files, because other conditions have to be taken in consideration too. Such as overlapping pairs with different starting coordinates (in my script I used only same coordinates of both read pairs to report).

Friday, August 5, 2016

NCBI Blast error - with Pipe symbols in fasta header

I am running NCBI blast version 2.2.28 for my analysis.  I encountered an error while making database for my local blast.

Error:

$ ~/Documents/Blast/makeblastdb -in HSA_GRCh37.p13_EnsGenes.fasta -dbtype nucl -out HSA_GRCh37.p13_EnsGenes.BlastDB -parse_seqids

Building a new DB, current time: 08/05/2016 11:11:15
New DB name:   HSA_GRCh37.p13_EnsGenes.BlastDB
New DB title:  HSA_GRCh37.p13_EnsGenes.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B

No volumes were created because no sequences were found.


Error: NCBI C++ Exception:

    "/am/ncbiapdata/release/blast/src/2.2.25/Linux64-Suse-icc/c++/ICC1010-ReleaseMT64--Linux64-Suse-icc/../src/objects/seq/../seqloc/Seq_id.cpp", line 1637: Error: ncbi::objects::CSeq_id::x_Init() - Unsupported ID type ENSG00000003137

Fasta file was okay but the header is the problem. It contained pipe symbol "|". So replaced all the pipes with an underscore "_"


$ sed -i 's/|/_/g' HSA_GRCh37.p13_EnsGenes.fasta

Database was created succesfully.

$ ~/Documents/Blast/makeblastdb -in HSA_GRCh37.p13_EnsGenes.fasta -dbtype nucl -out HSA_GRCh37.p13_EnsGenes.BlastDB -parse_seqids


Building a new DB, current time: 08/05/2016 11:12:05

New DB name:   HSA_GRCh37.p13_EnsGenes.BlastDB
New DB title:  HSA_GRCh37.p13_EnsGenes.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 215647 sequences in 15.894 seconds.


Monday, August 1, 2016

Categorizing the mapped reads



The mapped reads primarily fall in above four categories. 1) Properly paired, 2) Improperly paired, Broken reads [3) mates map on different scaffolds, 4) only one of the mate maps]

For my convenience, I have used the samtools commands in a script to categorize the reads mapped in pairs (PE reads) on the reference. Please find it here.

Running the script will list out the number of reads. But it takes more time with very large BAM files.

./SAM_stats.sh file_coordinate_sorted.bam
Properly paired with correct insert distances: 191012
Wrong insert Distance or mates inverted: 1858
Broken Reads: 6284
Out of 6284 broken reads 6100 does not have a mate mapped
Out of 6284 broken reads, 92 pairs fall on different contigs