Thursday, September 8, 2016

Reapr - Sampling fragment coverage - didn't get any coverage


 While running Reapr  on my genome using for the long insert libraries I received the following error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
$ Reapr_1.0.18/reapr scaffolds.fasta LIMP_8kb_long_mapped.bam LIMP_8kb_Reapr_out perfect
[REAPR pipeline] Running facheck
[REAPR pipeline] Running preprocess
[REAPR preprocess] sampling region scaffold1_size3169106:5458-6874.  Already sampled 0 bases
[REAPR preprocess] sampling region scaffold1_size3169106:6895-18716.  Already sampled 1417 bases
[REAPR preprocess] sampling region scaffold1_size3169106:30623-57811.  Already sampled 13239 bases
[REAPR preprocess] sampling region scaffold1_size3169106:57813-98842.  Already sampled 40428 bases
[REAPR preprocess] sampling region scaffold1_size3169106:98863-108371.  Already sampled 81458 bases
[REAPR preprocess] sampling region scaffold1_size3169106:108392-108786.  Already sampled 90967 bases
[REAPR preprocess] sampling region scaffold1_size3169106:108807-132036.  Already sampled 91362 bases
[REAPR preprocess] sampling region scaffold1_size3169106:153000-153621.  Already sampled 114592 bases
[REAPR preprocess] sampling region scaffold1_size3169106:153642-254312.  Already sampled 115214 bases
[REAPR preprocess] sampling region scaffold1_size3169106:254314-257386.  Already sampled 215885 bases
[REAPR preprocess] sampling region scaffold1_size3169106:257407-308464.  Already sampled 218958 bases
[REAPR preprocess] sampling region scaffold1_size3169106:309627-314663.  Already sampled 270016 bases
[REAPR preprocess] sampling region scaffold1_size3169106:314684-355810.  Already sampled 275053 bases
[REAPR preprocess] sampling region scaffold1_size3169106:355831-356388.  Already sampled 316180 bases
[REAPR preprocess] sampling region scaffold1_size3169106:356424-360553.  Already sampled 316738 bases
[REAPR preprocess] sampling region scaffold1_size3169106:362353-365556.  Already sampled 320868 bases
[REAPR preprocess] sampling region scaffold1_size3169106:365596-383718.  Already sampled 324072 bases
[REAPR preprocess] sampling region scaffold1_size3169106:393560-401076.  Already sampled 342195 bases
[REAPR preprocess] sampling region scaffold1_size3169106:401097-405358.  Already sampled 349712 bases
[REAPR preprocess] sampling region scaffold1_size3169106:405379-423388.  Already sampled 353974 bases
[REAPR preprocess] sampling region scaffold1_size3169106:423445-426298.  Already sampled 371984 bases
[REAPR preprocess] sampling region scaffold1_size3169106:426319-435419.  Already sampled 374838 bases
[REAPR preprocess] sampling region scaffold1_size3169106:435440-439265.  Already sampled 383939 bases
[REAPR preprocess] sampling region scaffold1_size3169106:439286-440463.  Already sampled 387765 bases
[REAPR preprocess] sampling region scaffold1_size3169106:440484-443034.  Already sampled 388943 bases
[REAPR preprocess] sampling region scaffold1_size3169106:443036-506052.  Already sampled 391494 bases
[REAPR preprocess] sampling region scaffold1_size3169106:506073-512313.  Already sampled 454511 bases
[REAPR preprocess] sampling region scaffold1_size3169106:512315-521665.  Already sampled 460752 bases
[REAPR preprocess] sampling region scaffold1_size3169106:526492-538394.  Already sampled 470103 bases
[REAPR preprocess] sampling region scaffold1_size3169106:538415-571424.  Already sampled 482006 bases
[REAPR preprocess] sampling region scaffold1_size3169106:571426-583083.  Already sampled 515016 bases
[REAPR preprocess] sampling region scaffold1_size3169106:583085-585679.  Already sampled 526674 bases
[REAPR preprocess] sampling region scaffold1_size3169106:585700-640937.  Already sampled 529269 bases
[REAPR preprocess] sampling region scaffold1_size3169106:640982-662781.  Already sampled 584507 bases
[REAPR preprocess] sampling region scaffold1_size3169106:662783-742983.  Already sampled 606307 bases
[REAPR preprocess] sampling region scaffold1_size3169106:743047-760078.  Already sampled 686508 bases
[REAPR preprocess] sampling region scaffold1_size3169106:765101-815949.  Already sampled 703540 bases
[REAPR preprocess] sampling region scaffold1_size3169106:815971-820741.  Already sampled 754389 bases
[REAPR preprocess] sampling region scaffold1_size3169106:820912-821293.  Already sampled 759160 bases
[REAPR preprocess] sampling region scaffold1_size3169106:821314-865486.  Already sampled 759542 bases
[REAPR preprocess] sampling region scaffold1_size3169106:865488-908573.  Already sampled 803715 bases
[REAPR preprocess] sampling region scaffold1_size3169106:909261-942195.  Already sampled 846801 bases
[REAPR preprocess] sampling region scaffold1_size3169106:942216-986026.  Already sampled 879736 bases
[REAPR preprocess] sampling region scaffold1_size3169106:986047-991833.  Already sampled 923547 bases
[REAPR preprocess] sampling region scaffold1_size3169106:991854-1003751.  Already sampled 929334 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1003772-1027843.  Already sampled 941232 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1029659-1034131.  Already sampled 965304 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1034152-1034594.  Already sampled 969777 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1034615-1057662.  Already sampled 970220 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1057683-1063521.  Already sampled 993268 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1063542-1064434.  Already sampled 999107 bases
[REAPR preprocess] Something went wrong sampling fragment coverage - didn't get any coverage.
Mon Jul 18 11:42:50 SGT 2016


Solution that worked for me:


I changed the orientation of the mate pairs reads by reverse complementing before using REAPR since by default Reapr assumes that the orientation of the reads is FR. To perform RC on fastq files, I used fastx_reverse_complement. Check this page for more details.

Wednesday, September 7, 2016

Extracting flanking regions of a marker sequence

Tools used : Blast Alignment tool, Bedtools,

Step 1: As a first step, I blasted my marker sequences against the reference genome using following.

Note: Since, my marker sequences are very small in length ( 17- 26 nt), I used a blastn (not megablast) and higher e-value which in my case was 1000 and the results are listed in tabular format.

~/Documents/Blast/blastn  -task blastn  -query Markers.txt -db ~/db/Trichecus_manatus/Trichechus_manatus_Lat1.0_chrUnMasked -out Markers_vs_Trichechus_Unmasked.txt -evalue 1000 -outfmt "6 qseqid qlen sseqid slen pident nident length mismatch gapopen qstart qend sstart send evalue bitscore" -num_descriptions 10 -num_alignments 10

Step 2: From the blast results, I chose a cut off to find if atleast 90% of the marker seqs are aligned to the reference.

cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90'

Step 3: Now that I have the coordinates in bed format obtained from blast, I want to generate coordinates with the number of flanking bases before and after the marker (in my case it is 100+/- nt).

cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $3,$12,$13}' | awk '{if ($2 > $3) print $1,($3-100),($2+100); else if($3 > $2) print $1,($2-100),($3+100);}' | sed 's/ /\t/g' >Markers.bed

$ head Markers.bed
gi|460718269|ref|NW_004444026.1|    8607288    8607509
gi|460712402|ref|NW_004445196.1|    1258    1482
gi|460713408|ref|NW_004444190.1|    1238289    1238513
gi|460718268|ref|NW_004444027.1|    4548875    4549099
gi|460713515|ref|NW_004444083.1|    1935238    1935453
gi|460718331|ref|NW_004443964.1|    21821180    21821395
gi|460726833|ref|NW_004443938.1|    8634324    8634539
gi|460713195|ref|NW_004444403.1|    45475    45689
gi|460713273|ref|NW_004444325.1|    69053    69267
gi|460713348|ref|NW_004444250.1|    98323    98537

Step 4: Using this Markers.bed file, I extracted the marker sequence along with its flanking regions using bedtools - getfasta.

 $ bedtools getfasta -fi ~/db/Trichecus_manatus/tma_ref_TriManLat1.0_chrUn.fa -bed Markers.bed -fo Trichechum_Markers_withFlankingRegions.fasta

The result in the Trichechum_Markers_withFlankingRegions.fasta

>gi|460718269|ref|NW_004444026.1|:8607288-8607509
TGTGTGTATGCCTAGCTGTATGTCTCTATCTGT........
>gi|460718269|ref|NW_004444026.1|:8607130-8607354
ATCTTCATCAGTAAGTACTTCAAGTGCTCTTCATT........
>gi|460712402|ref|NW_004445196.1|:1258-1482
GTATACTGAGTACTTTGAAGTCAGGAAAGACGTGACT........
>gi|460713408|ref|NW_004444190.1|:1238289-1238513
AAAGGGATATTACATGGTTTAAAATTAGGAAAGGCAC........
>gi|460718268|ref|NW_004444027.1|:4548875-4549099
GGATACTATGATTTCAAGTCAAGAAACGAGTGTGTCA........

The information in the header now only contains:

gi|460718269|ref|NW_004444026.1|:8607288-8607509
                 gene id                            coordinates 

But, I would like to add information of marker also in the header line like this:


Marker1#gi|460718269|ref|NW_004444026.1|:8607288-8607509
                               gene id                             coordinates


Step 5: So, by running the following commands a file with both marker information along with genome region and coordinates is generated.

 $ cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $1,$3,$12,$13}' | awk '{if ($3 > $4) print $1"#"$2":"($4-100)"-"($3+100); else if($4 > $3) print $1"#"$2":"($3-100)"-"($4+100);}' >Marker_Genome_Coords.txt

which looks like this:
$ head Marker_Genome_Coords.txt
TmaE11_AF223658_177–197_F#gi|460718269|ref|NW_004444026.1|:8607288-8607509
TmaE11_AF223658_R#gi|460718269|ref|NW_004444026.1|:8607130-8607354
TmaE11_AF223658_R#gi|460712402|ref|NW_004445196.1|:1258-1482
TmaE11_AF223658_R#gi|460713408|ref|NW_004444190.1|:1238289-1238513
TmaE11_AF223658_R#gi|460718268|ref|NW_004444027.1|:4548875-4549099
TmaE01_EF191343_272–280_F#gi|460713515|ref|NW_004444083.1|:1935238-1935453
TmaE01_EF191343_272–280_F#gi|460718277|ref|NW_004444018.1|:3212423-3212638
TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:19459039-19459254
TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:10019186-10019400
TmaE01_EF191343_272–280_F#gi|460718331|ref|NW_004443964.1|:21821180-21821395

Step 6: Using both these files, Marker_Genome_Coords.txt and Trichechum_Markers_withFlankingRegions.fasta, I ran the following perl script:

#!/usr/bin/perl
## add the marker information to the header line in the fasta file
open MARKERFILE,"Marker_Genome_Coords.txt";

%MarkerHash=();

 while(<MARKERFILE>)

    {

    $_ =~ s/^\s+|\s+$//; #remove leading and trailing space

    @array=split("#",$_);

    $MarkerHash{$array[1]}="$array[0]";

    }

open FASTA,"Trichechum_Markers_withFlankingRegions.fasta";

 

    while(<FASTA>)

    {

    $id=$_;

    $seq=<FASTA>;

    if($id=~/^\>(.+)\s*$/ && (exists($MarkerHash{$1})))

        {

        $alt_id="$MarkerHash{$1}#$1";

        print "\>$alt_id\n$seq";

        }

    }

    close(MARKERFILE);

    close(FASTA);

This finally gave me a fasta file with marker, genome region and corresponding coordinates of flanking regions.

>TmaA03_AF223651_R#gi|460713300|ref|NW_004444298.1|:346653-346872
CCTGGCTCATCAGT....
>TmaA03_AF223651_R#gi|460718350|ref|NW_004443945.1|:25974611-25974830
TCAGATTTTTAGTGG....
>TmaA03_AF223651_R#gi|460713360|ref|NW_004444238.1|:217351-217569
TCTTCTTTTATTTAAAT....

So, thats it!

Monday, August 22, 2016

Pilon Error - Inappropriate call if not paired read

I was running Pilon to improve my genome assembly using pilon-1.16.jar. 

time nohup java -jar /data/software/pilon-1.16.jar --genome genome.fasta --frags SIPE1_filtered_sorted.bam --frags SIPE2_filtered_sorted.bam --frags SIPE3_filtered_sorted.bam --frags SIPE4_filtered_sorted.bam --frags SIPE5_filtered_sorted.bam --frags SIPE6_filtered_sorted.bam --jumps LIMP1_filtered_sorted.bam --jumps LIMP4c_filtered_sorted.bam --output Pilon_Output --changes --threads 22 --verbose &

Pilon version 1.16 Mon Dec 7 09:45:45 2015 -0500

Genome: genome.fasta
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.simontuffs.onejar.Boot.run(Boot.java:340)
    at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.IllegalStateException: Inappropriate call if not paired read
    at htsjdk.samtools.SAMRecord.requireReadPaired(SAMRecord.java:648)
    at htsjdk.samtools.SAMRecord.getProperPairFlag(SAMRecord.java:656)
    at org.broadinstitute.pilon.BamFile$$anonfun$scan$1.apply(BamFile.scala:273)
    at org.broadinstitute.pilon.BamFile$$anonfun$scan$1.apply(BamFile.scala:268)
    at scala.collection.Iterator$class.foreach(Iterator.scala:750)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
    at org.broadinstitute.pilon.BamFile.scan(BamFile.scala:268)
    at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$4.apply(GenomeFile.scala:92)
    at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$4.apply(GenomeFile.scala:92)
    at scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115)
    at scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62)
    at scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
    at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
    at scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
    at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Elsewhere I read files with reads mapped in different directions is causing the problem. So, I tried Pilon1.16 with only short insert libraries direction, it worked for a bit of time but eventually failed!

Since, I wanted to run pilon using both paired end and mate-pair libraries, I tried downloading the newest version of Pilonv1.18 which ALSO DID NOT SOLVE the above error.

Solution that worked for me:

So after exchannging few mails with the author, I finally extracted the reads whose pairs also mapped elsewhere in the genome and generated a BAM file. This time alone Pilon worked and still running without error! Let us see..if it will last! :)

I used the following command from the samtools.

samtools view -b -f 0x2 LIMP1.bam | samtools sort - LIMP1_mappedPairs_sorted.bam

Wednesday, August 10, 2016

Comparing bamcheck and samtools stats - finding inward and outward oriented read pairs from bam file

In addition to the biostar post, I am adding more details in this blog post.

I wanted to find out the inward and outward orientation pairs from a bam file. So, I checked a few tools. Two of them were bamcheck and samtools stats. The reason for doing so, if the bam shows that there are more number of outward oriented pairs, that means there are higher chances that the region has some problem associated with it.

But, I was stuck because both these tools gave different outputs. So, I myself wanted to find out which is one more reliable.





This subsequently led me run the following steps.

1) Split the bam into smaller bam files, (1 scaffold and its associated mapped reads)

$ bamtools split -in 500bp_insert_mapping_subset2_sorted.bam -reference

$ ls
500bp_insert_mapping_subset2_sorted.bam                   500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam

2) Count the inward and oriented pairs, using both bamcheck 

$ for i in *bam; do echo "$i"; bamcheck $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t
500bp_insert_mapping_subset2_sorted.bam                   SN  inward  oriented  pairs:  95506  SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  SN  inward  oriented  pairs:  2801   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   SN  inward  oriented  pairs:  2861   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  SN  inward  oriented  pairs:  2473   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  SN  inward  oriented  pairs:  2844   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   SN  inward  oriented  pairs:  2620   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  SN  inward  oriented  pairs:  3212   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   SN  inward  oriented  pairs:  3543   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  SN  inward  oriented  pairs:  3018   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  SN  inward  oriented  pairs:  3671   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  SN  inward  oriented  pairs:  3530   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  SN  inward  oriented  pairs:  3093   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  SN  inward  oriented  pairs:  2951   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  SN  inward  oriented  pairs:  1545   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  SN  inward  oriented  pairs:  3242   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam  SN  inward  oriented  pairs:  2252   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam  SN  inward  oriented  pairs:  3876   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam  SN  inward  oriented  pairs:  3718   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam  SN  inward  oriented  pairs:  4269   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam  SN  inward  oriented  pairs:  2493   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam  SN  inward  oriented  pairs:  3140   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam  SN  inward  oriented  pairs:  4837   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam  SN  inward  oriented  pairs:  6686   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam  SN  inward  oriented  pairs:  2668   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam  SN  inward  oriented  pairs:  3145   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam   SN  inward  oriented  pairs:  2622   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam   SN  inward  oriented  pairs:  2729   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam   SN  inward  oriented  pairs:  2880   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam   SN  inward  oriented  pairs:  2668   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam   SN  inward  oriented  pairs:  4305   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam   SN  inward  oriented  pairs:  1814   SN  outward  oriented  pairs:  0

2) Count the inward and oriented pairs, using samtools stats


$ for i in *bam; do echo "$i"; ~/Documents/samtools-1.2/samtools stats $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t
500bp_insert_mapping_subset2_sorted.bam                   SN  inward  oriented  pairs:  96131  SN  outward  oriented  pairs:  295
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  SN  inward  oriented  pairs:  2807   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   SN  inward  oriented  pairs:  2861   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  SN  inward  oriented  pairs:  2474   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  SN  inward  oriented  pairs:  2845   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   SN  inward  oriented  pairs:  2622   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  SN  inward  oriented  pairs:  3221   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   SN  inward  oriented  pairs:  3553   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  SN  inward  oriented  pairs:  3020   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  SN  inward  oriented  pairs:  3677   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  SN  inward  oriented  pairs:  3545   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  SN  inward  oriented  pairs:  3097   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  SN  inward  oriented  pairs:  2961   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  SN  inward  oriented  pairs:  1548   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  SN  inward  oriented  pairs:  3251   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam  SN  inward  oriented  pairs:  2272   SN  outward  oriented  pairs:  24
500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam  SN  inward  oriented  pairs:  3885   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam  SN  inward  oriented  pairs:  3727   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam  SN  inward  oriented  pairs:  4279   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam  SN  inward  oriented  pairs:  2850   SN  outward  oriented  pairs:  227
500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam  SN  inward  oriented  pairs:  3176   SN  outward  oriented  pairs:  14
500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam  SN  inward  oriented  pairs:  4841   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam  SN  inward  oriented  pairs:  6735   SN  outward  oriented  pairs:  24
500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam  SN  inward  oriented  pairs:  2679   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam  SN  inward  oriented  pairs:  3150   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam   SN  inward  oriented  pairs:  2625   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam   SN  inward  oriented  pairs:  2732   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam   SN  inward  oriented  pairs:  2885   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam   SN  inward  oriented  pairs:  2671   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam   SN  inward  oriented  pairs:  4312   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam   SN  inward  oriented  pairs:  1830   SN  outward  oriented  pairs:  6

3) I selected this small BAM file: 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam

Bamcheck stats

$ bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam | head -28
# This file was produced by bamcheck (2012-04-24)
# The command line was:  bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN sequences: 14606
SN is paired: 1
SN is sorted: 1
SN 1st fragments: 7843
SN last fragments: 6763
SN reads mapped: 14606
SN reads unmapped: 0
SN reads unpaired: 1234
SN reads paired: 13372
SN reads duplicated: 0
SN reads MQ0: 9217
SN total length: 1437208
SN bases mapped: 1437208
SN bases mapped (cigar): 1436380
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 0
SN error rate: 0.000000e+00
SN average length: 98
SN maximum length: 101
SN average quality: 36.8
SN insert size average: 497.2
SN insert size standard deviation: 29.1
SN inward oriented pairs: 6686
SN outward oriented pairs: 0

Samtools stats

$ samtools stats 500bp_insert_mapping_subset2_sorted.REF__unitig_2151_outties_filtered.bam | head -37
SN raw total sequences: 14606
SN filtered sequences: 0
SN sequences: 14606
SN is sorted: 1
SN 1st fragments: 7843
SN last fragments: 6763
SN reads mapped: 14606
SN reads mapped and paired: 13526 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 13372 # proper-pair bit set
SN reads paired: 13526 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 9217 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 0
SN total length: 1437208 # ignores clipping
SN bases mapped: 1437208 # ignores clipping
SN bases mapped (cigar): 1436380 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 0 # from NM fields
SN error rate: 0.000000e+00 # mismatches / bases mapped (cigar)
SN average length: 98
SN maximum length: 101
SN average quality: 36.8
SN insert size average: 491.7
SN insert size standard deviation: 29.4
SN inward oriented pairs: 6735
SN outward oriented pairs: 24

Observe here, (6735+24)*2 != 13,526 but equals to 13,518 paired reads. (check from step 7)

4) First take out all the mapped read pairs (irrespective of direction) from the bam and convert into sorted bed file. 

$ fgrep -f <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam \|
cut -f4 | cut -d "_" -f1 \|
sort | uniq -c | sort -n \| 
fgrep '      2' | sed 's/      2 //g') <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam) \|
sort -k4,4 >500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed

5) Checking the numbers of reads paired

$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \| 
awk '{print $4,$2,$3,$6}' \| 
paste - - | wc -l

6763 (reads paired: 13526) - This number is same as the number generated by samtools stats see above SN reads paired: 13526.

But the bamcheck was giving a different result (SN reads paired: 13372)

6) Now, from the read pairs, I want to know how many are 

  1. Inward oriented pairs (--> <--)
  2. Outward oriented pairs (<-- -->)
  3. Same direction pairs (-->--> or <--<--)
7) At first I tried running the following command, which gave me slightly higher number than the samtools stats given number

$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \|
awk '{print $4,$2,$3,$6}' | paste - - \|
awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else print $0;}' \|
sed 's/-/<@@@/g' | sed 's/+/###>/g' | awk '{print $1,$2,$3,$5,$6,$7,$4$8}' \|
column -t | awk '{print $NF}' | sort | uniq -c

    27  <--()-->
   6736 -->()<--

(27 + 6736)*2 =13,526 paired reads.

But, samtools stats gives (6735+24 ) * 2 = 13518 reads are paired reads. I tried understand deeply and scrutinizing the bed file I found that 3 reads pairs are overlapping at the same position for outward oriented pairs and 1 overlapping read pair from the inward oriented pairs.

From overlapping, I mean the coordinates of both reads are mapping around the same point.

HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/1   8700   8798   HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/2   8700   8798   <-- -->
HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/1   8702   8796   HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/2   8702   8794   <-- -->
HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/1   8717   8814   HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/2   8717   8814   <-- -->

These were problematic ones!

8) So, I made slight change in the above command


$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \|
awk '{print $4,$2,$3,$6}' | paste - - \|
awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else if ( $2 == $6) next; else print $0;}' \|
sed 's/-/<--/g' | sed 's/+/-->/g' \|
awk '{print $1,$2,$3,$5,$6,$7,$4"()"$8}' \|
column -t | awk '{print $NF}' \|
sort | uniq -c

     24 <--()-->
   6735 -->()<--

which is same as samtools stats command.

I have checked these reads in a visualization tool called tablet. It shows the directions of the pairs when the bam file is loaded into the tool.

**Note:

The final script I used is not a one-liner to find the orientation of paired reads for bigger bam files, because other conditions have to be taken in consideration too. Such as overlapping pairs with different starting coordinates (in my script I used only same coordinates of both read pairs to report).

Friday, August 5, 2016

NCBI Blast error - with Pipe symbols in fasta header

I am running NCBI blast version 2.2.28 for my analysis.  I encountered an error while making database for my local blast.

Error:

$ ~/Documents/Blast/makeblastdb -in HSA_GRCh37.p13_EnsGenes.fasta -dbtype nucl -out HSA_GRCh37.p13_EnsGenes.BlastDB -parse_seqids

Building a new DB, current time: 08/05/2016 11:11:15
New DB name:   HSA_GRCh37.p13_EnsGenes.BlastDB
New DB title:  HSA_GRCh37.p13_EnsGenes.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B

No volumes were created because no sequences were found.


Error: NCBI C++ Exception:

    "/am/ncbiapdata/release/blast/src/2.2.25/Linux64-Suse-icc/c++/ICC1010-ReleaseMT64--Linux64-Suse-icc/../src/objects/seq/../seqloc/Seq_id.cpp", line 1637: Error: ncbi::objects::CSeq_id::x_Init() - Unsupported ID type ENSG00000003137

Fasta file was okay but the header is the problem. It contained pipe symbol "|". So replaced all the pipes with an underscore "_"


$ sed -i 's/|/_/g' HSA_GRCh37.p13_EnsGenes.fasta

Database was created succesfully.

$ ~/Documents/Blast/makeblastdb -in HSA_GRCh37.p13_EnsGenes.fasta -dbtype nucl -out HSA_GRCh37.p13_EnsGenes.BlastDB -parse_seqids


Building a new DB, current time: 08/05/2016 11:12:05

New DB name:   HSA_GRCh37.p13_EnsGenes.BlastDB
New DB title:  HSA_GRCh37.p13_EnsGenes.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 215647 sequences in 15.894 seconds.