Thursday, September 8, 2016

Reapr - Sampling fragment coverage - didn't get any coverage


 While running Reapr  on my genome using for the long insert libraries I received the following error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
$ Reapr_1.0.18/reapr scaffolds.fasta LIMP_8kb_long_mapped.bam LIMP_8kb_Reapr_out perfect
[REAPR pipeline] Running facheck
[REAPR pipeline] Running preprocess
[REAPR preprocess] sampling region scaffold1_size3169106:5458-6874.  Already sampled 0 bases
[REAPR preprocess] sampling region scaffold1_size3169106:6895-18716.  Already sampled 1417 bases
[REAPR preprocess] sampling region scaffold1_size3169106:30623-57811.  Already sampled 13239 bases
[REAPR preprocess] sampling region scaffold1_size3169106:57813-98842.  Already sampled 40428 bases
[REAPR preprocess] sampling region scaffold1_size3169106:98863-108371.  Already sampled 81458 bases
[REAPR preprocess] sampling region scaffold1_size3169106:108392-108786.  Already sampled 90967 bases
[REAPR preprocess] sampling region scaffold1_size3169106:108807-132036.  Already sampled 91362 bases
[REAPR preprocess] sampling region scaffold1_size3169106:153000-153621.  Already sampled 114592 bases
[REAPR preprocess] sampling region scaffold1_size3169106:153642-254312.  Already sampled 115214 bases
[REAPR preprocess] sampling region scaffold1_size3169106:254314-257386.  Already sampled 215885 bases
[REAPR preprocess] sampling region scaffold1_size3169106:257407-308464.  Already sampled 218958 bases
[REAPR preprocess] sampling region scaffold1_size3169106:309627-314663.  Already sampled 270016 bases
[REAPR preprocess] sampling region scaffold1_size3169106:314684-355810.  Already sampled 275053 bases
[REAPR preprocess] sampling region scaffold1_size3169106:355831-356388.  Already sampled 316180 bases
[REAPR preprocess] sampling region scaffold1_size3169106:356424-360553.  Already sampled 316738 bases
[REAPR preprocess] sampling region scaffold1_size3169106:362353-365556.  Already sampled 320868 bases
[REAPR preprocess] sampling region scaffold1_size3169106:365596-383718.  Already sampled 324072 bases
[REAPR preprocess] sampling region scaffold1_size3169106:393560-401076.  Already sampled 342195 bases
[REAPR preprocess] sampling region scaffold1_size3169106:401097-405358.  Already sampled 349712 bases
[REAPR preprocess] sampling region scaffold1_size3169106:405379-423388.  Already sampled 353974 bases
[REAPR preprocess] sampling region scaffold1_size3169106:423445-426298.  Already sampled 371984 bases
[REAPR preprocess] sampling region scaffold1_size3169106:426319-435419.  Already sampled 374838 bases
[REAPR preprocess] sampling region scaffold1_size3169106:435440-439265.  Already sampled 383939 bases
[REAPR preprocess] sampling region scaffold1_size3169106:439286-440463.  Already sampled 387765 bases
[REAPR preprocess] sampling region scaffold1_size3169106:440484-443034.  Already sampled 388943 bases
[REAPR preprocess] sampling region scaffold1_size3169106:443036-506052.  Already sampled 391494 bases
[REAPR preprocess] sampling region scaffold1_size3169106:506073-512313.  Already sampled 454511 bases
[REAPR preprocess] sampling region scaffold1_size3169106:512315-521665.  Already sampled 460752 bases
[REAPR preprocess] sampling region scaffold1_size3169106:526492-538394.  Already sampled 470103 bases
[REAPR preprocess] sampling region scaffold1_size3169106:538415-571424.  Already sampled 482006 bases
[REAPR preprocess] sampling region scaffold1_size3169106:571426-583083.  Already sampled 515016 bases
[REAPR preprocess] sampling region scaffold1_size3169106:583085-585679.  Already sampled 526674 bases
[REAPR preprocess] sampling region scaffold1_size3169106:585700-640937.  Already sampled 529269 bases
[REAPR preprocess] sampling region scaffold1_size3169106:640982-662781.  Already sampled 584507 bases
[REAPR preprocess] sampling region scaffold1_size3169106:662783-742983.  Already sampled 606307 bases
[REAPR preprocess] sampling region scaffold1_size3169106:743047-760078.  Already sampled 686508 bases
[REAPR preprocess] sampling region scaffold1_size3169106:765101-815949.  Already sampled 703540 bases
[REAPR preprocess] sampling region scaffold1_size3169106:815971-820741.  Already sampled 754389 bases
[REAPR preprocess] sampling region scaffold1_size3169106:820912-821293.  Already sampled 759160 bases
[REAPR preprocess] sampling region scaffold1_size3169106:821314-865486.  Already sampled 759542 bases
[REAPR preprocess] sampling region scaffold1_size3169106:865488-908573.  Already sampled 803715 bases
[REAPR preprocess] sampling region scaffold1_size3169106:909261-942195.  Already sampled 846801 bases
[REAPR preprocess] sampling region scaffold1_size3169106:942216-986026.  Already sampled 879736 bases
[REAPR preprocess] sampling region scaffold1_size3169106:986047-991833.  Already sampled 923547 bases
[REAPR preprocess] sampling region scaffold1_size3169106:991854-1003751.  Already sampled 929334 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1003772-1027843.  Already sampled 941232 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1029659-1034131.  Already sampled 965304 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1034152-1034594.  Already sampled 969777 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1034615-1057662.  Already sampled 970220 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1057683-1063521.  Already sampled 993268 bases
[REAPR preprocess] sampling region scaffold1_size3169106:1063542-1064434.  Already sampled 999107 bases
[REAPR preprocess] Something went wrong sampling fragment coverage - didn't get any coverage.
Mon Jul 18 11:42:50 SGT 2016


Solution that worked for me:


I changed the orientation of the mate pairs reads by reverse complementing before using REAPR since by default Reapr assumes that the orientation of the reads is FR. To perform RC on fastq files, I used fastx_reverse_complement. Check this page for more details.

Wednesday, September 7, 2016

Extracting flanking regions of a marker sequence

Tools used : Blast Alignment tool, Bedtools,

Step 1: As a first step, I blasted my marker sequences against the reference genome using following.

Note: Since, my marker sequences are very small in length ( 17- 26 nt), I used a blastn (not megablast) and higher e-value which in my case was 1000 and the results are listed in tabular format.

~/Documents/Blast/blastn  -task blastn  -query Markers.txt -db ~/db/Trichecus_manatus/Trichechus_manatus_Lat1.0_chrUnMasked -out Markers_vs_Trichechus_Unmasked.txt -evalue 1000 -outfmt "6 qseqid qlen sseqid slen pident nident length mismatch gapopen qstart qend sstart send evalue bitscore" -num_descriptions 10 -num_alignments 10

Step 2: From the blast results, I chose a cut off to find if atleast 90% of the marker seqs are aligned to the reference.

cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90'

Step 3: Now that I have the coordinates in bed format obtained from blast, I want to generate coordinates with the number of flanking bases before and after the marker (in my case it is 100+/- nt).

cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $3,$12,$13}' | awk '{if ($2 > $3) print $1,($3-100),($2+100); else if($3 > $2) print $1,($2-100),($3+100);}' | sed 's/ /\t/g' >Markers.bed

$ head Markers.bed
gi|460718269|ref|NW_004444026.1|    8607288    8607509
gi|460712402|ref|NW_004445196.1|    1258    1482
gi|460713408|ref|NW_004444190.1|    1238289    1238513
gi|460718268|ref|NW_004444027.1|    4548875    4549099
gi|460713515|ref|NW_004444083.1|    1935238    1935453
gi|460718331|ref|NW_004443964.1|    21821180    21821395
gi|460726833|ref|NW_004443938.1|    8634324    8634539
gi|460713195|ref|NW_004444403.1|    45475    45689
gi|460713273|ref|NW_004444325.1|    69053    69267
gi|460713348|ref|NW_004444250.1|    98323    98537

Step 4: Using this Markers.bed file, I extracted the marker sequence along with its flanking regions using bedtools - getfasta.

 $ bedtools getfasta -fi ~/db/Trichecus_manatus/tma_ref_TriManLat1.0_chrUn.fa -bed Markers.bed -fo Trichechum_Markers_withFlankingRegions.fasta

The result in the Trichechum_Markers_withFlankingRegions.fasta

>gi|460718269|ref|NW_004444026.1|:8607288-8607509
TGTGTGTATGCCTAGCTGTATGTCTCTATCTGT........
>gi|460718269|ref|NW_004444026.1|:8607130-8607354
ATCTTCATCAGTAAGTACTTCAAGTGCTCTTCATT........
>gi|460712402|ref|NW_004445196.1|:1258-1482
GTATACTGAGTACTTTGAAGTCAGGAAAGACGTGACT........
>gi|460713408|ref|NW_004444190.1|:1238289-1238513
AAAGGGATATTACATGGTTTAAAATTAGGAAAGGCAC........
>gi|460718268|ref|NW_004444027.1|:4548875-4549099
GGATACTATGATTTCAAGTCAAGAAACGAGTGTGTCA........

The information in the header now only contains:

gi|460718269|ref|NW_004444026.1|:8607288-8607509
                 gene id                            coordinates 

But, I would like to add information of marker also in the header line like this:


Marker1#gi|460718269|ref|NW_004444026.1|:8607288-8607509
                               gene id                             coordinates


Step 5: So, by running the following commands a file with both marker information along with genome region and coordinates is generated.

 $ cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $1,$3,$12,$13}' | awk '{if ($3 > $4) print $1"#"$2":"($4-100)"-"($3+100); else if($4 > $3) print $1"#"$2":"($3-100)"-"($4+100);}' >Marker_Genome_Coords.txt

which looks like this:
$ head Marker_Genome_Coords.txt
TmaE11_AF223658_177–197_F#gi|460718269|ref|NW_004444026.1|:8607288-8607509
TmaE11_AF223658_R#gi|460718269|ref|NW_004444026.1|:8607130-8607354
TmaE11_AF223658_R#gi|460712402|ref|NW_004445196.1|:1258-1482
TmaE11_AF223658_R#gi|460713408|ref|NW_004444190.1|:1238289-1238513
TmaE11_AF223658_R#gi|460718268|ref|NW_004444027.1|:4548875-4549099
TmaE01_EF191343_272–280_F#gi|460713515|ref|NW_004444083.1|:1935238-1935453
TmaE01_EF191343_272–280_F#gi|460718277|ref|NW_004444018.1|:3212423-3212638
TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:19459039-19459254
TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:10019186-10019400
TmaE01_EF191343_272–280_F#gi|460718331|ref|NW_004443964.1|:21821180-21821395

Step 6: Using both these files, Marker_Genome_Coords.txt and Trichechum_Markers_withFlankingRegions.fasta, I ran the following perl script:

#!/usr/bin/perl
## add the marker information to the header line in the fasta file
open MARKERFILE,"Marker_Genome_Coords.txt";

%MarkerHash=();

 while(<MARKERFILE>)

    {

    $_ =~ s/^\s+|\s+$//; #remove leading and trailing space

    @array=split("#",$_);

    $MarkerHash{$array[1]}="$array[0]";

    }

open FASTA,"Trichechum_Markers_withFlankingRegions.fasta";

 

    while(<FASTA>)

    {

    $id=$_;

    $seq=<FASTA>;

    if($id=~/^\>(.+)\s*$/ && (exists($MarkerHash{$1})))

        {

        $alt_id="$MarkerHash{$1}#$1";

        print "\>$alt_id\n$seq";

        }

    }

    close(MARKERFILE);

    close(FASTA);

This finally gave me a fasta file with marker, genome region and corresponding coordinates of flanking regions.

>TmaA03_AF223651_R#gi|460713300|ref|NW_004444298.1|:346653-346872
CCTGGCTCATCAGT....
>TmaA03_AF223651_R#gi|460718350|ref|NW_004443945.1|:25974611-25974830
TCAGATTTTTAGTGG....
>TmaA03_AF223651_R#gi|460713360|ref|NW_004444238.1|:217351-217569
TCTTCTTTTATTTAAAT....

So, thats it!