Wednesday, September 7, 2016

Extracting flanking regions of a marker sequence

Tools used : Blast Alignment tool, Bedtools,

Step 1: As a first step, I blasted my marker sequences against the reference genome using following.

Note: Since, my marker sequences are very small in length ( 17- 26 nt), I used a blastn (not megablast) and higher e-value which in my case was 1000 and the results are listed in tabular format.

~/Documents/Blast/blastn  -task blastn  -query Markers.txt -db ~/db/Trichecus_manatus/Trichechus_manatus_Lat1.0_chrUnMasked -out Markers_vs_Trichechus_Unmasked.txt -evalue 1000 -outfmt "6 qseqid qlen sseqid slen pident nident length mismatch gapopen qstart qend sstart send evalue bitscore" -num_descriptions 10 -num_alignments 10

Step 2: From the blast results, I chose a cut off to find if atleast 90% of the marker seqs are aligned to the reference.

cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90'

Step 3: Now that I have the coordinates in bed format obtained from blast, I want to generate coordinates with the number of flanking bases before and after the marker (in my case it is 100+/- nt).

cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $3,$12,$13}' | awk '{if ($2 > $3) print $1,($3-100),($2+100); else if($3 > $2) print $1,($2-100),($3+100);}' | sed 's/ /\t/g' >Markers.bed

$ head Markers.bed
gi|460718269|ref|NW_004444026.1|    8607288    8607509
gi|460712402|ref|NW_004445196.1|    1258    1482
gi|460713408|ref|NW_004444190.1|    1238289    1238513
gi|460718268|ref|NW_004444027.1|    4548875    4549099
gi|460713515|ref|NW_004444083.1|    1935238    1935453
gi|460718331|ref|NW_004443964.1|    21821180    21821395
gi|460726833|ref|NW_004443938.1|    8634324    8634539
gi|460713195|ref|NW_004444403.1|    45475    45689
gi|460713273|ref|NW_004444325.1|    69053    69267
gi|460713348|ref|NW_004444250.1|    98323    98537

Step 4: Using this Markers.bed file, I extracted the marker sequence along with its flanking regions using bedtools - getfasta.

 $ bedtools getfasta -fi ~/db/Trichecus_manatus/tma_ref_TriManLat1.0_chrUn.fa -bed Markers.bed -fo Trichechum_Markers_withFlankingRegions.fasta

The result in the Trichechum_Markers_withFlankingRegions.fasta

>gi|460718269|ref|NW_004444026.1|:8607288-8607509
TGTGTGTATGCCTAGCTGTATGTCTCTATCTGT........
>gi|460718269|ref|NW_004444026.1|:8607130-8607354
ATCTTCATCAGTAAGTACTTCAAGTGCTCTTCATT........
>gi|460712402|ref|NW_004445196.1|:1258-1482
GTATACTGAGTACTTTGAAGTCAGGAAAGACGTGACT........
>gi|460713408|ref|NW_004444190.1|:1238289-1238513
AAAGGGATATTACATGGTTTAAAATTAGGAAAGGCAC........
>gi|460718268|ref|NW_004444027.1|:4548875-4549099
GGATACTATGATTTCAAGTCAAGAAACGAGTGTGTCA........

The information in the header now only contains:

gi|460718269|ref|NW_004444026.1|:8607288-8607509
                 gene id                            coordinates 

But, I would like to add information of marker also in the header line like this:


Marker1#gi|460718269|ref|NW_004444026.1|:8607288-8607509
                               gene id                             coordinates


Step 5: So, by running the following commands a file with both marker information along with genome region and coordinates is generated.

 $ cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $1,$3,$12,$13}' | awk '{if ($3 > $4) print $1"#"$2":"($4-100)"-"($3+100); else if($4 > $3) print $1"#"$2":"($3-100)"-"($4+100);}' >Marker_Genome_Coords.txt

which looks like this:
$ head Marker_Genome_Coords.txt
TmaE11_AF223658_177–197_F#gi|460718269|ref|NW_004444026.1|:8607288-8607509
TmaE11_AF223658_R#gi|460718269|ref|NW_004444026.1|:8607130-8607354
TmaE11_AF223658_R#gi|460712402|ref|NW_004445196.1|:1258-1482
TmaE11_AF223658_R#gi|460713408|ref|NW_004444190.1|:1238289-1238513
TmaE11_AF223658_R#gi|460718268|ref|NW_004444027.1|:4548875-4549099
TmaE01_EF191343_272–280_F#gi|460713515|ref|NW_004444083.1|:1935238-1935453
TmaE01_EF191343_272–280_F#gi|460718277|ref|NW_004444018.1|:3212423-3212638
TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:19459039-19459254
TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:10019186-10019400
TmaE01_EF191343_272–280_F#gi|460718331|ref|NW_004443964.1|:21821180-21821395

Step 6: Using both these files, Marker_Genome_Coords.txt and Trichechum_Markers_withFlankingRegions.fasta, I ran the following perl script:

#!/usr/bin/perl
## add the marker information to the header line in the fasta file
open MARKERFILE,"Marker_Genome_Coords.txt";

%MarkerHash=();

 while(<MARKERFILE>)

    {

    $_ =~ s/^\s+|\s+$//; #remove leading and trailing space

    @array=split("#",$_);

    $MarkerHash{$array[1]}="$array[0]";

    }

open FASTA,"Trichechum_Markers_withFlankingRegions.fasta";

 

    while(<FASTA>)

    {

    $id=$_;

    $seq=<FASTA>;

    if($id=~/^\>(.+)\s*$/ && (exists($MarkerHash{$1})))

        {

        $alt_id="$MarkerHash{$1}#$1";

        print "\>$alt_id\n$seq";

        }

    }

    close(MARKERFILE);

    close(FASTA);

This finally gave me a fasta file with marker, genome region and corresponding coordinates of flanking regions.

>TmaA03_AF223651_R#gi|460713300|ref|NW_004444298.1|:346653-346872
CCTGGCTCATCAGT....
>TmaA03_AF223651_R#gi|460718350|ref|NW_004443945.1|:25974611-25974830
TCAGATTTTTAGTGG....
>TmaA03_AF223651_R#gi|460713360|ref|NW_004444238.1|:217351-217569
TCTTCTTTTATTTAAAT....

So, thats it!

No comments:

Post a Comment