Step 1: As a first step, I blasted my marker sequences against the reference genome using following.
Note: Since, my marker sequences are very small in length ( 17- 26 nt), I used a blastn (not megablast) and higher e-value which in my case was 1000 and the results are listed in tabular format.
~/Documents/Blast/blastn -task blastn -query Markers.txt -db ~/db/Trichecus_manatus/Trichechus_manatus_Lat1.0_chrUnMasked -out Markers_vs_Trichechus_Unmasked.txt -evalue 1000 -outfmt "6 qseqid qlen sseqid slen pident nident length mismatch gapopen qstart qend sstart send evalue bitscore" -num_descriptions 10 -num_alignments 10
Step 2: From the blast results, I chose a cut off to find if atleast 90% of the marker seqs are aligned to the reference.
cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90'
Step 3: Now that I have the coordinates in bed format obtained from blast, I want to generate coordinates with the number of flanking bases before and after the marker (in my case it is 100+/- nt).
cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $3,$12,$13}' | awk '{if ($2 > $3) print $1,($3-100),($2+100); else if($3 > $2) print $1,($2-100),($3+100);}' | sed 's/ /\t/g' >Markers.bed
$ head Markers.bed gi|460718269|ref|NW_004444026.1| 8607288 8607509 gi|460712402|ref|NW_004445196.1| 1258 1482 gi|460713408|ref|NW_004444190.1| 1238289 1238513 gi|460718268|ref|NW_004444027.1| 4548875 4549099 gi|460713515|ref|NW_004444083.1| 1935238 1935453 gi|460718331|ref|NW_004443964.1| 21821180 21821395 gi|460726833|ref|NW_004443938.1| 8634324 8634539 gi|460713195|ref|NW_004444403.1| 45475 45689 gi|460713273|ref|NW_004444325.1| 69053 69267 gi|460713348|ref|NW_004444250.1| 98323 98537
Step 4: Using this Markers.bed file, I extracted the marker sequence along with its flanking regions using bedtools - getfasta.
$ bedtools getfasta -fi ~/db/Trichecus_manatus/tma_ref_TriManLat1.0_chrUn.fa -bed Markers.bed -fo Trichechum_Markers_withFlankingRegions.fasta
The result in the Trichechum_Markers_withFlankingRegions.fasta
>gi|460718269|ref|NW_004444026.1|:8607288-8607509 TGTGTGTATGCCTAGCTGTATGTCTCTATCTGT........ >gi|460718269|ref|NW_004444026.1|:8607130-8607354 ATCTTCATCAGTAAGTACTTCAAGTGCTCTTCATT........ >gi|460712402|ref|NW_004445196.1|:1258-1482 GTATACTGAGTACTTTGAAGTCAGGAAAGACGTGACT........ >gi|460713408|ref|NW_004444190.1|:1238289-1238513 AAAGGGATATTACATGGTTTAAAATTAGGAAAGGCAC........ >gi|460718268|ref|NW_004444027.1|:4548875-4549099 GGATACTATGATTTCAAGTCAAGAAACGAGTGTGTCA........
The information in the header now only contains:
gi|460718269|ref|NW_004444026.1|:8607288-8607509
gene id coordinates
But, I would like to add information of marker also in the header line like this:
Marker1#gi|460718269|ref|NW_004444026.1|:8607288-8607509
gene id coordinates
Step 5: So, by running the following commands a file with both marker information along with genome region and coordinates is generated.
$ cat Markers_vs_Trichechus_Unmasked.txt | awk '{ print $0, $6*100/$2 }' | column -t | awk '$16 >=90' | awk '{print $1,$3,$12,$13}' | awk '{if ($3 > $4) print $1"#"$2":"($4-100)"-"($3+100); else if($4 > $3) print $1"#"$2":"($3-100)"-"($4+100);}' >Marker_Genome_Coords.txt
which looks like this:
$ head Marker_Genome_Coords.txt TmaE11_AF223658_177–197_F#gi|460718269|ref|NW_004444026.1|:8607288-8607509 TmaE11_AF223658_R#gi|460718269|ref|NW_004444026.1|:8607130-8607354 TmaE11_AF223658_R#gi|460712402|ref|NW_004445196.1|:1258-1482 TmaE11_AF223658_R#gi|460713408|ref|NW_004444190.1|:1238289-1238513 TmaE11_AF223658_R#gi|460718268|ref|NW_004444027.1|:4548875-4549099 TmaE01_EF191343_272–280_F#gi|460713515|ref|NW_004444083.1|:1935238-1935453 TmaE01_EF191343_272–280_F#gi|460718277|ref|NW_004444018.1|:3212423-3212638 TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:19459039-19459254 TmaE01_EF191343_272–280_F#gi|460718319|ref|NW_004443976.1|:10019186-10019400 TmaE01_EF191343_272–280_F#gi|460718331|ref|NW_004443964.1|:21821180-21821395
Step 6: Using both these files, Marker_Genome_Coords.txt and Trichechum_Markers_withFlankingRegions.fasta, I ran the following perl script:
#!/usr/bin/perl ## add the marker information to the header line in the fasta file open MARKERFILE,"Marker_Genome_Coords.txt"; %MarkerHash=(); while(<MARKERFILE>) { $_ =~ s/^\s+|\s+$//; #remove leading and trailing space @array=split("#",$_); $MarkerHash{$array[1]}="$array[0]"; } open FASTA,"Trichechum_Markers_withFlankingRegions.fasta"; while(<FASTA>) { $id=$_; $seq=<FASTA>; if($id=~/^\>(.+)\s*$/ && (exists($MarkerHash{$1}))) { $alt_id="$MarkerHash{$1}#$1"; print "\>$alt_id\n$seq"; } } close(MARKERFILE); close(FASTA);
This finally gave me a fasta file with marker, genome region and corresponding coordinates of flanking regions.
>TmaA03_AF223651_R#gi|460713300|ref|NW_004444298.1|:346653-346872 CCTGGCTCATCAGT.... >TmaA03_AF223651_R#gi|460718350|ref|NW_004443945.1|:25974611-25974830 TCAGATTTTTAGTGG.... >TmaA03_AF223651_R#gi|460713360|ref|NW_004444238.1|:217351-217569 TCTTCTTTTATTTAAAT....
So, thats it!
No comments:
Post a Comment