Friday, September 15, 2017

Converting NCBI annotation from GFF3 to BED12

boxed Language: Python, Style: perldoc

Recently, I needed to convert a gff3 file from NCBI to bed12 (bed file with 12 columns) format for UCSC browser purposes. The BED12 file is not available in UCSC since its latest genome is processed with the annotation. All, I needed to do was to convert gff3 to bed12. These are the steps involved:

Direct conversion (from GFF3 to BED12) using GFFtools 


1. (older version of script)

$ gff_to_bed.py ../ref_ASM185804v2_top_level.gff3 > ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed

$ head ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed
NC_031971.1 6018623 6018697 rna15273 . - 6018623 6018697 0 1 74, 0,
NW_017615921.1 193623 198461 rna67841 . - 193623 198461 0 6 633,85,89,39,44,68, 0,1075,1280,4477,4625,4770,
NW_017615921.1 193623 198461 rna67842 . - 193623 198461 0 7 633,85,89,354,39,44,68, 0,1075,1280,2684,4477,4625,4770,
NC_031968.1 33369464 33374965 rna7798 . - 33369464 33374965 0 2 1181,494, 0,5007,
NC_031980.1 33188366 33210198 rna43547 . + 33188366 33210198 0 4 916,183,9,4122, 0,12798,16099,17710,
NW_017615939.1 319272 321966 rna68222 . + 319272 321966 0 3 1350,253,749, 0,1526,1945,
NC_031976.1 23579551 23595921 rna31865 . + 23579551 23595921 0 26 115,113,87,49,163,120,258,731,455,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,443, 0,577,3241,4303,4441,4760,5592,6386,7302,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15927,
NC_031976.1 23579551 23595921 rna31866 . + 23579551 23595921 0 26 115,113,87,49,163,120,258,731,455,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,395, 0,577,3241,4303,4441,4760,5592,6386,7302,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15975,
NC_031976.1 23579551 23595921 rna31867 . + 23579551 23595921 0 27 115,113,87,49,163,120,258,731,138,227,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,443, 0,577,3241,4303,4441,4760,5592,6386,7302,7530,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15927,
NC_031976.1 23579551 23595921 rna31868 . + 23579551 23595921 0 27 115,113,87,49,163,120,222,731,138,227,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,443, 0,577,3241,4303,4441,4760,5592,6386,7302,7530,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15927,

$ wc -l  ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed
71103 ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed

2. (new version of script) - throws error
$ python gff_to_bed_latest.py ref_ASM185804v2_top_level.gff3 >out3.bed
Traceback (most recent call last):
  File "gff_to_bed_latest.py", line 112, in <module>
    __main__() 
  File "gff_to_bed_latest.py", line 109, in __main__
    writeBED(Transcriptdb)
  File "gff_to_bed_latest.py", line 66, in writeBED
    score = ent1['transcript_score'][idx] if ent1['transcript_score'].any() else score ## getting the transcript score 
ValueError: no field of name transcript_score


Raised an issue on the above error : https://github.com/vipints/GFFtools-GX/issues/9?_pjax=%23js-repo-pjax-container

Indirect conversion (from GFF3 to GTF to BED12)


1. Converting gff3 to genePred to bed12 - UCSC way

gff3ToGenePred - Convert a GFF3 file to a genePred file

$ ./gff3ToGenePred ../ref_ASM185804v2_top_level.gff3 ref_ASM185804v2_top_level.GenePred

genePredToBed - Convert from genePred to bed format.

$ ./genePredToBed ref_ASM185804v2_top_level.GenePred ref_ASM185804v2_top_level_genePredToBed.bed

This has generated me BED12 format successfully.

$ tail ref_ASM185804v2_top_level_genePredToBed.bed
NC_031965.1 434912 448252 rna12 0 - 436871 444514 0 5 3105,384,137,588,456, 0,9275,9991,10219,12884,
NC_031965.1 434912 448252 rna11 0 - 436871 444514 0 5 3105,384,137,502,456, 0,9275,9991,10219,12884,
NC_031965.1 339041 345377 rna10 0 - 341012 345350 0 2 94,5970, 0,366,
NC_031965.1 211588 212299 rna7 0 + 212021 212282 0 2 86,367, 0,344,
NC_031965.1 137999 139964 rna6 0 + 137999 139964 0 4 34,758,655,65, 0,137,1024,1900,

$ wc -l ref_ASM185804v2_top_level_genePredToBed.bed
63081 ref_ASM185804v2_top_level_genePredToBed.bed



2. Converting gff3 to gtf to bed12 

$ gffread ../ref_ASM185804v2_top_level.gff3 -T -o ref_ASM185804v2_top_level.gtf

$ ./cufflinks_gtf2bed.py ref_ASM185804v2_top_level.gtf >ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed

$ head ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed 
NC_013663.1 0 69 rna70978 100 + 0 69 255,0,0 1 69 0
NC_013663.1 69 1013 rna70979 100 + 69 1013 255,0,0 1 944 0
NC_013663.1 1013 1085 rna70980 100 + 1013 1085 255,0,0 1 72 0
NC_013663.1 1085 2784 rna70981 100 + 1085 2784 255,0,0 1 1699 0
NC_013663.1 2784 2858 rna70982 100 + 2784 2858 255,0,0 1 74 0
NC_013663.1 3836 3906 rna70983 100 + 3836 3906 255,0,0 1 70 0
NC_013663.1 3905 3976 rna70984 100 - 3905 3976 255,0,0 1 71 0
NC_013663.1 3976 4045 rna70985 100 + 3976 4045 255,0,0 1 69 0
NC_013663.1 5091 5163 rna70986 100 + 5091 5163 255,0,0 1 72 0
NC_013663.1 5164 5233 rna70987 100 - 5164 5233 255,0,0 1 69 0

$ wc -l ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed
71912 ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed

Now, all three methods generated three BED12 format files but the number of records are different. Let's check this by counting number of lines:


$ wc -l */*bed
   71103 direct/ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed
   71912 indirect/ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed
   63081 indirect/ref_ASM185804v2_top_level_genePredToBed.bed

Number of records common to all three bed files (based on intervals):


$ cut -f2,3,4 */*bed | sort | uniq -c | sort -nk1,1 | fgrep ' 3 ' >common_to_threeBED_files_with_intervals.txt 
$ wc -l common_to_threeBED_files_with_intervals.txt
62070 common_to_threeBED_files_with_intervals.txt

Some records seem cleary missing using different conversion tools. I am not sure, if we can use these common 62070 features. But then, BED12 file could be generated from GFF3 file this way.

Observations:
  1. gtf2bed threw following error. The gtf file was generated from gffread script

    $ gtf2bed < ref_ASM185804v2_top_level.gtf
    Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

  2. Tried to eliminate the above error using the solution in biostar post from the author of script. Still without success.
  3. $ awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' ref_ASM185804v2_top_level.gtf | gtf2bed -
    Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)
    

  4. Wanted to generate bed12 using this github script. Using the binary script gtfToGenePred from UCSC could not generate the required output
  5. $ ./gtfToGenePred -genePredExt -geneNameAsName2 ref_ASM185804v2_top_level.gtf gene.tmp
    ref_ASM185804v2_top_level.gtf doesn't appear to be a GTF file (GFF not supported by this program)

Sunday, September 3, 2017

Copying multiple files from a list to another directory

I have a list of filenames in a file "list.txt".

  1. I want to find out the absolute location of these files
  2. and then want to copy to another folder.
I did the following:


$ time for d in $(cat list.txt); do cp `find /media/My_Book/DATA/Raw/ -name "$d"` /media/F3F9/files/ ; done