Recently, I needed to convert a gff3 file from NCBI to bed12 (bed file with 12 columns) format for UCSC browser purposes. The BED12 file is not available in UCSC since its latest genome is processed with the annotation. All, I needed to do was to convert gff3 to bed12. These are the steps involved:
Direct conversion (from GFF3 to BED12) using GFFtools
1. (older version of script)
$ gff_to_bed.py ../ref_ASM185804v2_top_level.gff3 > ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed
$ head ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed NC_031971.1 6018623 6018697 rna15273 . - 6018623 6018697 0 1 74, 0, NW_017615921.1 193623 198461 rna67841 . - 193623 198461 0 6 633,85,89,39,44,68, 0,1075,1280,4477,4625,4770, NW_017615921.1 193623 198461 rna67842 . - 193623 198461 0 7 633,85,89,354,39,44,68, 0,1075,1280,2684,4477,4625,4770, NC_031968.1 33369464 33374965 rna7798 . - 33369464 33374965 0 2 1181,494, 0,5007, NC_031980.1 33188366 33210198 rna43547 . + 33188366 33210198 0 4 916,183,9,4122, 0,12798,16099,17710, NW_017615939.1 319272 321966 rna68222 . + 319272 321966 0 3 1350,253,749, 0,1526,1945, NC_031976.1 23579551 23595921 rna31865 . + 23579551 23595921 0 26 115,113,87,49,163,120,258,731,455,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,443, 0,577,3241,4303,4441,4760,5592,6386,7302,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15927, NC_031976.1 23579551 23595921 rna31866 . + 23579551 23595921 0 26 115,113,87,49,163,120,258,731,455,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,395, 0,577,3241,4303,4441,4760,5592,6386,7302,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15975, NC_031976.1 23579551 23595921 rna31867 . + 23579551 23595921 0 27 115,113,87,49,163,120,258,731,138,227,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,443, 0,577,3241,4303,4441,4760,5592,6386,7302,7530,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15927, NC_031976.1 23579551 23595921 rna31868 . + 23579551 23595921 0 27 115,113,87,49,163,120,222,731,138,227,528,155,83,94,389,156,236,250,218,133,141,155,97,167,350,147,443, 0,577,3241,4303,4441,4760,5592,6386,7302,7530,7853,8486,10035,10364,10572,11048,11291,11621,11953,12301,13122,13352,13722,14853,15113,15669,15927,
$ wc -l ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed 71103 ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed
2. (new version of script) - throws error
$ python gff_to_bed_latest.py ref_ASM185804v2_top_level.gff3 >out3.bed Traceback (most recent call last): File "gff_to_bed_latest.py", line 112, in <module> __main__() File "gff_to_bed_latest.py", line 109, in __main__ writeBED(Transcriptdb) File "gff_to_bed_latest.py", line 66, in writeBED score = ent1['transcript_score'][idx] if ent1['transcript_score'].any() else score ## getting the transcript score ValueError: no field of name transcript_score
Raised an issue on the above error : https://github.com/vipints/GFFtools-GX/issues/9?_pjax=%23js-repo-pjax-container
Indirect conversion (from GFF3 to GTF to BED12)
1. Converting gff3 to genePred to bed12 - UCSC way
gff3ToGenePred - Convert a GFF3 file to a genePred file
$ ./gff3ToGenePred ../ref_ASM185804v2_top_level.gff3 ref_ASM185804v2_top_level.GenePred
genePredToBed - Convert from genePred to bed format.
$ ./genePredToBed ref_ASM185804v2_top_level.GenePred ref_ASM185804v2_top_level_genePredToBed.bed
This has generated me BED12 format successfully.
$ tail ref_ASM185804v2_top_level_genePredToBed.bed
NC_031965.1 434912 448252 rna12 0 - 436871 444514 0 5 3105,384,137,588,456, 0,9275,9991,10219,12884, NC_031965.1 434912 448252 rna11 0 - 436871 444514 0 5 3105,384,137,502,456, 0,9275,9991,10219,12884, NC_031965.1 339041 345377 rna10 0 - 341012 345350 0 2 94,5970, 0,366, NC_031965.1 211588 212299 rna7 0 + 212021 212282 0 2 86,367, 0,344, NC_031965.1 137999 139964 rna6 0 + 137999 139964 0 4 34,758,655,65, 0,137,1024,1900,
$ wc -l ref_ASM185804v2_top_level_genePredToBed.bed 63081 ref_ASM185804v2_top_level_genePredToBed.bed
2. Converting gff3 to gtf to bed12
$ gffread ../ref_ASM185804v2_top_level.gff3 -T -o ref_ASM185804v2_top_level.gtf
$ ./cufflinks_gtf2bed.py ref_ASM185804v2_top_level.gtf >ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed
$ head ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed NC_013663.1 0 69 rna70978 100 + 0 69 255,0,0 1 69 0 NC_013663.1 69 1013 rna70979 100 + 69 1013 255,0,0 1 944 0 NC_013663.1 1013 1085 rna70980 100 + 1013 1085 255,0,0 1 72 0 NC_013663.1 1085 2784 rna70981 100 + 1085 2784 255,0,0 1 1699 0 NC_013663.1 2784 2858 rna70982 100 + 2784 2858 255,0,0 1 74 0 NC_013663.1 3836 3906 rna70983 100 + 3836 3906 255,0,0 1 70 0 NC_013663.1 3905 3976 rna70984 100 - 3905 3976 255,0,0 1 71 0 NC_013663.1 3976 4045 rna70985 100 + 3976 4045 255,0,0 1 69 0 NC_013663.1 5091 5163 rna70986 100 + 5091 5163 255,0,0 1 72 0 NC_013663.1 5164 5233 rna70987 100 - 5164 5233 255,0,0 1 69 0
$ wc -l ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed 71912 ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed
Now, all three methods generated three BED12 format files but the number of records are different. Let's check this by counting number of lines:
$ wc -l */*bed 71103 direct/ref_ASM185804v2_top_level_gff_to_bed_frm_Vipin.bed 71912 indirect/ref_ASM185804v2_top_level_cufflinks_gtf2bed.bed 63081 indirect/ref_ASM185804v2_top_level_genePredToBed.bed
Number of records common to all three bed files (based on intervals):
$ cut -f2,3,4 */*bed | sort | uniq -c | sort -nk1,1 | fgrep ' 3 ' >common_to_threeBED_files_with_intervals.txt $ wc -l common_to_threeBED_files_with_intervals.txt 62070 common_to_threeBED_files_with_intervals.txt
Some records seem cleary missing using different conversion tools. I am not sure, if we can use these common 62070 features. But then, BED12 file could be generated from GFF3 file this way.
Observations:
- gtf2bed threw following error. The gtf file was generated from gffread script
$ gtf2bed < ref_ASM185804v2_top_level.gtf Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)
- Tried to eliminate the above error using the solution in biostar post from the author of script. Still without success.
- Wanted to generate bed12 using this github script. Using the binary script gtfToGenePred from UCSC could not generate the required output
$ awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' ref_ASM185804v2_top_level.gtf | gtf2bed - Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)
$ ./gtfToGenePred -genePredExt -geneNameAsName2 ref_ASM185804v2_top_level.gtf gene.tmp ref_ASM185804v2_top_level.gtf doesn't appear to be a GTF file (GFF not supported by this program)