Tuesday, August 14, 2012

convert bed to gtf, gtf to bed

bed --> gtf:

bedToGenePred input.bed stdout | genePredToGtf file stdin output.gtf

Here is an example from UCSC wiki (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format):

Some gene tracks are in a bed format in the database, perhaps with extra columns past the standard bed format. In this case, extract the standard bed columns, convert it to a genePred and then to a gtf. For example

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd from wgRna;" hg19 | bedToGenePred stdin stdout | genePredToGtf file stdin wgRna.gtf

gtf --> bed:

download gtf2bed.pl from Eric's site:

perl gtf2bed.pl input.gtf > output.bed

If simply converting gtf to bed6 format (no exons/intron info), the following bash line can work:
fgrep -w transcript gencode.v17.annotation.gtf | sed 's/[";]//g;' | awk '{OFS="\t"; print $1, $4-4,$5,$12,0,$7}'
or a bed9 format with additional info on gene/type/name
fgrep -w transcript gencode.v17.annotation.gtf | sed 's/[";]//g;' | awk '{OFS="\t"; print $1, $4-4,$5,$12,0,$7,$18,$14,$10}' 

Also, I edited Eric's code a bit to output a more meaningful gene ID, e,g.

($transid) = $f[8]=~ /transcript_id "([^"]+)"/;
($geneid) = $f[8]=~ /gene_id "([^"]+)"/;
($gene_type) = $f[8]=~ /gene_type "([^"]+)"/;
($gene_name) = $f[8]=~ /gene_name "([^"]+)"/;
($trans_type) = $f[8]=~ /transcript_type "([^"]+)"/;


  1. There is a bug in the gtf2bed.pl script. See https://code.google.com/p/ea-utils/issues/list

    1. It's already clarified as not a bug. see https://code.google.com/p/ea-utils/issues/detail?id=28