Monday, November 19, 2012

get intron, UTR, CDS from bed12 format

Input: gene annotation in BED12 format (If your input is GTF/GFF, you want to convert into bed12 ahead, see my another post for howto)
Output: intron/UTR/CDS in bed12 format (which can be further converted into bed6 via "bedtools bed12tobed6" command)

# Introns (if any) of gene annotation in BED12 format
cat annotation.bed | awk '{OFS="\t";split(\$11,a,","); split(\$12,b,","); A=""; B=""; for(i=1;i<length(a)-1;i++) {A=A""(b[i+1]-b[i]-a[i])",";B=B""(b[i]+a[i]-(b[1]+a[1]))",";} if(\$10>1) print \$1,\$2+a[1], \$3-a[length(a)-1], \$4,\$5,\$6,\$2+a[1], \$3-a[length(a)-1],\$9,\$10-1,A,B;}'

# 5´ UTR (if any) of gene annotation in BED12 format
cat annotation.bed | awk '{OFS="\t";split(\$11,a,","); split(\$12,b,","); A=""; B=""; if(\$7==\$8) next; if(\$6=="+" && \$2<\$7) {for(i=1;i<length(a);i++) if((\$2+b[i]+a[i])<=\$7) {A=A""a[i]",";B=B""b[i]",";} else {A=A""(\$7-\$2-b[i])",";B=B""b[i]","; break; } print \$1,\$2,\$7,\$4,\$5,\$6,\$2,\$7,\$9,i,A,B;} if(\$6=="-" && \$8<\$3) {for(i=length(a)-1;i>0;i--) if((\$2+b[i])>=\$8) {A=a[i]","A;B=(\$2+b[i]-\$8)","B;} else {A=(\$2+b[i]+a[i]-\$8)","A;B=0","B; break; } print \$1,\$8,\$3,\$4,\$5,\$6,\$8,\$3,\$9,length(a)-i,A,B;}}'
## Update: the code crossed above did not consider the special case that start codon is spliced (e.g. in intron within in the start codon).
awk '{OFS="\t";split(\$11,blockSizes,","); split(\$12,blockStarts,","); blockCount=\$10;A=""; B=""; if(\$7==\$8) next;N=0;if(\$6=="+" && \$2<\$7) {start=\$2;end=\$7; for(i=1;i<=blockCount;i++) if((\$2+blockStarts[i]+blockSizes[i])<=\$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=(\$2+blockStarts[i]+blockSizes[i]); N++;} else { if((\$2+blockStarts[i])<\$7) {A=A""(\$7-\$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=\$7;} break; } print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;} if(\$6=="-" && \$8<\$3) {start=\$8;end=\$3; for(i=blockCount;i>0;i--) if((\$2+blockStarts[i])>=\$8) {A=blockSizes[i]","A;B=(\$2+blockStarts[i]-\$8)","B;start=(\$2+blockStarts[i]); N++;} else {if((\$2+blockStarts[i]+blockSizes[i])>\$8) {A=(\$2+blockStarts[i]+blockSizes[i]-\$8)","A;B=0","B; N++; start=\$8;} break; } print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;}}'

## Update (2017-Jan-28, see comment below)
awk '{OFS="\t";split(\$11,blockSizes,","); split(\$12,blockStarts,","); blockCount=\$10;A=""; B=""; if(\$7==\$8) next;N=0;if(\$6=="+" && \$2<\$7) {start=\$2;end=\$7; for(i=1;i<=blockCount;i++) if((\$2+blockStarts[i]+blockSizes[i])<=\$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=(\$2+blockStarts[i]+blockSizes[i]); N++;} else { if((\$2+blockStarts[i])<\$7) {A=A""(\$7-\$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=\$7;} break; } print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;} if(\$6=="-" && \$8<\$3) {start=\$8;end=\$3; for(i=1;i<=blockCount;i++) if((\$2+blockStarts[i])>=\$8) {if(start==0) {A=blockSizes[i];B=0; start=\$2+blockStarts[i];} else {A=A","blockSizes[i];B=B","(\$2+blockStarts[i]-start);} N++;} else { if((\$2+blockStarts[i]+blockSizes[i])>\$8) { A=(\$2+blockStarts[i]+blockSizes[i]-\$8);B=0; N++; start=\$8;} if((\$2+blockStarts[i]+blockSizes[i])==\$8) start=0;} print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;}}'

# CDS (if any) of gene annotation in BED12 format
grep "protein_coding\.protein_coding" annotation.bed | awk '{OFS="\t";split(\$11,a,","); split(\$12,b,","); A=""; B=""; if(\$7==\$8) next; j=0; for(i=1;i<length(a);i++) if((\$2+b[i]+a[i])>\$7 && (\$2+b[i])<\$8) {j++; start=\$2+b[i]-\$7; size=a[i]; if((\$2+b[i])<=\$7) {start=0;size=size-(\$7-(\$2+b[i]));} if((\$2+a[i]+b[i])>=\$8) {size=size-(\$2+a[i]+b[i]-\$8);} A=A""size",";B=B""start",";} print \$1,\$7,\$8,\$4,\$5,\$6,\$7,\$8,\$9,j,A,B;}'

### Note1: The thickStart and thickEnd in the bed12 format don't always indicate CDS. "When there is no thick part, thickStart and thickEnd are usually set to the chromStart position." So, we cannot use bed12 to infer CDS (or coding exons), esp. for lincRNA. The correct way is to grep all CDS lines from the GTF file, or directly using UCSC Table Browser to download coding exons.

### Note2: The "Polymorphic pseudogene" in GENCODE can also have a protein-coding transcript, even though the gene itself is classified as a pseudogene (YES, polymorphic pseudogenes are also pseudogenes, they "are coding gene that are pseudogenic due to the presence of a polymorphic premature stop codon in the reference genome" (http://www.genomebiology.com/2012/13/9/R51). For more details, see https://gencodegenes.wordpress.com/toolbox/.

# 3´ UTR (if any) of gene annotation in BED12 format
cat annotation.bed | awk '{OFS="\t";split(\$11,blockSizes,","); split(\$12,blockStarts,","); blockCount=\$10;A=""; B=""; if(\$7==\$8) next;N=0;if(\$6=="-" && \$2<\$7) {start=\$2;end=\$7; for(i=1;i<=blockCount;i++) if((\$2+blockStarts[i]+blockSizes[i])<=\$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=(\$2+blockStarts[i]+blockSizes[i]); N++;} else { if((\$2+blockStarts[i])<\$7) {A=A""(\$7-\$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=\$7;} break; } print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;} if(\$6=="+" && \$8<\$3) {start=\$8;end=\$3; for(i=blockCount;i>0;i--) if((\$2+blockStarts[i])>=\$8) {A=blockSizes[i]","A;B=(\$2+blockStarts[i]-\$8)","B;start=(\$2+blockStarts[i]); N++;} else {if((\$2+blockStarts[i]+blockSizes[i])>\$8) {A=(\$2+blockStarts[i]+blockSizes[i]-\$8)","A;B=0","B; N++; start=\$8;} break; } print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;}}'

cat annotation.bed | awk '{OFS="\t";split(\$11,blockSizes,","); split(\$12,blockStarts,","); blockCount=\$10;A=""; B=""; if(\$7==\$8) next;N=0;if(\$6=="-" && \$2<\$7) {start=\$2;end=\$7; for(i=1;i<=blockCount;i++) if((\$2+blockStarts[i]+blockSizes[i])<=\$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=(\$2+blockStarts[i]+blockSizes[i]); N++;} else { if((\$2+blockStarts[i])<\$7) {A=A""(\$7-\$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=\$7;} break; } print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;} if(\$6=="+" && \$8<\$3) {start=\$8;end=\$3; for(i=1;i<=blockCount;i++) if((\$2+blockStarts[i])>\$8) { if(start==0) {A=blockSizes[i];B=0; start=\$2+blockStarts[i];} else {A=A","blockSizes[i];B=B","(\$2+blockStarts[i]-start);} N++; } else { if((\$2+blockStarts[i]+blockSizes[i])>\$8) { A=(\$2+blockStarts[i]+blockSizes[i]-\$8);B=0; N++; start=\$8;} if((\$2+blockStarts[i]+blockSizes[i])==\$8) start=0;} print \$1,start,end,\$4,\$5,\$6,start,end,\$9,N,A,B;}}'

Update: Thanks to Heather's comment below, when the thickEnd (end of CDS) is at the end of an exon (or thickStart at the first nt of an exon), it will generate an invalid BED12 format. It's fixed now.  (2017-Jan-28)

BTW, the above codes are integrated into a neat script in Github: https://github.com/sterding/RNAseq/blob/master/bin/bed12toAnnotation.awk

Also, I fixed some bugs in gtf2bed (originally by Erik) and host here:
https://github.com/sterding/RNAseq/blob/master/bin/gtf2bed

1. hi dong, your blog is helpful for me, and thank you very much. For get UTR bed12 file, there are some genes which have only 5'UTR or 3'UTR, but it seems that your stript ignores the set of genes. What is your idea? I'm a newer in bioinformatics, so it may be wrong! hehe!

1. Hi, Huanwei. Glad to hear that this blog is helpful for you. What do you mean by "some genes which have only 5'UTR or 3'UTR"? (1) They just have UTR (no CDS) in the annotation; (2) They just have one of 5UTR and 3UTR, not both. For case (1), I think they should be annotated as non-coding RNA, and the "UTR" doesn't make sense since it's not translated into protein at all. We should ignore this case. For the (2) case, if only 5UTR or 3UTR exist, I think my script can report whatever it has. Did I answer your question?

2. The outputted 3utr or 5utr bed12 files do not convert fully to fasta. I use bedtools getfasta, and it exists in the middle when it encounters the following issue: Error: cannot construct subsequence with negative offset or length < 1. I do not have this problem with the cds bed12. Do you know how to fix the bed12 to allow conversation to fasta that would run through completion by either solving the issue above or omitting entries which have this issue?

1. Thanks for showing interests to my code snip. Could you please send me example of your input bed12? If you have the name of the record which reproduce the error message, that's even better.

3. I found an example of where the 3' UTR script does not give the correct results. This happens when the 3' UTR starts at the exact start of the second exon ("thickEnd" equal to the end of the first exon).

BED line:

chr2 174297858 174346742 NM_201618 0 + 174298039 174300217 0 12 2359,73,45,55,120,98,55,74,180,131,68,501, 0,36353,38888,43794,43951,45341,47216,47369,47524,47790,48131,48383,

1. Hi Heather. Thanks for reporting the bug. It's fixed now.

4. Henry2:48 PM

Hello,
Thanks for sharing all these commands!
I just tried the cds extraction, and I found it probably has a problem when the stop codon in the last exon, and I got the results with the last exon missing.

Example input:
chr1 4776379 4785709 Gene 0 - 4776463 4785677 0 5 422,124,166,155,137 0,1145,6188,7571,9193

Example output:
chr1 4776463 4785677 Gene 0 - 4776463 4785677 0 4 338,124,166,155, 0,1061,6104,7487,

The correct output I think:
chr1 4776463 4785677 Gene 0.0 - 4776463 4785677 0 5 338,124,166,155,105, 0,1061,6104,7487,9109,

I am new to bioinfo, so please point me out if I am wrong. Thank you!

1. Henry2:52 PM

Well, since the gene is on '-' strand, actually it was first exon missing.