We have further updated the protein impact predictions (i.e. whether inclusion disrupts the ORF, the event is in a UTR, etc). We have also updated the download files Downloads , so you are encouraged to download the new version (v2; formerly v1.4.2). The main changes included:
- For INT events:
a) We have a new category, "ORF disruption upon sequence inclusion (1st CDS intron)", to account for the cases in which the first intron in the CDS introduces a premature termination codon (PTC) when retained, and it is not the last intron (i.e. in genes with only one CDS intron). Although, to our knowledge, there are no specific analyses investigating their actual protein impact, these introns are peculiar in some respects, since NMD is inefficient near the ATG (Nat Genet. 2016 Oct;48(10):1112-8) and there are several reports of protein isoforms with alternative N-termini generated by retention of the first CDS intron and translation from an intronic in-frame ATG. As such, they were previously labeled as "Alternative protein isoforms". However, this is most likely inaccurate, and therefore we now refer them more explicitly as (ORF-disrupting) 1st CDS introns.
b) In the previous versions, to identify PTCs, we only translated the intron in-frame. If no PTCs were found, the event was labeled as "Alternative protein isoforms" if its length was multiple of three, and frame-shifting (and thus "ORF disruption upon sequence inclusion") if not. Now, the downstream exon is also translated in search for PTCs and the NMD predictions (based on the 50-55nt rule) are done accordingly. This has little impact for most mammalian introns, but affects a a few small introns.
- For ALTA/ALTD events:
The impact of these events is particularly complicated to predict, unless any of the splice sites maps to an annotated transcript. We have now dug more into those non-annotated cases and added an additional (more conservative) category "In the CDS, with uncertain impact", to account for cases in the CDS for which it is not possible to make confident predictions. This category also exists for uncertain cases in EX and INT events.
- For EX events:
a) Frame-shifting exons (i.e. not multiple of 3 nt) that are the second to last CDS exon in the "reference" transcript are now considered "Alternative protein isoforms (No Ref, Alt. Stop)". That is, they create an alternative protein isoform when excluded, that is not the reference one and has an alternative C-terminus.
b) The impact for some EX6* events (from vast-tools v2, not yet included in VastDB) was incorrectly predicted.
Of course, if you detect that a prediction is incorrect, please let us know and we will try to improve them further!
The VastDB team