FAQ

From VastDB

Revision as of 12:25, 22 December 2020 by Mirimia (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

How do I report an issue or share a suggestion?

Please send us any comment/suggestion to vastdb@googlegroups.com. We will normally get back to you within the same day.


How is the inclusion level (PSI) of a given AS event quantified?

AS event quantification is performed using vast-tools. vast-tools uses different modules to quantify cassette exons, microexons, alternative 5' and 3' splice sites and intron retention (reflected in the 'vast-tools module' field in the ‘VastDB Features’ section of each event). For detailed information about how the quantification works, please refer to the Supplementary Information of Irimia et al., Cell 2014. Current inclusion data in VastDB corresponds to vast-tools v2.5.1.


How is the gene expression (GE) of a given gene quantified?

GE quantification is also performed using vast-tools. vast-tools maps the first 50 nucleotides of the forward read (if longer and paired end) to a library with one reference transcript per gene. GE levels are provided using the cRPKM metric (corrected [for mappability] Reads Per Kilobasepair and Million mapped reads), as detailed in Labbé et al., Stem Cells 2012. cRPKM can be converted to TPMs applying the following formula: TPM = 10^6 * cRPKM/sum_all(cRPKM). Moreover, vast-tools can provide tables with TPMs and raw counts.


How is AS event conservation assessed?

Currently, the AS event orthology relationships in VastDB are done using a combination of a liftOver-based approach (as described in the Supplementary Information of Irimia et al., Cell 2014) and a new software we are currently developing (ExOrthist). The liftOver-based approach is mainly used for closely related species (mammals and chicken) and for non-exon skipping events. ExOrthist complements exon orthologies for danRer10 and dm6.


Why is there data for two assemblies in some species and how can I benefit from it?

For some species (human, mouse and chicken), we have used two independent VASTDB libraries in vast-tools to quantify GE and AS. Since these libraries are built and used independently from each other, the quantifications from the two assemblies (e.g. one for hg19 and another for hg38) may not fully coincide or an event may just be covered in one of the assemblies. Therefore, you can use the two datasets as semi-independent validations: if the two assemblies show the same regulatory pattern for the same EventID, it is more reliable.


What AS events are displayed in VastDB UCSC track?

VastDB displays AS events detected and quantified in vast-tools that show a minimal level of alternative usage. This cut-off depends on the species and AS event type (normally, INT and ALTA/ALTD events are required to have larger PSI variations to be displayed). In addition, events for which we have annotated an experimentally validated function (including for any of its orthologs) are also displayed. If you are interested in an event that is not displayed, you can directly look for it using the search box in the main page.


What do the colors and block thickness in the UCSC track mean?

The colors signify the different types of AS events, whereas the block thickness inform about the type of sequence.

  • For any individual cassette exon event (including microexons), each C1, A and C2 exons are represented. The alternative exon (A) thus corresponds to the exon in between.
    • Blue: simple cassette exon. “Simple” is defined as cassette exons for which ≥95% of the reads used to quantify their PSI come from the three reference exon-exon junctions, which are C1A, AC2 and C1C2. It corresponds to “S” or “MIC_S” in ‘Average complexity’.
    • Purple: cassette exon event of intermediate complexity. This is defined as those alternative exons for which ≥50% and ≤95% of the reads used to quantify their PSI come from the three reference exon-exon junctions. Corresponds to “C1” or “C2” in ‘Average complexity’.
    • Red: complex cassette exon event, for which <50% of the reads used to quantify their PSI come from the three reference exon-exon junctions. Corresponds to “C3”, “ME” or “MIC_M” in ‘Average complexity’.
    • Black: groups multiple neighboring cassette exon events. Black tracks are only informative and do not link to any page in VASTDB.
  • For Intron Retention events: Orange track. Thick blocks correspond to the intronic sequence, and the thin blocks to the adjoining exons (C1 and C2).
  • For Alternative 3' and 5' splice site choice event: Dark Green and Light Green, respectively. In both cases, thick block corresponds to the alternative sequence, whereas the thin blocks are the constant exonic sequences (C1 and C2). For these events, at least two tracks are shown: for sequence exclusion (the most internal splice site; EventID-1/N) and for sequence inclusion.


How are the splice site scores calculated?

These scores were calculated using score5.pl and score3.pl from Yeo and Burge, 2004 . This method uses a position weight matrix and calculates deviation from the consensus. For 5’ splice sites, three exonic and six intronic positions surrounding the exon-intron junction were analyzed, and for the 3’ splice sites, 20 intronic and 3 exonic positions were analyzed.


How is the impact on the ORF predicted?

The pipeline to predict ORF impact is largely as described in Irimia et al., 2014. Currently, VastDB displays the version 3 of these predictions, and can be downloaded in the Downloads page. When using this information, please keep in mind:

  • The prediction is based on the impact that the specific alternative sequence is likely to have when included or excluded from the transcript in isolation. That is, if there are other associated AS events (e.g. mutually exclusive or coordinated exons) the prediction may not be accurate.
  • Like any other prediction, our annotations must be inaccurate. Please check your results carefully and, as with any other dataset in VastDB, use at your own risk.


How should I interpret the domain information?

"Domain information is currently only available for cassette exons." When an exon (either C1, A or C2) overlap a PROSITE or PFAM domain, it shows the following information:


Dom_ID = Dom_Name = Type_Overlap(%Dom_Overlap = %Exon_Overlap)


The meaning of each field is explained below:

  • Dom_ID: Domain ID in either PROSITE or PFAM databases. For PROSITE, domains with ID P0* (high frequency motifs) are excluded.
  • Dom_Name: Domain name as provided by PROSITE or PFAM databases.
  • Type_Overlap: There are four possible ways in which an exon can overlap a protein domain:
    • The whole exonic sequence fully overlaps with a domain (FE, Full Exon).
    • The whole domain is fully encoded within an exon (WD, Whole Domain).
    • The upstream (5') of the exon overlaps the domain (PU, Partial Upstream).
    • The downstream (3') of the exon overlaps the domain (PD, Partial Downstream).
  • %Dom_overlap: percent of the domain encode by the exon.
  • %Exon_overlap: percent of the exon that overlaps the domain.


How are the primers for RT-PCR validation designed?

Primers are designed automatically using Primer3 (optimal primer lenght = 21 nt; optimal Tm = 61 ºC). As a general rule, primers are located in the C1 and C2 exonic sequences, so two RT-PCR products will be produced: a shorter one (from C1 to C2, skipping the A sequence) and a longer one (including the A sequence). This is provided in ‘Band lengths’. To minimize PCR amplification bias towards shorter amplicons (i.e. over-representation of the skipping form) and, at the same time, optimize the visualization in agarose gels, primers are designed based on the size relationship between the two predicted amplicons. This is based on the following rules:

  • Alternative sequence LE < 15 nt => optimal skipping band size = 100 nt.
  • Alternative sequence 15 ≤ LE < 25 nt => optimal skipping band size = 110 nt.
  • Alternative sequence 25 ≤ LE < 40 nt => optimal skipping band size = 120 nt.
  • Alternative sequence 40 ≤ LE < 65 nt => optimal skipping band size = 140 nt.
  • Alternative sequence 65 ≤ LE < 100 nt => optimal skipping band size = 175 nt.
  • Alternative sequence 100 ≤ LE < 200 nt => optimal skipping band size = 250 nt.
  • Alternative sequence 200 ≤ LE < 300 nt => optimal skipping band size = 300 nt.
  • Alternative sequence 300 ≤ LE < 1000 nt => optimal skipping band size = 350 nt.
  • Alternative sequence LE > 1000 nt => primers not designed. A three-primer strategy is recommended.


What are the quality scores (QC) in the PSI plots?

As provided by vast-tools; from the README: Quality scores, and number of corrected inclusion and exclusion reads (qual@inc,exc):

  • Score 1: Read coverage, based on actual reads (as used in Irimia et al., Cell 2014:
    • For EX: OK/LOW/VLOW: (i) ≥20/15/10 actual reads (i.e. before mappability correction) mapping to all exclusion splice junctions, OR (ii) ≥20/15/10 actual reads mapping to one of the two groups of inclusion splice junctions (upstream or downstream the alternative exon), and ≥15/10/5 to the other group of inclusion splice junctions.
    • For EX (microexon module): OK/LOW/VLOW: (i) ≥20/15/10 actual reads mapping to the sum of exclusion splice junctions, OR (ii) ≥20/15/10 actual reads mapping to the sum of inclusion splice junctions.
    • For INT: OK/LOW/VLOW: (i) ≥20/15/10 actual reads mapping to the sum of skipping splice junctions, OR (ii) ≥20/15/10 actual reads mapping to one of the two inclusion exon-intron junctions (the 5' or 3' of the intron), and ≥15/10/5 to the other inclusion splice junctions.
    • For ALTD and ALTA: OK/LOW/VLOW: (i) ≥40/20/10 actual reads mapping to the sum of all splice junctions involved in the specific event.
    • For any type of event: SOK: same thresholds as OK, but a total number of reads ≥100.
    • For any type of event: N: does not meet the minimum threshold (VLOW).
  • Score 2: Read coverage, based on corrected reads (similar values as per Score 1).
  • Score 3: Read coverage, based on uncorrected reads mapping only to the reference C1A, AC2 or C1C2 splice junctions (similar values as per Score 1). Always NA for intron retention events.
  • Score 4: Imbalance of reads mapping to inclusion splice junctions (only for exon skipping events quantified by the splice site-based or transcript-based modules; For intron retention events, numbers of reads mapping to the upstream exon-intron junction, downstream intron-exon junction, and exon-exon junction in the format A=B=C)
    • OK: the ratio between the total number of reads supporting inclusion for splice junctions upstream and downstream the alternative exon is < 2.
    • B1: the ratio between the total number of reads supporting inclusion for splice junctions upstream and downstream the alternative exon is > 2 but < 5.
    • B2: the ratio between the total number of reads supporting inclusion for splice junctions upstream and downstream the alternative exon is > 5.
    • Bl/Bn: low/no read coverage for splice junctions supporting inclusion.
  • Score 5: Complexity of the event (only for exon skipping events quantified by the splice site-based or transcript-based modules); For intron retention events, p-value of a binomial test of balance between reads mapping to the upstream and downstream exon-intron junctions, modified by reads mapping to a 200-bp window in the centre of the intron (see Braunschweig et al., 2014).
    • S: percent of complex reads (i.e. those inclusion- and exclusion-supporting reads that do not map to the reference C1A, AC2 or C1C2 splice junctions) is < 5%.
    • C1: percent of complex reads is > 5% but < 20%.
    • C2: percent of complex reads is > 20% but < 50%.
    • C3: percent of complex reads is > 50%.
    • NA: low coverage event.
  • inc,exc: total number of reads, corrected for mappability, supporting inclusion and exclusion.


Where do the protein structures come from and what do the different colors mean?

ENSEMBL protein isoforms including at least one of the C1, A and C2 exons for cassette exon events have been mapped to protein structures from the same gene in the Protein Data Bank using sequence alignment. The best structural match is shown on the database, prioritizing structures containing the A exon.

For cassette exon events with no PDB hits, the structure of the longest ENSEMBL protein isoform was modeled using Phyre2 (Kelley et al. 2015).

Red residues correspond to the A exon of the event, while bright orange corresponds to the C1 exon and pale orange to the C2 exon. The rest of the protein is shown in grey in the case of structures retrieved from the PDB, and in light blue for models.


Where does the VastDB logo come from?

The image depicts an alternative exon (yellow) as the bridge between a neuron and a myocyte (red). These are two of the tissue types with the most distinctive alternative splicing signatures in the species included in the database. The image is an original design by Yamile Márquez.