Among all the steps during gene expression, alternative splicing (AS) provides perhaps the largest potential for molecular diversity and controlled regulation in the cell. Genes are transcribed into pre-mRNA molecules that require extensive processing. For most genes, this processing involves the removal of introns through the process of splicing. Multiple molecular complexes, composed of RNA-binding proteins (RBPs), structural RNAs, and other protein factors, bind to the pre-mRNA at various locations (RNA-binding motifs) and mediate the splicing process. On the other hand, different mature RNA molecules can be produced from the same pre-mRNA through the mechanism of AS. AS takes place through the controlled changes in the expression and activity of the complexes acting on the regulatory sequences on the pre-mRNA or as a consequence of the alterations in these complexes and motifs. AS is therefore, a critical mechanism not only in normal physiological processes, but also in multiple pathologies, including cancer.
Alternative splicing has been observed during cell differentiation, development, between cell and tissue types and between normal and tumor tissues. These changes in splicing can occur without measurable changes in gene expression. Accordingly, there are changes in the RNA molecules produced from a gene locus that could impact the function of the gene and affect cell function without necessarily leading to a significant change in the amount of RNA output from that gene locus.
Alternative splicing is usually described in terms of simple local variations of the exon-intron structures of genes. These variations are described as alternative splicing events, which describe a dychotomy between two possibly outcomes. These are generally calissified as:
There can also be variations in the first and last exon, which although are also related to splicing, can also be determined by transcriptional regulation.
The splicing pattern of an alternative splicing event is usually evaluated in terms of an inclusion level, also called
PSI (percent or proportion spliced in). This is a value from 0 to 1 that represents the fraction of RNA molecules
from that gene that include a given form of the splicing event. We will work with exon-skipping events. In this case, the PSI value
represents the proportion of RNA molecules that include the exon from all RNA molecules that include or exclude the exon.
We have calculated PSI values for exon-skipping events using SUPPA for human (Gencode annotation v23, hg38 assembly version) for a large number of tissue samples from the GTEX project. The events are encoded in the form, e.g.:
where the event identifier contains the Ensembl gene ID, the SE to denote it is an exon skipping event, and the positions of the regulated middle exon and the adjacente exonic positions:
In particular, we have obtained two datasets, one for different compartments from brain and one from different organs and tissues (heart, kidney, liver, lung, muscle and nerve), from different individuals.
69 Amygdala 83 Anterior_Cingulate_Cortex_(Ba24) 109 Caudate_(Basal_Ganglia) 97 Cerebellar_Hemisphere 119 Cerebellum 105 Cortex 102 Frontal_Cortex_(Ba9) 84 Hippocampus 82 Hypothalamus 104 Nucleus_Accumbens_(Basal_Ganglia) 81 Putamen_(Basal_Ganglia) 60 Spinal_Cord_(Cervical_C-1) 57 Substantia_Nigra
377 Heart 28 Kidney 110 Liver 288 Lung 396 Muscle 278 Nerve
The PSI values are given in a file with a header containing all sample IDs separated by a tab:
GTEX-ZVT4-1326-SM-5NQ8E GTEX-XXEK-0626-SM-4BRWE GTEX-139YR-2526-SM-5IJC6 ...
Followed by as many lines as SE events, where the first column is the event ID and the following columns are the PSI values in the samples using the same order as the header:
ENSG00000003509.15;SE:chr2:37231760-37232106:37232266-37236096:+ 1.0 1.0 0.88 ...
The description of the samples are given in a text file without a header
GTEX-13SLW-0011-R4b-SM-5S2W2 Brain_-_Amygdala Brain Normal_Tissue Male GTEX GTEX-13RTJ-0011-R4b-SM-5PNX1 Brain_-_Amygdala Brain Normal_Tissue Male GTEX ..
where the order of the items is:
sample_id detailed_category primary_site sample_type gender study
The objective of this assignment is to build a Naive Bayes model to predict the tissue type or brain region from the pattern of splicing of the samples. You will have to choose one of these two datasets.
P(e_up|heart), P(e_down|heart) P(e_up|liver), P(e_down|iver) ...
MI(S,A) = H(S) - H(S|A)
where the relative entropy is calculated as (see the course slides):
H(S|A) = P(heart, e_up) log2 ( P(heart, e_up) / P(e_up) ) + P(heart, e_down) log ( P(heart, e_down) / P(e_down) ) + P(liver, e_up) log2 ( P(liver, e_up) / P(e_up) ) + P(liver, e_down) log ( P(liver, e_down) / P(e_down) ) + ...
score prediction label sample -20.04 heart heart id1 -30.03 heart kidney id2 -21.32 liver liver id3 ...
Since we are going to multiply probabilities, we will obtain in general very small numbers. This can become a problem as computers have a limitation in the number of decimals they can handle. A solution to this problem is to consider the logarithm of the probabilities. The products become sums and the maximization procedure to select the best hypothesis remain the same.
Alamancos et al. Leveraging transcript quantification for fast computation of alternative splicing profiles. RNA. 2015 Sep;21(9):1521-31.
GTEx Consortium. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648-660.
Barbosa-Morais, et al. (2012). The evolutionary landscape of alternative splicing in vertebrate species. Science, 338(6114), pp.1587-1593.
Merkin, et al (2012). Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science, 338(6114), pp.1593-1599.
Wang et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221), 470-476.
Pan et al. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics, 40(12), 1413-1415.
Yeo, G., Holste, D., Kreiman, G., & Burge, C. B. (2004). Variation in alternative splicing across human tissues. Genome biology, 5(10), R74.