Advanced Genome Bioinformatics

Naive-Bayes Assignment

Modeling tissue types with a Naive-Bayes model


Among all the steps during gene expression, alternative splicing (AS) provides perhaps the largest potential for molecular diversity and controlled regulation in the cell. Genes are transcribed into pre-mRNA molecules that require extensive processing. For most genes, this processing involves the removal of introns through the process of splicing. Multiple molecular complexes, composed of RNA-binding proteins (RBPs), structural RNAs, and other protein factors, bind to the pre-mRNA at various locations (RNA-binding motifs) and mediate the splicing process. On the other hand, different mature RNA molecules can be produced from the same pre-mRNA through the mechanism of AS. AS takes place through the controlled changes in the expression and activity of the complexes acting on the regulatory sequences on the pre-mRNA or as a consequence of the alterations in these complexes and motifs. AS is therefore, a critical mechanism not only in normal physiological processes, but also in multiple pathologies, including cancer.

\includegraphics{alternative splicing}

Alternative splicing has been observed during cell differentiation, development, between cell and tissue types and between normal and tumor tissues. These changes in splicing can occur without measurable changes in gene expression. Accordingly, there are changes in the RNA molecules produced from a gene locus that could impact the function of the gene and affect cell function without necessarily leading to a significant change in the amount of RNA output from that gene locus.

Alternative splicing is usually described in terms of simple local variations of the exon-intron structures of genes. These variations are described as alternative splicing events, which describe a dychotomy between two possibly outcomes. These are generally calissified as:

\includegraphics{alternative splicing events}

There can also be variations in the first and last exon, which although are also related to splicing, can also be determined by transcriptional regulation.

The splicing pattern of an alternative splicing event is usually evaluated in terms of an inclusion level, also called PSI (percent or proportion spliced in). This is a value from 0 to 1 that represents the fraction of RNA molecules from that gene that include a given form of the splicing event. We will work with exon-skipping events. In this case, the PSI value represents the proportion of RNA molecules that include the exon from all RNA molecules that include or exclude the exon.


We have calculated PSI values for exon-skipping events using SUPPA for human (Gencode annotation v23, hg38 assembly version) for a large number of tissue samples from the GTEX project. The events are encoded in the form, e.g.:


where the event identifier contains the Ensembl gene ID, the SE to denote it is an exon skipping event, and the positions of the regulated middle exon and the adjacente exonic positions:

\includegraphics{exon skipping event}

In particular, we have obtained two datasets, one for different compartments from brain and one from different organs and tissues (heart, kidney, liver, lung, muscle and nerve), from different individuals.

Brain dataset

Multi-organ dataset

The PSI values are given in a file with a header containing all sample IDs separated by a tab:

GTEX-ZVT4-1326-SM-5NQ8E  GTEX-XXEK-0626-SM-4BRWE GTEX-139YR-2526-SM-5IJC6 ...

Followed by as many lines as SE events, where the first column is the event ID and the following columns are the PSI values in the samples using the same order as the header:

ENSG00000003509.15;SE:chr2:37231760-37232106:37232266-37236096:+      1.0       1.0       0.88  ...

The description of the samples are given in a text file without a header

GTEX-13SLW-0011-R4b-SM-5S2W2    Brain_-_Amygdala        Brain   Normal_Tissue   Male    GTEX
GTEX-13RTJ-0011-R4b-SM-5PNX1    Brain_-_Amygdala        Brain   Normal_Tissue   Male    GTEX

where the order of the items is:

sample_id  detailed_category    primary_site   sample_type    gender study


The objective of this assignment is to build a Naive Bayes model to predict the tissue type or brain region from the pattern of splicing of the samples. You will have to choose one of these two datasets.

  1. The class values for the classification will be the brain region (amygdala, etc...) in the first set, or the tissue type (heart, etc..) in the second.
  2. Separate the datasets into sets for training and testing. A subset of samples will be used for training and a different subset of samples for testing. Make sure that you use the same number of samples from each tissue type or brain region for training. For testing, balanced sets are relevant to build proper ROC curves, but in general you can evaluate the accuracy on the rest of the samples.
  3. The attributes that describe each sample will be the PSI values of the SE event. We will discretize the PSI values into two possible values: up is PSI > 0.5 and down if PSI< 0.5. You can discard the cases that do not have a number assigned, i.e. NA.
  4. For each class (brain region / tissue type), and for each attribute (SE event), you will have to measure the likelihoods (proportions) of each value up/down in each class. For instance, for event e, we will measure:
    P(e_up|heart), P(e_down|heart) 
    P(e_up|liver), P(e_down|iver)

  5. Using as attributes the discretized inclusion of events, and using Mutual Information (Information Gain) on the training set, determine which attributes are the most informative to separate between the tumor types. The Mututal Information provides a single value per event, which gives a sense of how well the discretized PSI (attribute value) is associated to the tissue types (class values). Recall that for a given set of classification values S and an attribute A (event), the Mutual Information is defined as:
    MI(S,A) = H(S) - H(S|A)

    where the relative entropy is calculated as (see the course slides):

    H(S|A) =    P(heart, e_up) log2 ( P(heart, e_up) / P(e_up) )  +  P(heart, e_down) log ( P(heart, e_down) / P(e_down) ) 
             +  P(liver, e_up) log2 ( P(liver, e_up) / P(e_up) )  +  P(liver, e_down) log ( P(liver, e_down) / P(e_down) ) 
             +  ...

  6. Using the best predictive attributes, build a Naive Bayes model with the training set and use it to predict the tissue type / brain region on the testing set. The output of the program should be the resulting classification for each test case using the Naive Bayes classifier, together with a score and the real label. Remember that the scores can be transformed into a probability. The output should be of the form, e.g.:
    score    prediction   label   sample
    -20.04   heart        heart    id1
    -30.03   heart        kidney   id2
    -21.32   liver        liver    id3
  7. Consider the use of pseudocounts and discuss whether they are necessary or not.
  8. Determine the accuracy of the model by computing the coincidences and the discrepancies between the predictions and the actual labels. You can calculate the following quantities:

  9. Discuss the choice of a score (or probability) cut-off to select your predictions and thereby reduce the number of false positives. Can you find an optimal cut-off?
  10. Discuss whether this is a good classifier or not. Can you propose a way to improve the classifier?


Since we are going to multiply probabilities, we will obtain in general very small numbers. This can become a problem as computers have a limitation in the number of decimals they can handle. A solution to this problem is to consider the logarithm of the probabilities. The products become sums and the maximization procedure to select the best hypothesis remain the same.


Alamancos et al. Leveraging transcript quantification for fast computation of alternative splicing profiles. RNA. 2015 Sep;21(9):1521-31.

GTEx Consortium. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648-660.

Barbosa-Morais, et al. (2012). The evolutionary landscape of alternative splicing in vertebrate species. Science, 338(6114), pp.1587-1593.

Merkin, et al (2012). Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science, 338(6114), pp.1593-1599.

Wang et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221), 470-476.

Pan et al. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics, 40(12), 1413-1415.

Yeo, G., Holste, D., Kreiman, G., & Burge, C. B. (2004). Variation in alternative splicing across human tissues. Genome biology, 5(10), R74.