Advanced Genome Bioinformatics

HMM Assignment

Modeling 5' splice-site selection with a hidden Markov model


This project consists in building a hidden Markov model (HMM) of a 5' splice-site (donor site) You will implement a simple model and introduce various extensions corresponding to

The goal of this exercise is to implement a generic Viterbi algorithm that is valid for all variations of the model, to train some of the parameters of the model with real data and to perform an accuracy evaluation.

The simplest model for a 5'ss only considers the position of the splice-site (see Eddy Nat. Biotech 2004):


5' splice sites with the greatest sequence complementarity to the U1 snRNP molecule are more frequently selected in experimental assays. We can thus define a consensus from those that are often selected by the U1 snRNP and those stretches of nucleotides whose composition is more similar to the consensus will be more likely to be real splice sites:


Besides the 5'ss, 3'ss, BP and PPT signals, there can be other signals: splicing enhancers and silencers. They could occur in exons and introns, and they are binding sites of RNA binding proteins (RBPs) that facilitate (enhancer) or repress (silencer) the recognition of splice sites. These regulators can bind on exons and facilitate the recognition of splice sites and branch points with weak signals. These could help in the prediction of branch points that have a signal different from the consensus.

The U1 small nuclear ribonucleoprotein (U1 snRNP) binds to the pre-mRNA 5' splice site at early stages of spliceosome assembly. Recruitment of the U1 snRNP to weak 5' splice-sites can be promoted by the binding of the splicing factor TIA1 to Uridine-rich sequences downstream of the 5' splice-site (Forch et al. 2002). This is depicted below (Figures from Forch et al. 2002 and Izquierdo et al. 2005):

Image tia1u1u2af Image tia1
(a) (b)

where in (b) the numbers 1, 2 and 3 denote the RNA recognition motifs (RRMs) in the TIA1 protein from which 2 and 3 recognize a uridine(U)-rich stretch downstream of the 5' splice site sequence. Although the U-rich region is depicted immediately downstream of the splice site, this can also occur further downstream.

Other auxiliary factors can bind to the exon. For instance, SR proteins are a family of proteins that binding RNA, frequently on exons, and work as auxiliary factors in the splicing reaction. SR proteins can bind to exonic elements called exonic splicing enhancers (ESEs) and recruit U2AF35 to an upstream 3′ ss or U1-70K to the downstream 5′ ss, thereby facilitating the splicing reaction (Image from Long and Caceres 2009):

Image SR-proteins Image SR-proteins
(a) (b)

SR proteins bound to ESEs can also promote splice-site selection by antagonizing the action of the splicing repressors (often hnRNP proteins) bound to exonic splicing silencers (ESSs) or introni splicing silencers (ISSs), which block 3′ss or 5'ss selection. SR proteins can facilitate intron bridging interactions by binding, via the RS domain, to U170K and U2AF35, thereby juxtaposing the 5′ss and 3′ss. ESEs are often defined by GA-rich motifs (see Fairbrother et al. 2002).


These are the main objectives:

  1. Implement the Viterbi algorithm and test it with the toy donor splice site model.


  2. Change the model into a donor site (5' splice site) model that considers the binding of the U1 snRNP, by extending the number of states that describe the exon-intron boundary. Use an entropy-based measure (Information Content, Information Gain, Mutual Information) to determine how many positions you should used for your model. Estimate the emission probabilities for these new positions using the training data set. The Viterbi algorithm should be the same, you should only change the model.


  3. Incorporate into the previous model a state describing the presence of a binding site for a splicing regulator (TIA1) binding to a Uridine-rich motid downstream of the donor site. This extension can be a single TIA1 state emitting mostly T's, or two or more TIA1 states emitting mostly T's and connected between themselves.


  4. Incorporate into the previous model a state describing the presence of a binding site for an SR-protein site (an AG-rich sequence) on the exon upstream of the donor site. As before, this extension could be a single or an array of 2 or more states emitting AG-rich sequences.


  5. Make an assesment of the performance of the model using accuracy measures, separating the data into training and testing. Do you find any improvements between models? The accuracy measures could be based on:

Datasets: The data file contains 121129 5'ss extracted from the human annotation. Each line contains 20nt from the exon and 100nt from the intron. That is, the 5'ss is at position 21 of each sequence. You will have to split this file for training and testing. If the file is too large, you can work with a random subset of the splice-sites.

The Viterbi algorithm

You should implement the Viterbi algorithm such that given an input DNA sequence, it calculates the optimal path of states (the sequence of hidden states) that maximizes the joint probability of the model and the DNA sequence. Try to do it first with the DNA sequence of Eddy's article to test whether it works:



A possible data structure in PERL to represent this model could be as follows:

my %hmm = ( states    => ["begin", "exon", "donor", "intron"],
            emp       => [],
            trp       => [],
            initstate =>  0  );
$hmm{emp} = [ [0.00, 0.00, 0.00, 0.00],   # begin is a silent state
              [0.25, 0.25, 0.25, 0.25],   # exon nuc are equiprobable
              [0.05, 0.00, 0.95, 0.00],   # donor sites are mostly G
              [0.40, 0.10, 0.10, 0.40] ]; # introns are A-T rich
$hmm{trp} = [ [0.0, 1.0, 0.0, 0.0],       # outgoing from begin
              [0.0, 0.9, 0.1, 0.0],       # outgoing from exon
              [0.0, 0.0, 0.0, 1.0],       # outgoing from donor
              [0.0, 0.0, 0.0, 1.0] ];     # outgoing from intron

where the model is a hash with a vector of states, a matrix of emission probabilities (emp) and a matrix of transition probabilities (trp). For simplicity, we have neglected to include the "end" state in the model, which is not strictly necessary in this case. Accordingly, there is only one single transition going out from the intron state which goes into itself, thus with probability 1. The keys contain the following elements:


R. Durbin, S.R. Eddy, A. Krogh and G. Mitchinson. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. (available at the UPF library)

S.R. Eddy. What is a hidden Markov model ? Nat Biotech, 22(10):1315-1316, 2004.

Fairbrother WG, Yeh RF, Sharp PA, Burge CB. Predictive identification of exonic splicing enhancers in human genes. Science. 2002 Aug 9;297(5583):1007-13.

P. Forch, et al. The splicing regulator TIA-1 interacts with U1-C to promote U1 snRNP recruitment to 5' splice sites. EMBO J, 21(24):6882-6892.

Long JC, Caceres JF. The SR protein family of splicing factors: master regulators of gene expression. Biochem J. 2009 Jan 1;417(1):15-27.

J.M. Izquierdo, et al Regulation of Fas alternative splicing by antagonistic effects of TIA-1 and PTB on exon definition. Mol Cell, 19:475-484, 2005.