Advanced Genome Bioinformatics

HMM Assignment

Modeling U12 Branch-Point selection with a hidden Markov model


This project consists in the development of a hidden Markov model (HMM) to predict Branch-Point (BP) in introns of the type U12. You will implement a simple model and introduce various extensions corresponding to:

The goal of this assignment is to implement a generic Viterbi algorithm that is valid for all variations of the model, to train some of the parameters of the model with real data and to perform an accuracy evaluation.

There are Two different intron types in eukaryotes, each processed by their own dedicated spliceosome: the U2-type (or major) introns, which constitute the majority of introns (99%), and the U12-type (or minor) introns, which constitute about 1% of all introns in the human genome. The Branch Point (BP) is an essential splicing regulatory signal for the recognition of introns and splice-sites. Importantly, the splice-sites of U2-introns are usually recognized independently, whereas in the U12 introns, both splice-sites are recognised together.


U2 and U12 introns have different mechanisms of recognition and therefore different BP consensus sequences.


U12 introns show a stronger BP consensus and a shorter (almost non-existent) Poly-pyrimidine tract (PPT) and a shorter distance between the BP and the 3'ss. In contrast, U2 BPs have a weaker consensus. The distance to the 3'ss can be longer, but most of the U2 BPs lie between 15nt and 50nt upstream of the 3'ss.

There are two types of U12 introns: those of type GT-AG (sometimes also GC-AG) and those of type AT-AC.


Both U12 intron types (GT-AG or AT-AC) have similar BP consensus and weak PPT signal. The 5' splice-site is also quite different from the U2 introns.

Besides the 5'ss, 3'ss, BP and PPT signals, there can be other signals: splicing enhancers and silencers. They could occur in exons and introns, and they are binding sites of RNA binding proteins (RBPs) that facilitate (enhancer) or repress (silencer) the recognition of splice sites.

Image SR-proteins Image SR-proteins
(a) (b)

In particular, these regulators can bind on exons and facilitate the recognition of splice sites and branch points with weak signals. These could help in the prediction of branch points that have a signal different from the consensus. U12 introns usually have different auxiliary factors compared to U2 introns.



These are the main objectives:

  1. The aim of the assignment is to build an HMM model of the U12 branch point and implement the Viterbi algorithm to predict BPs on new sequences. Since in U12 introns the 5'ss and 3'ss are recognized together, we will build a model that includes both sites besides the BP.
  2. Start by implementing Viterbi and testing it with the simplest model:

    \includegraphics{U2 BP model 1}

  3. Change the model by extending the number of states that describe the branch point and the intron-exon boundaries (5'ss and 3'ss). Use an entropy-based measure (Information Content, Information Gain, Mutual Information) to determine how many positions you should use for your model. The Viterbi algorithm should be the same, you should only change the model.

    \includegraphics{U2 BP model 2}

  4. Additionally, incorporate into the previous model a state describing the possible presence of a splicing enhancer on the exon.
  5. There are two types of U12 introns: AT-AC and GT-AG/GC-AG. Accordingly, you should consider both possibilites.
  6. For each model, you should separate the data into training and testing, and make an assesment of the performance of the model using accuracy measures. Do you find any improvements between models? The accuracy measures could be based on:
A possible data structure in PERL to represent the simplest model could be as follows:

my %hmm = ( states    => ["begin", "exon1", "G", "T", "intron", "bp", "ppt", "A", "G", "exon2"],
            emp       => [],
            trp       => [],
            initstate =>  0  );
# order is horzontally: A, C, G, T, 
# vertically: begin, exon, G, T, intron, bp, ppt, A, G, exon
$hmm{emp} = [ [0.00, 0.00, 0.00, 0.00],    # begin is a silent state
              [0.20, 0.30, 0.30, 0.20],    # exon (G+C rich)
              [0.00, 0.00, 1.00, 0.00],    # G
              [0.00, 0.30, 0.00, 0.70],    # T/C
              [0.25, 0.25, 0.25, 0.25],    # here you assume intron is uniform
              [1.00, 0.00, 0.00, 0.00],    # BP is one single position here
              [0.00, 0.55, 0.10, 0.35],    # PPT is TC-rich but allows G
              [1.00, 0.00, 0.00, 0.00],    # A
              [0.00, 0.00, 1.00, 0.00],    # G
              [0.20, 0.30, 0.30, 0.20] ];  # exon (G+C rich)

# symmetric matrix for the transitions between states
$hmm{trp} = [ [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],       # from begin jumps to exon
              [0.0, 0.9, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],       # outgoing from exon
              [0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0],       # outgoing from G
              [0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0],       # outgoing from T
              [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],       # outgoing from BP A
              [0.0, 0.0, 0.0, 0.0, 0.0, 0.8, 0.2, 0.0, 0.0],       # outgoing from PPT
              [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],       # outgoing from A
              [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],       # outgoing from G
              [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] ];     # outgoing from exon

The model is written as a hash with a vector of states, a matrix of emission probabilities (emp) and a matrix of transition probabilities (trp). For simplicity, we have neglected to include the "end" state in the model, which is not strictly necessary in this case. Accordingly, there is only one single transition going out from the exon state which goes into itself, thus with probability 1. The keys contain the following elements:

Datasets: The data file contains 846 U12 introns with the positions of the intron in the human genome, the full intron sequence with 20nt extra on either side, and the position of the BP A (distance from the 3'ss):

where the order is

chromosome name
start of intron
end of intron
Position of the BP A ( distance from the 3'ss)
Sequence: 20nt + intron + 20nt


Turunen JJ, Niemela EH, Verma B, Frilander MJ. The significant other: splicing by the minor spliceosome. Wiley Interdiscip Rev RNA. 2013 Jan-Feb;4(1):61-76. PMID: 23074130

Levine A, Durbin R. A computational scan for U12-dependent introns in the human genome sequence. Nucleic Acids Res. 2001 Oct 1;29(19):4006-13. PMID: 11574683

Corvelo A, Hallegger M, Smith CW, Eyras E. Genome-wide association between branch point properties and alternative splicing. PLoS Comput Biol. 2010 Nov 24;6(11):e1001016.PMID: 21124863

Gooding C, Clark F, Wollerton MC, Grellscheid SN, Groom H, Smith CW. A class of human exons with predicted distant branch points revealed by analysis of AG dinucleotide exclusion zones. Genome Biol. 2006;7(1):R1. PMID: 16507133

Yeo GW, Van Nostrand EL, Liang TY. Discovery and analysis of evolutionarily conserved intronic splicing regulatory elements. PLoS Genet. 2007 May 25;3(5):e85. Erratum in: PLoS Genet. 2007 Jul;3(7):e122. PMID: 17530930

Long JC, Caceres JF. The SR protein family of splicing factors: master regulators of gene expression. Biochem J. 2009 Jan 1;417(1):15-27.

S.R. Eddy. What is a hidden Markov model ? Nat Biotech, 22(10):1315-1316, 2004.