Advanced Genome Bioinformatics

HMM Assignment


Modeling U2 Branch-Point selection with a hidden Markov model

Introduction

This project consists in the development of a hidden Markov model (HMM) to predict the Branch-Point (BP) in U2 introns. You will implement a simple model and introduce various extensions corresponding to:

The goal of this assignment is to implement a generic Viterbi algorithm that is valid for all variations of the model, to train some of the parameters of the model with real data and to perform an accuracy evaluation.

There are two different intron types in eukaryotes, each processed by their own dedicated spliceosome: the U2-type (or major) introns, which constitute the majority of introns (99%), and the U12-type (or minor) introns, which constitute about 1% of all introns in the human genome. The Branch Point (BP) is an essential splicing regulatory signal for the recognition of introns and splice-sites. Importantly, the splice-sites in U2-introns are usually recognized independently of each other, whereas the both splice-sites are recognised together.

\includegraphics{U2_and_U12_introns}

U2 and U12 introns have different mechanisms of recognition and therefore different BP consensus sequences.

\includegraphics{U2_and_U12_signals}

U2 introns are generally GT-AG, although sometimes can be GC-AG. Most of the U2 BPs are between 15nt and 80nt upstream of the 3'ss. They can have a variable position, as short as 6nt upstream and as far as >300nt upstream. Also, there are generally multiple BPs per U2 intron. Downstream of the U2 BP there is a PPT that can be quite long.

Besides the 5'ss, 3'ss, BP and PPT signals, there can be other signals: splicing enhancers and silencers. They could occur in exons and introns, and they are binding sites of RNA binding proteins (RBPs) that facilitate (enhancer) or repress (silencer) the recognition of splice sites.


Image SR-proteins Image SR-proteins
(a) (b)

In particular, these regulators can bind on exons and facilitate the recognition of splice sites and branch points with weak signals. These could help in the prediction of branch points that have a signal different from the consensus.


\includegraphics{Regulators}

Objectives

These are the main objectives:

  1. The aim of the assignment is to build an HMM model of the U2 branch point and implement the Viterbi algorithm to predict BPs on new sequences.
  2. Start by implementing the Viterbi algorithm and testing it with the simplest model:


    \includegraphics{U2 BP model 1}

  3. Change the model by extending the number of states that describe the branch point and the intron-exon boundary (3'ss). Use an entropy-based measure (Information Content, Information Gain, Mutual Information) to determine how many positions you should used for your model. to determine how many positions should you use for these positions. The Viterbi algorithm should be the same, you should only change the model.


    \includegraphics{U2 BP model 2}

  4. Incorporate into the previous model a state describing the presence of a splicing enhancer on the exon. Additionally, add one or more states to model spacers: regions between motifs. This will allow to model longer regions and strong motifs surrounded by uniform distributions of nucleotides:


    \includegraphics{U2 BP model 2}

  5. For each model, you should separate the data into training and testing, and make an assesment of the performance of the model using accuracy measures. Do you find any improvements between models? The accuracy measures could be based on:

A possible data structure in PERL to represent the simplest model could be as follows:

my %hmm = ( states    => ["begin", "intron", "bp", "ppt", "A", "G", "exon"],
            emp       => [],
            trp       => [],
            initstate =>  0  );
 
# order is horzontally: A, C, G, T, 
# vertically: begin, intron, bp, ppt, A, G
$hmm{emp} = [ [0.00, 0.00, 0.00, 0.00],    # begin is a silent state
              [0.25, 0.25, 0.25, 0.25],    # here you assume intron is uniform
              [1.00, 0.00, 0.00, 0.00],    # BP is one single position here
              [0.00, 0.55, 0.10, 0.35],    # PPT is TC-rich but allows G
              [1.00, 0.00, 0.00, 0.00],    # A
              [0.00, 0.00, 1.00, 0.00],    # G
              [0.20, 0.30, 0.30, 0.20] ]; # exons are GC-rich

# symmetric matrix for the transitions between states
$hmm{trp} = [ [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],       # from begin jumps to intron
              [0.0, 0.9, 0.1, 0.0, 0.0, 0.0, 0.0],       # outgoing from intron
              [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],       # outgoing from BP A
              [0.0, 0.0, 0.0, 8.0, 2.0, 0.0, 0.0],       # outgoing from PPT
              [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],       # outgoing from A
              [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],       # outgoing from G
              [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] ];     # outgoing from exon

The model is written as a hash with a vector of states, a matrix of emission probabilities (emp) and a matrix of transition probabilities (trp). For simplicity, we have neglected to include the "end" state in the model, which is not strictly necessary in this case. Accordingly, there is only one single transition going out from the exon state which goes into itself, thus with probability 1. The keys contain the following elements:

Datasets: The data file contains a set of introns with experimentally validated BPs. Each line has 100nt from the intron and 20nt from the exon.

chr18  196768  197615  ENSG00000101557.10:USP14  +   79  AAGGAAACCACTTTCTTGCCTAGGAGACAGTGATTATTTTGGAAGAACTTTGTAGATCATTCACTGCCAGGATTACTT...
chr18  198133  199201  ENSG00000101557.10:USP14  +   87  ATGATACTATTAATTTTAATGGAATTAGGATAATGTATTATATTCAAGAGTATATAACAAATTGAATACCCGTTGGCA...
...

The format is:

chromosome name
start of intron
end of intron
strand
position of the BP A (in the 100+20 nt sequence)
100+20 nt sequence

References

Turunen JJ, Niemela EH, Verma B, Frilander MJ. The significant other: splicing by the minor spliceosome. Wiley Interdiscip Rev RNA. 2013 Jan-Feb;4(1):61-76. PMID: 23074130

Corvelo A, Hallegger M, Smith CW, Eyras E. Genome-wide association between branch point properties and alternative splicing. PLoS Comput Biol. 2010 Nov 24;6(11):e1001016.PMID: 21124863

Gooding C, Clark F, Wollerton MC, Grellscheid SN, Groom H, Smith CW. A class of human exons with predicted distant branch points revealed by analysis of AG dinucleotide exclusion zones. Genome Biol. 2006;7(1):R1. PMID: 16507133

Yeo GW, Van Nostrand EL, Liang TY. Discovery and analysis of evolutionarily conserved intronic splicing regulatory elements. PLoS Genet. 2007 May 25;3(5):e85. Erratum in: PLoS Genet. 2007 Jul;3(7):e122. PMID: 17530930

Long JC, Caceres JF. The SR protein family of splicing factors: master regulators of gene expression. Biochem J. 2009 Jan 1;417(1):15-27.

S.R. Eddy. What is a hidden Markov model ? Nat Biotech, 22(10):1315-1316, 2004.