Advanced Genome Bioinformatics

Markov model Assignment

Modeling the binding sites for RNA binding proteins with a Markov model


Nearly all protein-encoding human genes have multiple exons that are combined in alternative ways to produce distinct mRNAs, often in an organ-specific, tissue-specific, or cell type specific manner. Definition of intron borders often requires the collaboration of RNA-binding proteins (RBPs), such as serine arginine (SR) and heterogeneous nuclear RNPs (hnRNPs), which interact with specific exonic or intronic sequence elements usually located in the vicinity of splice sites. Additionally, there are RBPs that are particularly relevant in development or in tissue-specific splicing regulation.

For instance, binding sites for RBFOX1 (A2BP1) and RBOFX2 (RBM9) proteins have been found in the introns proximal to differentially spliced exons in brain and muscle tissues. In general, these binding sites were found more frequently in introns upstream of exons that were differentially excluded in those tissues, and in introns downstream of exons that were differentially included. Similarly, SRRM4 binding sites were found upstream of exons that were more included in differentiated neurons, compared with undifferentiated neurons.

Multiple biochemical and computational strategies have been developed to study how RBPs bind RNA. A strategy that is frequently applied consists in biochemically purifying a specific RBP from a cell extract and identifying its associated RNA targets using microarrays or RNA deep sequencing. There are several variations of this approach. The most common approach, CLIP-Seq, involves cross-linking with UV light of the RBP to its cognate RNAs, followed by immunoprecipitation and deep sequencing of RNA (Konig et al. 2012):

CLIP techniques

Sequencing reads are mapped back to the genome or the transcriptome. Significant clusters of reads, also called peaks, are calculated by estimating the enrichment with respect to a background of RNA expression, or with respect to an expected distribution of CLIP-Seq reads in the same region (the pre-mRNA where the RBP is expected to bind). This allows the determination of candidate targets for an RBP (See Konig et al. 2012):

CLIP peaks

These techniques have been applied before to multiple RNA binding proteins (RBPs) to find their binding specificities.


You will have to choose between one of these datasets obtained from the ENCODE project:

Each of these datasets is composed of sequences with the CLIP regions (regions of the genome with significant CLIP signal) plus 20nt on either side: 20nt + CLIP-region + 20nt. Bear in mind that the CLIP region can be variable and very short.

  • Separate the datasets into training and testing.
  • Develop a program that will read these sequences and will estimate a background Markov model of order k from the flanking sequences, and a signal Markov model of order k for the CLIP regions.
  • Note that Markov models of order k, with k>1, can be seen as a Markov model of order 1 over the alphabet of k-tuples. For instance, a Markov model of order 2 for RNA can be treated as a Markov model of order 1 over the alphabet:
        AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT

    In this case, it is important to process one character at a time. Accordingly, the sequence ACGGT would be processed:

        AC CG GG GT

    Accordingly, we will calculate the signal and the background model as probability distributions over k-mers, using the parts of the sequence corresponding to the CLIP regions and the parts corresponding to the flanking regions. The output of this first program will be the list of probabilities for k-mers for signal and background. That is, inside and outside the CLIP regions. This first program should accept k as input.

  • Develop a second program to scan the test sequences to predict the binding regions with your Markov model. This program will read the estimated k-mer probabilities from the previous step and the set of test sequences (different from the training sequences) and will run a sliding window of length l along the sequence to score for the potential to detect an RBFOX2 binding site. This program should also accept a length l as input. Decide on a score cutoff to determine the windows that can be considered potential binding sites.

  • Calculate the precision and recall of the predictions using the following definitions:

  • Note that since the flanking regions are the negatives, these will cover in general many more nucleotides than the positives. Accordingly, some accuracy measures will be understimated, like the false positive rate. For this reason, the precision (proportion of predicted positions that are correct) represent a better estimate of the actual accuracy of the method.

  • Calculate the accuracy of the method as a function of the length l of the search window and as a function of the order k of the Markov model. Can you find an optimal window length l or an optimal k?


    Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, Gueroussov S, Albu M, Zheng H, Yang A, Na H, Irimia M, Matzat LH, Dale RK, Smith SA, Yarosh CA, Kelly SM, Nabet B, Mecenas D, Li W, Laishram RS, Qiao M, Lipshitz HD, Piano F, Corbett AH, Carstens RP, Frey BJ, Anderson RA, Lynch KW, Penalva LO, Lei EP, Fraser AG, Blencowe BJ, Morris QD, Hughes TR. A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013 Jul 11;499(7457):172-7.

    Sundararaman B, Zhan L, Blue SM, Stanton R, Elkins K, Olson S, Wei X, Van Nostrand EL, Pratt GA, Huelga SC, Smalec BM, Wang X, Hong EL, Davidson JM, Lécuyer E, Graveley BR, Yeo GW. Resources for the Comprehensive Discovery of Functional RNA Elements. Mol Cell. 2016 Mar 17;61(6):903-13.

    Lovci MT, Ghanem D, Marr H, Arnold J, Gee S, Parra M, Liang TY, Stark TJ, Gehman LT, Hoon S, Massirer KB, Pratt GA, Black DL, Gray JW, Conboy JG, Yeo GW. Rbfox proteins regulate alternative mRNA splicing through evolutionarily conserved RNA bridges. Nat Struct Mol Biol. 2013 Dec;20(12):1434-42.

    Konig J, Zarnack K, Luscombe NM, Ule J. Protein-RNA interactions: new genomic technologies and perspectives. Nat Rev Genet. 2012 Jan 18;13(2):77-83.

    Yeo GW, Coufal NG, Liang TY, Peng GE, Fu XD, Gage FH. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat Struct Mol Biol. 2009 Feb;16(2):130-7.