Nearly all protein-encoding human genes have multiple exons that are combined in alternative ways to produce distinct mRNAs, often in an organ-specific, tissue-specific, or cell type specific manner. Definition of intron borders often requires the collaboration of RNA-binding proteins (RBPs), such as serine arginine (SR) and heterogeneous nuclear RNPs (hnRNPs), which interact with specific exonic or intronic sequence elements usually located in the vicinity of splice sites. Additionally, there are RBPs that are particularly relevant in development or in tissue-specific splicing regulation.
For instance, binding sites for RBFOX1 (A2BP1) and RBOFX2 (RBM9) proteins have been found in the introns proximal to differentially spliced exons in brain and muscle tissues. In general, these binding sites were found more frequently in introns upstream of exons that were differentially excluded in those tissues, and in introns downstream of exons that were differentially included. Similarly, SRRM4 binding sites were found upstream of exons that were more included in differentiated neurons, compared with undifferentiated neurons.
Multiple biochemical and computational strategies have been developed to study how RBPs bind RNA. A strategy that is frequently applied consists in biochemically purifying a specific RBP from a cell extract and identifying its associated RNA targets using microarrays or RNA deep sequencing. There are several variations of this approach. The most common approach, CLIP-Seq, involves cross-linking with UV light of the RBP to its cognate RNAs, followed by immunoprecipitation and deep sequencing of RNA (Konig et al. 2012):
Sequencing reads are mapped back to the genome or the transcriptome. Significant clusters of reads, also called peaks, are calculated by estimating the enrichment with respect to a background of RNA expression, or with respect to an expected distribution of CLIP-Seq reads in the same region (the pre-mRNA where the RBP is expected to bind). This allows the determination of candidate targets for an RBP (See Konig et al. 2012):
These techniques have been applied before to multiple RNA binding proteins (RBPs) to find their binding specificities.
You will have to choose between one of these datasets obtained from the ENCODE project:
Each of these datasets is composed of sequences with the CLIP regions (regions of the genome with significant CLIP signal) plus 20nt on either side: 20nt + CLIP-region + 20nt. Bear in mind that the CLIP region can be variable and very short.
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT
In this case, it is important to process one character at a time. Accordingly, the sequence ACGGT would be processed:
AC CG GG GT
Accordingly, we will calculate the signal and the background model as probability distributions over k-mers, using the parts of the sequence corresponding to the CLIP regions and the parts corresponding to the flanking regions. The output of this first program will be the list of probabilities for k-mers for signal and background. That is, inside and outside the CLIP regions. This first program should accept k as input.
Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, Gueroussov S, Albu M, Zheng H, Yang A, Na H, Irimia M, Matzat LH, Dale RK, Smith SA, Yarosh CA, Kelly SM, Nabet B, Mecenas D, Li W, Laishram RS, Qiao M, Lipshitz HD, Piano F, Corbett AH, Carstens RP, Frey BJ, Anderson RA, Lynch KW, Penalva LO, Lei EP, Fraser AG, Blencowe BJ, Morris QD, Hughes TR. A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013 Jul 11;499(7457):172-7.
Sundararaman B, Zhan L, Blue SM, Stanton R, Elkins K, Olson S, Wei X, Van Nostrand EL, Pratt GA, Huelga SC, Smalec BM, Wang X, Hong EL, Davidson JM, Lécuyer E, Graveley BR, Yeo GW. Resources for the Comprehensive Discovery of Functional RNA Elements. Mol Cell. 2016 Mar 17;61(6):903-13.
Lovci MT, Ghanem D, Marr H, Arnold J, Gee S, Parra M, Liang TY, Stark TJ, Gehman LT, Hoon S, Massirer KB, Pratt GA, Black DL, Gray JW, Conboy JG, Yeo GW. Rbfox proteins regulate alternative mRNA splicing through evolutionarily conserved RNA bridges. Nat Struct Mol Biol. 2013 Dec;20(12):1434-42.
Konig J, Zarnack K, Luscombe NM, Ule J. Protein-RNA interactions: new genomic technologies and perspectives. Nat Rev Genet. 2012 Jan 18;13(2):77-83.
Yeo GW, Coufal NG, Liang TY, Peng GE, Fu XD, Gage FH. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat Struct Mol Biol. 2009 Feb;16(2):130-7.