Topics In Mathematics For Undergraduates
Topics In Mathematics For Undergraduates MA 49000
Popular in Course
Popular in Mathematics (M)
verified elite notetaker
MA 162 Pl Anly Geo Calc II
verified elite notetaker
verified elite notetaker
One Day of Notes
verified elite notetaker
verified elite notetaker
verified elite notetaker
This 6 page Class Notes was uploaded by Dorothea Bode on Saturday September 19, 2015. The Class Notes belongs to MA 49000 at Purdue University taught by Staff in Fall. Since its upload, it has received 37 views. For similar materials see /class/208121/ma-49000-purdue-university in Mathematics (M) at Purdue University.
Reviews for Topics In Mathematics For Undergraduates
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
BIOL 495S CS 490B MATH 490B STAT 490B Spring 2002 Jan 14 and 16 lectures Probabilities and probabilistic models Reading S M Ross Introduction to probability models 7th ed Chapter 1 Reference R Durbin S Eddy A Krogh and G Mitchison 1998 Biological sequence analysis Probabilistic models of proteins and nucleic acids Section 13 and 31 Probabilities Let us consider a very simple example A familiar probabilistic system with a set of discrete outcomes is the roll of a sixsided die To de ne probabilities we must first de ne the space of possible events In this example there are six events for a roll of a die face 16 Following the annotations ofthe reading Ross S 123456 A model ofa roll of a die possible loaded would have six parameters p1p6 the probability of rolling i is p To be probabilities the parameters pl must satisfy the conditions that pl 2 0 and 2171 1 For example suppose that all six numbers are equally likely to appear a fair die then we will have 1 p1p2p3p4p5p6g Another example closer to our biological subject matter is to define probabilities of amino acids or nucleotides For instance at a position of a protein sequence we assume an amino acids at occurs at random with probability qa In another words the amino acid at this position is possible to be one of the twenty types each type a has probability qa to occur The probability qa is a number between 0 and 1 and the sum of all twenty probabilities equals to 1 C F and r Suppose that we toss two dice There are totally 36 possible outcomes combing the possible numbers of the first and second toss Suppose that each of the 36 possible outcomes is equally likely to occur hence has probability 1 36 If we observe that the first die is a four then given this information what is the probability that the sum of the two dice equals six Given that the initial die is a four it follows that there can be at most six possible outcomes of our experiment namely 41 42 43 44 45 and 46 Since each of these outcomes originally had the same probability of occurring they should still have equal probabilities That is given that the first die is a four then the conditional probability of each of the outcomes 41 42 43 44 45 and 46 is 16 while the conditional probability of the other 30 points is 0 Hence the desired probability will be 16 A conditional probability is the probability that one event will occur given that we already know that some other events have occurred If we let E and F denote respectively the event that the sum of the dice is siX and the event that the first die is a four then the probability just obtained is the conditional probability that E occurs given that F has occurred and is denoted by P E l F A general formula for PE l F which is valid for all events E and F when PF gt 0 is P EF PE l F M 1 P F where PEF is the probability of the intersection of E and F or the probability of both E and F occur Back to the previous example PEF is the probability of the first die is a four and the sum of the two dice equals siX and PF is the probability the first die is a four Therefore PEF Ptwo dice are 42 i l P F 6 PE F 13621 16 6 By de ning conditional probability we can also write the probability of both E andF occurrence as PEF PE l FPF Similarly we have PEF PF l EPE if PE gt 0 I ndependency Two events E and F are said to be independent if PEF PEPF By Equation 1 this implies thatE andF are independent if PE l F PE That is E and F are independent if knowledge thatF has occurred does not affect the probability that E occurs That is the occurrence of E is independent of whether or notF occurs Two events E and F that are not independent are said to be dependent As we will see in the followings some models for nucleotide sequence or protein sequence assume a nucleotide or amino acid occurs independent of other residues in the sequence Therefore the probability of the sequence will be the product of the probabilities of residues in the sequence Probabilistic models of sequences When we talk about a model normally we mean a system that simulates the object under consideration A probabilistic model is one that produces different outcomes with different probabilities A probabilistic model can therefore simulate a whole class of objects assigning each an associated probability In our case the objects will normally be sequences and a model might describe a family of related sequences A model of a sequence of three consecutive rolls of a die might be that they were all independent so that the probability of sequence 163 would be the product of the individual probabilities p1 p6 p3 For a fair die each number has probability 16 to occur in a roll The probability of sequence 163 from three independent consecutive rolls is then 1216 Consider a second example of a simple model of any protein or DNA sequence Biological sequences are strings from a nite alphabet of residues generally either four nucleotides or twenty amino acids Assume that a residue a occurs at random with probability qa independent of all other residues in the sequence If the protein or DNA sequence is denoted x1xn the probability of the whole sequence is then the product qxlqx2 max HIquotqu This is often used as random sequence model ofa baselevel model or null hypothesis to compare other models against Maximum likelihood estimation The parameters for a probabilistic model are typically estimated from large sets of trusted examples often called a training set For instance the probability qa for amino acid a can be estimated as the observed frequency of residues in a database of known protein sequences such as SWISSPROT We obtain the twenty frequencies from counting up some twenty million individual residues in the database and thus we have so much data that as long as the training sequences are not systematically biased towards a peculiar residue composition we expect the frequencies to be reasonable estimates of the underlying probabilities of our model This way of estimating models is called maximum likelihood estimation because it can be shown that using the frequencies with which the amino acids occur in the database as the probabilities qa maximizes the total probability of all the sequences given the model the likelihood In general given a model with parameters 9 and a set of data D the maximum likelihood estimate for 9 is that value which maximizes PD l 9 Markov chains Markov chain is an excellent statistical tool for modeling a sequence of any type because a Markov chain is simply a sequence of events that occur one after another The main restriction on a Markov chain is that the probability assigned to an event at any location in the chain can depend on only a fixed number of previous events For example the identity of a DNA base might depend only on the previous base a more general model compared to independent bases Let us use a simple example to introduce Markov models In the human genome wherever the dinucleotide CG occurs frequently written CpG to distinguish it from the CG base pair across the two stands the C nucleotide cytosine is typically chemically modified There is a relatively high chance of this modification that mutates C into a T with the consequence that in general CpG dinucleotides are rarer in the genome than would be expected from the independent probabilities of C and G For biologically important reasons the mutation modification process is suppressed in short stretches of the genome such as around the promoters or start regions of many genes In these regions we see many more CpG dinucleotides than elsewhere Such regions are called CpG islands What sort of probabilistic model might we use for CpG island regions We know that dinucleotides are important We therefore want a model that generates sequences in which the probability of a symbol depends on the previous symbol The simplest such model is a classical Markov chain We can show a Markov chain graphically as a collection of states each of which corresponds to a particular residue with arrows between the states A Markov chain for DNA can be drawn like this where there is a state for each of the four letters A C G and T in the DNA alphabet A probability parameter is associated with each arrow in the figure which determines the probability of a certain residue following another residue or one state following another state These probability parameters are called the transition probabilities which we will write p52 PS Pxtlx1 3 where st indicate one of the four nucleotides Markov models for the CpG island example are illustrated below From a set of human DNA sequences we extracted a total of 48 putative CpG islands and derived two Markov chain models one for the regions labeled as CpG islands the model and the other from the remainder of the sequence the model The transition probabilities for each model were set using the equation C 2 Ps2 csr and its analogue for p where c is the number of times letter I followed letter s in the island regions These are the maximum likelihood estimators for the transition probabilities The resulting tables are A C G T A C G T 0180 0274 0426 0120 0171 0368 0274 0188 0161 0339 0375 0125 0079 0355 0384 0182 0300 0205 0285 0210 0322 0298 0078 0302 0248 0246 0298 0208 0177 0239 0292 0292 1 QODgt 1 QODgt where the first row in each case contains the frequencies with which an A is followed by each of the four bases and so on for the other rows so each row sums to one These numbers are not the same for example G following A is much more common than T following A Notice also that the tables are asymmetric In both tables the probability for G following C is lower than that for C following G But the transition probabilities of C9G and G9C are much higher in the island model than those in the nonisland model The different transition probabilities make the CpG island to be distinguished from other regions For any probability model of sequences we can write the probability of the sequence denoted by x x1x2xL71xL as PxPx1x2xL71xL 2 PxL i x15 x2quotquot erI Perl i x1 9 x2 quotquot5 x1172 ixl by applying PEF PF l E PE many times The key property of a Markov chain is that the probability of each symbol x1 depends only on the value of the preceding symbol x11 not on the entire previous sequence ie 1309 l x1quotquot9x171 Pxx l x171 prx The equation 2 therefore becomes Px Px1x2 xHxL ixl xl quotquot5 xLil PxLil ixl x2 quotquot5 x1172 ixl PxL ierlPerl i erz Px2 l x1Px1 Although we have derived this equation in the context of CpG islands in DNA sequences it is in fact the general equation for the probability of a speci c sequence from any Markov chain Equation 3 is used to calculate the likelihood of a sequence For instance we use 3 to obtain the likelihood values of a sequence under model and model respectively If the likelihood value from model is much higher than that from model we may conclude the sequence is a CpG island region see exercise 5 Exercises Due on Wednesday Jan 23rd 1 In a DNA sequence set there are totally 2 X 104 nucleotides among which 6 X 103 are adenine 4 gtlt103 are guanine and 8 X 103 are thymine Use this data set to estimate the probabilities for the four types of nucleotides 2 A protein sequence is 500 amino acids length Assume amino acids occur independently in the sequence If we are given that the probability for leucine is qL 004 how many leucine residues we are expected to see in this sequence 3 Prove Bayes theorem that m F PF l EPE P F PF l E P E PF l EPEPF l E PE where all the conditional probabilities are assumed well de ned 4 Suppose that the chance of rain tomorrow depends on previous weather conditions only through whether or not it is raining today Suppose also that if it rains today then it will rain tomorrow with probability 06 and if it does not rain today then it will rain tomorrow with probability 03 Show that the process is a twostate Markov chain Calculate all the transition probabilities 5 Consider a nucleotide sequence AGCGCG a Use both the Markov model and the Markov model de ned in the text to obtain two probabilities of this sequence When applying Equation 3 we assume the start residue is equally likely to be each of the four nucleotides The transition probabilities are from the table in the text b Compare the probabilities calculated in a What region we can predict for this sequence CpG island or normal region
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'