### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 233 Class Note for MA 49000 at Purdue

### View Full Document

## 27

## 0

## Popular in Course

## Popular in Department

This 6 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 27 views.

## Similar to Course at Purdue

## Reviews for 233 Class Note for MA 49000 at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

BIOL 495S CS 490B MATH 490B STAT 490B Spring 2002 Jan 14 and 16 lectures Probabilities and probabilistic models Reading S M Ross Introduction to probability models 7th ed Chapter 1 Reference R Durbin S Eddy A Krogh and G Mitchison 1998 Biological sequence analysis Probabilistic models of proteins and nucleic acids Section 13 and 31 Probabilities Let us consider a very simple example A familiar probabilistic system with a set of discrete outcomes is the roll of a sixsided die To de ne probabilities we must first de ne the space of possible events In this example there are six events for a roll of a die face 16 Following the annotations ofthe reading Ross S 123456 A model ofa roll of a die possible loaded would have six parameters p1p6 the probability of rolling i is p To be probabilities the parameters pl must satisfy the conditions that pl 2 0 and 2171 1 For example suppose that all six numbers are equally likely to appear a fair die then we will have 1 p1p2 p3 p4 p5 p6 zg Another example closer to our biological subject matter is to define probabilities of amino acids or nucleotides For instance at a position of a protein sequence we assume an amino acids a occurs at random with probability an In another words the amino acid at this position is possible to be one of the twenty types each type a has probability an to occur The probability an is a number between 0 and 1 and the sum of all twenty probabilities equals to 1 Conditional probabilities anal independency Suppose that we toss two dice There are totally 36 possible outcomes combing the possible numbers of the first and second toss Suppose that each of the 36 possible outcomes is equally likely to occur hence has probability 1 36 If we observe that the first die is a four then given this information what is the probability that the sum of the two dice equals six Given that the initial die is a four it follows that there can be at most six possible outcomes of our experiment namely 41 42 43 44 45 and 46 Since each of these outcomes originally had the same probability of occurring they should still have equal probabilities That is given that the first die is a four then the conditional probability of each of the outcomes 41 42 43 44 45 and 46 is 16 while the conditional probability of the other 30 points is 0 Hence the desired probability will be 16 A conditional probability is the probability that one event will occur given that we already know that some other events have occurred If we let E and F denote respectively the event that the sum of the dice is siX and the event that the first die is a four then the probability just obtained is the conditional probability that E occurs given that F has occurred and is denoted by P E l F A general formula for PE l F which is valid for all events E and F when PF gt 0 is P EF PE l F 1 P F where PEF is the probability of the intersection of E and F or the probability of both E and F occur Back to the previous example PEF is the probability of the first die is a four and the sum of the two dice equals siX and PF is the probability the first die is a four Therefore PEF Ptwo dice are 42 PF PE l F Ni l 1 6 6 By de ning conditional probability we can also write the probability of both E andF occurrence as PEF PE l FPF Similarly we have PEF PF l EPE if PE gt 0 I ndependency Two events E and F are said to be independent if PEF PEPF By Equation 1 this implies thatE andF are independent if PE l F PE That is E and F are independent if knowledge thatF has occurred does not affect the probability that E occurs That is the occurrence of E is independent of whether or notF occurs Two events E and F that are not independent are said to be dependent As we will see in the followings some models for nucleotide sequence or protein sequence assume a nucleotide or amino acid occurs independent of other residues in the sequence Therefore the probability of the sequence will be the product of the probabilities of residues in the sequence Probabilistic models of sequences When we talk about a model normally we mean a system that simulates the object under consideration A probabilistic model is one that produces different outcomes with different probabilities A probabilistic model can therefore simulate a whole class of objects assigning each an associated probability In our case the objects will normally be sequences and a model might describe a family of related sequences A model of a sequence of three consecutive rolls of a die might be that they were all independent so that the probability of sequence 163 would be the product of the individual probabilities p1 p6 p3 For a fair die each number has probability 16 to occur in a roll The probability of sequence 163 from three independent consecutive rolls is then 1216 Consider a second example of a simple model of any protein or DNA sequence Biological sequences are strings from a nite alphabet of residues generally either four nucleotides or twenty amino acids Assume that a residue a occurs at random with probability qa independent of all other residues in the sequence If the protein or DNA sequence is denoted x1xn the probability of the whole sequence is then the product qxlqx2 maxquot FIan This is often used as random sequence model ofa baselevel model or null hypothesis to compare other models against Maximum likelihood estimation The parameters for a probabilistic model are typically estimated from large sets of trusted examples often called a training set For instance the probability qa for amino acid a can be estimated as the observed frequency of residues in a database of known protein sequences such as SWISSPROT We obtain the twenty frequencies from counting up some twenty million individual residues in the database and thus we have so much data that as long as the training sequences are not systematically biased towards a peculiar residue composition we expect the frequencies to be reasonable estimates of the underlying probabilities of our model This way of estimating models is called maximum likelihood estimation because it can be shown that using the frequencies with which the amino acids occur in the database as the probabilities qa maximizes the total probability of all the sequences given the model the likelihood In general given a model with parameters 9 and a set of data D the maximum likelihood estimate for 9 is that value which maximizes PD l 9 Markov chains Markov chain is an excellent statistical tool for modeling a sequence of any type because a Markov chain is simply a sequence of events that occur one after another The main restriction on a Markov chain is that the probability assigned to an event at any location in the chain can depend on only a fixed number of previous events For example the identity of a DNA base might depend only on the previous base a more general model compared to independent bases Let us use a simple example to introduce Markov models In the human genome wherever the dinucleotide CG occurs frequently written CpG to distinguish it from the CG base pair across the two stands the C nucleotide cytosine is typically chemically modified There is a relatively high chance of this modification that mutates C into a T with the consequence that in general CpG dinucleotides are rarer in the genome than would be expected from the independent probabilities of C and G For biologically important reasons the mutation modification process is suppressed in short stretches of the genome such as around the promoters or start regions of many genes In these regions we see many more CpG dinucleotides than elsewhere Such regions are called CpG islands What sort of probabilistic model might we use for CpG island regions We know that dinucleotides are important We therefore want a model that generates sequences in which the probability of a symbol depends on the previous symbol The simplest such model is a classical Markov chain We can show a Markov chain graphically as a collection of states each of which corresponds to a particular residue with arrows between the states A Markov chain for DNA can be drawn like this where there is a state for each of the four letters A C G and T in the DNA alphabet A probability parameter is associated with each arrow in the figure which determines the probability of a certain residue following another residue or one state following another state These probability parameters are called the transition probabilities which we will write p52 PS Pxtlx1 3 where st indicate one of the four nucleotides Markov models for the CpG island example are illustrated below From a set of human DNA sequences we extracted a total of 48 putative CpG islands and derived two Markov chain models one for the regions labeled as CpG islands the model and the other from the remainder of the sequence the model The transition probabilities for each model were set using the equation C p 2 tvcm and its analogue for p where c is the number of times letter I followed letter s in the island regions These are the maximum likelihood estimators for the transition probabilities The resulting tables are A C G T A C G T 0180 0274 0426 0120 0171 0368 0274 0188 0161 0339 0375 0125 0079 0355 0384 0182 0300 0205 0285 0210 0322 0298 0078 0302 0248 0246 0298 0208 0177 0239 0292 0292 1 QODgt 1 QODgt where the first row in each case contains the frequencies with which an A is followed by each of the four bases and so on for the other rows so each row sums to one These numbers are not the same for example G following A is much more common than T following A Notice also that the tables are asymmetric In both tables the probability for G following C is lower than that for C following G But the transition probabilities of C9G and G9C are much higher in the island model than those in the nonisland model The different transition probabilities make the CpG island to be distinguished from other regions For any probability model of sequences we can write the probability of the sequence denoted by x x1x2xL71xL as PxPx1x2xL71xL 2 PxL i x15 x2quotquot erI Perl i x1 9 x2 quotquot5 x1172 ixl by applying PEF PF l E PE many times The key property of a Markov chain is that the probability of each symbol x1 depends only on the value of the preceding symbol x11 not on the entire previous sequence ie 1309 l x1quotquot9x171 Pxx l x171 prx The equation 2 therefore becomes Px Px1x2 xHxL i x1 xZquotquot xLil PxLil ixl x2 quotquot5 x1172 ixl PxL ierlPerl i erz Px2 l x1Px1 Although we have derived this equation in the context of CpG islands in DNA sequences it is in fact the general equation for the probability of a speci c sequence from any Markov chain Equation 3 is used to calculate the likelihood of a sequence For instance we use 3 to obtain the likelihood values of a sequence under model and model respectively If the likelihood value from model is much higher than that from model we may conclude the sequence is a CpG island region see exercise 5 Exercises Due on Wednesday Jan 23rd 1 In a DNA sequence set there are totally 2 X 104 nucleotides among which 6 X 103 are adenine 4 gtlt103 are guanine and 8 X 103 are thymine Use this data set to estimate the probabilities for the four types of nucleotides 2 A protein sequence is 500 amino acids length Assume amino acids occur independently in the sequence If we are given that the probability for leucine is qL 004 how many leucine residues we are expected to see in this sequence 3 Prove Bayes theorem that P F l E P E P F PF l E P E PF l EPEPF l E PE where all the conditional probabilities are assumed well de ned 4 Suppose that the chance of rain tomorrow depends on previous weather conditions only through whether or not it is raining today Suppose also that if it rains today then it will rain tomorrow with probability 06 and if it does not rain today then it will rain tomorrow with probability 03 Show that the process is a twostate Markov chain Calculate all the transition probabilities PElF 5 Consider a nucleotide sequence AGCGCG a Use both the Markov model and the Markov model de ned in the text to obtain two probabilities of this sequence When applying Equation 3 we assume the start residue is equally likely to be each of the four nucleotides The transition probabilities are from the table in the text b Compare the probabilities calculated in a What region we can predict for this sequence CpG island or normal region

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.