### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Topics In Statistics For Undergraduates STAT 49000

Purdue

GPA 3.63

### View Full Document

## 46

## 0

## Popular in Course

## Popular in Statistics

This 5 page Class Notes was uploaded by Bailey Macejkovic on Saturday September 19, 2015. The Class Notes belongs to STAT 49000 at Purdue University taught by Staff in Fall. Since its upload, it has received 46 views. For similar materials see /class/207943/stat-49000-purdue-university in Statistics at Purdue University.

## Similar to STAT 49000 at Purdue

## Reviews for Topics In Statistics For Undergraduates

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/19/15

BIOL 495S CS 490B MATH 490B STAT 490B Spring 2002 Feb 4 and 6 lectures Reference R Durbin S Eddy A Krogh and G Mitchison 1998 Biological sequence analysis Probabilistic models of proteins and nucleic acids Section 22 28 and 32 The textbook A primer of genome science p 9597 De ning substitution matrices for sequence alignment We need score terms for each aligned residue pair A biologist with a good intuition for proteins could invent a set of 210 scoring terms for all possible pairs of amino acids but it is extremely useful to have a guiding theory for what the scores mean We will derive substitution scores from probabilistic model Let us establish some notation We will be considering a pair of sequences x and y oflengths n and m respectively Let x1 be the ith symbol in x and y be the jth symbol of y These symbols will come from some alphabet ft in the case of DNA this will be the four bases A G C T and in the case of proteins the twenty amino acids We denote symbols from this alphabet by lowercase letters like a b Let us first consider ungapped global pairwise alignments that is two completely aligned equal length sequences The unrelated or random model R gives the probability PxleququJ I J with the assumption that letter a occurs independently with frequency qa In the alternative match model aligned pairs of residues occur with a joint probability pub This value pub can be thought of as the probability that the residues at and b have been derived from a common ancestor The probability for the whole alignment is PxylM Hp The ratio of these two likelihoods is known as the odds ratio may l M 2 11m Pxyl R Hal 1 I q qy 39 We take the logarithm of this ratio logodds ratio to obtain scores of aligning the pair of sequences S 2mm where sab log p J qaqb is the log likelihood ratio of the residue pair a b occurring as an aligned pair as opposed to an unaligned pair Arranging sa b scores in a matrix we get the substitution matrix for instance the BLOSUMSO matrix The score matrix is obtained from probabilities Then it comes the issue of how to estimate the probabilities A simple and obvious approach would be to count the frequencies of aligned residue pairs in con rmed alignments and to set the probabilities pub qa to the normalized frequencies The set of BLOSUM matrices are derived by this approach In detail they were derived from a set of aligned ungapped regions from protein families called the BLOCKS database The sequences from each block were clustered putting two sequences into the same cluster whenever their percentage of identical residues exceeded some level L Then we calculate the frequencies Aab of observing residue at in one cluster aligned against residue b in another cluster correcting for the size of the clusters by weighting each occurrence by lnln2 where ml and n2 are the respective cluster sizes From A the probabilities are estimated by qa 2b Aab Zm Am ie the fraction of pairings that include an a and ab pub Aab Zm Acd ie the fraction of pairings between a and b out of all observed pairings Then the score matrix entries were derived using sa b log qa qb The resulting logodds score matrices were scaled and rounded to the nearest integer value For L 62 andL 50 we get BLOSUM62 and BLOSUMSO substitution matrices respectively BLOSUM62 is standard for ungapped matching and BLOSUMSO for alignment with gaps Note that lower L values correspond to more distantly related sequences and are applicable for less similarity searches Hidden Markov models In the example of CpG island we want to build a single model that incorporates both Markov chains model and model A motivation of this is to answer a question like How do we find the CpG island in a long unannotated sequence We want to have both the Markov chains present in the same model with a small probability of switching from one chain to the other at each transition point We have to introduce two states corresponding to each nucleotide symbol We now have A C G T which emit A C G T respectively in CpG island regions and A C G T correspondingly in nonisland regions The transition probabilities in this model are set so that within each group they are close to the transition probabilities of the original component model but there is also a small but finite chance of switching into the other component Overall there is more chance of switching from to than vice versa so if left to run free the model will spend more of its time in the nonisland states than in the island states The relabelling of the states is the critical step The essential difference between a Markov chain and a hidden Markov model is that for a hidden Markov model there is not a onetoone correspondence between the states and the symbols It is no longer possible to tell what state the model was in when x was generated just by looking at x In our example there is no way to tell by looking at a single symbol C in the isolation whether it was emitted by state C or state C Let us formalize the notation for hidden Markov models We have to distinguish the sequence of states from the sequence of symbols Let us call the state sequence the path 7239 The path itself follows a simple Markov chain The ith state in the path is called 7239 The chain is characterized by parameters ak1P xli xr1 k Another set of parameters is the emission probabilities ek b for emitting symbol b from state k ekbPx bl k For the CpG island model the emission probabilities are all 0 or 1 The reason for the name emission probabilities is that hidden Markov models generate or emit sequences The states are hidden though A sequence can be generated from an HMM as follows First a state 72391 is chosen according to the starting probabilities denoted by do In that state an observation x1 is emitted according to the distribution er1 for that state Then a new state 72392 is chosen according to the transition probabilities aw and so forth This way a sequence of random arti cial observations are generated Therefore we will sometimes say that Px is the probability that x was generated by the model It is now easy to write down the joint probability of an observed sequence x and a state sequence 7239 POL 7T ahme x 9 4 where we require no 0 For example the probability of sequence CGCG being emitted by the state sequence C G C G in the model is am X l X achr X l X aG ci X l X ac m X 1 Equation 4 is the HMM analogue of Equation 3 in Markov chain model Refer to the lecture notes However it is not so useful in practice because in general we do not know the path In the following sections we will study algorithms to estimate the path and parameters for an HMM An example of the HMM Consider the CpG island example again We will use another state labeling approach We now de ne two hidden states Every nucleotide belongs to either a normal region N or to a CpG island region R We use a HMM to describe a long sequence with both normal and CpG island regions A symbol sequence may look like TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA With a corresponding state sequence as The underlined regions are CpG island according to the state sequence The HMM describing these data is represented as in the gure below 0 9 08 01 1 A 03 1 gt 1 A 01 1 1 T 03 1 1 T 01 1 1 G 02 1 02 1 G 04 1 1 c 02 1 lt 1 c 04 1 Normal CpG island The states of the HMM are the two categories N or R Transition probabilities govern the assignment of states from one position to the next In the current example if the present state is N the following position will be N with probability 09 and R with probability 01 The four nucleotides in a sequence will appear in each state in accordance to the corresponding emission probabilities Consider a simple sequence TGCC One possible way that this sequence could arise is from the set of hidden states NNNN Given the hidden states the probability of generating emitting the observed sequence is PTGCC1NNNN 03 X 02 X 02 X 02 00024 Whereas the joint probability of the sequence being emitted by the states NNNN would be PTGCCNNNN PTGCC 1 NNNNPNNNN Assuming the rst nucleotide is always in a normal region then the probability is calculated as PTGCCNNNN PTGCC 1 NNNNPNNNN 00024 X 09 X 09 X 09 000175 where PNNNN is computed by Markov transitions Summing over all the possible state paths we obtain the probability of PTGCC from the HMM It will be calculated by PTGCC PTGCC l NNNNPNNNN PTGCC l NNNRPNNNR PTGCC l NNRNPNNRN PTGCC l NRNNPNRNN PTGCC l NNRRPNNRR PTGCC l NRNRPNRNR PTGCC l NRRNPNRRN PTGCC l NRRRPNRRR When we compute the probability of the sequence for all possible paths we can use the path that contributes to maximum probability as our best estimate of the unknown hidden states For the sample sequence one finds that the most probable path is in fact NNNN which is slightly higher than the path NRRR If the fth nucleotide in the series were also a G or C the path NRRRR would be more likely than NNNNN providing eVidence for the existence of a CpG island

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.