×

### Let's log you in.

or

Don't have a StudySoup account? Create one here!

×

or

22

0

5

# 265 Class Note for STAT 49000 at Purdue

Marketplace > Purdue University > 265 Class Note for STAT 49000 at Purdue

No professor available

These notes were just uploaded, and will be ready to view shortly.

Either way, we'll remind you when they're ready :)

Get a free preview of these Notes, just enter your email below.

×
Unlock Preview

COURSE
PROF.
No professor available
TYPE
Class Notes
PAGES
5
WORDS
KARMA
25 ?

## Popular in Department

This 5 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 22 views.

×

## Reviews for 265 Class Note for STAT 49000 at Purdue

×

×

### What is Karma?

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15
BIOL 495S CS 490B MATH 490B STAT 490B Spring 2002 Feb 4 and 6 lectures Reference R Durbin S Eddy A Krogh and G Mitchison 1998 Biological sequence analysis Probabilistic models of proteins and nucleic acids Section 22 28 and 32 The textbook A primer of genome science p 9597 De ning substitution matrices for sequence alignment We need score terms for each aligned residue pair A biologist with a good intuition for proteins could invent a set of 210 scoring terms for all possible pairs of amino acids but it is extremely useful to have a guiding theory for what the scores mean We will derive substitution scores from probabilistic model Let us establish some notation We will be considering a pair of sequences x and y oflengths n and m respectively Let x1 be the ith symbol in x and y be the jth symbol of y These symbols will come from some alphabet ft in the case of DNA this will be the four bases A G C T and in the case of proteins the twenty amino acids We denote symbols from this alphabet by lowercase letters like a b Let us first consider ungapped global pairwise alignments that is two completely aligned equal length sequences The unrelated or random model R gives the probability PxleququJ I J with the assumption that letter a occurs independently with frequency qa In the alternative match model aligned pairs of residues occur with a joint probability pub This value pub can be thought of as the probability that the residues at and b have been derived from a common ancestor The probability for the whole alignment is PxylMHPxy The ratio of these two likelihoods is known as the odds ratio Px y l M H pm 2 H pay Pxyl R Hal 1 z 61 qy We take the logarithm of this ratio logodds ratio to obtain scores of aligning the pair of sequences SZSXy where sab logamp qa qb is the log likelihood ratio of the residue pair a b occurring as an aligned pair as opposed to an unaligned pair Arranging sa b scores in a matrix we get the substitution matrix for instance the BLOSUMSO matrix The score matrix is obtained from probabilities Then it comes the issue of how to estimate the probabilities A simple and obvious approach would be to count the frequencies of aligned residue pairs in con rmed alignments and to set the probabilities pub qa to the normalized frequencies The set of BLOSUM matrices are derived by this approach In detail they were derived from a set of aligned ungapped regions from protein families called the BLOCKS database The sequences from each block were clustered putting two sequences into the same cluster whenever their percentage of identical residues exceeded some level L Then we calculate the frequencies Aab of observing residue at in one cluster aligned against residue b in another cluster correcting for the size of the clusters by weighting each occurrence by lnln2 where ml and n2 are the respective cluster sizes From A017 the probabilities are estimated by qa 2b Aab Zm Am ie the fraction of pairings that include an a and pub Aab Zm Acd ie the fraction of pairings between a and b out of all observed pairings Then the score matrix entries were derived using sa b log qa qb The resulting logodds score matrices were scaled and rounded to the nearest integer value For L 62 andL 50 we get BLOSUM62 and BLOSUMSO substitution matrices respectively BLOSUM62 is standard for ungapped matching and BLOSUMSO for alignment with gaps Note that lower L values correspond to more distantly related sequences and are applicable for less similarity searches Hidden Markov models In the example of CpG island we want to build a single model that incorporates both Markov chains model and model A motivation of this is to answer a question like How do we find the CpG island in a long unannotated sequence We want to have both the Markov chains present in the same model with a small probability of switching from one chain to the other at each transition point We have to introduce two states corresponding to each nucleotide symbol We now have A C G T which emit A C G T respectively in CpG island regions and A C G T correspondingly in nonisland regions The transition probabilities in this model are set so that within each group they are close to the transition probabilities of the original component model but there is also a small but finite chance of switching into the other component Overall there is more chance of switching from to than vice versa so if left to run free the model will spend more of its time in the nonisland states than in the island states The relabelling of the states is the critical step The essential difference between a Markov chain and a hidden Markov model is that for a hidden Markov model there is not a onetoone correspondence between the states and the symbols It is no longer possible to tell what state the model was in when x was generated just by looking at x In our example there is no way to tell by looking at a single symbol C in the isolation whether it was emitted by state C or state C Let us formalize the notation for hidden Markov models We have to distinguish the sequence of states from the sequence of symbols Let us call the state sequence the path 7239 The path itself follows a simple Markov chain The 2th state in the path is called 7239 The chain is characterized by parameters ak1P 2li xr1 k Another set of parameters is the emission probabilities ek b for emitting symbol b from state k ekb Pxl bl721 k For the CpG island model the emission probabilities are all 0 or 1 The reason for the name emission probabilities is that hidden Markov models generate or emit sequences The states are hidden though A sequence can be generated from an HMM as follows First a state 72391 is chosen according to the starting probabilities denoted by do In that state an observation x1 is emitted according to the distribution er1 for that state Then a new state 72392 is chosen according to the transition probabilities aw and so forth This way a sequence of random arti cial observations are generated Therefore we will sometimes say that Px is the probability that x was generated by the model It is now easy to write down the joint probability of an observed sequence x and a state sequence 7239 POL 7T ahme x 9 4 where we require 72390 0 For example the probability of sequence CGCG being emitted by the state sequence C G C G in the model is a gtlt1gtlta gtltl gtlt1gtlta CG gtlt1gtlta GC 0c Equation 4 is the HMM analogue of Equation 3 in Markov chain model Refer to the lecture notes However it is not so useful in practice because in general we do not know the path In the following sections we will study algorithms to estimate the path and parameters for an HMM cG An example of the HMM Consider the CpG island example again We will use another state labeling approach We now de ne two hidden states Every nucleotide belongs to either a normal region N or to a CpG island region R We use a HMM to describe a long sequence with both normal and CpG island regions A symbol sequence may look like TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA With a corresponding state sequence as NNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRNNNN The underlined regions are CpG island according to the state sequence The HMM describing these data is represented as in the gure below 0 9 0 8 01 1 A 03 1 gt 1 A 01 1 1 T 03 1 1 T 01 1 1 G 02 1 02 1 G 04 1 1 c 0 2 1 lt 1 c 0 4 1 Normal CpG island The states of the HMM are the two categories N or R Transition probabilities govern the assignment of states from one position to the next In the current example if the present state is N the following position will be N with probability 09 and R with probability 01 The four nucleotides in a sequence will appear in each state in accordance to the corresponding emission probabilities Consider a simple sequence TGCC One possible way that this sequence could arise is from the set of hidden states NNNN Given the hidden states the probability of generating emitting the observed sequence is PTGCC1NNNN 03 X 02 X 02 X 02 00024 Whereas the joint probability of the sequence being emitted by the states NNNN would be PTGCCNNNN PTGCC 1 NNNNPNNNN Assuming the rst nucleotide is always in a normal region then the probability is calculated as PTGCCNNNN PTGCC 1 NNNNPNNNN 00024 X 09 X 09 X 09 000175 where PNNNN is computed by Markov transitions Summing over all the possible state paths we obtain the probability of PTGCC from the HMM It will be calculated by PTGCC PTGCC l NNNNPNNNN PTGCC l NNNRPNNNR PTGCC l NNRNPNNRN PTGCC l NRNNPNRNN PTGCC l NNRRPNNRR PTGCC l NRNRPNRNR PTGCC l NRRNPNRRN PTGCC l NRRRPNRRR When we compute the probability of the sequence for all possible paths we can use the path that contributes to maximum probability as our best estimate of the unknown hidden states For the sample sequence one finds that the most probable path is in fact NNNN which is slightly higher than the path NRRR If the fth nucleotide in the series were also a G or C the path NRRRR would be more likely than NNNNN providing eVidence for the existence of a CpG island

×

×

### BOOM! Enjoy Your Free Notes!

×

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

Bentley McCaw University of Florida

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Allison Fischer University of Alabama

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over \$600 per month. I LOVE StudySoup!"

Jim McGreen Ohio University

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Parker Thompson 500 Startups

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!
×

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com