### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# BIOINFORMATICS CAP 5510

University of Central Florida

GPA 3.76

### View Full Document

## 10

## 0

## Popular in Course

## Popular in System Engineering

This 26 page Class Notes was uploaded by Khalil Conroy on Thursday October 22, 2015. The Class Notes belongs to CAP 5510 at University of Central Florida taught by Shaojie Zhang in Fall. Since its upload, it has received 10 views. For similar materials see /class/227215/cap-5510-university-of-central-florida in System Engineering at University of Central Florida.

## Similar to CAP 5510 at University of Central Florida

## Reviews for BIOINFORMATICS

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/22/15

Phylogenetic Hidden Markov Models Adam Siepel and David Haussler Center for Biomolecular Science and Engineering University of California Santa Cruz Santa Cruz CA 95064 USA Phylogenetic hidden Markov models or phylo HMMs are probabilistic mod els that consider not only the way substitutions occur through evolutionary history at each site of a genome but also the way this process changes from one site to the next By treating molecular evolution as a combination of two Markov processesione that operates in the dimension of space along a genome and one that operates in the dimension of time along the branches of a phylogenetic treeithese models allow aspects of both sequence structure and sequence evolution to be captured Moreover as we will discuss they per mit key computations to be performed exactly and efficiently Phylo HMMs allow evolutionary information to be brought to bear on a wide variety of problems of sequence segmentation such as gene prediction and the iden ti cation of conserved elements Phylo HMMs were rst proposed as a way of improving phylogenetic mod els that allow for variation among sites in the rate of substitution 8 52 Soon afterward they were adapted for the problem of secondary structure prediction 10 47 and some time later for the detection of recombination events 19 Recently there has been a revival of interest in these models 40 42 43 44 31 in connection with an explosion in the availability of comparative sequence data and an accompanying surge of interest in com parative methods for the detection of functional elements 35 3 23 46 41 There has been particular interest in applying phylo HMMs to a multi species version of the ab initio gene prediction problem 40 43 31 In this chapter phylo HMMs are introduced and examples are presented illustrating how they can be used both to identify regions of interest in mul tiply aligned sequences and to improve the goodness of t of ordinary phylo genetic models In addition we discuss how hidden Markov models HMMs phylogenetic models and phylo HMMs all can be considered special cases of general graphical models77 and how the algorithms that are used with these models can be considered special cases of more general algorithms This chapter is written at a tutorial level suitable for readers who are familiar with phylogenetic models but have had limited exposure to other kinds of graphical mo e s 2 Adam Siepel and David Haussler X TAACGGCAGE X TTAGGCAAG gt AAGGCGCCGQ Fi 1 A A 3state singlesequence HMM with a multinomial distribution as sociated with each state boxed tables A new state is visited at each time step according to the indicated transition probabilities numbers on arcs and a new character is emitted according to the probability distribution for that state The shaded boxes indicate the current state and a newly emitted character which is ap pended to the sequence X In this example one state has an AT rich distribution 51 one has a GC rich distribution 52 and one favors purines 53 B An analogous phyloHMM In this case the multinomial distributions are replaced by phylogenetic models and at each time step a new column in a multiple alignment X is emitted The phylogenetic models include parameters describing the overall shape and size of the tree as well as the background distribution for characters and the pattern of substitution For simplicity the tree parameters are represented graphically and only one auxiliary parameter is shown 1 Background A phylo HMM can be thought of as a machine that probabilistically generates a multiple alignment column by column such that each column is de ned by a phylogenetic model As with the singlesequence HMMs ordinarily used in bi ological sequence analysis 6 this machine probabilistically proceeds from one state to anotherl and at each time step it emits an observable object which is drawn from the distribution associated with the current state Figure 1 With phylo HMMs however the distributions associated with states are no longer multinomial distributions over a set of characters eigr ACGT but are more complex distributions de ned by phylogenetic models Phylogenetic models as considered here de ne a stochastic process of sub stitution that operates independently at each site in a genome the question 1 Throughout this chapter it is assumed that the Markov chain for state transitions is discrete rstorder and homogeneous Phylogenetic Hidden Markov Models 3 of independence will be revisited below In the assumed process a character is rst drawn at random from the background distribution and assigned to the root of the tree character substitutions then occur randomly along the trees branches from root to leaves The characters that remain at the leaves when the process has completed de ne an alignment columnl Thus a phylogenetic model induces a distribution over alignment columns having a correlation structure that re ects the phylogeny and substitution process see 10 The different phylogenetic models associated with the states of a phylo HMM may re ect different overall rates of substitution as in conserved and nonconserved regions different patterns of substitution or background distributions as in coding and noncoding regions or even different tree topologies as with re combination 19 l Typically with HMMs a sequence of observations here denoted X is available to be analyzed but the sequence of states called the path by which the observations were generated is hidden hence the name hidden Markov model Efficient algorithms are available to compute the maximum likelihood path the posterior probability that any given state generated any given element of X and the total probability of X considering all possible paths the likelihood of the model The usefulness of HMMs in general and phyloHMMs in particular is in large part a consequence of the fact that these computations can be performed exactly and ef ciently In this chapter three examples of applications of phylo HMMs will be presented that parallel these three types of computationiprediction based on the maximumlikelihood path example 1 prediction based on posterior probabilities example 2 and improved goodness of t as evidenced by model likelihood example 3 Finally it will be shown how these algorithms may be considered special cases of more general algorithms by regarding phylo HMMs as graphical models 2 Formal De nition of a Phylo HMM Formally we de ne phyloHMM H S 11 A b to be a fourtuple consisting of a set of states 5 31 l l l SM a set of associated phylogenetic models 1 1Hl M a matrix of statetransition probabilities A aj l S j k S M and a vector of initialstate probabilities b 121 l l 12M In particular 1 is the phylogenetic model associated with state Sj l S j S M a c l S j k S M is the conditional probability of visiting state k at some site 239 given that state j is visited at site i 7 l and bj l S j S M is the probability that statej is visited rst thus 2k a c l for all j and Ej bj 1 Let X be the given alignment consisting of L columns sites and n rows one for each of n species with the ith column denoted Xi l S i S L Each phylogenetic model 11 in turn consists of several components For our purposes a phylogenetic model 1 Q 77 Tj j is a fourtuple con sisting of a substitution rate matrix Q a vector of background or equilib rium frequencies 77 a binary tree Tj and a set of branch lengths jl The 4 Adam Siepel and David Haussler model is de ned with respect to an alphabet 2 eg 2 ACGT whose size is denoted di Generally Qj has dimension d X d and Tr has dimension d but see example 3 The tree Tj has n leaves corresponding to n presentday taxai The elements of j are associated with the branches edges of the tree It is assumed that all phylogenetic models in 1 are de ned with respect to the same alphabet and number of species The probability that a column Xi is emitted by state Sj is simply the prob ability of Xi under the corresponding phylogenetic model PXil ji This quantity can be computed ef ciently by a recursive dynamic programming algorithm known as Felsensteinls pruning algorithm Felsenstein s algo rithm requires conditional probabilities of substitution for all bases ab E E and branch lengths t E ji The probability of substitution of a base I for a base a along a branch of length t denoted Pblat j is based on a continuoustime Markov model of substitution de ned by the rate matrix jS In particular for any given nonnegative value t the conditional probabilities Pblat j for all a b E E are given by the d X d matrix Pj t expth k where expth 220 27 Qj can be parameterized in various more or less parsimonious ways 50 For most of this chapter we will assume the parameterization corresponding to the HKY model 12 which implies that Qj has the form 7 WCJ Hi We gt1 WTJ Qj WAJ 7 WGJ HjTrTj 1 H1 WAJ WCJ 7 WTJ WAJ H1 WCJ WGJ 7 where Trj WAJ39 TCJ TGJ39 TTJ39 Hj represents the transitiontransversion rate ratio for model 11 and the 7 symbols indicate quantities required to make each row sum to zero A path through the phyloHMM is a sequence of states 1 4151 i i i 45L such that E 1i i M for 1 S i S L The joint probability of a path and an alignment is2 L PW Xlg b 1PX1l 1 H a 71 PXz l 2 i2 The likelihood is given by the sum over all paths PXl9 245 P XlH and the maximumlikelihood path is 2 arg maxqb P Xl0i These quanti ties can be computed ef ciently using two closely related dynamicprogram ming algorithms known as the forward and Viterbi algorithms respectively The posterior probability that observation X was produced by state 8 de noted leH can be computed for all i and j by combining the for ward algorithm with a complementary backward algorithm in a forward backward procedurei Details can be found in 2 For simplicity transitions to an end state are omitted here Phylogenetic Hidden Markov Models 5 B 01 substhite l Q mouse Fig 2 A A 4 state phyloHMM for gene nding States 51 52 and 53 represent the three codon positions and state 54 represents noncoding sites The associated phylogenetic models 139 74 capture characteristic properties of the di erent types of sites g the higher average rate of substitution and the greater transi tiontransversion ratio in noncoding and 3rd codonposition sites than in 1st and 2ndcodonposition sites B The eight mammals and phylogeny assumed for the simulation with branch lengths drawn in the proportions of the noncoding model 1amp4 Subsets of species were selected to maximize the sum of the branch lengths of the induced subtree eg rat and dog for n 2 and rat dog and cow for n 3 Example 1 A toy gene nder This example is meant to demonstrate in principle how a phylo HMM can be used for gene nding Consider a simple 4state phylo HMM with states for the three codon positions and noncoding sites Fig 2A The problem is to identify the genes in a synthetic data set based on this model using nothing but the aligned sequence data and the model this is a multiplesequence version of the ab initio gene prediction problem For simplicity we assume the model parameters 9 are given along with the data set X In practice the parameters have been set to reasonable values for a phylogeny of n 8 mammals Fig 2B3 and the data set has been generated according to these values The state path was recorded during the generation of the data so that it could be used to evaluate the accuracy of predictions The synthetic data set consists of L 100 000 sites and 74 genes he Viterbi algorithm can be used for prediction of genes in this data set in a straightforward way For every site i l S i S L and statej l S j S M the emission probability PXil j is computed using Felsenstein s algorithm These L X M values together with the statetransition probabilities A and initialstate probabilities b are suf cient to de ne the joint probability P XWJ for any path 1 and can be simply plugged into the standard 3 Parameter estimates from 44 were used for the phylogenetic models and the 39 39 39 39 39 were 39 t based on estimates from 43 the probability from 54 to 51 was in ated so that genes would not be too sparse A uniform distribution was assumed for initialstate probabilities 6 Adam Siepel and David Haussler B o 8 m 7 50 o l 6 u a gt ltr u nonrphyloHMM sensltlvlty O onrphyloHMM speclflclty g phyloHMM sensltlvlty m phyloHMM speclflclty Fig 3 Nucleotidelevel sensitivity and speci city for the phle and nonphylo HMMs on the simulated data set of example 1 Results are shown for n 1 species 7 Viterbi algorithm to obtain a maximumlikelihood path 1 This predicted path in turn de nes a set of predicted genes 0 evaluate the effect on prediction accuracy of the number of species in the data set subsets of n 1H8 sequences were selected from the full alignment Fig 2B and a separate set of predictions was produced for eac su set Predictions were also produced with an alternative model in which emission probabilities were based on the assumption that all characters in a column were independently drawn from the background equilibrium distribution of each stateiin other words the correlation structure implied by the phylogeny was ignored This model which will be called the non phylo HMM77 allows the importance of the phylogeny in the phylo HMM to be assesse The nucleotidelevel sensitivity portion correctly predicted of sites actu ally in genes and speci city portion correct of sites predicted to be in genes for both mo e s are shown in Fig 3 as the number of species increases from n l to n 8 The two models are identical for n l where there is no phy logeny to consider but as the number of species increases from n 2 i i 8 the performance of the phylo HMM rapidly improves with about 98 sensi tivity and speci city achieved by n 2 and 99 sensitivity and speci city achieved by n 5 The nonphylo HMM on the other hand appears to improve slightly then decline in both sensitivity and speci city4 The phylo HMM is able to capitalize on differences in branch lengths and substitution 4 It might be expected that the prediction accuracy of the nonphyloHMM would simple fail to improve as rapidly as that of the phyloHMM rather than declin ing The reason for the decline seems to be that the erroneous assumption of independence causes random uctuations in base composition to appear more Phylogenetic Hidden Markov Models 7 patterns while the nonphylo HMM has to rely completely on more subtle differences in base composition This example is obviously a gross simpli cation of the real gene prediction problem here the model used for prediction exactly matches the model used to generate the data while in the real problem the model for prediction tends to t the data in a much more approximate way see Discussion Even if slightly contrived however this example should help to illustrate how the information encoded in substitution rates and patterns can be exploited in problems of segmentation such as gene predictioni D Example 2 Identi cation of highly conserved regions Our second example is concerned with a phylo HMM in which states cor respond to rate categorieslliclasses of sites assumed to differ only in overall rate of substitutionirather than functional categories as in the previous example The problem is to identify highly conserved genomic regions in a set of multiply aligned sequencesi Such regions are likely to be functionally important and hence their identi cation has become a subject of consider able interest in comparative genomics see Margulies et all 30 for a recent review and a comprehensive discussion In this example we will use a phylo HMM to identify conserved regions in a subset of the data set analyzed by Margulies et all It will be shown that a phyloHMM can be used to obtain results comparable to theirs and has certain potential advantages over their methods A phylo HMM is assumed like the one proposed by Felsenstein and Churchill 8 with k states corresponding to k rate categories and state tran sitions de ned by a single autocorrelation parameter A Fig 4 a similar model but with a more complex parameterization of transition probabilities was proposed by Yang 52 Regions of the alignment that are likely to have been generated by the slowest rate categories will be considered putative Multispecies Conserved Sequences MCSs 30 Speci cally we will look at sites i for which the posterior probability llX H is high assuming state 31 has the smallest rate constant Posterior probabilities will be com puted using the forwardbackward algorithmi As in example 1 the L X k table of emission probabilitiesiirei PXil j for every site i l S i S L and state j l S j S kitogether with the statetransition and initialstate probabilities parameters A and b of the phyloHMM can be plugged into the standard forwardbackward algorithm for HMMsr In other words once the emission probabilities are computed the phylogenetic models can be ignored and the phyloHMM can be treated like an ordinary HMMr Note that infer ences about the evolutionary rate at each site could alternatively be based on the Viterbi path We have opted to use posterior probabilities instead partly for illustration and partly because they can be conveniently interpreted as a continuousvalued conservation score that can be plotted along the genome signi cant than they really are These uctuations are explained by changes in state resulting in errors in the inferred path and a decline in accuracy 8 Adam Siepel and David Haussler a17 Hum Fig 4 Statetransition diagram for the autocorrelated ratevariation model of Felsenstein and Churchill 8 with k 3 rate categories and a uniform stationary distribution The autocorrelation parameter A de nes all transition probabilities as shown It takes values between 0 and 1 and describes the degree to which the evolutionary rates at adjacent sites tend to be similar The values r1 T2 and T3 are applied as scaling constants to the branch lengths of a phylogenetic model all parameters other than branch lengths are left unchanged In our case7 these rate constants77 as well as A are estimated approximately from the data see 42 see below With this model the posterior probabilities also tend to be more robust than the Viterbi path which is highly sensitive to A The data set consists of about 18 Mb of human sequence from chromo some 7 and homologous sequence from 8 other eutherian mammals 46 we consider only the 9 mammals of the 12 species analyzed in 30 The species and phylogeny are as shown in Fig 2B except that in this case chimp is also included and appears in the phylogeny as a sister taxon to human Assum ing the HKY substitution model and k 10 states7 we tted a phylo HMM to this alignment obtaining an estimate of x 094 Using these parameter estimates we then computed the posterior probability of each state at each site The posterior probabilities for 31 in a selected region of the alignment are shown in Fig 5 along with the conservation scores developed by Margulies et al The known exons in this region all coincide with regions of high posterior probability7 as do several conserved intronic features identi ed by Margulies et al 30 A detailed comparison of results is not possible here7 but we note that the posterior probabilities based on the phylo HMM are qualitatively very simi lar to the binomial and parsimonybased conservation scores of Margulies et al 30 In addition7 the phylo HMM may have certain advantages as a frame work for addressing this problem For example7 it requires no sliding window of xed size and as a result is capable of identifying both very short highly conserved sequences and long notso conserved sequences In addition7 it can be used with any phylogenetic model including eg ones that allow for non Phylogenetic Hidden Markov Models 9 BasePosmon I Ennnnn I autumn I Enznnn I nannn I ENDED I n nnn I susnnn I Sumnn I snsnnn I Relsw Genes BinomiaLz Conservation Swves 1 H r i I Bmomlal 439 i u l 1 l 1 PIXJr x Mm 39 Waman Pym basM MuslParsF VRl a mall E II n m Paysmny Memes Conservation Scove Pamrlnjjv Iquot E m I H rl mn il nIIJ n 14 nx n A a I w l PhylorHMM Post onb Rate cmgowt nun Mn phyiorHMMyostvob n Fig 5 A screen shot from the UCSC Genome Browser 24 showing a selected region of the data set of example 2 including several exons of the MET gene black boxes at top The binomialbased light gray and parsimonybased medium gray conservation scores of Margulies et al 30 are shown as tracks in the browser as are the posterior probabilities X1000 of state 51 in the phyloHMM dark gray Plots similar to this one showing phyloHMMbased conservation scores across the whole human genome can be viewed online at httpgenomeucscedu homogeneities in the substitution process or contextdependent substitution see example 3 it extends naturally to the case in which different functional categories of sites as well as rate categories are considered 42 and it could be adapted to model properties such as the length distributions of MCSs eigr using techniques from gene nding D 3 Higher Order Markov Models for Emissions It is common with singlesequence gene nding HMMs to condition the emis sion probability of each observation I on the observations that immediately precede it in the sequence eigi 1172 and 1 By taking into consideration the context for each observation emission probabilities become more in formative and the HMM can discriminate more effectively between different classes of observations For example in a 3rd codonposition state the emis sion of a base I A might have a fairly high probability if the previous two bases are 2 G and 1 A GAA Glu but should have 77 zero probability if the previous two bases are 2 T and 1 A Sto Considering the N observations preceding each I corresponds to using an Nth order Markov model for emissions Note that such a model does not imply an Nth order Markov chain for state transitions indeed things are kept simpler and the model remains mathematically valid if state transitions continue to be described by a 1st order Markov chain An Nth order model for emissions is typically parameterized in terms of N ltuples of observations 10 Adam Siepel and David Haussler and conditional probabilities are computed as PIi7N7wIi7171i 2y 1311397N7H 71i71797 with the numerator being the probability of the N 1tuple IFN i i Ii and the sum in the denominator being over all possible observations y that could appear in place of mix An Nth order Markov model for emissions can be incorporated into a phyloHMM in essentially the same way In this case a whole alignment col umn Xi is considered in place of each single base Iii Because we will primar ily be concerned below with tuple size let us also rede ne N and speak of N 7 lst order Markov models and Ntuples of observations instead of Nth order Markov models and N ltuples of observations With these changes equation 3 can be rewritten as 3 Pzil1i7N7 71171 PXiN1i i i Xi1 Xi PltXZlXZiN17HWXZilgt Ev PXi7N17A7Xi717Y 4 Notice that the sum in the denominator is now over all possible alignment columns Y and has d terms where d is the size of the alphabet d and n is the number of rows species in the alignment To compute the quantity in the numerator of equation 4 we replace an ordinary phylogenetic model de ned with respect to an alphabet E with what we will call an Nth order77 phylogenetic model de ned with respect to EN the alphabet of N tuples of characters from 2 5 The new rate matrix and vector of equilibrium frequencies will have dimensions dN gtlt dN and dN respectively The Ntuple of columns in the numerator is reinterpreted as a column of Ntuples and its probability is computed with Felsenstein s pruning algorithm using the Nth order phylogenetic model The sum in the denominator can no longer be evaluated directly but it can be computed ef ciently by dynamic program ming using a slight adaptation of Felsensteinls algorithm 44 42 This new algorithm differs from the original only in its initialization strategyi Thus the conditional probability PXilXiN1 i Xi1 can be computed with an Nth order phylogenetic model and two passes through Felsensteinls algo rithm one for the numerator and one for the denominator of equation 4 This procedure is feasible only for small N so far for N S 3 5 Note that the order of a phylogenetic model is given by the size of the tuples considered and is not equal to the order of the Markov model for emissions Here Nth order phylogenetic models are used to de ne an N 7 1st order Markov model Phylogenetic Hidden Markov Models 11 Once the conditional emission probabilities of equation 4 are available they can be substituted directly into equation 2 For example in the case of N 3 equation 2 can be rewritten as Plt 7Xl0 12451PX1l 1a 1 2PX2lX17 2 L X H a 171 PXilegt27Xiii la 5 i3 The forward Viterbi and forwardbackward algorithms are unaffected by the use of a higherorder Markov model for emissions It is important to note that this strategy for incorporating higher order Markov models into a phylo HMM allows context to be considered in the nucleotide substitution process as well as in the equilibrium frequencies of bases Nth order phylogenetic models describe the joint substitution proba bilities of Ntuples of nucleotides As a result the conditional probabilities of equation 4 may re ect various important context or neighbordependencies in the substitution process such as the tendency for synonymous substitutions to occur at a higher rate than nonsynonymous substitutions in coding regions or the tendency for a high rate of CAT transitions in CpG dinucleotides Equations 4 and 5 as will be shown in example 3 essentially provide a way of stringing together77 contextdependent phylogenetic models so that context dependencies can be considered between every adjacent pair of columns in an alignment Example 5 Modeling contextdependent substitution In this example we will look at how goodness of t is affected by increasing the order N of a phylogenetic model and by allowing for Markov dependence between sites as in equation 5 We will consider the goodness of t of var ious independentsite N l and contextdependent N gt 1 phylogenetic models with respect to about 160000 sites in aligned noncoding DNA from 9 mammalian species The results presented here are taken from 44 The full paper should be consulted for complete details For convenience let us call the class of phylo HMMs described by equa tions 4 and 5 Markovdependent models because they allow for Markov dependence of columns in the alignment As will be seen below these models are actually only approximations of models that properly allow for Markov dependence across sites in the substitution process Regardless these Markov dependent models are valid probability models the probabilities of all align ments of a given size sum to one so it is fair to evaluate goodness of t based on model likelihoods The way in which these models are approximate is discussed in detail in Section 7 and the Appendix In this example there are no functional or rate categories to considerl We assume that the HMM has only a single state so nothing is actually hiddenllionly one path is possible and the model reduces to a Markov chain 12 Adam Siepel and David Haussler As a result Equation 5 becomes L P 7Xl9 PX1l 1PX2lX17 1HPXilXi727Xi71 p1 6 i3 This simpli cation allows us to focus on the impact of higherorder Markov models and to avoid issues related to the HMM structure Keep in mind however that higherorder Markov models can be used with a nontrivial HMM as easily as with this trivial one In 44 various models were tted to the data set of 160000 noncod ing sites and their likelihoods were compared The models differed in the type of phylogenetic model used its order N and the parameterization of its rate matrix and whether Ntuples of columns were assumed independent or Markovdependence was allowed We will focus here on four types of phyloge netic models the HKY and UNR 1st order models the U2S 2nd order model and the USS Srd order model The HKY model introduced in Section 2 is treated as a baseline The UNR or unrestricted model has a separate free parameter for every nondiagonal element of the rate matrix and is the most general model possible for single nucleotide substitution see eg 51 The U2S and USS models are fully general 2nd and Srd order models respectively except that they assume strand symmetry so that eg the rate at which AG changes to AC is the same as the rate at which CT changes to GT and like most codon models 11 they prohibit instantaneous substitutions of more than one nucleotide They have 48 and 288 ratematrix parameters respectively We will consider two cases for each phylogenetic model an in dependent tuples77 case in which the data set was partitioned into Ntuples of columns which were considered independent and a Markovdependent case in which Ntuples were allowed to overlap and likelihoods were computed with equations 4 and 6 Note that with 1st order models the independent tuples and Markovdependent cases are identical Fig 6A shows the log likelihoods of the UNR U2S and USS phylogenetic models with and without Markov dependence relative to the log likelihood of the HKY modeli Even when Ntuples are considered independent context dependent models here U2S and USS produce a striking improvement in likelihoodia far larger increase than is obtained by replacing even a fairly parsimonious 1st order model HKY with a fully general one UNR When Markov dependence between sites is introduced another large improvement occurs This improvement appears to be largely a consequence of the fact that with Markov dependence every boundary between adjacent sites is con sidered while with independent tuples only every other UQS or every third USS such boundary is considered Notice that even with Markov depen dence goodness of t improves signi cantly when a 2nd order model UZS is replaced with a Srd order model USS This is probably partly because of direct context effects that extend beyond the nearest neighbors of each base and partly because the Srd order model does a better job than the Phylogenetic Hidden Markov Models 13 A B o m o Transition 0 u lndep tu les m Transverslon o O Markov e a CpG E o E N 7 a o D O 0 3 v 5 9 7 5 9 X S g 8 E U m 7 lt1 g m E m a N 0 l 39 7 T UNR U23 USS 01 50 100 2 0 5 1 0 2 0 esllmaled rate USS Fig 6 A Log likelihoods of the UNR UQS and U3S phylogenetic models with and without Markov dependence between sites relative to the log likelihood of the HKY model Results are for an alignment of 9 species and approximately 160000 sites of noncoding data as described in 44 B Parameter estimates of substitution rates for the U3S model vs estimates based on counts from aligned human genes and pseudogenes 15 The rates cluster into three groups transversions transitions and G transitions CpG transversions cluster with nonCpG transitions In general the two sets of estimates agree fairly well considering the differences in metho s and data sets see 44 for a detailed discussion 2nd order model of accounting for indirect context effectsiie it provides a better approximation of a proper p 1 mo el 0 t substitution see below The observed improvements remain essentially unchanged when a mea surement is used that considers the different numbers of parameters in the models and the size of the data set the Bayesian information criterion and in crossvalidation experiments 44 Thus the apparent improvement in good ness of t is not an artifact of the number of parameters in the models The UZS and USS models allow contextdependent substitution rates to be estimated with full consideration of the phylogeny and allowance for mul tiple substitutions per site unlike simpler counting 7 methods for estimating contextdependent substitution rates 15 Parameter estimates indicate wide variation in rates spanning a 200fold range and in particular pronounced CpG effects Fig 6B Coding regions can be modeled using a simple 3state phylo HMM with a separate 3rd order phylogenetic model for each codon position Thus the state corresponding to the 3rd codon position considers columns of aligned codons like an ordinary codon model but the other two states consider columns of nucleotide triples that are out of frame and consequently these states can capture context effects that cross codon boundaries Such a model improves 14 Adam Siepel and David Haussler substantially on ordinary codon models indicating that context effects that cross codon boundaries are important 44 see also 39 D 4 Phylogenetic Models HMMs and Phylo HMMs as Graphical Models In recent years probabilistic models originally developed in various research communities have been uni ed under the heading of graphical models77 Graphical models provide an intuitively appealing framework for construct ing and understanding probabilistic models and at the same time allow for rigorous analysis in very general statistical and graphtheoretic terms of al gorithms for inference and learning Many familiar classes of models t natu rally into the graphical models framework including HMMs and phylogenetic models as well as mixture models and hierarchical Bayesian models A phylo HMM can be seen as a graphical model whose structure is a hybrid of the graphical models for HMMs and phylogenetic models Fig 7 Viewing phylo HMMs as graphical models helps to provide insight about why they permit ef cient inference and why this property may be sacri ced when assumptions such as site independence are relaxed Our discussion of graphical models will necessarily be brief other tutorials should be consulted for a more complete introduction to the eld eg 5 13 22 In graphical models random variables are represented by nodes in a graph and dependencies between variables are represented by edges Fig 7 Let X be the set of random variables represented by a graph with nodes vertices V and edges E such that Xv is the variable associated with v 6 Vi In addition let X0 be the subset of variables associated with C E V and let lowercase letters indicate sets of instances of variables ergi ID 10 and 1 Graph ical models can be de ned in terms of directed or undirected graphs and accordingly are called directed or undirected models here we will focus on the directed case which for our purposes is simpler to describe In a directed model the edges of the graph cone pond to local quot39 p quot39 dis tributions and the joint probability of a set of instances 1 is a product of the conditional probabilities of nodes given their parents Pltzgt H PIleP7 lt7 UEV where 7 denotes the set of parents of node v and PIvl1pw is the local conditional probability associated with xvi It should not be too hard to see looking at Fig 7 that equation 7 generalizes thejoint probability ofa sequence and a particular path in the case of an HMM and the joint probability of an 6 The brief introduction to graphical models provided here roughly follows the more detailed tutorial of Jordan and Weiss 22 Phylogenetic Hidden Markov Models 15 X12 1 Xi Xi1 Fig 7 Graphical model representations of A an HMM B a phylogenetic model and C a phyloHMM In each case nodes correspond onetoone with random variables shaded nodes represent observed variables and unshaded nodes repre sent unobserved latent variables These are directed graphical models7 based on directed acyclic graphs sometimes called Bayesian networks The edges between nodes correspond to local conditional probability distributions and can be thought of as implying dependencies between variables More precisely the set of all edges de nes a set of conditional independence assertions about the variables In A7 each Xi represents an observation in the sequence and each represents a state in the path The conditional probability distribution for observation Xi given state is incorporated in the directed edge from to Xi and the conditional probability distribution for state given state mil ie of a transition from WA to is incorporated in the directed edge from W71 t0 In B each set of nodes collec tively labeled Xi represents an alignment column and each set collectively labeled Yi represents a set of ancestral bases The conditional probabilities of nucleotide substitutions based on the continuoustime Markov model are incorporated in the directed edges from each parent node to its two children In C7 conventions from A and B are combined alignment and a particular set of ancestral bases in the case of a phylogenetic o e The general problem of probabilistic inference is to compute marginal probabilities from this joint distributioniprobabilities of the form PzU 1W PzU 1W7 where U7 W is a partitioning of V The likelihood is an example of such a marginal probability7 with my being the observed data and XW being the set of latent variables When the likelihood of an HMM is com 16 Adam Siepel and David Haussler A B C X2 X2 runway 30 X3 X4 X3m34w44 24 4 X5 X1 Fig 8 A A directed graphical model whose nodes form an arbitrary tree The marginal probability of an observed value of X5 is desired B The intermediate values of the elimination algorithm can be seen as messages that are passed from one node to another in the direction of X5 C In the beliefpropagation algorithm all possible messages are generated simultaneously the marginal probability of each node is a product of the incoming messages Based on Figure 1 of Jordan and Weiss 22 puted EU is the observed sequence and XW is the latent path With a phylogenetic model the procedure is applied independently at each site and EU is an observed alignment column and XW is a set of latent ancestral bases Conditional probabilities of interest such as the posterior probabilities of example 2 can be computed as quotients of marginal probabilities For in stance suppose zU is the observed data and Xw w E W is a latent variable then PleIU W Marginal probabilities can always be computed from the complete joint distribution by bruteforce summation The problem is to keep these compu tations tractable as the number of random variables becomes large It turns out that if a directed graphical model is a tree or set of trees as in Fig 7AampB and Fig 8 meaning that every node has at most one parent then exact inference can be accomplished ef ciently by dynamic programming As we will see ef cient exact inference is also possible in certain cases in which the directed graph is not a tree The basic algorithm for computing marginal probabilities is known as elimination and is most easily described by example Consider the graph of Fig 8A with X X1X2 X3X4X5 and edges as depicted The elimina tion algorithm takes advantage of the commutativity of sums and products and reuse of intermediate computations to reduce the computational com plexity of a marginal summation 7 This discussion is restricted to discrete random variables although it extends directly to the continuous case Phylogenetic Hidden Markov Models 17 Algebraically the algorithm proceeds as follows PI5 Z Px1x2xax4x5 wlvaWx39vaA Z ZZZPx1Px2lx4Pmlx4Pmlx1P5lx4 701 a02 702 704 ZPx5lx4ZPmlx4ZPx2lx4ZPx1Pmlx1 ZPx5lx4ZPmlx4ZPx2lx4m14x4 ZPx5lx4ZPI3lI4m244m144 Z PI5lI4ma4I4m244m14I4 704 77145x5 8 where the terms of the form mijzj denote the results of intermediate nested summations each mijzj is the result of a sum over I and is a function of zjr The algorithm can be described in graphtheoretic terms as a procedure that eliminates one node at a time from the graph until only the node corresponding to the desired marginal probability remainsr From t e al gebraic description many readers will recognize the similarity to Felsenstein s pruning algorithm Felsensteinls algorithm it turns out is an instance of the elimination algorithmione of the earliest instances to be discovered The forward algorithm is another instance of the elimination algorithm as is the combined forwardFelsenstein algorithm that we used above to compute the likelihood of a phyloHMMr The Viterbi algorithm is closely related to the elimination algorithm it can be derived by noting that the max operator commutes with products just as the summation operator doesr Note that the elimination algorithm depends on a good elimination ordering An optimal ordering is dif cult to nd for arbitrary graphs but can be determined eas ily for speci c classes of models as with HMMs phylogenetic models and phyloHMMs Often not just one but many marginal probabilities are desired The elimi nation algorithm can be extended to compute the marginal probabilities for all nodes in two passes across the graph with conditional probabilities computed in a forward pass and marginals in a backward pass 29 Typically this proce dure is described as belief propagation 37 with node elimination replaced by a messagepassing metaphor Fig 8BampCr The beliefpropagation also called sumproduct algorithm generalizes the forwardbackward algorithm for HMMs and algorithms for phylogenetic models that compute marginal probabilities of ancestral bases 26 e have focused on directed models but undirected models are similarr Moreover the undirected case turns out to be in a sense the more general one with respect to inference ln undirected models the graph is viewed in 18 Adam Siepel and David Haussler terms of cliques maximal fully connected subgraphs and a potential function essentially an unnormalized probability distribution is associated with each clique The joint probability of all variables equation 7 is now a product over cliques with a normalizing constant to ensure that 21 Pz 1 Di rected graphs can be converted to undirected graphs by a process known as moralization wherein the arrowheads of the edges are removed and new edges are added between all parents of each node the resulting graph is called the moral graph because it requires that all parents are married By explicitly creating a clique that includes each node and all of its parents moralization ensures that all dependencies implied by the local conditional distributions of the directed graph are captured in the undirected graph The moral graph for a directed tree is simply an undirected tree ie no new edges are added and the belief propagation algorithm for this undirected tree is the same as that illustrated in Fig 8 For undirected graphs that con tain cycles a generalization of the belief propagation algorithm called the junctiontree algorithm can be used The junctiontree algorithm oper ates on a tree of cliques rather than of nodes and computes unnormalized marginal probabilities for cliques marginal probabilities of nodes can be ob tained afterwards It requires an additional step called triangulation in which new edges are added to the graph to represent certain implicit depen dencies between nodes A complete introduction to thejunction tree algorithm is not possible here more details can be found in 5 and 22 The key point for our purposes is that the computational complexity of the algorithm is ex ponential in the size of the largest cliquei Thus graphs with cycles can still be handled efficiently if their clique size is constrained to be small It is for this reason that phyloHMMs permit efficient inference their triangulated moral graphs have cycles but the maximum clique size turns out to be threegi When clique size is large exact inference is intractable and approximate methods are required Some of the approximate methods in use include Monte Carlo algorithms and variational methods which include mean eld methods and loopy belief propagation approximate methods are partially surveyed in 22 see also 36 53 48 49 With phyloHMMs the junctiontree algorithm allows computation not only of the posterior probability that each site was emitted by each state as in example 2 but also of marginal posterior probabilities of ancestral bases considering all paths In addition the algorithm can be used to com pute posterior expected values of interest such as the expected number of substitutions per site or the expected numbers of each type of substitution AHC AHG etci along each branch of the tree the sufficient statistics for parameter estimation by expectation maximization 9 44 Using the junc tion tree algorithm in the expectation step of an expectationmaximization 8 In the case of a phylo HMM the parents of each node are already connected Fig 7C s0 moralization amounts simply to removing the arrowheads from all edges in the graph Moreover it turns out that this graph is already triangulated Phylogenetic Hidden Markov Models 19 B Yiil Yi Yi1 X121 Xi Xi1 Fig 9 A The lattice that results when contextdependent substitution is incor porated into a phylogenetic model shown as an undirected graphical model For clarity only a single leaf node is shown for each site with a chain of ancestral nodes leading to the root the phylogeny can be imagined as going into and out of the page Each node depends not only on its parent node in the phylogeny but also on its parent s left and right neighbors in the alignment B A version of the graph in A with intermediate nodes added to the branches of the tree As more and more nodes are added the branch lengths between them approach zero and the model approaches a true processbased model of contextdependent substitution In both A and B the untriangulated graph is shown additional edges appear during triangulation leading to prohibitively large clique sizes algorithm it is possible to train a phylo HMM including its phylogenetic models completely from unlabeled data This technique could be used for example for de novo detection of bindingsite motifs in aligned sequences Once the effect of cycles in graphical models is understood it becomes clear that ef cient exact inference will not be possible with models that ac curately describe the process of contextdependent substitution by allowing for dependencies between adjacent bases on all branches of the phylogenetic tree Fig 9A illustrates what happens to the graphical structure of a phy logenetic model when this kind of proper contextdependence is introduced The additional edges in the graph lead to the formation of a kind of lattice of dependency reminiscent of the classic lsing model from statistical mechanics this case is like a twodimensional lsing model except that the branching structure of the phylogeny creates a branching structure of twodimensional sheets not shown in Fig 9A Unless the size of the lattice is constrained to be small models of this kind are wellknown to require approximate methods for inference Moreover for contextdependent substitution to be modeled properly it should be integrated into the continuoustime Markov model of substitution so that contexteffects can propagate inde nitely across sites as substitutions accumulate along each branch of the phylogeny This behavior can be approx imated by introducing intermediate nodes in the phylogeny while keeping total branch lengths constant as shown in Fig 9B As more and more nodes are introduced the branch lengths between them will approach zero and the model will approach the desired processbased model Exact inference is 20 Adam Siepel and David Haussler intractable for such models even in the case of two sequences and an unrooted tree but Markov chain Monte Carlo MCMC methods have been applied in this special case 20 38 The stationary distribution of a related process has also been studied Extending such processbased models to full phylogenies appears difficult even with MCMCi However a model without intermediate nodes as in Fig 9A has been studied by Jojic et all 21 using variational methods for approximate inference Jojic et all have shown experimentally that this model can produce signi cantly higher likelihoods than the U2S ver sion of the more approximate Markovdependent model described in Section 3 The model of Section 3 essentially works by de ning a simple N 7 lst order Markov chain of alignment columns observed variables while ignoring the dependencies between the ancestral bases latent variables that are asso ciated with overlapping Ntuples of columns As a result this model has no reasonable processbased interpretationi Nevertheless it is a valid probability model that appears to t the data well and it allows for exact inference at modest computational cost 44 The Markovdependent model is compared to the model of Jojic et all in more detail in the Appendix 5 Discussion Phylogenetic hidden Markov models are probabilistic models that describe molecular evolution as a combination of two Markov processesione that op erates in the dimension of space along a genome and one that operates in the dimension of time along the branches of a tree They combine HMMs and phylogenetic models two of the most powerful and widely used classes of probabilistic models in biological sequence analysis Phylo HMMs often t aligned DNA sequence data considerably better than models that treat all sites equally or that fail to allow for correlations between sites In addition they are useful for identifying regions of interest in aligned sequences such as genes or highly conserved regions Three examples have been presented to illustrate some of the ways in which phyloHMMs may be used and each one deserves additional comment Applying phyloHMMs to gene prediction example 1 is a much harder prob lem than implied here for several reasons First while coding and noncoding sites have quite different properties on average both types of sites are hetero geneous mixtures so that correctly classifying particular sequence segments can be difficult For example protein coding sites show higher average levels of evolutionary conservation than noncoding sites but mammalian genomes do appear to have many islands of conservation in noncoding regions 4 30 which can lead to falsepositive predictions of exons 43 Similarly coding sites in mammalian genomes exhibit higher average GC content than do noncod ing sites but base composition varies considerably in both kinds of sites from one genomic region to another which can have the effect of confounding gene prediction softwarei Second the gene finding problem ends up being largely Phylogenetic Hidden Markov Models 21 about identifying the boundaries of exons as determined by splice sites and phyloHMMs are not necessarily the best tools for detecting these so called signals Gene nders are often based on composite models with special ized submodels for signal detection a similar approach may be required for phyloHMMs to be effective in gene predictioni A third problem is that a straightforward phylo HMM like that of example 1 induces a geometric distri bution of exon lengths which is known to be incorrect Some of these problems have been addressed with a generalized phylo HMM that allows for arbi trary length distributions of exons and also uses different sets of parameters for regions of different overall GC content 31 In other recent work it has been shown that the prediction performance of a phylo HMMbased exon pre dictor can be improved signi cantly by using contextdependent phylogenetic models and by explicitly modeling both conserved noncoding regions and nu cleotide insertionsdeletions 43 Additional challenges in multispecies gene prediction are also discussed in 43 stemming from lack of conservation of exon structure across species and errors in the multiple alignment There are many possible ways of identifying conserved regions example 2 and even quite different methods eg ones that do and do not consider the phylogeny tend to be fairly concordant in the regions they identify 45 30i Perhaps more dif cult than proposing a method to identify conserved regions is con rming that it produces biologically useful results Limited kinds of validation can be done computationally but this is ultimately an experimental problem and must be addressed in the laboratory Most likely phylo HMMs of the kind described in example 2 will not produce dramatically different results from other methods but as mentioned above they provide a exible framework in which to address the problemi It should be noted that while the original papers introducing phylo HMMs focused on improving the realism and goodness of t of models allowing for rate variation 8 52 they also showed that phylo HMMs could be used to predict the evolutionary rate at each site Modeling contextdependent substitution is an active area of current re search and the Markovdependent model described here example 3 repre sents only one of several possible approaches to this problemi The approach of Jojic et all 21 discussed at the end of Section 4 is another and we are aware of work in progress on at least two other completely different methods At this stage it remains unclear which models and algorithms for inference will allow for the best compromise between computational ef ciency and goodness of t It is likely that different approaches will turn out to be appropriate for different purposes pace has not allowed for a complete survey of the applications of phylo HMMsi In particular we have not discussed their use in the prediction of secondary structure 10 47 28 or the detection of recombination 19 nor have we touched on their use in a Bayesian setting 32 18i We also have not discussed the models similar in spirit to phylo HMMs that have been ap plied to the problems of RNA secondary structure prediction 25 and multiple 22 Adam Siepel and David Haussler alignment 34 17 16 14 It has been noted 40 that phyloHMMs themselves could be used for multiple alignment in a direct extension of the way pair HMMs are used for pairwise alignment lndeed phyloHMMs provide a natural framework for simultaneously addressing the multiple alignment and gene prediction problems as has been done in the twosequence case with pair HMMs 1 33 Another area in which phyloHMMs may prove useful is ho mology searchingi In principle the pro le HMMs that are commonly applied to this problem 6 could be adapted to use phylogenetic models instead of assuming independence of aligned sequences or relying on ad hoc weighting schemesi Acknowledgments We thank Nick Goldman David Heckerman and Michael Jordan for helpful quot 39 about t t d p d t 39 39 and Brian Lucena Math ieu Blanchette Robert Baertsch and Michael Jordan for comments on the manuscript AS is supported by an ARCS Foundation scholarship and NHGRI grant 1P41HG02371 and DiHi is supported by the Howard Hughes Medical Institute Appendix In this short Appendix we will examine more closely how the Markov dependent model for contextdependent substitution that was presented in Section 3 example 3 compares with the graphical models of Section 4 We will concentrate on the model studied by Jojic et all 21 which we will refer to as the simplelattice model in contrast to the full processbased model of Fig 913 The undirected graph for the simplelattice model is shown in Fig 10A assuming a very small alignment of n 3 sequences and L 3 columns The complete graph is shown here whereas in Fig 9A only a sub graph was shown From Fig 10A it should be clear that the graph contains an L X 2 lattice of nodes for each branch of the phylogenyi The Markovdependent model of Section 3 is a graphical model insofar as it is based on a Markovchain of random variables but it is quite different from the simplelattice model The Markovdependent model actually oper ates at two levels as illustrated in Fig 1013 At one level top of gure a simple Markov chain of alignment columns is assumed with each column being treated as an observed random variable At another level boxes at bottom of gure the conditional probability of each column given the previous column is computed according to a phylogenetic model for pairs of columns Each of these phylogenetic models is a submodel of the model shown in Fig 10A When conditional probabilities are computed according to these separate phy logenetic models multiple versions of the random variables for ancestral bases Phylogenetic Hidden Markov Models 23 Fig 10 A Undirected graph for the simplelattice model of Fig 9A for an alignment of L 3 sites and n 3 species Each node in the phylogeny is repre sented by a sequence of three nodes corresponding to sites 1 2 and 3 and each of these nodes is connected not only to its parent but to its parent s neighbors to the left and right The shaded nodes together represent the three columns of the alignment X1 X2 and X3 and the unshaded nodes re resent the corresponding sets of ancestral bases Y1 Y2 and Y3 B An interpretation of the Markovchain model of Section 3 applied to the same alignment the case of N 2 is illustrated At one level top a Markov chain of alignment columns is assumed At another level bottom inside boxes the conditional probability of each column given the previous column is computed according to a phylogenetic model for pairs of sites are effectively introduced eg Y2 and Y in Fig 10B Moreover these dif ferent versions are not required to be consistent The effect of this modeling choice is to ignore indirect dependencies between latent variables that do not belong to the same slice of N columns but at the same time to permit exact likelihood computations and to capture what are probably the most important context effects By failing to tie together the ancestral nodes of these multiple phyloge netic models the Markovdependent model sacri ces any claim of accurately representing the process of contextdependent substitution Nevertheless it allows the major consequences of this process to be characterized empirically in such a way that valid likelihoods can be extracted as well as reasonable 39 39 of the quot39 39 of key quantities 24 Adam Siepel and David Haussler References 1 M Alexandersson S Cawley7 and L Pachter SLAM Crossspecies gene nding and alignment with a generalized pair hidden Markov model Genome Res7 13496502 2003 2 P F Arndt C B Burge and T Hwa DNA sequence evolution with neighbor dependent mutation In Proc 6th Int l Gonf on Research in Computational Molecular Biology REGUMB UQ pages 32738 2002 D Bo elli J McAuliHe D Ovcharenko K D Lewis7 l Ovcharenko L Pachter7 and E M Rubin Phylogenetic shadowing of primate sequences to nd functional regions of the human genome Science 2992139171394 2003 F Chiaromonte7 R J Weber K M Roskin M Diekhans W J Kent7 and D Haussler The share of human genomic DNA under selection estimated from humanmouse genomic alignments In Cold Spring Harbor Symp Quant Biol7 volume 687 pages 2457254 2003 R Cowell Introduction to inference for Bayesian networks In M 1 Jordan editor Learning in Graphical Models MIT Press Cambridge7 MA 1999 R Durbin7 S Eddy7 A Krogh7 and G Mitchison Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press 1998 w 3 01 C H J Felsenstein Eholutionary trees from DNA sequences J Mol Evol 17368r 376 1981 J Felsenstein and G A Churchill A hidden Markov model approach to variation among sites in rate of evolution Mol Biol Evol 1329371047 1996 N Friedman M Ninio 1 Peer and T Pupko A structural EM algorithm for phylogenetic inference J Comp Biol7 923317353 2002 N Goldman J L Thorne and D T Jones Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses J Mol Biol7 26319672087 1996 N Goldman and Z Yan A codonbased model of nucleotide substitution for proteincoding DNA sequences Mol Biol Evol 1127257735 1994 12 M Hasegawa H Kishino7 and T Yano Dating the humanape splitting by a molecular clock of mitochondrial DNA J Mol Evol 2216071747 1985 D Heckerman A tutorial on learning with Bayesian networks In M 1 Jordan editor Learning in Graphical Models MIT Press Cambridge7 MA 1999 J Hein J L Jensen7 and C N S Pedersen Recursions for statistical multiple alignment Proc Natl Acad Sci USA 100214960714967 2003 15 S T Hess J D Blake and R D Blake Wide variations in neighbordependent substitution rates J Mol Biol7 2362102271033 1994 1 Holmes Using guide trees to construct multiplesequence evolutionary HMMs Bioinformatics 19Suppl 1i1477i157 2003 1 Holmes and W J Bruno Eholutionary HMMs a Bayesian approach to mul tiple alignment Bioinformatics 1780820 2001 18 D Husmeier and G McGuire Detecting recombination in 4taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo Mol Biol Evol7 20231573377 2003 19 D Husmeier and F Wright Detection of recombination in DNA multiple align ments with hidden Markov models J Comp Biol7 824014127 2001 00 3 H O H H H w H a H C H H to H to K to 3 O H w w on C gtJgt on K Phylogenetic Hidden Markov Models 25 J L Jensen and AM K Pedersen Probabilistic models of DNA sequence evolution With context dependent rates of substitution Adv Appl Prob 3249 517 2000 V Jojic N Jojic C Meek D Geiger A Siepel D Haussler and D Heckerman Ef cient approximations for learning phylogenetic HMM models from data In Proc 12th Int l Conf on Intelligent Systems for Molecular Biology 2004 In press M 1 Jordan and Y Weiss Graphical models probabilistic inference In M Ar bib editor The Handbook of Brain Theory and Neural Networks MIT Press 2nd edition 2002 M Kellis N Patterson M Endrizzi B Birren and E S Lander Sequenc ing and comparison of yeast species to identify genes and regulatory elements Nature 42322417254 2003 W J Kent C W Sugnet T S Furey K M Roskin T H Pringle A M Zahler and D Haussler The human genome browser at UCSC Genome Res 129961006 2002 B Knudsen and J Hein RNA secondary structure prediction using stochastic contextfree grammars and evolutionary history Bioinformatics 1524467454 1999 J M Koshi and R M Goldstein Probabilistic reconstruction of ancestral pro tein sequences J Mol Evol 4223137320 1996 P Lio and N Goldman Models of molecular evolution and phylogeny Genome Res 82123371244 1998 P Lio N Goldman J L Thorne and D T Jones PASSML Combining evo lutionary inference and protein secondary structure prediction Bioinformatics 14726733 1998 B Lucena Dynamic Programming Tree Width and Computation on Graphical Models PhD thesis Brown University 2002 E H Margulies M Blanchette NlSC Comparative Sequencing Program D Haussler and E D Green Identi cation and characterization of multi species conserved sequences Genome Res 132250772518 2003 J D McAulilfe L Pachter and M 1 Jordan Multiplesequence functional annotation and the generalized hidden Markov phylogeny Bioinformatics 2004 In press G McGuire F Wright and M J Prentice A Bayesian model for detecting past recombination events in DNA multiple alignments J Comp Biol 721597170 2000 l M Meyer and R Durbin Comparative ab initio prediction of gene structures using pair HMMs Bioinformatics 18130971318 2 G J Mitchison A probabilistic treatment of phylogeny and sequence alignment J Mol Evol 49211722 1999 Mouse Genome Sequencing Consortium lnitial sequencing and comparative analysis of the mouse genome Nature 42025207562 2002 K Murphy Y Weiss and M 1 Jordan Loopy beliefpropagation for approx imate inference An empirical study In K B Laskey and H Prade editors Proceedings of the Fifteenth Conference on Uncertainty in Arti cial Intelligence UAI pages 4677476 San Mateo CA 1999 Morgan Kaufmann J Pearl Probabilistic Reasoning in Intelligent Systems Networks of Plausible Inference Morgan Kaufmann San Mateo CA 1988 26 a to a w a a a U U H U to U 00 AM K Pedersen C Wiuf7 and F B Christiansen A M Wainwright T Jaakkola and A Willsky J Yedidia W Freeman and Y Weiss Adam Siepel and David Haussler AM K Pedersen and J L Jensen A dependent rates model and MCMC based methodology for the maximum likelihood analysis of sequences with overlapping reading frames Mol Biol Evol7 1827637776 2001 codon based model designed to describe lentiviral evolution Mol Biol Evol 15106971081 1998 J S Pedersen and J Hein Gene nding with a hidden Markov model of genome structure and evolution Bioinformatics 1922197227 2003 Rat Genome Sequencing Project Consortium Genome sequence of the Brown Norway Rat yields insights into mammalian evolution 2004 Nature7 428249375217 A Siepel and D Haussler Combining phylogenetic and hidden Markov models in biosequence analysis J Comp Biol7 2004 In press A Siepel and D Haussler Computational identi cation of evolutionarily con served exons In Proc 8th Int l Conf on Research in Computational Molecular Biology RECOMB M7 pages 1777186 2004 A Siepel and D Haussler Phylogenetic estimation of contextdependent sub stitution rates by maximum likelihood Mol Biol Evol 21246874887 2004 N Stojanovic7 L Florea C Riemer7 D Gumucio7 J Slightom M Goodman W Miller and R Hardison Comparison of ve methods for nding conserved sequences in multiple alignments of gene regulatory regions Nucleic Acids Res7 2723899739107 1999 J W Thomas J W Touchman R W Blakesley7 et al Comparative analyses of multi species sequences from targeted genomic regions Nature7 424278877937 2003 J L Thorne N Goldman and D T Jones Combining protein evolution and secondary structure Mol Biol Evol7 1326667673 1996 Treebased reparameterization framework for analysis of sumproduct and related algorithms IEEE Transaci tions on Information Theory7 492112071146 2001 M J Wainwright and M 1 Jordan Graphical models7 exponential families7 and variational inference Technical Report 649 Department of Statistics University of California Berkeley 2003 S Whelan P Lio7 and N Goldman Molecular phylogenetics Stateofthe art methods for looking into the past Trends Genet 1722627272 2001 Z Yang Estimating the pattern of nucleotide substitution J Mol Ecol 3921057 111 1994 Z Yang A spacetime process model for the evolution of DNA sequences Ger netics 139299371005 1995 Bethe free energy Kikuchi approxi mations and belief propagation algorithms Technical Report TR2001 16 Mit subishi Electronic Research Laboratories7 2001

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.