### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Bayesian Statistics 22S 138

UI

GPA 3.72

### View Full Document

## 20

## 0

## Popular in Course

## Popular in Natural Sciences and Mathematics

This 138 page Class Notes was uploaded by Cullen Conn on Friday October 23, 2015. The Class Notes belongs to 22S 138 at University of Iowa taught by Mary Cowles in Fall. Since its upload, it has received 20 views. For similar materials see /class/228077/22s-138-university-of-iowa in Natural Sciences and Mathematics at University of Iowa.

## Reviews for Bayesian Statistics

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/23/15

Introduction to Hierarchical Models 228138 Bayesian Statistics Lecture 11 Sept 28 2005 Kate Cowles PhD Example Pump failure data 0 A hierarchical model is t to data on failr ure rates of the pump at each of 10 power plants The number of failures for the 2 th pump is assumed to follow a Poisson distribution mi N P0239amp90726 t 7 117710 where 6239 is the failure rate for pump 239 and ti is the length of operation time of the pump in 1000s of hours 0 Important point we do not assume that all the pumps have the same failure rate In fact7 one of the questions of interest is to estimate the rates for the individual pumps oWe do not consider the ahtzv pairs eac changeable Hierarchical models 0 Bayesian models with more than two levr els or stages 0 may arise for several reasons i we have insuf cient knowledge to spec ify the parameters of priors we wish to model data or parameters that cannot be considered exchanger able but that are related 0 Write the likelihood of the data 0 Recall that the de nition of exchangeable observations is their likelihood is invarir ant to permutations of the indices o If we exchanged the subscripts on two 25 ti pairs7 and did not change the indices of the corresponding 6239s7 the evaluation of the likelihood would change 0 The rst stage of a hierarchical model is the sampling distribution of the observed data7 or the likelihood The second stage 0 The second stage gives priors on the pa rameters that appeared in the rst stage 0 In the pump failures example a conjugate gamma prior distribution is adopted for the failure rates 6239 N Gammaa 239 1 10 o This says that although the failure rates for the indiVidual pumps are not the same they are related They are all drawn from a common distribution oWe do not know enough about failure rates of pumps in nuclear power plants to be able to specify xed numbers for the prior parameters 04 and In fact we want the data to inform us about these values 0 Consequently we will make 04 and addir tional unknown paramters in the model 5 WinBUGS program to t Pump model model for i in 1Nf thetai quot dgammaalphabeta iambda lt7 thetaiti xi dpoisiambdai alpha dexp10 beta quot dgamma0110 hyperparameters 0 At the third stage of the hierarchical model for pump failures the following priors are speci ed for the hyperparameters Oz and 04 N Emponential N Gamma01010 Data and initial values listt clt94315762912e052431410510521 105 x clt5151431911422 N10 listalpha 10 beta 10 theta C01010101010101010101 thetauo 000000000000 Results 26990 53250 V 02542 V 07855 V 03759 V 03048 V 31500 V 13930 72520 72500 V 77670 42510 MC error V0047060 V0049150 25 28510 26400 HHOOOOOOOOOO than thetas for other observations othetas far from the common mean are shrunk more than those near it MWMMOHOOOOMH start 1001 1001 Sam 10m r r r r r r r r r H 0000000000 Q Q Q Q Q Q Q Q Q Q 10m Compare to maximum likelihood es timates for individual pumps hours failures mle theta 94130 5 10530 10598 15 i 70 1 10637 i 1008 62190 5 10795 10893 126100 14 11111 11160 5 i 24 3 15725 i 6056 31140 19 6051 16105 1105 1 9528 i 9025 1105 1 9524 i 8964 21 10 4 119048 1 i 5900 10150 22 210952 119309 oindiVidual estimates are shrunk away from mle toward a common mean 0 individual estimates borrow strength from the rest of the data 0 thetas for observations With large sam7 ple size time observed are shrunk less 10 2281138 Bayesian Statistics Intro to OneParameter Models Learning about a Proportion Lecture 3 Sept 57 2007 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu 3 0 You do not have the time or resources to lo cate and interview all 28000 students7 so you cannot evaluate p exactly 0 You pick a simple random sample of n 50 students from the student directory and ask each of them whether he she would be likely to quit school if tuition were raised by 10 0 You wish to use your sample data to esti mate the population proportion p7 and to determine the amount of uncertainty in your estimate 2 Example 0 You read in last Monday7s newspaper that one of the Iowa regents wants to raise tuition at the 3 Iowa universities by 10 oYou want to send the regents some argu ments against this idea 0 To support your argument7 you would like to tell the regents what proportion of current Ul students are likely to quit school if tuition is raised by that much 0 Your research question is what is the uni known population parameter p 7 the pro portion in the entire population of Ul stur dents who would be likely to quit schooli 4 The binomial distribution 0 For each of the 50 students in your sample7 de ne a Bernoulli random variable indicat ing whether they say that would quit school yes or no Bernoulli or binary random variable 7 random variable that can assume one of only two values 7 one value is arbitrarily called a success the other a failure we7ll call a yes answer a success 0 The unknown population proportion p is also the probability that a randomlyrselected stur dent from this population would answer yes 5 0 Because we know nothing about the people in your sample except that they were in the student directory7 it is reasonable to assume that they all have the same probability of saying yes to your questions 7 namely pi i If we knew more about the students7 this assumption would not be reasonable 0 We also will assume7 because you drew a simr ple random sample7 that the responses from the individual students are independent This would not be reasonable if you had chosen 25 pairs of roommates7 sets of sibr lings7 etc The likelihood function 0 But we dont know pl 0 lnstead7 after you interview the 50 students7 we know that y 77 and we want to estimate pl 0 In this case7 we may change perspective and regard the sampling distribution as a funcr tion of the unknown parameter p When re garded in this way7 the sampling distribution is called the likelihood function 131 570le W43 0 lt10 lt1 0 We could compute this likelihood for differ ent values of pi lntuitively7 values of p that give larger likelihood evaluations are more consistent with the observed data 6 0 De ne a random variable Y as the count of the number of successes in your sample 0 Y meets the de nition of a binomial random variable 7 it is the count of the number of successes in n independent Bernoulli trials7 all with the same success probability Y N Binomialnp o What are the possible values of Y o If we knew p7 we could use the binomial prob ability mass function to compute the proba bility of obtaining any one of the possible values of Y in our sample n 7 pltylpypy1ip y yo1n 8 The likelihood function for a binomial sample With 7 successes in 50 trials normaiEd l keihnnd 9 10 Frequentist approach to estimating p n Maximum likelihood estimation 1 09 y y l09ltp n 7 yl09lt1 7 m 0 lt p lt 1 o Frequentist approach to estimating p nd the value of p for which the likelihood fume 0 Take rst derivative of log likelihood with re tion attains its maximum spect to p7 set equal to 07 and solve for pi i This is the value that makes the observed data most likely dlltPgt g This value of p is called the maximum dp P 1 7P likelihood estimate77 or MLE 13 EI Q o 0 Usually preferable to maximize the natural 0 ln example7 MLE is log of the likelihood function A 7 i log transformation is monotonic7 so maxi p 3914 imizing the log gives the same answer as maximizing the original function 7 original likelihood is usually a product7 so log is a sum 7 easier to differentiate curvature of log likelihood is related to sample variance of the MLE 11 12 Sampling distribution of the MLE o asymptotic approximation to sampling dis tribution of any MLE when n is large 0 The sampling distribution of an estimator is the distribution of the values of that statistic calculated from all possible samples of size n generically7 if d is the MLE for a popula tion parameter 07 then drawn from the population of interest d 2 N 0 7El2 or7 if the population is in nite or only Where theoretical7 the distribution of values obi 2 is the second derivative Of the log tained in the limit under repeated same likelihood evaluated at the MLE pling gtk and N ab is the normal distribution 0 binomial setting sampling distribution of 13 with mean a and variance 1 is the probability distribution of the values of this estimator if repeated binomial same A A ples are taken with sample size n and xed 13 g N p success probability p n o for binomial proportion p7 Frequentist con dence interval 0 The estimated standard deviation of an estir mator is called its standard error A 151 13 o The approx1mate standard error of p is n o a level C con dence interval for population proportion p can be calculated using asympr totic sampling distribution of 13 13 e 21702 see p 21102 see where Z ILCQ is the 1 7 02 quantile of the standard normal distribution 15 i and applied this procedure for computing a con dence interval to each of the samples then 790 of the resulting con dence intervals would include the true population proporr tion p 0 We dont know whether the particular con dence interval from the sample we actually have is one of the 90 or one of the 10 o The frequentist cannot say that there is 90 probability that the true p is in this intervali The true p is some xed number7 even though we dont know what it is That number is or is not in this intervali 14 Back to quitting school example 013014 on50 o se3 049 o for a 90 con dence interval7 zquot 1645 0 90 con dence interval for population pro portion p is 014 71645 X 0049 014 1645 X 0049 00590221 0 Note if we had obtained a di erent random sample of 50 students7 we would have got ten not only a different 13 but also a different con dence intervali 0 Interpretation of this frequentist con dence interval if i we took many7 many different random same ples from this population 16 Bayesian inference regarding a propor tion Constructing a prior 0 parameter of interest is still the unknown population proportion p o p could take on any value in interval 07 1 0 We need to assess our knowledge or belief about this unknown parameter before we ob serve the data from the survey 0 Because p can take on any of a continuum of values7 we express this knowledge or belief most appropriately by means of a probability density function unlike our previous problems where there was a discrete set of models to which we assigned probabilities Constructing a prior continued 0 person who has little or no knowledge about likely values of this proportion might consider all values in 0 1 equally plausible before seeing any data uniform density on 01 describes this be lief mathematically p 110 1 1011 0ltplt1 Other possible priors o If a person has knowledge or belief regard ing likely values of p hisher prior will be informative 0 examples two different possible priors exr pressing the belief that the most likely values of p are between 11 and 25 Umform 0 called vague or noninformative prior o A histogram prior 0 A discrete prior such as 1 191 p 1 125 15 175 20 225 25 0 An in nite number of other possibilities Updating prior beliefs 0 Bayes theorem for probability density funcr tions pltpldata 0lt 101 131 0 Recall quitting school example and binomial likelihood Lp7py1ipniy 0 lt1 lt1 0 Combining the prior density and the likelir hood to get the posterior density Pltpldata 0lt PltPgtPy1p y 0 lt p lt 1 22 o If the uniform prior7 pp 1 had been cho sen7 and there were 7 people who said they would quit out of 50 surveyed pltpldam 0lt 15071043 0 lt p lt 1 0 With the noninformative7 uniform prior7 the posterior density is proportional to the like lihood functionl But the Bayesian and frequentist interpre tations are different i The Bayesian says that the population pa rameter p can be treated as if it were ran dom variable7 and the posterior distribur tion is a probability distribution represent ing beliefs about its value 7 The frequentist says that the same curve represents the probability of the sample result 7 successes in 50 trials for differ ent xed values of the unknown popula tion parameter p The Poisson distribution one more oneparameter distribution Markov Chain Monte Carlo o The Poisson distribution may be appropriate when the data are counts of rare events 2281138 Bayesmn StatlStlcs 0 events occurring at random at a constant rate per unit time7 distance7 volume7 or whatever Lecture 10 Oct 17 2007 o assumption that the number of events that K t C 1 occur in any interval is independent of the a 9 CW es 374 SH 3350727 number of events occurring in a CllSJOlnt interval 0 examples 0 The conjugate prior distribution for the the number of cases of a rare form of cancer POlSSOn rate parameter is the gamma family occurring in Johnson County in each calendar year the number of aws occurring in each 1007foot length of yarn produced by a spinning machine the number of particles of pollen per cubic foot of air in this room 0 Since the values of a random variable following a Poisson distribution are counts7 what are the possible values 0 probability mass function for a Poisson random variable 7A 6 y WA A y o the count of the number of events occurring in m time units also follows a Poisson distribution7 but with parameter m y01 Markov Chain Monte Carlo Methods 0 Goals to make inference about model parameters to make predictions 0 Requires 7 integration over possibly highrdimensional integrand and we may not know the integrating constant Markov chains 0 A Markov chain is a sequence of random variables X0 X1 X2 0 At each time t 2 0 the next state Xt1 is sampled from a distribution PXt1lXtgt that depends only on the state at time t i called transition kernel 0 Under certain regularity conditions7 the iterates from a Markov chain will gradually converge to draws from a unique stationary or invariant distribution fie chain will forget7 its initial state i as t increases7 sampled points Xt will look increasingly like correlated samples from the stationary distribution Monte Carlo integration and MCMC 0 Monte Carlo integration i draw independent samples from required distribution 7 use sample averages to approximate expectations 0 Markov chain Monte Carlo MCMC draws samples by running a Markov chain that is constructed so that its limiting distribution is the joint distribution of interest 0 Suppose MC is run for N large number iterations we throw away output from rst in iterations regularity conditions are met 0 then by ergodic theorem 7 we can use averages of remaining samples to estimate means Em tiger 0 break complicated7 highrdimensional problem into a large number of simpler7 lowrdimensional problems Gibbs sampling one way to construct the transition kernel 0 seminal references Geman and Geman IEEE Trans Pattn Anal Mach Intel7 1984 Gelfand and Smith JASA7 1990 Hastings Biometrika7 1970 Metropolis7 Rosenbluth7 et al J Chem Phys7 1953 0 subject to regularity conditions7 joint distribution is uniquely determined by full conditional distributions full conditional distribution for a model quantity is distribution of that quantity conditional on assumed known values of all the other quantities in the model Example Inference about normal mean Full Conditional Distributions for and variance both unknown Normal Model 0 to extract mathematical form of full 0 model yilu 02 N NW 02 conditional for a parameter 1 1 N i write out expression to which joint posterior is proportional priors pull out all terms containing the parameter of interest M N NltM003gt 02 N 0a1b1 0 We want posterior means7 posterior medians7 posterior credible sets for 110 Gibbs Sampler algorithm for Normal 1 choose initial values 0 020 2 at each iteration t7 generate new value for each parameter7 conditional on most recent value of all other parameters 0 WinBUGS for Windows written in Component Pascal running in Oberon Microsystems7 Blackbox environment able to t a wider variety of models than BUGS can handle undergoing continuing development 0 excellent documentation7 including two volumes of examples 0 Web page httpwwwmrcbsucamacukbugswelcomeshtml o OpenBUGS open source version of WinBUGS interfaces easily with R 7 Web page http mathstat helsinki fiopenbugs What are BUGS and WinBUGS o Bayesian inference Using Gibbs Sampling 0 general purpose program for tting Bayesian models 0 developed by David Spiegelhalter7 Andrew Thomas7 Nicky Best7 and Wally Gilks at the Medical Research Council Biostatistics Unit7 lnstitute of Public He alth7 in Cambridge7 UK 0 BUGS for Unix and DOS platforms 7 written in ModularQ distributed in compiled form only What do BUGS and WinBUGS do 0 enable user to specify model in simple Splusrlike language 0 construct the transition kernel for a Markov chain with the joint posterior as its stationary distribution7 and simulate a sample path of the resulting chain 7 determine whether or not the full conditional for each unknown quantity parameter or missing data in the model is a standard density generate random variates from standard densities using standard algorithms BUGS uses the adaptive rejection algorithm Gilks and Wild7 Applied Statistics7 1992 to generate from nonstandard full conditionals consequently can handle only logrconcave or discrete full conditionals WinBUGS uses Metropolis algorithm to generate from nonstandard full conditionals and is not subject to this limitation In the simple models we have encountered so far7 the MCMC sampler will converge quickly even with a poor choice of initial values 7 In more complicated models7 choosing initial values in low posterior density regions may make the sampler take a huge number of iterations to nally start drawing from a good approximation to the true posteriori 0 Assessing whether sampler has converged How many initial iterations need to be discarded in order that remaining samples are drawn from a distribution close enough to the true stationary distribution to be usable for estimation and inference Once we are drawing from the right distribution7 how many samples are needed in order to provide the desired precision in estimation and inference The Art and Science of MCMC Use 0 Deciding how many chains to run 0 Choosing initial values 7 Do not confuse initial values with priorsl Priors are part of the model lnitial values are part of the computing strategy used to t the model Priors must not be based on the current data 7 The best choices of initial values are values that are in a highrposteriorrdensity region of the parameter space If the prior is not very strong7 then maximum likelihood estimates from the current data are excellent choices of initial values if they can be calculated 0 Choosing model parameterizations and MCMC algorithms that will lead to convergence in a reasonable amount of time 0 Using correlated samples for estimation and inference adjusting estimates of standard errors 0 Our next lab will consider output analysis items from this list 2281138 Bayesian Statistics Inference for Proportions continued Lecture 6 Sept 127 2007 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu 3 o lf7 based on the earlier survey7 we actually knew the true value of the population pro portion p7 weld just use the binomial proba bility 71 7 pomp pg 172 y yo 0 But of course we still had uncertainty about p even after observing the original sample 0 All of our current knowledge about p is con tained in the posterior distribution obtained using the original survey 2 Predict ion 0 In many situations7 interest focuses on pre dicting values of a future sample from the same population 7 ie on estimating values of potentially obi servable but not yet observed quantities 0 Example we are considering interviewing an other sample of 50 Ul students in the hope of getting more evidence to present to the regents7 and we would like to get an idea of how it is likely to turn out before we go to the trouble of doing sol 0 So we are considering a new sample of sample size 717 and want to estimate the probability of some particular number if of successes in this sample 4 o The posterior predictive probability of some particular value of y in a future sample of size itquot is pyly olpy lppplydp yquot 0 where 3 denotes the data from the original survey7 and pp l y is the posterior distribur tion based on that survey 5 o For example7 suppose we had used the l3eta107 40 prior so our posterior distribution is pply Betalt17 83 Then the posterior predictive probability of y successes in a future sample of size 71 is 10le olpy lp Betap1783dp yquot 0w o This is particularly easy to compute if 71 17 in which case PTW 1ly 01Pry 1lppply dp 01P my dp MW 17 1783 Recognizing kernels Normalizing constants revisited 0 when trying to determine whether a function is the kernel of a standard density7 consider the support 0 if you do recognize a function as the kernel of a standard density7 then you can easily gure out What it integrates to o examples H176 0lt6lt1 6 Proper and Improper distributions 0 a density is valid only if it integrates to one over the support of the random variable 0 any density that integrates to a positive nite number can be normalized so that it integrates to one o a density is improper if its integral is not r nite 0 example Qwilez iv 0 lt 0 lt oo Noninformative or reference priors o useful when we want inference to be unafr fected by information apart from the current data 0 in many scienti c contexts7 we would not bother to carry out an experiment unless we thought it was going to increase our knowlr edge signi cantly 7 ie we expect and want the likelihood to dominate the prior The case of the binomial likelihood 0 one choice of noninformative prior U 0 1 o a disadvantage it is not invariant under trans formations o suppose we were more interested in the logit transformation of the unknown proportion p i 7 P 45 7 91 7 Whip than in p itself 7 when we get to logistic regression later in the semester7 this is exactly the quantity we will be interested in 10 o improper reference priors are sometimes used if you do this7 you must verify that the resulting posterior is proper inote that if the posterior is improper7 it doesnt exist7 so valid inference cannot be based on it i in some multiple parameter models7 it may be possible to make valid inference about a subset of parameters even if the posterior is improper 0 often more than one choice of reference prior for the same likelihood 12 o recall transformation of variables 7 if y g is a onertorone transformation 0 x SO 96 9719 7pxz is the density function of x i we want density function of y7 pyy dx dy o letls transform the uniform prior on the bi nomial parameter p into a prior on logitp my pz9y o uhroh7 its not vague or uniform Jeff reys7 prior First7 recall the Fisher information 0 used by frequentists in computing asympr totic variance of MLEs 0 used by Bayesians in constructing one form of reference prior 0 let pyl denote the probability density funcr tion of the data given the unknown parame ter 0 0 Fisher de ned the information about a pa rameter provided by an experiment as 62109pyl9 o The expectation is taken over possible values of y for xed 0 Since the information is an eavpectation7 it depends on the distribution of 37 not the observed value of y J eff reys7 prior o If we transform the unknown parameter 0 to 43 90 then 6109L ly 6109L9ly 943 96 943 o Squaring and taking expectations over values of 3 note that 35 does not depend on y7 we get wa Iwwgt32 14 0 Since the logrlikelihood logL ly differs from logpyl only by a constant7 all their derivatives are equal Thus the information can equivalently be de ned as 62105JltLlt9lygtgt I 0 7E ltm W o If there are n independent observations y 31 32 yn then the probability densities multiply and the logrlikelihoods add Thus the Fisher information becomes 62109L9ly wmieE ET o Finally7 it can be shown erg Hogg and Craig7 or Lee 2 WWEpmgwm 71 109W 16 0 Clearly wwwwwbg 0 So Jeffreys proposed the following reference prior 129 0lt MW 0 what this means i suppose we place a Jeffreys prior p90 on 0 i then suppose we place a Jeffreys prior p on 45 99 WWm which means that the probability under p90 that 0 E a 1 equals the probability under p lt gt that 45 6 90 95 o advantages invariance property no matter what scale we choose for measuring the unknown par rameter7 the same prior results when the parameter is transformed to any other scale 7 depends on the form of the likelihood but not on the current observed data 0 disadvantages sometimes information doesnt exist eg in Cauchy distribution 7 more controversial in multiparameter set ting 19 One more candidate noninformative prior for the binomial likelihood o Uniform071 0 Beta 0 Beta 07 0 improper will give a proper posterior unless gtk either 3 0 or y n in current data attractive feature yields the mle as the posterior mean 18 Jeffrey s prior for binomial likelihood 109Lply y lagp 7172 109071 00n8tant 6210901er i 7 71 3 8p p2 171v2 What is Ey if y N Binomialnp So 77 RPM m Taking the square root and removing the con stant n7 gives 101 0lt 17 lip Do we recognize this density 2281138 Model Comparison Lecture 18 Nov 77 2007 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu 3 Model comparison for nested vs non nested models 0 nested models two regressionrtype models in which the predictors in the smaller model are a subset of the predictors in a larger model 7 larger model will t better but will be more dif cult to t and to interpret key questions in model comparison gtk is improvement in t substantial enough to justify increased dif culty in tting and interpreting gtk are priors on additional parameters rea sonable o nonrnested models 7 different link functions in GLMs nonrnested sets of predictors 2 Model comparison 0 often there are several plausible candidate models idifferent candidate predictor variables in regression different link functions in generalized line ear models 7 different assumptions regarding form of like hhood different priors o statisticians often will compare the t of sevr eral models in order to choose the best one i then assess whether that one is adequate 0 alternative Bayesian modelrmixing does prediction using a weighted combina tion of all candidate models 4 Tools for Bayesian model comparison 0 Bayes factors and approximations to them 0 Deviance Information Criterion 5 Frequentist use of deviance as measure of model t in linear and generalized linear models Example Dataset is counts of how many beetles were killed Ti 1 18 in 8 groups of beetles exposed to different doses of an insecticide Each group i had 711 beetles in it 0 consider a saturated model for a particular dataset has a parameter for every observation in the dataset so its t is perfect 7 not useful7 since it is no simpler than the entire original dataset but it provides a benchmark to which to compare the t of other models against the general alternative 7 under certain conditions7 deviance has an asymptotic chiesquare distribution with de grees of freedom equal to the difference between the number of parameters in the saturated model and the number of pa rameters in the model being evaluated saturated model for beetles data would have 8 parameters pi i 1 87 the pop ulation proportion killed at each of the 8 dose levels the frequentist point estimate of each p2 would be 0 now consider a more useful model that lets us quantify the doseeresponse 109 tPi 04 W 9 i has only 2 parameters will not t the data as perfectly as the saturated model 0 notation let logL y denote the maxie mum of the log likelihood for a particular model 0 deviance in GLM is de ned as 2105L9mmz 0f Maggy 105M9mmm y i this is the likelihooderatio statistic for test ing the null hypothesis that the model holds 8 Frequentist deviance for models for bee tles data ntbeet1esbeet1esgt 1 gt2 g1mltf0rmula respmat beetles l family b1nom1a1lt1mk iog1tgtgt Deviance Residuals M 1Q e 30 Max 715213 706270 08705 r2575 r6487 Coefficients stimate Std Error 2 Value Prltgtlzlgt Intercept 759869 5100 1139 lt2ee16 w beetlesV1 33784 2866 1179 lt2ee16 M Srgmr codes 0 MW 0001 w 001 7 03905 0V1 7 1 Dispersion parameter for binomial family taken to be 1 Null deviance 280866 on 7 degrees of freedom Residual deviance 1139474 on 6 degrees of freedom AIC 41 803 9 Call g1mltf0rmu1a respmat beetlesv1gt family binomiallink prob1tgtgt Deviance Residuals 1n 3Q x 714994 706939 07942 11473 13076 Coefficients Estnnate Std Error 2 Value Prltgtlzlgt intercept 734501 2616 71319 lt2es16 w beetiesv1 19478 1469 1326 lt2es16 WM Signif codes 0 WWW 0001 w 001 7 005 01 1 71 Dispersion parameter for binomial family taken to be 1 Null deviance 280866 on 7 degrees of freedom Residual deviance 10368 on 6 degrees of freedom AIC 40698 Deviance Information Criterion o Spiegelhalter D J7 Best N G7 Carlin B P and van der Linde A 2002 Bayesian measures of model complexity and t with discussion J Roy Statistr Soc B 647 58376401 0 to compare t and predictive ability of Bayesian models 0 penalty for model complexity 0 also provides estimate of number of free pa rameters in the model i highly correlated parameters and paramr eters that are strongly in uenced by their priors count for less than 1 each scalled the e ective number of parame ters 0 built into WinBUGS 0 can be used to compare nonrnested models Call g1mltf0rmu1a respmat beetlesv1gt family binomiallink cloglog Deviance Residuals Min 1 Median 3Q Max 07906 06252 00838 04158 14120 Coefficients Estnnate Std Error 2 Value Prltgtlzlgt intercept 739035 3182 122 lt2es16 w beetiesv1 21733 1766 1231 lt2es16 WM Signif codes 0 WWW 0001 m7 001 7 005 01 1 7 1 Dispersion parameter for binomial family taken to be 1 Null deviance 2808664 on 7 degrees of freedom Residual deviance 40124 on 6 degrees of freedom AIC 34342 Complementary logrlog link claglogp 1094090 19 12 0 but response variable must have same form in all models eergr you couldnlt use it to compare two regression models7 one with yls untransr formed and one with yls log transformed 0 uses a version of the deviance from which the log likelihood of the saturated model is not subtracted off 0 let Dy0 QZOQPQ lO 0 we want two quantities7 which can be approxr imated using MCMC sampler output Davgy D averaged over the posterior distribution of 0 e D D evaluated at the posterior mean of 0 0 then the effective number of parameters is estimated as o and the DlC is o DlC is an approximation to the expected pre dictive deviance and has been suggested as an indicator of model t when the goal is to pick a model with the best outrofrsample predictive ability 0 smaller values of BIG suggest better model t 228138 Bayes factors Lecture 19 Nov 77 2005 Kate Cowles 374 SH7 33570727 kcowles statvuiowavedu 3 Bayes rule applied to the example from lec tures 1 You take the blood test and the result is positive This is the data or observation MODEL Prior Like Product Posterior for Have disease 001 95 00095 019 Don t have disease 999 05 04995 981 905090 1 1 Hypothesisetesting view H0 p 005 H A p 0 95 simple hypotheses regarding probability of positive test 2 Bayes factors for model comparison and hypothesis testing 0 simplest case null and alternative hypother ses both simple 0 equivalently comparing two models that diff fer according to point values of one parame ter prior Model description probability M0 don7t have disease 999 M1 have disease 001 Equivalently prior Hypothesis description probability H0 p 05 99 H1 p 95 001 Prior odds in favor of Model 1 vs Model 0 PrM1 001 1 PrM0 999 999 Bayes factor in favor of Model 1 vs Model 0 PrdatalM1 95 i 19 BFm PrdatalM0 i Where data is positive test Bayes factor in simple simple case 0 BF10 is weight of evidence contained in the data in favor of M1 vs M0 0 usually reported on logo scale 0 interpretation Kass and Raftery7 JASA7 1995 log10B10 Blo Evidence against H0 or M0 0 to 12 1 to 32 Not worth more than bare mention 12 to 1 32 to 10 Substantial 1 to 2 10 to 100 Strong gt 2 gt 100 Decisive 7 Before considering more general case recall Bayes rule p0py10 9 7pltt9gtplty19gtd9 Denominator is o the marginal likelihood of the data 0 depends on 7 data 7 model form of likelihood and prior 6 Posterior probabilities and posterior odds o posterior odds in example PrMlldata 0194 PrM0ldata 981 0 relationship among BF7 posterior odds and prior odds in simplesimple case Pr M1 data Pr M data BFm pl M1 l P7quot Mo BF is ratio of posterior odds to prior odds B More general case To compare two competing models7 M1 and M0 0 Compute the marginal likelihood of the data under each model i let 01 parameters under M1 7 let 00 parameters under M0 PYlM1 P91PYl91d91 PYlM0 P90PYl90d90 PYlMi More general case Hoi6690 H1Z6691 Bayesian hypothesis test involves calculating pos terior probabilities P 01Y P91IY Example continued H0eg100 H1Z6gt10 Prior probabilities and prior odds p6 N1007 225 i gt Pr6 g 100 5 Pr6 gt 100 5 Prior odds r 1 Posterior probabilities and odds Pr6 g 10mg 106 Pr6 gt 10mg 894 10 Example 0 Child is given intellignce test7 With resulting score Y 0 Y N N67 100 7 Where 6 represents childls own true IQ 100 is variance if same child takes repeated IQ tests of the same kind 0 in population as a Whole7 IQ scores are dis tributed as 6 N N1007 225 o if child scores 3 1157 then posterior distrir bution of 6 is 61y m N11047 692 12 posterior odds 0106 7 0119 0894 Bayes factor in favor of H0 vs H1 is 01106 BF01 0119 W Baeys factor in favor of H1 vs H0 is BFm 844 13 Bayesian hypothesis testing and frequen tist pValues o In onersided testing situations like this free quentist prvalue will sometimes have a Bayesian justi cation 0 Example 7 normal likelihood7 variance known Y m N67 lt72 noninformative prior M 0lt 1 posterior le N Ny7 lt72 Testing a point null hypothesis 0 common in frequentist practice H0 6 60 H1 6 7 60 o where 6 could have any value on a continuum o Bayesian answers may differ radically from frequentist answers 0 almost never do we seriously consider that 6 60 exactly 0 more reasonable HOI6E 60b760 l b for some small b region of indifference o Hypotheses H0 6 g 60 H1 I 6 gt 60 o posterior probability of H0 6 Pr6 g 60ly lt1 a o classical prvalue pval PTltY Z yl6 60 i 6 1 7 1 U a 0 By symmetry of normal distribution7 Pr 6 g 6oly prvalue against H0 16 Consider Bayesian test of point null so as to compare With frequentist H0 6 60 H1 6 7 60 0 cannot use a continuous prior on 6 Why 0 reasonable approach to constructing a prior 7 put positive prior probability on 60 Pr6 60 no gt 0 i give 6 6 7 60 the prior 1 7T1919 where 91 is a proper density Bayesian analysis 0 let fyl6 denote the sampling density of y 0 then the marginal likelihood is MY fYl907T0 m1Y1 7T0 Where WW 693690 f Yl9919d9 Example child s intelligence 0 sampling distribution of data We Ma 02 100gt o hypotheses to be tested H0 6 100 H1 6 7 100 o priors Pr6 100 We 05 916 Nutmeg N1007100 note prior mean no 60 value from H0 7 prior variance lt73 equals variance of same pling distribution 0 so posterior probability of H0 is 6 7r W 60 fYl90W 3 h1y01 7 7T0 0 posterior odds in favor of H0 vs H1 l f y 90 1 7 7T0 WW 0 and the Bayes factor in favor of H0 vs H1 is BFOl m1y 20 Example continued 0 What statistical test would a frequentist use When sampling distribution is assumed to be normal With known variance 0 results for frequentist test With different pos sible data values 3 frequentist Z prvalue PTH0ly 11645 1645 01 042 11960 1960 005 035 12576 2576 001 021 13291 3291 0001 0086 21 Similar table for different sample sizes 0 Table 427 p 151 from Berger7 JO 1985 Statistical Decision The ory and Bayesian Analysis 2nd ed New York7 SpringerrVerlag 0 applies when lt72 is assumed known7 M0 60 W0 057 lt73 lt72 22S 138 Bayesian Statistics What is Bayesian Statistics Lecture 1 Aug 21 2006 Kate Cowles 374 SH 33570727 kcowles statuiowaedu Where does statistics t in 0 Central to steps 2 3 and 5 0 May help with step 1 7 can help show that a question is inappropriate 7 may show that answering the question will be dif7 cult or impossible o Bayesian statististics is particularly well7suited for steps 2 and 5 The Scienti c Method1 But it s not just for science H Ask a question or pose a problem to Assemble and evaluate the relevant information 0 Take stock of what is already known w Based on current information design an investigation or experiment or perhaps no experiment to address the question posed in step 1 0 Consider costs and bene ts of the available experi7 ments including the value of any information they may contain 0 Recognize that step 6 is coming Jgt Carry out the investigation or experiment on Use the evidence from step 4 to update the previously available information draw conclusions if only tenta7 tive ones 6 Repeat steps 3 through 5 as necessary 1 as stated by Don Berry Who started it all Thomas Bayes Born 1702 in London England Died 17 April 1761 in Tunbridge Wells Kent England 0 ordained Nonconformist minister in Engalnd 0 Essay towards solving a problem in the doctrine of chances 7 set out Bayes s theory of probability 7 published in the Philosophical Transactions of the Royal Society of London in 1764 7 The paper was sent to the Royal Society by Richard Price a friend of Bayes who wrote I now send you an essay which I have found among the papers of our deceased friend Mr Bayes and which in my opinion has great merit In an introduction which he has writ to this Essay he says that his design at rst in thinking on the subject of it was to nd out a method by which we might judge concerning the probability that an event has to happen in given circumstances upon sup7 position that we know nothing concerning it but that under the same circumstances it has happened a certain number of times and failed a certain other number of times 0 Bayes s conclusions were accepted by Laplace in a 1781 memoir rediscovered by Condorcet as Laplace mentions and remained unchallenged until Boole ques7 tioned them in the Laws of Thought Since then Bayes7 techniques have been subject to controversy 0 elected a Fellow of the Royal Society in 1742 despite the fact that at that time he had no published works on mathematics lndeed none were published in his lifetime under his own name Simple inference using Bayes rule Example Do you have a rare disease 0 Your friend is diagnosed with a rare disease that has no obvious symptoms 0 You wish to determine how likely it is that you too have the disease That is you are uncertain about your true disease status 0 Your friend s doctor has told her that 7 The proportion of people in the general population who have the disease is 001 7 The disease is not contagious o A blood test exists for this disease but it sometimes gives incorrect results 6 Some settings in which Bayesian statistics is used today 0 economics and econometrics 0 marketing 0 social science 0 education 0 health policy 0 medical research 7 more common in England than in US 7 but FDA has approved some new medical devices based on Bayesian analysis and is pushing the use of Bayesian methods in device testing 0 weather 0 the law 0 etc etc 3 Quantifying uncertainty using probabilities The long7run frequency de nition of the probability of an event The probability of an event is the proportion of the time it would occur in a long sequence of observations ie as the number of trials tends to in nity 0 example when we say that the probability of getting a head on a toss of a fair coin is 5 we mean that we would expect to get a head half the time if we ipped the coin a huge number of times under exactly the same conditions 0 requires a sequence of repeatable experiments 0 no frequency interpretation possible for probabilities of many kinds of events 7 including the event that you have the rare disease Probability as degree of belief The subjective de nition of probability is A probability of an event is a number between 0 and 1 that measures a particular person s sub jective opinion as to how likely that event is to occur or to have occurred 0 applies whenever the person in question has an opin7 ion about the event 7 if we count ignorance as an opinion7 always applies 0 Different people may have different subjective proba7 bilities regarding the same event 0 The same person s subjective probability may change as more information comes in 7 where Bayes7 rule comes in Back to the example 0 two possible events or models 1 you have the disease 2 you don t have the disease 0 before taking any blood test7 you think your chance of having the disease is similar to that of a randomly selected person in the population 7 so you assign the following prior probabilities to the two models MODEL Have disease Don t have disease 10 Properties of probabilities These properties apply to probability whichever de nition is being used 0 Probabilities must not be negative If A is any event7 then PA gt 0 o All possible outcomes together must have probability If S is the sample space in a probability model then PS 1 12 Data 0 You decide to take the blood test 7 the new information that you obtain to learn about the different models is called data 7 the different possible data results are called obser7 pations or outcomes 7 the data in this example is the result of the blood test 0 The two possible observations are 7 a positive blood test 7 suggests you have the disease 7 a negative blood test 7 7 suggest you don t have the disease Likelihoods o The probabilities of the two possible test results are different depending on Whether you have the disease or not 0 these probabilities are called likelihoods i the probai bilities of the different data outcomes conditional on each possible model LIKELIHOODS MODEL PRIOR P MODEL P7 MODEL Have disease 001 95 05 Don t have disease 999 05 l 95 Bayes rule applied to the example You take the blood test and the result is positive This is the data or observation MODEL Prior Like Product I Posterior for Have disease 001 95 00095 I 019 Don t have disease 999 05 04995 981 05090 1 o Are the entries in the Product column probabilities o How do we convert them into probabilities 14 Using Bayes rule to update probabilities o Bayes7 rule is the formula for updating your probabile ities about the models given the data 0 enables you to compute posterior probabilities given the observed data 7 posterior means after Bayes rule simplest form PMODEL DATA ltx PMODEL gtltPDATA MODEL posterior ltx prior gtlt likelihood 16 What have you learned from the blood test 0 The probability of your having the disease has in creased by a factor of 19 0 But the actual probability is still small lt 02 0 You decide to obtain more information by taking the blood test again Updating the probabilities again 0 We will assume that7 conditional on your true disease status7 the results from two blood tests are indepene dent 0 Your current probabilities are the posterior probabile ities from after the rst test 0 These will become your prior probabilities with re spect to the second test 0 The second test result is also positive MODEL Prior Like Product I Posterior for Have disease 019 95 01805 I 269 Don t have disease 981 05 04905 731 0671 1 13 What if the second test had been negative That is7 the second observations was MODEL Prior Like Product Posterior OIquot Have disease 019 Don t have disease 981 225 138 Bayesian Statistics Inference for Proportions continued Lecture 6 Sept 7 2005 Kate Cowles 374 SH 33570727 kcowles statuiowaedu 3 o If based on the earlier survey we actually knew the true value of the population pro portion p weld just use the binomial proba bility 71 7 pm p9 1729quot 9 yoy 0 But of course we still had uncertainty about p even after observing the original sample 0 All of our current knowledge about p is con tained in the posterior distribution obtained using the original survey 2 Prediction 0 In many situations interest focuses on pre dicting values of a future sample from the same population ie on estimating values of potentially obi servable but not yet observed quantities 0 Example we are considering interviewing an other sample of 50 Ul students in the hope of getting more evidence to present to the regents and we would like to get an idea of how it is likely to turn out before we go to the trouble of doing sol 0 So we are considering a new sample of sample size 72 and want to estimate the probability of some particular number y of successes in this sample 4 o The posterior predictive probability of some particular value of y in a future sample of size 72 is pyly amppy lppplydp7 yquot 07 where 3 denotes the data from the original survey and pp l y is the posterior distribur tion based on that survey 5 o For example suppose we had used the Beta10 40 prior so our posterior distribution is pply Beta17 83 Then the posterior predictive probability of y successes in a future sample of size 72 is py l y 01py lpB tap17y83dp7 yquot 07 o This is particularly easy to compute if 72 1 in Which case Pry 1ly 01Pry 1lppplydp 01ppplydp MM 17 17 83 Recognizing kernels Normalizing constants revisited 0 When trying to determine Whether a function is the kernel of a standard density consider the support 0 if you do recognize a function as the kernel of a standard density then you can easily gure out What it integrates to o examples M176 0lt6lt1 6 Proper and Improper distributions 0 a density is valid only if it integrates to one over the support of the random variable 0 any density that integrates to a positive nite number can be normalized so that it integrates to one o a density is improper if its integral is not r nite 0 example 6wilezpiv6 0 lt 6 lt oo Noninformative or reference priors o useful when we want inference to be unafr fected by information apart from the current data 0 in many scienti c contexts we would not bother to carry out an experiment unless we thought it was going to increase our knowlr edge signi cantly ie we expect and want the likelihood to dominate the prior The case of the binomial likelihood 0 one choice of noninformative prior U0 1 o a disadvantage it is not invariant under trans formations o suppose we were more interested in the logit transformation of the unknown proportion p 91 109 than in p itself 7 when we get to logistic regression later in the semester this is exactly the quantity we will be interested in 10 o improper reference priors are sometimes used if you do this you must verify that the resulting posterior is proper inote that if the posterior is improper it doesnt exist so valid inference cannot be based on it i in some multiple parameter models it may be possible to make valid inference about a subset of parameters even if the posterior is improper 0 often more than one choice of reference prior for the same likelihood 12 o recall transformation of variables 7 if y 91 is a onetoone transformation 0 x SO 56 9719 7px is the density function of x i we want density function of y pyy dm dy o letls transform the uniform prior on the bir nomial parameter p into a prior on logitp my Maw o uhroh its not vague or uniform Jeffreys7 prior First recall the Fisher information 0 used by frequentists in computing asympr totic variance of MLEs 0 used by Bayesians in constructing one form of reference prior 0 let pyl6 denote the probability density func tion of the data given the unknown parame ter 6 0 Fisher de ned the information about a pa rameter provided by an experiment as 52l09pyl9 I 6 7E o The expectation is taken over possible values of y for xed 6 Since the information is an expectation it depends on the distribution of 3 not the observed value of y Jeffreys7 prior o If we transform the unknown parameter 6 to o 96 then 5109L ly 5109L9ly 6o 66 6o 0 Squaring and taking expectations over values of 3 note that g does not depend on y we get low gt 7 16l gt66 6 i 6 gas 0 So Jeffreys proposed the following reference prior 129 0lt MGM 0 Since the logrlikelihood logL6ly differs from logpyl6 only by a constant all their derivatives are equal Thus the information can equivalently be de ned as 52109L9ly 662 o If there are n independent observations y 31 yg yn then the probability densities multiply and the logrlikelihoods add Thus the Fisher information becomes a log L 6 y my 7E n my 0 Finally it can be shown eg Hogg and Craig or Lee Rely 19 mommy my E a 16 o advantages invariance property no matter what scale we choose for measuring the unnown pa rameter the same prior results when the parameter is transformed to any other scale 7 depends on the form of the likelihood but not on the current observed data 0 disadvantages isometimes information doesnt exist eg in Cauchy distribution 7 more controversial in multiparameter set ting 17 Jeffrey s prior for binomial likelihood 109Lpy 9 109p n7y10917p constant 62109Lpy i 7 7273 6p p2 17102 What is Ey if y N BinomiaKmp So 72 PM m Taking the square root and removing the con stant 72 gives 191 0lt p 1p Do we recognize this density 13 One more candidate noninformative prior for the binomial likelihood o Uniform01 0 Beta 7 0 Beta 07 0 improper Will give a proper posterior unless gtk either 3 0 or y n in current data attractive feature yields the mle as the posterior mean 2281138 The Likelihood Principle Lecture 22 Dec 77 2007 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu 3 Another way to state the likelihood prin ciple o For a given sample of data7 any two proba bility models py 0 that have the same like lihood function yield the same inference for 0 0 With regard to the information contained in the data about the unknown parameters7 only the actual observed data y is relevant No other possible outcomes gtk Contrast this with the frequents prvalue the probability assuming H0 is true7 of getting a test statistic as extreme as7 or more extreme than7 the value that was actually obtained 7 Not the researchers7 intentions 2 The likelihood principle 0 Suppose that two different experiments may inform about an unknown parameter 0 0 Suppose the outcomes of the experiements are respectively y and zquot 0 Suppose the likelihoods for 0 resulting from the two experiements are proportional that is Mf cm where c is a constant 0 Then the information about 0 contained in both experiments is equivalent 4 Example 0 We are given a coin We are interested in estimating 07 the probability of obtaining a head on a single flip 0 We want to test the hypotheses Hm6 H 6gt 0 Experiment consists of ipping coin 12 times independently 0 Result is 9 heads and 3 tails Example7 continued 0 There are at least two possible ways the experiment might have been conducted 7 Design 1 do 12 flips The random variable Y is the number of heads obtained in n 12 flips Design 2 Flip the coin until 9 heads are obtained Random variable Y is the numr ber of tails that are obtained before the ninth head 0 Frequentist inference for 0 would be different depending on which design is used 0 Bayesian inference would be the same under both designs because the likelihoods are pro portionall o The negative binomial distribution Y the number of failures observed in a sequence of independent Bernoulli trials 7 Implications of the likelihood principle 0 the stopping rule principle 0 the likelihood principle and reference priors before the kth success iy N NBkp 714211 lhmepw 7EltYgt 8 Stopping rulesquot are often used in de signing frequentist statistical studies 0 instead of a xed sample size 0 to make it possible to stop a study early if the results are in o particularly common in clinical trials reducing the size and duration of a clinical trial reduces the number of patients who are exposed to the treatment that will be found to be inferior and speeds up the dis semination of the results to the medical community 0 Frequentist statisticians must choose the stop ping rule before the experiment is conducted and adhere to it exactly deviations can produce serious errors if a frequentist analysis is used 9 0 Large frequentist literature on how to control the overall probability of Type I error while allowing for more than one analysis of the data 11 Jeffreys7 priors and the likelihood prin ciple o recall Jeffreysl prior 7 reference prior noninformative invariant to transformations of parameters 1 49 0C 109W where I 0 is the expected Fisher informa tion for 0 o Jeffreysl prior when likelihood is Binomialn plt6gt oc M1 7 M l l B t e alt2 2 o J effreys7 prior when likelihood is negative binomial k plt0gt oc 9417 er 1 Bt 0 mag 10 Stopping Rule Principle 0 In a sequential experiment7 the evidence pro vided by the experiment about the value of the unknown parameters 0 should not de pend on the stopping rule 0 follows directly from the likelihood principle 12 0 So use of Jeffreysl prior in some cases can violate the likelihood principlel 2281138 Bayesian Statistics Introduction to MultiParameter Models Lecture 9 Sept 247 2007 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu 3 0 In cases of this kind7 the aim of Bayesian analysis is to obtain the posterior marginal distribution of the parameters of interest7 eigi pwly o The general approach is to estimate the joint posterior distribution of all unknown quantir ties in the model7 and then integrate out the ones we aren7t interested in 0 Example in normal means example7 we will nd pm 02 y then pm y pM 02 y da2 2 Mult iparameter models 0 Real problems in statistics nearly always in volve more than one unknown quantity 0 However7 usually only one7 or a few7 parame ters or predictions are of substantive interest 0 Example newt healing rates 7 We may be primarily interested in the pop ulation mean healing rate M7 but of course we don7t really know the value of the pop ulation variance 02 7 So in a realistic model7 we must also treat 02 as an unknown parameter 0 nuisance parameters 4 Example normal data With both M and 02 unknown 0 Need joint prior on both unknown parame ters 0 Consider rst the conventional noninformar tive prior for this problem 1 2 WM 0lt U2 0 This arises by considering M and 02 a priori independent and taking the product of the standard noninformative priors for each i A priori independence may be a reason able assumption here it says that if we knew something about one of the unknown parameters7 that wouldn7t give us inforr mation about the distribution of the other one 5 7 Recall standard noninformative priors for M when 02 is assumed known7 and for 02 when M is assumed known 0 This is not quite a conjugate prior we will see that the posterior distribution does not factor like this into an inverse gamma times an independent normal 0 Note that this prior is improper7 and the joint posterior is improper if there are fewer than two observations in the current data 7 Steps to the marginal posterior distri bution of M 0 We will use these identities from conditional probability pm l y MM 0 ly d0 2 PM l U yp0 WW o It can be shown by direct integration GCSR pl 75 that the marginal posterior distribur tion of 02 is p02l339 0lt r1 696p i l 202 02 o What parametric density is this 6 Joint posterior distribution With con ventional noninformative prior 0 The joint posterior is 1 1 1 n 2 2 We 0 0lt p X 772 P 2 02 El n 7 4 l Liv m2 wwzj 1 exp 7 71 52 7117 7 M2 where 52 is the sample variance of the yis 1 52 n 7 12322 9 o g and 52 are the suf cient statistics for M and a i 8 The conditional posterior distribution of M given a 0 Use what we already know about the poste rior mean of M with known variance and a uniform prior on M pm l 0231 2 a N i y n o Again7 it can be shown by direct integration GCSR p 68769 that the marginal poster rior distribution of M 7 2 2 d 2 pm l y 7 PM l U yp0 M 0 is a Studentls t distribution with i mean 3 2 7 scale parameter 7 degrees of freedom n 7 1 l Return of the newts 11 0 We can nd posterior means7 variances7 quanr tiles7 etc by numerical integration or simur lationi 0 We will use the WinBUGS program in lab to do this 10 An informative semiconjugate joint prior on u and 02 in for the normal distribu tion 0 An intuitive procedure for specifying a joint prior distribution pm 02 if we had prior information on both is 7 Assume a priori independence 7 Place an inverse gamma prior on 02 7 Place a normal prior on u 7 Then the joint prior is the product of these two priors o This is called a semirconjugate priori Why 0 However7 it is not a conjugate priorl 0 ln fact7 the marginal posterior distributions p02ly and pmly have no simple conjugate ormsi 12 What are Markov chain Monte Carlo methods used for o to t models that are too complex7 highr dimensional7 or otherwise wierd to t by other methods 0 especially frequently used for tting Bayesian models BUGS and WinBUGS are generalrpurpose packr ages that use Gibbs sampling to t Bayesian models 0 constructs a Markov chain whose stationary distribution is the joint posterior of the uni knowns model parameters and missing data of the speci ed model7 conditional on the ob served data 0 exploits the fact that7 under certain regularr ity conditions7 this joint posterior distribur tion is the product of the 77full conditional 13 distributions of each unknown given all the other model quantities o generates a sample path from the Markov chain 7 at each iteration7 generates a realization of each unknown What does the WinBUGS user have to input 0 model speci cation in terms of the distribur tional relationships between observables and parameters distributions of observables as functions of parameters likelihood prior distributions of parameters 0 auxiliary les containing 7 data 7 initial values for unknowns 14 WinBUGS output is samples 0 correlated o of quantities user has requested WinBUGS to 7 monitor 7 parameters 7 missing data 7 functions of either of these 1 2 2281138 Posterior predictive distribution of a Bayesian Statistics fut ure observation Pynewly is Normal Lecture 8 i mean is posterior mean Sept 217 2007 variance is sum of gtk data variance 02 assumed known in this unrealistic case Kate Cowles 374 SH7 33570727 gtk variance of posterior mean kcowles statuiowaedu 3 4 Je reys prior for normal mean With data What is the posterior predictive dis variance assumed known tribution of the healing rate of a new newt opltpgt0 l fooltpltoo 0 limit of NW0 0 as 08 goes to 00 08 is prior variance 0 equivalently7 limit of N 0 78 as 78 goes to 0 7393 is prior precision 5 Inference about the spread of a normal distribution 0 primary research question may concern varir ability of response variable in population 7 quality control in industry reponse to medical treatment 7 0 Recall the joint sampling distribution of n observations modelled as conditionally inde pendent draws from a normal 2 n 1 er 1391 77 m1 ynle 21 27mm 202 1 Ezlltyi 7 2 0C Ug ezp 202 0 We will assume unrealistically that M is a known constant 6 Inference for the variance of a normal distribution 0 Suppose in the newt healing rate example that we knew the population mean M 25 but we did not know the population variance 0 o Equivalently7 we dont know the population precision 739 0 We wish to infer about the distributions of these parameters that describe the spread of the normal distribution 8 Suf cient statistic 0 suf cient statistic for 02 is 232 7 2 0 we can write likelihood equivalently as 1 m 2 2 pyla oc gespli l 0 lt a lt oo 02 where U l gEQz39 02 0 What is corresponding conjugate prior Inverse gamma distribution What is posterior Magly 0lt 7 11 Estimating 02 of healing rates in pop ulation of newts 0 Suppose M was known to be 25 0 Suppose we had previously studied 2 newts and the average squared difference between their healing rate and 25 was 64 o What is our appropriate prior 0 for newt data 2211y2 7 252 1201 o What is posterior Magly 0lt o Recognize this as 10 Alternative parameterization of prior 2 2 V0 VOUO G pltagt lt2 2 Then we can think of prior as providing equiv alent information to 0 V0 prior observations 0 with 08 average squared deviation from known M 12 Noninformative prior for normal vari ance o What inverse gamma prior would have infor mation equivalent to 0 prior observations 0 How would you write this as a pdf for 02 o ls it proper or improper o What characteristics would a dataset have to have in order to produce a proper posterior distribution for 02 if this prior were used Priors for normal precision o if 02 10a and T2 i 02 then T2 Gem 0 You must be careful of parameterizations of both gamma and inverse gamma distribur tions 225 138 Bayesian Statistics Introduction to MultiParameter Models Lecture 9 Sept 19 2003 Kate Cowles 374 SH 33570727 kcowles statuiowaedu 3 0 In cases of this kind the aim of Bayesian analysis is to obtain the posterior marginal distribution of the parameters of interest eg my 0 The general approach is to estimate the joint posterior distribution of all unknown quantir ties in the model and then integrate out the ones we aren7t interested in 0 Example in normal means example we will nd pm lt72 ly then put l y pM7 lt72 y dc 2 Multiparameter models 0 Real problems in statistics nearly always in volve more than one unknown quantity 0 However usually only one or a few parame ters or predictions are of substantive interest 0 Example newt healing rates 7 We may be primarily interested in the pop ulation mean healing rate u but of course we dont really know the value of the pop ulation variance lt72 7 So in a realistic model we must also treat 02 as an unknown parameter 0 nuisance parameters 4 Example normal data With both M and 02 unknown 0 Need joint prior on both unknown parame ters 0 Consider rst the conventional noninformar tive prior for this problem ml H mm 0lt o This arises by considering M and 02 a priori independent and taking the product of the standard noninformative priors for each i A priori independence may be a reason able assumption here it says that if we knew something about one of the unknown parameters that wouldn7t give us inforr mation about the distribution of the other one 5 7 Recall standard noninformative priors for M when 02 is assumed known7 and for 0 when M is assumed known 0 This is not quite a conjugate prior we will see that the posterior distribution does not factor like this into an inverse gamma times an independent normal 0 Note that this prior is improper7 and the joint posterior is improper if there are fewer than two observations in the current data 7 Steps to the marginal posterior distri bution of M 0 We will use these identities from conditional probability pu7 0 ly d0 2 PM l lt7 7 WM W0 o It can be shown by direct integration GCSR p 67768 that the marginal posterior distrir bution of 02 is 1 n 7 1 32 P02lY 0lt nr vp g 027 o What parametric density is this 202 6 Joint posterior distribution With con ventional noninformative prior 0 The joint posterior is We 02 0lt 2 X m fi lt02gtv7 lexp 2 a2 where 32 is the sample variance of the yzs 2 1 n 7 2 8 7717123312 y o g and 32 are the suf cient statistics for M and a E The conditional posterior distribution of M given 0 0 Use what we already know about the poster rior mean of M with known variance and a uniform prior on M 2 2 7 039 WHO 7y NW7 0 Again7 it can be shown by direct integration GCSR p 68769 that the marginal poster rior distribution of M pm l y PM l 027 yivlt72lydlt72 is a Students 75 distribution with i mean 9 2 7 scale parameter 57 7 degrees of freedom n 7 1 9 10 Return of the newts An informative semiconjugate joint prior on u and 02 in for the normal distribu tion 0 An intuitive procedure for specifying a joint prior distribution pm lt72 if we had prior information on both is 7 Assume a priori independence 7 Place an inverse gamma prior on 02 7 Place a normal prior on u 7 Then the joint prior is the product of these two priors o This is called a semirconjugate prior Why 0 However it is not a conjugate prior 0 In fact the marginal posterior distributions p02ly and pmly have no simple conjugate orms 11 12 0 We can nd posterior means variances quanr What are Markov chain Monte Carlo tiles etc by numerical integration or simur methods used for lation oto t models that are too complex high yet use the WIUBUGS program In lab to dimensional or otherwise wierd to t by other 0 1s methods 0 especially frequently used for tting Bayesian models BUGS and WinBUGS are generalrpurpose packr ages that use Gibbs sampling to t Bayesian models 0 constructs a Markov chain whose stationary distribution is the joint posterior of the uni knowns model parameters and missing data of the speci ed model conditional on the ob served data 0 exploits the fact that under certain regularr ity conditions this joint posterior distribur tion is the product of the full conditional 13 14 distributions of each unknown given all the WinBUGS output is samples other model quantities o correlated o generates a sample path from the Markov Chain 0 of quantities user has requested WinBUGS to monitor 7 at each iteration7 generates a realization of each unknown 7 parameters 7 missing data What does the WinBUGS user have to input 7 functions of either of these 0 model speci cation in terms of the distribur tional relationships between observables and parameters distributions of observables as functions of parameters likelihood prior distributions of parameters 0 auxiliary les containing 7 data 7 initial values for unknowns 2281138 Bayesian Statistics Directed Graphs More on Full Conditionals Bayesian OneWay Variance Components Model Lecture 13 Oct 177 2008 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu Directed graphs 0 dz39reeted graphical modelrepresents all quanr tities in a statistical model as nodes in a die rected graph 0 arrows run into nodes from the nodes that directly in uence them ie from their par ents o developers of BUGS and WinBUGS recomr mend drawing directed graphs as part of model development process 0 WinBUGS includes DoodleBUGS7 which lets you specify models as directed graphs instead of using WinBUGS language flimited in what kinds of models can be speci ed in DoodleBUGS 2 Copying WinBUGS graphs into Microsoft Word for printing 0 Left click on plot you wish to copy A box will appear around the plot indicating that it has been selected 0 Choose EditCopy from the WinBUGS pullr down menu 0 Click in the destination Word document where you want graph to be placed 0 Choose EditPaste from the pullrdown menu in Word 4 Types of nodes in directed graphs 0 constants xed by design of study 7 always are founder notes iiei do not have parents idenoted as single or sometimes doubler edged rectangles o stochastic nodes 7 variables that are given a distribution 7 may be parents or children or both i may be observed data or unobservable pa rameters generally denoted as circles gtk athough data often denoted as single edge rectangles o deterministic nodes logical functions of other nodes 5 6 Types of directed links in a directed Example directed graph for Pumps graph problem 0 stochastic dependence 7 indicated by solid arrow 0 logical function 7 indicated by dashed or hollow arrow 7 8 Assumptions behind directed graphi Full conditionals for Pumps problem cal model 0 conditional independence assumption given its parent nodes7 each node in the graph is independent of all other nodes except its own children 7 given the nodes it is connected to by are rows7 any node say 2 is conditionally in dependent of all the other nodes in the graph collapse over dashed arrows in evaluating connections 0 simpli es determination of full conditional dis tributions 9 Analytical Procedure for extracting full condit ionals 0 Recall that the full conditional distribution of an unknown model parameter is the dis tribution of that parameter given suppos edly known values of all other quantities in the model 0 Write the mathematical form of the unnorr malized joint posterior 0 Pull out every term in the joint posterior that contains the parameter of interest 0 The product of all these terms is proportional to the needed full conditional distribution o If possible7 identify the parametric family of which the full conditional is a member One Way VarianceComponents Model Bayesian avor omodel 2 2 yijlaiae N NWi e Illmag Mani 11K j1nz o priors M N NMo03 0 N IGWJH 03 N 1002 52 Note These priors give a very straightforr ward Gibbs sampling algorithm7 but a diff ferent prior on 03 may be preferable See the new Sept 2004 WinBUGS 1 42 Dyes examplel 10 Semigraphical Procedure 0 Draw a directed graph of your complete model 0 Identify the parents and children of the parameter of interest 0 Write the product of the conditional distrir butions of i the parameter given its parents 7 the children given their parents 12 0 We want posterior means7 posterior medians7 posterior credible sets for a M 03 03 Full Conditional Distributions for Variance Components Model 2 2 ailyijMUeUa N lt 7739an 7 76 030 71203 a 71203 03 71203 03 2 2 Mlyz39j we a N K02 02 0202 Nlt 0 7 a 0 a a K08ag K0803MU K0803 2 2 2 Uelyij 04239 M Ha 0a N 32171239 E b 2 2 1 IGlta1 1231212zr 102 2 Uglyijai ag N K 04239 7 2 2 K IGlta2 3192 2271 t 7 7 Mtyijag03t 103 1 3071 again a Mo 3 1 Ka3a3tgt K08 03405 1 gym 150 Mm 031671 N K K m 02 2 7139 E391E13z39jai G 2 1 z b 2 ai 2 1 2 73 am Mm age N aim e M 2 K 0a23b2 14 Gibbs Sampler algorithm for Variance Components Model 1 choose initial values 010 ago 3493 0 030 03 2 at each iteration t7 generate new value for each parameter7 conditional on most recent value of all other parameters t aglyz j t DJ 12216 UN 2t71 2t71 lt 77an 06 27571 275715 27571 27571 We gt aelt gt W gt aelt gt U ltt71gtaglttelgt 71103054 03671 Introduction to Hierarchical Models 228138 Bayesian Statistics Lecture 12 Oct 9 2006 Kate Cowles PhD Example Pump failure data 0 A hierarchical model is t to data on failr ure rates of the pump at each of 10 power plants The number of failures for the 2 th pump is assumed to follow a Poisson distribution mi N P0239amp90726 t 7 117710 where 6239 is the failure rate for pump 239 and ti is the length of operation time of the pump in 1000s of hours 0 Important point we do not assume that all the pumps have the same failure rate In fact7 one of the questions of interest is to estimate the rates for the individual pumps oWe do not consider the ahtzv pairs eac changeable Hierarchical models 0 Bayesian models with more than two levr els or stages 0 may arise for several reasons i we have insuf cient knowledge to spec ify the parameters of priors we wish to model data or parameters that cannot be considered exchanger able but that are related 0 Write the likelihood of the data 0 Recall that the de nition of exchangeable observations is their likelihood is invarir ant to permutations of the indices o If we exchanged the subscripts on two 25 ti pairs7 and did not change the indices of the corresponding 6239s7 the evaluation of the likelihood would change 0 The rst stage of a hierarchical model is the sampling distribution of the observed data7 or the likelihood The second stage 0 The second stage gives priors on the pa rameters that appeared in the rst stage 0 In the pump failures example a conjugate gamma prior distribution is adopted for the failure rates 6239 N Gammaa 239 1 10 o This says that although the failure rates for the indiVidual pumps are not the same they are related They are all drawn from a common distribution oWe do not know enough about failure rates of pumps in nuclear power plants to be able to specify xed numbers for the prior parameters 04 and In fact we want the data to inform us about these values 0 Consequently we will make 04 and addir tional unknown paramters in the model 5 WinBUGS program to t Pump model model for i in 1Nf thetai quot dgammaalphabeta iambda lt7 thetaiti xi dpoisiambdai alpha dexp10 beta quot dgamma0110 hyperparameters 0 At the third stage of the hierarchical model for pump failures the following priors are speci ed for the hyperparameters Oz and 04 N Emponential N Gamma01010 Data and initial values listt clt94315762912e052431410510521 105 x clt5151431911422 N10 listalpha 10 beta 10 theta C01010101010101010101 thetauo 000000000000 Results 26990 53250 V 02542 V 07855 V 03759 V 03048 V 31500 V 13930 72520 72500 V 77670 42510 MC error V0047060 V0049150 25 28510 26400 HHOOOOOOOOOO than thetas for other observations othetas far from the common mean are shrunk more than those near it MWMMOHOOOOMH start 1001 1001 Sam 10m r r r r r r r r r H 0000000000 Q Q Q Q Q Q Q Q Q Q 10m Compare to maximum likelihood es timates for individual pumps hours failures mle theta 94130 5 10530 10598 15 i 70 1 10637 i 1008 62190 5 10795 10893 126100 14 11111 11160 5 i 24 3 15725 i 6056 31140 19 6051 16105 1105 1 9528 i 9025 1105 1 9524 i 8964 21 10 4 119048 1 i 5900 10150 22 210952 119309 oindiVidual estimates are shrunk away from mle toward a common mean 0 individual estimates borrow strength from the rest of the data 0 thetas for observations With large sam7 ple size time observed are shrunk less 10 225 138 Bayesian Statistics Calibration Experiments and Review of Probability Lecture 2 Aug 24 2005 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu Calibration experiments 0 calibration experiment 7 a scale used to assess a persons degree of belief that a pare ticular event Will occur or has occurred 0 All outcomes of the calibration experiment must be equally likely in the opinion of the person Whose subjective probability is being assessed 7 Example imagine that I promise to pay you 100 if the roll of a 6rsided die comes up the number you call elf you are indifferent as to Which numr ber you call7 the 6 possible outcomes are equally likely for you 2 Assessing subjective probability about events 0 We may sometimes need to quantify our subr jective probability of an event in order to make a decision or take an action 0 Example 7 You have been offered a job as a statistir cian with a marketing rm in Cincinnati 7 In order to decide Whether to accept the job and move to Cincinnati7 you Wish to quantify your subjective probability of the event that you would like the job and would like Cincinnati 7 We Will talk about Bayesian decision the cry later in the semester 4 0 Calibration experiments may be useful if 7 person is not knowedgeable or comfortable with probability person is uncertain as to hisher opinion about the event 0 principle of using a calibration experiment to assess subjective probability 7 Person is offered a choice of 2 ways of Wine ning a prize gtk through a realization of the calibration experiment with known probability of success gtk through the occurrence of the event of interest 7 The calibration experiment is adjusted at successive steps Example Using a chipsrinrarbowl experiment to assess your subjective probability regarding event A that the Department of Physics at Florida State University has more than 2 female faculty meme bers 0 Experiment is having a blindfolded person draw one chip at random from a bowl con taining chips of the same size and shape 0 Let 13514 denote your subjective probability that event A has occurred 7 o The bowl contains 3 green chips and 1 red chip You may choose Game 1 or Game 2 If you choose Game 2 then I conclude that your 025 lt P5A lt 050 0 Then I may go on to Step 3 Now the bowl contains 5 green chips and 3 red chips You may choose Game 1 or Game 2 If you choose Game 17 I conclude that 025 lt PA lt 0375 0 Step 1 The bowl contains 1 green chip and 1 red chip Imagine that you may choose 1 of two games 1 I will be blindfolded and draw one chip at random 1 will pay you 100 if the chip drawn is red I will pay you nothing if it is green 2 I will pay you 100 if the Physics Dept at FSU has more than 2 female faculty 1 will pay you nothing if it has 2 or fewer If you choose Game 17 then I conclude that 0 lt P5A lt 05 So 1 construct Step 2 as follows E 0 We continue until it becomes too dif cult for you to choose between the two games o How the chips are set up in the bowl at each step is determined by your answer at the pre ceding step 0 Comments i The payoffs have to be imaginary7 because we wish to use this procedure for assessing unveri able probabilities Luckily7 a high degree of accuracy in as sessing subjective probabilties usually is not needed Quick review of probability 0 event any outcome or set of outcomes of a random phenomenon the basic element to which probability can be applied usual notation is a capital letter near be ginning of alphabet example random phenomenon is that we are drawing a patient at random from a huge database of patients insured by an HMO gtk event A is the event that the patient we draw is under 6 years of age we will denote the probability of event A as PA 0 sample space S the set of all possible out comes of a random phenomenon 7PS1 11 o complement of an event A is the event not A notated AC or A 7 A u A S o the null event 7 an event that can never happen notated Q 0 events A and B are disjoint or mutually exclusive if they cannot occur together fie ifAnB Q iexample event A is that the patient we draw is under 6 years of age7 and event B is event that the patient is 6 to 11 years of age 10 o intersection of two events A and B is the event both A and B notated A n B 7 example if event B is the event that the patient we draw weighs at least 150 pounds7 then A n B is the event that the patient we draw is under 6 years of age and weighs at least 150 pounds 0 union of two events is event either A or B or both notated A U B o A set of events A17 A27 A37 are exhaus tive if AlUAQU A3U 3 o additive rule of probability 7 if two events A and B are mutually exclur sive7 then PAUB PA 133 PAC17 PA Conditional Probability o PB A 7 the probability that event B Will 14 Patients in the database example gt occur given that we already know that event lt 150 pounds 7 150 pounds TOtal A h d under 6 798 2 800 as 000mm 2 6 4702 4498 9200 o multiplicative rule of probability Total 5500 4500 10000 PA B PAPB A PA B PBPA B 0 so if PA 7 0 PM n B P B A lt l gt PM Independence Patients in the database example 0 TWO events are independent if the occurrence lt 150 pounds gt 150 pounds Total or nonroccurrence of one of them does not green eyes 7 800 affect the probability that the other one ocr not green 5060 4140 9200 curs Events A and B are independent if Total 5500 4500 10000 PA B PM PBA 133 o multiplicative rule of probability for inde pendent events PAnB PAPB Law of Total Probability o Applies when you wish to know the marginal unconditional probability of some event but you only know its probability under some conditions 0 Example 1 have asked my friend to mail an imporr tant letter i I want to calculate PA the probabilr ity that the letter will reach the addressee within the next 3 days if believe that PM the probability that my friend will remember to mail the letter today or tomorrow is 60 Using the Law of Total Probability to nd PM o For any events A and M A AmM U AmM 0 Events AnM and AHM are disjoint so the addition rule says PA PAmM PA M 0 And the multiplication rule applied to both terms on the right hand side says PM PA MPM PA MPM o For the example PA 9560 0000140 57004 i I believe that if my friend mails the letter today or tomorrow the probability that the postal service will deliver it to the ad dressee within the next 3 days is 95 PA M 95 if believe there7s only 1 chance in 10000 that the letter will get there somehow if my friend forgets to mail it PAM 0001 20 General Law of Total Probability 0 Suppose there were many different conditions under which the event of interest could occur o If M1 M2 M3 are mutually exclusive and exhaustive events then PA PltA M1PltM1PA M2PM2PA M3PMg Bayes7 Rule discrete case 0 My prior probability that my friend would remember to mail the letter was PM o The data is that the letter actually arrived within 3 days 0 Bayes7 rule calculates my posterior probabilr ity that my friend mailed the letter given the data that is PUWA when we know PAlM PAlM and Generalized Bayes7 Rule 0 corresponds to Generalized Law of Total Probe ability 0 assumes that the event A could happen cone ditional on one of a number of different other events M1 M2 M3 0 assumes that we know PAlM1 PAlM2 etc as well as PM1 PM2 etc 0 after the event A has occurred we want to assess the conditional probability of one of the events Mj PMle o If M1 M2 M3 are mutually exclusive and exhaustive events then PltMAAgt P PMWJ My AlM1PM1 PAlM2PM2 PAlM3PM3 o By the de nition of conditional probabilility PMnA PM 0 Using the multiplication rule to expand the numerator and the law of total probability to expand the denominator gives Bayes7 rule PAlMPM P M A H l PA MPM PAlMPM o For the example this is PMA 9560 PltMW 57 57004 7 9999 225 138 Bayesian Statistics Intro to OneParameter Models Learning about a Proportion Lecture 3 Aug 29 2005 Kate Cowles 374 SH 33570727 kcowles statuiowaedu 3 0 You do not have the time or resources to lo cate and interview all 28000 students so you cannot evaluate p exactly 0 You pick a simple random sample of n 50 students from the student directory and ask each of them whether he she would be likely to quit school if tuition were raised by 10 0 You wish to use your sample data to estir mate the population proportion p and to determine the amount of uncertainty in your estimate 2 Example 0 You read in last Monday7s newspaper that one of the Iowa regents wants to raise tuition at the 3 Iowa universities by 10 0 You want to send the regents some argue ments against this idea 0 To support your argument you would like to tell the regents what proportion of current Ul students are likely to quit school if tuition is raised by that much 0 Your research question is what is the uni known population parameter p 7 the pro portion in the entire population of Ul stur dents who would be likely to quit school 4 The binomial distribution 0 For each of the 50 students in your sample de ne a Bernoulli random variable indicat ing whether they say that would quit school yes or no Bernoulli or binary random variable 7 random variable that can assume one of only two values 7 one value is arbitrarily called a success the other a failure we7ll call a yes answer a success 0 The unknown population proportion p is also the probability that a randomlyrselected stur dent from this population would answer yes 5 0 Because we know nothing about the people in your sample except that they were in the student directory it is reasonable to assume that they all have the same probability of saying yes to your questions 7 namely p i If we knew more about the students this assumption would not be reasonable 0 We also will assume because you drew a simr ple random sample that the responses from the individual students are independent 7 This would not be reasonable if you had chosen 25 pairs of roommates sets of sibr lings etc The likelihood function 0 But we donlt know p 0 Instead after you interview the 50 students we know that y 7 and we want to estimate p 0 In this case we may change perspective and regard the sampling distribution as a funcr tion of the unknown parameter p When re garded in this way the sampling distribution is called the likelihood function Mp 57071 18437 0ltplt1 0 We could compute this likelihood for differ ent values of p lntuitively values of p that give larger likelihood evaluations are more consistent with the observed data 6 0 De ne a random variable Y as the count of the number of successes in your sample 0 Y meets the de nition ofa binomial random variable 7 it is the count of the number of successes in n independent Bernoulli trials all with the same success probability Y N Binomialnp o What are the possible values of Y7 o If we knew p we could use the binomial prob ability mass function to compute the proba bility of obtaining any one of the possible values of Y in our sample n 7 pylpypylipquot y y07177n E The likelihood function for a binomial sample with 7 successes in 50 trials nnrmaiEd l x2 mom 9 Frequentist approach to estimating p Maximum likelihood estimation 0 Frequentist approach to estimating p nd the value ofp for Which the likelihood funcr tion attains its maximum 7 This is the value that makes the observed data most likely This value of p is called the maximum likelihood estimate or MLE 0 Usually preferable to maximize the natural log of the likelihood function i log transformation is monotonic so maxi imizing the log gives the same answer as maximizing the original function 7 original likelihood is usually a product so log is a sum 7 easier to differentiate curvature of log likelihood is related to sample variance of the MLE Sampling distribution of the MLE o The sampling distribution of an estimator is the distribution of the values of that statistic calculated from all possible samples of size n drawn from the population of interest 7 or if the population is in nite or only theoretical the distribution of values ob tained in the limit under repeated same pling o binomial setting sampling distribution of 13 is the probability distribution of the values of this estimator if repeated binomial same ples are taken With sample size n and xed success probability p Mp log 3 1091 n 7 yl091 719 0ltplt1 0 Take rst derivative of log likelihood With re spect to p set equal to 0 and solve for p dlpginiy0 dp p 17 Aiy p n o In example MLE is 7 Ail4 p 50 o asymptotic approximation to sampling dis tribution of any MLE When n is large generically if is the MLE for a popula tion parameter 6 then N 6 izlt2gt 1 Where gtk 12 is the second derivative of the log likelihood evaluated at the MLE gtk and Nab is the normal distribution With mean a and variance b o for binomial proportion p 132 Frequentist con dence interval 0 The estimated standard deviation of an estir mator is called its standard error A 13013 o The approx1mate standard error of p is n o a level C con dence interval for population proportion p can be calculated using asympr totic sampling distribution of 13 I3 4102 860 I3 ZiCC2 8603 where zLCg is the 1 7 02 quantile of the standard normal distribution 15 i and applied this procedure for computing a con dence interval to each of the samples then 790 of the resulting con dence intervals would include the true population proporr tion p 0 We don7t know whether the particular con dence interval from the sample we actually have is one of the 90 or one of the 10 o The frequentist cannot say that there is 90 probability that the true p is in this interval The true p is some xed number even though we don7t know what it is That number is or is not in this interval 14 Back to quitting school example 013014 on50 360 4 4W2 141 14 049 o for a 90 con dence interval z 1645 0 90 con dence interval for population pro portion p is 014 71645 gtlt 0049 014 1645 gtlt 0049 00590221 0 Note if we had obtained a di erent random sample of 50 students we would have got ten not only a different 13 but also a different con dence interval 0 Interpretation of this frequentist con dence interval if i we took many many different random same ples from this population 16 Bayesian inference regarding a propor tion Constructing a prior oparameter of interest is still the unknown population proportion p o p could take on any value in interval 0 1 0 We need to assess our knowledge or belief about this unknown parameter before we ob serve the data from the survey 0 Because p can take on any of a continuum of values we express this knowledge or belief most appropriately by means of a probability density function unlike our previous problems where there was a discrete set of models to which we assigned probabilities Constructing a prior continued 0 person who has little or no knowledge about likely values of this proportion might consider all values in 07 1 equally plausible before seeing any data uniform density on 01 describes this be lief mathematically p N U 07 1 10117 0ltplt1 Other possible priors o If a person has knowledge or belief regard ing likely values of p7 hisher prior will be informative 0 examples two different possible priors exr pressing the belief that the most likely values ofp are between 1 and 25 Uniform 0 called vague or noninformative prior o A histogram prior 0 A discrete prior7 such as 1 101 7 p 17 1257 157 1757 207 2257 25 0 An in nite number of other possibilities Up dating prior beliefs 0 Bayes theorem for probability density funcr tions ppldata 0lt 101 L00 0 Recall quitting school example and binomial likelihood Lp7ylpy1ipquot yy 0 lt10 lt1 o Combining the prior density and the likelir hood to get the posterior density Mpldata 0lt Mp lit1710 0 lt p lt 1 22 o If the uniform prior pp 1 had been cho sen and there were 7 people who said they would quit out of 50 surveyed ppldam 0lt p71 71243 0 lt p lt 1 0 With the noninformative uniform prior the posterior density is proportional to the like lihood function 7 But the Bayesian and frequentist interprer tations are different i The Bayesian says that the population pa rameter p can be treated as if it were ran dom variable and the posterior distribur tion is a probability distribution represent ing beliefs about its value 7 The frequentist says that the same curve represents the probability of the sample result 7 successes in 50 trials for differ ent xed values of the unknown popula tion parameter p 225 138 Bayesian Statistics Inference for Proportions continued Lecture 6 Sept 10 2003 Kate Cowles 374 SH 33570727 kcowles statuiowaedu 3 o If based on the earlier survey we actually knew the true value of the population pro portion p weld just use the binomial proba bility 71 7 pylp f py 1712quot yy y07 0 But of course we still had uncertainty about p even after observing the original sample 0 All of our current knowledge about p is con tained in the posterior distribution obtained using the original survey 2 Prediction 0 In many situations interest focuses on pre dicting values of a future sample from the same population ie on estimating values of potentially obi servable but not yet observed quantities 0 Example we are considering interviewing an other sample of 50 Ul students in the hope of getting more evidence to present to the regents and we would like to get an idea of how it is likely to turn out before we go to the trouble of doing sol 0 So we are considering a new sample of sample size 72 and want to estimate the probability of some particular number if of successes in this sample 4 o The posterior predictive probability of some particular value of y in a future sample of size 72 is 10le 01P3lPPPl3 dpy yquot 07 where 3 denotes the data from the original survey and pp l y is the posterior distribur tion based on that survey 5 o For example7 suppose we had used the Beta107 40 prior so our posterior distribution is pply Beta177 83 Then the posterior predictive probability of y successes in a future sample of size 72 is py l y 01P3 lpB tap17y83dp7 yquot 07 o This is particularly easy to compute if 72 17 in Which case Pry 1ly 01Pry 1lppplydp 01PPPly dp MM 17 17 83 Recognizing kernels Normalizing constants revisited 0 When trying to determine Whether a function is the kernel of a standard density7 consider the support 0 if you do recognize a function as the kernel of a standard density7 then you can easily gure out What it integrates to o examples M176 70lt6lt1 6 Proper and Improper distributions 0 a density is valid only if it integrates to one over the support of the random variable 0 any density that integrates to a positive nite number can be normalized so that it integrates to one o a density is improper if its integral is not r nite 0 example GililezMin7 0 lt 6 lt oo Noninformative or reference priors o useful when we want inference to be unafr fected by information apart from the current data 0 in many scienti c contexts we would not bother to carry out an experiment unless we thought it was going to increase our knowlr edge signi cantly ie we expect and want the likelihood to dominate the prior The case of the binomial likelihood 0 one choice of noninformative prior U 0 1 o a disadvantage it is not invariant under trans formations o suppose we were more interested in the logit transformation of the unknown proportion p i 7 P 7 91 7 low 17p than in p itself when we get to logistic regression later in the semester this is exactly the quantity we will be interested in 10 o improper reference priors are sometimes used if you do this you must verify that the resulting posterior is proper inote that if the posterior is improper it doesnt exist so valid inference cannot be based on it i in some multiple parameter models it may be possible to make valid inference about a subset of parameters even if the posterior is improper 0 often more than one choice of reference prior for the same likelihood 12 o recall transformation of variables 7 if y 91 is a onertoone transformation 0 x SO 56 9719 7px is the density function of x i we want density function of y pyy dm dy o letls transform the uniform prior on the bir nomial parameter p into a prior on logitp my Maw o uhroh its not vague or uniform Jeffreys7 prior First recall the Fisher information 0 used by frequentists in computing asympr totic variance of MLEs 0 used by Bayesians in constructing one form of reference prior 0 let pyl6 denote the log of the probability density function of the data given the uni known parameter 6 0 Fisher de ned the information about a pa rameter provided by an experiment as 52l09pyl9 o The expectation is taken over possible values of y for xed 6 Since the information is an expectation it depends on the distribution of 3 not the observed value of y Jeffreys7 prior o If we transform the unknown parameter 6 to o 96 then 6109L ly 5109L9ly 6o 66 6o 0 Squaring and taking expectations over values of 3 note that 3 does not depend on y we get may may 0 So Jeffreys proposed the following reference prior 129 0lt MGM 14 0 Since the logrlikelihood logL6ly differs from logpyl6 only by a constant all their derivatives are equal Thus the information can equivalently be de ned as 52109L9lyi 662 o If there are n independent observations y 31 yg yn then the probability densities multiply and the logrlikelihoods add Thus the Fisher information becomes 52l09L9ly my 7 7ET 0 Finally it can be shown eg Hogg and Craig or Lee 2 my E6log 66lygtj Rely E n Rely 16 o advantages invariance property no matter what scale we choose for measuring the unnown pa rameter the same prior results when the parameter is transformed to any other scale 7 depends on the form of the likelihood but not on the current observed data 0 disadvantages sometimes information doesnlt exist eg in Cauchy distribution 7 more controversial in multiparameter set ting 17 Jeffrey s prior for binomial likelihood 109Lpy ylogp n7y10917p constant 62109Lpy i 7 7273 6p p2 What is Ey if y N Binomia mp So n 1 7 M p 1 i p Taking the square root and removing the con stant 72 gives 191 0lt 17 17196 Do we recognize this density 13 One more candidate reference prior for the binomial likelihood o Uniform01 0 Beta 7 0 Beta 07 0 improper Will give a proper posterior unless gtk either 3 0 or y n in current data attractive feature yields the rnle 2 as the posterior rnean 2281138 Bayesian Statistics What is Bayesian Statistics Lecture 1 Aug 217 2007 Kate Cowles 374 SH7 33570727 kcowlesstatuiowaedu Where does statistics t in 0 Central to steps 27 37 and 5 0 May help with step 1 7 can help show that a question is inappropriate 7 may show that answering the question will be dif7 cult or impossib e o Bayesian statististics is particularly well7suited for steps 2 and 5 The Scienti c Method1 But its not just for science 1 Ask a question or pose a problem 2 Assemble and evaluate the relevant information 0 Take stock of what is already known 3 Based on current information7 design an investigation or experiment or perhaps no experiment to address the question posed in step 1 0 Consider costs and bene ts of the available experi7 ments7 including the value of any information they may contain 0 Recognize that step 6 is coming 4 Carry out the investigation or experiment 5 Use the evidence from step 4 to update the previously available information draw conclusions7 if only tenta7 tive ones 6 Repeat steps 3 through 5 as necessary 1 as stated by Don Berry Who started it all Thomas Bayes Born 1702 in London7 England Died 17 April 1761 in Tunbridge Wells7 Kent7 England 0 ordained Nonconformist minister in England 0 Essay towards solving a problem in the doctrine of chances 7 set out Bayesls theory of probability 7 published in the Philosophical Transactions of the Royal Society of London in 1764 7 The paper was sent to the Royal Society by Richard Price7 a friend of Bayesl7 who wrote I now send you an essay which I have found among the papers of our deceased friend Mr Bayes7 and which7 in my opinion7 has great merit In an introduction which he has writ to this Essay7 he says7 that his design at rst in thinking on the subject of it was7 to nd out a method by which we might judge concerning the probability that an event has to happen7 in given circumstances7 upon sup position that we know nothing concerning it but that7 under the same circumstances7 it has happened a certain number of times7 and failed a certain other number of times 0 Bayes7s conclusions were accepted by Laplace in a 1781 memoir7 rediscovered by Condorcet as Laplace mentions7 and remained unchallenged until Boole ques7 tioned them in the Laws of Thought Since then Bayes7 techniques have been subject to controversy 0 elected a Fellow of the Royal Society in 1742 despite the fact that at that time he had no published works on mathematics lndeed none were published in his lifetime under his own name Simple inference using Bayes7 rule Example Do you have a rare disease 0 Your friend is diagnosed with a rare disease that has no obvious symptoms 0 You wish to determine how likely it is that you7 too7 have the disease That is7 you are uncertain about your true disease status 0 Your friends doctor has told her that 7 The proportion of people in the general population who have the disease is 001 7 The disease is not contagious o A blood test exists for this disease7 but it sometimes gives incorrect results 6 Some settings in which Bayesian statistics is used today 0 economics and econometrics 0 marketing 0 social science 0 education 0 health policy 0 medical research 7 more common in England than in US 7 but FDA has approved some new medical devices based on Bayesian analysis and is pushing the use of Bayesian methods in device testing 0 weather 0 the law 0 etc7 etc 8 Quantifying uncertainty using probabilities The long7run frequency de nition of the probability of an event The probability of an event is the proportion of the time it would occur in a long sequence of observations ie as the number of trials tends to in nity 0 example when we say that the probability of getting a head on a toss of a fair coin is 57 we mean that we would expect to get a head half the time if we ipped the coin a huge number of times under exactly the same conditions 0 requires a sequence of repeatable experiments 0 no frequency interpretation possible for probabilities of many kinds of events 7 including the event that you have the rare disease Probability as degree of belief The subjective de nition of probability is A probability of an event is a number between 0 and 1 that measures a particular person7s sub7 jective opinion as to how likely that event is to occur or to have occurred 0 applies whenever the person in question has an opin7 ion about the event 7 if we count ignorance as an opinion7 always applies 0 Different people may have different subjective proba7 bilities regarding the same event 0 The same person7s subjective probability may change as more information comes in 7 where Bayes7 rule comes in Back to the example 0 two possible events or models 1 you have the disease 2 you don7t have the disease 0 before taking any blood test7 you think your chance of having the disease is similar to that of a randomly selected person in the population 7 so you assign the following prior probabilities to the two models MODEL Have disease Don7t have disease 10 Properties of probabilities These properties apply to probability whichever de nition is being used 0 Probabilities must not be negative If A is any event7 t en PA 2 o o All possible outcomes together must have probability If S is the sample space in a probability model then PS 1 12 Data 0 You decide to take the blood test 7 the new information that you obtain to learn about the different models is called data 7 the different possible data results are called obser7 vations or outcomes 7 the data in this example is the result of the blood test 0 The two possible observations are 7 a positive blood test 7 suggests you have the disease 7 a negative blood test 7 7 suggest you don7t have the disease Likelihoods o The probabilities of the two possible test results are different depending on whether you have the disease or not 0 these probabilities are called likelihoodsi the probae bilities of the different data outcomes conditional on each possible model LIKELIHOODS MODEL PRIOR P MODEL P7 MODEL Have disease 001 95 05 Don t have disease 999 05 l 95 Bayes7 rule applied to the example You take the blood test and the result is positive This is the data or observation MODEL Prior Like Product I Posterior for Have disease 001 95 00095 I 019 Donlt have disease 999 05 04995 981 05090 1 o Are the entries in the Product column probabilities o How do we convert them into probabilities 14 Using Bayes7 rule to update probabilities o Bayes7 rule is the formula for updating your probabile ities about the models given the data 0 enables you to compute posterior probabilities given the observed data 7 posterior means after Bayes7 rule simplest form PMODEL DATA x PMODEL gtltPDATA MODEL posterior X prior X likelihood is What have you learned from the blood test 0 The probability of your having the disease has in creased by a factor of 19 0 But the actual probability is still small lt 02 0 You decide to obtain more information by taking the blood test again 17 18 Updating the probabilities again What if the second test had been negative 0 We will assume that7 conditional on your true disease That 137 the second Observations was status7 the results from two blood tests are indepene dent39 MODEL Prior Like Product Posterior or e 0 Your current probabilities are the posterior probabile ities from after the rst test Have disease 019 Don7t have disease 981 0 These will become your prior probabilities with re spect to the second test 0 The second test result is also positive MODEL Prior Like Product I Posterior for Have disease 019 95 01805 I 269 Don7t have disease 981 05 04905 731 0671 1 Introduction to Linear Regression Review of Frequentist Approach to Linear Regression Per Capita Health Spending and Per Capita Gross 7 7 Domestic Product GDP in 24 OECD Countries 2281138 Bayeswtn Statistics 1989 Lecture 14 Schieber Poullier and Greenwald Health A airs 1991 Countr Per Ca Hlth Per Ca GDP October 16 2006 Y P P 1 united states 2051 181429 Kate Comes ph D 2 Canada 1483 172857 3 iceland 1241 15 5714 4 sweden 1233 138571 5 switzerland 1225 158571 6 norway 1149 155714 7 france 1105 122857 8 g ma y 1093 13 4288 9 be 1050 14 8571 10 netherlands 1041 13 0000 11 stria 982 11 8571 12 Iinland 949 12 8571 13 aLLstralia 939 122857 14 japan 915 134288 15 belgium 879 118571 18 italy 841 124288 17 denmark 792 135714 18 united kingdom 758 124288 19 new zealand 733 108571 20 ireland 581 7 8571 21 spam 521 8 8571 22 portugal 388 8 5714 23 sets 337 8 4288 24 turkey 148 4 4288 1 2 in regression analysis we look at the conditional distribution Per Capita Health Spending and Per Capita Gross of the response variable at different levels of a predictor Domestic Product GDP In 24 OECD Countries variable 1989 2EEE I 0 Response variable 75m 7 also called dependent or outcome variable P 7 what we want to explain or predict E was 7 in simple linear regression response variable is continu7 OHS SEIEI I I o Predictor variables 397 v v75 E 5 V V V PCGDP 7 also called 77independent77 variables or 77covariates77 7 in simple linear regression predictor variable usually is also continuous 7 How we de ne which variable is response and which is predictor depends on our research question Scatterplots 0 response variable on Y axis a predictor variable on X axis 0 Relationship in this scatterplot looks roughly linear 7 Makes sense to try to summarize the relationship be tween these two variables with a straight line Linear regression 0 ln linear regression analysis o 1X represents the mean value of all the Y s for a given value of X EYlX o 1X 0 There is an entire distribution of Y values for each value of X a conditional distribution 7 Ebltarnple1 for any given value of per capita GDP there is a distribution of values of per capita health spending arnong OECD countries 0 We say the relationship between X and Y is linear if the means of the conditional distributions of YlX lie on a straight line Quick review of linear functions Y o 1X 0 Y is a response variable that is a linear function of the predictor variable X o 30 intercept the value of Y when X 0 o 31 slope how much Y changes when X increases by 1 unit Error terms 0 ln regression we represent factors other than X that affect K with an error term 5 0 population model 61 K i 70 1X1 at K 7 EM 0 or equivalently K o 7199 Wisdom K Em 22mm Determining the best tting line zoom 39 15mm P C H mom son 5 1E 15 Ll PCGDP Ordinary least squares method OLS I computes the maximum likelihood estimates of the inter7 cept and slope I chooses the best7 tting line by minimizing the sum of the squared differences between each data point and the tted line Notation 0 population model ElKl o lXi 0 sample estimates A A A Y1 o lX t 7 id is the tted value or predicted value of Y for case 2 o residuals 51 7 sample estimates of error terms 51 7 at 7 K 7 K I error sum of squares SSE 53 7 K e W a OLS chooses 1 and 1 so as to minimize SSE 0 7L number of observations in the dataset a K number of is in the model 2 ioi SLR Parameter Estimates Parameter Standard 1 102 110 vaziabie DF Estimate Ezzoz Pazametez0 onb gt W mummy 1 7337153134 11972572120 3234 00033 12cm 1 1072532119 937953524 11435 00001 zoom 39 15mm P C H mom son 39 5 a 15 PCGDP Calculatin predicted values and esiduals Dbs MAME Residual Per capita hea ch expenditures and per capita GDP Dep Vaz Pledict scd Ezz LouezBS A UppezBS A 1 umtedsmtes Dbs MAME POII Vaiu Pledict Mean Mean 2 cmada 04 3 rceiand 7420753 1 UnitedStates 20510 15535 53075 14231 15357 A 5 133 s 2 Canada 14330 1457 0 55251 13503 1533 5 5 1 de 3938 7205 3 Iceiand 12410 1233 1 43353 11521 1374 0 5 quotmay 39 4 514 den 12330 10552 34541 10274 1171 0 g 3537 5 Switzezland 12250 1313 7 45755 12133 1403 5 B x g 7155 5 5 Mozua 11450 1233 1 43353 11521 1374 0 7 17sz 11050 530 5 31430 3554 555 5 10 quotmmde 3374 3 Gezmany 10530 1053 2 33155 5345 1122 0 fig 7 5 emb g 10500 1205 5 35437 11245 1233 3 13 Aumaha B 3551 10 Methezland 10410 1007 3 32127 5405 1073 5 11 Austzia 5320 334 7 31771 3133 550 5 i 52 7527 12 17 niand 5450 5515 31335 5253 10531 13 Austzaha 5350 5305 31430 3554 5555 Dejiizk 14 Jap 5150 1053 2 33155 5345 1122 0 B Unltedxmgdom m 15 Eelgium 3750 3347 31771 3133 550 5 15 Memalmd 4AM 15 a 3410 545 0 31457 3305 1011 3 17 Denmazk 7520 10535 33511 5533 1133 3 2 eh 5 A 13 UnitedKingdom 7530 545 0 31457 3305 1011 3 ngtxgal 39g g 15 Neulealand 7330 7774 34322 7052 34 5 20 Izeiand 5510 4555 52341 3471 554 2 A 3 21 Spa 5210 552 5 45201 4551 555 5 22 Poztuga 3350 3177 52355 1333 447 1 23 Gzeece 3370 3024 53555 1705 434 2 24 Tuzkey 1430 373523 30354 7733545 254 5 13 14 Estimating the common variance Inferences for the Slope 0 One of the assumptions of linear regression is that the vari7 0 So far we ve been describing the relationship between two ance for each of the conditional distributions of Y X is the continuous variables same at all values Of X 0 Now we want to perform a hypothesis test to determine 0 The estimate of this common variance is whether there is a linear relationship between the two varir 7 2 7 depends on assumptions of linear regression o analogous to estimate of variance in a normal sample Question Does the value of Y depend linearly on X o n 7 2 in denominator is degrees of f39reedom 7 Em o lXi 7 number of observations minus number of estimated re7 Eamon coef cients 0 Answer Yes unless 31 0 in which case ElKl o o Hypotheses for test for linear relationship between Y and X Ho 31 0 HA I r 0 0 Test statistic A t 51 051 where 63951 7m 7 standard form of test statistic estimate divided by its standard error 7 standard error of i1 depends on variability of Ys how closely clustered the Xs are 7 follows at distribution with n 7 2 degrees of freedom because we have to estimate 2 parameters 30 and 31 to compute 6 7 p7value the probability of obtaining at statistic as eX7 treme as or more extreme than what we got if H0 is true Interpreting the test for zero slope 0 Failure to reject H0 l 0 7 Type ll error 7 X and Y related in a nonlinear way 7 X provides little help in predicting Y o Rejecting Ho 1 0 7 X provides signi cant information for predicting Y 7 Although the data t a linear model some nonlinear model may do even better 0 lmportant caveat regarding inferences on 31 the best straight line may be terrible 0 Con dence interval for the slope 7 A 1 7 00 con dence interval for the true slope 31 is given by 71 i t17a2dfm72lt351 7 lfthis Cl includes the value 0 we cannot reject the nul hypothesis at signi cance level or Inferences concerning the regression line 0 Estimating the mean of the Yls for a par7 ticular value of X say X0 7 Example What is the average per capita health spending for a country With per capita gross domestic product 10 PPP Elleol YXO 30 31 0 estimated standard error of EYlX0 o A 17 00 con dence interval for EYlX0 is given by Y i tl7a2dfn72639fXo Dbs name 95 CL Mean 1 Unltedstates 1428 1690 2 Canada 1350 1584 3 Iceland 1192 1374 4 Sweden 1027 1171 5 Swltzerland 1219 1409 6 1192 1374 7 France 865 3552 9959267 8 many 9844517 1122 9 uxemb 1125 1288 10 Metherland 940 6317 1074 11 mtrla 8187786 9505571 12 inland 9258032 1058 13 Amtralla 8653552 9959267 14 apan 9844517 1122 15 Belglum 818 7786 950 5571 16 Italy 8806486 1011 17 D ark 9988446 1138 18 Unltedeg om 880 6486 101 1 19 MewZealand 706 2243 848 5850 20 3470657 5641641 21 Spain 4691369 6566193 22 Portugal 188 2993 4471 138 23 Greece 1705763 4342024 24 turkey r788646 2545903 0 estimated standard error of Ynew lt9 A lt7 Y E U o A 1 r 00 prediction interval for Ynew is given by Res idual 346107 601372 Ynew i tlia21dfn726Ynew Prediction interval for a new indi vidual s Y given that we know their value of X say X0 0 Point estimate is the group mean Ynew B0 B1X0 0 Would you expect an individualls response to be more or less variable than the groups mean response 0 two sources of variability in prediction of individual Y uncertainty of the group mean EYlX0 variability of individual responses around the group mean gtk Example The Netherlands has per capita PCGDP of 137 but is unlikely to have PCH of exactly the mean of all possible countries with the same Dutput s at lst ics Dbs name 95 CL Predlct Residual 1 Unltedstates 1213 1904 492 0968 2 Canada 1127 1807 1604 3 Iceland 9505722 1616 r4207 4 den 7714038 1427 1338056 5 Swltzerland 9801141 1647 88 7209 6 y 9505722 1616 1340758 7 e 6042244 1257 1743591 8 Ge y 7260987 1380 397 79 9 Luxemborg 876 3187 1537 156 4576 10 Metherl d 6805716 1334 337410 11 cria 5581302 1211 973321 12 Finland 6653452 1319 429311 13 Amtralla 6042244 1257 835 1 14 la an 7260987 1380 1382321 15 Belglum 5581302 1211 r5 6679 16 I 1y 6195454 1272 1049688 17 k 7412203 1396 r2765493 18 Unltedeg om 619 5454 1272 187 9688 19 MewZealand 4497583 110 444046 20 reland 1178747 7933551 1053851 21 Spain 2296021 8961541 418781 22 Porrugal 273032 6627163 682935 23 Greece 43 5300 648 34 6107 24 turkey r2728081 4 5338 60 1372 M Plot showing 95 con dence limits and 95 prediction limits 2mmm 9 mm c 225 138 The Likelihood Principle Lecture 22 Nov 28 2005 Kate Cowles 374 SH 33570727 kcowles statuiowaedu 3 Another way to state the likelihood prin ciple o For a given sample of data any two proba bility models pyl 6 that have the same like lihood function yield the same inference for 6 0 With regard to the information contained in the data about the unknown parameter only the actual observed data y is relevant No other possible outcomes gtk Contrast this with the frequents prvalue the probability assuming H0 is true of getting a test statistic as extreme as or more extreme than the value that was actually obtained 7 Not the researchers7 intentions Z The likelihood principle 0 Suppose that two different experiments may inform about an unknown parameter 6 0 Suppose the outcomes of the experiements are respectively y and z 0 Suppose the likelihoods for 6 resulting from the two experiements are proportional that is Mf c f where c is a constant 0 Then the information about 6 contained in both experiments is equivalent 4 Example 0 We are given a coin We are interested in estimating 6 the probability of obtaining a head on a single f lip 0 We want to test the hypotheses Hy6 H 6gt 0 Experiment consists of flipping coin 12 times independently 0 Result is 9 heads and 3 tails Example continued 0 There are at least two possible ways the experiment might have been conducted 7 Design 1 do 12 f lips The random variable Y is the number of heads obtained in n 12 f lips Design 2 Flip the coin until 9 heads are obtained Random variable Y is the numr ber of tails that are obtained before the ninth head 0 Frequentist inference for 6 would be different depending on which design is used 0 Bayesian inference would be the same under both designs because the likelihoods are pro portional o The negative binomial distribution Y the number of failures observed in a sequence of independent Bernoulli trials 7 Implications of the likelihood principle 0 the stopping rule principle 0 the likelihood principle and reference priors before the kth success iv N NBk7p WW ljpkuemy 7EltYgt E Stopping rules are often used in de signing frequentist statistical studies 0 instead of a xed sample size 0 to make it possible to stop a study early if the results are in o particularly common in clinical trials reducing the size and duration of a clinical trial reduces the number of patients who are exposed to the treatment that will be found to be inferior and speeds up the dis semination of the results to the medical community 0 Frequentist statisticians must choose the stop ping rule before the experiment is conducted and adhere to it exactly deviations can produce serious errors if a frequentist analysis is used 9 0 Large frequentist literature on how to control the overall probability of Type I error while allowing for more than one analysis of the data 11 Jeff39reys7 priors and the likelihood prin ciple o recall Jeffreysl prior 7 reference prior noninformative invariant to transformations of parameters 1 49 0lt NW where 6 is the expected Fisher informa tion for 6 o Jeffreysl prior when likelihood is Binomial n76 196 olt 6 7 M 1 B6tdlt57 o Jeffreysl prior when likelihood is negative binomial k p6 olt 9117 art Beta07 10 Stopping Rule Principle 0 In a sequential experiment7 the evidence pro vided by the experiment about the value of the unknown parameters 6 should not de pend on the stopping rule 0 follows directly from the likelihood principle 12 0 So use of Jeffreysl prior in some cases can violate the likelihood principlel 2281138 Bayesian Statistics What is Bayesian Statistics Lecture 1 Aug 257 2008 Kate Cowles 374 SH7 33570727 kcowlesstatuiowaedu Where does statistics t in 0 Central to steps 27 37 and 5 0 May help with step 1 7 can help show that a question is inappropriate 7 may show that answering the question will be dif7 cult or impossib e o Bayesian statististics is particularly well7suited for steps 2 and 5 The Scienti c Method1 But its not just for science 1 Ask a question or pose a problem 2 Assemble and evaluate the relevant information 0 Take stock of what is already known 3 Based on current information7 design an investigation or experiment or perhaps no experiment to address the question posed in step 1 0 Consider costs and bene ts of the available experi7 ments7 including the value of any information they may contain 0 Recognize that step 6 is coming 4 Carry out the investigation or experiment 5 Use the evidence from step 4 to update the previously available information draw conclusions7 if only tenta7 tive ones 6 Repeat steps 3 through 5 as necessary 1 as stated by Don Berry Who started it all Thomas Bayes Born 1702 in London7 England Died 17 April 1761 in Tunbridge Wells7 Kent7 England 0 ordained Nonconformist minister in England 0 Essay towards solving a problem in the doctrine of chances 7 set out Bayesls theory of probability 7 published in the Philosophical Transactions of the Royal Society of London in 1764 7 The paper was sent to the Royal Society by Richard Price7 a friend of Bayesl7 who wrote I now send you an essay which I have found among the papers of our deceased friend Mr Bayes7 and which7 in my opinion7 has great merit In an introduction which he has writ to this Essay7 he says7 that his design at rst in thinking on the subject of it was7 to nd out a method by which we might judge concerning the probability that an event has to happen7 in given circumstances7 upon sup position that we know nothing concerning it but that7 under the same circumstances7 it has happened a certain number of times7 and failed a certain other number of times 0 Bayes7s conclusions were accepted by Laplace in a 1781 memoir7 rediscovered by Condorcet as Laplace mentions7 and remained unchallenged until Boole ques7 tioned them in the Laws of Thought Since then Bayes7 techniques have been subject to controversy 0 elected a Fellow of the Royal Society in 1742 despite the fact that at that time he had no published works on mathematics lndeed none were published in his lifetime under his own name Simple inference using Bayes7 rule Example Do you have a rare disease 0 Your friend is diagnosed with a rare disease that has no obvious symptoms 0 You wish to determine how likely it is that you7 too7 have the disease That is7 you are uncertain about your true disease status 0 Your friends doctor has told her that 7 The proportion of people in the general population who have the disease is 001 7 The disease is not contagious o A blood test exists for this disease7 but it sometimes gives incorrect results 6 Some settings in which Bayesian statistics is used today 0 economics and econometrics 0 marketing 0 social science 0 education 0 health policy 0 medical research 7 more common in England than in US 7 but FDA has approved some new medical devices based on Bayesian analysis and is pushing the use of Bayesian methods in device testing 0 weather 0 the law 0 etc7 etc 8 Quantifying uncertainty using probabilities The long7run frequency de nition of the probability of an event The probability of an event is the proportion of the time it would occur in a long sequence of observations ie as the number of trials tends to in nity 0 example when we say that the probability of getting a head on a toss of a fair coin is 57 we mean that we would expect to get a head half the time if we ipped the coin a huge number of times under exactly the same conditions 0 requires a sequence of repeatable experiments 0 no frequency interpretation possible for probabilities of many kinds of events 7 including the event that you have the rare disease Probability as degree of belief The subjective de nition of probability is A probability of an event is a number between 0 and 1 that measures a particular person7s sub7 jective opinion as to how likely that event is to occur or to have occurred 0 applies whenever the person in question has an opin7 ion about the event 7 if we count ignorance as an opinion7 always applies 0 Different people may have different subjective proba7 bilities regarding the same event 0 The same person7s subjective probability may change as more information comes in 7 where Bayes7 rule comes in Back to the example 0 two possible events or models 1 you have the disease 2 you don7t have the disease 0 before taking any blood test7 you think your chance of having the disease is similar to that of a randomly selected person in the population 7 so you assign the following prior probabilities to the two models MODEL Have disease Don7t have disease 10 Properties of probabilities These properties apply to probability whichever de nition is being used 0 Probabilities must not be negative If A is any event7 t en PA 2 o o All possible outcomes together must have probability If S is the sample space in a probability model then PS 1 12 Data 0 You decide to take the blood test 7 the new information that you obtain to learn about the different models is called data 7 the different possible data results are called obser7 vations or outcomes 7 the data in this example is the result of the blood test 0 The two possible observations are 7 a positive blood test 7 suggests you have the disease 7 a negative blood test 7 7 suggest you don7t have the disease Likelihoods o The probabilities of the two possible test results are different depending on whether you have the disease or not 0 these probabilities are called likelihoodsi the probae bilities of the different data outcomes conditional on each possible model LIKELIHOODS MODEL PRIOR P MODEL P7 MODEL Have disease 001 95 05 Don t have disease 999 05 l 95 Bayes7 rule applied to the example You take the blood test and the result is positive This is the data or observation MODEL Prior Like Product I Posterior for Have disease 001 95 00095 I 019 Donlt have disease 999 05 04995 981 05090 1 o Are the entries in the Product column probabilities o How do we convert them into probabilities 14 Using Bayes7 rule to update probabilities o Bayes7 rule is the formula for updating your probabile ities about the models given the data 0 enables you to compute posterior probabilities given the observed data 7 posterior means after Bayes7 rule simplest form PMODEL DATA x PMODEL gtltPDATA MODEL posterior X prior X likelihood is What have you learned from the blood test 0 The probability of your having the disease has in creased by a factor of 19 0 But the actual probability is still small lt 02 0 You decide to obtain more information by taking the blood test again 17 18 Updating the probabilities again What if the second test had been negative 0 We will assume that7 conditional on your true disease That 137 the second Observations was status7 the results from two blood tests are indepene dent39 MODEL Prior Like Product Posterior or e 0 Your current probabilities are the posterior probabile ities from after the rst test Have disease 019 Don7t have disease 981 0 These will become your prior probabilities with re spect to the second test 0 The second test result is also positive MODEL Prior Like Product I Posterior for Have disease 019 95 01805 I 269 Don7t have disease 981 05 04905 731 0671 1 1 2 2281138 How I became interested in problem Spatial statistics Cowles7 M K7 Zimmerman7 D L 7 Christ7 A7 Lecture 22 and McGinnis7 DL 2002 Combining Snow NW 287 2007 Water Equivalent Data from Multiple Sources to Estimate SpatiorTemporal Trends and Come pare Measurement Systems Journal of Agri Kate Cowles 374 3H7 33570727 cultural Bzologzcal and Envzmnmental Statzs kcowles statuiowaedu 732037 77 536557 3 4 Problem water supply in western SWE Measurement Sites United States 0 approximately 75 of annual discharge in western rivers begins as snowpack 0 water management decisions depend on estir mation of the water contained in the snowr pack 0 several US government agencies collect data on snow water equivalent SWE amount of water in the snow owe considered annual SWE data from the eleven westernmost of the lower 48 United States western US i N 707745 observations of S WE S 2027 sites i T 89 years 191071998 Goals of our study 0 to estimate the temporal trend in SWE over the entire western US 0 to characterize how this trend varies spatially o to investigate whether there are systematic differences in the accuracy and reliability of the measurement systems 0 to account appropriately for spatiotemporal correlation structure of data Geostatist ical models 0 natural and interpretable way to model spa tial correlation for data measured at irregularlyr spaced point sites 0 correlation is a function of the distance7 and possibly orientation7 between sites 6 Motivation for studies of parallelizing MCMC for Bayesian spatial and spatiotemporal models 0 preferred model involved a separable correla tion structure geostatistical model for spatial component 7 AR 1 model for temporal component 0 in 20017 we abandoned the idea of tting such a model to the dataset as a whole because months of computing time would have been involved 0 in 2002720037 I began collaboration with Marc Armstrong and Shaowen Wang Shaowen heads Grid Research and Educa tion Group at the University of Iowa Parametric correlation functions Function COT 43 d Exponential ezpi d Spherical aq ddd i 343d 27 0 d dgt 7 eiHmH o 43 is a parameter controlling the rate of decay of correlation with increasing distance 0 corr d is the correlation between residue als at two sites separated by distance d Simple geostatistical model with spatial correlation and additive measurement error Y N N XT 03 243 031 o X is a matrix of locationrspeci c covariates 0 is a vector of coef cients to be estimated 0 is spatial correlation matrix 7 entries are calculated from correlation funcr tion 0 03 is spatial variance 0 03 is random variance measurement error variance 0 I is identity matrix 0 Bayesian model completed by speci cation of prior distributions on 437 037 037 and 11 0 proposed and compared several different par allel MCMC algorithms with respect to run time and mixing 7 all algorithms based on Metropolis updates 0 found speedups up to a factor of 5 with 8 processors 0 slow mixing Whiley M and Wilson SP 2004 Parallel algorithms for Markov chain Monte Carlo methods in latent spatial Gaussian models Statistics and Computing7 1411717179 Whiley and Wilson7 2004 owent beyond embarrassingly parallel imr plementation of MCMC in which separate chains are run on different processors 0 identi ed two potential bene ts of parallelizr ing withinriteration MCMC computations for latent spatial Gaussian models reducing time taken to generate required number of samples from approximation to target distribution 7 allowing given MCMC algorithm to be ap plied to target distribution of larger die mension by dividing storage requirements among processors and over distributed meme ory Diggle and Ribeiro and R package geoRW 0 one practical solution to computational inf tensiveness o reparameterized covariance structure 03243 03 as taur6l1 02 where taurel 35 0 required treatment of taurel xed7 known value7 or discrete uniform prior 0 semirconjugate inverse gamma prior on 03 0 normal or at prior on 13 MCMC algorithm in geoR R package 0 based on factoring posterior M a gt 0 ta plt aurelfygt X p03 taureh y X p W5 03 0 each MCMC iteration m generate 43m taurelm from discrete joint marginal p taurel y generate 03 m from 03 47quot taurelm y gtk inverse gamma generate from p MWquot 03 m taurelm y gtk multivariate Diggle7 Pd and Ribeiro Jr7 PJ 2002 Bayesian inference in Gause sian modelebased geostatistics Geographical and Environmental Modelling7 Vol 67 No 27 1297146 0 priors continuous uniform prior on 43 gtk endpoints chosen to re ect belief as to largest and smallest possible distances at which spatial correlation could decay to 0 U0 1 prior on S 7 uniform shrinkage IGa 1 prior on a multivariate normal or at prior on 0 ur alternative reparamet erizat ion 0 facilitates prior speci cation and computing algorithm 0 reparameterized covariance matrix U E 031 Ugotflt175gt2lt gt 51 where 2 2 2 Utot Us U6 06 S a2 a2 S E 16 Spat iotemporal model With separable correlation st ruct ure Y NXT a 175 K W Eltpgti KT 0 where 2p is an AR1 matrix representing temporal correlation 0 K is a matrix of 17s and 07s that matches each observation with the correct row and column of 2p K is not needed if data are rectangular 0 prior on p uniform on 7171 or 0717 slightly bounded away from endpoints Our sequential MCMC algorithm 0 based on factoring posterior plt pagot5l 2 2 gtlt pltat0tl p5ygt gtlt l pat0t 0 each MCMC iteration m generate 43m m 57quot from continue ous joint marginal p p Sly using slice sampling generate 17quot from pa0tl m m Sm y 7 inverse gamma generate from p lg m pm 07quot Sm y 7 multivariate normal 0 all parameters are blocked 7 would be iid sampling if there were a way to obtain inde pendent draws from joint posterior marginal of 437 p7 and S 19 o in our experience7 compared to Metropolis updating7 slice sampling 7 results in lower autocorrelation in sampler output since new values are drawn at every iteration at cost of requiring more computationally expensive evaluations within each iterar tion 7 results in more effective samples per sec ond 18 Drawing from joint posterior marginal of 43 p and S 1 p p5 1 y 0lt A T in Wat375 Elt 7P7SV1 in b 1 1 XWW XI015 X MW X 171103 o Cholesky decomposition of E p S enorr mously reduces computation involved for obi taining required determinants and quadratic form 0 trivariate slice sampling Neal7 2003 attracr tive due to nite support of all 3 parameters Parallelizing the algorithm 0 used PLAPACK publicrdomain parallel linear algebra library for dense matrix operations 7 www cs utexas eduNplapack van de Geijn7 1997 o correlation matrix is distributed among mule tiple processors7 and Cholesky decomposition is done in parallel 0 master node gathers results from each itr eration and does computations involving scalar quantities Cluster used for timing studies Timing results for spatiotemporal model 0 Beowulf Linux cluster 14 nodes Number of Sample Size CPUs 1000 2000 4000 6000 8000 10000 Dua114GHZ CPU 1 1 00 1 00 1 00 1 00 1 00 100 1G memory 4 1 70 2 01 2 33 2 46 2 58 264 oNo other users 9 2 40 3 18 4 16 4 59 4 94 5 40 16 2 00 3 75 5 09 5 90 6 53 721 25 1 94 3 54 5 68 7 03 8 36 917 speed up 0 Number of time points 20 0 Number of spatial points 607 1207 2407 3607 4807 600 0 Results are for 10 MCMC iterations about 60 likelihood evaluations 23 24 Speedup comparisons Output for single year of SWE data Sampler Trace WWWW EWW w quot S g Output for single year of SWE data Sampler LagAutocorrelations g a 5 g mg g m m m 25 Work in progress 0 tuning to improve performance clever distributed storage multivariate sliceesampling block size for PLAPACK o porting to TeraGrid NSFefunded Grid composed of extremely higheperformance clusters at 8 partner sites decomposing problems so as to minimize interecluster communication 0 application to SWE data and radon data 0 extension to nonstationary spatial covariance structures and other more complex models 225 138 Bayesian Statististics Intro to Hierarchical Normal Linear Models Lecture 16 Oct 24 2005 Kate Cowles 374 SH 33570727 kcowles statuiowaedu Hierarchical normal linear models 0 combine hierarchical models 7 linear regression Review of assumptions of linear regression o Homoscedasticity o Linearity 0 Independence o Normality 0 Existence 4 Example AIDS study ACTGIIGB117 o randomized controlled doubleblind clinical trial 0 patients with at least 16 weeks of prior treat ment with Zidovudine ZDV 0 each patient was randomized to one of 3 treat ments 7 continued ZDV 2 different dose levels of ddilt another anr tiretrOVial drug 0 primary endpoint progression to a new AIDSr de ning event or death 0 primary results published in Kahn et al 1992 0 CD4 counts measured on all patients 7 at study entry week 0 and at weeks 2 8 12 1624 32 40 48 56 and 64 5 your homework dataset consists of CD4 counts taken up to week 24 from a subset of the patients in 1 treatment group Another possibility separate linear regressions for each patient 0 would result in poor esimation of individual slopes and intercepts since there are few data values for each patient 0 then question arises of how to compute over all slope for treatment group 7 average 7 weighted average Research question and statistical models 0 two parameters of interest are average change in CD4 count per week in patients on each of the two treatments rm 5137 and 510 0 one approach simple linear regression ap plied separately to patients in each treatment group 2 2 yijl 0g7 lg70 N N 0g lgtij70 where g A7 B7 or C 0 which assumption of linear regression is vior lated 0 what are the likely consequences of the vior lation of this assumption 3 Hierarchical normal linear model 0 a compromise between pooling all the data into one simple linear regression model gtk would violate independence assumption 7 separate linear regressions for each patient gtk would result in poor estimation of indie vidual slopes and intercepts since there are few data values for each patient o notation 327 7transformed CD4 count measured on patient 239 at week 752739 0 stage 1 likelihood for each patient 239 239 1 N 2 2 yile OhO hyTy N NW whimsy where 7393 is the precision of the points around the patientrspeci c regression line 0 stage 2 formulation 1 2 2 O Oil ovTao N 575me 2 2 o lil lvl al N Nlt 177a1 0 stage 2 multivariate formulation 50 50 51 51 050239 051239 zaN2 2111 11 gtk Wishart is multivarate generalization of gamma gtk p is the degrees of freedom A determines the degree of certainty you have about the mean A for a Wishart distribution to be prior p must be 2 dimension of the matrix A p is equivalent to prior sample size gtk Wishart distribution is parameterized sevr eral different ways WinBUGS does not use the same par rameterization as GCSR table gtk in WinBUGS parameterization E1 N WishartR22 p implies that Ego is pR 1 10 0 third stage rst formulation 50 N WWW 5 N N 1771 7 y N Ca yvby 7 0210 N Ga 0107b010 73931 N Caa1ba1 0 third stage multivariate formulation 50 Mo 71 2 51 l 51 0 E0N2 Mo 1 7 3 N Ca yiby E1 N WishartR22lp where p is the degrees of freedom scalar gtk equivalent prior sample size gtk must be greater than or equal to dimenr sion of matrix in order for Wishart to be proper R is prior guess at order of covariance ma trix Ea 12 Priors on precision matrices o WinBUGS requires parameterizing models including an unknown variance covariance matrix of a multi variate normal distribution in terms of the precision matrix inverse of the variance covariance matrix 0 The Wishart distribution is the conjugate prior for the precision matrix of a multivariate normal distribution with known mean 0 It is the standard choice of prior for precision ma trices in realistic multivariatenormalebased models with means and possibly many other parameters un known because it leads to a Wishart full conditional distribution for the precision matrix that simpli es MCMGbased model tting 0 The two parameters of the Wishart distribution are a mean matrix and a scalar parameter called the degrees of freedom multiple parameterizations Confusingly several different parameterizations of the Wishart density appear in the literature If X denotes a p X p syme metric positive de nite random matrix R is a xed p X p symmetric positive de nite matrix 1 is a strictly positive scalar and the pdf of X is p yr 7 1 MW olt R7X expe trltRxgt lt1 then the references below de ne the two parameters as follows 15 In what follows we use the WinBUGS parameterization The Wishart distribution is proper if 1 2 p If X N dwishR1 then the moments are as follows EltXligt VltR71l V TltXU V Film R71gt11ltR71gtJl 0071sz X11 VR ILNR IMR 1zzR 1gtkl ference Parameterizacion X N dwzshR1 X N dwzshR 11 q q q q q X N dwzshR 11ip1 16 Note that the gamma distribution is a special oneedimensional case of the Wishart If X and R are scalars and the pdf of X is proportional to xg lexp then W013 1 G WinBUGS does not allow the use of its Wishart distribu tion with oneedimensional matrices however If X N dwishR 1 then X 1 has an inverse Wisha rt distribution X 1 N IWR1 where R E X 1 lt l J 1 7 p 7 l The inverse Wishart distribution is always proper how ever it has a degenerate form if 1 lt p and obviously the rst moment is negative or in nite unless 1 gt p l 17 Since statisticians and subject7matter experts tend to be better able to think in terms of variances and correlations rather than of elements of precision matrices7 the follow7 ing way of specifying a prior on a covariance matrix7 say 2 in WinBUGS is attractive 1 Let R equal the prior guess for the mean of the p X p variancecovariance matrix 2 2 Choose a degrees7of7freedom parameter 1 gt p 1 that roughly represents an equivalent prior sample size77 7 your belief in R as the value of E is as strong as if you had seen 1 previous vectors with sample covariance matrix R 03 De ne a matrix S 1 7 p 71R 5 ln WinBUGS7 put the following Wishart prior on the corresponding precision matrix 2 1 Sigmainv dwish S nu 5 then 0 E62 Rm 0 the variance of the prior will be decreasing in 1 EE1 RAM 17p71 19 o HNLMs borrow strength from other sub7 jects7 data to help in estimating subject7speci c intercepts and slopes 7 useful when there are few data points for some or all individual subjects 13 Advantages of hierarchical normal lin ear models o compared to 7 pooling the data one single linear regres7 sion 7 separate linear regressions for each patient 0 HNLMs accommodate correlations within sub7 ject o HNLMs accommodate differences between sub7 jects with respect to 7 intercept 7 slope o HNLMs provide appropriate posterior credi7 ble sets for the population intercept and slope 20 Requirements for posterior to be proper 0 must have proper priors on 7 73930 and 7024 or 51 0 at least 1 subject must have gt 2 measure7 ments 21 Centering covariates in hierarchical nor mal linear models 0 straightforward if all subjects have the same set of values of predictor variable 0 question if subjects have different values of the predictor7 should we center around subjectrspeci c average ma 7 mbari overall average 27 7 mbar 0 centering affects the priors on i 50 2 Tao or Ea 1 2 2281138 Model checking and sensitivity analy Posterior predictive checking sis 0 goal assess t of model to Lecture 22 Nov 267 2007 7 data 7 our substantive knowledg 0 must check effects of Kate Cowles 374 8H7 33570727 Prior kcowles statuiowaedu 7 likelihood Speci cation hierarchical structure 7 any other applicationrspeci c issues gtk erg which predictor variables 3 4 o theoretically possible to set up and t a sur per model including all possibly true models 7 but computationally infeasible 0 do the models de ciencies have a noticeable effect on substantive inferences Principles and methods of modelchecking and really conceptually impossible 0 how to Judge when assumptions of conver 0 instead we t a feasible number of models meme can be made safely and examine the posterior distributions that result 7 cast models as broadly as possible fail to t reality sensitive to arbitrary speci cations 5 Using the posterior distribution to check a statistical model 0 compare posterior distribution of parameters to substantive knowledge 7 other data 0 compare posterior predictive distribution of future observations to substantive knowledge 7 eg compare election predictions from a model to substantive knowledge 0 compare posterior predictive distribtuion of future observations to the data that have ace tually occurred 7 Checking a model by comparing the data that we have to the posterior pre dictive distribution 0 enables checking t of model without any more substantive knowledge than is in ex isting data and model 0 do datasets simulated from the model we t look like the real data in ways relevant to our inference 0 requires drawing replicated data 6 Using the posterior predictive distribu tion to check a statistical model 0 recall posterior conditional on observed data y predictive prediction of an observable but unobserved y p29yd9 p 9yp9yd9 p 9p9yd9 last line holds if new data are conditionr ally independent of old data given model parameters MW 8 Procedure to draw a replicated dataset77 from posterior predictive distribution 0 notation 3 observed data grep a complete simulated dataset gtk same number of observations as in y gtk same values of explanatory variables if any gtk response variables simulated from pos terior predictive distribution 0 vector of all unknown model parame ters7 including parameters of upper stage priors if model is hierarchical 0 Step 1 draw 0 from p ly ie from posterior distribution of 0 0 Step 2 draw grep from pltyT P 9 0 repeat steps 1 and 2 a large number of times 11 Using the test quantities posterior pre dictive pvalues o compute Ty 0 for the real data y o compute Ty We for each simulated replir cate dataset o compute the proportion of the replicated datasets for which TQfEP 0 2 Ty 0 0 this is an approximation to the Bayes prvalue ITyrep792Ty9P9WINWW d9 dyrep othat is7 Bayes prvalue is PrTyTep 2 Ty 0 with the probability taken over the joint posterior distribution of 0 and grep 10 Discrepancy measures or test quanti ties for posterior predictive checks 0 intended to measure discrepancy between model and real data 0 Ty scalar summary of data and pose sibly parameters used as a standard when comparing real data to data simulated from posterior predictive distribution 0 choose one or more test quantities that are meaningful with respect to your research pure pose Evaluating outliers in Newcomb s speed of light data 0 from GCSR textbook 0 66 measurements of speed of light two low outliers 0 what we want to evaluate is normal density ok for likelihood 0 de ned Ty as mei to check whether data with such extreme outliers could reasonably have come from a normal model 0 Fit model to the 66 observations yi N 02 2 pm 0 gt 0lt U2 66 0 generated 20 replicate datasets 0 found that in all replicate datasets7 miny6p was much larger than mei in real data 14 Interpreting and using posterior pre dictive p Values 0 not Prmodel is true l data 0 posterior probability that TQfEP 0 2 Ty x o ideal is if posterior predictive prvalue is some Where around 5 would mean that real data y is typical of data that comes from the model 0 model is suspect if tailrarea probability of meaningful test quantity is close to either 0 or 1 i would mean that aspect of data being mea sured by test quantity is inconsistent with model extreme ppprvalue indicates that model needs to be changed or expanded gtk in Newcomb example7 use t or contamr inated normal likelihood 225 138 Model Comparison Lecture 18 Nov 67 2006 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu 3 Model comparison for nested vs non nested models 0 nested models two regressionrtype models in Which the predictors in the smaller model are a subset of the predictors in a larger model 7 larger model Will t better but Will be more dif cult to t and to interpret key questions in model comparison gtk is improvement in t substantial enough to justify increased dif culty in tting and interpreting gtk are priors on additional parameters rea sonable o nonrnested models 7 different link functions in GLMs nonrnested sets of predictors 2 Model comparison 0 often there are several plausible candidate models idifferent candidate predictor variables in regression different link functions in generalized line ear models 7 different assumptions regarding form of like hhood different priors o statisticians often Will compare the t of sevr eral models in order to choose the best one i then assess Whether that one is adequate 0 alternative Bayesian modelrmixing does prediction using a weighted combina tion of all candidate models Tools for Bayesian model comparison 0 Bayes factors and approximations to them 0 Deviance Information Criterion 5 6 Frequentist use of deviance as measure saturated model for beetles data would have of model t in linear and generalized 8 parameters p25 239 17 787 the pope linear models ulation proportion killed at each of the 8 dose levels Example the frequentist point estimate of each pi ld b IL Dataset is counts of how many beetles WOU e m were killed my 239 17 78 in 8 groups 0 now consider a more useful model that lets of beetles exposed to different doses of an US quantify the doserresponse insecticide Each group 239 had 72239 beetles 10mm a n j m It i has only 2 parameters 0 consider a saturated model for aparticular Will Qt t the data as PerfeCtly as the dataset saturated model onotation let logL0y denote the maxir mum of the log likelihood for a particular model 0 deviance in GLM is de ned as i has a parameter for every observation in the dataset so its t is perfect 7 not useful7 since it is no simpler than the entire original dataset A A 2109L9modez of Maggy 109L9smmted l this is the likelihoodrratio statistic for test ing the null hypothesis that the model holds 7 but it provides a benchmark to Which to compare the t of other models 7 3 against the general alternative Frequentist deviance for models for bee under certain conditions7 deviance has an 95 data asymptotic chirsquare distribution With de nebeetiesweeties grees of freedom equal to the difference 1 1 between the number of parameters in the 21 13 47 3 18 44 saturated model and the number of par 4 28 rameters in the model being evaluated gimltrormuia respmat beetlesv1 ramiiy binomia1ltiimk iogitgtgt DeViamee Residuals Mln 1Q Median 3Q Max 715213 706270 08705 r2575 r6487 Coerrieiemts Estimate Std Error 2 value Prltgtlzlgt ltrmtereeptgt 759869 5100 71174 lt2e716 w beetlesfSVl 33784 2866 1179 lt2e716 M sigmir codes 0 New 0001 w 001 w 005 0V1 I 1 ltDispersion parameter for binomial ramiiy taken to be 1 Null deViamee 280866 on 7 degrees or freedom Residual deviance 1139474 on 6 degrees of freedom AIC 41 803 9 Cali g1mltrormuia respmat beetlesv1 ramiiy binomia1lt1imk probicgtgt DeVianee Residuals In Me 3Q Max 714994 706939 07942 11473 13076 Coerrieiencs Estimate Std Error 2 value Prltgtlzlgt ltrncereepcgt 734501 2616 71319 lt2e716 beetlesv1 19478 1469 1326 lt2e716 signir codes 0 0001 001 005 1 01 I 1 ltuispersion parameter for binomial ramiiy taken to be 1 Null deVianee 280866 on 7 degrees or freedom Residual deviance 10368 on 6 degrees of freedom AIC 40698 Deviance Information Criterion o Spiegelhalter D J Best N G7 Carlin B P and van der Linde A 2002 Bayesian measures of model complexity and t With discussion J Roy Statist Soc B 647 5837640 0 to compare t and predictive ability of Bayesian models 0 penalty for model complexity 0 also provides estimate of number of free pa rameters in the model i highly correlated parameters and paramr eters that are strongly in uenced by their priors count for less than 1 each called the e ective number of parame ters 0 built into WinBUGS 0 can be used to compare nonrnested models Caii g1mltrormuia respmat beetlesv1 ramiiy binomia1lt1imk cloglog DeVianee Residuals NIB 1 Median 3Q Max 707906 706252 00838 04158 14120 Coerrieiencs Estimate Std Error 2 value Prltgtlzlgt ltrncereepcgt 739035 3182 227 lt2e716 beetlesv1 21733 1766 1231 lt2e716 sigmir codes 0 0001 001 005 1 01 I 1 ltuispersion parameter for binomial ramiiy taken to be 1 Null deVianee 2808664 on 7 degrees or freedom Residual deviance 40124 on 6 degrees of freedom AIC 34342 Complementary logrlog link clogl09p 10901090 19 12 0 but response variable must have same form in all models eg you couldnlt use it to compare two regression models7 one With yls untransr formed and one With yls log transformed 0 uses a version of the deviance from Which the log likelihood of the saturated model is not subtracted off 1et 17679 7210919610 0 we want tvvo quantities7 Which can be approxr imated using MCMC sampler output Davy y D averaged over the posterior distribution of 0 7 D9 y D evaluated at the posterior mean of 0 0 then the eiiective number of parameters is estimated as o and the BIG is DIG 15mm 1213 2Dav9y Dgy 0 BIG is an approximation to the expected prer dictive deViarice and has been suggested as an indicator of model t When the goal is to pick a model With the best outrofrsample predictive ability 0 smaller values of BIG suggest better model t 1 Z 22S1138 Revisiting the uniform distribution a Bayesian Statistics noninformative prior for a proportion o The uniform distribution is a special case of Inference for Proportions continued the Beta distribution 0 What are its parameters Lecture 5 Sept 012005 U07 1 Beta7 7 Kate Cowles 374 SH7 3350727 kcowles statuiowaedu 3 4 o What is the equivalent prior sample size for 7 a U07 1 prior There is disagreement as to Whether the Failuresquot43 equivalent prior sample size should be de ned as 5 S 06 5 53 gtk or B 7 1 gtk or B i 2 a o What is the posterior distribution produced With a U07 1 prior and a binomial likelir hood MW Beta1 97 1 n 7 y 0lt py 1 7 19 proportional to the likelihood7 as we said be fore 5 o ls the posterior mean equal to the MLE 137 0 Note that the mode of a Betao 7 distrir bution is g So the mode of the posterior distribution given above is 27 Estimation 0 point estimates 0 measures of spread o Bayesian intervals 6 The posterior distribution contains all the current information about the un known parameter All Bayesian inference is based on the posterior distribution 0 estimation estimating values of unknown parameters that can never be observed or known 0 testing 0 prediction estimating the values of potentially observe able but currently unobserved quantities eg7 we might want to predict the numr ber of yesses in a future survey of 50 Ul students 3 The posterior variance 0 The posterior variance is one summary of the spread of the posterior distr ibution o The larger the posterior variance7 the more uncertainty we s till have about the parame ter 0 See the table of distributions from GCSR for the formula for the variance of a random vari able With a beta distribution 0 For a uniform prior and a binomial likelir hood7 the posterior variance is almost ale ways smaller than the prior variance 9 10 When posterior variance is not smaller 0 In our schooquuitting example than prior variance uniform prior a gtk prior variance 0083 gtk posterior variance 00246 7 Beta 107 40 prior y a gtk prior variance 000314 gtk posterior variance 00140 2 esse Fa ures 9 Vame BUG haBEtl je39tliig 11 12 Bayesian intervals Equal tail credible sets 0 called posterior intervals or credible sets 0 A 1001 7 00 equal tail credible set is the two kinds interval from the quantile to the 1 7 quantile of the posterior distribution 7 equal tail credible sets 7 h h t t i d it i 0 eg if we want a 95 equal tail credible set7 1g as p08 erlor BUSI y reglons or is 05 and we need the 025 and the 975 quantiles 0 We can use builtrin Splus functions to get quantiles of standard distributions 13 0 Example for our quitting school problem with the Beta1040 prior the posterior was Beta 17 83 gt qbeta CO 025 0 975 17 83 1 0 1033333 0 2491463 Beta density alpha 17beta 83 Interpretation of Bayesian intervals 0 Recall that the posterior distribution repre sents our updated subjective probability dis tribution for the unknown parameter 0 Thus for us the interpretation of the 95 credible set is that that the probability is 95 that the true p is in that interval o If the Beta1040 had been a true represenr tation of our prior beliefs or knowledge about the parameter p then after seeing our survey data we would believe that P0103 lt p lt 0249 95 0 Contrast this with the interpretation of a free quentist con dence interval 14 o If we had instead used a uniform prior the posterior was Beta844 gt qbeta C0i025 0 975 844 1 0 07024083 0 26255154 Beta density alpha 7 beta 43 16 Highest posterior density regions 0 the density at any point inside the interval is greater than the density at any point outside N o shortest possible interval trapping the desired probability 0 preferable to equalrtail credible sets when pos terior is highly skewed or multimodal 0 generally dif cult to compute tables of HDRs for certain densities are available 17 What would go wrong if the new data were used to formulate the prior 0 Worst case 7 Suppose we know nothing about the prob lem our true prior is uniform ignorance suppose we looked at our own survey data and used its normalized likelihood as our prior 7 43 p88 CK p lgip 0 prior would be Beta87 44 o posterior would be Beta157 87 70ltplt1 o posterior mean would be 3 147 o posterior variance would be 00012 0 95 credible set gt qbeta C01025 0 975 15 87 1 0 08557233 0 22162269 19 Using posterior probabilities to test hy potheses 0 Suppose we wanted to test the following hyr potheses regarding p H0 p lt 1 H1 p gt 1 0 We simply need the posterior probabilities of these two ranges of values for p 0 Suppose that the Beta107 40 had been our true prior7 so our posterior distribution is Beta177 83 We can use use a builtrin Splus function to obtain Pp lt 71 l y gt pbeta11 17 83 1 0 01879825 With this prior7 we would conclude that Pp lt 1 l y 019 o If we instead had used the uniform prior7 so our posterior was Beta87 447 44 Bet d nsit ha g 233 messes allure 45 IA IE Value 31 gt pbeta 11 8 44 1 0 1329079 With this prior7 we would conclude that Pp lt 1 l y 133 o The interpretation here is totally different from that of a frequentist prvalue Robustness 0 An inference is robust if it is not seriously affected by changes in the assumptions on Which it is based 0 Assumptions include 7 form of the likelihood parametric family for prior 7 parameters of prior 7 etc 0 Whether an inference is seriously affected depends on the purpose of the analysis 0 In this case7 if the primary purpose of the analysis was to get a point estimate for p7 we might decide that estimation was quite robust to changes in the prior parameters o If our primary purpose was the hypothesis test7 we might decide otherwise Introduction to Hierarchical Models 228 138 Bayesian Statistics Lecture 12 Oct 107 2007 Kate Cowles7 PhD Example Pump failure data 0 A hierarchical model is t to data on failr ure rates of the pump at each of 10 power plants The number of failures for the 1 th pump is assumed to follow a Poisson distribution P01550nlt0iti 1110 where 01 is the failure rate for pump 1 and t1 is the length of operation time of the pump in 1000s of hours 0 Important point we do not assume that all the pumps have the same failure rate In fact7 one of the questions of interest is to estimate the rates for the individual pumps oWe do not consider the zbti pairs ew changeable Hierarchical models 0 Bayesian models with more than two leve els or stages 0 may arise for several reasons i we have insuf cient knowledge to spec ify the parameters of priors we wish to model data or parameters that cannot be considered exchanger able but that are related 0 Write the likelihood of the data 0 Recall that the de nition of exchangeable observations is their likelihood is invarir ant to permutations of the indices o If we exchanged the subscripts on two xi t1 pairs7 and did not change the indices of the corresponding 02s the evaluation of the likelihood would change 0 The rst stage of a hierarchical model is the sampling distribution of the observed data7 or the likelihood The second stage 0 The second stage gives priors on the pa rameters that appeared in the rst stage 0 In the pump failures example7 a conjugate gamma prior distribution is adopted for the failure rates 01 Gammaa i 1 10 o This says that7 although the failure rates for the individual pumps are not the same7 they are related They are all drawn from a common distribution oWe do not know enough about failure rates of pumps in nuclear power plants to be able to specify xed numbers for the prior parameters a and ln fact7 we want the data to inform us about these values 0 Consequently7 we will make a and addir tional unknown paramters in the model 5 WinBUGS program to t Pump model model for i in 1N thetai quot dgammalta1phabeta 1ambdai lt7 thetaiti xi 39 dpois1ambdai alpha 39 dexp10 beta 39 dgamma0110 hyperparameters 0 At the third stage of the hierarchical model for pump failures7 the following priors are speci ed for the hyperparameters a and a N Exponential N Gamma01010 Data and initial values listCt C943157629126052431410510521 105 x clt51514319114221110 listCalpha 10 beta 10 theta Co101o1o1o101o1o1o101 Results node mean Sd MC error alpha 0 70010 0 26990 0047060 0 beta 0 92900 0 53250 0097800 0 theta1 0 05980 0 02542 0002680 0 theta2 0 10080 0 07855 0008177 0 theta3 0 08927 0 03759 0003702 0 theta4 0 11600 0 03048 0003170 0 theta5 0 60560 0 31500 0030870 0 theta6 0 61050 0 13930 0014000 0 theta7 0 90250 0 72520 0079370 0 theta8 0 89640 0 72500 0082620 0 theta9 1 59000 0 77670 0090040 0 theta10 1 99300 0 42510 0049150 1 25 28510 26400 Hwoooooooooo than thetas for other observations othetas far from the common mean are shrunk more than those near it MWMMOHOOOOMH start 1001 1001 Sam 10m oooooooooo Q Q Q Q Q Q Q Q Q Q 10m Compare to maximum likelihood es timates for individual pumps hours failures mle theta 94 30 5 0530 0598 1570 1 0637 1008 6290 5 0795 0893 12600 14 1111 1160 5 24 3 5725 6056 3140 19 6051 6105 105 1 9528 9025 105 1 9524 8964 210 4 19048 15900 1050 22 20952 19309 oindividual estimates are shrunk away from mle toward a common mean 0 individual estimates borrow strength from the rest of the data 0 thetas for observations with large 7 sam7 ple size time observed are shrunk less 10 22S 138 Bayesian Statistics What is Bayesian Statistics Lecture 1 Aug 22 2005 Kate Cowles 374 SH 33570727 kcowles statuiowaedu Where does statistics t in 0 Central to steps 2 3 and 5 0 May help with step 1 7 can help show that a question is inappropriate 7 may show that answering the question will be dif7 cult or impossible o Bayesian statististics is particularly well7suited for steps 2 and 5 The Scienti c Method1 But it s not just for science H Ask a question or pose a problem to Assemble and evaluate the relevant information 0 Take stock of what is already known w Based on current information design an investigation or experiment or perhaps no experiment to address the question posed in step 1 0 Consider costs and bene ts of the available experi7 ments including the value of any information they may contain 0 Recognize that step 6 is coming Jgt Carry out the investigation or experiment on Use the evidence from step 4 to update the previously available information draw conclusions if only tenta7 tive ones 6 Repeat steps 3 through 5 as necessary 1 as stated by Don Berry Who started it all Thomas Bayes Born 1702 in London England Died 17 April 1761 in Tunbridge Wells Kent England 0 ordained Nonconformist minister in Engalnd 0 Essay towards solving a problem in the doctrine of chances 7 set out Bayes s theory of probability 7 published in the Philosophical Transactions of the Royal Society of London in 1764 7 The paper was sent to the Royal Society by Richard Price a friend of Bayes who wrote I now send you an essay which I have found among the papers of our deceased friend Mr Bayes and which in my opinion has great merit In an introduction which he has writ to this Essay he says that his design at rst in thinking on the subject of it was to nd out a method by which we might judge concerning the probability that an event has to happen in given circumstances upon sup7 position that we know nothing concerning it but that under the same circumstances it has happened a certain number of times and failed a certain other number of times 0 Bayes s conclusions were accepted by Laplace in a 1781 memoir rediscovered by Condorcet as Laplace mentions and remained unchallenged until Boole ques7 tioned them in the Laws of Thought Since then Bayes7 techniques have been subject to controversy 0 elected a Fellow of the Royal Society in 1742 despite the fact that at that time he had no published works on mathematics lndeed none were published in his lifetime under his own name Simple inference using Bayes rule Example Do you have a rare disease 0 Your friend is diagnosed with a rare disease that has no obvious symptoms 0 You wish to determine how likely it is that you too have the disease That is you are uncertain about your true disease status 0 Your friend s doctor has told her that 7 The proportion of people in the general population who have the disease is 001 7 The disease is not contagious o A blood test exists for this disease but it sometimes gives incorrect results 6 Some settings in which Bayesian statistics is used today 0 economics and econometrics 0 marketing 0 social science 0 education 0 health policy 0 medical research 7 more common in England than in US 7 but FDA has approved some new medical devices based on Bayesian analysis and is pushing the use of Bayesian methods in device testing 0 weather 0 the law 0 etc etc 3 Quantifying uncertainty using probabilities The long7run frequency de nition of the probability of an event The probability of an event is the proportion of the time it would occur in a long sequence of observations ie as the number of trials tends to in nity 0 example when we say that the probability of getting a head on a toss of a fair coin is 5 we mean that we would expect to get a head half the time if we ipped the coin a huge number of times under exactly the same conditions 0 requires a sequence of repeatable experiments 0 no frequency interpretation possible for probabilities of many kinds of events 7 including the event that you have the rare disease Probability as degree of belief The subjective de nition of probability is A probability of an event is a number between 0 and 1 that measures a particular person s sub jective opinion as to how likely that event is to occur or to have occurred 0 applies whenever the person in question has an opin7 ion about the event 7 if we count ignorance as an opinion7 always applies 0 Different people may have different subjective proba7 bilities regarding the same event 0 The same person s subjective probability may change as more information comes in 7 where Bayes7 rule comes in Back to the example 0 two possible events or models 1 you have the disease 2 you don t have the disease 0 before taking any blood test7 you think your chance of having the disease is similar to that of a randomly selected person in the population 7 so you assign the following prior probabilities to the two models MODEL Have disease Don t have disease 10 Properties of probabilities These properties apply to probability whichever de nition is being used 0 Probabilities must not be negative If A is any event7 then PA gt 0 o All possible outcomes together must have probability If S is the sample space in a probability model then PS 1 12 Data 0 You decide to take the blood test 7 the new information that you obtain to learn about the different models is called data 7 the different possible data results are called obser7 pations or outcomes 7 the data in this example is the result of the blood test 0 The two possible observations are 7 a positive blood test 7 suggests you have the disease 7 a negative blood test 7 7 suggest you don t have the disease Likelihoods o The probabilities of the two possible test results are different depending on Whether you have the disease or not 0 these probabilities are called likelihoods i the probai bilities of the different data outcomes conditional on each possible model LIKELIHOODS MODEL PRIOR P MODEL P7 MODEL Have disease 001 95 05 Don t have disease 999 05 l 95 Bayes rule applied to the example You take the blood test and the result is positive This is the data or observation MODEL Prior Like Product I Posterior for Have disease 001 95 00095 I 019 Don t have disease 999 05 04995 981 05090 1 o Are the entries in the Product column probabilities o How do we convert them into probabilities 14 Using Bayes rule to update probabilities o Bayes7 rule is the formula for updating your probabile ities about the models given the data 0 enables you to compute posterior probabilities given the observed data 7 posterior means after Bayes rule simplest form PMODEL DATA ltx PMODEL gtltPDATA MODEL posterior ltx prior gtlt likelihood 16 What have you learned from the blood test 0 The probability of your having the disease has in creased by a factor of 19 0 But the actual probability is still small lt 02 0 You decide to obtain more information by taking the blood test again Updating the probabilities again 0 We will assume that7 conditional on your true disease status7 the results from two blood tests are indepene dent 0 Your current probabilities are the posterior probabile ities from after the rst test 0 These will become your prior probabilities with re spect to the second test 0 The second test result is also positive MODEL Prior Like Product I Posterior for Have disease 019 95 01805 I 269 Don t have disease 981 05 04905 I 731 0671 1 13 What if the second test had been negative That is7 the second observations was MODEL Prior Like ProductIPosterior or e Have disease 019 I Don t have disease 981 I 225 138 Bayesian Statistics Inference for Proportions concluded Introduction to Other OneParameter Models Lecture 7 Sept 12 2005 Kate Cowles 374 SH 33570727 kcowles statuiowaedu 3 Example discrete prior for binomial proportion o In the survey problem suppose we chose to put all prior probability on a discrete set of values such as pp 1 p 002006010 038 0 Then regardless of the likelihood the poster rior probability will be 0 for all values of p except these oposterior oc prior gtlt likelihood ppiy 0lt 1 X py1pquot y 2 Nonconjugate prior distributions 0 conjugate prior distributions are a convenience but may not re ect true prior knowledge simplify computations easy to understand results i may be decent approximations to true prior knowledge 0 Bayes7 rule for updating from prior to poster rior applies with nonconjugate priors as well 4 Plot of prior likelihood and posterior With discrete prior ne Value Prior and Posterior mypmnis Prior Likelihood and Posterior El 4 Artn I r n p FxV Jquot nz nnnnnpannn a a v z mm mm mm Misuau mumsmi 5 6 Example a histogram prior for a bino 5 mial proportion interval prior probability 0 advantages of histogram priors 07 2 09 7 easy to specify 27 4 0075 7 requires no parametric assumptions 096054 prov1des great i lex1bility 1n specifying prior 87 10 0001 beliefs 0 how to construct for binomial proportion divide interval 01 into prede ned subinr tervals assign probabilities to each interval in ac cordance With prior belief that population proportion lay in that interval 7 3 Plot of histogram prior and likelihood Plot of resulting posterior Value Bt d 39t alphagaB 733 44 v Misuau 9 Other oneparameter models Inference for the mean of a normal population with known variance 0 For what kind of problem might a normal likelihood be appropriate 7 random variable is continuousrvalued expect roughly symmetric distribution of values in the population 7 not too heavyrtailed of a distribution 0 form of normal pdf y N NOW 1 y 7 M 2 PWMU 7 2M ezp 202 Exchangeability Part I Exchangeable experiments 0 First consider experiments with a discrete set of outcomes 0 Two such experiments are exchangeable for a particular person if 1The possible outcomes are the same in both experiments 2 The probability of each outcome in one experiment is the same as in the other ex periment 3 The conditional probabilities for the sec ond experiment given the results of the rst experiment are the same as the cone ditional probabilities for the rst given the results of the second 10 Example 0 We wish to estimate the population mean M of the rate at which skin wounds heal in a particular species of newt o Biologists measured the rate at which new cells closed the skin of 18 anesthetized newts They measured in micrometers millionths of a meter per hour 0 Scientists usually assume that animal sub jects are simple random samples from their species or genetic type We will do this with these observations 0 We will pretend that we know that the pop ulation standard deviation 0 for healing rate is 8 micrometers per hour 12 Example randomly drawing two cards one at a time from a set of four cards 0 All cards are aces One is from each suit 0 We will consider each of the two draws a sep arate experiment 0 Are the two draws independent 0 Are they exchangeable Consider each criter rion 1 Possible outcomes of rst draw are diar mond heart club spade DHCS Pose sible outcomes of second draw are DHCS 13 2 Probabilities of each outcome on rst draw are What about second draw 7 we need marginal unconditional probe abilities of each possible outcome on sec ond draw PrdraW 2 D 7 More on criterion 3 o Criterion 3 may be restated to say The experiments are symmetric in that the joint probabilities are the same regardless of the order in Which the experiments are ob served 0 In the example7 consider the joint event of getting a D and a C in the two draws We must check Prdraw1 D and draw 2 C Prdraw 1 C and draw 2 3 What about conditional probabilities gor ing in each direction What is Prdraw2 Hldrawl D What is Prdraw1 Hldran D 39 What is Prdraw2 Hldrawl H What is Prdraw1 Hldran H 7 16 o This criterion is usually stated more formally as The joint probabilities are invariant to per mutations of the indices

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.