REGRESSION ANALYSIS M 384G
Popular in Course
Popular in Mathematics (M)
This 6 page Class Notes was uploaded by Reyes Glover on Sunday September 6, 2015. The Class Notes belongs to M 384G at University of Texas at Austin taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/181462/m-384g-university-of-texas-at-austin in Mathematics (M) at University of Texas at Austin.
Reviews for REGRESSION ANALYSIS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/06/15
REVIEW OF BASIC STATISTICAL CONCEPTS M 384G374G I am assuming that you are familiar with con dence intervals and some form of hypothesis testing However these topics can be taught from more than one perspective and there are some common misconceptions regarding them so it is worthwhile to give a review that will also lay a firm foundation for further work in statistics I also want to introduce some notation that I will use in the course So please read these notes carefully They may contain details perspectives cautions notation etc that you have not encountered before I will however leave out some details that I assume you are familiar with such as the formulas for sample mean and sample standard deviation In statistics we are studying data that we obtain as a sample from some population For example we might be studying the population of all UT students and take a sample of 100 of those students We will assume that our sample is a simple random sample That means that the sample is chosen by a method that gives every sample of the same size an equal chance of being chosen For example we might choose our 100 UT students by assigning numbers to each UT student and use a random number generator to pick a random sample of 100 numbers our sample would then consist of the 100 students with those numbers Typically we are interested in a random variable defined on the population under consideration For example we might be interested in the height of UT students Typically we are interested in some parameter associated with this random variable For example we might be interested in the mean height of UT students We will illustrate with the example of mean as our parameter of interest Notation Let y refer to the random variable e g height Then 0 The population mean also called the expected value or expectation of y is denoted by either Ey or p 0 We use our sample to form the estimate y of Ey More generally 0 We use the word parameter to refer to constants that have to do with the population We will often refer to parameters using Greek letters e g 039 We use the word statistic singular to refer to something calculated from the sample So 7 is a statistic However not all statistics are estimates of parameters I am using lower case letters to refer to random variables since this is the notation used in the textbook You might be as I am used to using capital letters to denote random variables and lower case to denote values of the random variable in a sample I will probably revert to that notation quot quot 39 39J and sometimes to make the distinction clearer I hope that the meaning will be clear from context Models In order to use statistical inference or form confidence intervals we need to have a model for our random variable In the present context this means we assume that the random variable has a certain type of distribution Just what model distributionwe choose depends on what we know about the random variable in question including both theoretical considerations and available data The choice of model is also usually in uenced by information known about distributions we can deduce more from a distribution that has a lot known about it In working with models which we will do often in this course always bear in mind the following quote from the statistician GBE Box All models are wrong but some models are useful For our example of height we will use a normal model that is we proceed under the assumption that the height of UT students is normally distributed with mean u and standard deviation 039 The values of u or 039 are unknown in fact our aim is to try to use the data to say something about u If we are just considering students of one sex both theory and empirical considerations indicate that a normal model should be a pretty good one if we are considering both sexes then data theory and common sense tell us that it isn39t likely to be as good a choice as if we are just considering one sex However other theoretical considerations suggest that it probably isn39t too bad Sampling Distributions Although we only have one sample in hand when we do statistics our reasoning will depend on thinking about all possible simple random samples of the same size n Each such sample has a sample mean 7 which is itself a random variable Note that the new random variable 7 depends on the choice of sample whereas the original random variable y depended on the choice of student Mathematics using our assumption that the distribution of y is normal with mean u and standard deviation 039 tells us that the distribution of 7 is also normal with mean u but its standard deviation is Consequently 7 varies less than y See the demo Distribution 72 of Mean at httpwwwkuleuvenacbeucsjavaindexhtm under Basics for an illustration of this The distribution of 7 is called a sampling distribution If we knew 039 we could use it to get a kind of margin of error for 7 as an estimate of u Since we don t know 039 it is natural to use the sample standard deviation s to estimate 039 Note the use of English letters to refer to the statistics to distinguish them from the parameters denoted by Greek letters However using s instead of 039 no longer yields a normal distribution We can get around this difficulty by instead using the t statistic t y If se y where sey the standard error of y This gives us still another random n variable Mathematical theory plus our assumption of normality of y tells us that this random variable t has a tdistribution with nl degrees offreedom Con dence Intervals If we are trying to estimate Ey we use a confidence interval to give us some sense of how good our estimate y might be Note the qualifications in this sentence Qualifications are important in statistics For a 95 confidence interval we reason as follows From tables or software we can find the value t0 of the tstatistic such that 25 of the area under the tdistribution with nl degrees of freedom lies to the right of to Then in the language of probability Pr to s y Sto 095 sey Caution In understanding this it is important to remember that y is our random variable not u So this mathematical sentence should be interpreted as saying quotThe probability that a simple random sample of size n from the assumed distribution will produce a value ofy with t0 S y S t0 is 95quot se With a little algebraic manipulation we can see that this says the same thing as Pry tosey S u S J7 tose7 095 Bearing in mind the caution just mentioned we can express this in words as quotThe probability that a simple random sample of size n from the assumed distribution willproduce a value ofy withy tosey S p S y tosey is 95 quot The resulting interval y tosey y tosey formed using the value of y obtained from the data on hand is called a 95 con dence interval for u The confidence interval can be described in words in either of the following two ways i The interval has been produced by a procedure that for 95 of all simple random samples of size n from the assumed distribution results in an interval containing ii Either the con dence interval calculated from our sample contains u or our sample is one of the 5 of quotbadquot simple random samples of size n for which the resulting con dence interval doesn39t contain u Of course we also have to bear in mind the possibility that our assumed model is not a good one or that our sample really is not a simple random sample Hypothesis tests We use a hypothesis test when we have some conjecture quothypothesisquot about the value of the parameter that we think might or might not be true A hypothesis test is framed in terms of a null hypothesis usually called H0 or NH as in the textbook For all of the types of hypothesis tests we will do the null hypothesis will be of the form Parameter speci c value So in our example where the parameter of interest is the mean the null hypothesis would be stated as H0 or NH u yo There are two frameworks for a hypothesis test The one we will use in this course uses reasoning as follows If the null hypothesis is true and still assuming a normal model y fl called sey the test statistic has the tdistribution with nl degrees of freedom We calculate this test statistic for our sample of data call the result of the calculation ts and then calculate the p value de ned as the probability that a simple random sample of size n from our population would give a t statistic at least as extreme as the one ts that we have calculated from the data assuming the null hypothesis is true then as above we know that the sampling distribution of the statistic t To pin down just what we mean by quotat least as extremequot we usually specify an alternate hypothesis Hit or AH as in the textbook This can be either two sided or one sialeal Two sided H1 or AH in no One sided This can take one of two forms either Hit or AH u lt pg or Hit or AH u gt pg If the alternate hypothesis is u lt pg then quotat least as extreme asquot means S so that the p value is p Prt S ts Similarly if the alternate hypothesis is u gt yo then the pvalue is p Prt 2 ts If the alternate hypothesis is twosided then the pvalue is P Prltl 2 ts The pvalue is taken as a measure of the weight of evidence against H 0 A small p means that it would be very unusual to obtain a teststatistic at least as extreme as ours if indeed the null hypothesis is true Thus if we obtain a small p then either we have an unusual sample or the null hypothesis is false Or we don t have a simple random sample or our model is not a good one We somewhat subjectively but based on what seems reasonable in the particular situation at hand decide what value of p is small enough for us to consider that our sample provides reasonable doubt against the null hypothesis if p is small enough to meet our criterion of reasonable doubt then we say we reject the null hypothesis in favor of the alternate hypothesis Note 1 A hypothesis test cannot prove a hypothesis Therefore it is wrong to say quotthe null hypothesis is falsequot or quotthe alternate hypothesis is truequot or quotthe null hypothesis is truequot or quotthe alternate hypothesis is falsequot on the basis of a hypothesis test 2 Although is it arguably reasonable to say quotwe reject the null hypothesisquot on the basis of a small pvalue there is not as sound an argument for saying quotwe accept the null hypothesisquot on the basis of having a pvalue that is not small enough to reject the null hypothesis To see this imagine a situation where you are doing two hypothesis tests with null hypotheses just a little different from each other using the same sample It is very plausible that you can get a large e g around 05 pvalue for both hypothesis tests so you haven t really got evidence to favor one null hypothesis over the other So if your pvalue is not small enough for rejection all you can legitimately say is that the data are consistent with the null hypothesis This is assuming that by quotacceptquot you mean that the data provides adequate evidence for the truth of the null hypothesis If by quotacceptquot you mean accept p as a good enough approximation to the true u than that s another matter but if that s what you are interested in using a con dence interval would probably be more straightforward than a hypothesis test 3 The pvalue is roughly speaking the probability of getting data at least as extreme as the sample at hand given that the null hypothesis is true What many people really would like and sometimes misinterpret the pvalue as saying is the probability that the null hypothesis is true given the data we have Bayesian analysis aims to get at the latter conditional probability and for that reason is more appealing than classical statistics to many people However Bayesian analysis doesn t quite give what we d like either and is also often more difficult to carry out than classical statistical tests Increasingly people are using both kinds of analysis I encourage you to take advantage of any opportunity you can to study some Bayesian analysis Many people set a criterion for determining what values of p will be small enough to reject the null hypothesis The upper bound for p at which they will reject the null hypothesis is usually called 0L Thus if you set 0L 005 a very common choice then you are saying that you will reject the null hypothesis whenever p lt 005 This means that if you took many many simple random samples of size n from this population you would expect to falsely reject the null hypothesis 5 of the time that is you39d be wrong about 5 of the time For this reason 0c is called the type I error rate Note 1 If you set a type I error rate 0L then to be intellectually honest you should do this before you calculate your pvalue Otherwise there is too much temptation to choose 0L based on what you would like to be true In fact it s a good idea to think about what p values you are willing to accept as good evidence before the fact but if you are using p values you may think in terms of ranges of pvalues that indicate quotstrong evidencequot quotmoderate evidencequot and quotslight evidencequot rather than just a rejectdon39t reject cutoff 2 If you do set a type I error rate 0L then you don t really need to calculate p to do your hypothesis test you can just reject whenever the calculated test statistic ts is more extreme quotmore extremequot being determines as above by your alternate hypothesis than tw where ta is the value of the tdistribution that would give pvalue equal to CL 3 If you are going to publish any scientific work the second option is not a good choice instead you should calculate and publish the pvalue so others can decide if it satisfies their own criteria which might be different from yours for weight of evidence desired to reject the null hypothesis 4 When an 0L has been chosen for determining when the null hypothesis will be rejected and when the null hypothesis has indeed been rejected many people say that
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'