Intro Stats and Data Analysis
Intro Stats and Data Analysis ECON 2370
Popular in Course
Popular in Economcs
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Econ 2304 - Microeconomic Principles
verified elite notetaker
This 24 page Class Notes was uploaded by Dr. Leon Koss on Saturday September 19, 2015. The Class Notes belongs to ECON 2370 at University of Houston taught by Staff in Fall. Since its upload, it has received 37 views. For similar materials see /class/208206/econ-2370-university-of-houston in Economcs at University of Houston.
Reviews for Intro Stats and Data Analysis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
Chapter 7 Sampling Distribution In this chapter we talk about the techniques of collecting data and drawing samples that represent the distribution of values in a population Examples of data sets that use samples 1 Decennial Census 2 Current Population Survey CPS 3 Consumer Price Index CPI Decennial Census 1 Began August 1790 Conducted every 10 years 2 Motivation estimate population that can be taxed and assess the country s industrial and military potential 3 Data are kept con dential for 75 years Data items collected 1 1790 number of persons in household and counts of persons in the following categories free white males and females all other free persons slaves ethnicity 2 Added items agricultural mining government activities religious bodies 1810 taxes education wages value of property 1 1840 1850 farm and home mortgages 1880 unemployment 1980 business housing and transportation 1940 place of work and means to work 1960 occupation history 1970 Data items collected 1 1980 use and creation of TIGER les data can be mapped geographically Methodology 1 1790 1880 Marshals interview household 2 1830 Standard survey form used prior to this marshals used whatever paper was available rule it write in headings and bind the sheets together 3 1870 Rudimentary tallying device used to help clerks 4 1890 Herman Hollerith introduced punchcards and electric tabulating machines Methodology 1 1910 Census of ce organized as a permanent agency 2 1950 First full use of computer support 3 1960 Devices set up to read data on returns use of the Postal System to distribute surveys 4 1990 Counts of the homeless instituted Use of samples 1 1880 Basic counts took almost until 1890 census to tabulate and publish 2 1890 Supplemental survey some subjects were covered in more detail 3 1940 Sampling introduced 5 of the population were asked an additional set of questions Consumer Price Index CPI 1 Collected monthly 2 lnititated during World War I when rapid increases in prices particularly in shipbuilding centers made such an index essential for calculating cost of living adjustments in wages 3 Hypothetical bundle of goods is de ned for the typical family Changes to the bundle and the family composition occurred over time 4 Samples of prices collected from establishments are used to estimated value of bundles 3 Current Population Survey CPS 1 12 monthly survey topics include employment temporary workers job tenure and occupational mobility school enrollment race and ethnicity voting and voter registrations food security work schedules computer ownership fertility and marital history lx39 set up in the late 1930 s to provide direct measurements of unemployment each month 00 Probability sampling as used from the beginning Early samples of 50000 households used to estimate employment activities of the general population 4 Supplements decennial census data In this chapter we will concern ourselves with samples and sampling distributions 1 Previously when we looked at distributions we examined data distributions 2 In this chapter the distributions are made of parameters from samples mean distribution p distribution 3 Before we talk about sampling distributions we need to talk about how a sample is derived 4 In this chapter certain terms that we used before are given a new name 1 Parameters of interest gt u 02 and p are values or parameters that are ones we wish to derive from the samples 2 Statistics With each sample we derive statistics or the set of parameters of interest Methods of extracting a sample 1 Data on population is available extract a subset 2 Data on population is not available determine how many people are needed to obtain a representative sample survey n individuals Types of samples 1 Random sample 2 Non random methods convenience sample judgement sample quota sampling Data sets created by any of these methods cannot be used for making inferences Random Samples 1 In the simple version each element of the population has same chance of being selected 3 2 Other versions can also provide an unbiased sample a Strati ed random sample b Cluster sample c 1 in k systematic random sample Two methods of data collecting 1 Sampling from existing database eg stock market activity price data from a random sample of grocery stores researcher making use of data collected by the government secondary source data collection lx39 Retrieving data directly from the respondents using a survey designed by the researcher primary source data collection Secondary data source 1 Bene ts low cost data is typically high quality takes less time to obtain 2 Cost variables might not be a close t to the variables desired by the researcher Primary data source 1 Bene ts variables come close to tting the type of variables desired 2 Costs added cost of survey design data collection and processing the data Data collection problems that can result in an biased sample 1 Distributing surveys to a random sample and accepting a low response rate lx39 Collection techniques reaches only a subset of the full sample Even with a 100 response rate the methods will produce a biased sample 00 Wording Interviewer bias The choice of words used in the survey and the choice of interviewers can bias the results Sampling Distribution Sampling distribution of a statistics is the probability distribution for all possible values of the statistics that results when random sample of size n are repeatedly drawn from the population Methods of obtaining a sampling distribution 1 Derive the distribution mathematically using the laws of probability Examples 73 and tables 75 are examples of this method 2 Approximate the distribution empirically by drawing a large number of samples 7 3 Use statistical theorems such as the Central Limit theorem to derive exact or approximate distributions Central Limit Theorem If random samples of n observations are drawn from a non normal population with nite mean u and standard deviation 0 then when n is large the sampling distribution of the same mean aquot is approximately normally distributed with mean and standard deviation also known as the standard error of the mean Conditions Under certain conditions the means of random samples drawn from a population tend to approximate a normal distribution Conditions 1 If the population can be represented by a normal distribution the sampling distribution of aquot will be normal 2 If the population can be represented by a symmetric distribution the sampling 8 00 distribution of 3 becomes normal for small values of n for samples that are small relative to the population If the population can be represented by a skewed distribution the sampling distribution of 3 becomes normal for large values of nfor samples that are large relative to the population Tools to assess aquot given u u H 00M pp Compute the mean and standard deviation of the sample distribution m of Determine condition to test eg PG lt Convert aquot to a z score using the following function 35 Mr 0 w Use the table in the back of the book to test the probability condition region of the distribution Sampling Distribution of sample proportion 1 Recall from the previous chapter gt Let X be a binomial random variable with n trials and probability p of success The probability distribution of X approximates the normal with uznpandazm There is a similar outcome in sampling Let s assume that the sampling distribution has the following characteristics For a sample the probability of successes is equal to the number of person with this characteristics over the total number of persons in the sample or A a p n lx39 where 25 is the probability of success derived from the sample 3 For the sampling distribution W p 0quot m p n Where q 1 p If np gt 5 and nq gt 5 the sampling distribution can be approximated by the normal distribution Tools to assess 9 given that up p 1 Convert 25 into a z score and calculate the probability Econ 2370 Spring 2000 O Donnell 12 D Chapter 8 Large Sample Estimation Whenever we take a sample we do so with the idea of learning something about the population from which the sample is drawn Provided that the sample is drawn in an unbiased manner we believe that it may be taken representative of the parent population But representatives are not all equally authoritative Spokesmen even official spokesmen do not always tell a reliable tale and it is necessary in retelling a story secondhand from such a source that we indicate the degree of confidence which may be placed in what the spokesman has said Just the journalist tries to emphasize for his readers the difference between rumours and usually well informed sources so too the statistician has to attempt a similar thing Given large sample the problem is easily enough disposed of intuitively But when the samples are small we have to face not only the possibility of bias but also the fact that the average standard deviation or proportion found in the sample may differ be quite appreciably from the population parameters it is sought to estimate through the sample It is evident that there can be no possibility of finding a method of estimation which will guarantee us a close estimate under all conditions All we can hope for is a method which will be the best possible in the sense that it will have a high probability of being correct in the long run MJ Moroney Facts from Figures Points covered in this chapter Two approaches to estimate population parameters eg to estimate the mean and variance of the population for a normal distribution and the proportion for a binomial distribution when these values are unknown b Properties of sound estimators c Calculation of the margin of error and confidence intervals d How to choose a sample size Prior material used in this section Standard error measurement b 2 scores c Tchebysheflquots Theorem and Empirical Rule 36 Econ 2370 Spring 2000 O Donnell d lentral Limit Theorem if the sample size is large eg if n is large the sampling distribution will be approximately normal If the sample is normal we have a large set of statistical tools to our disposal 3 Types of estimators Point Estimator what is the best single value that can be used to estimate a population parameter b Interval Estimator what is the best interval refer to a con dence inter val that contains the population estimate Tied to the notion of the con dence interval is the con dence coefficient 1 oz Where 001 g oz g 010 4 Properties of point estimators Unbiased average values of the estimated parameter equals the population parameter b Consistent Estimators from sample converge to the true value the sample size increases c Ef cient Estimator with smallest sampling variance 5 Univariate Analysis Estimating point estimator i For population mean gt 7 Margin of error gt 196 Standard error of the estimator or 32194 If a is unknown and n 2 30 one can substitute 8 for 0 ii For population proportion gt 13 Z n Margin of error imamIE n 32196013 n estimated Recall up gt 5 and n gt 5 b Estimating interval estimator 37 Econ 2370 Spring 2000 O Donnell leneral function two tail test H Point estimator 1 gt1 Standard Error A Population mean When n gt 30 B Population proportion 13 3 39 General function left tail test onesided confidence interval H H Point estimator zQStandard Error 39 General function right tail test onesided confidence interval H H H Point estimator QStandard Error iv values of and 20 for given values of oz oz two tail 20 one tail Con dence 0010 258 233 990 0020 233 2055 980 0025 224 196 975 0050 196 1645 950 0100 1645 128 900 6 Bivariate Analysis This type of analysis works with two samples each drawn from different populations For this form of bivariate analysis the research question is Are the populations different Using the example from the textbook one would want to test if the average MCAT scores for biochemistry and biology majors are the same If there is no difference between these two populations biochemistry and biology students the difference in their population means m 12 would equal 0 This research question will be addressed briefly in this section and in more detail in Chapters 9 and 10 Right now we wish to deal with the point and interval estimates from the samples drawn from two populations There are two kinds of data sets used for bivariate analysis Data sets are not paired or data sets are independent from one another There is no relationship between the two parameters differenced eg MCAT scores for Biochemistry and Biology Majors 38 Econ 2370 Spring 2000 O Donnell b Data sets are paired eg there is a relationship between the two data sets Examples comparing the differences in gas mileage when a car is rst given one type and then another type of gasoline test scores of trainees before and after Viewing an instructional Video Properties of Sampling distribution of 771 52 not paired Mean and Standard error Mil 29 M 2 2 2 a 0 SE a 71 72 x1 x m m Margin of Error 2 2 a a 32196 4 4 TM 7amp2 Con dence interval twotail 2 2 a 0 L711 f2 i 55 i1 i2 m 7amp2 i If sampled populations are normally distributed then the sampling distribu tion of 771 52 is normally distributed regardless of size ii If the sampled populations are not normally distributed then the sampling distribution of 51 52 is approximately normally distributed when m and m are large due to the Central Limit Theorem iii If a and 03 are unknown but both m and m are greater than or equal to 30 you can substitute the sample variances for the population variances lV Use 2 values found in section 5BiV on preVious page b Properties of Sampling distribution of 131 132 not paired Mean and Standard Error A A 971 972 P1 39112 lm 132 P1 112 P191 P292 SE Owl gag l W K 39 Econ 2370 Spring 2000 O Donnell Margin of Error i196 111191 P2Q 2 7quot 712 lon dence Interval twotail The sampling distribution of 131 132 is approximately normally distributed when m and n2 are large due to the Central Limit Theorem 39 m and n2 must be sufficiently large so that the sampling distribution of 33 132 can be approximated by a normal distribution mpgQO mpg and H H H meg gt 5 ii Use 2 values found in section 5Biv H Properties of Sampling distribution of 771 52 paired Mean and Standard error A O V where a is the variance of the di 39erenced data and m m n 2 3219603 71 Margin of Error Con dence interval twotail d Properties of Sampling distribution of 131 132 paired Will not be covered in this class 7 Choosing a sample size Choosing a sample size is an application of the point and interval estimation techniques Suppose you want to generate a sample such that the margin of error is equal to some value let s call it B You also want a sample such that 95 of repeated sampling will given you a margin of error less than or equal to B 40 Econ 2370 Spring 2000 O Donnell For univariate and bivariate analyses each margin of error function is a function of n Here is the case for the population mean univariate case B 32 194 IfI rearrange the function above one nds the function for computing the sample size ngt 22 If a is not known the sample standard deviation can be used or a value based on the range of the values divided by 4 In order to prepare a sample with a different degrees of con dence just replace the margin of error function with the con dence interval function two tail version Below is a table of the set of function one can use to determine the sample size B is equal to margin of error Analysis Estimator Minimum sample size Univariate 77 n 2 g202BQ I3 n 2 xiQWVBQ Bivariate not paired 51 772 n 2 g2w 03B2 131 132 n 2 xi201101 meg82 For the Bivariate functions m m n B is the acceptable margin of error If a is not known the sample standard deviation can be used or a value based on the range of the values divided by 4 41 Large Sample tests of hypothesis Main points in this chapter H 00 Standard method to test research questions Discussion risks involved when decision based on the test is incorrect Detailed discussion of the standard method Application of standard method for research questions using large samples Recall chem lab or your biology lab in high school and the Scienti c Method Observations A good scientist is observant and Hypothesis Testing notices thing in the world around him herself She sees hears or in some other way notices whats going on in the world becomes curious about whats happening and raises a question about it This is a tentative answer to the question an explanation for what was observed The scientist tries to explain what caused what was observed hypo under beneath thesis an arranging 1 Hypotheses are possible causes An hypothesis is not an observation rather a tentative explanation for the observation 1 lx39 00 r U Hypotheses re ect past experience with similar questions educated propositions about cause Multiple hypotheses should be proposed whenever possible One should think of alternative causes that could explain the observation the correct one may not even be one that was thought of Hypotheses should be testable Hypotheses can be proven wrong incorrect but can never be proven or con rmed with absolute certainty Someone in the future with more knowledge may nd a case where the hypothesis is not true Statistical method H lx39 00 F U Observe the economy raise a question or set of questions Prepare an answer in the form of a hypothesis HO also known as the null hypothesis Prepare counter responses Ha alternative if the null is proven wrong or incorrect Collect data Specify a statistical test 2 6 Determining the critical regions to reject 7 Obtain the ndings prepare the results My modest example using Spam and eggs Implications from theory what the theory predicts with the respect the differences in the proportion of income spent on a good Income spent on good i Total income Let s look at the value of 7 spent by high income 7H2in and the the value of 7 spent by low income 7L0w wow 7mm lt O gt Luxury good wow 7mm O gt Normal good 71 7mm gt O gt Inferior good Test 1 is Spam a luxury good 1 Hypothesis to reject gt 7L0 7mg 2 O I wish to reject the notion that the good could be either a normal or inferior good Logically rejecting this hypothesis implies that I fail to reject that it is a luxury good Failing to reject Accepting an outcome Accepting an outcome implies that you have accepted the theory Full bank of tests for Is Spam a luxury good 1a Null Hypothesis to reject gt 7L0 7mg 2 0 1b Alternative hypothesis gt 7L0 7mg lt 0 2a Null Hypothesis to reject gt 7L0 7mg 0 2b Alternative hypothesis gt 7L0 7mg 0 3a Null Hypothesis to reject gt 710 7mg 3 0 3b Alternative hypothesis gt 7L0 7mg gt O The researcher will want to reject Tests 1 and 2 and fail to reject 3 Further comment regarding the three tests Two are one tailed tests one is a two tailed test Speci cs Making Assumptions H Type variable categorical interval 2 Type of population binomial nornial nornial given the Central Liniit Theoreni Type of analysis univariate bivariate 00 Large or small sample Differences in variances has an impact on differences in means test using a small sample O O Null hypothesis HO and alternative hypothesis one or two tailed test 4 Speci cs Sampling Distribution 1 Standard normal 2 scores 2 Student s t distribution 3 X2 distribution 4 F distribution Speci cs One tailed or two tailed If the hypothesis is an inequality eg u gt 0 u lt 1 we can use a one tail test If we are testing if u is a speci c value the alternative hypothesis is that u is not this value and can be any value in the distribution For this case we use a two tail test Speci cs Choosing a critical region Describes rejection area Answers the questions what are we willing to risk in being wrong Three scenariostwo tailed test Scenario 1 04 20 052 2 10 or p value 010 means that when we reject the hypothesis we reject it with a con dence level of 80 Scenario 2 04 10 052 2 5 or p value 005 means that when we reject the hypothesis we reject it with a con dence level of 90 Scenario 3 04 2 052 2 1 or p value 001 means that when we reject the hypothesis we reject it with a con dence level of 98 Notion of Signi cance table on page 346 Do researchers only report results that are signi cant Risk There are two types 1 Type I error rejecting a hypothesis when in fact it is true 2 Type II error failing to reject a hypothesis when one should reject it Probability of making a Type I error Signi cance level p value tells you the probability of making a Type I error Amount of risk for the three scenarios Highest risk of making a Type I error lowest con dence level taken Lowest risk of making a Type I error highest con dence level taken Probability of making a Type II error The probability of a Type II error is 5 6 Power of your statistical test is given as 1 5 Computing 5 and 16 Suppose your hypothesis test is that HO M0 A You want to compute a power test to determine the probability of rejecting HO when the alternative mean M C 1 Compute the two con dence interval values The book uses the margin of error values but the example 98 assumes that the signi cance level is 5 My instructions apply for all signi cance levels These values are the endpoints of the Type II region The formula for the con dence interval is 10 l Note function uses m From this you have the left boundary value LBV and right boundary value REV lx39 Draw the two graphs The left and right boundary points are points around no Determine where M is relative to these boundaries and determine the rejection area of the new distribution overlapping the acceptance region of the old distribution 7 3 Compute z scores for the two values using the following functions LBV a 2left boundary S Eu B a 2right boundary fig EM Note this set of functions uses Ma 4 Given the drawing above determine the paccepting ha when u M The power of the test or the probability of correctly rejecting HO given that u is M is 15 Relationships between Type I 85 II probabilities and power test 1 Increasing the significance level reduces the confidence interval thus increasing the probability of a Type II error and reducing the power of the test lx39 Increasing the sample size decreases the standard error This decreases the probability of a Type II error and increases the power of the test 00 If M is very close to no it weakens the power test
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'