Join StudySoup for FREE

Get Full Access to
UMD - STAT 100 - Study Guide - Final

Description

Reviews

STAT100 Lecture 1 Week 2 January 30th, 2017 Intro to Statistics Probability Study of randomness and uncertainty Methods of quantifying the chances, or likelihoods, associated with a desire outcome obtained from a set of possible outcomes of an experiment Probability provides the foundation of statisticsStatistics A science that involves the extraction of information from numerical data obtained during an experiment or from a sample. It involves: A question to be answered or situation to be analyzed Designing the experiment or sampling procedure Collecting and analyzing the data Making inferences (statements) about the population based upon information in a sample The Question or Situation We Need Analyzed Formulate your question or situation Determine the best group of people or things to go to that will help answer or analyze the situation in question. This is your population. Determine which characteristics from this group will help answer your question or analyze the situation in question. This/these is/are your variable(s). Individuals and Variables Individuals make up a population - The objects described by a set of data - May be people, animals, things, etc. Variable - Any characteristic of an individual - Can take different values for different individuals Data Statistics is the science of reasoning with data Data is the collection of values of a set of characteristics for a set of individuals, collected for a particular study Components of a Data Set Individual (Object or Subject or Experimental Unit): Entity on which information is collected Variable: Measurement or characteristic of interest for an individual (name, address, customer number, etc.) Observation: Set of all measurements (variable values) collected from a single individual What Does Statistics Involve? A population: Is the group to be studied Includes all of the individuals in the group A sample: Is a subset of the population Provides information used to infer information about populations Is often used in analyses because getting access to the entire population is impractical - Descriptive Statistics: Statements bout the sample via organizing and summarizing the information collected• Numerical • Graphical - Inferential Statistics: Generalization of sample results to statements about the population • Hypothesis Testing • Confidence Intervals Types of Statistical Studies Retrospective Studies: Analysis of data collected in the past for another purpose Observational Studies: Current passive observation of subjects without an investigator's intervention Experiments: Current observation of subjects with investigator control of observational conditions Sampling Techniques For retrospective or observational studies: Population of size N is collected in a frame (list) Sample of size n is drawn randomly from the frame Here are some examples: Simple random sample (SRS) - Each item has an equal chance of being in the sample - Each sample has an equal chance of being drawn Stratified random sample - Sample is proportionately representative of the population. Systematic (could produce a biased sample) Convenience (probably will produce a biased sample) For experimental studies: Population is all possible outcomes for all possible factors Sample is set of outcomes for selected combination of factors Types of Data Used in Statistics: Measurement data (aka quantitative data) are on the number line Categorical data (aka qualitative data) are not on the number line Measurement Data On the number line May be infinitely divisible (continuous data). Examples are values of physical quantities -- weight, length, temperature, density, or money (anything measured). May be counts (discrete data). Examples would be counts of the number of coin flips before the 3rd head is obtained Categorical Data Not on the number line Their values are simply different -- attributes or characteristics There are 2 kinds: - Nominal data do not have any natural order, e.g., state abbreviations or colors- Ordinal data do have a natural order, e.g., course letter grades, a 5-point attitude scale (called a Likert scale), judging scoresLecture 2 Week 2 February 1st, 2017 Main Types of Data Categorical (Qualitative) Quantitative Qualitative vs. Quantitative Variables Categorical or Qualitative Variable whose values are attributes or characteristics. Allows researchers to categorize the individual Quantitative Variable whose values are numerical measures that arithmetic operations can meaningfully be performed on. Allows researchers to summarize via averages. - Discrete variables — Variables that have a finite or a countable number of possibilities. Often these variables are counts. - Continuous variables — Variables that have an infinite, not countable, number of possibilities. Often these variables are measurements. Descriptive Statistics After we collect the raw data (from a sample survey or a designed experiment), we can: - Describe the data using visual methods - Describe the data using numeric methods Different methods are appropriate for different types of data Descriptive Statistics: Statements about the Sample Focus is on interpreting and presenting the data that has been collected via the sample No attempt is made to generalize to other (larger) groups such as the populationLecture 3 Week 3 February 6th, 2017 Examining Distributions Distribution of a Variable One way to examine a single variable is to graphically display its distribution The distribution of a variable tells what values a variable takes and how often it takes these values Distributions can be displayed using a table, graph, or a function. If displaying using a graph, the proper choice of a graph depends on the nature of the variable. Categorical (Pie Chart, Bar Graph) Quantitative Variable (Histogram, Stemplot) Relative Frequency — the relative amount (percentage) to the total Frequency — the count to the total Pie Chart Displays of Distribution: Distribution of ______ (name of category) <—— This is your title Label your variables Bar Graph Label the X and Y axis Organizing Qualitative/Categorical Data Organize qualitative/categorical data in tables Construct bar graphs Construct pie charts Raw qualitative data comes as a list of values… each value is one out of a set of categories. This allows researchers to categorize or classify individuals into groups. Frequency/Relative Frequency A frequency distribution lists: Each of the categories The frequency, or the count, of the observations that belong to each category Frequency <—> Counts A relative frequency distribution lists: Each of the categories The relative frequency, or the proportions (or percents), of the observations out of the total that belong to each category Relative Frequency <—> Proportions (or percents)Lecture 4 Week 3 February 8th, 2017 Displaying Quantitative Data Organizing Quantitative Data Organizing Quantitative Data: The Popular Displays Organize discrete data in tables and histograms Organize continuous data in tables and histograms Draw stem-and-leaf plots Identify the shape of a distribution Draw time-series plots Raw quantitative data comes as a list of numeric values … each value is a count or measurement, either discrete or continuous Comparisons (one value being more than or less than another) can be performed on the data values Mathematical operations (addition, subtraction, …) can be performed on the data values Discrete Quantitative Data Discrete quantitative data can be presented in tables and bar graphs in several of the same ways as qualitative data Frequency/Relative Frequency Distribution - Values listed in a table — use the discrete values instead of the category names - List frequencies or relative frequencies Histogram (Bar Graph for Discrete Data) - Use the discrete values instead of the category names and arrange the values in ascending order - Unlike a bar graph for qualitative data, no space is left between the bars and the width of the bars have meaning Frequency Tables Good practices for constructing tables for continuous classes: They should: - Not overlap - Not have any gaps between them - Have the same width (except for possible open-ended classes at the extreme low or extreme high ends) - Cover the range of the data The lower class limits should be “reasonable” numbers The class widths should be a “reasonable” number Typically 5-20 classes, with larger number of classes used with larger data sets Select classes to provide a meaningful overall summary of the data: Too few classes causes data to bunch Too many spread the data out so far that it is hard to detect patterns Histograms A histogram is a “picture” of a frequency/relative frequency table for quantitative data To construct a histogram: 1. Construct the frequency or relative frequency table desired 2. Place the variable of interest on the horizontal axis 3. Place the lower class limits for each interval on the axis 4. Draw a rectangle above each interval 5. The height of each rectangle is proportional to the frequency or relative frequency for that class Important points of histogram construction: Plot and label only the lower class limits, in between the bars Provide a descriptive title Generic title: “Distribution of Name of the Variable for Describe the Items” Label the horizontal axis Generic label: “Name of the Variable (in Units)” Label the vertical axis as a bar graph Stem-and-Leaf Plot A stem-and-leaf plot is a different way to represent quantitative data that is similar to a histogram To draw a stem-and-leaf plot, each data value must be broken up into two components. In the simplest scenario: The stem consists of all the digits except for the right-most one The leaf consists of the right-most digit Example: For the number 173, the stem would be “17” and the leaf would be “3” 17|3 To read a stem-and-leaf plot: Read the stem first Attach the leaf as the last digit of the stem The result is the original data value (after placement of the decimal point in some cases) Stem-and-leaf plots display the same visual patterns as histograms (essentially a histogram turned on its side) Stem <—> Classes Leaves <—> Bars Advantages: - Contain more information than histograms — usually can recover the “raw” data- “Quick” way to sort data Disadvantages: - Best used only with small data sets - Histogram more flexible in choice of “classes” Modifications to Stem-and-Leaf Plots If we wanted to compare two sets of data, we could draw two stem-and-leaf plots, using the same stem, with leaves going left (for one set of data) and right (for the other set) There are cases where constructing a descending stem-and-leaf plot could also be appropriate (for test scores, for example) Identifying Shapes of the Distributions A useful way to describe a quantitative variable is by the shape of its distribution Some common shapes of distributions are: Uniform Symmetric - Bell shaped - Other symmetric shapes Asymmetric - Right Skewed - Left Skewed Unimodal, bimodal Uniform A variable has a uniform distribution when: Each of the values tend to occur with the same frequency The histogram looks flatBell-Shaped A variable has a bell-shaped (or mound-shaped) distribution when: Most of the values fall in the middle The frequencies tail off to the left and to the right It is symmetric (i.e., left half mirror image of right half) Right-Skewed A variable has a skewed right distribution when: The distribution is not symmetric The tail to the right is longer than the tail to the left The arrow from the middle to the long tail points right Left-Skewed A variable has a skewed left distribution when: The distribution is not symmetric The tail to the left is longer than the tail to the right The arrow from the middle to the long tail points leftSummary: Organizing Quantitative Data Quantitative data can be organized in several ways: Histograms based on data values are good for discrete data Histograms based on classes (intervals) are good for continuous data The shape of a distribution describes a variable … histograms are useful for identifying the shapes Time Plots A time plot shows behavior over time Time is always on the horizontal axis, and the variable being measured is on the vertical axis Look for an overall pattern (trend), and deviations from this trend. Connecting the data points by lines may emphasize this trend Look for patterns that repeat at known regular intervals (seasonal variations) Time-Series Data: Variable is measured at different points in time Time-Series Plot: Time-series data (vertical axis) plotted against time (horizontal axis). Lines are then drawn connecting the points Identify long term trends Identify regularly occurring patterns with time (“seasonality”) Outliers Extreme values that fall outside the overall pattern of a distribution. Always look for outliers and try to explain them. May occur naturally May occur due to error in recording May occur due to error in measuring Observational unit may be fundamentally differentLecture 5 Week 4 February 13th, 2017 Summarizing Quantitative Data Measure of Center Measures of center: Mean, median Mean versus median Overview Three characteristics of a quantitative variable’s distribution: Shape — Visually via graphs Center (Typical Value) — Numeric summary Spread (Dispersion) — Numeric summary Review: Populations vs. Samples Analyzing populations versus analyzing samples For populations: - We know all of the data - Descriptive measures of populations are called parameters - Parameters are often written using Greek letters (µ). For samples: - We know only part of the entire data - Descriptive measures of samples are called statistics - Statistics are often written using Roman letters (X) Descriptive Measures: Numerical Summaries Center of the data: Numeric values that represent the average or typical value of a quantitative variable. Examples of Measures of Central Tendency: - Mean - Median Variation or Measures of Dispersion: Numeric values that represent the degree to which the values are spread out - Range - Quartiles (interquartile range) - Variance - Standard Deviation Central Tendency: Mean The mean (arithmetic mean) of a variable is often what people mean by the “average” … add up all the values and divide by the number of measurements in the data set To compute the arithmetic mean of: 6, 1, 5 Add up the three numbers and divide by 3: (6+1+5)/3 = 4.0 The arithmetic mean is 4.0, one more decimal place than the data One interpretation: The arithmetic mean can be thought of as the center of gravity, where the yardstick balancesThe mean, more accurately known as the arithmetic mean, is an arithmetic average of the elements of the data set. • The mean of a sample of n measurements is denoted by X • If the data are from a population, the mean is denoted by µ Central Tendency: Median (M) A resistant measure of the data’s center With ordered data, that is data arranged in ascending order, at least half of the ordered values are less than or equal to the median value With ordered data, that is data arranged in ascending order, at least half of the ordered values are greater than or equal to the median value - If n is odd, the median is the middle ordered data value - If n is even, the median is the average of the two middle ordered data values Comparing the Mean and Median The mean and median of data from a symmetric distribution should be close together. The actual (true) mean and median of a symmetric distribution are exactly the same. In a skewed distribution, the mean is farther out in the long tail than is the median [the mean is ‘pulled’ in the direction of the possible outlier(s)]. - Symmetric — mean will usually be close to the median - Skewed left — mean will usually be smaller than the median - Skewed right — mean will usually be larger than the median Resistant Statistic What if one value is extremely different from the others? Example: What if we made a mistake and 6,1,2 was recorded as 6000, 1, 2? The mean is now (6000 + 1 + 2)/3 = 2001.0 The median is still 2.0 Conclusion: The median is “resistant to extreme values” while the mean is not resistant Summary: Central Tendency Mean - Center of gravity - Used for roughly symmetric quantitative data with no extreme values Median - Splits the data into halves - Useful for highly skewed quantitative data (or data with extreme values) Measure of Spread Measures of Spread: Quartiles, Standard Deviation Five-number summary and box plot Choosing among summary statistics Changing the unit of measurement Spread, or Variability If all values are the same then they all equal the mean. There is no variability Variability exists when some values are different from (above or below) the mean - A measure of center alone can be misleading - A useful numerical description of a distribution requires both a measure of center and a measure of spread Measuring Spread or Variability Measures of spread, or measures of dispersion: Numerical values that represent the degree to which the values are spread out The range The quartiles - First quartile or 25th percentile - Second quartile or 50th percentile or Median - Third quartile or 75th percentile The variance The standard deviation Range The range of a variable is the largest data value (called the maximum) minus the smallest data value (called the minimum) The range only uses two values in the data set — the largest and smallest. So, the range is not resistant. Quartiles Three numbers which divide the ordered data into four equal sized groups - Q1 has 25% of the data below it - Q2 has 50% of the data below it (Median) - Q3 has 75% of the data below it Obtaining the Quartiles Order the data For Q2, just find the median For Q1, look at the lower half of the data values, those to the left of the median location; find the median of this lower half For Q3, look at the upper half of the data values, those to the right of the median location; find the median of this upper half Interquartile Range (IQR) The interquartile range (IQR) is the difference between the third and first quartiles IQR = Q3 — Q1 The IQR is a resistant measure of dispersion Five-Number Summary Five-Number Summary gives a concise description of the distribution of a variable Smallest value (min) First quartile (Q1) Median (M or Q2) Third quartile (Q3) Largest value (Max) Boxplot A box plot is a graphical representation of the five-number summary Central box spans Q1 and Q3 A line in the box marks the median M (Q2) Lines extend from the box out to the minimum and maximum (These lines are sometimes called whiskers) Outliers Outliers are extreme observations in the data. They are values that are significantly too high or too low, based on the spread of the data Outliers should be identified and investigated Outliers could be: - Chance occurrences - Measurement errors - Data entry errors - Sampling errors Outliers are not necessarily invalid data Fence Rule for checking for outliers using the quartiles: Calculate lower and upper fences: - Lower fence = LF = Q1 — (1.5 x IQR) - Upper fence = UF = Q3 + (1.5 x IQR) Values (strictly) less than the lower fence or (strictly) greater than the upper fence could be considered outliers Variance The variance is based on the deviation from the mean (xi — µ) for populations (xi — X) for samples Population Variance The population variance (denoted o2) of a variable is the sum of these squared deviations divided by the number in the population (i.e. population size). In other words we are calculating the average of squared distance of each of the at a points. Sample Variance The sample variance (denoted s2) of a variable is the sum of these squared deviations divided by one less than the number in the sample (i.e., sample size minus 1). We say that this statistic has n - 1 degrees of freedomStandard Deviation The standard deviation is the square root of the variance The population standard deviation - Is the square root of the population variance (o2) - Is represented by o The sample standard deviation - Is the square root of the sample variance (s2) - Is represented by s The variance and standard deviation are not resistant measures of dispersionLecture 6 Week 4 February 20th, 2017 Examining Relationships Response and Explanatory Variables Response Variable (Dependent Variable) The outcome variable on which comparisons are made Explanatory Variable (Independent Variable) When the explanatory variable is categorical, it defines the groups to be compared with respect to values on the response variable. When the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to the values for the response variable Example: Response/Explanatory Survival status/Smoking status Carbon Dioxide (CO2) Level/Amount of gasoline use for cars College GPA/Number of hours a week spent studying If we further classify each of the two relevant variables according to type (categorical or quantitative), we get the possibilities for “role-type classification” Categorical Explanatory and Categorical Response (C—> C) Categorical Explanatory and Quantitative Response (C—>Q) Quantitative Explanatory and Categorical Response (Q—>C) Quantitative Explanatory and Quantitative Response (Q—>Q) Association Between Two Variables The main purpose of data analysis with two variables is to investigate whether there is an association and to describe that association. An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable Relative Frequency Table Relative to grand total Points out largest and smallest cells Points out dominant categories in the margins A Contingency Table: Displays two categorical variables The rows list the categories of one variable The columns list the categories of the other variable Entries in the table are frequenciesCategorical Explanatory Variable and Quantitative Response Variable When exploring the relationship between a categorical explanatory variable and a quantitative response (Case C—> Q), we essentially compare the distributions of the quantitive response for each category of the explanatory variable using side-by-side bojxplots supplemented by descriptive statistics. A two-way table (contingency table) is a great way to summarize the data in a study of this kind. Exploring the relationship between two categorical variables amounts to comparing the distributions of the response variable across the different categories of the explanatory variable. Quantitative Explanatory Variable and Quantitative Response Variable When exploring the relationship between a quantitative explanatory variable and a quantitative response (Case Q—> Q), we essentially want to see if there is a relationship between the explanatory variable and the response variable. This analysis will require a different type of treatment. The first step in exploring the relationship between these variables is to create an appropriate and informative graphical display. The appropriate graphical display for examining the relationship between two quantitive variables is the scatterplot. When describing the relationship between two quantitative variables, we described the overall pattern of the distribution (shape, center, spread) and any deviations from that pattern (outliers) Roles of the Variables Response Variable (y): The variable of interest. It is what we want to predict Explanatory or Predictor Variable (x): The variable that we use to provide information or a prediction of the response variable Choosing the response variable and the explanatory variable depends on how we think about the problem Scatterplots Scatterplots exhibit the relationship between two variables. Used for detecting patterns, trends, relationships, and extraordinary values. The Direction of the Association Negative Direction: As one goes up, the other goes downPositive Direction: As one goes up, the other goes up also No Direction: Form of the Relationship Linear: The points cluster near a lineGently Curves in a Direction: May be able to strengthen with a transformation Curves Up and Down: Difficult to straighten Strength of the Relationship Strong Linear Relationship: Moderate Linear Relationship: No Linear Relationship: Outliers An outlier is a point on a scatterplot that stands away from the overall pattern of the scatterplot Outliers are almost always interesting and always deserve special attentionCorrelation Scatter Diagrams Postive vs. Negative, Strong vs. Weak Linear Correlations Linear Correlation — strength of linear association between two quantitative variables Assumptions Quantitative Variables — both variables must be quantitative Straight Enough Condition — their relationship must be reasonably straight (use of scatterplot) No Outliers Correlation Coefficient — r The numerical measure that assesses the strength of a linear relationship is called the correlation coefficient, and is denoted by r Correlation, r — a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables Remember the Assumptions and Conditions for Correlation Coefficient To use r, there must be a true underlying linear relationship between the two variables The variables must be quantitative The pattern for the points of the scatterplot must be reasonably straight Outliers can strongly affect the correlation. Look at the scatterplot to make sure that there are no strong outliers Interpretation of the Correlation Coefficient If all assumptions are met: Negative values of r indicate a negative linear relationship between the variables Positive values of r indicate a positive linear relationship between the variables Values of r that are close to 0 — either negative or positive — indicate a weak linear relationship Values of r that are close to 1 indicate a strong positive linear relationship Values of r that are close to -1 indicate a strong negative linear relationship Properties of Correlation r > 0 —> positive linear association r < 0 —> negative linear association -1≤ r ≤ 1, with r = -1 only if the points all lie exactly on a negatively sloped line and r = 1 only if the points all lie exactly on a positively sloped line Interchanging x and y does not change the correlation r has no units Changing the units of x or y does not affect rMeasuring in dollars, cents, or Euros will all produce the same correlation Correlation measures the strength of the linear association between the two variables Correlation is sensitive to outliers. An extreme outlier can cause a dramatic change in r The adjectives weak, moderate, and strong can describe correlation, but there are no agreed upon boundaries Linear Regression Regression — the technique that specifies the dependence of the response variable (y) on the explanatory variable (x) Linear Regression - the technique that specifies the linear dependence of the response variable (y) on the explanatory variable (x) Y(^) is called the predicted value For each point (x,y) look at the point (x,y(^)) on the regression line with the same x-coordinate The residual is defined by e = y - y(^) The residual is the difference between the observed value and the predicted value Regression Equation: Y = a + bX (X is explanatory, Y is dependent)0 More on Residuals Residual: Observed — Predicted Points above the line have positive residuals Points below the line have negative residuals The Line of Best Fit Line from Algebra y = mx + b m is the slope, the change in y for every increase of 1 unit in x b is the intercept, the value for y when x is 0 Line of Best Fit (Regression Model): y(^) = ß1x + ß0 ß1 is the slope: how rapidly y(^) changes with respect to x ß0 is the y-intercept: The value of y(^) when x is 0 Conditions for Using Regression The line of best fit is also called the least squares line or the regression line. Only use the regression line to make predictions if: The variable bus be Quantitative The relationship is Straight Enough There should be no Outliers Residuals must be independent and must be normally distributed with a mean of 0 with a constant standard deviation (independent and identically distributed)STAT100 Exam 2 Week 7 Lecture 7 March 8th, 2017 Review: Population and Sample Researchers often want to answer questions about some large groups of individuals. (This group is called the population.) Often the researchers cannot measure (or survey) all individuals in the population, so they measure a subset of individuals that is chosen to represent the entire population. (This subset is called a sample.) A sample design describes exactly how to choose a sample from the population The researchers then use statistical techniques (Called statistical inference) to make conclusions about the population based on the sample Choosing a Representative Sample Sample Survey: Selects a sample from the population of all individuals about which we desire information. How do we decide which individuals to include in the sample? Goals: Individuals in sample are representative of the population (Provide accurate information about the population) Minimize cost of obtaining the sample (Money, Time, Personnel, etc.) Biased SampleA sample is biased if it is not representative of the population Biased samples tend to systematically overrepresent certain segments of the population and systematically under-represent other segments Important: Because a biased sample does not represent the entire population, cannot accurately infer anything about the population of interest from such a sample. Sampling Bias Arising from the manner in which individuals are selected for the sample. Individuals are included in the sample without known probabilities of being selected. Sampling Frame — List of individuals from which the sample is actually drawn; the “population” which is actually sampled. If the sampling frame differs from the population of interest, some individuals in the population are left out of the process of choosing the sample (undercoverage). Example: Use of a telephone book to select a sample of households (not everyone has a telephone in their house) Convenience Sample The sample is identified primarily by convenience. Select individuals most readily accessible Example: A professor conducting research might use student volunteers to constitute a sample (since students in his class are more convenient to sample) Judgment Sample A person knowledgeable on the subject of the study selects individuals of the population that he or she feels are most representative of the population. The quality of the sample results depends on the judgement and favoritism of the person selecting the sample Example: A reporter might sample three or four senators, judging them as reflecting the general opinion of the senate. Voluntary Response Sample A sample collected in such a way that members of the population decide for themselves whether or not to be included in the sample. A voluntary sample tends to be biased as the sample is overrepresented by individuals with strong opinions, which are often negative. Example: Call-in polls Random Sample Random Sample (Probability Sample): A sample chosen by impersonal chance The use of chance to select the sample is the essential principle of random sampling Chance selection eliminates bias in the choice of a sample by giving each individual in the sampling frame a known chance of being in the sample and each sample has a known chance of being selected. Must Know: What samples are possible What chance, or probability, each possible sample has of being selected. Types of Random Samples Simple Random Sampling (SRS) Stratified SamplingSystematic Sampling Cluster Sampling Simple Random Sampling (SRS) Each individual in the sampling frame has an equal chance of being selected. Each possible sample of a given size has an equal chance of being the sample ultimately selected Choose an SRS by labeling the individuals in the population and using a table of random digits or software to select a sample. Example: For a simple random sample of size n = 2 from a population of size N = 4, each of the 6 possible samples has an equally likely chance of being selected. (or drawing a name from a hat) Stratified Sampling Select simple random samples from subpopulations To obtain a Stratified Random Sample: Divide the sampling frame into subpopulations (i.e., groups of individuals), called strata, such that each individual in the sampling frame belongs to one and only one stratum. Choose a separate SRS from each strata Combine these SRSs to form the full sample Example: Suppose we want to collect data on the student body at a college. Instead of selected students at random, we could select random samples from the freshmen, sophomores, juniors, and seniors individually (25 from each group) Strata: Best results are obtained when the individuals within each stratum are as much alike as possible (i.e. homogeneous group) in some way important to the study Strata are chosen based on facts known before the sample is taken such as department, location, age, industry type, etc. Strata for sampling is similar to blocks in experiments. Systematic Sampling Order the individuals in the sampling frame and then select every kth member. To obtain a systematic sample of size n from a sampling frame containing N individuals: Choose a starting individual at random from the first k = N/n individuals on the ordered lists Thereafter, select every kth individual from the ordered list. Example: Selecting every 100th listing in a telephone book after the first randomly selected listing among the first 100 listings. This method has similar properties to a SRS, especially if the list of the individuals is a random ordering. Advantage: The sample usually will be easier to identify than it would be if a SRS is used Disadvantage: The sample may be poor if there is some characteristic confounded with the ordering of individuals. Cluster Sampling Select groups (clusters) at random, not individuals To obtain a cluster sample:Group individuals into subsets, called clusters, such that each individual in the sampling frame belongs to one and only one cluster. Choose a SRS of clusters Combine the individuals in each sampled cluster to form the full sample Example: Suppose we want to select a sample of 100 hot dogs and test them for their chemical content. Instead of sampling individual hot dogs from 100 packages, we could select 10 packages (each containing Clusters: Ideally, each cluster is a representative small-scale version of the population (i.e., heterogeneous groups) Advantage: A primary application is area sampling, where clusters are city blocks or other well-defined areas. The close proximity of individuals can be cost effective (i.e., many sample observations can be obtained in a short time) Disadvantage: This method generally requires a larger total sample size than simple or stratified random sampling to obtain as precise informationWeek 8 Lecture 8 March 13th, 2017 Probability and Counting Techniques Sample Spaces Probability Rules Assigning Probabilities Counting Rules: • Products • Permutations • Combinations We use the 3 counting rules to determine event probabilities when the outcomes of the experiment are equally likely We can use long-run relative frequency as our estimate of the probability of an event. Probability The idea of probability is empirical. It is based on observation rather than theorizing. Probability describes what happens in multiple trials • The long term proportion of heads after a great many flips is 1/2 • This is called the Law of Large Numbers — as the number of repetitions of an experiment increases, the proportion with which a certain outcome is observed gets closer to the probability of that outcome Terms and Definitions Some Definitions An experiment is a repeatable process where the results are uncertain An outcome is one specific possible result The set of all possible outcomes is the sample space. The number of outcomes can be finite or infinite Example: Experiment — roll a fair 6 sided die One of the outcomes — roll a “4” The sample space — numbers 1-6 An event is a collection of possible outcomes .. we will use capital letters such as E for events • Events are identified by capital letters (A, B, A1, A2, etc) • Simple event — only one outcome • Compound event — more than one outcome Simple event example • E= roll a two • E consists of outcomes e = “roll a two” Compound event example • One of the events … E = Roll an even number • E consists of the outcomes e2= roll a 2, e4 = roll a 4, and e6 = roll a 6 — (2, 4, 6)Relationship Among Events — 1 Complement: A’ is the complement of A, • aka “not A” • If A = Head, then A’ = Tail in a coin-flip experiment Union: A or B, or both, A U B, A U B U C • If A = economist, B = professor, C = woman • A U B = all economists and all professors • A U B U C = All economists and all professors and all women Relationship Among Events — 2 Intersection A and B, A n B, A n B n C • A n B = all economists who are professors • A n B n C = all economists who are female professors Mutually exclusive events are disjoint events • One event precludes the other • Are B and C mutually exclusive? • Are B and B’ mutually exclusive? Null Event: ø, B n B’ = ø Probability Rules: Mathematical Notation 1. Any probability is a number between 0 and 1 2. All possible outcomes together must have probability 1 (100%) 3. If two events have no outcomes in common, the probability that one or the other occurs is the sum of their individual probabilities. 4. The probability that an event does not occur is 1 minus the probability that the event does occur Rule 1: The probability P(A) of any event A satisfies 0 ≤ P(A) ≤ 1 Rule 2: If S is the sample space in a probability model, then P(S) = 1 Rules 3: If A and B are disjoint, P(A or B) = P(A) + P(B) This is the addition rule for disjoint events Rule 4: The complement of any event A is the event that A does not occur, written A’ P(A’) = 1 — P(A) Probability of an Event Our subjective probability or degree of belief (different individual may have different values, subjective) — no real mathematical model Relative frequency, Empirical: P(event) = Number of times outcome of interest occurs/number of times experiment is run Understanding Rules of Probability Probability models must satisfy these rules These are some special types of events • If an event is impossible, then its probability must be equal to 0 (it can never happen) • If an event is a certainty, then its probability must be equal to 1 (it always happens) • An unusual event is one that has a low probability of occurring. Typically, probabilities of 5% or less are considered lowCompute and Interpret Probabilities Using the Classical Method The classical method applies to situations where all possible outcomes have the same probability This is also called equally likely outcomes The general formula is P(E) = Number of ways E can occur/ Number of possible outcomes If we have an experiment where There are n equally likely outcomes (N(S) = n) The event E consists of m of the (N(E) = m) then P(E) = m/n = N(E)/N(S)Week 8 Lecture 9 March 15th, 2017 Random Variables and Probability Models Random Variables Descriptions of chance behavior contain two parts: a list of possible outcomes and a probability for each outcome. A probability model describes the possible outcomes of a chance process and the likelihood that those outcomes will occur A random variable is a variable whose value is a numerical outcome of a random phenomenon • Often denoted with capital alphabetic symbols (X, Y, etc.) • A normal random variable may be denoted as X~N (µ, o) • The probability model for a random variable is its probability distribution. A probability with a finite sample space is called finite. Probability Distribution The probability distribution of a random variable X tells us what values X can take and how to assign probabilities to those values Example: Consider tossing a fair coin 3 times. Define X = the number of heads obtained X = 0: TTT X = 1: HTT THT TTH X = 2: HHT HTH THH X = 3: HHH Random Variables There are two types of random variables: discrete and continuous • Random variables that have a finite (countable) list of possible outcomes, with probabilities assigned to each of these outcomes, are called discrete • Random variables that can take on any value in an interval, with probabilities given as areas under a density curve, are called continuous Discrete Probability Model The discrete probability model of a discrete random variable X relates the values of X with their corresponding probabilities. (discrete probability distribution) The distribution/model could be • In the form of a table • In the form of a graph • In the form of a mathematical formulaIf X is a discrete random variable and x is a possible value for X, then we write P(X=x) as the probability that X is equal to x Examples: • In tossing one coin, if X is the number of heads, then P(X=0) = 0.5 and P(X=1) = 0.5 • In rolling one die, if X is the number rolled, then P(X=1) = 1/6 Properties of P(X): The probability distribution of a discrete random variable X lists the values xi and their probabilities pi: Value: x1, x2, x3 … Probability: p1, p2, p3 … Since P(X) form a probability distribution, they must satisfy the rules of probability 1. 0 ≤ P(X=x) ≤ 1 2. ∑ P(X=X) = 1 In the first rule, every probability pi = P(X=xi) is a number between 0 and 1, inclusive. In the second rule, the ∑ sign means to add up the P(X=x)’s for all the possible x’sWeek 9 Lecture 10 March 27th, 2017 Density Curves and Normal Distributions Exploring a Distribution Always plot your data: make a graph (histogram, stemplot, normal probability plot, box plot, dot plot, etc.) Look for overall patterns (shape, center, spread) and for striking deviations such as outliers Calculate a numerical summary to briefly describe center and spread Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve Many things are “distributed” E(x) - ∑(^) x1P(x - x1) = µ v(x) - ∑(^) (x1-µ)^2 P(x=x1) = o(^) If x is continuous then there is a pdf (probability density function) 1. f(x) ≥ 0 for all x in domain 2. total area beneath curve of f(x) is 1 3. P(a ≤ x ≤ b) = area beneath the curve between a + b Properties of the Normal Distribution In drawing the normal curve, the mean µ and the standard deviation ç have specific roles: The mean µ is the center of the curve The values (µ-ç) and (µ_ç) are the inflection points of the curve Empirical Rule — The 68-95-99.7 Rule The Empirical Rule can be used to determine the (also known as the area or probability) (or percentage) of the variable values within a specified number of standard deviations of the mean, provided the variables distribution is approximately bell shaped Empirical Rule: If the distribution (i.e., histogram) is roughly bell shaped, then Approximately 68% of the data will lie within 1 standard deviation of the mean Approximately 95% of the data will lie within 2 standard deviations of the mean Approximately 99.7% of the data will lie within 3 standard deviations of the meanThe standard deviation is very useful for estimating the percentage of the observations with values within certain intervals about the mean. (100-99.7)÷ 2 = 0.15% (3 standard deviations) (100-95) ÷ 2 = 2.5% (2 standard deviations) (100-68) ÷ 2 = 16% Different values of standard deviation shift the curve up and down Properties of the Normal Density Curve 1. The curve is symmetric about the mean 2. The mean = median = mode. So, the highest point of the curve is at x = µ 3. The curve has inflection points at (µ - ç) and (µ + ç) 4. The total area under the curve is equal to 1 5. The area under the curve to the left of the mean is equal to the area under the curve to right of the mean. (So, by symmetry, the area to the left of the mean equals 0.5; and the area to the right of the mean equals 0.5). 6. As x gets larger and larger (in either the positive or negative directions), the graph approaches but never reaches the horizontal axis. The Area Under a Normal Curve Suppose a random variable X is normally distributed with mean µ and standard deviation ç. The area under the normal curve for any interval of values represents either The proportion of the population with the characteristics described by the interval of values or The probability that a randomly selected individual from the population will have the characteristic described by the interval of numbers. So, the area under a normal curve is a proportion or a probability Since there is no area under the normal curve associated with a single value, the probability of observing a specific value for a normal random variable is 0 We only get proportions/probabilities with a range of values. Finding Normal Proportions (probabilities) The Standard Normal Distribution Find the area (proportions or probabilities under the standard normal curve Find Z-scores for a given area Interpret the area under the standard normal curve as a probability There are several ways to calculate the area under the standard normal curve. What does not work — some kind of a simple formulaWe can use a table (such as Table A) We can use technology (a calculator or software). Three different area calculations: Find the area to the left of a boundary P(Z < a) represents the probability a standard normal random variable is less than a Find the area to the right of a boundary P(Z > a) represents the probability a standard normal random variable is greater than a Find the area between two boundaries P(a < Z < b) represents the probability a standard normal random variable is between a and b “Area to the left of” — Using a Table Calculate the area to the left of Z = 1.68 (P(Z<1.68)): Break up 1.68 as 1.6 + 0.08 Find the row 1.6 Find the column 0.08 The probability is 0.9535 = P(Z ≤ 1.68) = P(Z < 1.68) “Area to the right of” — Using a Table Calculate the area to right of Z = 1.68: The area to the left of Z = 1.68 is 0.9535 “Area to the right of” is the remaining amount. The two add up to 1, so “Area to the right of” = 1 — “Area to the left of” “Area to the right of Z = 1.68” = 1 - 0.9535 = 0.0465 The proportion/probability is 0.0465 = P(Z > 1.68) = P(Z ≥ 1.68) “Area in between” — Using Calculate the area between Z = -0.51 and Z = 1.87 This is not a one step calculation P(-0.51 ≤ Z ≤ 1.87) = P(-0.51 ≤ Z < 1.87) = P(-0.51 < Z ≤ 1.87) = P(-0.51 < Z < 1.87) Different ways to write the same proportion/probability. Since the P(Z=z) = 0, it doesn’t matter if we have < or ≤. Similarly, it doesn’t matter if we have > or ≥. We Want: Area between Z = -0.51 and Z = 1.87. What we know how to calculate: Area to the left of Z = 1.87 It is too much by the area to the left of -0.51, which we also know how to calculate So, we “correct” by subtracting the excess area. To calculate the area between Z = -0.51 and Z = 1.87: P(-0.51 ≤ Z ≤ 1.87) = P(Z ≤ 1.87) - P(Z ≤ -0.51) Find the area to the left of 1.87, which is 0.9693 Find the area to the left of -0.51, which is 0.3050 Take the difference of these two areas: 0.9693 - 0.3050 = 0.6643 The proportion/probability is 0.6643 = P(-0.51 ≤ Z ≤ 1.87) Standard Normal Distribution The standard normal distribution is the normal distribution with mean 0 and standard deviation 1: N(0,1) If a variable x has any Normal distribution with mean µ and standard deviation ç[x~N(µ,ç)], then we can standardize the variable using the following calculation: z = x - µ ÷ ç The Standardized Score (Z-Score) The standardized score tells how many standard deviations a particular data value (x) is in the random variable (X) distribution. Observed Value for a Standardized Score Need to “undstandardize” the z-score to find the observed value (x): z = x - µ ÷ ç —> x = µ + zç Observed value = mean plus [(standardized score) x (standard deviation)]Week 9 Lecture 11 March 29th, 2017 Binomial Distributions Binomial Probability Distribution Binomial Probability Distribution • A special discrete probability distribution • Describes probabilities for experiments that have two mutually exclusive outcomes (the experiment has only two outcomes) • One outcome is called a success - The word “success” does not mean that this is a “good” outcome or that we want this to be the outcome. It is the outcome that we are looking for. - For example, if we are looking at the cobra strike on an animal, a success is that the animal dies. • The other outcome is called a failure - In the above example, the failure would be the animal does not die. Definition of a Binomial Experiment Definition: A binomial experiment is an experiment with the following characteristics: 1. The experiment is performed a fixed number of times, n; each time is called a trial. So, there is a fixed number of trials. 2. The n trials are independent. (Outcome of one trial will not affect the outcome of any other trials.) 3. Each trial has only two possible outcomes, usually called success and failure. 4. The probability of success is the same for each trial of the experiment Notation for a Binomial Experiment Notation used for binomial experiments: The number of trials is represented by n. The probability of success (in a single trial) is represented by p. The total number of successes in n trials is represented by the random variable X. (X is called a binomial random variable.) Because there cannot be a negative number of successes, and because there cannot be more than n successes (out of n attempts): 0 ≤ X = x ≤ n The probability distribution of a binomial random variable is called a binomial distribution Binomial Setting Example 1 In a shipment of 100 televisions, how many are defective? The probability of a television being defective is 0.08. n = 100 p = 0.08 X = number of defective televisions in shipment of 100 X is a binomial random variable since the trials (televisions) are independent, there is a fixed number of televisions, n = 100, and the probability of success (being defective) is the same for each television (p = 0.08) X ~ Binomial (100, 0.08) Summary for the Binomial Distribution Binomial Experiment • Fixed number of trials, n • Only two outcomes for each trial, success or failure • The n trials are independent • The probability of a success, p, is the same for each trial Let x = the count of successes in a binomial setting. The distribution of X is the binomial distribution with parameters n and p. X ~ Binomial (n,p) • n is the number of trials/observations • p is the probability of a success on any one observation (p must be the same for each trial) • The random variable X takes on whole values between 0 and n Binomial Probabilities Find the probability that a binomial random variable X takes any particular value. X is the number of success out of n trials/observations, P(x successes out of n observations) = P (X = x) Mean and Standard Deviation If X has the binomial distribution with n observations and probability p of success on each observation, then the mean and standard deviation of X are µ = np ç = √np(1-p) Week 10 Lecture 12 April 5th, 2017 Sampling Distributions The Sampling Distribution of the Sample Mean Population Distribution vs. Sampling Distribution The Mean and Standard Deviation of the Sample Mean Sampling Distribution of a Sample Mean Central Limit Theorem Sampling Distribution of the Sample Proportion Sampling Terminology Parameter • a number that describes the population • in practice, the value is unknown number • For example: • µ, population mean • ç, population standard deviation • p, population proportion Statistic • Known value calculated from a sample • a statistic is often used to estimate a parameter • For example: • (X-bar), sample mean • s, sample standard deviation • (p-hat), sample proportion A Statistic Estimates A Parameter (X-bar), sample mean estimates µ, population mean S, sample standard deviation estimates ç, population standard deviation (P-hat), sample proportion estimate p, population proportion Statistics come from samples Parameters come from the population Population and Sample The process of statistical inference involves using information from a sample to draw conclusions about a wider population Different random samples yield different statistics. We need to be able to describe the sampling distribution of possible statistic values in order to perform statistical inference.We can think of a statistic as a random variable because it takes numerical values that describe the outcomes of the random sampling process. Collect data from a representative Sample… Make an inference about the population Sampling Terminology (Continued) Variability • Different samples from the same population may yield different values of the sample statistic Sampling Distribution • Tells what values a statistic takes and how often it takes those values in repeated sampling Sampling Distribution of a Statistic • The distribution of values taken by the statistic in all possible samples of the same size from the same population Parameter vs. Statistic The mean of a population is denoted by µ — this is a parameter The mean of a sample is denoted by (X-bar) — this is a statistic. (X-bar) is used to estimate µ The true proportion of a population with a certain trait is denoted by p — this is a parameter The proportion of a sample with a certain trait is denoted by (P-hat) — this is a statistic. (p-hat) is used to estimate p. The Law of Large Numbers Law of Large Numbers — as the sample size increases, the sample mean gets closer to the population mean. That is, the difference between the sample mean and the population mean tends to become smaller (i.e., approaches zero). For example: ([X-bar] gets closer to µ). Central Limit Theorem The Central Limit Theorem states: Regardless of the shape of the population distribution, the sampling distribution of the sample mean becomes approximately normal as the sample size n increases. If the random variable X (i.e., the population) is normally distributed, then the sampling distribution of the sample mean is normally distributed for any sample size. For all other random variables X (i.e., other populations), the sampling distribution of the sample mean is approximately normally distributed if n is 30 or higher. Sampling Distribution of the Mean We can estimate µ(x-bar) by taking the mean of n(x-bar) from n samples (([x-bar]1 + [x-bar]2 + …. [x-bar]n) ÷ n) The sampling distribution of the sample mean does not necessarily look like the population distribution Behavior of Sampling Distribution The distribution of measurements in a sample looks like the distribution in the parent population, NOT necessarily like a normal curve. The sampling distribution of the sample mean looks like a normal curve as our sample size increased, even though the parent population is definitely NOT normal. As the sample size increases, the sample mean gets closer to the population mean, i.e., the difference between the sample mean and the population mean tends to become smaller (i.e., approaches zero) (Law of Large Numbers) The spread in the histograms for the sampling distribution of the sample mean is getting smaller for larger sample sizes (Law of Large Numbers — Causing less variation in the measurement) Mean and Standard Deviation of the Sampling Distribution of the Sample Means Mean of a sampling distribution of a sample mean There is no tendency for a sample mean to fall systematically above or below µ, even if the distribution of the raw data is skewed. Thus, the mean of the sampling distribution is an unbiased estimate of the population mean µ. Standard deviation of a sampling distribution of a sample mean The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample. It is smaller than the standard deviation of the population by a factor of (square root of n) Averages are less variable than individual observations Summary of Properties of Sampling Distribution The sampling distribution of the sample mean has several important properties: If a simple random sample of size n is drawn from any large population, then the sampling distribution of the sample mean has: Mean = µ(x-bar) = µ (The mean of the sampling distribution of the sample mean equals the population mean. It is an unbiased estimator of µ) Standard deviation, called the standard error of the mean: ç(x-bar) = ç ÷ (square root of n) (As the sample size increases, the standard error of the sample mean gets smaller.) Averages are less variable than individual observations In addition, if the population is normally distributed, then, the sampling distribution is normally distributed. Standard Error -- standard deviation of a sample distribution (√p(1— p) ÷ n ) (p-hat) ~ N (P, Standard Error) z = (x-bar) - µ ÷ (ç÷n) If np ≥ 10 and n(1-p) ≥ 10 These conditions must be met to be normally distributed. The individual samples must be independent of one another (one outcome must not affect the other) Randomization Condition: Samples were obtained by SRS or randomized experiments 10%: Your sample size should not be more than 10% of the population Week 11 Lecture 13 April 10th, 2017 Confidence Intervals When Sigma is Known for an Unknown Population A confidence interval is point estimate +/- margin of error Reasoning of Statistical Estimation about the Mean — Confidence Interval about the Mean If the sample mean is within +/- 1.96 ç ÷ (square root of n) of the population mean 95% of the time, then we can flip this around to say that the population mean is within +/- 1.96 ç ÷ (√n) of the sample mean 95% of the time. µ - 1.96 x ç ÷ √n < (x-bar) < µ + 1.96 x ç ÷ √n Same as (X-bar) - 1.96 x ç ÷ √n < µ < (x-bar) + 1.96 x ç ÷ √n Thus a 95% confidence interval for the population mean is: (X-bar) +/- 1.96 x ç ÷ √n This is in the form: (Point estimate) +/- (margin of error) The margin of error here is 1.96 x ç ÷ √n 1.96 is the z score of 95% Other Levels of Confidence If we wanted to compute a 90% confidence interval, or a 99% confidence interval, etc., we would just need to find the right standard normal value. In general, for a (1- å) x 100% confidence interval, we need to find the critical value z (a/2’) i.e., the z-score such that the area to the right of it is å/2: P(Z ≥ z(å/2)) = å/2 = P(Z ≤ -z(å/2)). Frequently used levels of confidence, and their related critical values are 90% corresponds to z(å/2) = z(0.05) = 1.645 95% corresponds to z(å/2) = z(0.025) = 1.960 99% corresponds to z(å/2) = z(0.005) = 2.575How Confidence Intervals Behave The margin of error is: margin of error = z(å/2) ç ÷ √n The margin of error gets smaller, resulting in more accurate inference, - when n gets larger - when z(å/2) gets smaller (confidence level gets smaller) The greater the confidence level, the wider the confidence interval. Constructing Confidence Intervals Estimation is the process of using sample data (known) to estimate the value of a population parameter (unknown). Estimation involves two steps: Step 1 — Obtain the value of a statistic that estimates the value of a parameter; this is called the point estimate. (Relatively easy.) Step 2 — Quantify the accuracy and precision of the point estimate using a confidence interval. (Requires knowledge of the sampling distribution of the statistic.) A confidence interval estimate consists of an interval of likely values for the population parameter determined from sample information. Associated with the interval is a percentage, called the level of confidence, that measures one’s “confidence” that the true parameter value lies within the interval. Level of Confidence What does the level of confidence represent? Example: Consider a process for calculating confidence intervals with a 90% level of confidence from a sample. Assume that we know the population mean. We then obtain a series of 50 random samples. We apply our process to the data from each random sample to obtain a confidence interval from each sample Then, we would expect that approximately 90% of those 50 confidence intervals we just constructed (or about 45) would contain the true population mean. The level of confidence represents the expected proportion of (random) intervals that will contain the parameter if a large number of different samples is obtained. The level of confidence is always expressed as a percent. Although the choice of the level of confidence is at the discretion of the experimenter, the most commonly used values are 90%, 95%, and 99%. The level of confidence is associated with a number å, the “error rate.” For “error rate” å, the level of confidence is (1 — å) x 100%. When å = 0.05, then (1 — å) = 0.95, and we have a 95% level of confidence. When å = 0.01, then (1 — å) = 0.99, and we have a 99% level of confidence.Confidence Interval A level C confidence interval ( (1 — å) x 100% Confidence Interval) has two parts. 1. An interval calculated from the data, usually of the form: estimate +/- margin of error 2. The confidence level C ( (1 — å) x 100% Confidence Interval), which is the probability that the interval will capture the true parameter value in repeated samples; that is, C is the success rate for method.Week 11 Lecture 14 April 12th, 2017 Hypothesis Testing About the Mean When Sigma is Known (P-value method) Tests of Significance — Hypothesis Testing Hypothesis testing is the other part of inferential statistics. Hypothesis testing and estimation are two different approaches to two similar problems • Estimation is the process of using sample data to estimate the value of a population parameter (Confidence Interval). • Hypothesis testing is the process of using sample data to test a claim about the value of a population parameter (Tests of Significance, Hypothesis Test). The setup of our problem is that we want to determine whether a particular claim is believable, or not. This is one of the most common goals of statistics. The process that we use is called hypothesis testing: A. A claim is made. (Statement about the nature of some population) B. Evidence (sample data from population) is collected to “test” the validity of the claim C. The data are analyzed to assess the plausibility of the claim. D. Conclusion about the claim is stated. Types of Hypotheses A hypothesis test for a parameter is a procedure, based on sample evidence and probability, used to test a specific claim about the value of the parameter. Population Parameters: µ = population mean ç = population standard deviation p = population proportion Since a claim can either be true or false, hypothesis testing is based on two types of hypotheses: Null Hypothesis, H(0) and Alternative Hypothesis, H(a)Null Hypothesis The Null Hypothesis is a statement (regarding the value of a parameter) that is believed to be true: - It is written as H(0) (read as “H-naught” or “H-sub-Oh”). - Always a statement of equality - It is a statement of status quo or no difference - Assumed to be plausible until we have evidence to the contrary. Since we will be given the population standard deviation and all the conditions are met (same conditions we need for Confidence Intervals), we will be able to see if the sample statistic makes sense in the Null Hypothesis. Alternative Hypothesis The alternative hypothesis is a claim (regarding the value of a parameter) to be tested: - Is written as H(a) (read as “H-a” or “H-sub-a”) (sometimes H(1)). - It never contains a statement of equality - It represents the claim that we seek evidence for. - There are different types of alternative hypotheses, depending on the wording of the claim. Right-Tailed Test A right-tailed test tests whether the parameter is either equal to, versus greater than, some value. H(0): parameter = some value (One-Tailed Test) H(a): parameter > some value We look for sample means far in the right tail in order to have “evidence” to reject the null hypothesis. Left-Tailed Test A left-tailed test tests whether the parameter is either equal to, versus less than, some value H(0): parameter = some value (One-Tailed Test) H(a): parameter < some value We look for sample means far in the left tail in order to have “evidence” to reject the null hypothesis. Two-Tailed Test A two-tailed test tests whether the parameter is either equal to, versus not equal to, some value. H(0): parameter = some value (Two-Tailed Test) H(a): parameter ≠ some value We look for sample means far in the right or left tail in order to have “evidence” to reject the null hypothesis.The 6 Step Process Complete the 6 steps to the hypothesis test: 1. State the Null and Alternative Hypothesis 2. State the significance level 3. Calculate the test statistic, z(0-) 4. Calculate the P-value 5. Determine whether you reject or fail to reject the null hypothesis 6. State your conclusion in the context of the problem P-value Rule If p-value < å (the significance level), reject H(0) If P-value is ≥ å, fail to reject H(0) Setting up Null and Alternative In general, to set up the null and the alternative hypotheses: Step 1: Identify the parameter in the claim. (Typically a mean or proportion) Step 2: Determine the status quo value of the parameter to determine the null hypothesis. (Value assumed to be true unless evidence to the contrary.) Step 3: Identify the claim that we want evidence to determine the alternative hypothesis. (Greater than, less than, or not equal to the status quo value.) Outcomes of a Hypothesis Test There are two possible results for a hypothesis test: If we do not have enough evidence to support the alternative hypothesis then we will fail to reject the null hypothesis. If we have enough evidence to support the alternative hypothesis, then we will reject the null hypothesis. However, because the decision to reject or not reject is based on incomplete sample information, there is always the possibility of making an incorrect decision. Stating Hypotheses Revisited Null Hypothesis, H(0) The statement being tested in a statistical test is called the null hypothesis. The test is designed to assess the strength of evidence against the null hypothesis. Usually the null hypothesis is a statement of “no effect” or “no difference,” or it is a statement of equality When performing a hypothesis test, we assume that the null hypothesis is true until we have sufficient evidence against it. Stating Hypotheses Revisited Alternative Hypothesis, H(a) The statement we are trying to find evidence for is called the alternative hypothesis. Usually the alternative hypothesis is a statement of “there is an effect” or “there is a difference,” or it is a statement of inequality. The alternative hypothesis should express the hopes or suspicions we bring the data. It is incorrect to first look at the data and then frame H(a) to fit what the data show. Decision Errors: Type I If we reject H(0) when in fact H(0) is true, this is a Type I error. If we decide that based on the sample, the sample seems unlikely in the null hypothesis so we reject the null hypothesis: - This was an incorrect decision only if H(0) is true and the sample was truly just unusual. - The probability of this incorrect decision is equal to å. That is, P(Type I error) = å - In practice, typical å’s are 0.01, 0.05, 0.1 - If the null hypothesis is true and å = 0.05: About 5% of all samples from this population will lead us to wrongly reject chance and conclude significance. Decision Errors: Type II If we fail to reject H(0) when in fact H(å) is true, this is a Type II error. If we decide not to reject chance and thus allow for the plausibility of the null hypothesis: - This is an incorrect decision only if H(å) is true. - The probability of this incorrect decision is computed as 1 minus the power of the test. Decision Errors: Type I and Type IIOne Definition of a P-Value The P-value is the probability of observing a sample mean that is as extreme or more extreme than the one observed. The probability is calculate assuming that the null hypothesis is true. We use the P-value to quantify how unlikely the observed sample mean is. (It becomes the basis of our reject/fail to reject decision.) What sample means are as extreme or more extreme than the observed depends on the alternative hypothesis. So, the formulas for calculating the P-value depend on the alternative hypothesis. The P-value is a statement about the probability the sample mean takes on specific values. It is not a statement about the probability that the population mean has a certain value. It is not a statement about the probability that the null hypothesis is true. Another Definition of a P-Value The P-value can also be defined as the probability of committing a Type I error based on your sample. So if the p-value is large, that indicates that the probability of making a Type I error is great, you will not feel comfortable rejecting the null hypothesis. But if the p-value is small, that indicates that the probability of making a Type I error is small, you will feel comfortable rejecting the null in favor of the alternative. P-Value How small is small, how large is large? If p-value < å, then the p-value is small enough to feel comfortable rejecting the null hypothesis. If the p-value ≥ å, then the p-value is large and you will not feel comfortable rejecting the null hypothesis. P-Value Approach To test a claim using the P-value approach: 1. Calculate the (observed value of the) test statistic: z(0) = (x-bar) — µ(0) ÷ ç ÷ √n µ(0) is the assumed value under H(0)The test statistic measures the number of standard deviations away our sample mean is from the mean of the null hypothesis. 2. Calculate the P-Value, the probability that the test statistic would be this extreme, or more extreme, if the null hypothesis is true. (If the null hypothesis is true, the test statistics has a standard normal distribution.) 3. Compare the P-Value to a pre-specified level of significance to decide whether to reject or fail to reject the null hypothesis, i.e., to determine if the result is statistically significant. Calculating the P-Value: Right-Tailed Right Tailed: H(0): µ = µ(0); H(å): µ > µ(0) For the right-tailed test, the “unlikely” region are values that are: too high. (Right-Tailed) P-Value = (area to the right of the observed value of the test statistic) = P(Z > z0) Calculation: P-Value = [1 — P(Z < z0)] = [1 — (value from standard normal table at z0)] Calculating the P-Value: Left-Tailed Left-Tailed: H0: µ = µ0; Hå: µ < µ0 For the left-tailed test, the “unlikely” region are values that are: too low. (Left-Tailed) P-Value = (area to the left of the observed value of the test statistic) = P(Z < z0) Calculation: P-Value = (Value from standard normal table at z0)Calculating the P-Value: Two-Tailed Two-Tailed: H0: µ = µ0; Hå: µ ≠ µ0 For the two-tailed test, the “unlikely” region are values that are: too high or too low. (Two-Tailed) P-Value = 2 x (area to the right of the absolute value of observed value of the test statistic) = 2 x P(Z > |z0|) Calculation: P-Value = 2 x [1 — P(Z < |z0|)] = 2 x [1 — (value from standard normal table at |z0|)] Interpretation of a P-Value For all three alternatives (two-tailed, left-tailed, right-tailed), the interpretation is the same: Larger P-Values mean that the observed sample mean is not unusual (i.e., not unlikely if the null hypothesis is true). - A P-Value of 0.30, for example, means that the observed value of the test statistic, or a more extreme value, would happen (by chance) 30% of the time if the null hypothesis is true. - 30% of the time is not unusual. (Fail to Reject Decision) Smaller P-Values mean that the observed sample is unusual (i.e., unlikely if the null hypothesis is true). - A P-Value of 0.01, for example, means that the observed value of the test statistic, or a more extreme value, would happen (by chance) only 1% of the time if the null hypothesis is true. - 1% of the time is unusual (Reject Decision) Decision Rule: P-Value Approach The decision rule is based on the level of significance å (given): Fail to reject the null hypothesis if the P-value is greater than or equal to å. P-Value ≥ å —> Fail to Reject H0 If the P-Value is low, the null must go!Reject the null hypothesis if the P-Value is less than å. P-Value < å —> Reject H0Week 12 Lecture 15 April 17th, 2017 Hypothesis Testing About the Mean When Sigma is Known (The Classical Method) The 6 Step Process Using the Classical Approach Complete the 6 steps to the hypothesis test: 1. State the Null and Alternative Hypothesis (From this you can describe the Null Hypothesis Distribution) 2. State the significance level (What is the å for this problem, å will be given to you. If not, assume å = 0.05) 3. Calculate the test statistic, z0 (What is the test statistic? The test statistic, z0, is how many standard deviations your sample mean is in the null hypothesis distribution.) z0 = (X bar) — µ0 ÷ ç ÷ √n (The test statistic measures the number of standard deviations between the sample mean and the hypothesized mean.) 4. Calculate the critical value, zc. The maximum number of standard deviations the sample mean can be from µ0 before the null hypothesis is rejected - For a right tailed test, zc, P (Z > zc) = å - For a left tailed test, zc, P (Z < zc) = å - For a two tailed test, there are two critical values, -zc and zc, where P (Z < -zc) = å/2 and P (Z > zc) = å/2 Compare the critical value, zc to the test statistic - For a right tailed test, if z0 > zc, reject the null hypothesis - For a left tailed test, if z0 < zc, reject the null hypothesis - For a two tailed test, if z0 < -zc or z0 > zc, reject the null hypothesis 5. Determine whether you reject or fail to reject the null hypothesis 6. State your conclusion in the context of the problem Confidence Intervals and Two-Sided Tests A level å two-sided significance test rejects the null hypothesis H0: µ = µ0 exactly when the value µ0 falls outside a level (1 — å) confidence interval for µ. STAT 100 Final Week 12 Lecture 16 April 19th, 2017 Inference One Sample Proportion Standardized Sample Proportion Inference about a population proportion p is based on the z statistic that results from standardizing (p-hat): z = (p-hat) — p ÷ √p(1 — p) ÷ n z has approximately the standard normal distribution as long as the sample is not too small and the sample is not a large part of the entire population. Distribution of the Sample Proportion Because the values of the sample proportion varies from sample to sample, it is a random variable. So, we have the same questions for the sample proportion as we had for the sample mean: What is the mean of the sample proportion? p What is the standard deviation of the sample proportion? SE = √p(1 — p) ÷ n What is the sampling distribution of the sample proportion if np (1 — p) ≥ 10 or number of success or failures is 15 or more? (p-hat) ~ N( mean: µ(p-hat) = p, standard error: √p(1 — p) ÷ n) Large-Sample Confidence Interval for a Proportion How do we find the critical value for our confidence interval? statistic +/- (critical value) x (standard deviation of statistic) If the Normal condition is met, we can use a Normal curve. To find a level C confidence interval, we need to catch the central area C under the standard Normal curve. For example, to find a 95% confidence interval, we use a critical value of 2 based on the 68-95-99.7 rule. Using a Standard Normal Table or a calculator, we can get a more accurate critical value. Note, the critical value z* is actually 1.96 for a 95% confidence level. Confidence IntervalDraw an SRS of size n from a population with unknown proportion p of successes. An approximate level C confidence interval for p is p +/= z(å/2) √(p-hat) x (1 — p-hat) ÷ n Where z(å/2) is the critical value for the standard Normal density curve with area C between - z(å/2) and z(å/2). Use this interval only when the counts of successes and failures in the sample are both at least 15. Note: You never use a t(å/2) critical value when constructing a confidence interval for a population proportion. Choosing the Sample Size In planning a study, we may want to choose a sample size that allows us to estimate a population proportion within a given margin of error. m = z(å/2) √(p-hat) x (1 — p-hat) ÷ n z(å/2) is the standard Normal critical value for the level of confidence we want. Because the margin of error involves the sample proportion (p-hat), we have to guess the latter value when choosing n. There are two ways to do this: Use a guess for (p-hat) based on past experience or a pilot study Use (p-hat) = 0.5 as the guess. ME is largest when (p-hat) = 0.5. Sample size for desired margin of error: The level C confidence interval for a population proportion p will have the margin of error approximately equal to a specified value m when the sample size is: n = (zå/2 ÷ m)^2 x p*(1 — p*) where p* is a guessed value for the sample proportion. The margin of error will be less than or equal to m if you take the guess p* to be 0.5. The Hypotheses for Proportions Choose an SRS of size n from a large population that contains an unknown proportion p of successes. Null: H0: p = p0 One sided alternatives: Hå: p > p0; Hå: p < p0 Two sided alternative: Hå: p ≠ p0 Test Statistic for Proportions Start with the z statistic that results from standardizing (p-hat): z = (p-hat) — p ÷ √p(1 — p) ÷ nAssuming that the null hypothesis is true (H0: p = p0), then p = p0. P-Value for Testing Proportions Hå: p > p0 P(Z ≥ z0) P-Value is the probability of getting a value as large or larger than the observed test statistic (z) value. Hå: p < p0 P(Z ≤ z0) P-Value is the probability of getting a value as smaller or smaller than the observed test statistic (z) value. Hå: p ≠ p0 2P(Z ≥ |z0|) P-Value is two times the probability of getting a value as large or larger than the absolute value of the observed test statistic (z) value.Week 13 Lecture 17 April 24th, 2017 Inference Dependent Two Sample Means Matched Pairs t Procedures To compare the responses to the two treatments in a matched-pairs design, find the difference between the responses within each pair. Then apply the one-sample t procedures to these differences. The parameter µd is the mean difference in the responses to the two treatments within matched pairs of subjects in the entire population. Robustness of t Procedures A confidence interval or significance test is called robust if the confidence level or P-Value does not change very much when the conditions for use of the procedure are violated. Except in the case of small samples, the condition that the data are an SRS from the population of interest is more important than the condition that the population distribution is Normal. Sample size at least 15: The t procedures can be used except in the presence of outliers or strong skewness. Sample size less than 15: Use t procedures if the data appear close to Normal. If the data are clearly skewed or if outliers are present, do not use t. Large samples: The t procedures can be used even for clearly skewed distributions when the sample is large, roughly n ≥ 30. Using the t Procedures Except in the case of small samples, the assumption that the data are an SRS from the population of interest is more important than the assumption that the population distribution is Normal. Sample size less than 30: Use t procedures if the data appear close to Normal (symmetric, single peak, no outliers). If the data are skewed or if outliers are present, do not use t. Large samples: The t procedures can be used even for clearly skewed distributions when the sample is large, roughly n ≥ 30. Data must be from an SRS, not a population.Week 14 Lecture 18 April 26th, 2017 Inference when Sigma is Unknown The t Distributions When the sampling distribution of (x-bar) is close to Normal, we can find probabilities involving (x-bar) by standardizing: z = (x-bar) — m ÷ s / √n When we don’t know ç, we can estimate it using the sample standard deviation sx. What happens when we standardize? This new statistic does not have a Normal distribution! Standard Error When ç is Unknown When we do not know the population standard deviation ç (which is usually the case), we must estimate it with the sample standard deviation s. When the standard deviation of a statistic is estimated from data, the result is called the standard error of the statistic. The standard error of the sample mean is: s ÷ √n The t Distributions When we standardize based on the sample standard deviation sx, our statistic has a new distribution called a t distribution. It has a different shape than the standard Normal curve: It is symmetric with a single peak at 0, however it has much more area in the tails. Like any standardized statistic, t tells us how far (x-bar) is from its mean µ in standard deviation units. However, there is a different t distribution for each sample size, specified by its degrees of freedom (df). The t density curve is similar in shape to the standard Normal curve. They are both symmetric about 0 and bell-shaped. The spread of the t distributions is a bit greater than that of the standard Normal curve (i.e., the t curve is slightly “fatter”). As the degrees of freedom increase, the t density curve approaches the N(0,1) curve more closely. This is because s estimates ç more accurately as the sample size increases. When we perform inference about a population mean µ using a t distribution, the appropriate degrees of freedom are found by subtracting 1 from the sample size n, making df = n — 1. Draw an SRS of size n from a large population that has a Normal distribution with mean µ and standard deviation ç. The one-sample t statistic: t = (x-bar) - m ÷ sx ÷ √n has the t distribution with degrees of freedom df = n — 1. When comparing the density curves of the standard Normal distribution and t distributions, several facts are apparent: The density curves of the t distributions are similar in shape to the standard Normal curve. The spread of the t distributions is a bit greater than that of the standard Normal distribution. The t distributions have more probability in the tails and less in the center than does the standard Normal. As the degrees of freedom increase, the t density curve approaches the standard Normal curve ever more closely. We can use Table D in the back of the book to determine critical values t(å/2) for t distributions with different degrees of freedom. Properties of Confidence Intervals As the sample size n gets large, there is less and less of a difference between the critical values for the normal and the critical values for the t-distribution. It is correct to use the t-distribution when ç is not known: • Technology should always use t-distribution. • When doing rough assessment by hand, the normal critical values can be used, particularly when n is large, for example if n is 30 or more. When does the t-distribution and normal differ by a lot? In either of two situations: • When the sample size n is small (particularly if n is 10 or less), or • When the level of confidence needs to be high (particularly if å is 0.005 or lower) Example: For n = 5 and å = 0.001 (99.9% level of confidence), the t-distribution critical value t(å/ 2) is 8.610 (4 df) compared to the normal critical value of 3.291. Conditions for Inference About a Mean Data are from a SRS of size n. Population has a Normal distribution with mean µ and standard deviation ç or sample is 30 or more. This will ensure that the sampling distribution is Normal or approximately Normal. Both µ and ç are usually unknown. We use inference to estimate µ. Problem: ç unknown means we cannot use the z procedures previously learned. One-Sample t Confidence Interval The one-sample t interval for a population mean is similar in both reasoning and computational detail to the one-sample z interval for a population proportion. Close an SRS of size n from a population having unknown mean µ. A level C confidence interval for µ is: (x-bar) +/- t(å/2) x sx ÷ √n where t(å/2) is the critical value for the t(n - 1) distribution. The margin of error is t(å/2) x sx ÷ √n This interval is exact when the population distribution is Normal and approximately correct for large n in other cases. Normality Here are two very effective ways to check for major deviations from Normality: 1. Boxplots will show if there are outliers and the general shape of boxplots can indicate major deviations from Normality. 2. Normal Probability Plots will indicate major deviations from Normality.Week 15 Lecture 19 May 1st, 2017 Inference Independent Two Sample Means Two-Sample Problems The goal of inference is to compare the responses of two treatments or to compare the characteristics of two populations. We have a separate sample from each treatment or each population. Individuals for one sample have no influence upon which individuals are selected for a second sample. Each sample is separate. The units are not matched, and the samples can be of differing sizes. Comparing Two Population Means One of the most common goal in inference of two populations is to compare the average or typical responses in the two populations. That is comparing the means of each of the populations. Test claims regarding the difference of two populations’ means from independent samples. Two samples are independent if one sample has no influence on the other. (Matched pairs violate independence since one sample directly affects the other sample) Construct and interpret confidence intervals regarding the difference of two population means. Two-Sample Problems What if we want to compare the mean of some quantitative variable for the individuals in two populations, Population 1 and Population 2? Our parameters of interest are the population means µ1 and µ2. The best approach is to take separate random samples from each population and to compare the sample means. Suppose we want to compare the average effectiveness of two treatments in a completely randomized experiment. In this case, the parameters µ1 and µ2 are the true mean responses for Treatment 1 and Treatment 2, respectively. We use the mean response in the two groups to make the comparison. Conditions for Comparing Two Means We have two independent SRSs, from two distinct populations - That is, one sample has no influence on the other — matching violates independence - We measure the same variable for both samples. Both populations are Normally distributed - the means and standard deviations of the populations are unknown - In practice, it is enough that the distributions have similar shapes and that the data have no strong outliers. The Two-Sample t Statistic When data comes from two random samples or two groups in a randomized experiment, the statistic (x-bar)1 — (x-bar)2 is our best guess for the value of m1 — m2. When the Independent condition is met, the standard deviation of the statistic (x-bar)1 — (x bar)2 is: s(x-bar1 - x-bar2) = √s1^2 ÷ n + s2^2 ÷ n2 Since we don’t know the values of the parameters s1 and s2, we replace them in the standard deviation formula with the sample standard deviations. The result is the standard error of the statistic (x-bar)1 — (x-bar)2: √s1^2 ÷ n + s2^2 ÷ n2 If the Normal condition is met, we standardize the observed difference to obtain a t statistic that tells us how far the observed difference is from its mean in standard deviation units: t0 = (x0bar1 — xbar2) ÷ √s1^2 ÷ n + s2^2 ÷ n2 The two-sample t statistic has approximately a t distribution. We can use technology to determine degrees of freedom OR we can use a conservative approach, using the smaller of n1 —1 and n2 — 1 for the degrees of freedom. Hypothesis Testing Now for the overall structure of the test: 1. Set up the hypotheses (right-tailed, left-tailed, or two-tailed alternative hypotheses). Right-Tailed: H0: µ1 = µ2; Hå: µ1 > µ2 Left-Tailed: H0: µ1 = µ2; Hå: µ1 < µ2 Two-Tailed: H0: µ1 = µ2; Hå: µ1 ≠ µ2 µ1 is the population mean for population 1 µ2 is the population mean for population 2 2. Select the level of significance å, depending on the seriousness of making a Type I error (Often 0.10, 0.05, or 0.01). 3. Use technology to compute the test statistic. 4. Use technology to calculate the P-Value. 5. Reach a fail to reject or reject the null hypothesis decision by comparing the P-Value to the level of significance. 6. Interpret the decision in terms of the problem (Conclusion).Confidence Interval for µ1 — µ2 A (1 — å) x 100% confidence interval for the difference of two population means µ1 — µ2 is: Point estimate: (x-bar1 — x-bar2) +/- t(å/2) x Standard Error: √s1^2 ÷ n1 + s2^2 ÷ n2 where t(å/2) is the critical value of the Student’s t-distribution with the degrees of freedom equal to the smaller of n1 — 1 and n2 — 1 (if calculated by hand). The point estimate is (x-bar1 — x-bar2) We use the denominator of the test statistic (Welch’s approximation) as the standard error. Confidence interval is of the form: Point estimate +/- margin of error Assumptions — Independent Samples For both hypothesis testing and confidence intervals, the data need to satisfy the following conditions: Both samples are obtained using simple random sampling. The samples are independent. Neither sample has outliers. Both populations are normally distributed, or both sample sizes are large (both n1 and n2 are at least 30). These are the usual conditions we need to make our Student’s t-distribution calculations.Week 15 Lecture 20 May 3rd, 2017 Chi Square Goodness of Fit Goodness-of-Fit Test A goodness-of-fit test is a procedure used in inferential statistics to test claims regarding entire probability distributions. Goodness-of-fit tests apply to experiments that consist of a series of n independent trials (e.g., simple random samples of size n). The (qualitative) outcome of each trial can be classified into one of k ≥ 3 categories. We let p1, p2, …, pk be the “claimed” probabilities for the k categories (One pi for each of the possible outcomes). Compare observed frequencies and expected frequencies. If they differ significantly, then we have evidence against the claimed probabilities. Goodness-of-fit tests use the X^2 (chi-square) distribution, which has the following properties: Its values are greater than or equal to zero. It is not symmetric; it is skewed right. Its shape depends on the degrees of freedom (just like a t-distribution). As the number of degrees of freedom increases, the shapes of the X^2 distributions become more and more symmetric.Performing the Test To perform a goodness-of-fit test: 1. Set up the hypotheses to be tested. 2. Decide on a level of significance, å, depending on the seriousness of making a Type I error. 3. Find the value of the test statistic. e. Calculate the number of occurrences to be expected in each category (if the null hypothesis is true). f. Calculate the number of occurrences actually observed in each category (from the sample). g. Compare the observed numbers to the expected numbers using a test statistic. 4. Using technology, find the P-Value. 5. Reach a reject / fail to reject decision by comparing the P-Value to the level of significance. 6. State the decision in terms of the problem (Conclusion). Hypotheses The hypotheses to be tested are: H0: The random variable follows the claimed distribution (i.e. that the proportions are true). H0: p1 = p1^0, p2 = p2^0, …, pk = pk^0 Hå: The random variable does not follow the claimed distribution. Hå: At least one of the pi’s is different than stated in H0 These hypotheses deal with all the categories at the same time, not just one. Expected Number The expected number of occurrences in each category is the number of trials of the experiment times the probability “success” for that category: E1 = np1^0, E2 = np2^0, …, Ek = npk^0 where: n = sample size (number of independent trials) ; pi^0 = proportion in category i (when the null hypothesis is true) ; k = number of mutually exclusive categories. We also use the notation: µ1 = np1^0, µ2 = np2^0, …, µk = npk^0 for the expected numbers. Test Statistic The experiment is run: O1 is the number observed in category 1. O2 is the number observed in category 2, etc. We want to compare: O1 to E1 O2 to E2, etc. with a single test statistic. The test statistic to use is: X(2/0) = ∑ (Oi — Ei)^2 ÷ Ei Test statistic is “large” if the observed and expected counts differ “significantly.” (In this formula, we compare the observed and the expected values via Oi — Ei, square it, divide by Ei, and add over all the different categories.) Sampling Distribution When the null hypothesis is true, the sampling distribution of the test statistic is approximately chi-square with k — 1 degrees of freedom, provided that the following conditions hold: - All expected frequencies (the Ei) should be at least 1, i.e., Ei ≥ 1. - No more than 20% of the Ei should be less than 5. Decision All goodness-of-fit tests are right-tailed tests. If the P-Value is less than the level of significance, then reject the null hypothesis that the proportions are as claimed. Chi-Square Goodness-of-Fit Test A variation of the Chi-square statistic can be used to test a different kind of null hypothesis: that a single categorical variable has a specific distribution The null hypothesis specifies the probabilities (pi) of each of the k possible outcomes of the categorical variable The chi-square goodness-of-fit test compares the observed counts for each category with the expected counts under the null hypothesis A categorical variable has k possible outcomes, with probabilities p1, p2, p3, …, pk. That is, pi is the probability of the i th outcome. We have n independent observations from this categorical variable. To test the null hypothesis that the probabilities have specified values: H0: p1 = p10, p2 = p20, …, pk = pk0 use the chi-square statistic: X^2 = ∑ (count of outcome i — npi0)^2 ÷ npi0 The P-Value is the area to the right of X^2 under the density curve of the chi-square distribution with k — 1 degrees of freedom.

References: