# STAT 2004, Midterm 1 Study Guide STAT 2004

Virginia Tech

GPA 3.62

This 8 page Study Guide was uploaded by Mara DePena on Thursday February 25, 2016. The Study Guide belongs to STAT 2004 at Virginia Polytechnic Institute and State University taught by Metzger in Spring 2016.

Date Created: 02/25/16

STAT 2004 STUDY GUIDE: MIDTERM ONE TABLE OF CONTENTS Course Logistics/About the Midterm ……………………………………………………………………….. 2 Populations and Samples …………………………………………………………………………………………. 2 Sampling Methods………………………………………………………………………………………… …………. 2 Experimental Design…………………………………………………………………………………………… ……. 3 Visualizing Numerical Data……………………………………………………………………………………….. 3 Distribution……………………………………………………………………………………… ………………………. 4 Boxplots………………………………………………………………………………………… …………………………. 4 Robustness……………………………………………………………………………………… ……………………….. 5 Probability……………………………………………………………………………………… ………………………… 5 Probability Distribution……………………………………………………………………………………… …..... 6 Symbols………………………………………………………………………………………… ………………………….. 7 2 COURSE LOGISTICS/ABOUT THE MIDTERM You must bring a calculator to the midterm. The midterm will consist of multiple choice and short answer questions. Statistics is the study of how best to collect, analyze, and draw conclusions from data. Data consists of observations, and these observations form the backbone of a statistical investigation. POPULATIONS AND SAMPLES Population- Represents all people or things of interest. Sample- Observed/measured subset of a population. Summary statistic- A single number that summarizes a large amount of data. Variables- Measured or observed characteristics of data. o Categorical variable- Responses themselves are categories. Nominal- Unordered levels. Ordinal- Ordered levels. o Numerical variable- Counts/measures information. Can take a wide range of numerical values. Sensible to add, subtract, or average these values. Discrete- Finite, countable scale. Can only take numerical values with jumps. (Ex: 1, 2, 3, 4…) Continuous- Continuous scale. (Height, weight, etc.) o Associated/dependent/correlated- When two variables show some connection with one another. o Independent- When two variables are not associated. o Correlation is not causation. 3 o Explanatory variable- In scientific terms, this is the independent variable. o Response variable- In scientific terms, this is the dependent variable. o Confounding variable- Variable that is correlated with both the explanatory and response variables. SAMPLING METHODS We seek to randomly select samples from a population. Bias- When a sample is skewed to a person’s interests. o Non-response bias- Can skew results when people do not respond. Sample frame- List/roster of all potential observations (numbered.) Simple random sample- Most basic random sample. Equivalent to using a raffle. All observations have an equal chance of being chosen. Stratified random sample- Divide-and-conquer sampling strategy. Population is divided into groups called strata by demographics/subgroups. Similar cases grouped together. o Random samples are drawn from each strata. Cluster sample- Population is divided into clusters, often but not always by location. o All members are measured/given treatment. EXPERIMENTAL DESIGN Observational study- A data analysis where data is collected in a way that does not directly interfere with how the data arises. Experiment- Used to investigate the possibility of causation. Has an explanatory and response variable. (Independent and dependent in scientific terms.) o Randomized- When individuals are randomly assigned to a group. o Placebo- Fake treatment. Prospective study- Identifies individuals and collects information as events unfold. Retrospective study- Collect data after events have taken place. Principles of Experiment Design: 1. Randomization- Subjects sampled randomly, treatments/control assigned randomly. 2. Replication- Large sample size based on cost/convenience. 3. Error control- Eliminate/account for any differences in the sample. Placebo Blinding- Subjects do not know if they are in the treatment or control group. 4 Double-blinding- Researcher also doesn’t know who is treatment/control. Blocking- Group subjects into blocks who share some other variable. Treatments are applied to experimental units. Response is measured on observational units. o Ex: If you modify the temperature in several fish tanks and record the heart rate of fish in different temperature tanks, the tanks are the experimental units while the fish are the observational units. VISUALIZING NUMERICAL DATA Dot plot Histogram o Sorts things into categories and provides a view of data density. o Right-skewed- Data trails off to the right. o Left-skewed- Data trails off to the left. o Unimodal- One prominent peak. o Bimodal- Two peaks. o Multimodal- Three peaks. Scatterplot o Provides a case-by-case view of data for two numerical variables. Stem-and-leaf plot o Data set: {1, 2, 4, 7, 7, 7, 12, 15, 18, 22, 24} 2 2, 4 1 2, 5, 8 0 1, 2, 4, 7, 7, 7 DISTRIBUTION Describes shape, center, and spread/variation of data. For the images below, imagine histograms that fit the depicted curves. 5 Sample standard deviation- Tells you how spread out your data is. It is the square root of the variance. BOXPLOTS A boxplot uses a five number summary consisting of the median (Q ), 2 minimum, maximum, and 25 (Q ) and 751(Q ) percenti3e. It summarizes a data set while also plotting unusual observations known as outliers. o Outliers- Observations that are extreme relative to the rest of the data. Below is an example of a boxplot depicting test scores. Boxplots are usually vertical, but this one will be depicted horizontally. o The first step is to draw the median. The second step is to draw a rectangle to represent the middle 50% of the data. 50 75 Interquartile range (IQR)- Q -3 .1It is the length of the box in the boxplot. IQR Method o One of the many methods for calculating outliers. Lower cutoff- Q -1(1.5 x IQR) Upper cutoff- Q -3(1.5 x IQR) 6 These upper and lower cutoffs make the whiskers attached to the box. Any points outside of the whisker range are considered outliers and are labeled with a dot. ROBUSTNESS Robust estimate- Strong/effective in all/most situations and conditions. Outliers do not change it very much. The median and IQR are considered robust estimates. PROBABILITY The proportion of times an outcome would occur if repeated infinitely many times. Sample space- Represents possible outcomes. Law of Large Numbers- As the number of trials increases, the estimate goes closer to the true probability. In other words, as a sample size increases a statistic gets closer to the parameter it is estimating. Example problem one: Find P(7) (probability of rolling a 7) with two fair independent dice. o How can a 7 be rolled? S (sample space)= {(1,6) or (2,5) or (3,4) or (6,1) or (5,2) or (4,3)} o Mutually exclusive/disjoint- Cannot both happen together. (Ex: Sanders and Bush cannot both be elected president.) In this example, you cannot get two of these results at the same time. Addition Rule of Disjoint Outcomes- If A1 and A2 represent two disjoint outcomes, then the probability that one of them occurs is given by P(A1 or A2) = P(A1) + P(A2). In this example we add the individual probabilities for each pair, and multiple the probabilities for each die. {(1,6) or (2,5) or (3,4) or (6,1) or (5,2) or (4,3)}= (1/6) x (1/6) + (1/6) x (1/6) +(1/6) x (1/6) +(1/6) x (1/6) = 6/36 or 1/6 A summary of the rules… o If A and B are disjoint… P(A or B)= P(A) + P(B) o If A and B are independent (knowledge of one doesn’t affect knowledge of the other)… P(A and B)= P(A) x P(B) o If A and B are not disjoint… P(A or B)= P(A) + P(B) – P(A and B) Example problem two: o We have 52 card deck. It consists of the numbers 1-10 and jacks, queens, kings, and aces. Suits are diamond and heart (red), and club and spade (black). o What is the probability of having a red King? 7 There are 52 cards, and two red Kings: the King of Diamonds and the King of Hearts. Therefore P(red King)= 2/52. o What is the probability of drawing a card that is red or a King? P(red or King)= P(R) + P(K) – P(R and K) (26/52) + (4/52) – (2/52)= 28/52 Venn diagrams can be used as a visual aid when solving probability problems. o We can make a venn diagram for the previous example. One circle represents the red cards, the other the Kings. The overlap represents Kings that are also red cards, and the rectangle represents the remainder of the deck. People who are more visual and less math-minded may prefer drawing a Venn Diagram to working out the formulas. 2 2 24 24 PROBABILITY DISTRIBUTION A table or graph showing all possible outcomes and their probabilities. Ex: Roll two dice. There are 11 possible sums. X (sum) 2 3 4 5 6 7 8 9 10 11 12 P(X) 1/3 2/3 3/3 4/3 5/3 6/3 5/3 4/3 3/3 2/3 1/3 6 6 6 6 6 6 6 6 6 6 6 8 Note that not all probability distributions will follow such a curve. Also note that all probabilities add up to one or to 100%. Possible short answer question for Pro the midterm: o Describe the distribution of two fair coin flips. S= { HH, HT, TH, TT} (1/4)(1/4)(1/4)(1/4) 1/4 SYMBOLS HH HT TH x Median x Mean Population mean ^ ❑ Estimation 2 Variance Standard deviation

