Section 1.1 The Structure of DataStatistics: Unlocking the Power of Data Lock5 Outline ● Data ● Cases and variables ● Categorical and quantitative variables ● Explanatory and response variables ● Using data to answer a questionStatistics: Unlocking the Power of Data Lock5 Why Statistics? ● Statistics iWe also discuss several other topics like What does empathy mean?
Don't forget about the age old question of What is a broad coalition of social justice groups?
Don't forget about the age old question of What are three notes to a beat?
Don't forget about the age old question of Discuss the NBC documentary "The Vaccine War".
If you want to learn more check out How were the objects buried during the year 800?
Don't forget about the age old question of What is the goal of an acceptance speech?
s all about DATA ○ Collecting DATA ○ Describing DATA – summarizing, visualizing ○ Analyzing DATA ● Data are everywhere! Regardless of your field, interests, lifestyle, etc., you will almost definitely have to make decisions based on data, or evaluate decisions someone else has made based on dataStatistics: Unlocking the Power of Data Lock5 Data ● Data are a set of measurements taken on a set of individual units ● Usually data is stored and presented in a dataset, comprised of variables measured on casesStatistics: Unlocking the Power of Data Lock5 Cases and Variables A case or unit is a subject/object in the study that we want information on.. A variable is any characteristic that is recorded for each case. Ex: If I ask 5 students what is their favorite flavor of ice cream There are 5 cases, thus “5 students” The variable is “favorite flavor of ice cream”Statistics: Unlocking the Power of Data Lock5 Intro Statistics Survey DataStatistics: Unlocking the Power of Data Lock5 Categorical versus Quantitative ● Variables are classified as either categorical or quantitative:• A categorical variable divides the cases into groups • A quantitative variable measures a numerical quantity for each case Statistics: Unlocking the Power of Data Lock5 Categorical and Quantitative Classify each of the following variables from the StudentSurvey data as either categorical or quantitative: Year in School Gender HigherSAT (which is higher – Math or Verbal?) SAT score GPA # of siblings Height Weight Exercise Hours of TV per week Pulse rate Award preference (Olympic Gold, Academy Award, or Nobel Prize?)Statistics: Unlocking the Power of Data Lock5 Categorical QuantitativeStatistics: Unlocking the Power of Data Lock5 Categorical variables ● Ordinal variables : measurements have meaningful order. Ex: letter grade: A, B,C,D,F Patient condition: Good, fair, Serious, Critical ● Nominal variables : measurements are unordered. Ex: Gender: Male ,female Eye color: blue, brown, green,blackStatistics: Unlocking the Power of Data Lock5 Quantitative variables ● Discrete variables: they take on only finite or countably infinite “isolated’ values. Ex: number of siblings, number of dogs ● Continuous variables: they take on any value in an interval. Ex: height, weight, etc..Statistics: Unlocking the Power of Data Lock5 Using Data to Answer a QuestionQUESTION: If you are romantically interested in someone, should you be obvious about it, or should you play hard to get? Let’s Collect Some Data! Statistics: Unlocking the Power of Data Lock5 Romance What type of person are you generally more romantically interested in? (a) Someone who is obviously into you (b) Someone who plays heard to getStatistics: Unlocking the Power of Data Lock5 One or Two Variables ● Sometimes we are interested in one variable, as in whether people prefer obvious romantic interest or hard to get ● Other times we are interested in the relationship between two variables, such as 1) prefer obvious interest or hard to get? 2) gender Statistics: Unlocking the Power of Data Lock5 Explanatory and Response If we are using one variable to help us understand or predict values of another variable, we call the former the explanatory variable and the latter the response variable. The variable to help understand is the explanatory variable. The variable we predict is the response variable. Examples: ● Does meditation help reduce stress? ● Does sugar consumption increase hyperactivity?Statistics: Unlocking the Power of Data Lock5 Summary ● Data are everywhere, and pertain to a wide variety of topics ● A dataset is usually comprised of variables measured on cases ● Variables are either categorical or quantitative ● Data can be used to provide information about essentially anything we are interested in and want to collect data on!Statistics: Unlocking the Power of Data Lock5 Section 1.2 Sampling from a PopulationStatistics: Unlocking the Power of Data Lock5 Outline ● Sample versus Population ● Statistical Inference ● Sampling Bias ● Simple Random Sample ● Other Sources of BiasStatistics: Unlocking the Power of Data Lock5 Sample versus Population A population includes all individuals or objects of interest. A sample is all the cases that we have collected data on (a subset of the population). Statistical inference is the process of using data from a sample to gain information about the population.Statistics: Unlocking the Power of Data Lock5 The Big Picture Population Statistical InferenceSampling Sample Statistics: Unlocking the Power of Data Lock5 Dewey Defeats Truman?Statistics: Unlocking the Power of Data Lock5 Dewey Defeats Truman? ● The paper was published before the conclusion of the 1948 presidential election, and was based on the results of a large telephone poll which showed Dewey sweeping Truman ● However, Harry S. Truman won the election ● What went wrong?Statistics: Unlocking the Power of Data Lock5 Sampling Bias Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. ● If sampling bias exists, we cannot trust generalizations from the sample to the populationStatistics: Unlocking the Power of Data Lock5 Sampling Population Sample Sample GOAL: Select a sample that is similar to the population, only smallerStatistics: Unlocking the Power of Data Lock5 Random Sampling ● How can we make sure to avoid sampling bias? Take a RANDOM sample!● Imagine putting the names of all the units of the population into a hat, and drawing out names at random to be in the sample ● More often, we use technology Statistics: Unlocking the Power of Data Lock5 Random Sampling ● Before the 2008 election, the Gallup Poll took a random sample of 2,847 Americans. 52% of those sampled supported Obama ● In the actual election, 53% voted for Obama ● Random sampling is a very powerful tool!!!Statistics: Unlocking the Power of Data Lock5 Random vs Non-Random Sampling ● Random samples have averages that are centered around the correct number ● Non-random samples may suffer from sampling bias, and averages may not be centered around the correct number ● Only random samples can truly be trusted when making generalizations to the population!Statistics: Unlocking the Power of Data Lock5 Bowl of Soup Analogy Think of tasting a bowl of soup… ● Population = entire bowl of soup ● Sample = whatever is in your tasting bites ● If you take bites non-randomly from the soup (if you stab with a fork, or prefer noodles to vegetables), you may not get a very accurate representation of the soup ● If you take bites at random, only a few bites can give you a very good idea for the overall taste of the soupStatistics: Unlocking the Power of Data Lock5 Simple Random Sample In a simple random sample, each unit of the population has the same chance of being selected, regardless of the other units chosen for the sample ● More complicated random sampling schemes exist, but will not be covered in this courseStatistics: Unlocking the Power of Data Lock5 Realities of Sampling ● While a random sample is ideal, often it isn’t feasible. A list of the entire population may not be available, or it may be impossible or too difficult to contact all members of the population. ● Sometimes, your population of interest has to be altered to something more feasible to sample from. Generalization of results are limited to the population that was actually sampled from. ● In practice, think hard about potential sources of sampling bias, and try your best to avoid themStatistics: Unlocking the Power of Data Lock5 Non-Random Samples Suppose you want to estimate the average number of hours that students spend studying each week. Which of the following is the best method of sampling? (a) Go to the library and ask all the students there how much they study (b) Email all students asking how much they study, and use all the data you get (c) Give a clicker question in this class and force every student to respond (d) Stand outside the student center and ask everyone going in how much they studyStatistics: Unlocking the Power of Data Lock5 Bad Methods of Sampling ● Sampling units based on something obviously related to the variable(s) you are studying ○ Sampling only students in the library when asking how much they study, or sampling only students taking a statistics class ○ “Today’s Poll” on fitnessmagazine.com asked “Have you ever hired a personal trainer?”. 27% of respondents said “yes” – can we infer that 27% of all humans have hired a personal trainer?Statistics: Unlocking the Power of Data Lock5 Bad Methods of Sampling ● Letting your sample be comprised of whoever chooses to participate (volunteer bias) ● People who chose to participate or respond are probably not representative of the entire population ○ Emailing or mailing the entire population, and then making conclusions about the population based on whoever chooses to respond ○ Example: An airline emails all of it’s customers asking them to rate their satisfaction with their recent travelStatistics: Unlocking the Power of Data Lock5 Data Collection and Bias Population Sampling Bias? Sample Other forms of bias?DATA Statistics: Unlocking the Power of Data Lock5 Other Forms of Bias ● Even with a random sample, data can still be biased, especially when collected on humans ● Other forms of bias to watch out for in data collection: ○ Question wording ○ Context ○ Inaccurate responses ○ Many other possibilities – examine the specifics of each study!Statistics: Unlocking the Power of Data Lock5 Question Wording ● “Do you think the US should allow public speeches against democracy?” 21% said speeches should be allowed ● “Do you think the US should not forbid public speeches against democracy?” 39% said speeches should not be forbiddenSource: Rugg, D. (1941). “Experiments in wording questions,” Public Opinion Quarterly, 5, 91-92. Statistics: Unlocking the Power of Data Lock5 Question Wording ● A random sample was asked: “Should there be a tax cut, or should money be used to fund new government programs?” Tax Cut: 60% Programs: 40% ● A different random sample was asked: “Should there be a tax cut, or should money be spent on programs for education, the environment, health care, crime-fighting, and military defense?” Tax Cut: 22% Programs: 78%Statistics: Unlocking the Power of Data Lock5 Summary Always think critically about how the data were collected, and recognize that not all forms of data collection lead to valid inferences ● This is the easiest way to instantly become a more statistically literate individual!Statistics: Unlocking the Power of Data Lock5 Section 1.3 Experiments and Observational StudiesStatistics: Unlocking the Power of Data Lock5 Outline ● Association versus Causation ● Confounding Variables ● Observational Studies vs Experiments ● Randomized ExperimentsStatistics: Unlocking the Power of Data Lock5 Association and Causation Two variables are associated if values of one variable tend to be related to values of the other variable Two variables are causally associated if changing the value of the explanatory variable influences the value of the response variableStatistics: Unlocking the Power of Data Lock5 Causal Association? “Daily Exercise Improves Mental Performance”The wording of this headline implies… a) Association (not necessarily causal) b) Causal Association This implies that exercising daily will improve (change) your mental performance Statistics: Unlocking the Power of Data Lock5 Causal Association? “Want to lose weight? Eat more fiber!”The wording of this headline implies… a) Association (not necessarily causal) b) Causal Association This implies that eating fiber will cause you to lose weight. Statistics: Unlocking the Power of Data Lock5 Causal Association? “Cat owners tend to be more educated than dog owners”The wording of this headline implies… a) Association (not necessarily causal) b) Causal Association There is no claim that owning a cat will change your education level. Statistics: Unlocking the Power of Data Lock5 TVs and Life Expectancy Should you buy more TVs to live longer? Association does not imply causation!Statistics: Unlocking the Power of Data Lock5 Confounding Variable A third variable that is associated with both the explanatory variable and the response variable is called a confounding variable • A confounding variable can offer a plausible explanation for an association between the explanatory and response variables • Whenever confounding variables are present (or may be present), a causal association cannot be determinedStatistics: Unlocking the Power of Data Lock5 TVs and Life Expectancy Wealth Number of TVs per capita ?Life Expectancy Statistics: Unlocking the Power of Data Lock5 Experiment vs Observational Study An observational study is a study in which the researcher does not actively control the value of any variable, but simply observes the values as they naturally exist An experiment is a study in which the researcher actively controls one or more of the explanatory variablesStatistics: Unlocking the Power of Data Lock5 Observational Studies ● There are almost always confounding variables in observational studies Observational studies can Observational studies can almost ● Observational studies can almost never be almost never be used to used to establish causation never be used to establish causation establish causationStatistics: Unlocking the Power of Data Lock5 It’s a Common Mistake! “The invalid assumption that correlation implies cause is probably among the two or three most serious and common errors of human reasoning.” - Stephen Jay GouldStatistics: Unlocking the Power of Data Lock5 Randomization • How can we make sure to avoid confounding variables? RANDOMLY assign values of the explanatory variableStatistics: Unlocking the Power of Data Lock5 Randomized Experiment In a randomized experiment the explanatory variable for each unit is determined randomly, before the response variable is measuredStatistics: Unlocking the Power of Data Lock5 Randomized Experiment ● The different levels of the explanatory variable are known as treatments ● Randomly divide the units into groups, and randomly assign a different treatment to each group ● If the treatments are randomly assigned, the treatment groups should all look similarStatistics: Unlocking the Power of Data Lock5 Randomized Experiments ● Because the explanatory variable is randomly assigned, it is not associated with any other variables. Confounding variables are eliminated!!! Confounding Variable RANDOMIZED EXPERIMENTExplanatory Variable Response Variable Statistics: Unlocking the Power of Data Lock5 Randomized Experiments ● If a randomized experiment yields a significant association between the two variables, we can establish causation from the explanatory to the response variable Randomized experiments are very powerful! They allow you to infer causality.Statistics: Unlocking the Power of Data Lock5 How to Randomize? ● Option 1: As with random sampling, we can put all the names/numbers into a hat, and randomly pull out names to go into the different groups ● Option 2: Put names/numbers on cards, shuffle, and deal out the cards into as many piles as there are treatments ● Option 3: Use technologyStatistics: Unlocking the Power of Data Lock5 Knee Surgery for Arthritis Researchers conducted a study on the effectiveness of a knee surgery to cure arthritis. It was randomly determined whether people got the knee surgery. Everyone who underwent the surgery reported feeling less pain. Is this evidence that the surgery causes a decrease in pain? (a) Yes (b) No Need a control or comparison group. What would happen without surgery?Statistics: Unlocking the Power of Data Lock5 Control Group ● When determining whether a treatment is effective, it is important to have a comparison group, known as the control group ● It isn’t enough to know that everyone in one group improved, we need to know whether they improved more than they would have improved without the surgery ● All randomized experiments need either a control group, or two different treatments to compareStatistics: Unlocking the Power of Data Lock5 Knee Surgery for Arthritis ● In the knee surgery study, those in the control group received a fake knee surgery. They were put under and cut open, but the doctor did not actually perform the surgery. All of these patients also reported less pain! ● In fact, the improvement was indistinguishable between those receiving the real surgery and those receiving the fake surgery! Source: “The Placebo Prescription,” NY Times Magazine, 1/9/00.Statistics: Unlocking the Power of Data Lock5 Placebo Effect ● Often, people will experience the effect they think they should be experiencing, even if they aren’t actually receiving the treatment . ● This is known as the placebo effect ● One study estimated that 75% of the effectiveness of anti-depressant medication is due to the placebo effect ● For more information on the placebo effect (it’s pretty amazing!) read The Placebo PrescriptionStatistics: Unlocking the Power of Data Lock5 Placebo and Blinding ● Control groups should be given a placebo, a fake treatment that resembles the active treatment as much as possible ●Using a placebo is only helpful if participants do not know whether they are getting the placebo or the active treatment ● If possible, randomized experiments should be double-blinded: neither the participants or the researchers involved should know which treatment the patients are actually gettingStatistics: Unlocking the Power of Data Lock5 Green Tea and Prostate Cancer ● A study was conducted on 60 men with PIN lesions, some of which turn into prostate cancer ● Half of these men were randomized to take 600 mg of green tea extract daily, while the other half were given a placebo pill ● The study was double-blind, neither the participants nor the doctors knew who was actually receiving green tea ● After one year, only 1 person taking green tea had gotten cancer, while 9 taking the placebo had gotten cancerStatistics: Unlocking the Power of Data Lock5 Green Tea and Prostate Cancer A difference this large is unlikely to happen just by random chance. Can we conclude that green tea really does help prevent prostate cancer? (a) Yes Good randomized experiments allow (b) No conclusions about causality.Statistics: Unlocking the Power of Data Lock5 Types of Randomized Experiments ● Randomizing cases into different treatment groups is called a randomized comparative experiment ● We can also give each treatment to each case, and just randomize the order in which treatments are received: matched pairs experiment ● Either are valid randomized experiments!Statistics: Unlocking the Power of Data Lock5 Why not always randomize? ● Randomized experiments are ideal, but sometimes not ethical or possible ● Often, you have to do the best you can with data from observational studies ● Example: research for the Supreme Court case as to whether preferences for minorities in university admissions helps or hurts the minority studentsStatistics: Unlocking the Power of Data Lock5 Randomization in Data CollectionWas the sample randomly selected? Was the explanatory variable randomly assigned? Yes Possible to generalize to the population No Should not generalize to the population Yes Possible to make conclusions about causality No Can not make conclusions about causality Statistics: Unlocking the Power of Data Lock5 Two Fundamental Questions in Data Collection Random sample??? Population Sample Randomized experiment???DATA Statistics: Unlocking the Power of Data Lock5 Randomization ● Doing a randomized experiment on a random sample is ideal, but rarely achievable ● If the focus of the study is using a sample to estimate a statistic for the entire population, you need a random sample, but do not need a randomized experiment (example: election polling) ● If the focus of the study is establishing causality from one variable to another, you need a randomized experiment and can settle for a non-random sample (example: drug testing)Statistics: Unlocking the Power of Data Lock5 Summary ● Association does not imply causation! ● In observational studies, confounding variables almost always exist, so causation cannot be established ● Randomized experiments involve randomly determining the level of the explanatory variable ● Randomized experiments prevent confounding variables, so causality can be inferred ● A control or comparison group is necessary ● The placebo effect exists, so a placebo and blinding should be usedStatistics: Unlocking the Power of Data Lock5 http://xkcd.com/552/Statistics: Unlocking the Power of Data Lock5