Statistical Reasoning & Practice, Week 3 Notes
Statistical Reasoning & Practice, Week 3 Notes 36-201
Popular in Statistical Reasoning and Practice
verified elite notetaker
Popular in Statistics
This 7 page Class Notes was uploaded by Monica Chang on Wednesday September 21, 2016. The Class Notes belongs to 36-201 at Carnegie Mellon University taught by Gordon Weinberg in Fall 2016. Since its upload, it has received 14 views. For similar materials see Statistical Reasoning and Practice in Statistics at Carnegie Mellon University.
Reviews for Statistical Reasoning & Practice, Week 3 Notes
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/21/16
Week 3 EDA for 2-Variable Data: - 2-variable data – when you have two measurements for each individual - An association (relationship) exists when values of a certain variable are more likely to occur when you have certain values of another variable - Explanatory variable is the x - Response variable is the y Explanatory Response Variable EDA Variable Categorical Quantitative Display: side-by-side boxplots (MORE DETAILS BELOW) Summary: descriptive statistics Categorical Categorical Display: contingency table (MORE DETAILS BELOW) Summary: conditional percents Quantitative Quantitative Display: scatterplot (MORE DETAILS BELOW) Quantitative Categorical Display: logistics model Contingency Table: For a contingency table there are different ways to compute percentages: - ‘row conditional’ percentages – cells in each row divided by row total - ‘column conditional’ percentages – cells of each column divided by column total - ‘joint’ percentages – each cell in table divided by grand total - ‘marginal’ percentages – each value across one margin divided by grand total Conditional percentage of employment example wording: “16% is the conditional percentage of being temp given female” (the quality following “given” is always the explanatory variable). In general, conditional percentages should be made out of the explanatory variables. Scatterplots: Describing Scatterplots: - Direction: positive - Form: linear - Strength of association: measured by the correlation coefficient - Outliers: a few (maybe the two points that have the two largest y- values) Correlation coefficient: - Measures direction and strength of linear relationships between two quantitative variables - Sample correlation coefficient denoted by R, population correction coefficient denoted by ρ. - Sample Pearson correlation coefficient: n ∑ (X i¿X)(Y −Yi ´ 1 i=1 n−1 SxSy R=¿ - Population Pearson correlation coefficient: n ∑ (Xi−¿μ )xY −i ) y 1 i=1 N σ σ x y ρ=¿ - Properties of correlation coefficient: o Since correlation coefficient is calculated with means and standard deviations, it’s not resistant outliers and skewness o Unitless number between -1 and 1 o The sign of the relationship tells the direction (+ or -) o If the correlation coefficient is closer to -1 or 1, it is a stronger linear relationship, and if it’s closer to 0, it is a weaker linear relationship o Doesn’t change if x and y are switched o Not affected by change in units For 2-variable quantitative data, the correlation coefficient is not enough summary, we also need center and spread: - center and spread for explanatory variable: X(samplemeanof x) , Sx(samplestandard deviationof x) - center and spread for response variable: Y´ (sample mean of y), S y (samplestandard deviationof y) - measure of linear association between variables: R (sample correlation coefficient) Predicting y from x: The Least-Squares Regression Line: - Linear equation in statistics: Ŷ ¿b 0b 1 ´ - Least-Squares Regression line (best-prediction line): Y=b 0b X1´ - How prediction line is calculated (actually calculated w/ software in practice): o Prediction line passes through (X ,Y) o Prediction minimizes sum of residuals squared (“least- squares) o slope tells you how much y changes for a particular change in x R∗S y calculated as follows: slope = b1= S x o y-intercept tells you the value of y at when x is 0 calculated as follows: intercept=b =Y −b X ´ 0 1 - Predictions and error: o to see what a model predicts a particular y value to be given a certain x, plug the x value into the best-prediction line equation o residual: vertical distance from point to line, it equals the observed minus the predicted: y−¿ ŷ o small residuals indicate that the line is a good model of the data. 2 Meaning of R : 1. R measures linear direction and strength of quantitative association in a sample. 2. R is part of the formula for the slope of a regression line When you square R, it becomes a measure of explanatory power 3. 2 (as a percentage) is the percent of variation in Y explained R by (or accounted for) the linear relationship with X (simply, “ R is the percentage of Y explained by X”) Cautions w/ correlation and regression: - Correlation coefficients only measure linear relationships - Do NOT extrapolate: the regression line gives only predictions for values of x within the range of x values from the data used to make the model Overview of EDA: - 1 variable: o Categorical Display: bar graph Summary: percentages o Quantitative Display: histogram Summaries: shape + center + spread, five-number summary - 2 variable: o Categorical (explanatory) & quantitative (response) Display: side-by-side boxplots Summaries: descriptive statistics in each group o Categorical (explanatory) & categorical (response) Display: contingency table Summaries: conditional percents o Quantitative (explanatory) & quantitative (response) Display: scatterplot Summaries: correlation coefficient, R, and the regression equation, Ŷ ¿b0+b1X´ o Quantitative (explanatory) & categorical (response) (we will not cover this in class) Features of a good study: - Sensitivity (low random variation) o Random error (error within sample) is expected Can be minimized with sample size Can be measured with mathematical probability - Validity (reliable estimates and predictions): o Do not extrapolate o Be cautious of outliers: remove outliers or look at them separately before using summaries or prediction equation o Validating linear regression model (scatterplots): To make sure the relationship of data is appropriately modeled by a linear model, use a residual plot to see if a straight line is the right fit If a linear model is appropriate, the residuals should seem randomly distributed around the zero-line on a residual plot (cannot have an obvious non-linear pattern or thickening) - Generalizability (no bias/systematic error) o A non-random tendency towards certain outcomes means there’s bias o Systematic error (bias, shift of sample from true mean of population) Cannot be minimized with sample size Cannot be measured with probability Eliminated with random sampling o “Instrument bias”: Physical instrument is wrongly set Social instrument bias (surveys) Avoid serial position effect (putting choices in a certain order) Avoid unfair/confusing wording o “Sampling bias” (sample isn’t representative of population): Examples of sampling bias: Voluntary response bias Undercoverage bias Quota sampling (intentionally manufacturing a sample to have all features of population) o impossible to know all features of population o subconscious personal biases can have an effect Solution to sampling bias (probability sampling methods): Simple Random Sampling (SRS): 1) everyone in the sample has an equal chance of being selected 2) chance that one person is selected is independent of any prior selection o Causality (no lurking variables) Conduct study so that we are confident there’s a relationship between x and y (there is causation) Using random assignment makes a study an experiment Lurking variables Can confound how we understand the causation between x and y o Ex. Doctors ask whether patients take vitamin c supplements and how long their most recent cold was. Potential lurking variables may be health consciousness, region of residency, healthiness, etc. Can reverse the direction of the relationship due to grouping (reversal paradox) o Ex1. Lurking variable of economic sector: o Ex. Ecological fallacy – wrongly assuming that a relationship that exists between individuals should also hold true between groups. The lurking variable here is socioeconomic sector: o Ex. Simpson’s paradox – reversal for contingency tables where trend exists in certain groups but disappears when combined Avoid lurking variables (which is give a false impression of causation) by using randomized assignment for explanatory variable (randomized assignment equalizes treatment groups on average)
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'