### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Statistical Reasoning & Practice, Week 3 Notes 36-201

CMU

### View Full Document

## 14

## 0

## Popular in Statistical Reasoning and Practice

## Popular in Statistics

This 7 page Class Notes was uploaded by Monica Chang on Wednesday September 21, 2016. The Class Notes belongs to 36-201 at Carnegie Mellon University taught by Gordon Weinberg in Fall 2016. Since its upload, it has received 14 views. For similar materials see Statistical Reasoning and Practice in Statistics at Carnegie Mellon University.

## Reviews for Statistical Reasoning & Practice, Week 3 Notes

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/21/16

Week 3 EDA for 2-Variable Data: - 2-variable data – when you have two measurements for each individual - An association (relationship) exists when values of a certain variable are more likely to occur when you have certain values of another variable - Explanatory variable is the x - Response variable is the y Explanatory Response Variable EDA Variable Categorical Quantitative Display: side-by-side boxplots (MORE DETAILS BELOW) Summary: descriptive statistics Categorical Categorical Display: contingency table (MORE DETAILS BELOW) Summary: conditional percents Quantitative Quantitative Display: scatterplot (MORE DETAILS BELOW) Quantitative Categorical Display: logistics model Contingency Table: For a contingency table there are different ways to compute percentages: - ‘row conditional’ percentages – cells in each row divided by row total - ‘column conditional’ percentages – cells of each column divided by column total - ‘joint’ percentages – each cell in table divided by grand total - ‘marginal’ percentages – each value across one margin divided by grand total Conditional percentage of employment example wording: “16% is the conditional percentage of being temp given female” (the quality following “given” is always the explanatory variable). In general, conditional percentages should be made out of the explanatory variables. Scatterplots: Describing Scatterplots: - Direction: positive - Form: linear - Strength of association: measured by the correlation coefficient - Outliers: a few (maybe the two points that have the two largest y- values) Correlation coefficient: - Measures direction and strength of linear relationships between two quantitative variables - Sample correlation coefficient denoted by R, population correction coefficient denoted by ρ. - Sample Pearson correlation coefficient: n ∑ (X i¿X)(Y −Yi ´ 1 i=1 n−1 SxSy R=¿ - Population Pearson correlation coefficient: n ∑ (Xi−¿μ )xY −i ) y 1 i=1 N σ σ x y ρ=¿ - Properties of correlation coefficient: o Since correlation coefficient is calculated with means and standard deviations, it’s not resistant outliers and skewness o Unitless number between -1 and 1 o The sign of the relationship tells the direction (+ or -) o If the correlation coefficient is closer to -1 or 1, it is a stronger linear relationship, and if it’s closer to 0, it is a weaker linear relationship o Doesn’t change if x and y are switched o Not affected by change in units For 2-variable quantitative data, the correlation coefficient is not enough summary, we also need center and spread: - center and spread for explanatory variable: X(samplemeanof x) , Sx(samplestandard deviationof x) - center and spread for response variable: Y´ (sample mean of y), S y (samplestandard deviationof y) - measure of linear association between variables: R (sample correlation coefficient) Predicting y from x: The Least-Squares Regression Line: - Linear equation in statistics: Ŷ ¿b 0b 1 ´ - Least-Squares Regression line (best-prediction line): Y=b 0b X1´ - How prediction line is calculated (actually calculated w/ software in practice): o Prediction line passes through (X ,Y) o Prediction minimizes sum of residuals squared (“least- squares) o slope tells you how much y changes for a particular change in x R∗S y calculated as follows: slope = b1= S x o y-intercept tells you the value of y at when x is 0 calculated as follows: intercept=b =Y −b X ´ 0 1 - Predictions and error: o to see what a model predicts a particular y value to be given a certain x, plug the x value into the best-prediction line equation o residual: vertical distance from point to line, it equals the observed minus the predicted: y−¿ ŷ o small residuals indicate that the line is a good model of the data. 2 Meaning of R : 1. R measures linear direction and strength of quantitative association in a sample. 2. R is part of the formula for the slope of a regression line When you square R, it becomes a measure of explanatory power 3. 2 (as a percentage) is the percent of variation in Y explained R by (or accounted for) the linear relationship with X (simply, “ R is the percentage of Y explained by X”) Cautions w/ correlation and regression: - Correlation coefficients only measure linear relationships - Do NOT extrapolate: the regression line gives only predictions for values of x within the range of x values from the data used to make the model Overview of EDA: - 1 variable: o Categorical Display: bar graph Summary: percentages o Quantitative Display: histogram Summaries: shape + center + spread, five-number summary - 2 variable: o Categorical (explanatory) & quantitative (response) Display: side-by-side boxplots Summaries: descriptive statistics in each group o Categorical (explanatory) & categorical (response) Display: contingency table Summaries: conditional percents o Quantitative (explanatory) & quantitative (response) Display: scatterplot Summaries: correlation coefficient, R, and the regression equation, Ŷ ¿b0+b1X´ o Quantitative (explanatory) & categorical (response) (we will not cover this in class) Features of a good study: - Sensitivity (low random variation) o Random error (error within sample) is expected Can be minimized with sample size Can be measured with mathematical probability - Validity (reliable estimates and predictions): o Do not extrapolate o Be cautious of outliers: remove outliers or look at them separately before using summaries or prediction equation o Validating linear regression model (scatterplots): To make sure the relationship of data is appropriately modeled by a linear model, use a residual plot to see if a straight line is the right fit If a linear model is appropriate, the residuals should seem randomly distributed around the zero-line on a residual plot (cannot have an obvious non-linear pattern or thickening) - Generalizability (no bias/systematic error) o A non-random tendency towards certain outcomes means there’s bias o Systematic error (bias, shift of sample from true mean of population) Cannot be minimized with sample size Cannot be measured with probability Eliminated with random sampling o “Instrument bias”: Physical instrument is wrongly set Social instrument bias (surveys) Avoid serial position effect (putting choices in a certain order) Avoid unfair/confusing wording o “Sampling bias” (sample isn’t representative of population): Examples of sampling bias: Voluntary response bias Undercoverage bias Quota sampling (intentionally manufacturing a sample to have all features of population) o impossible to know all features of population o subconscious personal biases can have an effect Solution to sampling bias (probability sampling methods): Simple Random Sampling (SRS): 1) everyone in the sample has an equal chance of being selected 2) chance that one person is selected is independent of any prior selection o Causality (no lurking variables) Conduct study so that we are confident there’s a relationship between x and y (there is causation) Using random assignment makes a study an experiment Lurking variables Can confound how we understand the causation between x and y o Ex. Doctors ask whether patients take vitamin c supplements and how long their most recent cold was. Potential lurking variables may be health consciousness, region of residency, healthiness, etc. Can reverse the direction of the relationship due to grouping (reversal paradox) o Ex1. Lurking variable of economic sector: o Ex. Ecological fallacy – wrongly assuming that a relationship that exists between individuals should also hold true between groups. The lurking variable here is socioeconomic sector: o Ex. Simpson’s paradox – reversal for contingency tables where trend exists in certain groups but disappears when combined Avoid lurking variables (which is give a false impression of causation) by using randomized assignment for explanatory variable (randomized assignment equalizes treatment groups on average)

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.