Exam Review 1
Exam Review 1 STAT 121
Popular in Principles of Statistics
Popular in Statistics
STAT 201 003
verified elite notetaker
This 16 page Study Guide was uploaded by Amanda Berg on Friday September 25, 2015. The Study Guide belongs to STAT 121 at Brigham Young University taught by Dr. Christopher Reese in Fall 2015. Since its upload, it has received 179 views. For similar materials see Principles of Statistics in Statistics at Brigham Young University.
Reviews for Exam Review 1
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/25/15
Exam Review 1 Know these de nitions Confounding when you can t determine whether the response variable is a result of the explanatory variable or a lurking variable There are effects of 2 explanatory variables that can t be separated because of how the study takes place Only exist because you designed and conducted the study badly Distribution what the values of the data are and how often they occur lnterquartile range IQR Q3Ql quartile 3 quartile 1 measure of variability used when considering outliers and skewed data Lurking variable a variable that affects the relationship between the variables being studied but the study doesn t take it into account Median M middle value of data meaning that 50 of the data has a lower value and 50 has a higher value Not affected by outliers should be used when interpreting skewed data Mean xbar mathematical average if the data highly in uenced by outliers The mean follows the tail Should be used when analyzing symmetric moundshaped data First quartile Ql 25 of the data has lower values and 75 has higher values Find by rst nding the median and then nding the median of the lower half of the data Third quartile QB 75 of the data has lower values and 25 has higher values Find by rst nding the median and then nding the median of the upper half of the data Standard deviation 5 measures the variability of data in relation to the mean symmetric moundshaped data 68 of the data falls within 1 standard deviation of the mean 34 higher and 34 lower 95 falls within 2 standard deviations and 997 falls within 3 This is called the 6895997 rule Explanatory variable What happens rst in time it may or may not have an effect on another variable response variable in the scienti c method it is called the quotindependent variablequot Note An observational study can have an explanatory variable but a valid experiment always has an explanatory variable Response variable the variable that happens second in time may or may not be affected by the explanatory variable but this effect is what is being studied May or may not be a number Slope how 9 is affected when x increases by 1 In a regression line formula 9 a bx the slope is b For example if y 244x y would increase by 4 every time x increases by 1 Yintercept the value of y when xO In a regression line formula 9 a bx the yintercept is a r2 the percentage of total variation in the response variable Y that is explained by the explanatory variable X Pretty much r2 is the percentage chance that y would change if x were to change Leastsquares regression line the line with the smallest sum of squared residuals on a scatterplot comparing two quantitative variables That is the line equation that ts best with the data The formula for the line is y a bx where y is the predicted y vaue response variable a is the y intercept b is the slope and x is the given value of the explanatory variable Residual the difference between the predicted y value 9 and actual y value If y gt9 the residual is a positive value whereas if ylty the residual is a negative value You nd the residual by nding the actual value on the scatterplot and nding 9 on the regression line with the same value of x Subtract yy and that is the residual Simpson s paradox when including the lurking variable causes us to rethink the direction of an association Know how to use in context Distributions what values the variable takes and how often those values occur F39i Chrt f Edy Irrige about right 355 El 3 quot u ntlerweig ht IIiI ill 92 VETWEiglhl 235 Il i 39ii In this example the variable we are studying is body image how each individual views his or her body This variable is categorical The data values are underweight overweight and quotabout rightquot Underweight value occurs 110 time or 92 of the time how often it occurs This means the distribution is 92 underweight Mean Use the mean xbar as a measure of center when the graph is symmetrical and moundshaped If data is skewed the mean follows the tail Eyma trie iatribIuiion RimShewad istribIutinn Lefim lmawe fiisirilbutin IMEEIIHI Median Median Mean Mean Median Median Use the median M as a measure of center when the graph is skewed The median is the middle value of the data and is not affected by outliers You nd the median when you calculate 50 of the data to be greater value and 50 to be lesser ALWAYS MAKE SURE YOU ORDER THE DATA before you start calculating the median For example if the data contains the following values 5 14 24 4 7 56 32 11 12 9 you would calculate the mean by rst ordering the data 4 5 7 9 11 12 14 24 32 56 Then nd the middle value of the data Since there are 10 values you would nd the average of the 2 middlemost values 11122 115 If the data has an odd number of values the mean would just be the middle value On a histogram you can nd the mean by counting how high the bars go and nding the approximate value where half of the values are a higher value and half are lower Standard Deviation 95 llilll ITIEEII39I mean mean mean mean ITIEEII39I mean SEED area an SD 2ESD331 SDJ l 9939 l Use the 6895997 rule If the mean of a set of data is 5 and standard deviation is 3 and Professor Reese asks you what percentage of the data is less than 8 then you can nd that value by adding 50 to 682 The reason why you would do this is because 8 is one standard deviation higher than the mean This means that the mean 50 one standard deviation more 34 is 8 The answer to the question is 84 r2 httpswwwyoutubecomwatchvlng4ZgConCM Slope How the value of y changes when X increases by 1 unit For example if the graph looks like this 1 391 H l y 12 l ILL1 1411 1 a 11 1 E 11 i ll 2 2 1 The slope of the line is 11 Therefore if the given x value increases by 1 the predicted y value is EXPECTED to decrease by 11 What does this mean in context The slope of the regression line indicated the direction of the linear relationship Knowing that the slope is 11 we know that the direction of the relationship is negative This means that when X increases y is expected to decrease Remember the word expected this is a regression line so not all the data will line up perfectly it isn t for sure that y will decrease as x increases it is just expected to do so In the formula 9 a bx the slope is b yintercept The yintercept is the predicted y value when xO Sometimes this value doesn t make sense but it is important in the creation of the line For example if we use the regression line in the previous example we can see that when xO the value of y is 14 In the formula yabx the yintercept is a Be able to identify Confounding ann i E 3 fun III E U E I EEIIII I 39339 if H sun f i l Other US Country In this data one could conclude that the cause for an individual s SAT math score is their country of origin However there are other variables that are not being taken into account lurking variables such as the educational level of those who took the test For example if the sample of international students were elite students planning on coming to the USA to study but you just used a sample of regular high schoolers from the US without eite education there is a lurking variable that isn t being accounted for The varying levels of education between samples is something you knew about when choosing the study but this variable isn t being taken into account and the study was therefore not performed very well There may be a relationship between country of origin and SAT score but because the sample of the population wasn t chosen well there is an extra variable education level whose relationship to the response variable can t be distinguished from the actual explanatory variable s relationship to the response variable The relationship is confounded Potential lurking variable Often you see graphs that have strong linear relationships between the explanatory and response variables and you decide that there is a relationship between the two variables However often there are more variables that are not being taken into account that may affect the strength of the relationship between the explanatory and response variables A child is riding on a plane and every time the seat belt sign comes on the ride becomes super bumpy The explanatory variable is the seatbelt sign and the response variable is bumpiness Of course we know that when a seatbelt sign comes on on a plane it is because there is turbulence However this child comes to think that there is causation for bumpiness the seat belt sign He thinks that the reason for the bumpiness is the seatbelt sign There is a lurking variable that the boy is unaware of affecting this relationship turbulence Explanatory Variable The explanatory variable is the one expected to cause a change in the other variable being studied It happens rst in time For example We are studying how someone s gender affects how good they are at math The explanatory variable is the gender because it is explaining how good the person is at math On a linear regression line the explanatory variable is x Response Variable The response variable is the one that is caused by the explanatory variable It occurs second in time For example if we are studying how someone s gender affects how good they are at math the response variable is how good the person is at math On a linear regression line the response variable is y predicted y value Outliers on a Scatterplot Outliers affect the correlation coef cient r negatively because they decrease the correlation between all the data rE A FEW utlier a 391 39 Elutllier removed lib it The relationship becomes more linear r becomes closer to 1 and farther from 0 when the outlier is removed This works for negative relationships as well if the relationship is negative and there is an outlier r is closer to 0 When the outlier is not there r becomes closer to 1 Extrapolation Never ever ever ever EVER predict a 9 value outside of the range of X values given because this is called extrapolation You shouldn t do this because you don t know if the linear relationship continues or changes If on the exam the given X values have a range of 19 and they ask you to plug the value of 16 into the regression line formula DON T DO IT Association between 2 Categorical Variables in 2way tables httpscommunityolicmuedujcourseworkbookactivitypage context667d64980a1dac30418c721daa27203c Shape of a Distribution U H mm Uniform 3 symmetrical Skewecl right EDGE Symmetrical Skewed left EU a Bimoclal E symmetrical The symmetrical graph in this example is also moundshaped Be Able to Describe Shape center and spread of a histogram or boxplot Shape see directly above Center The value of data where 50 of the values are less and 50 are more In skewed data the center is the median In symmetrical data the center is the mean On a boxplot this value is indicated by the bar in the middle of the box Spread Ql 15 X IQR until Q3 15 X IQR This excludes potential outliers On a boxplot the spread is the values between the whiskers outside of the box minimummaximum not including outliers Direction form and strength of a relationship between two quantitative variables in a scatterplot it Fugitive 333 ulciatinnl Hegmhe Mancia m in Direction H imti mn Hm Liimm mi 39imn is i i39 it i i l quot i 11 V 39 i j I l i 1 i ME A ni 39iquot FiE rfmtlljmm39 mi i m Form Strength i E l I w H i I if i V l I l39 I I1 a a f 11 I it 39 air539 ii If itquot i I 939 a I I39 Eh I i I 17 i II I I I M if 1 It f I i i t in g I i i l 1 RV I l39Ila 39 H i 39 isla In tagl5 a I I 4 iquot It quot39 W ii I vi Hi weak quotD quotD quotD quotD quotquotquotquot quotD quotquotquotquotquot quotD quotquotquot quotD stronger Be Able to Check for Outliers 3c 24 22 U E III 3 El E L39 139339 l nutlier I I I 5 12 15 On a histogram A data set using the IQR rule The data includes the values 4 24 1 8 9 66 22 18 First order the data 1 4 8 9 18 22 2466 Second nd 01 and Q3 Ql 4 Q324 Third calculate IQR 24420 Fourth calculate IQR15Ql and IQR15Q3 20154 26 and 201524 54 Anything outside of the range of 2654 is an outlier Therefore 66 is an outlier A boxpl t t l l I J rig 4 U39l C hJ Iii 15QRQl Max Be able to state What the mean tells us versus what the median tells us The mean is the mathematical average of the data whereas the median is the exact middle value When the mean should be used as the center versus when the median should be used as the center The mean should be used when the graph is symmetrical and moundshaped with no outliers The median should be used if the graph is skewed or has outliers What the standard deviation tells us The standard deviation tells us approximate spread from the mean on a graph that is symmetric and mound shaped Pretty much it tells us how far away a value is from the mean This can help us understand how far away that value is from the mean in relation to other values What the IQR tells us The IQR is calculated by subtracting 0301 It measures the spread of data that is skewed or has outliers It contains the middle 50 of the data When the standard deviation should be used to measure variability spread versus when the IQR should be used Standard Deviation should be used when the graph is mound shaped and symmetrical without outliers whereas IQR should be used when the graph is skewed andor has outliers Be able to Com pa re Boxplots httpscommunityolicmuedujcourseworkbookactivitypage context667d643d0a1dac300cO7b16ef37ba695 Compare stemplots A stemplot is pretty much a sideways histogram but with every value represented Compare 2 stemplots just like you would 2 histograms Estimate mean median Q1 Q3 IQR and standard deviation from a histogram or boxplot Estimate mean from histogram or boxplot the mean always follows the tail or outlier on skewed data and is the center of symmetrical data If the data is skewed right for example the mean will likely be slightly to the right as well The more skewed the graph gets the farther to the right the mean will go This goes the same for a boxplot A larger area of a box means a larger spread aka the data is skewed in that direction This means the mean will be in that general vicinity Estimate median from histogram or boxplot It s very easy to sight the median on a boxplot because the median is the line down the middle of the box On a histogram the median can be found by adding up the heights of the boxes and guring out at approximately which value the number of values is split in half 50 larger and 50 smaller Estimating Q1 and Q3 from a histogram or boxplot On a boxplot Q1 and Q3 are the outer edges of the box Q1 being the lower value and Q3 being the higher value On a histogram you should nd the median rst and then nd the median of the lower half of the data to come up with Q1 and the median of the upper half of the data will be Q3 This is because 50225 so you would end up with 25 of the data being smaller for Q1 and 25 being larger for Q3 Estimating IQR from a histogram or boxplot The IQR is the difference between Q3 and Q1 so in order to nd the IQR you must rst nd Q1 and Q3 When you nd these values you subtract Q1 from Q3 and the number you come up with is the IQR To nd out how to nd Q1 and Q3 see above For a boxplot the IQR is the difference between the two edges of the box Take the value of the upper edge of the box Q3 and subtract the value of the lower edge of the box Q1 Estimating standard deviation from a histogram or boxplot First of all you will not use the standard deviation for a boxplot For a histogram nd the values where 68 of the data falls within the middle 34 above the mean and 34 below and then divide that by 2 The answer will be the standard deviation Compare the median and mean for a boxplot or histogram The median is always the middle value and it is not affected by outliers or skewness The mean follows the tail If you have a graph that looks like this Skewed Left Distribution SUD NI Cl 1 I Frequency 100 D El 3910 20 SD 40 5 ED TD SCI EU The mean will be more to the left than the median If the graph was reversed rightskewed the mean would be on the opposite side again following the tail The median will be closer to the peak of the graph than the mean will be If the data is symmetrical and moundshaped the mean and median will be pretty close to the same or the same Compare the standard deviations of two data sets without computing Standard deviation is a measure of variability or spread The larger the spread the larger the standard deviation In this example the rst graph is more spread out so it has a larger standard deviation than the second graph Determine the effect of outliers on mean median standard deviation IQR Outliers affect the mean the mean would change to become closer to the outlier Outliers also affect the standard deviation Because it measures spread and outliers increase the spread of the data the standard deviation increases when outliers are present Outliers do not affect the IQR or median because median is the center value of the data and the IQR is only the middle 50 of the data Use the 6895997 rule for symmetric distributions httbscommunitvolicmueduicourseworkbookactivitvbade context667d645a0a1dac301d1f3f4a616ace4d what is the standard deviation rule and httpscommunityolicmuedujcourseworkbookactivitypage context667d645c0a1dac304c038f1cd107aff5 examples of how to use in context Determine whether data are categorical or quantitative Categorical data are qualitative meaning they are not useful as ordered numbers gender opinion etc Quantitative data are ordered numbered values price height yield etc Determine which graphs you should use for categorical vs quantitative data For categorical data use a bar graph A stupid idea would be to use a pie chart but it s possible Just don t do it because they re stupid and hard to read For quantitative data use a histogram dotplot stemplot or boxplot Estimate the value of quotrquot from a scatterplot r is the correlation coef cient which is a measure of how linear the data is The more linear and positive the closer to 1 r will be The more linear and negative the closer to 1 r will be The less linear in either direction positive or negative the closer to O r will be rOQO rO50 r000 Y twirl Y 1139 fr 1 Y aw t X x X r050 r090 r100 1t 1t 1quot 1r El 139 1139 1r t Y t it t Y F 1r x X X Determine the effects of outliers on the correlation coef cient Outliers have a very large effect on the correlation coef cient When outliers are present and not in line with the data r will become closer to 0 because the data becomes less linear However when outliers are present and in line with the data the correlation coef cient may stay the same or even become closer to 1 or 1 depending on the direction of the relationship DutliEf quoti 39 utlier remove Know how to compute Mean median Ql QB IQR see above Mode nd the value with the highest frequency of occurrences On a histogram or bar graph nd the highest bar That is the mode r2 from r and visa versa Well this is easy Take r square it and it s r2 Take r2 nd the square root It s r Residual You nd the residual by nding the actual value on the scatterplot and nding 9 on the regression line with the same value of X Subtract yy and that is the residual Conditional percentages of categorical variables from twoway tables httpscommunityolicmuedujcourseworkbookactivitypage context667d64980a1dac30418c721daa27203c Know facts about Standard deviation regression and correlation correlation coef cient r see above Cautions for regression and correlation see lurking variables confounding and extrapolation above Know what these symbols represent r correlation coef cient measures how linear a relationship is on a scatterplot r2 the percentage of total variation in the response variable Y that is explained by the explanatory variable X Pretty much r2 is the percentage chance that y would change if x were to change Q1Q3 First and third quartiles see above xbar mean M median SD standard deviation Know these formulas 9abx leastsquares regression line 9 is the predicted y value response variable a is the yintercept b is the slope X is the given value of the explanatory variable
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'