Week 2 notes
Popular in Principles of Statistics
Popular in Statistics
STAT 201 003
verified elite notetaker
This 7 page Class Notes was uploaded by Amanda Berg on Saturday September 19, 2015. The Class Notes belongs to STAT 121 at Brigham Young University taught by Dr. Christopher Reese in Fall 2015. Since its upload, it has received 68 views. For similar materials see Principles of Statistics in Statistics at Brigham Young University.
Reviews for Week 2 notes
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/19/15
Notes Week 2 Lesson 4 Numerical Measures to Summarize the Distribution of Quantitative Variables 1 Measures of Center 2 3 a b C 0 Mode value corresponding to a quotpeakquot of the distribution i Most common value value with highest frequency Median middle value i If n data values is odd the median is the middle value ii If n is even the median is the mean of the two middle values iii Denoted by the symbol quotMquot iv 12 histogram area data values are to the right and 12 to the left v Median is always a number even if the data shows vi You can only nd the median when the data are ordered Mean center of gravity average of histogram i On a skewed graph the mean follows the tail ii Calculated by summing values then dividing sum by number of values iii Denoted by gt39lt Mean Mode and Median can be the same on a symmetric moundshaped graph Should we use the mean or median a b Use median when the graph is skewed because it is quotresistantquot to long tails and outliers i Home prices and salaries Use mean if it is roughly symmetric Measures of spread a IQR rather than range i Why not range 1 Highly affected by outliers 2 Only measures overall spread ii What is the IQR 1 Range occupied by middle 50 of data in numbers a If you have 100 individuals studied the IQR would contain the middle 50 individuals b 3rCI quartile1St quartile c If small relative to range highly clustered data set d If large relative to range less clustered data set e Resistant to outliers Bottom 25 of the data Top 25 lullldrzlle 513 of the data D1 me Elma Ida To i lmma f Quartiles 25 Elf values E Q ll first EiUEFii39E sss if saiuss 3 mi sass of values 5 Gig sssund usrtils l q J sass nf 1isslusss 3 Eia third sq usrtiils E 33 E E 1 01 is the median of the smallest half of the observations 2 Q3 is the median of the largest half of the observations 3 02 is the median of the data 1 When n is odd the median is not included in either the bottom or top half of the data 2 When n is even the data are naturally divided into 2 halves g Outliers i Values that are not consistent with the rest of the distribution 1 Sometimes dif cult to judge 1 That39s why we de ne outliers as 15IQR and 15IQR lmy sisssrsalish isilmg in suns st tissss regions will he ssnw sitisrrsrzi ss 51 suspsstsd suitier iii iEil Fij r33 1siiriiii 2 Reasons to keepremove outliers i Keep if the distribution is longtailed and value is legitimate ii Remove if the values were produced under different conditions than the rest of the data iii Remove or correct if possible if the value is a mistake ortypo b Standard deviation quotaverage distancequot from the mean Vocabulary Clustering large amounts of data in one area Minimum smallest number in the data Outlier a piece of data that is more than 15 IQR or less than 15 IQR xbar symbol for mean Mean numerical average for data found by adding all of the data and dividing by the number of individuals Maximum largest number in the data Range maximumminimum Median middle value of the data Summation all of the values added together Resistant unaffected For example the median is resistant to outliers whereas the mean is not Mode the value with the most data The peak of the graph 01 Q3 IQR interquartile range see notes Lesson 5 Numerical Measures to Summarize the Distribution of Quantitative Variables Part 2 1 5Number Summary a Median range and IQR determined by 5 numbers i Minimum ii 01 1St quartile iii Median 2nCI quartile center nd rst iv Q3 3rCI quartile v Maximum b Complete numerical description of 5number summary i Centermedian ii Spread 1 Overall maxmin 2 Clustering QlQ3 iHShape 1 MedianQl versus Q3median 2 Medianminimum versus maximummedian 3 If lower numbers contain more data the graph is skewed right 4 If higher numbers contain more data the graph is skewed left 2 Boxplots a Represents density of data by thickness of boxes and length of lines b How is it made i Central box contains interquartile range ii Line in box marks median iii Right whisker extends from box to largest non agged value no outliers iv Left whisker extends from box to smallest non agged value no outliers v Flagged values outliers marked by asterisks vi Boxplot can be horizontal or vertical I I I I l I I I I ll EU 2 5 353 35 a39fllil e5 amree January irritatewee 5 LE For this data 01 25 Median 29 Q3 33 Max 45 Flagged value minimum 11 Line at 13 15IQR c Examples of questions to be asked about boxplots i What is the median January temperature in SLC ii What is the rst quartile of average January temperatures in SLC iii About what percent of years have average January temperatures above freezing d A major advantage easy comparison of several distributions using sideby side ieiI greup heel 39ih Iergeel e1 eed m I39nerried liemelee 3 3m einglI ierrIelIee rnerried rrIelle eingle I39I39Ielee me we menII boxplots The answer is D because it spans the longest out of all the data The least spread is married males 3 Standard Deviation as a measure of spread a What is standard deviation i Single measure that responds to both aspects of spread 1 Overall spread 2 Clustering ii Some facts about standard deviation that will help you interpret it 1 Does not only measure clustering 2 Can be 0 3 Has the same units as the data 4 Is not resistant to outliers 5 Should be paired with the mean 6 Should be used when the data is symmetric and moundshaped a Should not be used when the data is skewed or there are outliers i The median and IQR should be used in that case not the mean and SD b 68 95 997 rule i For symmetric moundshaped distributions 1 Approximately 68 of data falls within 1 SD of the mean 2 Approximately 95 falls within 2 3 Approximately 997 falls within 3 Elia5 l l ii i i 1i ri i 1 I TIEFII39I mE39EII39I Tl39t I39I1FEII39I I39I39iil l39i I39l1EIII39i lu li l Vocabulary IQR Interquartile Range 0301 Middle 50 of the data exists in this range Boxplot graphical representation of the 5number summary Used to determine shape and spread of the graph 2number summary mean and standard deviation 5number summary min 01 median Q3 max S standard deviation Sidebyside boxpots used to compare data sets that use the 5number summary Standard deviation average distance of values from the mean When the value is small the data is closer to the mean When it is large the data is farther from the mean Standard deviation rue 68 95997 68 of the data falls within 1 Standard deviation of the mean 95 within 2 and 997 within 3 This only applies when the data is symmetric and moundshaped Variance differentiation in the data values Deviation from mean movement away from the mean When the data has more deviation from the mean it is farther away Flagged value values that are more than 15IQR or less than 15IQR Can be considered outliers Lesson 6 Examining Relationships for Pairs of Variables 1 When examining 2 variables we look for a relationship a 2 variables measured on the same individual is called bivariainIity 2 Relationships a 2 variables for each individual b Want to investigate the relationship between the variables using visual displays and numerical summaries 3 Goals for relationships a Characterize relationship b Predict one from another c Investigate causeeffect relationship i We normally want this but it is often not achievable 4 Explanatoryresponse variables a Used if prediction or causeeffect analysis is the goal b Explanatory happens rst in time used to predict or explain changes in response i In the scienti c method called the independent variable c Response happens second in time outcome of study i In scienti c method called the dependent variable d Explanatory x response y Eatagariaal iEluaritiitatii Eatagariaal E v E E uarrtitatiaa If 1 1 5 Roletype classi cation quotthat a In order to gure out how to analyze a relationship we must determine the roIetype classi cation 6 C gt Q Vocabulary a Categorical explanatory variable b c d Quantitative response variable Visual display tool sidebyside boxplots Numerical summary tool 5 or 2number summery for each category 2 variable data studying 2 variables on one individual Explanatory variable X the rst variable to occur in time In causeeffect relationships the explanatory variable attempts to explain the data ndings for the response variable Response variable Y the second variable to occur in time In causeeffect relationships the response variable is what happens because the explanatory variable changes Roletype classi cation explanatorygtresponse shows what kind of value is which variable categorical or quantitative Sidebyside boxplots used to compare data sets that use the 5number summary