Popular in Course
Popular in Statistics
This 16 page Class Notes was uploaded by Orval Funk on Monday September 28, 2015. The Class Notes belongs to STAT101 at University of Pennsylvania taught by A.Buja in Fall. Since its upload, it has received 18 views. For similar materials see /class/215428/stat101-university-of-pennsylvania in Statistics at University of Pennsylvania.
Reviews for INTROBUSINESSSTAT
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
STAT 101 Module 2 Numerical Summaries of Variables Questions one wants to quantify If we are to examine data about Penn students PennStudentsJMP one might ask 0 How many students are 19 years old What fraction of the total are they Are they fewer or more than the 18 year old students 0 On average how tall are male and female students How spread out are the heights How strong is the overlap between male and female heights Numerical versus Graphical Summaries o Graphical methods allow us to 0 see the data as a whole 0 discover unexpected facts 0 Numerical summaries give us 0 simplicity by condensing a lot of data into few numbers 0 precision for example when comparing groups 0 ways to reason about uncertainty stay tuned Neither replaces the other Numerical Summaries according to Variable Type 0 Textbook part of Chap 3 o Qualitative variables how many in each group 0 CountsFrequencies o Proportions 0 Quantitative variables 0 Measures of Location Where is the data Mean Median Quantiles Minimum Maximum 0 Measures of Dispersion how Wide is the data Standard Deviation Range lnterquartile Range Qualitative Variables CountsFrequencies and Proportions o The following example is from the data PennStudentsJMP Age is used as an ordinal variable Frequencies Level Prob 18 032821 19 139 035641 20 70 017949 21 33 008462 22 14 003590 23 4 001026 24 2 000513 Total 390 100000 N Missing 7 Levels o The barplot gives a good comparison of the frequencies across the age groups 0 The table gives a list of exact counts and proportions Prob in JMP Count Frequency synonyms Proportion Count Total Fraction Percentage Proportion 100 Algebraic notation n count of the i th label p proportion of the i th label Example above 111 128 112 I39 n7 2 p1 328 p2 356p7 005 Where label 1 is 18 label 2 is 19 Terminology Level label group name Example JMP reports 7 levels for Age JMP To reproduce the above output you need to convert the quantitative variable Age to qualitative before you do Analyze gt Distribution The conversion is done as follows Rightclick the label Age above the Age column gt Modeling Type gt Ordinal Quantitative Variables Measures of Location and Dispersion 0 Again the following example is from PennStudentsJMP HEIGHT Quantiles 1000 80000 76180 75000 73500 71000 67500 65000 63000 60000 57478 57000 maximum quartile median quartile minimum Moments Mean Std Dev Std Err Mean upper 95 Mean lower 95 Mean 67754103 39749694 02012804 68149836 67358369 N Ignore Std Err Mean upper 95 Mean and lower 95 Mean in this table Everything else will be explained in the next two bullets 0 Measures of Location Where is the data Sec 31 Textbook central tendency Sec 31 ignore population 0 0 Mean average of the values x1 x2 xN in column x meanx x1 x2 xN N In the height data the mean is reported to be 6775 Median the middle value of the sorted values in a column if N is odd and the average of the two middle values if N is even Examples If the values in a column are l2345 the median is 3 If the values are l234 the median is 25 In the height data the median is reported to be 675 390 o Quantiles The idea of quantiles is that they divide the values in a column roughly into for example 20 percent of values less and 80 greater This would be called the 20 quantile The same applies to any other percentage Sometimes one calls for example the 90 quantile the upper 10 quantile If nothing is said to the contrary the percentage of a quantile refers to the fraction of values that are less Don t worry about the fine points of defining quantiles Trust that J MP has a reasonable general definition Special cases I 50 quantile median I 25 and 75 lower and upper quartiles I 10 20 90 quantiles deciles I 0 quantile minimum I 100 quantile maximum In the height data JMP give us the lower and upper 0 05 25 10 25 and 50 quantiles Abbreviations meanHeight medHeight maxHeight minHeight Note 0 mean g median Move transformation properties after introducing dispersion measures Then explain that it is these properties that distinguish them Note 1 Shifting the values of a variable If you add a constant value to all the values in a column the location measures also get added that value For the mean this can be expressed as follows meanxc meanxc Example If you reexpress degrees Celsius in degrees Kelvin you add 273 Therefore add 273 to the means and quantiles of degrees Celsius and you obtain the means and quantiles in degrees Kelvin K C 273 Note 2 Rescaling the values of a variable If you multiply all the values in a column with a constant value the location measure also get multiplied with that value For the mean this can be expressed as follows meancx cmeanx Example If you convert to you have to multiply with a factor 0770831727 20070121 Therefore multiply means and quantiles of s to obtain the means and quantiles in s Caution Quantiles other than the median do not strictly follow this formula when the factor 0 is negative Lower quantiles become upper quantiles and vice versa Ex The lower quartile becomes the upper quartile if clt0 Note 3 Shifting and rescaling the values of a variable Notes 1 and 2 can be combined Example For translating means and quantiles from degrees Celsius to degrees Fahrenheit apply the wellknown conversion formula to the means and quantiles in Celsius and you obtain the means and quantiles in Fahrenheit F 95 C 32 Problem Make up a new measure of location Notation Because the mean is the most important measure of location we abbreviate it often as italic m That is m meanx If more than one variable is in play and we need to indicate the variable we may write 1m and my For example we might write mHelgh and m Weigm Measures of Dispersion How wide is the data Sec 32 Textbook variability Sec 32 ignore population 0 O 0 Range maximum minimum This is the vertical width from the top most point to the bottom most point in the boxplot In the height data the range is 80 57 23 Interquartile Range IQR IQR upper quartile lower quartile This is the vertical width of the box in the boxplot In the height data IQR 71 65 6 Standard Deviation s sdev sd SD std dev L 2 2 2 S N1x1 m x2 m xN m where m meanx In the height data s is reported to be 397 This is the most important measure of dispersion Questions arise however Why squared deviations from the mean Why a square root Why N l This will require more explanation Stay tuned Abbreviations If we have standard deviations of more than one column x and y say we have to distinguish the measures of dispersion We would then use the symbols sx or sx and sy or sy for the respective standard deviations Similarly we might use IQRx or IQRx and QRy or IQRy For the height data above we could write IQRHelghl 6 and sHelghZ 397 Terminology s2 Variance A look ahead The variance of stock returns is used in finance as a measure of volatility or risk of stock investments Of course the standard deviation could serve for the same purpose and so could any other measure of dispersion but finance math dictates the use of variances Note 1 Shifting the values of a variable If you add a constant value to all values in a column measures of dispersion do not change For the standard deviation this can be expressed as follows Sxc Sx or sxc sx Idea The width does not depend on where the distribution is Example If you convert Co to K the standard deviation does not change Neither do the range nor the IQR Note 2 Rescaling the values of a variable If you multiply a constant value to all values in a column measures of dispersion multiply along with the absolute value of the constant For the standard deviation this can be expressed as follows Scx lcl Sx 0r Scx lcl Sx Idea If you double the numbers you double the width Example If you convert to you have to multiply with a factor 0770831727 20070121 Therefore multiply standard deviations ranges IQRs of s with this factor to obtain the standard deviations ranges IQRs in s Note 3 Shifting and rescaling the values of a variable Notes 1 and 2 can again be combined Example For translating standard deviations ranges IQRs from degrees Celsius to degrees Fahrenheit multiply them with a factor 95 Problem Make up a new measure of dispersion 0 Appendix on Standard Deviations and Variances 2 i 2 2 2 S xi m 952 771 xNm N1 0 Q Why is the variance not a measure of dispersion A If the values x1x2xN are multiplied with a constant c then s2 gets multiplied with c2 and not 0 For a measure of dispersion we want that doubling the values entails doubling the measure of dispersion not quadrupling as is the case for the variance s2 This explains the root in the formula for s 0 Q Why do we divide by N71 and not N A The deviations from the mean X m are not independent If we know xl mxN1 m then we know xN m because these values sum up to zero xi m xz m xNelm xNm 0 which we can solve for xN m The complete answer is more technical so take this as a hint Proof of the identity xl m X2 m xN71 m xN m x1x2 xN Nm Nm Nm 0 Q Why squares in the first place Why not absolute values lxl ml This would do away with the root A A simple reason is that we can do algebra with squares but not easily with absolute values A deeper reason has to do with Pythagoras and probabilities Stay tuned A Few Data Examples 0 Counts and Proportions the Titanic data CLASS Frequencies Level Count Crew 1st 325 2nd 285 W 3rd 706 crew 885 Total 2201 2nd N Missing 0 15 4 Levels AGE Frequencies child Level Count adult 2092 child 109 Total 2201 adult N Missing 0 2 Levels SEX Frequencies Level Count female 470 male male 1731 Total 2201 N Missing 0 female 2 Levels Prob 014766 012949 032076 040209 100000 Prob 095048 004952 100000 Prob 021354 078646 100000 SURVIVED Frequencies Level Count Prob no 1490 067697 yes 711 032303 Total 2201 100000 N Missing 2 Levels Lesson For extreme differences in frequencies numbers are superior to pictures For example we see that there are almost no children on the Titanic but how few really The table shows that there were 109 children or about 5 of the total This would be difficult to estimate by eyeballing the bar plot Measures of Location and Dispersion CEO compensation Total comp opt exer 1000 100000 Fmm Quantiles 1000 maximum 156168 995 59045 975 25266 900 10493 750 quartile 4412 500 median 1884 250 quartile 903 100 508 25 254 05 16 00 minimum 0 Moments Mean 45634621 Std Dev 92351532 Std Err Mean 23837119 upper 95 Mean 50310383 lower 95 Mean 40958858 N 1501 logTotCompoptexer 8 I Quantiles J IE 1000 maximum 81936 7 995 77714 1 975 74028 6 1 900 70224 I 75 0 quartile 66461 D 50 0 median 62763 5 250 quartile 59577 100 57128 4 i39 25 54165 05 44635 00 minimum 44e16 3 Moments 2 Mean 63104396 Std Dev 05715899 1 Std Err Mean 00147732 upper 95 Mean 63394179 0 lower 95 Mean 62814613 N 1497 Lessons 1 The distribution of raw compensations is extremely skewed upwards This is the reason why the mean and median are extremely different meanTot Comp 4563 medTot Comp 1884 both in 1000s The median is a better measure because the mean gets pulled up by the upper extremes and is no longer a typical value If we asked however how much each CEO would get if the sum of all compensations were equally redistributed among CEOs we would have to use the mean never mind the skew distribution 2 The textbook has a measure of skewness P 76f but we will not use it Instead we take a discrepancy between mean and median as a sign of skewness The direction of skewness follows from the order of the two measures 0 mean gt median skewed upwards 0 mean lt median skewed downwards Remember that the mean gets pulled by extreme values the median doesn t Therefore the mean tells you to which side the distribution is skewed 3 The sdev is even more problematic than the mean for very skewed distributions forming squares blows up even more than the raw values By comparison the QR does not lose its meaning it always tells how far the upper and lower quartiles are apart For the raw compensations the CEOs at the upper and lower quartile make about m44 and m09 respectively with a spread of about m35IQR Messages 0 The mean and sdev are problematic for extremely skewed distributions They are more meaningful for bellshaped nearlysymmetric distributions 0 The median and QR remain meaningful for skewed distributions Another Appendix Mean versus Median Below is a physical illustration of the difference between mean and median E i ll Mean Median o The mean corresponds to the balance point of the data values on a seesaw balance as drawn on the left assuming all data values have the same weight The median requires a scale that only counts how much is left and how much is right The scale on the right does this the distance of the points from the balance point is irrelevant as long as they stay on the same side The reason is that all their weights get transmitted to equal distances on either side Oldfashioned scales are constructed like the median scale so it doesn t matter where on the platforms one places the goods and the metal weights XXX To be added next time o sx 0 i xconst o ax mean ximedxl before sx 0 use of location and dispersion measures for standardization Zscores with example of equalizing midterm scores then remove standardization from Module 3 where it is an afterthought introduce also the empirical rule and the normal distribution even the normal probability plot to have an interpretation for the SD