Elementary Statistical Methods
Elementary Statistical Methods STAT 30100
Popular in Course
Popular in Statistics
This 32 page Class Notes was uploaded by Bailey Macejkovic on Saturday September 19, 2015. The Class Notes belongs to STAT 30100 at Purdue University taught by Staff in Fall. Since its upload, it has received 326 views. For similar materials see /class/207948/stat-30100-purdue-university in Statistics at Purdue University.
Reviews for Elementary Statistical Methods
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/19/15
1 of 13 ID MSTDCPCD040010b A survey of 96 people at a local popular recreational area asked what their favorite flavor of ice cream was The results of the survey are shown Survey results butter rocky road rocky road coffee rocky road rocky road pecan butter rocky road neapolitan chocolate vanilla butter pecan pecan vanilla strawberry coffee vanilla rocky road strawberry cookies and butter pecan vanilla vanilla vanilla rocky road cream cookies and neapolitan vanilla rocky road coffee butter pecan cream neapolitan vanilla rocky road neapolitan rocky road coffee cookies and neapolitan rocky road vanilla coffee rocky road cream rocky road other vanilla vanilla chocolate strawberry neapolitan other rocky road vanilla vanilla butter pecan other rocky road rocky road neapolitan neapolitan neapolitan butter neapolitan chocolate butter pecan chocolate chocolate pecan cookies and neapolitan vanilla rocky road strawberry vanilla cream rocky road rocky road rocky road vanilla strawberry vanilla chocolate neapolitan rocky road butter pecan chocolate chocolate cookies and strawberry neapolitan rocky road chocolate chocolate cream vanilla neapolitan vanilla rocky road strawberry vanilla aThis is an example ofl categorical gldata roc road bMore people answered ky 39 than answered vanilla 14 cThe number of people that answered neapolitan is equal to 1458 dThe percentage that answered neapolitan is equal to eThe most frequent favorite flavor is nnnnnnnnm rocky road chocolate vanilla butter pecan strawberry neapolitan cookies and cream coffee other fThe least frequent favorite flavor is mnnnnnnn rocky road chocolate vanilla butter pecan strawberry neapolitan cookies and cream coffee other gThe results of the survey suggest that when buying stock an ice cream vendor in this area should focus on which two flavors l7 I rocky road chocolate vanilla butter pecan strawberry neapolitan cookies and cream quotl39T39W39l I coffee 7 out of 7E Feedback gYou are correct Discussion aThese data are known as categorical data because they consist of values for nonnumerical 39categories39 That is the data values are qualitative A data point in this data set is a color bThe first step in wor ing with categ rical data is to construct a frequency table Value Frequency rocky road 23 vanilla 20 neapolitan 14 chocolate 10 butter pecan 9 strawberry 7 cookies and cream 5 coffee 5 other 3 Once this is done it is apparent that more people answered rocky road than answered vanilla cReferring to the frequency table you can see that the number of people that answered neapolitan is equal to 14 dThe percentage that answered neapolitan is equal to the number of people that answered neapolitan divided by the total number in the survey This percentage is 1496 x 100 146 eReferring to the frequency table the most frequent favorite flavor is rocky road fThe least frequent favorite flavor is other gThe results of the survey suggest that the two most popular favorite flavors of icecream are rocky road and vanilla So an ice cream vendor in this area in this area should focus on stocking and selling icecream of these two flavors 3 pointsE 2 of 13 ID MSTDCPCTD020010 The following survey was used by the Hyp Test Corporation to obtain data on the productivity of its employees PRODUCTIVITY SURVEY HYP TEST CORPORATION This survey will be used by the Hyp Test Corporation to evaluate the productivity of its employees for the purpose of developing policies that may improve the working conditions of the firm Your answers will remain anonymous so please respond honestly 1 How much time did you spend at work last week 2 How much sales did you earn at work last week 3 How productive do you think you are 0 very productive O midy productive 0 not very productive Thank you for completing this survey Based on this survey complete the statements uantitative aThe data obtained from question 1 represent a q 39l variable uantitative bThe data obtained from question 2 represent a q 7 variable cate orical cThe data obtained from question 3 represent a g 7 variable 3 out of 3EI Feedback aYou are correct bYou are correct cYou are correct Discussion Data may be quantitative or categorical Quantitative data take values that are numbers The answers to question 1 and 2 in the survey will be numbers which means that they will represent a quantitative variable Categorical data take values from a limited number of options categories The answer to question 3 can take a value from a limited number of options very productive mildly productive and not very productive Therefore the data obtained in question 3 will represent a categorical variable 3 pointsE 3 of 13 ID MSTNDMMCT010010 Match each description to its numerical measure I mode aThe measure that describes the most frequently occurring value in a sam le is the mean v bThe measure that describes the average value of data in a sample is the median vl cThe measure that descrIbes the mIddle value In a sample Is the 3 out of 3EI Feedback aYou are correct bYou are correct cYou are correct Discussion Mean The mean is calculated by summing all of the values in the sample and dividing by the sample size A consequence of this is that the mean represents all of the data in the sample equally An advantage of this is that introducing new values or changing existing values in the data set will result in a corresponding change in the value of the mean In this way the mean is a very faithful measure of the center of a data set On the other hand extreme values can skew the mean and cause it to misrepresent a reasonable value for the center of the data set Median When the data set is ordered from lowest to highest the median is the value of the data point that is ranked in the middle As such the median is only representative of this one value An advantage of this approach to measuring the center is that median is not sensitive to extreme values because it only represents the middle value However the median does not represent all of the data points For example the addition of two data points to the set will not change the median no matter what they are as long as one is to the left of the median and the other is to the right On the other hand the addition of one value can change the median dramatically because a completely different data point now occupies the center of the data Mode The mode is the value that occurs most frequently in the data set The mode like the median does not represent all of the data in a set equally and it has the same weakness the mode may change drastically with only minor changes to the data set The mode does not necessarily represent the center of the data as much as it gives a crude representation of where the data values are concentrated If two or more values occur with the same highest frequency then all of these values are the modes If no value occurs more frequently than any other value then there is no mode 1 point1E4 of 13 ID MSTNDMMPD010050 mummy rqum um Same 1m lily E The Mean Corporation operates out of two major cities City A and City B It has a head office for each city and each office has thousands of employees A computer competency exam is administered to all staff in each head office and the results are recorded The CEO decides that he would like to compare the performance of the two offices He labels the two groups of staff City A and City B and looks at their distribution of scores The CEO is told that both City A and City B have the same mean sco consistent than City B because the standard deviation for City A is for City B 1 out of 1E Feedback You are correct Discussion The standard deviation is a numerical measure of the dispersion of data points about the mean of that data set In order to know the exact value of the standard deviation the values for the raw data need to be known However it is possible to compare the standard deviations of two populations by looking at a graph of their distributions especially if these distributions are symmetrical and bellshaped A tall and thin bell shape implies that the standard deviation is small because the bulk of the data points do not stray far from the mean A short and fat be shape implies that the standard deviation is large because the bulk of the data points are more spread out and far away from the mean low standard deviation less variation more consistency high standard deviation more variation less consistency 3 pointsE 5 of 13 ID MSTNDMDA050060 You have been following the share price of Eastern Mining Corp and have recorded the daily return as a percentage for the last 120 days The data is presented here 124 065 003 049 046 159 011 006 271 18 072 14 071 Download the data Daily returns 124 065 182 236 01 142 198 008 034 04 183 211 063 175 18 074 098 024 22 056 109 018 226 123 194 052 099 158 28 134 044 181 198 078 147 187 199 012 029 211 194 004 067 129 015 047 031 022 191 003 09 049 167 017 023 246 29 012 185 164 031 046 016 006 242 017 472 189 15 159 328 26 195 105 081 146 022 02 011 323 01 196 281 289 042 231 076 198 038 165 006 158 064 271 183 033 199 18 212 135 021 312 179 032 234 068 078 055 141 091 026 309 184 181 009 002 197 072 14 071 aFind the fivenumber summary for the data M 472 I 115 H 102 l 194 l 328 bIdentify a suspected outlier in the sample data 472 Suspected outlier 3 out of 3E Feedback aYou are correct bYou are correct Discussion aFor a set of numerical data the fivenumber summary consists of the smallest value the first quartile the median the third quartile the greatest value The first step in finding these values is to rank the data in order This data set consists of 120 points The median which is the middle value will be the average of the 60th and 61st values The first quartile will be the middle value of the bottom 60 points So it will be the average of the 30th and 31st data values Similarly the third quartile will be the average of the 90th and 91st values Rank 1 2 30 31 60 61 90 91 119 120 Data value 472 164 011 012 099 105 191 194 323 328 Therefore the fivenumber summary is Minimum 472 First quartile 05 X 011 012 0115 Median 05 X 099 105 102 Third quartile 05 X 191 194 1925 Maximum 328 bAn approach to identifying potential outliers within a set of numeric data is to use the 15 X IQR approach where IQR denotes the interquartile range With this method a data value is deemed to be a suspected outlier if it falls more than 15 X IQR above the third quartile or less than 15 X IQR below the first quartile In order to apply this method you need to find the interquartile range The interquartile range within the sample can be found in the following way show variables Q1 first quartile 0115 Q3 third quartile 1925 IQR interquartile range unknown IQR Q3 39 Q1 1925 0115 181 Therefore a value is a suspected outlier in this sample if it is below Q1 15 X IQR 0115 2715 26 or above Q3 15 x IQR 1925 2715 464 In the data set given in the question the value 472 falls below 26 Therefore the value 472 is a suspected outlier 3 pointsE 6 of 13 ID MSTNDMMCT020010 You are an analyst for a mining company researching a report concerning the hours that miners work per week You requested some data for the report from the human resources department and have been emailed the following sample of hours worked per week Aiden Carter From Meg Koch Sent 31 January 2011 900 AM To Aiden Carter Subject Sample hours worked per week Hi Aiden Here is a sample of hours worked per week by miners 255 135 145 25 255 475 385 255 12 29 11 36 475 Rega rds Meg aCaculate the mean hours worked per week Give your answer rounded to 1 decimal place 27 Mean hours bCaculate the median hours worked per week Give your answer rounded to 1 decimal place 255 Median hours cCaculate the mode of the hours worked per week Give your answer rounded to 1 decimal place 255 Mode hours 3 out of 3EI Feedback aYou are correct bYou are correct cYou are correct Calculation aThe mean hours worked per week can be calculated using the following formula show variables xi ith value n number of values 13 x mean value unknown in X n 255 135 145 25 255 475 385 25512291136475 13 270 hours bThe median is the middle value when the sample of hours worked is ranked in order lowest value to highest value The hours worked ranked in order are The median can be identified from this ordered list show variables n number ofvalues 13 n 1 14 position of middle value 7 ith value with values ranked in order median value unknown Xm X7 255 hours cThe mode is the value that occurs with the greatest frequency Here the value 255 occurs three times and is the most frequent Therefore the mode is 255 hours 1 pointE 7 of 13 ID MSTDCPND020030 Jerry has collected some data on amount of time people spend browsing the internet per week He surveyed 58 people and wants to begin analysing this data Unfortunately he doesn39t know much about statistics and has asked for your help Jerry quotWhat kind of graph should I use in order to see how much time people spend browsing the internetquot You recommend that Jerry should use a E F n 393 Lquot 391 normal quantile plot bar graph histogram scatter plot time plot pie chart 1 out of 1EI Feedback You are correct Discussion The data that Jerry has collected is numerical data since he is interested in measuring the amount of time people spend browsing the internet and time is a numerical variable His data then consists of 58 numbers each number being an amount of time that the person spent browsing the internet during some specified period M mauarcy E G 2 a 4LZ41LJJLJJL 25 quot 5 id Hours quot395 115 A histogram is a graphical summary of numerical data that is constructed with reference to the relative frequency distribution of that data The midpoint of each class is labelled on a horizontal axis and a vertical bar is drawn to represent the frequency of each class There are no gaps between the bars in a histogram to help distinguish it from a bar graph which can be used to graphically represent categorical data The histogram that Jerry will construct may look something like the one shown here 1 pointE s of 13 ID MSTDCPCD020030 The owner of a book store decides to investigate sales for the different genres The store sells fiction in four genres crime romance thriller and science fiction After counting the sales over a week the owner calculates the sales of each genre as percentages of total fiction sales The results are presented in the following table Genre Percentage Crime 21 Romance 18 Thriller 11 Science fiction 50 Select pie chart that represents this data Romance 3 I Thriller Science quot ction Scuence ction gt Romance E Cnme Romance Crime Thriller E fScience 39 fiction 1 out of 1E Feedback You are correct Discussion When collecting categorical data you usually organize the data into a summary table first If the quantities given for the categories are percentages you can then draw a pie chart to visually demonstrate these percentages The information of a summary table can be mapped directly to a pie chart a disc is split up into different sections each labeled with the name of a category with the 39size39 of each section determined by the percentage of values in that category By the size of a section we mean the angle that section makes at the center of the circle For example if a category has 35 of the values in a set of data then the section for that category will make an angle of 35 X 360 126 at the center of the pie chart Crime In this question the categories are the different genres of books sold The percentage of fiction books sold that are crime is 21 so in the correct pie chart the section for 39crime39 makes an angle of 21 X 360 756 at the center The sections for the other genres can be calculated in a similar fashion Note that since you are not given the exact angles with the diagrams in this question the angles calculated should be used to estimate the size of each section in the pie chart Romance The percentage of fiction books sold that are romance is 18 so in the correct pie chart the section for 39romance makes an angle of 18 X 360 648 at the center Thriller The percentage of fiction books sold that are thrier is 11 so in the correct pie chart the section for IthrillerI makes an angle of 11 X 360 396 at the center Science ction The percentage of fiction books sold that are science fiction is 50 so in the correct pie chart the section for 39science fiction39 makes an angle of 50 X 360 180 at the center 5 pointsE 9 of 13 ID MSTDCPND050010a You have been following the share price of Western Investment Group and have recorded the daily return as a percentage for the last 120 days The data is presented here 0227 0241 0122 036 002 0029 0146 0461 0149 0454 0056 001 0101 0165 0239 0597 0105 0327 0255 0284 0036 0675 0106 0027 0185 0038 0117 0146 0371 0094 0018 0356 0455 0116 0149 0114 0455 0156 0376 003 0327 0088 0201 0288 0022 0107 0291 0123 005 0093 0224 0365 0061 0028 0307 0272 0017 0032 0111 0097 0101 0213 0076 0217 0359 0214 0478 0335 0098 0069 0472 0334 011 03 039 016 0121 0014 0191 0691 0268 Download the data Daily returns 022 024 012 002 014 046 014 045 005 7 1 2 036 002 6 1 9 4 6 001 010 39 023 059 010 032 025 028 003 067 010 39 1 03916 9 7 5 7 5 4 6 5 6 03902 5 7 39 003 011 014 037 009 001 035 39 011 014 011 03918 8 7 6 1 4 8 6 03945 6 9 4 5 5 045 015 39 032 008 020 028 39 010 029 012 5 6 03937 03903 7 8 1 8 03902 7 1 3 6 2 009 022 036 39 39 030 027 001 003 011 009 03905 3 4 5 03906 03902 7 2 7 2 1 7 1 8 1110 221 016 234 049 004 220 205 1112 244 001 219 6 4 8 036 008 040 39 39 032 036 044 041 048 2 3 2 038 029 018 014 2 9 6 8 4 7 4 027 006 014 602 021 032 015 03917 013 026 03926 007 062 007 015 021 035 021 047 033 009 004 008 003 8 3 6 4 5 5 7 9 4 8 5 8 006 03947 03933 011 03 039 016 012 001 03919 03969 03926 9 2 4 1 4 1 1 8 aFind the number of days that the stock made a return greater than or equal to 01 Give your answer as a whole number 75 Number of days that the stock made a return greater than or equal to 01 bFind the number of days that the stock made a positive return Give your answer as a whole number 96 Number of days that the stock made a positive return cCacuate the percentage of the total number of days that the stock made a positive return Give your answer as a percentage 80 Percentage of days that the stock made a positive return dConsider the statement llMore than half of the daily returns were greater than Xquot From the options presented the greatest value that makes the above statement true is oo 04 MENU O O eThe most frequent class of daily returns was 02 to less than 01 01 to less than 0 0 to less than 01 01 to less than 02 02 to less than 03 03 to less than 04 nnnmnnn 04 to less than 05 4 out of 5EI Feedback aThis is not correct Number of days that the stock made a return greater than or equal to 01 77 bYou are correct cYou are correct dYou are correct eYou are correct Discussion aThe raw data is not very helpful is answering questions about the data A frequency table provides a summary of the data and can be used to more easily answer questions about the data The frequency table for the sample of daily returns is Class Frequency lt 02 3 02 to less than 01 7 01 to less than 0 14 0 to less than 01 19 01 to less than 02 25 02 to less than 03 20 03 to less than 04 17 04 to less than 05 11 gt 05 4 From the table adding up the frequencies from the class 01 to less than 02 and higher the stock made a return greater than or equal to 01 on 77 days during the period of observation bUsing the table by summing the frequencies from the class 0 to less than 01 and higher it is apparent that the stock made a positive return that is a return greater than 0 on 96 days during the period of observation cTherefore since the total period of observation was 120 days the stock made a positive return for 96120 x 100 800 of the period dBeginning with the highest class and working backwards you find that the top four classes covering returns greater than 02 account for 52 of the 120 observed values Including the next highest class you find that the top five classes covering returns greater than 01 account for 77 of the 120 observed values Therefore more than half of the daily returns were greater than 0109 eFrom the Frequency table you can see that the most frequent class of returns was 01 to less than 02 3 pointsE 10 of 13 ID MSTNDMDA050050 A survey asked a random sample of 27 people how many hours they had worked in the last week The following set of sample data was produced 375 24 34 365 295 48 18 395 445 35 445 49 305 345 28 895 395 86 415 29 215 17 17 295 33 215 215 Download the data 375 24 34 365 295 48 18 395 445 35 445 49 305 345 28 895 395 86 415 29 215 17 17 295 33 215 215 aIdentify the wo suspected outliers in this sample 86 Suspected outlIer 895 Suspected outlIer bCaculate the sample mean found by including the two suspected outliers Give your answer to 1 decimal place 363 Sample mean cCaculate the sample mean found by excluding the two suspected outliers Give your answer to 1 decimal place 322 Sample mean 3 out of 3E Feedback aYou are correct bYou are correct cYou are correct Calculation aA common approach to identifying potential outliers within a collection of numeric sample data is to use the 15 x IQR approach With this method a sample value is deemed to be a suspected outlier if it falls more than 15 X IQR above the third quartile or less than 15 X IQR below the first quartile In order to apply this method you need to find the interquartile range The interquartile range within the sample can be found in the following way show variables Q1 first quartile 24 Q3 third quartile 415 IQR interquartile range unknown IQR Q3 Q1 415 24 Therefore a value is a suspected outlier in this sample if it is below Q1 15 X IQR 24 2625 225 or above Q3 15 X IQR 415 2625 6775 There are two values in the sample that are greater than 6775 and are therefore suspected outliers These are the values 860 and 895 bThe sample mean including the two outliers can be found either using software or can be calculated using the following formula show variables Xi cThe the fol Xi ith value sample size 27 sample mean unknown in n 3627777778 363 hours Rounded as last step sample mean excluding the two outliers can be found either using software or can be calculated using lowing formula show variables ith value sample size 25 x sample mean unknown in X n 3216 322 hours Rounded as last step Note how the two outliers bring the mean up quite significantly 2 pointsE 11 of 13 ID MSTNDMMVS060010 In a survey 100 accountants were asked what they charge clients for one hour of consultation The data are recorded in the following frequency distribution table Amount charged Frequency greater than 40 but less than 50 12 greater than 50 but less than 60 24 greater than 60 but less than 70 29 greater than 70 but less than 80 25 greater than 80 but less than 90 10 aApproximate the mean amount charged from this summary Give your answer in dollars and cents to the nearest cent I 65 Mean z bApproximate the standard deviation from this summary Give your answer in dollars and cents to the nearest cent 80 Standard deviation z 0 out of 2EI Feedback aThis is not correct Mean z 6470 bThis is not correct Standard deviation z 1176 Calculation aThe mean can be approximated using the following formula show variables mi midpoint of the ith class number of values in the ith class sample size 100 mean unknown thfi 45X1255gtlt2465gtlt2975gtlt2585gtlt10 100 540 1320 1885 1875 850 100 6470 100 6470 bUsing either software or a calculator the sample standard deviation can be found to be 1176 Alternatively it can be calculated manually as follows The standard deviation can be approximated using the following formula show variables m midpoint of the ith class fi number of values in the ith class n sample size 100 x mean m 647 s standard deviation unknown 2mi X2fi s2 N n 1 45 6472 X 12 55 6472 X 24 65 6472 X 29 75 6472 X 25 85 6472 X 10 99 465708 225816 261 265225 41209 99 13691 99 13829292929 s e 13829292929VZ By taking the square root of both sides 1175980141 1176 Rounded as last step 1 pointEI 12 of 13 ID MSTNDMDA050020 A friend of yours is analyzing data collected on the response time to emergency calls They are unsure about statistics and have asked for your advice Friend quotI have got data on the response time to emergency calls for a random sample of 57 emergency responses I39ve been asked to report some numbers to describe the center and spread of this data Apparently there are several ways to do that and I am unsure which method I should use I39ve looked at the data and it seems that the response times are fairly evenly spread around a time of 5 minutes How should 1 best report the center and spread of this dataquot You respond to your friend saying that in this case it is best to report the E mean only 3 median and interquartile range E mean and standard deviation 3 mean minimum and maximum values 3 variance E mode and range 1 out of 1E Feedback You are correct Discussion There are several ways that you can numerically describe the center and spread a set of data Which way is best will depend upon the distribution of that data The main numerical measures available for describing numerical data are mean median standard deviation quartiles and highest and lowest values Typically it is best to use the mean and standard deviation for data that is reasonably symmetrical and does not have any significant outliers Otherwise it is better to use the median and interquartile range to describe the center and spread of the data The median is not affected by outliers or strong skewness as the mean is and similarly for the interquartile range versus the standard deviation 4 pointsE 13 of 13 ID MSTDCPND050020a 0 b a D N 0 Frequency Frequency 1 Ci 0102030405060708090 A B C D E Group 127 201128 3244455577899 400289 Frequency 53358 618 0 10 20 so 40 50 60 7390 30 so 7017 85 94 Frequency None of the above No graph could tell you this information 010 20 30 40 50 60 7b 80 90 A aThe graph that shows a distribution that is left skewed is v D v cThe graph that can be used to identify a most frequent cate or is I F I dThe graph that shows an association between variables is v 2 out of 4EI Feedback aYou are correct bThis is not correct bThe graph from which you can easily calculate the mean is The graph from which you can easily calculate the mean is D cThis is not correct The graph that can be used to identify a most frequent category is B dYou are correct Discussion Graphs are a useful way of presenting data Types of graphs available depend upon the type of data that you are trying to display Data can be either numerical or categorical and this distinction determines which types of graphs you can use aSkew is a property that numerical data can have A set of numerical data is left skewed if there are relatively few values much lower while most of the data is concentrated relatively higher In a histogram this appears as some smaller bars in the left of the histogram The graph that shows this is A bIn order to calculate the mean of a set of data you need to know all the data values The graph that actually retains all the data values is the stemandleaf plot which is graph D cA bar chart is used to conveniently display frequencies of categories in a set of categorical data A bar chart can be used to identify a most frequent category within a set of categorical data The graph that is a bar chart is graph B dAll the graphs in this question display data for a single variable Therefore none of the above graphs show an association between variables There are graphs that can show an association between variables for example a scatter plot is one such graph