Introductory Statistics for Engineers
Introductory Statistics for Engineers STAT 224
Popular in Course
Mrs. Triston Collier
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in Statistics
Mrs. Triston Collier
verified elite notetaker
This 12 page Class Notes was uploaded by Mrs. Triston Collier on Thursday September 17, 2015. The Class Notes belongs to STAT 224 at University of Wisconsin - Madison taught by Michael Iltis in Fall. Since its upload, it has received 18 views. For similar materials see /class/205076/stat-224-university-of-wisconsin-madison in Statistics at University of Wisconsin - Madison.
Reviews for Introductory Statistics for Engineers
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/17/15
Week 1 Lecture 1 Introduction to Statistics WEEK 1 page 1 We have a population of units of population size N ie the actual objects or subjects for which we want to collect observations The statistical population is the population of measurements variables of interest corresponding to the units Originally statistics involved describing or summarizing numerical data graphically or in tables the realm of descriptive statistics Since we generally don39t have the resources nor do we want to spend the effort to make a measurement for each unit of a very large population instead more modern inferential statistics make inferences which are generalizations about the population as a whole based on a small sample of size n of the population ie a small subset Note 1 More generally we can have an in nite conceptual population such as the possible outcomes of choosing a random real number from the interval 01 There are infinitely many real numbers in this interval Note 2 For Categorical data variables having values such as binary valued yesno or successfailure ones also called Bernoulli random variables or the ternary valued colors red green or blue it is often useful to convert them into ordinal or numeric values by ordering or assigning numbers to the categories eg 123 instead or red green blue There are risks associated with making a false generalization that a hypothesis is false when it is true or vica versa and statistics examines likelihoods and consequences of such errors Statistics examines the sources of variability in the data and tries to determine how much of this variability as manifested in the trends and relationships in the data is due to chance this is known as experimental error and how much is due to natural laws or actual patterns and it looks at optimal ways to design experiments so as to minimize experimental error and determine real effects In an experiment controllable input variables are known as factors and the output variables that result are known as responses Good experimental design seeks to cope with complexity allowing the examination of multiple factors silmultaneously and the possible interactions between them It also seeks to separate correlation from causation which may or may not be present Correlation is a measure of a linear relationship in the data For example data from the town of Oldenburg in Germany during the years 1930 to 1936 supports the fairy tale that storks bring babies since both the stork population X and the human population Y observed at the end of each of 7 years shows a distinct linear trend in time so that the plot of stork population versus human population is nearly a straight line But clearly the hidden causal variable factor here behind both of these correlated measurements is time If the numeric valued variable of interest was height and the population parameter we were trying to determine was the population average height H of Madison Wisconsin citizens we might average the heights of a sample of 10 Madison residents sample size n10 to get the sample mean X as a measure of where the data is centered or located thus obtainng X1 Z Xi1 10X1X2X10 as our estimate of the true population mean u also called the 1 k1 expected value of a randomly chosen height X1 denoted by E X1 where the capital letter E stands for expectation The population mean formula is similar to the sample mean formula except that we replace the sample size n by in this example Madison39s population size N so 1 N u EX11FZ1 x Notice for a fixed population this expected value or population mean u is no longer random but is instead a fixed parameter of the population whereas for a a randomly chosen small sample prior to choosing the sample S the sample mean is arandom variable XX1X2Xn7quot WEEK 1 page 2 the average of n unknown heights denoted by capital letters X Once selected a particular sample is no longer random so we use lower case letters Xi for the iquot39 non random height in a particular sample after selection Notice that the sample mean X is a kind of summary statistic of the typical or average location or center value of our sample data Technically a statistic is a quantity calculated from that is a function of the sample observations or the corresponding random variable prior to selecting the random sample In our example of height measurements from Madison39s population of roughly N 250000 if we arrange height measurements in a sample in increasing order we get the order statistics denoted In the above sum defining the population mean rearranging the observed heights in increasing order will not change the sum But now suppose all our height measurements have been rounded off to the nearest inch The shortest baby or the tallest giant will likely lie somewhere between 1 foot 12 inches and 8 feet 96 inches Thus in this case there will only be K 84 different height values Zk that the N 250000 measurements can assume which we can take arranged in increasing order so we have 1 N 1 N KE4 C k KE4 H ElX1lF Zi1 XEFZM X iZk1 ZkTZk1 kaltzk Here Ck is the count or frequency or number of X 39 S that assume the value Zk inches and pltz M k N refers to the frequency count relative to the total population N Since the counts Ck add up to N the proportions PltZkl must add up to 1 when summed over all 84 values of k For a fixed finite population as above these relative frequencies are in fact the probabilities PltZklP X1Zk in our example that a randomly selected person39s height rounded to nearest inch will equall the value Zk is the proportion or fraction or relative frequency of Zk values where relative In the limit as the sample size n goes to infinity under suitable assumptions the law of large numbers says that the sample mean approaches the population mean X4 u as na 00 From this it follows that the above relative frequencies converge as na 00 to probabilities which we can regard as limiting relative frequencies proportions between 0 and 1 whose total sum is 1 For a random variable X that can take continuous values the probability PXEab that X lies in the interval 01 on the real number line with a Sb is just the area lying under the probability density function f X for X between a and b The uniform U01 distribution on the interval 01 for example is the one with density fx 1 for OSXS 1 and density 0 outside this interval Then for OS aSb S 1 the probability of picking a random number between a and b according to this distribution is just the area under fx1 on the interval a b ie the area of a rectangle of height 1 and width bia which equals b a Remarks some further comments on probabilities Consider an infinite conceptual population of heights in which we shrink the distance A Z between adjacent z values down to billionths of an inch and finally to zero Writing pltZkfZkA Z expresses the probability pltZk of obtaining a value between Zk and ZkA Z as the area of a rectangle of vertical height C21 known as the probability density function and small width A Z In this continuous limit the probability that our random variable in this example height takes a value between a and b is just the integral I f 2 dz of the density function from a to b which is the area under the density function The total area total probability under the density function must be I f zdz 1 WEEK 1 page 3 The above sum for the population mean then becomes 2 ZkfZkA Z which turns into the integral uEX zfzdz in the limit as A 2H0 Note the counts viewed as random variables before the data has been collected can be rewritten Ck 1Uka AZXI where the indicator function 1p A1ltXi equals 1 if the 139 random observation X lies in the interval ZkZkA Z and equals 0 otherwise Then PltZk is just the sample mean of the indicator random variables 1 AltXi whose population mean is the limiting probability PXE Zk Z k A Z that the random variable X lies in the interval Z k Z kA Z Thus the law of large numbers justifies the relative frequency interpretation of probability Note on linearity of expectations when a discrete random variable Z takes values Zk with non zero discrete probability PltZkP ltZZk we have the definition of expectation ElzlZk ZkPltZk even when there are a countably infinite number of such values whose probabilities must still sum to 1 For a continuous random variable with probability density fz having zero probability of assuming any particular value we have u EZ I zfz dz In either case by the distributive properties of finite sums and their limiting infinite sums and integrals for any constants a and b and random variables X and Y the expectation satisfies the linearity property EaXbYaEXbEY Need for randomization random sampling If we happen to live next door to the UW basketball team it might be convenient for us to collect 10 heights by measuring the heights of team members next door But such a self selected convenience sample is not likely to be representative of the heights of the Madison population as a whole Rather it is likely to be biased not close to the actual parameter u of interest It is best to avoid such selfselected samples and to choose instead a random sample In this case the sample mean X will be a random variable which will be an unbiased estimator of the population mean ie the expected value E X of the sample mean X is equal to the correct population mean E X H We could for instance use a computer to randomly select 10 names out of the Madison phone directory and then attempt to measure the heights of these individuals or better yet of a randomly selected member of their family so as to include children not listed in the phone book Or we could use a random number table to choose random integer numbers corresponding to a random selection of these names after ordering the names sequentially from 1 to the number of names in the phone book A random sample from a nite population say picking a sample of size n 5 poker hand from a population of N 52 well shuf ed cards in a deck will mean one for which each possible sample of size n chosen from N is equally likely ie has equal probability for fixed n A random sample from an in nite population is one which is independent identically distributed IID When ipping a fair coin fair means probability PheadsPtails 12 for instance to say coin ips are independent means roughly that the outcome of the previous ip should not in uence the next ip We will give a more precise definition of independence later Time series data are generally dependent ie not independent and said to be autocorrelated since for example 5 consecutive days in Madison in July are likely to have similar temperatures compared to temperatures on 5 randomly chosen dates from all year roundIn the poker hand example the individual cards viewed as samples of size 1 when sampling without replacement are not random not independent of one another nor identically dist39d unless we put each card back in the deck sampling with replacement since the first card has probability 152 of being chosen but the second card has WEEK 1 page 4 probability 151 as there are only 51 cards left in the deck unless we put cards back Large sample statistics The law of large numbers tells us that the sample mean X approaches the population mean u for large sample size but it does not tell us how fast it converges to it The Central Limit Theorem CLT tells us more For a sum 5quot X1 X2m Xl of a large number of independent identically distributed IID observations in practice for sample size nZ30 under assumptions which usually hold assuming the variance U Jvar X1 EX17u 2ltoc exists no matter what the probability distribution of the individual observations the sum 5 and hence the sample mean X will be approximately normally distributed By subtracting off its mean and dividing by its standard deviation the square root of the variance any normally distributed random variable becomes standard normal N 01 Ie it is normal with mean 0 and variance 1 with 2 L e V 2 Tr Note on expectation and variance of sums Linearity of expectations EaX bY a EX bE Y holds whether or not independence of the random variables X and Y holds and for IID independent identically distributed sums gives ESEX1X2XquotEX1EX2EXl bell shaped probability density function of a standard normal distribution f x S uuunu and hence EXE quot1ESnn uu or EXu ie we say the n n n sample mean is an unbiased estimator of the population mean u Provided indepence of X and Y holds the corresponding property for variance says VaXbY a2V b2VY which for ab1 gives VXYVXVY and for a 1 bil gives VXY VXVY For independent sums this says VSVX1X2XVX1VX2VXquotUz0202n02 andhence S 2 2 0iVXV7quoti2VSnn ZUT or UXSDX7 7l forthestandarddeviationof X n n This says the larger the sample size n the smaller the variance of the sample mean X about u ie the better the estimate X of u is This also says for any random variable X having mean u and variance 02 the standardized variable Z has mean 0 and variance 1 If X was normal N u 02 the standardized Z will again be normal N 0 1 Applying this and the central limit theorem CLT to the approximately normal 5 with mean n u and standard deviation Vino S in 7 the CLT result can be expressed as Z quot u X 5 is approximately standard normal V n 039 039 lvn N 0 1 for large nZBO In reality we usually don39t know the population standard deviation 039 and have to replace 039 with the sample standard deviation 5 in the CLT This requires somewhat larger sample size n2 40 is usually OK Small sample statistics sample size n small must either make strong assumptions about the distribution of the individual observations often that each is approximately normal since one can show even for few terms any sum of IID normals is normal or involve tests which are robust not too much affected to departures from normality or use statistical methods so called nonparametric methods based on order statistics which do not depend on the distribution of individual observations but these methods while more general purpose may be less powerful less accurate than other methods If the observations X 1 are normal the Z in the CLT above is normal even for small n replacing 039 by S in the CLT expression for Z gives for small n what is called a t statistic with n 1 degrees of freedom The t distribution density is also bell shaped but wider than a normal Z distribution since 5 being a random variable adds variability not present in the fixed parameter 039 Statistics as an iterative cyclical process of deductive and inductive learning In Statistics we start from assumptions such as normality which represent a probability model and based on the model we make deductions about the world We compare the deduced consequences of the model to the real world data and then in the inductive step revise the model accordingly We repeat iterate the process until satisfied with the results Said differently we start with a prior model ie before data is added and incorporate new data which may entail updating parameters in the posterior model ie after data is added The old posterior then becomes the new prior in the next cycle of the process Chapter 1 Examples material relevant to Chapter 1 of teXt Example 1 like problem 1113 To measure the average miles per gallon that a typical car in New York city gets the variable of interest miles per gallon was measured for 254 randomly selected automobiles in New York City over the course of a month Here the population of units are the set of all cars of New York City The statistical population is the set of miles per gallon fuel efficiency measurements or hypothetical measurements associated with these cars The random sample consists of the 254 mpg readings sample size n254 corresponding to the 254 auto units randomly chosen Example 2 like problem 12 A WORT FM Madison radio host wants to know who the favorite Presidential candidates are amongst eligible voters in Madison He asks listeners to download a questionnaire from the station39s website and to fill it out online What are the potential aws of this self selected method of surveying popular sentiment The desired population of interest consists of Madison voters The actual population surveyed are both 1 listeners of WORT radio whose political views may differ sharply from Madison residents as a whole and 2 only those listeners who were sufficiently enthused about their candidate or about answering a questionnaire that they bothered to eXpend the effort needed to fill it out online This last consideration may also be a source of bias A better method might be to randomly select 100 Madison residents from the phone directory call them and politely ask them to answer questions about their candidate Such a survey is not limited to WORT listeners and hopefully if the person conducting the poll is polite most persons called will answer the questions without hanging up If they do hang up we can hope that this behavior is not related to their political views Example 3 like problem 16 Using a random number table Of 75 restaurants in a downtown district use table 13 on p9 of the teXt to randomly select a sample of n6 restaurants for the health inspector to monitor for compliance with city food safety requirements I39ll start with row 7 and columns 15 and 16 and read down the columns The pairs of digits thus found in the random number table are 88 we discard this since 88 is greater than 75 and then 45 61 52 75 23 68 Had we obtained a repeat of a given number we would also discard such repeats until 6 different numbers are obtained between 1 and 75 Le We pick 45 since in row 8 we have a 4 in column 15 and a 5 in column 16 61 since in row 8 we have a 6 in column 15 and a 1 in column 16 etc Thus the health inspector should visit the 6 restaurants numbered as above with the number 45 being the first on his list WEEK 1 page 5 WEEK 1 page 6 Example 4 like problem 17 Extending the 16 slot depth data in Table 11 on p4 and the x bar chart in Figure 11 on p5 suppose two new samples of 3 ceramic parts each were measured after the machine was repaired yielding sample 17 215 217 216 having X17215 and Xm217 To get the new value of the mean of the sample means X33 based on the two new sample mean values for a total of 18 sample mean values we don39t want to have to recompute the sums used previously Rather use x3 lt16w X17 X18 which recursively updates the new value of the mean of the sample means based on old computed value and the 2 new observations only This simple idea has a higher dimensional generalization useful for prediction of time series called a Kalman Filter which itself has a generalization called the extended Kalman filter Lecture 2 Descriptive Statistics Measures of center or location Given a collection of 11 data values assigned to variables 1 1 X1 X2 m Xl the sample mean XSquotX1X2m Xquot is commonly used as a single number description summarizing the center or location or central tendency of the data and as an estimate of the actual population parameter namely the population mean u Here 5quot X1 X2 Xl is the n partial sum of the variables But other choices are possible Another commonly used measure of location or center is the sample median 7 If we rearrange the sample data in increasing order we can assign these increasing values to the order statistics variables denoted minx USX 2s3x nmax Then when the number of data values n is odd the sample median is defined as the value in the middle of the ordered data while when n is even the sample median is the average of the two middle values Equivalently gt7 Q2 the sample median is the same as the second quartile ie the value 50 of the way through the ordered data To be precise we use Richard Johnson39s De nition of the sample median for p 12 and of quartiles amp percentiles for other p If np k is an integer ie when n even then we take the average of the two values in the middle z Xk Xk 1 median X Q2 while if np is not an integer we round up to k and define the second quartile as median 27 QZX k the k th ordered value which is the single data value in the middle of the ordered data The other quartiles and more generally percentiles are defined by exactly the same procedure except that for the first quartile Q1 the 25 mark we take p 1 4 25 while for the third quartile Q3 the 75 mark we take p 3 4 etc Note Other authors may use a slightly different definition of the quartiles than Johnson39s When the sample size is large it won39t make much difference One other measure of location sometimes encountered is the mode which is the most frequently occurring data value In the case of a frequency distribution which is bell shaped this value is the value on the X aXis associated to the unique peak of frequency the y coordinate but it is possible that the mode is not unique such as in a bimodal distribution in which case one has two data values where the peaks of identical frequencies ie heights occur Visualizing data sometimes part of what is called exploratory data analysis Dotplots and Pareto diagrams A Pareto chart is a bar chart showing the largest counts of categories that the data falls under in decreasing order from right to left with a possibly larger category on the right of everything else left over Problem 21 Accidents at a potato chip plant are characterized by the area of the human body injured For the accident body location counts broken down into fingers 17 eyes 5 arm 2 leg 1 a Draw a Pareto chart I have yet to get R graphics images into pdf files so apologies for poor bar charts Count y17 15 5 y5 y2y1 I4 I Fingers eyes arm leg What percentage of accidents occur for b Fingers Note that the total accident count is total count 17521 25 Thus the fraction for fingers is 1725 68 c Fingers and eyes 175 25 2225 88 The dotplot of the 7 deviations 3 6 2 4 7 4 3 observed speed minus target speed of cutting speed of a lathe given in Figure 22 on p14 looks something like 0 0 U I I U U U U I 2 0 2 4 6 8 If the lathe were behaving exactly at the target speed set by the controller the deviations would all be zero Ideally the deviations would be centered about zero with half negative and half positive but here almost all are positive with the sample mean of the deviations being 36724743 25 7 which slightly more than 3 and 1 Thus we conclude that the lathe is running fast WEEK 1 page 7 Histogram example somewhat like problem 29 As another example of a bar chart we look at the histogram example obtained from the 58 data values given in increasing order on p22 of the text and summarized in Figure 28 The picture in figure 28 of the text was computer generated but we were not told what the identical width used was for each of the seven class intervals pictured We take a guess that comes close to reproducing the picture Namely the intervals represented by the 7 bars of the histogram in the graph extend from slightly before the smallest data value 664 to slightly larger than the largest data value 753 so we will divide up the interval from 663 to 754 into 7 equal class intervals each of width 13 754 6637 So that the intervals do not overlap we follow the leftendpoint convention of including the left but not the right endpoint within each class interval Note that the difference of the endpoint values is the width 13 Since the data are already ordered it is easy to compute the frequency counts ie the number of the 58 values listed which lie in that class interval for each of the seven intervals so obtained Interval Frequency cumulative relative frequency cumulative Count frequency count percent relative frequency or fraction of total percent 663 676 1 1 158 17 17 676 689 7 8 758 120 137 689 702 16 24 1658 276 413 702 715 14 38 1458 242 655 715 728 13 51 1358 224 879 728 741 4 55 458 69 948 741 754 3 58 358 52 100 total count 58 total percent 1 100 These values are not exactly as shown in the graph in Figure 28 but they are close Note that the continuous bell shaped curve fitted over the discontinuous histogram bar graph pictured in Figure 28 is a common situation as the number of data values n gets large a bar graph histogram typically approaches some continuous curve distribution more and more closely An example of cumulative frequency is plotted in Figure 29 on the next page of the book for the different set of data of sulfur oxide emissions given on p16 of the text The dots are placed at the beginning of each class interval with the height of the dot representing the cumulative frequency up to that point The dots are connected by line segments to give the ogive or cumulative frequency graph Note that the cumulative counts given in the Pareto diagram in figure 21 on p14 were drawn differently with the dots falling in the middle of each bar which would correspond to the so called class mark which is the value in the middle of each class interval if we were dealing with class intervals rather than categories as in the Pareto diagram Stem and leaf display A stem and leaf diagram is kind of like a histogram but provides more detailed information If we were trying to summarize the ordered data set of 16 values 13 13 15 16 17 17 19 23 25 26 26 28 28 28 30 32 we could break off the tens column which represent the class intervals 10 19 20 29 30 39 and describe the above data as 1 3 3 5 6 7 7 9 for a count of 7 values between 10 and 19 2 3 5 6 6 8 8 8 a count of 7 values between 20 and 29 3 0 2 a count of 2 values between 30 and 39 WEEK 1 page 8 Note although a stem and leaf display is like a histogram turned on its side a histogram would only plot the counts in each class interval in this case the counts 7 7 and 2 and would loose track of the individual data values contained in the stem and leaf display Stem and leaf displays can also record more complicated data sets such as 14 51 68 74 the class interval here is 14 15 15 23 34 89 class interval 15 16 for representing the data values 1451 1468 1474 1523 1534 1589 or the stem and leaf display 231 3 6 for the values 231 233 236 etc class interval 230 239 56137 Scatter plot diagram For a final example of a graphical display we consider the scatter plot diagram for the Space Shuttle Challenger disaster data which plots one variable against another useful for viewing correlations between data sets which we will study in chapter 11 You can view the graph of this data on page 2 of the postscript file with link httpwwwstatdukeeduSpring03sta113Noteslec11ps and see httpwwwmathyorkucaSCSGallerymissedhtml and httpweb grinnell J 39 J39 39J labs tnnics html lt under topic categorical data This disaster resulted in the deaths of the seven astronauts cost billions of dollars and set the space program back year The black curve is an extrapolation from the data represented by dots of the likelihood of O ring failure which at 30 degrees Farenheit is around 80 It is surrounded by two other curves which tell us that with say 95 probability the correct failure probability of the extrapolated curve lies between the upper and lower curves The engineers plotted 7 data points the y coordinate of the points represented the number of distressed O rings for failed O ring data Thinking it was not informative they ignored 17 data points the larger part of the relevant data that were the dots representing zero stressed O rings ie no O ring failures dots along the x axis where y0 failures occurred These dots appear underneath the U shaped sequence of dots which represent ights that experienced 1 2 or 3 stressed O rings and you will note that the x axis represents temperature The engineers had no existing data for temperatures below 55 degrees Farenheit yet the Challenger took off when the temperature was 30 degrees Farenheit The temperatures at which no O ring failures occurred were all in the range from 65 degrees to around 82 degrees Farenheit To be fair engineers were aware of and complained about the O ring problem but managers didn39t listen Mark Twain author of Huckleberry Finn and the Adventures of Tom Sawyer once said Lies Damn Lies and Statistics You might want to read the Statistics book by this name In the case of the Challenger the engineers weren39t exactly lying rather they were using a lawyer39s definition of truth they ignored more than half of the relevant data Applied statisticians like R Snee have told us time and time again In God We Trust Others Must Have Data The Challenger disaster emphasizes this point Boxplots and their relatives are discussed below Measures of spread about the center variationdispersionwidth If you told a visitor from another galaxy that the average height of an adult human on earth is 71 they might wonder if the smallest is one milimeter tall and the largest is a kilometer in height since the mean value does not include the variability variance of the data WEEK 1 page 9 When the center is given by the median which to be able to compute WEEK 1 page 10 requires the ordered data from smallest to largest such variability measures include the range range max 7minx mix which measures distance between eXtremes and another also in terms of the ordered data is the interquartile range also known as the fourth spread f 5 f5 inter equartile range Q3 7 Q1 which is the distance between the 3rd quartile and the 151 quartile ie between the data value 75 of the way through the ordered data and the value 25 of the way through Outliers All of the quartiles including the median and hence also the inter quartile range measure of spread are not sensitive to outliers ie to eXtreme values of the data not representative of the typical behavior of the data since for example the later disregards the largest and smallest 25 of the ordered data values Of course the range is sensitive since it uses the most eXtreme values This is not true of the sample mean sample variance and sample standard deviation discussed below which are all sensitive to outliers since they use all the data including the outliers in their calculation One can however speak of a trimmed sample mean or trimmed sample variance for example a 5 trimmed mean 395 disregards trims the largest and smallest 5 of the ordered data in its calculation As in the discussion of modified boxplots in the teXt we can speak of a mild outlier as being more than 15 times but less than 3 times the inter quartile range to the left of the 151 quartile or to the right of the 3rd quartile and of an extreme outlier if more than 3 times this inter quartile distance f5 Sample Variance and sample standard deviation When the center is given by the mean spread is usually measured in terms of the sum of squares of the deviations from the sample mean XIi X of the data Namely the sample variance 52 with n 1 degrees of freedom is defined by S 1 quot 1 quot 2 X7202 SXX where SXXZX17 X2 is the sum of squared dev1ations from n 1 i1 n 7 1 i1 the mean of the X data values Dividing by n71 not H above is needed to insure EsZO392 ie sample variance 52 is an unbiased estimator of population variance 0392 You were asked to show 2 in problem 251 the alternate but equivalent computational formula 521 Z X 7M n 7 n Note we could not have use the sum of the deviations themselves to define a measure of spread since problem 250 of the homework shows these deviations also called residuals add to zero 2 xii X 0 and zero is not a terribly informative measure of spread We say this constraint eliminates one degree of freedom since if we know the sample mean the n data values are no longer independent We could have used the absolute value of the deviations however or some power of that and this is sometimes done but is not as easy nor as customary to work with mathematically Finally the sample standard deviation is another measure of spread defined as the square root of the sample variance 5 1 52 which has the advantage over the variance that if we are say measuring deviations of height in meters from the mean height of some height data values the variance will be in square meters the wrong units whereas the standard deviation will be in the correct units meters for height Boxplots We plotted the neutrino data given on p15 and p36 of the teXt on a line and discussed the boxplot and modified boxplot given on page 36 of the teXt We verified the quartile computation procedure of Johnson on this data which illustrates the case where np is not an WEEK 1 page 11 integer ie with the n11 ordered values of neutrino inter arrival times 021 107 179 19 196 283 58 854 118 20 73 for the 151 quartile we take p14 so np 114 gets rounded up to 3 and then X 3 1 79 Q1 gives the 1st quartile Similarly for the median 2 d quartile we take p12 so np 112 gets rounded up to 6 and so Q2 X 6 283 and finally for the 3rd quartile marking 75 of the ordered data p34 so np334 gets rounded up to 9 or Q3 X 9 118 gives the 3rd quartile The rectangle box portion of the boxplot extends from the first to the third quartile with the vertical line dividing the rectangle at the median 2nd quartile A line segment extends from the minimum value 021 to the left of the rectangle 1SI quartile and another line segment extends to the right of the rectangle 3rd quartile to the maximum value 73 In the modi ed boxplot this line segment only extends to the 2nd to last data value 20 since the value 73 is an outlier which is more than 15 times the interquartile range fourth spread Q3Q 1 1001 from the 3rd quartile ie from the right of the box Example Using the data of problem 231 of the text illustrates the quartile calculation procedure when np is an integer Here the data set consists of the n4 deviations observation specification of critical crank bore diameter in ten thousandths of an inch 6 1 4 3 Thus the ordered data are X 176 x 274 X 373 and X 4 Since for all quartiles including the median p is a multiple of 14 and n4 we have np is an integer in each case so the procedure says to average the integer and the next so that the quartiles are X1X2 X2X3 7 X3X4 75 7 Q1 2 Q2 2 2 Q3 We will now do the problem 231 of the text 7 7 7 7 12 a The sample mean is Xw 73 b We compute the sample standard deviation in two ways using the two formulas for the sample variance First the sample variance is with n4 1 quot 1 52 x7202 ltlt767lt73gtgt2lt1773gtgt2lt747lt73gtgt2lt737lt7sgtgt2gt 7 1 Then the sample standard deviation is the square root of this or 5 52m 194 We can also use the alternate formula for the sample variance 1 521Z gig X0 where for both sums the index k ranges from k1 to kn so with n4 n7 n 26 2 asbefore 3 7 2 our example gives 5 62 1242327lt c The sample mean given in part a above says that the average deviation is 3 ten thousandths of an inch smaller than the specified bore diameter so the answer is that the bore hole is too small by 310000 of an inch Finally we look at another variance computation example which illustrates that the variance does not a n u WEEKI page 12 showthlsln dev1atlons to get the corresp ondlng Varlances E m 12 no ldenumber nanbt between the 9 years 1985 to 1993 are as follows 1985 1986 1987 1988 1989 1990 1991 1992 1993 22 22 26 28 27 25 30 29 211 e anstormtlns data set of x Valueslntoaset of y Valueshavlngthesamevananceobtalned by subtractlng 26 nom each of the prevlousvaluesle the y x526 gwen by 1 VA 2 1 7111 r1us1ng the second formula for the Vanance wttn n9 glves 1 1 z z z z 1 z 13 1 e snl jn yt 8113222123 9111 836 8361 abet W 4 but earller l ampulatton m n co ld have been mput by gt x lt7 C61 62 63 6a 6a 65 66 67 67 67 67 67 67 67 68 68 68 68 6 6 68 68 69 69 7o 70 7o 70 7o 70 7o 70 7o 70 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 72 72 72 72 72 72 72 72 72 72 72 73 73 73 73 73 73 73 73 73 73 73 73 7A 7A 7A 7A 75 75 75 75 75 75 76 77 78 79 We flnd the mean and standard devlatlon of the databy commands meanx Notetne C61 79 tells Rto concatenatethe data mto a vector a t 1 non anw s x 1 3 3898411 gt lnstx neql gt Glrvednormxn70 7o1153 3898408 38984111 addT settmg addilquot superlmposes the normal a a a I R would do the wrong thlng Exactly what 1 amnot sure 1 usedtne llnux glmp photoshop equlvalent to ldld gt dev copydevlcepdf flle helghthlst2pdf wldth3 netgtt3 polntslze8 gtdev otto For pdttne unlts are m lnd39les 1 followed an example In quotStatlsucs and Data m R book Histogram 0 x Denslly O 05 0 1O
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'