### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# APPLIED STAT FOR ENGINR & SCI STAT 541

Virginia Commonwealth University

GPA 3.55

### View Full Document

## 10

## 0

## Popular in Course

## Popular in Statistics

This 158 page Class Notes was uploaded by Verdie Hauck PhD on Wednesday October 28, 2015. The Class Notes belongs to STAT 541 at Virginia Commonwealth University taught by James Davenport in Fall. Since its upload, it has received 10 views. For similar materials see /class/230675/stat-541-virginia-commonwealth-university in Statistics at Virginia Commonwealth University.

## Similar to STAT 541 at Virginia Commonwealth University

## Reviews for APPLIED STAT FOR ENGINR & SCI

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/28/15

Measurement Scales When me measure something we are actually using a function that maps a property of the item being studiedto areal number These numbers are the results of conducting our well designed study The design of our data collection process states what measurements are to be made on the units After we have amassed our data we begin to think in terms of the distribution of those numbers where are the located and how frequently occurring are the various values Our next step is to describe that distribution using graphical and numerical techniques Consider the Bureau of Labor Statistics regular Employment Survey The individual in the household selected reports hisher employment status You might record a value of 1 if this person is currently unemployed and record a 2 if they are currently employed Now if we consider only the numbers themselves 2 is twice as much as 1 But an employment status of 2 is not twice an employment status of 1 The issue here is the scale or meaning of the numbers themselves The numbers used to code employment status are just indicators or category labels that happen to be numerals We must be careful because the same thing happens for measurements that we might think of as being more numerical than employment status indicators Consider the scores on the Stanford Binet IQ test Can we say that a child with an IQ score of 120 is twice as smart as a child that has an IQ score of 60 Of course not This property of twice as much that we naturally like to associate with numbers is not appropriate for these measurements The general problem here is that not all numbers carry the same information The number along with the context in which the numbers were generated is very important Furthermore what we can do with the numbers depends on how much information a measurement carries The way we deal with this is to speak of the kind of information carried by the data by referring to the kind of scale that the measurement has We generally consider four kinds of scales A measurement of a property of a unit has a an nominal scale if the measurements are data in name only that is the data tells only what category to which a unit belongs ordinal scale if the measurement tells when one unit has more of the property being measured than does another unit It is similar to the nominal scale in that the measurement tells to which category the unit belongs but there is also an underlying ordering principle interval scale if the measurement tells us that one unit differs by a certain amount of the property being measured from another unit ratio scale if the measurement tell us that one unit has so many times as much of the property being measured as does another unit Measurements in the nominal scale simply place units into categories Such properties as sex race ethnicity eye color hair color employment status etc are measured in the nominal scale For example we could code licensed professional engineering status as follow 0 7 individual is not a licensed professional engineer 1 7 individual is a licensed professional engineer But we could just as easily and correctly use the following 0 7 individual is a licensed professional engineer 1 7 individual is not a licensed professional engineer Which numbers we assign makes no difference The value of this variable simply indicates to which category the engineer belongs and nothing more If a measurement of a property on a unit has an ordinal scale then the order of the numbers is meaningful Consider a commonly used measure of socioeconomic status where 1 indicates the lowest level of income little or no education low status in the community very little net worth etc On the other end of the scale is the value 4 that indicates the upper level of income usually with some college education good status in the community a high level of net worth etc The values 2 and 3 would be between 1 and 4 where 2 indicates a status that is above 1 3 would indicate a status above 2 and 4 would indicate a status above 3 However a socioeconomic status of 4 is not twice as much as a status of 2 Likewise the difference in quality and quantity between a 4 and 3 is not necessarily the same difference between a 3 and a 2 Only the order of the numbers is meaningful Ordinal scale measurements are important in the educational social and health sciences Manytests used to measure learning p 39 39 39 39 of r quot y types measurements such as ethnocentric thinkingbeliefs and certain health status measurements are all ordinal scale measurements Consider a test such as the Pro le of Mood States which is often give to individuals who are experiencing great difficulties in their lives that alter their mood status The measurements are presented as percentages 0 to 100 A person who scores 80 on this scale cannot be said to be twice as moody as a person who scores a 40 But there is an ordering principle here The 80 score certainly indicates an individual that is more emotionally distressed than an individual who scores 40 The Inverval scale and the ratio scale measurements present us with the usual type of measurements with which we are familiar These measurement are made on a scale of equal units such as centimeters feet pounds kilograms mHz degrees Celsius etc Arithmetic such as finding differences is meaningful here For example a 500 mHz processor is 100 mHz faster than a processor that is rated at 400 mHz There is a very fine distinction between interval and ratio scales Ifwe measure the length of two pieces of pipe one measures 4 feet and the other measures 2 feet then we can say that the pipe that is 4 feet long is twice as long as the pipe that is 2 feet long Ifwe had measuredthe lengths in centimeters 12192 cm and 6096 cm then the longer pipe is still twice as long as the shorter one In general the ratios of the lengths remains constant under changes in the Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 22 22 22 1 M am 2 MW 5 541 2W 2qu Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 134 More on multiple Regression Analysis Subset Selection umquot 22 Jans M Damprt 2 MW 5m 41 cummzm Sum of Squares Decomposition 39 SSTO 2yl 7f n 1s same as ilrI1simple linear regression SSR 2 m1 note cht the computation of 7 is now much more complicated umquot 22 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Sum of Squares Decomposition SSE 2m j2 2e SSTO SSR i1 i1 SSE and SSR divided by 0392 are distributed as chisquare variables with their respective degrees of freedom Mean Squares The respective mean squares are given by MSRSSR P MSEZSS Ezal quot111 umquot 22 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu 22quot 22 2 M um I MW 5 541 Amman ANOVA Table Source Sum of df Mean Squares Sq Regression SSR P MSR Error SSE np1 MSE Total SSTo n1 22quot 22 2 M um 2 MW 5 541 Amman ANOVA Table and the F ANOVA Table and the F Test Test It can beSh Wquot that if aquot the R o F has a central F distribution if and only assumptwns are mat thequot F MSE has if the following hypothesis is true Hop1 2pa po an F dIstrIbutIon With p degrees of Otherwise F has a non central F freed m 39quot the quotum ajramr and quotfp391 distribution with non centrality parameter degrees of freedom In the denominator that is a function of the values of the pl kmquot 22 39 Mquot M 3275 7 53223quot 541 113 27539quot I The Snedecor s F Distribution ANOVA Table and the F Test Thus we will reject the hypothesis if and only if the observed value of the F statistic is too large 39 39 39 gt or if the pvalue PFp1i fohs S 0 1 M Damprt u amino 2qu umquot 22 Jans M Damprt 1 mm 22 vcus 5m 41 Elpyrighfz l vcus 5m 41 The Snedecor39s F Distribution Coefficient of Determination Density SSR 1 SSE SSTO SSTO 0 5 R2 51 the regression sum of squares as a percentage of the total sum of squares umquot 22 Jans M Damprt 11 umquot 22 Jans M Damprt 12 MW 5m 41 Elpyrighfz l vcus 5m 41 amino 2qu R2 Coefficient of Determination A high R2 indicates that the regression model with the p variables explains a large portion of the total variability in the y s but does not by itself tell us which regressors are the important ones umquot 22 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Adjusted Coefficient of Determination If we wish to compare similar models with different numbers of variables we must use the adjusted coefficient of determination SSE R1 1 MSE 1lLP39ll It sl SST y quot 1 uquot 22 h M m 14 w s 541 cizuu Adjusted Coefficient of Determination The adjusted coefficient of determination can be computed from knowledge of the regular coefficient of determination R2 R2 n lRz p n 11 1 umquot 22 Jans M Damprt 15 MW 5m 41 Elpyrigm 2qu Facts about R2 R2 and F o It can be shown that R 5 R1 The F statistic and R2 are related as follows R F 7 1 1Rz P lR umquot 22 Jans M Damprt vcus 5m 41 Elpyrighlellll Facts about R2 R2 and F o The F statistic and R are related as follows HEW umquot 22 Jans M Damprt 17 MW 5m 41 Elpyrigm 2qu F Example Multiple Regression NCSS and the Hald cement data M 22 M Damprt um Jans vcus 5m 41 Elpyrighlellll X1 X2 X3 X4 Y 1 70 60 600 785 2 10 29 0 150 520 743 3 110 56 0 80 20 0 1043 4 110 310 80 47 0 876 5 70 520 60 330 959 6 110 550 90 220 1092 7 30 710 170 60 1027 8 10 310 220 440 725 9 20 540 180 220 93 1 10 210 470 40 260 1159 11 10 400 23 0 340 838 12 110 660 9 0 120 1133 13 100 680 8 0 120 1094 humquot 22 Jans M Dampquot u vcus 5m 41 Elpyrigmz l Hald Cement Data where x1 amount of tricalcium aluminate x2 amount of tricalcium silicate x3 amount of calcium aluminum ferrate x4 amount of dicalcium silicate umquot 22 Jans M Damprt 211 MW 5m 41 Elpyrighlellll Hald Cement Data and Y heat evolved in calories per gram of cement x1 x2 x3 and x4 are measured as percent of the weight of the clinkers from which the cement was made umquot 22 Jans M Damprt 21 MW 5m 41 Elpyrigm 2qu Hald Cement Data The regression function is as follows Ely 0 plxl pzxz psxs p4x4 Let s see how NCSS handles this Jans M umquot 22 Damprt 22 MW 5m 41 Elpyrighlellll Individual Beta Coefficients Prior to sampling B is a random variable and has a sampling distribution We need to know what this distribution is in order to compute confidence intervals and test hypotheses umquot 22 Jans M Damprt 2 MW 5m 41 Elpyrigm 2qu Individual Beta Coefficients It can be shown that each 8 is a linear combination of the observed Yi s pi Edgy i1 The dij s are functions of the xij s umquot 22 Jans M Damprt 24 MW 5m 41 Elpyrighlellll Individual Beta Coefficients Furthermore it can be shown that Vurll j c1702 f0rj12p and each j has a normal distribution The c depend only upon the x variables lazmr 22 Jans M rumpquot vcus 5m 41 25 Elpyrigm 2qu E j f0rj012p unbiased Individual Beta Coefficients Furthermore the following statistic has a Student s t distribution with n p 1 degrees of freedom SW Jans M m lazmr 22 Damp vcus 5m 41 Elpyrigm 2qu Individual Beta Coefficients The estimated standard error of is given by A s j Ic MSE Most computer programs will give these estimates of the standard errors along with the estimates of the parameters IAer 22 vcus 5m 41 Jans M rumpquot 25 Elpyrighlellll Individual Beta Coefficients Thus we can test hypotheses and find confidence intervals on all of the pj s The 100 1 on confidence interval for p is given by I9i i taz n p l s IAer 22 Jans M n vcus 5m 41 Damp Elpyrighlellll Individual Beta Coefficients We can test the following hypothesis Ho 3 0 vs Ha 3 e 0 A Test Statrstrc tpjA sm rumpquot lazmr 22 Jans M vcus 5m 41 Elpyrigm 2qu Individual Beta Coefficients We reject Ho if and only if we Or we can compute the pvalue and reject if and only if pvalue 5 or Z taZ n p l p value 2 X PTn 1 2 IAer 22 Jans M ampquot vcus 5m 41 Elpyrighlellll The Student39s T Distribution Smdmt s T Distributinn w D2912 nr Freednm umquot 22 Jans M Damprt 1 MW 5m 41 Luminari PM 2 x PTnp1 2 Itahl Density 00 Student39s T Distributinn with v nrprl Degas nr Freednm umquot 22 Jans M Damprt 2 MW 5m 41 amino 2qu Individual Beta Coefficients Most computer programs give these p values If the pvalue is less that or equal to or we conclude that the regressor variable xj has a significant impact on the response variable Y umquot 22 Jans M Damprt 3 MW 5m 41 Luminari Individual Beta Coefficients These are called the partial t tests since they assess the partial or additional significance of the variable x over and above the impact of all other variables that are in the model Jans M umquot 22 Damprt 4 MW 5m 41 amino 2qu Computer Software Let us use NCSS and the Hald data to illustrate the use of these partial t tests Also while we are there let us examine the correlations among the variables umquot 22 Jans M Damprt as vcus 5m 41 Luminari Multicollinearity Among the Explanatory Variables If the x variables are not orthogonal then we say that they are correlated The explanatory variables that are correlated most likely convey the same or similar information about the response variable Y umquot 22 Jans M Damprt 5 MW 5m 41 amino 2qu Multicollinearity Among the Explanatory Variables In the Hald data x1 and x3 are highly correlated Also x2 and x4 are highly correlated Does this matter Is it a help or a hindrance When multicollinearity is present we must proceed carefully um 22 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu Example Let us reconsider the fuel consumption example In a quest to improve the model an engineer suggested includin horsepower as an explanatory variable Of course horsepower and weight are correlated umquot 22 Jans M Damprt vcus 5m 41 mmimzm Scatterplot of the Weight versus Horsepower ofVehicle Weight ofVehicle in 1000 lbs um 22 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Fuel Consumption vs Weight amp Horsepower HSPDWER VlEIGHT Jans M umquot 22 Damprt 4n vcus 5m 41 mmimzm Multicollinearity Among the Explanatory Variables The presence of multicollinearity among the regressor variables causes produces instability in the solutions to the normal equations This in turn causes unstable non robust estimates of the coefficients um 22 Jans M Damprt 41 MW 5m 41 Elpyrigm 2qu Multicollinearity Among the Explanatory Variables We would like to avoid unstable estimates So we cannot build multiple linear regression models by simply throwing every regressor that we think might be even remotely related to the response umquot 22 Jans M Damprt 42 MW 5m 41 mmimzm Multicollinearity Among the Explanatory Variables If we try this it will lead to problems with multicollinearity We would like to avoid this Is there something that we can do to deal with this problem Yes umquot 22 Jans M Damprt 4 MW 5m 41 Elpyrigm 2qu Multicollinearity Among the Explanatory Variables We could start with a large set of explanatory variables and then attempt to reduce the size of this set by eliminating the troublesome variables that are causing the multicollinearity umquot 22 Jans M Damprt 44 MW 5m 41 Elpyrighlellll Subset Selection This suggests the following question Is there a smaller subset of the explanatory variables that we can use as our predictive fitted model without significant loss in say the coefficient of determination R2 umquot 22 Jans M Damprt as vcus 5m 41 Elpyrigm 2qu Subset Selection Recall that the general multiple linear regression model is given by Yr oAxi1pzxiz 39ppxgsi fori123n Jans M umquot 22 Damprt 45 MW 5m 41 Elpyrighlellll Subset Selection Forward selection Backward selection directed t search All possible subsets Stepwise regression umquot 22 Jans M Damprt 47 MW 5m 41 Elpyrigm 2qu Subset Selection Forward Selection Start by including the regressor that has the highest correlation with the response variable Y Then choose as a second variable to include that variable that gives the maximum increase in SSR and umquot 22 Jans M Damprt 4 vcus 5m 41 Elpyrighlellll Subset Selection Forward Selection continued the partial t test for that added coefficient is significant Continue until you cannot include any more variables Let s look at NCSS and the Hald data um 22 1 M Damprt 41 MW 5 541 Elpyrigm 2qu Subset Selection Subset Selection Hald Data So we include x4 first and the second variable is x1 Let us return to NCSS and see about including a third variable um 22 1 M Damprt 51 MW 5 541 Elpyrigm 2qu Hald Data Pairs of A X395 SSR tpj pvalue j 123 X1 amp X4 264100 1040 00000 X2 amp X4 184683 042 06867 X3 amp X4 2540025 635 00001 uquot 22 h M mm 5 MW 5 541 cz Subset Selection Hald Data x s A t included SSR p1 pvalue in model j 23 x1 x4 amp x2 266779 224 00517 x1 x4 amp x3 2664927 205 00697 uquot 22 h M mm 52 MW 5 541 cz Subset Selection Hald Data Forward Selection Procedure If we use a 005 then we cannot include and additional variables and the final regression model is given by Y o 1x1 4x48 um 22 1 M Damprt 5 MW 5 541 Elpyrigm 2qu Subset Selection Hald Data Forward Selection Procedure If we use on 010 we would include x2 and try x3 However the partial t test for x3 given that x1 x2 and x4 are already in the model is not significant so it would not be included um 22 1 M Damprt 54 MW 5 541 mmimzm Subset Selection Hald Data Subset Selections Forward Selection Procedure The final model using a 010 is given by Y op1x1 zxz 4x4s umquot 22 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Backward Elimination Eliminate the variable whose coefficient t test has the largest pvalue The re compute the estimates and individual t tests with the reduced model umquot 22 Jans M Damprt 55 MW 5m 41 Elpyrighlellll Subset Selection Backward Elimination continued Check to see of another variable can be eliminated and act accordingl Continue until all remaining remaining t tests are significant umquot 22 Jans M Damprt 57 MW 5m 41 Elpyrigm 2qu Subset Selection Hald Data Let us use NCSS to show how to do this backward elimination procedure using the Hald data We will start with the full model including all four explanatory variables Jans M n umquot 22 Damp vcus 5m 41 Elpyrighlellll Subset Selection Hald Data Backward Elimination Procedure The first explanatory variable eliminated is x3 So the reduced model is now given by Y op1x1pzxz 4x4s umquot 22 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Subset Selection Hald Data Backward Elimination Procedure Let us return to NCSS and see if one of these three variables can be eliminated We see that x4 can be eliminated umquot 22 Jans M Damprt ill vcus 5m 41 Elpyrighlellll Subset Selection Hald Data Backward Elimination Procedure The new reduced model is given by Y oplx1pzxzs No other variables can be eliminated so this is the final model Example Let us consider the fuel consumption data from earlier and include the horsepower as an additional variable Also let us begin with a complete second order model in these two variables aquot 22 2 M 2 2 W 2 2y2m Example Y gallons per 100 miles X1 weight X2 horsepower umquot 22 Jans M Damprt n vcus 5m 41 Elpyrigm 2qu 22 22 2 M 2 2 W mm 2 2 Example The initial model is given by Y pa Axl zxz 11x12 zzxzz 3ux1xz s Jans M n H mm 22 Damp vcus 5m 41 Elpyrighlellll Scatterplot of the Horsepower of the Vehicle Gallons Consumed per 100 Miles Driven Horsepower ofVehicle umquot 22 Jans M Damprt n vcus 5m 41 Elpyrighlellll Gallons Consumed per 100 Miles Driven Vehicle Weight ofVehicle hzmn 22 Jans M Damprt n vcus 5m 41 Elpyrighfz l Gallons Consumed per 100 Miles Driven Product ofWeight amp Horsepower of Vehicle umquot 22 Jans M Damprt n vcus 5m 41 amino 2qu Scatterpiot of the Weight versus Fuel Consumption vs Weight amp Horsepower HSPDWER VlElGHT Jans M umquot 22 Damprt 7n vcus 5m 41 amino 2qu 0 3 LE 0 gt 398 o o 2 o I Weight ofVehicle in 1000 lbs um 22 a M pm 51 rows s 541 2ynn2rrr Fuel Consumption vs Weight ampHorsepower E n 3 HSPDWER WEIGHT um 22 a M pm 71 rows s 541 2ynn2rrr Example Let us use NCSS to examine the model and see which variable can be eliminated first The first variable to be eliminated is weight squared Then we will eliminate horsepower umquot 22 Jans M Damprt 72 MW 5m 41 amino 2qu Example Next we eliminate horsepower squared Followed by the elimination of weight And finally we are left with the product of weight and horsepower as the final model umquot 22 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu Example The final selected model contains the explanatory variable PROD Using the backward elimination procedure the final model is as follows Gal39s1001946470008052gtlt PROD umquot 22 Jans M Damprt 74 MW 5m 41 Elpyrighlellll Subset Selection All Possible Subsets Compute the regression function for every possible subset the are 2p 1 of these Pick the model that is best by some appropriate criteria If p is large then this can produce a very ar e amount of computer printout but if done correctly can be very effective umquot 22 Jans M Damprt 75 MW 5m 41 Elpyrigm 2qu Subset Selection Stepwise Regression This is essentially a combination of the Forward Selection and Backward Elimination procedures It is very computationally intensive and certainly requires the use of a computer Most packages include a form of this type of selection procedure as on option Jans M umquot 22 Damprt 75 MW 5m 41 Elpyrighlellll Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 24 m 24 J M am 1 MW sun 41 mm 2qu Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 1032 amp 165 Goodness of Fit Quality Control Methods Acceptance Sampling umquot 24 Jans M Damprt 2 MW 5m 41 Elpyrigm 2qu Goodness Of Fit Tests Suppose we think that the distribution of a random variable X is of some specified form For example we conjecture that X may have a normal distribution with unknown parameters p amp 0392 umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu A NORMAL DENSITV CURVE Measurement Varlabla umquot 24 Jans M lumpquot 4 MW 5m 41 Luminan Goodness Of Fit Tests We can divide the support of the distribution of the random variable X into k mutually exclusive and exhaustive cells group or intervals Take a random sample of size n and let Y1 Y2 Yk denote the frequencies of those cells Let s look at NCSS J n s umquot 24 Damp vcus 5m 41 Luminan Goodness Of Fit Tests Let p1 p2 pk denote the probabilities that are associated with the k cells computed from the pdf and E Yi n pi denote the expected number of observations associated with the i I cell k k k 2p1 2mm i1 1 i1 Goodness Of Fit Tests If the null hypothesis completely specifies the values of the various pi s then k Y 1 np Q l 1 39 1 quot11 has an approximate 1L1 umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Goodness Of Fit Tests If the null hypothesis does M completely specify the various values of the pi s then we have to estimate the unknown parameters using the data We loose one degree of freedom for each parameter estimated umquot 24 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Goodness Of Fit Tests Thus the statistic Q n iz 16 1 i1 quot t 2 has an approximate lk1 where m the number of parameters that were estimated using the data umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Goodness Of Fit Tests Recent research indicates that the decision rules need to be modified to account for the changing degrees of freedom To avoid confusion in the following decision rules let qm the calculated value of the ChiSquare statistic umquot 24 Jans M Damprt u vcus 5m 41 Elpyrigm 2qu Goodness Of Fit Tests The decision rules are modified as follows W q 2239 H then reject Ho lf qm s 1k1m then fail to reject Ho If 1k1m lt qohr lt 1k1 then we withhold judgment umquot 24 Jans M Damprt 11 MW 5m 41 Elpyrigm 2qu TlE CHISQUARE CURVE Plot of a Chisquare pdf Reserve Judgment Region Re ction Region Density WVariable with r Degrees of Freedom umquot 24 Jans M Damprt 12 MW 5m 41 Elpyrigm 2qu Example Poisson A We observe n 200 values on a random variable X that we believe has a Poisson distribution with parameter 7 o The data are as follows annual EEIEEIIIH mquot 24 J M mm 1 W s 541 mm 2qu Example We first must estimate the value of A 32 890701quot15 n 200 0 835 200 Jans M m 14 umquot 24 Damp vcus 5m 41 Elpyrigm 2qu Example Ho X has a Poisson distribution A is unknown Ha X does not have a Poisson dist And we have estimated A with 2A 0835 Damprt 15 umquot 24 Jans M vcus 5m 41 Elpyrigm 2qu Example Using NCSS s probability calculator we get the following probabilities PX004339 PX400088 PX103623 PX500015 P X 2 01513 P X 3 00421 umquot 24 Jans M Damprt 15 MW 5m 41 Elpyrigm 2qu Example Expected Frequencies E1 200 04339 8678 E2 200 03623 7246 E3200015133026 E420000421 842 E520000088 mil Ei s too E620000015 030 small umquot 24 Jans M Damprt 17 MW 5m 41 Elpyrigm 2qu Example Expected Frequencies Too Low We have a problem These last two expected frequencies are too low in order for the chisquare approximation to be adequate So we must combine the last two categories into one category which we can label X 34 umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Example Expected Frequencies Too Low The general rule for combining the cells or categories is as follows None of the expected frequencies can be less than 1 And No more than 20 of the cells can have expected frequencies less than 5 umquot 24 7 M Damprt 11 MW 5m 41 Elpyrigm 2qu Example Expected Frequencies E1 200 04339 8678 E2 200 03623 7246 E3200015133026 E420000421 842 E520000105 210 X34 O189 0270 0330 O48 o53 umquot 24 7 M Damprt 20 MW 5m 41 Elpyrigm 2qu Example m1 since we estimated it dfk 1 m5113 2 1045 4 9488 amp 130537815 z Z zmhdrw 0549 q3 8678 umquot 24 7 M Damprt 21 MW 5m 41 Elpyrigm 2qu Example Since 0549 lt 7815 we fail to reject the null hypothesis The data are consistent with those we would expect to see from a Poisson distribution umquot 24 Jans M Damprt vcue 5m 41 Elpyrigm 2qu lE CHISQUARE CURVE lot ofa Chisquare pdf Density WVariable with r 3 or 4 Degrees of Freedom Jans M Damprt umquot 24 vcue 5m 41 Elpyrigm 2qu Acceptance Sampling This is also referred to as Attribute Sampling in Statistical Process Control SPC terminology defective I non defective umquot 24 M Damprt vcue 5m 41 Jans Elpyrigm 2qu Notation Let N denote the number of items in a lot usually finite but can be small or large in size Let p denote the UNKNOWN proportion of defective items in the lot thus there are Ngtltp defective items in the lot umquot 24 Jans M Damprt 25 MW 5m 41 Elpyrigm 2qu Acceptance Sampling 100 inspection is usually not done It can be too expensive It may be impossible destructive testing umquot 24 Jans M Damprt 25 MW 5m 41 Elpyrigm 2qu Acceptance Sampling So we resort to selecting a sample from a lot and either accept the entire lot or reject the entire lot based upon the outcome of the sample Note that this is sampling without replacement Thus these events are NOT independent umquot 24 Jans M Damprt 27 MW 5m 41 Elpyrigm 2qu Two Types of Errors We may reject the lot as unacceptable when it is okay We may accept a lot when it is indeed unacceptable umquot 24 Jans M Damprt 2 vcus 5m 41 Elpyrigm 2qu Acceptance Sampling We orient our thinking in terms of Accepting the Lot as the event of interest and we desire to determine whether an acceptance sampling plan is a desirable one or one that is undesirable umquot 24 Jans M Damprt 21 MW 5m 41 Elpyrigm 2qu Acceptance Sampling To assess the operating characteristics of an acceptance sampling plan we will use the probabilities of making these two types of errors as part of our criteria that determines a good plan versus a bad plan umquot 24 Jans M Damprt an vcus 5m 41 Elpyrigm 2qu Producer s Risk Definition The probability of rejecting a satisfactory lot is called the producer s risk It is also referred to as the n u or rIsk m 24 J M MWquot 1 w s 541 mm 2qu Consumer s Risk Definition The probability of accepting an unsatisfactory lot is called the consumer s risk It is also referred to as the p risk umquot 24 Jans M Damprt 2 MW 5m 41 Elpyrigm 2qu Operating Characteristic Curve the 0C Curve This suggests that we define a function of the variable p the percent defective in the lot called the Operating Characteristic Curve 0C curve for short which is defined to be the probability of accepting the lot for various values of p umquot 24 Jans M Damprt 3 MW 5m 41 Elpyrigm 2qu 0C Curve Definition The Operating Characteristic Curve 0C curve for any acceptance sampling plan is defined to be 0Cp PAccq1ting L0t 0 S p S 1 umquot 24 Jans M Damprt 4 MW 5m 41 Elpyrigm 2qu Example N 1000 p percent defective 1 Np of defective items in the lot n 10 items are randomly sampled from the lot without replacement Evaluate and plot the 0C curve Jans M m as umquot 24 Damp vcus 5m 41 Elpyrigm 2qu Example Decision Rule accept the lot if we observe no failures in the sample of n 10 items That is ifY the of defective items in the sample then we accept iff Y 0 umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Example The probabilities that we need to calculate can be found with the hypergeometric distribution But we will first use a simplifying approximation based on the multiplicative rule of probability umquot 24 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu Lot Size N defectives Nondefectives umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Example 0Cp PAccepting Lot 0 5 ps 1 PY00ltplt1 N Np N Np l N Np n l N N 1 N n 1 m m M a Example Consider the first term wHwaw N N N mquot 24 J M mm 4 W 5 541 mm 2qu Example Now consider the second term N Np 1 N N 1 N 1 39 p39 If N is very large then Us and 51 umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Example Thus for the second term is approximated by N Np 1 g 1 N1 p In general we can show that 0Cp1 pquot providednltlt N Damprt 42 umquot 24 Jans M vcus 5m 41 Elpyrigm 2qu Example n 10 0CP1P So OC 005 1 00510 09510 05987 OC 010 1 01010 09010 03487 OC 0025 1 002510 097510 07763 1 24 1 M ampquot 4 MW 5 541 awning OPERATING CHARACTERISTIC CURVE Probability of Accepting the Lot p probability of defective Example ofDCp1pquot10 1 M lamquot 24 Damprt 44 MW 5m 41 mmimzm True 0C Curve is given by Hypergeometric Dist The true probability of accepting the lot is found by using the Hypergeometric Distribution Can be found using NCSS so we don t need the approximations lamquot 24 1 M Damprt as vcus 5m 41 mmimzm Hypergeometric PDF Note It Is Discrete To evaluate this pdf we need four pieces of information namely o N the finite population size p the proportion of defective items n the sample size y the number of defective items in the sa ple m 24 J M MWquot 45 w s 541 mm 2qu Lot Size N Sample Size n m 24 J M MWquot 7 w s 541 mm 2qu Hypergeometric PDF i7 umquot 24 Jans M Damprt 4 vcus 5m 41 Elpyrigm 2qu Example Recall for this example p 005 n 10 N 1000 and we accept only ify 0 Thus the true probability of accepting the lot under these conditions is given by uquot 24 h M mm 4 W s 541 Aquot 2 Example 501100040 PY 0 w J 05973 1000 10 Jans M m 5 umquot 24 Damp vcus 5m 41 Elpyrigm 2qu Calculating Probabilities for the Hypergeometric Calculating these probabilities can be a very challenging problem But NCSS Probability Calculator has been programmed to get around these problems and is very accurate Let s try it now The annotated window for this is attached to these notes and available on the web page m h M mm 51 w s 541 Ami 2qu Example in Summary Using the approximation we developed for the 0C curve we got the following 0Cp 1 p10 So for p 005 we get 0C005 1 00510 05987 The exact probability is given by 0C005 05973 umquot 24 Jans M Damprt 52 MW 5m 41 Elpyrigm 2qu Using Acceptance Sampling Plans There are two values of p the percent defective that are very important The percent defective at which the lot is acceptable is called the acceptable quality level AQLQ This value of p is denoted by po umquot 24 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Using Acceptance Sampling Plans The largest value of the percent defective that the consumer is willing to accept is called the Lot Tolerance Percent Defective LTPDL We denote this probability at p1 umquot 24 Jans M Damprt 54 MW 5m 41 Elpyrigm 2qu Using Acceptance Sampling Plans The probability of rejecting a good lot with AQL po actually realized is denoted by or and is called the Producer s Risk The probability of accepting a bad lot with LTPD p1 is denoted by p and is called the Consumer s Risk lamquot 24 Jans M lumpquot 55 MW 5m 41 Elpyrightz l OPERATING CHARACTERISTIC CURVE Probability of Accepting the Lot p probability of defective Example of DCp1 p quot10 Jans M lamquot 24 lumpquot 55 MW 5m 41 Elpyrightz l Evaluating Acceptance Sampling Plans Recall that we judge the goodness of any acceptance sampling plan by the shape of the Operating Characteristic Curve given by 0Cp PAccq1ting L0t 0 S p S 1 lamquot 24 Jans M lumpquot 57 MW 5m 41 Elpyrightz l Evaluating Acceptance Sampling Plans We define Accepting the Lot as the following event Y 5 Ac where Ac the acceptance number and Y the number of defective items in the sample umquot 24 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Evaluating Acceptance Sampling Plans Then the Operating Characteristic Curve is defined by 0CpPYSAc0SpSl umquot 24 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Evaluating Acceptance Sampling Plans Example N 5000 p0 AQL 002 n 200 p1 LTPD 006 And we desire the error probabilities to be or 005 and f 010 That is 0C002095 and 0C006010 umquot 24 Jans M Damprt in vcus 5m 41 Elpyrigm 2qu 20 LUI 102 001 004 005 006 Lump 24 James M Duvmpori e1 VCU39 5m 54 Copyright zoos Example And nally let Ac 7 that is we will accept the lot if there are 7 of fewer defectives in the sample and reject the lot otherwise 0Cp PAcc Lot p PY s 7 p Lecture 24 James M Duvmpui 62 VCU39 Std 541 Copyright 2003 Exam pe 0CpPYSAcpw um 24 James M Duvmpori as VCU39 5m 54 Copyright zoos 21 For Example 10014900 7 0C002 PY s 7 zy m y0 5000 200 J m 24 1 M m 54 Example Value Of p Hypergeometric 002 09542 003 07491 004 04473 005 02078 006 00786 m 24 1 M m is OPERATING CHARACTERISTIC CURVE Probability Accepting the Lot p defective 5 Curve Computed with the Hypergeometric Distribution humquot 24 Jans M Damprt n vcus 5m 41 Elpyrightz l 22 Average Outgoing Quality A0QPOC 1101OC 11 POCP where p is the actual proportion defective of all the combined lots that go into the production process umquot 24 1 m n vcus 5m 41 ms M Davup Elpyright 2qu Average Outgoing Quality The maximum of the A00 curve is called the Average Outgoing Qualitz Limit AOQLL It tells us about the worst possible average of outgoing quality It is usually determined by calculus or empirical methods mm 21 1 M ampquot 51 MW 5 541 mm 2qu A0001 03904 AOQL e 004 003 quotgt02 001 0025 005 0075 010 0125 015 7 AOQ curve for 1h MILSTDIOSD example umquot 24 Jams M Davmplrf n vcus 5m 41 Elpyright 2qu 23 Practice Problems 14 1 In a genetics experiment investigators looked at 300 chromosomes of a particular type and counted the number of sisterchromatid exchanges on each A Poisson model was hypothesized for the distribution of the number of exchanges Test the fit of a Poisson distribution to the data by first estimating 7t and then combining the counts for x 8 and x 9 into one cell that corresponds to x 3 8 Use at 010 the data follow number of 0 1 2 3 4 5 6 7 8 9 exchanges Observed 6 24 42 59 62 44 41 14 6 2 Counts 2 An article published in the Annals ofMathematical Statistics reports the following data on the number of borers in each of 120 groups of borers Does the Poisson pdf provide a plausible model for the distribution of the number of borers per group Hint add the frequencies for 7 8 12 to establish a single category 3 7 Use at 001 N m ber 0 1 2 3 4I5I6l7 8 9 10 1rl1z of Borers 96534301 3 In a recent article anthropologists reported the following data that addresses the association of relative foot size and sex among righthanded individuals A sample of righthanded man and a sample of righthanded women were chosen and each individual was classified as having feet that were the same size having a bigger left than right foot a difference of half a shoe size or more or having a bigger right than left foot I LgtR LR LltR Samplesize Men 2 10 28 40 Women 55 18 14 87 Does the data indicate that gender has a significant relationship with the development of foot asymmetry State the appropriate null and alternative hypothesis compute the value of the x2 and obtain the pvalue Use at 0005 4 A sample of 50 items is to be selected from a batch consisting of 5000 items The batch will be accepted if the sample contains at most one defective item Calculate the probability ofthe lot acceptance for p 001 002 003 010 and sketch the OC curve Be sure to use the hypergeometric distribution 5 Refer to problem 4 and consider the plan with n 100 and Ac 2 Calculate OCp for p 001 002 005 and sketch the two OC curves on the same graph Which of the two sampling plans is preferable leaving aside the cost of sampling etc and why Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 6 an 5 MW 1 M mum 1 s 541 aquoter 2qu Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 51 amp 61 Normal Distribution continued Checking Normality The Nature of Statistical Inference Populations amp Samples Random Sampling Isz s vcus Jans M DavInput 2 5m 41 Elpyrigm 2qu Finding Probabilities for non standard normals You can prove that if X has N u 01 then the distribution of Z 039 is N 0 1 DavInput Isz s vcus Jans M 5m 41 Elpyrigm 2qu Example percent production in spec Certain shafts are manufactured to meet engineering specs But there is variation in the produced items X diameter of shaft and X has N p 0392 where p 0251 and 039 0001 umquot s vcus Jans M DavInput 4 5m 41 Elpyrigm 2qu Example percent production in spec Engineering Spec s call for 025 1 0002 Given that X has a normal distribution with mean and standard deviation as given above what proportion of this population of manufactured rotor shafts is in spec m i w h M mmquot s Example percent production in spec x has N o251 az 00012 P 025 0002 lt X lt 025 0002 P 0248 lt x lt 0252 0248 X 0252 P lt lt 639 039 039 umquot s vcus Jans M DavInput 5m 41 Elpyrigm 2qu Example percent production in spec 0248 0251 0252 0251 lt Z lt 0001 P 300 lt Z lt 100 ltb1 ltb 3 08413 1 09987 39 08413 00013 08400 umquot s vcu s Jans M DavInput 7 5m 41 Elpyrigm 2qu THE STANDARD NORMAL CURVE Density 1 Variable umquot s vcus Jans M Davuper x 5m 41 Elpyrigm 2qu Checking for Normality Much of what we will be doing for the remainder of the semester is based upon the normal distribution How can be check data to see if it has arisen from an underlying normal dist umquot s vcus Jans M DavInput 1 m 541 Elpyrigm 2qu Checking for Normality There are several ways to check for an underlying normal distribution The first is a QQ type plot of the data against the theoretical counterparts This is called a normal probability plot and is available in most all statistical packages umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu Checking for Normality The second are numerical analytical tests that most statistics packages perform such as the ShapiroWilk s test We will not discuss these in this course but will examine the output of NCSS on occasion umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu Example of a QQ type plot Lilliefors Plots This is a graphical method that many statistical packages will provide in various different versions when you ask for a normal probability plot These are best illustrated by an example DavInput 12 umquot s vcus Jans M 5m 41 Elpyrigm 2qu Checking for Normality Example The Environmental Protection Agency EPA has the responsibility of establishing estimates of miles per gallon mpg for both city and highway driving These estimates are printed on the window stickers of every new car sold in this country umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu Checking for Normality Example In the past ads for new cars have used these figures usually with an asterisk referring to smaller print that gives a somewhat soft warning that your mileage may vary The truth is you mileage will vary downward umquot s vcus Jans M DavInput 4 5m 41 Elpyrigm 2qu Checking for Normality Example The reason for this is that the EPA tests are performed in a laboratory setting and not on the road Many other factors such as wind hills driving style etc will cause the mileage to decrease M DavInput 15 umquot s vcus Jans 5m 41 Elpyrigm 2qu Checking for Normality Example n 20 obs ns on mpg for a specific car 2421 2435 2382 2421 2414 2460 2375 2501 2466 2472 2447 2438 2308 2388 2309 2457 2516 2462 2462 2514 umquot s vcus Jans M DavInput 5 5m 41 Elpyrigm 2qu Checking for Normality Example The sample mean for the 20 observations of mpg was 24324 and the sample standard deviation is 0578 The following Zscores or standardized sample values were found b z xi 243242 i 0578 ma 39 wquot Jim39 39quot39 Checking for Normality Example zl w 4120 0578 Mquot W J M um 11 Checking for Normality Example Zscores 020 004 087 020 032 048 099 119 058 069 025 010 215 077 213 043 145 051 051 141 umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu Checking for Normality Example We now plot these cdf of these values on special graph paper and create what is called a Lilliefors plot The Zscores are on the horizontal axis and we plot a jump of 1ln 1I20 at each Zscore umquot s vcus Jans M DavInput 2 5m 41 Elpyrigm 2qu Distribution Function Probability Z Scores umquot s vcus Jans M DavInput 2 5m 41 Elpyrigm 2qu Lecture 6 Vw39 James M Duvmpori apyn39gm zoos 39 James M Duvmpori apyn39gm zoos Lecture 6 Vw39 Std 541 Checking for Normality Here is an example of a normal probability plot the expected normal percentiles on the horizontal axis versus the actual data on the vertical axis It should be a straight line if the data are normal James M Duvmpori Lump a vcu39s 5m 41 Copyright zoos Checking for Normality The expected standard normal quantiles Z are computed using the following definition for P i a pi n 2a 1 Common choices are a 12 and a 13 umquot s vcus Jans M DavInput 5m 41 Elpyrigm 2qu Normal Probability Plot 2 E o 5 n m 2 E x ected Norrml Percentile Using Expected Nonnal Quantiles for a 12 um vcus mm M Davupquot 25 m cum 2qu Normal Probability Plot 2 E o 5 n m 2 E Expected Norrml Percentile Using Expected Nonnal Quantiles for a 13 VEU39s 1 Downyrt 27 Elpyrigm 2qu umquot 5 5m 41 Normal Prob Plots The next slide provides what NCSS gives as a Normal Probability Plot when you request this as output Let s see how to get this using NCSS umquot s vcus Jans M DavInput 2 5m 41 Elpyrigm 2qu Norrml Probability Plot of rrl39leage o m u 2 392 Expected Normals rm 5 MW 1 M om z s 541 cum 2qu What is Statistics Statistics Nature Philosophy Thinking Inference The Statistical Method DavInput an umquot s vcus Jans M 5m 41 Elpyrigm 2qu What is Statistics Some have defined it as the science of drawing conclusions from data in a reliable manner when faced with uncertainly umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu HG Wells circa 1940 statistical thinking will one day be as necessary for efficient citizenship as the ability to read and erte umquot s vcus Jans M DavInput 2 5m 41 Elpyrigm 2qu Walter Shewhart 1939 The longrange contribution of statistics depends not so much on getting a lot of highly trained statisticians into industry as it does in creating a statistically minded generation of physicists chemists engineers and others who will in any way have a hand in developing and directing the production processes of tomorrow umquot s vcus DavInput a 5m 41 Jans M Elpyrigm 2qu Statistical Thinking Variability is omnipresent whenever measurement values are observed on a variable Here are three sources of variability there are others Process variability Measurement variability Sampling variability umquot s vcus Jans M DavInput 4 5m 41 Elpyrigm 2qu Histogram for Newcomb39s Speed of Light 1882 Frequency Count Translated and Scaled Time Measurement Numbers are actually the TIME in seconds Multiplied by 10 to the 9 minus 24800 humquot 5 vcus 1 M DavInput as 5m 41 mm 2qu Statistical Thinking What makes the discipline of statistics useful and unique is that it is concerned with the process of getting data and understanding these problems in the presence of variable measurements Jans M DavInput u Elpyrigm 2qu Distribution So whenever we observe a measurement repeatedly we almost never get the same value Instead we obtain what is called a distribution of values inn 5 vcus Jans M Downyrt 7 5m 41 Elpyrigm 2qu Distribution definition The pattern of variation of a variable is called its distribution The distribution records and provides the numerical values of the variable and how often each value occurs inn 5 vcus Jans M Downyrt 5m 41 Elpyrigm 2qu Histogram for Newcomb39s Frequency Count Translated and Scaled Time Measurement bers are actuall the TIME in seconds Multiplied by10 to the 9 minus uaoo Downyrt 31 inn 5 vcus Jans M 5m 41 Elpyrigm 2qu Distributions Our study of statistics will involve man aspects of dealing with distributions both data distributions and theoretical distributions Location mean etc Variation variance amp standard deviation Comparisons between and among umquot s vcus Jans M rumpquot 411 me 541 Elpyrigm 2qu We need data How is this data generated Where does it come from and does that matter Yes What are the properties of the data and does it matter Yes m i w h M mmquot 41 s 541 mm m The Statistical Method Inferential Procedure Two very very important elements are 1 The POPULATION 2 The SAMPLE umquot s vcus rumpquot 42 5m 41 Jans M Elpyrigm 2qu The Statistical Method Inferential Procedure For good scientific experimentation it is absolutely necessary for these to be welldefined and for the experimenter to keep these two notions clearly separated umquot s vcus Jans M DavInput 4 5m 41 Elpyrigm 2qu The Statistical Method Inferential Procedure Population a set of items to investigated sample spaces universes aggregates kollectivs etc umquot s vcus Jans M DavInput 44 5m 41 Elpyrigm 2qu Populations Existent or real Conceptual Target population Sampled population All of the above may be welldefined or illdefined umquot s vcus DavInput as me 541 Jans M Elpyrigm 2qu Sample a subset of the population Representative Don39t exist Random or simple random sample C S Peirce quot 39 provided it is drawn by such machinery arti cial or 39 39 39 any one individual of the whole lot would get taken as often as any otherquot circa 1896 Raw 9 e VCU 1m M Dwaier 4 5m 54 wagm 2m Sample a subset of the population Stratified Random Sample Blocking Raw 9 e VCU 1m M Dwaier 47 5m 54 wagm 2m Raw 9 e VCU 1m M Dwaier 4x 5m 54 wagm 2m Sample a subset of the population Stratified Random Sample Blocking Systematic sample ClusterPurposive for convenience Biased Unbiased Raw 9 r VCU 1m M mznwrv 5m 54 wagm 2m Raw 9 r VCU 1m M mznwrv 5m 54 wagm 2m Sample So how do we take a quotgoodquot sample Raw 9 r VCU 1m M Wmquot 51 5m 54 cwmgm 2m How do we obtain random samples Sample frame construction a numerical listing of all the elements of the sampled population Use of a random mechanism to select numbers computer program a table of random digits or other means Damn s VCU39s James M borerport 52 Stat 541 Copyright 2m TOUR or ACCOUNTING NINE NINE more we OVER HERE 2 NINE NINE we HAVE OUR g NINE NXNE RANDOM NUMBER 1 w ENERATOR E Copyright 5 zuut United Feature syndicat i the Damn s VCU39s James M borerport 55 Stat 541 Copyright 2m Using tables of random digits Pop Size N 87 Sample size n 10 Pick at random a place to start in the table I chose the 5th column amp the 14th row Damn s VCU39s James M borerport 54 Stat 541 Copyright 2m Raw 9 r vcu s 1m M Duvenwr39 55 5m 54 wagm 2m 90 discard 13 39 402 51 3 05 8 504 94discard 12 5 69 9 68 6 51 dupicate 55 7 28 1 warm mm mmquot Se 5m 54 wagm 2m No Selected in Index No of ascending order Order of Selection 5 8 12 5 13 1 28 1 0 40 2 50 4 51 3 55 7 68 6 69 9 Raw 9 VVCU S 1m M Duvenwr39 57 5m 54 wagm 2m Create your own table of random digits Using Excel or any other spreadsheet that has a function to generate uniform numbers in the interval 01 RAND W l m M s 5 Create your own table of random digits Create your own table of random digits Excel Microsoft seems to have heard the complaints and is being a little more open on what it is using See httplsupportmicrosoftcomlkbl828795 umquot s vcus Jans M Downyrt 5 5m 41 Elpyrigm 2qu Create your own table of random digits I like John von Newmann s take on random numbers quotAnyone who considers arithmetical methods of producing random digits is of course in a state of sinquot Downyrt ill umquot s vcus Jans M 5m 41 Elpyrigm 2qu 20 Use random generators online httpwwwfourmilabchlhotbits httpwwwRandomorg httpwwaavarand0r9 umquot s vcus Jans M DavInput 51 5m 41 Elpyrigm 2qu 21 FIGURE 33 Map oi mm mm m Cd lupus low Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 12 um 12 1 M em 1 vcu39s s 541 ammun Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 54 81 amp 101 Large Sample Confidence intervals for Proportions Con dence intervals on Means with Unknown Variance intro to the ChiSquare Distribution amp Student s T Distribution amp confidence intervals rortne Mean Using Small Samples Iaer 12 Jans M lumpquot 2 MW 5m 4 minimal Central Limit Theorem Application We can now use the Central Limit Theorem to construct a large sample confidence interval for an unknown proportion lamquot 12 Jans M pumpquot 3 MW 5m 41 Elpyrigm 2qu Confidence Interval for a Proportion p Z 2 Ynp n P 1111 1an 11111 11111 It It Is approximately distributed as N 0 1 Iaer Jans M n 4 12 Damp vcus 5m 4 minimal Confidence Interval for a Proportion p The resulting confidence interval for p is A p1 11 i z P uZ n Note that this is a function of the unknown proportion p lamquot 12 Jans M lumpquot 5 MW 5m 41 Elpyrigm 2qu Confidence Interval for a Proportion p If we solve this inequality for the unknown proportion p we obtain the following for the confidence limits LCLW 2n zzlz Iaer 12 Jans M lumpquot 5 MW 5m 4 minimal Confidence Interval for a Proportion p The upper confidence limit is as follows UCL W 2 n zi These agree with those given in several texts and should be used for small n sometimes called the Wilson Estimators umquot 12 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu Sample Size for a Confidence Interval for a Proportion p The sample size formula is given by zziz 11sz i l4z2z LZ L zZz L2 where L is the width of the interval umquot 12 Jans M Damprt 1 MW 5m 41 mmimzm Confidence Interval for a Proportion p If the sample size n is large these limits reduce to the following Piz 1 n where i1I3 This is an approximate 1001 a confidence interval for p umquot 12 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Sample Size for a Confidence Interval for a Proportion p And the sample size formula reduces to 4zz A A n 2qu where L is the width of the interval Jans M umquot 12 Damprt 1 MW 5m 41 mmimzm Confidence Intervals Using Estimates of 0392 Large Sample Results What we discussed in the previous lecture requires that 0392 be known What if azis unknown umquot 12 Jans M Damprt 11 MW 5m 41 Elpyrigm 2qu Confidence Intervals Using Estimates of 0392 Large Sample Results At this point the only thing we can do is simple substitute an estimate of o namely s1 and hence use s x i z a2 J umquot 12 Jans M Damprt 12 MW 5m 41 mmimzm Confidence Intervals Using Estimates of 0392 Large Sample Results This is an approximate 1001 a confidence interval for the mean p o How good is it It s not too bad provided the sample size is sufficiently large even though this involves an additional approximation umquot 12 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Confidence lntervals Using Estimates of 039 Large Sample Results However substituting s for a does produce confidence intervals that on average will be wider longer than those produced by using a known value of a39 which can be a problem This difference in length will not be noticeable whenever the sample size is sufficiently large umquot 12 Jans M Damprt 14 MW 5m 41 Elpyrighlellll Confidence lntervals Using Estimates of 0392 Large Sample Results However the length of the confidence interval in not the only consideration that is important For example what if X is biased for some reason that is unknown to us How do we guard against this error Confidence lntervals Using Estimates of 0392 Large Sample Results Part of the answer is to increase the sample size to increase our precision of estimation increases power We will discuss this later Jans M umquot 12 Damprt 15 MW 5m 41 Elpyrighlellll umquot 12 Jans M Damprt 15 MW 5m 41 Elpyrigm 2qu umquot 12 Jans M Damprt 17 MW 5m 41 Elpyrigm 2qu umquot 12 Jans M Damprt vcus 5m 41 Elpyrighlellll Confidence Intervals Using Estimates of 039 Large Sample Results Also note that if we use sin place of 039 then the length of the confidence interval will also be a random quantity Hence If we determine the sample size from a known or historical value for 039 and compute the interval using the sample standard deviation s then the confidence interval will most likely not be of length L m 12 J M um MW 5 541 cyrizm u Confidence lntervals for p using s What we really must investigate is the sampling distribution of this statistic X p X p T aso osedto Z sJZ Pquot w First done by W S Gosset umquot 12 Jans M Damprt 211 MW 5m 41 Elpyrighlellll The Sampling Distribution of S2 But before we can adequately do that need to introduce the samplin we distribution of S2 the sample variance that are computed from random samples The Sampling Distribution of S2 That is the sampling distribution of the random variable defined by 1 n s2 Xi X2 11 1 Jans M umquot 12 Damprt 22 MW 5m 41 Elpyrighlellll X1 X2 Xquot that arise from NORMAL DISTRIBUTIONS NIE all m 12 1 M um 21 The Sampling Distribution of S2 We can prove that the distribution ofW has a very special sampling distribution called the Chisquare distribution with n 1 degrees of freedom where n 2 n1sz 2XiX W 2 i1 2 umquot 12 Jans M Damprt vcus 5m 41 Elpyrigm 2qu The Sampling Distribution of S2 The density function of W is given by 1 71 39g fw e 0ltWltw 0 elsewhere where r the dlengrdega39g f freedom mu 2 c 395 sm 24 The Chisq uare Distribution The parameter r denotes the degrees of freedom For brevity we write quot W has Xi quot V 7 M u l m M 02 01 0 1 A 6 K a 2 4 a x yd 011 IZ 04 n 2 4 a a 52 xdwbunm n 1 a a a x umquot 2 W M Wmquot 25 umquot 2 W M WW as VCU 5m541 wagm 2m VCU 5m541 wagm 2m THE cHIsouARE CLIRVE Plnl nl a Chirsquzre pm L w Vzmhle wllh r Degrees mFremnm Densny umquot m 1m M mmpm 27 VCU 5m 54 wagm 2m Percentage Points of the Chisquare Distribution The percentage points 1 for selected proba I Ies a such that PW21Yra fur 0ltalt1 are given in Table II in your textbook umquot m 1m M umpm 2x VCU 5m 54 wagm 2m THE cHIsouAnE CURVE mm of a Chisquzle pm w Vznzhle wnh r nemees m nemm umquot m 1m M mmpm 2y VCU 5m 54 wagm 2m Percentage Points of the Chisquare Distribution These percentage points can be found using NCSS s Probab ty lculator along with the cumulative probabilIties of the Chisquare distribution umquot m 1m M umpm an VCU 5m 54 wagm 2m THE CHISQUARE CURVE Plot of a Chisquare pdf Density w Variable with rDegrees of Freedom hzmn 12 Jans M oampm vcus 5m 41 cyrim 2qu THE CHISQUARE CURVE Plot ora Chisquare pdf Density w Variable with r Degrees of Freedom umquot 12 Jans M oampm 2 MW 5m 41 mmimzun THE CHISQUARE CURVE Plotofa c 39 uare p I 153m 15379 w variabiewm r Degrees of Freedom hzmn 12 Jans M oampm 3 MW 5m 41 cyrim 2qu Properties of the Chi square Distribution Suppose W has X3 The mean and variance of the chisquare distribution are given as follows EW r Var W 2r 2 is Unbiased for 0392 We can now prove that S2 is unbiased for the unknown population variance 02 1 s2 Suppose W u follows a 2 a39 chisquare with n 1 degrees of edom fre MWquot as hzmn 12 Jans M vcus 5m 41 cyrim 2qu mm 12 Ms M my 4 2 is Unbiased for a2 n l S2 ElWln 1E quot60Es2n 1 or Es2a2 umquot 12 Jans M oampm 5 MW 5m 41 mmimzun S is Unbiased for a So S2 is unbiased for 0392 o Is S unbiased for a39 o The answeris NO S is a biased estimator of o But the bias is not great and we use S as M ESTIMATOR of a umquot 12 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu Confidence Interval for It with small sample sizes If the sample size is small to moderate in size and the variance 039 is unknown what do we do Simply substituting S for 039 is fine provided the sample size is large enough for S to be a reasonably good estimator of a laer 12 Jans M Damprt vcus 5m 41 Elpyrighlellll Small Sample Conf lnt I P z s 5 z 1 a aJ 4 This was the starting place for deriving the confidence interval for u assuming a 39s known umquot 12 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Small Sample Conf lnt But if we substitute s for 039 and the sample size is small to moderate in size then the percentiles zaz are not correct Jans M umquot 12 Damprt 411 MW 5m 41 Elpyrighlellll Small Sample Conf lnt These percentiles are not correct since for small samplesizes the sampling distribution of X is no longer adequately described by the normal distribution 02 le J 11 umquot 12 Jans M Damprt 41 MW 5m 41 Elpyrigm 2qu Small Sample Conf lnt If we substitute s for 039 and we wish to maintain a valid probability statement with probability 1 a then we must use a different percentage point not zaz umquot 12 Jans M Damprt 42 MW 5m 41 Elpyrighlellll The T Statistic What we really must investigate is the sampling distribution of the following statistic um 12 1 M Damprt 4 MW 5m 41 Elpyrigm 2qu sz w llll m m v um 12 1 M Damprt as vcus 5m 41 Elpyrigm 2qu um 12 1 M Damprt 44 MW 5m 41 Elpyrightzlllll Vanna v leoH mu No I BIOMETRIKA THE PROBABLE ERROR OF A J39A III norm i Inddag A uniall upqu um 94 1 Iu n It sdunk m mill n Wequot qu w m mm A mi 1n u l aquot 1 m mm m n tu um nan mm In n i m a m m wmlmumnwwmaumampl m m l 1 m 4 m m mm ml u u an Inn rnlardlmuy u n m dimmnm Hquot n n ma 1 M um 12 Damprt 45 MW 5m 41 Elpyrightzlllll T Ratio Can Be Written As Follows 9 6N X SJ 05 T Z Z N01 anus M Damprt Elpyrigm 2qu Student s T Distribution The random variable T has a very special sampling distribution which is called the Student s T distribution with r n 1 degrees of freedom um 12 1 M Damprt 4 vcus 5m 41 Elpyrightzlllll Student s T Distribution The pdf of the Student s T is as follows rit li 11 where uolttltoo gt z 2 Wu umquot 12 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Density THE STUDENT39S T CURVE Variables Student39s t with1 amp 4 degrees of freedom umquot 12 Jans M Damprt vcus 5m 41 Elpyrighlellll THE STUDENT39S T CURVE 2 39 o n t amp 1 Variable Student39s t has 4 degrees of freedom umquot 12 Jans M Damprt 51 MW 5m 41 Elpyrigm 2qu Percentage Points of the Student s T Find ta such that a is the upper PT 2 in a tail probability These are recorded in Table III than t symmetric umquot 12 Jans M Damprt vcus 5m 41 Elpyrigm 2qu THE STUDENT39S T CURVE Plot of Student39st pdf Density 00 t Variable Jans M Damprt umquot 12 vcus 5m 41 Elpyrighlellll THE STUDENTS T CURVE Plot of Student39st p df 00 t Variable umquot 12 Jans M Damprt 54 MW 5m 41 Elpyrighlellll Percentage Points of the Student s T n2 rn 1 1 a0025 Find toms 1 SUCh that PT Z to025 1 0025 From Table IV tam1 12706 By contrast z 025 19600 umquot 12 Jans M Damprt 55 MW 5m 41 Elpyrigm 2qu THE STUDENTS T CURVE Plot of Student39st p df tnm12706 Density 00 t Variable umquot 12 Jans M Damprt 55 MW 5m 41 Elpyrighlellll Percentage Points of the Student s T And of course these percentage points can be found for any value of the cumulative or upper tail probability and for any degree of freedom including non integer degrees of freedom using NCSS s Probability Calculator umquot 12 Jans M Damprt 57 MW 5m 41 Elpyrigm 2qu THE STUDENT39S T CURVE Plot of Student39st pdf t 2776 mus 00 t Variable Jans M umquot 12 Damprt vcus 5m 41 Elpyrighlellll Moments of Student s T ET0 VarTrr 2 umquot 12 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Exercise Find d1 such that P T gt d1 005 or equivalently P T5 d1 095 for degrees of freedom r 1 From NCSS s Probability Calculator we find d1 1796 mm 12 Jans M Damprt ill vcus 5m 41 Elpyrighlellll THE STUDENT39S T CURVE Plot of Student39st pdf Density Exercise Let r 11 Find d2 such that Pd25T5d2095 1u095 a005 al20025 This implies that d2 2201 umquot 12 Jans M Damprt 2 MW 5m 41 mmimzm 00 t Variable umquot 12 Jans M Damprt 51 MW 5m 41 Elpyrigm 2qu STUDENTST CURVE Plot of Student39st pdf 42 tnn25u 2201 d2 tnn25u 2201 025 I 00 t Variable Jans M m n umquot 12 Damp vcus 5m 41 Elpyrigm 2qu Confidence Interval for the Mean p using 52 S X i m is a 1001 on confidence interval for the mean I This of course assumes that we are sampling from a N p 0392 pop umquot 12 Jans M Damprt n vcus 5m 41 mmimzm Example The mean tearing strength of a certain brand of paper in under investigation by a manufacturer of laser printers We assume that this measurement is normally distributed A random sample of n 22 sheets of paper were tested and the sample mean tearing strength was 2 4 pounds m 12 h M mm is w s 541 aizuu Example known 0392 1a095 a005 aI20025 X tearing strength of paper pounds XhaSNpu 2004 n22 324 i za2 24i196 232 248 mm 12 Jans M Damprt n vcus 5m 41 mmimzm Example known 0392 This is a 95 confidence interval for the unknown mean p and these values can be interpreted as plausible value of u umquot 12 Jans M Damprt n vcus 5m 41 Elpyrigm 2qu Example unknown 0392 1a095 a005 al20025 X has N p 0392 n 22 a2 is unknown and furthermore suppose s 024 rn 1 22 1 21 to025212080 umquot 12 Jans M Damprt n vcus 5m 41 mmimzm Example unknown 0392 S Tci t u2r J 24i2080 J5 229 251 Note that this is a wider interval umquot 12 Jans M Damprt n vcus 5m 41 Elpyrigm 2qu When do you use the Student s T percentage points See Velleman s comments Any time you are sampling from a normally distributed population and the unknown variance 0392 is estimated by s1 then you MUST use the Student s t percentage points for sample sizes Jans M n 711 umquot 12 Damp vcus 5m 41 mmimzm When do you use the Student s T percentage points See Velleman s comments If the original population is decidedly nonnormally distributed and you have a sample size that is large enough to appeal to the Central Limit Theorem then you can use the percentage points from the standard normal distribution umquot 12 Jans M Damprt 71 MW 5m 41 Elpyrigm 2qu When do you use the Student s T percentage points See Velleman s comments That is as the sample size gets larger and large then the assumption that the original population is normally distributed is of less and less importance due to the Central Limit Theorem umquot 12 Jans M Damprt 72 MW 5m 41 mmimzm When do you use the Student s T percentage points Hence if that population is relatively moundshaped and somewhat symmetric then use the percentage points from the Student s T distribution umquot 12 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu When do you use the Student s T percentage points If you find yourself in situations where the original population is decidedly non normally distributed and the sample size is small then you must use other methods that we will not discuss here exact sampling dist amp non parametric methods uquot 12 h M MWquot 74 Emu m dzgus gr frudnm nf39h cm squat dmbmm ms 15 1h mm calManly paramth Using NCSS S Prob ahility Calculator to nd an cumuhuve pmbmw m p A mu fm cm squat dmnbuhnn wnh 2 dzgus nffrudnm O O Sludanhxad Rally 0 O Welhull Emu 1h Wm say x at m Chrsquar dnimbuunn wnh dzgus uffrudmn gvzn ahnve 1 1m Wher campmng Wm thlx gt2 X n nmszmm Ths mm wmdnw prawsz 1h cumuhuve pmbnbdxtynssnuntzd thhthz Winn x mtzrzd mm mama 7 x x Fm alnwzrmlzdusl 11115wa 17 1h p AWJHE m ten mush campmzd frmu 1h am Ths mm camplzmznt nfth 14PngPxgtX Fm muppu mm tzsg um Mn 17 m p A vain Fm mm inlzdll Wu must mm m ppmpmu vain gven abuvzatnga p V u 2001525 nnznsz Using NCSS S Prob ahility Calculator to nd an 95mpucznu1 mm cm squat dmnbuhnn wnh z dzgus nffrudnm lglzl m 0 Hy alganmahlc Ealculal C C 0 o Hmnmlal 0 N29 Hmnmla o mam Manual 0 0 anmal gun Emu the deans of g o E 5mm 0 Fnlxxnn hm Um Ch 0 o Enualahnn o swamx r Ompmxs m pmng pm r 0 Sam mmbum o o Sudnlxd Hang rm m cmqum mmmmmm Wm a Armn ucss Invalx lnpul Eumuhuve pmbnbmtyatlnwu 1m 1ltznr van Ehrsnual Raxulkg Snual Value m ms 15 1h mm calManly paramth Emu 1h cunnuhhv pmbmw carrzspnndlng m m pawnAug pmnt dzsxrzd Using NCSS39S Probahility Calculator to nd an cumuhuve pmbmw and pl mu far Smdzm s t dmnbuhnnthh 12 dzgus uffrudmn Emu m dzgus nffrudnm nf39h Smdzm s t dnsmbuhnn 49le nhahll y Mummm o o m o Hymn ms mmmmmm an O O Hmnmlal mum pmbmhty assncxmdwnh Q a E quotmm m valuz T mm 1 m lmvulsz gun x p 1 T O O Eh Snual Thzxs39hznnnczmuhty Fm alwznnlzdtz mwmm pnmztu wmhm must at m nyphcmnnsmm o o wmu Ahnul ucss W1 Wm quallnxzm r rm mh n n ssssaszsn T1151 m camplzmzmnfthz pmbn gvennbvv x LP TP gtT Emu m mug say T Ur 1h smem smmmnnmm Fm muppulmlzdust 115wa 17 azgmnmnam gvm m Wu abnve mm p Whm campuhng pwm um 5 1h mama mu gr mm 11 am Nat mm m pmbabxhuzs sum to m Fm m Human yUu mmt dauhl 1h hpprapnht mu gven ahnve atngnt x 9mm 2 n 0304 n nan Using NCSS S Prob ahility Calculator in mm 95mpucznu1 mm smamvsmmmmmm 12 dzgus nffrudnm Ealculalz Emu m dzgus gr frudnm far 1h Studznt39s 4 dmbmm 0 Sudn39x r O O Sludanhxad Rally 0 O Welhull Armn ucss Invalx r Raxulk Invalx lnnul 1 lt or awn 1 lavu W Th wmdnw 15 fax 1h nanlnmnhtypumzur m m pawnAug pm dzmzd IS n 30 BIG ENOUGH A while back there was a discussion on the rule of thumb that a sample of size n30 is sufficient in practice for assuming normality of the sample mean Does anyone know the source of this 39rule of thumb39 Professor Paul Velleman Cornell University responded I recently responded to a High School student39s question on the same subject Here is what I wrote Bill Willis asks The text book we are using says use the 2 distribution when ever the sample size exceeds 30 and the distribution is normal Let sample s approximate population sigma How come the t table in the AP booklet has degrees of freedom entries way past 30 what am I missing Bill You have caught statistics teachers in one of our more egregious lies Here is the truth Background which you probably know The tdistributions are a family of distributions one for each value of the degrees of freedom parameter twith infinite df is Normal so the Normal distribution is a member of the family Moreover as df grows the difference between t and Normal becomes very slight If we know the true sigma then we should use Normal in testing and estimation Since we almost never know sigma but rather estimate it from the data with s we should instead use t provided we have a random sample and are willing to assume underlying Normality of the data The rest of the story In the olden days years and years ago before there were computers on every desktop and graphing calculators in every student39s hands we had to make tables of t and 2 values because there was no other practical way to get these values This forced us to make several compromises First we had to work with a standard normal distribution gotta make tables for only one of the Normal distributions so lets make it have mean zero and sd 1 Second we had to settle on a few special significance levels for ttests 10 5 l because we don39t want to have an entire book of ttables an entire table like the normal table for each df Third we had to choose an arbitrary number of df to tabulate for the tdistribution and declare that we should move to the Normal tables thereafter The tfamily goes on and on for any number of df and is only really Normal at infinite df so really complete ttables should be infinitely long The obvious choice was to stop at the bottom of the first page Most book pages have between 40 and 50 lines so most ttables after allowing for headings and a final row labeled infinite df go up to 30 or 40 df All of these choices were arbitrary and had nothing whatever to do with statistics only with practical compromises To cover our tracks statistics texts made a quotrulequot that after 30 or 40 depending on page size and type size df we should use the Normal distribution This quotrulequot is not statistics it is a rationalization of outdated practical compromises but we all agreed to go along with the charade The modern truth is Use t whenever you have estimated the standard deviation from the data and a you have a random sample or data from a properly randomized experiment and b you are willing to assume that the data are Normally distributed 1 a is much more important that b You are OK if the underlying distribution is unimodal and reasonably Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 6 M 5 MW 1 M mum 1 s 541 mm 2qu Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 51 amp 61 Normal Distribution continued Checking Normality The Nature of Statistical Inference Populations amp Samples Random Sampling mm s vcus Jans M rumpquot 2 5m 41 amino 2qu Finding Probabilities for non standard normals You can prove that if X has Nuo1 then the distribution of Z 039 is N o 1 a J M a 5 Example percent production in spec Certain shafts are manufactured to meet engineering specs But there is variation in the produced items X diameter of shaft and X has N p 0392 where u 0251 and a 0001 mm s vcus Jans M rumpquot 4 5m 41 amino 2qu Example percent production in spec Given that X has a normal distribution with mean and standard deviation as given above what proportion of this population of manufactured rotor shafts is in spec Isz s MW 5 5m 41 Jans M DavInput Elpyrigm 2qu Engineering Spec s call for 025 1 0002 Example percent production in spec X has N p 0251 0392 00011 P 025 0002 lt X lt 025 0002 P 0248 lt X lt 0252 0248 lt X p lt 0 252 p aquot aquot a39 P mm s vcus Jans M rumpquot 5m 41 amino 2qu Example percent production in spec 0248 0251 0252 0251 lt Z lt 0001 P 300 lt Z lt 100 ltb1 ltb 3 08413 1 09987 39 08413 00013 08400 umquot s vcus Jans M DavInput 7 5m 41 Elpyrigm 2qu THE STANDARD NORMAL CURVE Density 1 Variable 0mm 5 vcus Jans M Damprt x 5m 41 amino 2qu Checking for Normality Much of what we will be doing for the remainder of the semester is based upon the normal distribution How can be check data to see if it has arisen from an underlying normal dist umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu Checking for Normality There are several ways to check for an underlying normal distribution The first is a QQ type plot of the data against the theoretical counterparts This is called a normal probability plot and is available in most all statistical packages 0mm 5 vcus Jans M 5m 41 Damprt 1 amino 2qu Checking for Normality The second are numerical analytical tests that most statistics packages perform such as the ShapiroWilk s test We will not discuss these in this course but will examine the output of NCSS on occasion Jans M DavInput u Elpyrigm 2qu Example of a QQ type plot Lilliefors Plots This is a graphical method that many statistical packages will provide in various different versions when you ask for a normal probability plot These are best illustrated by an example 0mm 5 vcus Jans M Damprt 2 5m 41 amino 2qu Checking for Normality Example The Environmental Protection Agency EPA has the responsibility of establishing estimates of miles per gallon mpg for both city and highway driving These estimates are printed on the window stickers of every new car sold in this country umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu Checking for Normality Example In the past ads for new cars have used these figures usually with an asterisk referring to smaller print that gives a somewhat soft warning that your mileage may vary The truth is you mileage will vary downward mm s MW 5 M Damprt u 1 5m 41 amino 2qu Checking for Normality Example The reason for this is that the EPA tests are performed in a laboratory setting and not on the road Many other factors such as wind hills driving style etc will cause the mileage to decrease umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu Checking for Normality Example n 20 obs ns on mpg for a specific car 2421 2435 2382 2421 2414 2460 2375 2501 2466 2472 2447 2438 2308 2388 2309 2457 22516 2462 2462 2514 Jans M Damprt 5 5m 41 amino 2qu Checking for Normality Example The sample mean for the 20 observations of mpg was 24324 and the sample standard deviation is 0578 The following Zscores or standardized sample values were found by z xi 243242 i 0578 h J M 7 5 Checking for Normality Example 2421 24324 020 0578 z1 mm s vcus Jans M Damprt 5m 41 amino 2qu Checking for Normality Example Zscores Lecture 6 Vw39 051 051 141 James M Duvmpori 19 apyn39gm zoos Checking for Normality Example We now plot these cdf of these values on special graph paper and create what is called a Lilliefors plot The Zscores are on the horizontal axis and we plot a jump of 1m 1l20 at each Zscore Lulu a VCU39 James M Davenport 0 5m 41 Copyism zoos Probability P Empirical Distribution Function 10 00 10 20 30 Z Scores Lecture 6 Vw39 James M Duvmput 21 Latin 6 VCU39 James M Davenport 2 5m 41 apyn39gho 2003 5m 41 apyn39gho zoos CheckIng for Normallty Lecture 6 Vw39 Std 541 James M Duvmpori zs apyn39gm zoos Here is an example of a normal probability plot the expected normal percentiles on the horizontal axis versus the actual data on the vertical axis It should be a straight line if the data are normal Lulu a VCU39 James M Davenport 4 5m 41 Copyism zoos Checking for Normality The expected standard normal quantiles Z are computed using the following i a definition for P pi n 2a 1 Common choices are a 12 and a 13 an i W 1 M mmquot 25 s 541 mm 2qu Normal Probability Plot Miles per Gallon Expected Normal Percentile Using Expected Normal Quantiles for a 12 mm s VEU39s 1 n 2 5m 41 anyrigmzuul Normal Probability Plot Normal Prob Plots The next slide provides what NCSS gives as a Normal Probability Plot when you request this as output Let s see how to get this using NCSS mm s vcus Jans M Damprt 2 5m 41 amino 2qu 2 E o 5 n m 2 392 x ected Norrml Percentile Using Expected Nonnal Quantiles for a 13 um vcus c M DavInput 27 st mm 2qu Norrml Probability Plot of rrl39leage o m E E Expected Normals DavInput 21 umquot s vcus Jans M 5m 41 Elpyrigm 2qu What is Statistics Statistics Nature Philosophy Thinking Inference The Statistical Method mm s vcus Jans M Damprt an 5m 41 amino 2qu What is Statistics Some have defined it as the science of drawing conclusions from data in a reliable manner when faced with uncertainly umquot s vcus Jans M DavInput 1 5m 41 Elpyrigm 2qu HG Wells circa 1940 statistical thinking will one day be as necessary for efficient citizenship as the ability to read and erte mm s vcus Jans M Damprt 2 5m 41 amino 2qu Walter Shewhart 1939 The longrange contribution of statistics depends not so much on getting a lot of highly trained statisticians into industry as it does in creating a statistically minded generation of physicists chemists engineers and others who will in any way have a hand in developing and directing the production processes of tomorrow umquot s vcus Jans M DavInput a 5m 41 Elpyrigm 2qu Statistical Thinking Variability is omnipresent whenever measurement values are observed on a variable Here are three sources of variability there are others Process variability Measurement variability Sampling variability mm s vcus Jans M 5m 41 Damprt 4 amino 2qu Histogram for Newcomb39s Speed of Light 1882 Frequency Count Translated and Scaled Time Measurement Numbers are actually the TIME in seconds Multiplied by 10 to the 9 minus 24800 umquot s vcus Jans M DavInput as me 541 Elpyrigm 2qu Statistical Thinking What makes the discipline of statistics useful and unique is that it is concerned with the process of getting data and understanding these problems in the presence of variable measurements mm s vcus Jans M Damprt 5 5m 41 amino 2qu Distribution So whenever we observe a measurement repeatedly we almost never get the same value Instead we obtain what is called a distribution of values inn 5 vcus Jans M Downyrt 7 5m 41 Elpyrigm 2qu Distribution definition The pattern of variation of a variable is called its distribution The distribution records and provides the numerical values of the variable and how often each value occurs mm s vcus Jans M Damprt 5m 41 amino 2qu Histogram for Newcomb39s Frequency Count Translated and Scaled Time Measurement mbers are actuall the TIME in seconds Multiplied by10 to the 9 minus uaoo inn 5 vcus Jans M Downyrt 1 5m 41 Elpyrigm 2qu Distributions Our study of statistics will involve many aspects of dealing with distributions both data distributions and theoretical distributions Location mean etc Variation variance amp standard deviation Comparisons between and among mm s vcus Jans M n An 5m 41 Damp Luminan We need data How is this data generated Where does it come from and does that matter Yes What are the properties of the data and does it matter Yes inn 5 vcus Downyrt 4 5m 41 Jans M Elpyrigm 2qu The Statistical Method lnferential Procedure Two very very important elements are 1 The POPULATION 2 The SAMPLE mm s vcus Jans M Damprt 42 5m 41 amino 2qu The Statistical Method Inferential Procedure For good scientific experimentation it is absolutely necessary for these welldefined and for the experimenter to keep these two nations clearly separated Raw 9 VCU 1m M mmquot 4 5m 54 wagm am The Statistical Method Inferential Procedure Population a set of items to investigated sample spaces universes aggregates ka t39 t l ec Ivs e c umquot e 7W Mu M Wmquot M 5m 54 wagm 2m Populations Existent or real Conceptual Target population Sampled population All of the above may be welldefined or illdefined Raw 9 VCU 1m M Wmquot 45 5m 54 wagm 2m Sample a subset of the population Representative Don39t exist Random or simple random sample provided 5 dr n arti cial or physiological that in the long run any 0 e individual of the whole lot would get taken as often as any otherquot circa 1896 Raw 9 VCU 1m M umpm 4 5m 54 wagm 2m Sample a subset of the population Stratified Random Sample Blocking Raw 9 VCU 1m M Wmquot 47 5m 54 wagm 2m Raw 9 VCU 1m M umpm 4x 5m 54 wagm 2m Sample a subset of the population Stratified Random Sample Blocking Systematic sample ClusterPurposive for convenience Biased Unbiased Lecture 5 VCU39s James M haverport Stat 541 Copyright 2m James M Davenport 59 Stat 541 Copyright 2m Sample So how do we take a good sample Lecture 5 VCU39s James M haverport Stat 541 Copyright 2m How do we obtain random samples Sample frame construction a numerical listing of all the elements of the sampled population Use of a random mechanism to select numbers computer program a table of random digits or other means heme s VCU39s James M Davenport 52 Stat 541 Copyright 2m NINE NINE NINE NINE NINE NXNE Copyright 5 zeat United Feature senateete the Lecture 5 VCU39s James M haverport Stat 541 Copyright 2m Using tables of random digits Pop Size N 87 Sample size n 10 Pick at random a place to start in the table I chose the 5th column amp the 14th row James M Davenport 54 heme s VCU39s Stat 541 Copyright 2m a w a 2 Ram VVCU S 1m M Wmquot 55 5m 54 wagm 2m 90 discard 00 discard 40 2 51 3 05 8 50 4 94discard 12 5 69 9 68 6 51 duplicate 55 7 28 0 321 39 W mirror 5 No Selected in Index No of ascending order Order of Selection 5 8 12 5 13 1 28 1 0 40 2 50 4 51 3 55 7 68 6 69 9 Raw 9 VVCU S 1m M Wmquot 57 5m 54 wagm 2m Create your own table of random digits Using Excel or any other spreadsheet that has a function to generate uniform numbers in the interval 01 RAND m M 53 Create your own table of random digits Create your own table of random digits Excel Microsoft seems to have heard the complaints and is being a little more open on what it is using See httpsupport croso comkb828795 Raw 9 r VCU 1m M Wmquot 5y 5m 54 wagm 2m Create your own table of random digits I like John Von Newmann s take on random numbers quotAnyone who considers arithmetical methods of producing random digits is of course in a state of sinquot Raw 9 r VCU 1m M umpm en 5m 54 wagm 2m Use random generators online httpwwwfourmilabchlhotbits httpwwwRandomorg httpwwaavarand0r9 umquot s vcus Jans M DavInput 51 5m 41 Elpyrigm 2qu FIGURE 33 Map oi mm mm m Cd lupus low Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 23 m 2 J M am 1 MW 5 541 mm 2qu Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 103 amp 104 ChiSquare Tests for Goodness of Fit and Contingency Tables umquot 23 Jans M Damprt 2 MW 5m 41 Elpyrigm 2qu Bernoulli Trials They result in only two outcomes Successfailure Zeroone Yesno Favoroppose TrueFalse umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Bernoulli Trials where p P success 1 p P failure m 2 J M MWquot 4 w s 541 ammu Consider Situations With Three Or More Outcomes For example p1 P item is accepted p2 P item is scraped p3 P item is reworked m 2 J M mm s w s 541 ammu Consider Situations With Three Or More Outcomes How can we test the following hypothesis Ho p1 83 p2 003 p3 014 Ha not ture umquot 23 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Cansider Situatians With Three 0r Mare Outcames We can da this with the basic SQUARE TEST which was rst published by Karl Pearsan in 1900 see Ian s Hacking article in Sciencex 84 umquot 23 7 1m M mum 7 VCU 5m 54 wagm 2m umquot 23 7 1m M mum x VCU 5m 54 wagm 2m Karl Pearson s Chisquare Test Hacking lan 1984 Trial by Number Science 84 Val 5 Na 9 Navember says this result is amang the 20 greatest discaveries that have shaped Our lives in the 20 quot century umquot 23 7 1m M mum y VCU 5m 54 wagm 2m Karl Pearson s Chisquare Test Pearson s Chisquare test measures the fit between theory and reality expectations and observations ushering in a new kind of decision making umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Ian Hacking 1984 For better or worse statistical inference has provided an entirely new style of reasoning The quiet statisticians have changed our world not by discovering new facts or technical developments but by changing the ways we reason experiment amp form our opinions about it umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Chisquare Test While we cannot develop all of the theory behind this test we can give an intuitive argument as to why it works Let Y1 be the number of success and Y2 be the number of failures in n Bernoulli trials umquot 23 Jans M 12 MW 5m 41 Elpyrigm 2qu Chisquare Test Y1 has a binomial n p1 and Y2 has a binomial n p2 where p1 p2 1 Z M has a approximate WAG111 N01bythe CLT mm J M um 1 MW 5m 41 Elpyrigm 2qu Chisquare Test How large does the sample size have to be Previously we stated the following np1 3 10 amp n1 m 2 10 We will revise this criteria in a moment 14 umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Chisquare Test Furthermore ifZ is a N 0 1 variable then it can be mathematically proved that Q1 Z2 has a Chisquare distribution with one degree of freedom umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Chisquare Test And with a little bit of algebra you can show that Q1 can be written as follows 2 2 Q1 22 Y1nP1 Yz an quotP1 quotP2 where Yz n Y1 undpz 1 p1 umquot 23 Jans M Damprt 15 MW 5m 41 Elpyrigm 2qu Chisquare Test Note the Y2 has a binomial n p2 and we have the following Q npiz 1 i1 quotPi Since we have used the CLT we say that Q1 has an approximate 23912 umquot 23 Jans M Damprt 17 MW 5m 41 Elpyrigm 2qu Chisquare Test To generalize this consider an experiment that results in k mutually exclusive and exhaustive events where 1 2 Yk denotes the observed frequency counts for each of the k categories umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Chisquare Test Likewise let p1 p2 pk denote the respective probabilities associated with each of these k categories umquot 23 Jans M Damprt 11 MW 5m 41 Elpyrigm 2qu Chisquare Test There are constraints The sample observations must be n independent identically distributed Bernoulli trials a random sample k k n and 2P 1 i1 i1 umquot 23 Jans M Damprt 211 MW 5m 41 Elpyrigm 2qu Chisquare Test What Pearson proved was that the following statistic has an approximate Chisquare distribution with k 1 degrees of freedom k YIT W Y Qk l quotPi umquot 23 Jans M Damprt 21 MW 5m 41 Elpyrigm 2qu Chisquare Test Some write this formula as follows k 0i Ei Qk l E where 0i the observed frequency of the i I category and Ei the expected frequency under Ho being true umquot 23 Jans M Damprt 22 MW 5m 41 Elpyrigm 2qu Chisquare Test Now suppose that you wish to test the following hypothesis Ho pipi0 i12 k H1 at least one inequality umquot 23 Jans M Damprt 2 MW 5m 41 Elpyrigm 2qu Chisquare Test where the pin are specified known proportions subject to the constraint k 1 2 Pro i1 umquot 23 Jans M Damprt 24 MW 5m 41 Elpyrigm 2qu Chisquare Test The test statistic is then given by Q iYinPi0z 1H i1 quotPro and if the null hypothesis tends to be false then we would expect to see large values of Qk1 umquot 23 Jans M Damprt 25 MW 5m 41 Elpyrigm 2qu Chisquare Test Therefore we would reject the null hypothesis stated above if and only if the observed value of Qk1 is too large z 99 quotT Qk l Z I k l rm 2 a M um 25 MW 5 541 2yrimzm THE CHI SQUARE CURVE Plot of a Chisquare pdf Density Rejection Region Q Variable withv 1 Degrees ofFreedom umquot 23 Jans M Damprt 27 MW 5m 41 Elpyrigm 2qu Example Mendelian theory of genetics on two the odds of characteristics of Green Yellow Peas Round 9 3 Wrinkled 3 1 cm 2 1 M um 21 W s 541 cyrizuu Independence Green Yellow Round Wrinkled Jans M m umquot 23 Damp vcus 5m 41 Elpyrigm 2qu Example This can be stated as a null hypothesis on four proportions Ho p1 9l16p2 3l16 p3 3l16 p4 1l16 Ha not true umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Example n 80 Y142 Y2 17 Y3 13 Y4 3 k Y n 2 Test statistic Qk1 l391 quotPro m 2 J M mm Example quot Oi El2 quot observed expectedz QM Ei expected and Qk1 has an approximate 111 k nZYi 42 17 13 8 80 i1 umquot 23 Jans M Damprt 2 MW 5m 41 Elpyrigm 2qu Example Expected Frequencies theory model etc Ei npi0 expected frequency of the iquotI cell 6 k 6 Note Z39lI m quotZ m 1 m 2 quot1 393 M omrquot1 a w s 541 mm 2qu Example 1 80 916 45 2 80 316 15 3 80 316 15 4 80 116 5 Total 80 m 2 11 M 6mm 4 w s 541 Laymm Example 42 45 2 17 15 2 q3 45 15 Z Z 1315 85 2533 15 5 This is a measure of the distance between our expectations and our observations umquot 23 Jans M Damprt as vcus 5m 41 Elpyrigm 2qu Example How big is too big eg How large does this have to be for us to conclude What we have observed is too deviant from what we would expect by chance variation umquot 23 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Example Choose 0 005 dfk 1 4 1 3 10235 3 7815 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu THE CHISQUARE CURVE Plot of a Chisquare pdf Denslty Q Variable wi Degrees of Freedom umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Goodness of Fit amp NCSS Let s use NCSS to help us do the arithmetic amp provide a pvalue umquot 23 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Contingency Tables F Factor B a c t o r A 333 541 12 quot39 M Contingency Tables This leads to mutually exclusive and exhaustive outcomes Ai ijori12a amp j12b There are k ab cells or groups Repeat the experiment n times and let Y he freq of occurrence of the ij I cell umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Contingency Tables We wish to test hypotheses such as Ho Pl Ai n Bj Pi Ha not true where pij are known proportions umquot 23 Jans M Damprt 42 MW 5m 41 Elpyrigm 2qu Contingency Tables When H0 is true then the statistic Z a b Y np Qab l 2 y 1 i1 j1 quotPg has an approximate chisquare with k ab 1 degrees of freedom umquot 23 Jans M Damprt 4 MW 5m 41 Elpyrigm 2qu Contingency Tables There is a restriction on the application of these types of chisquare tests Recall that we are essentially appealing to the Central Limit Theorem So n must be large In addition we also need to have each expected frequency npij reasonably large hzmr 2 Jans M Damprt 44 MW 5m 41 Elpyrigm 2qu Contingency Tables So what are the criteria to determine if the sample size is large enough 1 None of the expected frequencies can be smaller than 1 ie no npg lt 2 No more than 20 of the expected frequencies less than 5 umquot 23 Jans M Damprt as vcus 5m 41 Elpyrigm 2qu Contingency Tables We can now begin to model some structure on this contingency table For example suppose that we suspect that Factor A and Factor B are statistically independent What does this imply umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Contingency Tables Let s form the row and column marginal totals in the table above That is 1 pi PIAi1 211i j1 pPBji 211 um 23 3911 M Dmm 7 Contingency Tables Factor B Contingency Tables When the hypothesis of independence is true then the structure on the probabilities in this table is as follows HoPIAinB1PIA1PB1 0quot Ho Pi Pi Pj m 2 1 M am 4 MW 5 541 cyrizm Contingency Tables When the null hypothesis is true the test statistic becomes 2 a b Y npp Q J 1 NH quotPLPJ umquot 23 Jans M Damp vcus 5m 41 Elpyrigm 2qu Contingency Tables But this is not a true statistic since it is a function of the quantities Pi amp p which are unknown So we substitute the best estimates for these parameters and then compute the statistic umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Contingency Tables The best estimates are as follows b 39 where 2Yg for i12u j1 n A Y pj7 where YJZlYg f0rJ12b umquot 23 Jans M Damprt 52 MW 5m 41 Elpyrigm 2qu Contingency Tables Adjusting the degrees of freedom If we replace parameters with their respective estimates from the data then there is a corresponding reduction in the degrees of freedom df of the chisquare test one for each parameter estimated umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Contingency Tables There are a levels of the FactorA but we need to estimate only a 1 of them since 2m 1 The same is true for i1 the marginal proportions for columns umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Contingency Tables Hence the degrees of freedom can be computed as follows dfab 1 a 1 b 1 a1b1 umquot 23 Jans M Damprt 55 MW 5m 41 Elpyrigm 2qu Contingency Tables Thus the test statistic becomes 2 a IQ quot1313 Q P 1 quotIL11 and has an approximate ZZHKH umquot 23 Jans M Damprt 55 MW 5m 41 Elpyrigm 2qu Example Ninety graduating engineers were interviewed by the university to determine their initial starting salary either high of low Their respective grade point averages low average and high were also obtained Are starting salaries and GPA s related umquot 23 Jans M Damprt 57 MW 5m 41 Elpyrigm 2qu Example FactorA starting salary Factor B grade point average Low Avg High Total Low 15 18 7 40 High 5 22 23 50 Total 20 40 30 90 rm 2 1 M 8mm 51 MW 5 541 mm zuur Example Table of expected percentages under the hypothesis of independence Low Avg High Total Low 4090 High 5090 Total 2090 4090 3090 rm 2 1 M 8mm 51 MW 5 541 mm zuur Example A A 4 2 np1p1 MEX 8889 E11 A A 4 4 anpl 90 J 17778 Elz hzmr 2 vcus 5m 41 Jans M rumpquot Ami mzuur 20 Example Table of expected percentages under the hypothesis of independence Low Avg High Total Low 8889 17778 13333 40 High 11111 22222 16667 50 Total 20 4o 30 90 aquot 2 h M mm 51 W s 541 m Example 2 3 Y A A z ij quotPLPJ qz i1 quot t j 15 8889z 18 17778z 39 8889 17889 23 16667z 12 97 m 2 1 M m 2 MW 5 541 mm 2qu Example dfa 1b 12 13 1 122 Seta005 105 2 59915 umquot 23 Jans M Damprt n vcus 5m 41 Elpyrigm 2qu 21 THE CHI SQUARE CURVE Plot ofa Chisquare pdf Rejection Region Density Q Variable withv2 Degrees ofFreedom mquot 2 J M mm W s 541 arim Example Is there a way to display this lack of independence using a graphical method Yes it is called a segmented bar graph or a stacked percentage bar graph or a mosaic chart umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Example First we need to compute what are called the conditional distributions of Salary for the various levels of GPA or of GPA given starting salary umquot 23 Jans M Damprt vcus 5m 41 Elpyrigm 2qu 22 Example Conditional distributions of Salary given GPA Low Avg High Low 1520 75 1840 45 730 23 High 520 25 2240 55 2330 77 100 1 00 100 rm 2 1 M Damprt n vcu39s 5m 41 mum qu Segmented Bar Graph Starting Salaryv GPA 0 01 N E 0 U 0 in Low Average High Grade Point Average rm 2 1 M um 51 MW 5 541 cwimzm Example What would such a chart look like if the two factors were independent Low Avg High Total Low 0987 1975 1481 40I90 High 1235 2469 1852 5090 Total 2090 4090 3090 umquot 23 Jans M Damprt n vcus 5m 41 Elpyrigm 2qu 23 Example The conditional distributions for Salary given GPA are as follows under independence Low Avg High Low 044 044 044 High 056 056 056 100 100 100 may margin 7quot Segmented Bar Graph Starting Salary vs GPA Percentage Low Average High Grade Point Average umquot 23 Jans M Damprt 71 MW 5m 41 Elpyrigm 2qu 24 FIFTH ANNIVERSARY ISSUE w SCIEN Volume 5 No 9 Novemhcr About 1h Anibal 20 DISCOVERIliS THAT SHAPED OUR LIVES century at the Scknus by Home Freelandij 19001919 l The Plastic Man by Ruben Friaid Leo Baekeland39s Cnnqucst or an unyielding britilo resin led to nylons and Tupperware 2 Th Tut by Gang A Miller Alfred Binet s method of identifying subnorma schuolchildrcn sued the test and its intenuy n Impilmlion Ihal intelligenee is inharitcd 3 Einrreiu39r wonderlul Year ln 1905 he rormulated rclalivity helped create by Timollly Fenu quantum theory and elinched the reality ulihe alum 404 Diirereut Bloods Everyone who has ever had a transfusion ora by Bernard Dier transpianl owes a debt to Karl landiteiner39s tilinnrn 5 Trial By Number Karl Pearson39s chisq uare test measured the lit bylnu Hacking between theory and reality ushering in a new nm ul dedsion making 6 From Edilon l wtuteharlter ire de Forest39s curious variation of tho ght by Leonard s Rn39di bulb the vacuum tithe opened tho airwaves in radio TV and telephones 7 Hybrid Vim Ind Vigor by William L Brawn Georg Shull s experiments Wuh inbreeding and crossbreeding corn made it possible 10 lead the ullili 44 8 Taking lo the Air Modem aviation gal iu lift from vatuum cleancn unit by Tom a Cindi vorlices 11 The lnvaniv Continuum After the discovery how Ihc elephant th by Thom P Hugber automobile and the electric grid grew pa 19201939 byLm u Tima powerfully effective aproplied sciencer to The Child From Taung by Phillip v Tilbl39at Sixty years ago only Raymond Dan could ehoy tlur Africa had brought forth the rst human sperm 11 To Gave In Am man Fission has given us a new kind of power and a ll by Alan E Light kind of war J K 12 Invznling the Beginning me the univu39sc39s expansion Edwin Huhbl by Allan sandage deduced the Big Bang J NCSS for doing Contingency Table Analysis egmented or Stacked Bar Charts Haw to Setup Sr 5 r 7 AL L 23 Stamng ary Law 3 w 5 7 Law Stamng 22 ngh StamngS m 5 5 cell Tabulauun er euek en Analysis gt Fmpumuns gt Cmss Tabs Chquuare Tests The Grass Tabu auun preeedure wmduww l appmr as b eluw Under Table Culurmsquotsele1 me culumn mahle GPA m me sttreleVanable 39 mpmwrndew e sure me Nummc Vanables Widthj39 mpulwmdnwxs blank Under Table Ruwi39sele1 me mW Mable SALARY and make sure me Nummc Vanable Widthj39 mpm wmduw is blank 1n me Frequmcy Vanablequot mpm Wmduw salad me mahle that speci es me eeu frequmnes FREQ Under me Repuns tab mereere many epuens that annual theamnunl ufuulpullhal wu an get such m um mm PM m m d m m nuzw epuens U A u d quotw Wm wenquot 39 Ifyuu eheek me Lst Regenquot epnen men me mm mm 1151 me mpm data set Repgns Qumbu Ems 5 mm Iemmate Janamesl vauamesu Eveaks l Mjssmg I mem M semngsand 2 cmumns 2 Ruws Wm MN I 2 nnand pm 7 mm 7 Tam Dwscve e Vanab es Dwsmetevanab es SALARV 3 Numechanames Wm Numencvwwames WW 1 J Numbev Mwmmum Wm Numbev Mwmmum Wwdth Frequemvauame FREQ 3 l FMEVAIJIVE Tempbta m l CussTahula n IEIIE Bun Ana ysws Qvaphu BASS lnduw usxp D L31 asks Missing a Elms ymus Iemplate Cnums RuwPemems Cnlumn Pennants Repun v epun v spun v TablePemems BpedeaIUES ChFSquale REpun v CeWDavlahun SMszduat ChFquS ms Omn v Chvquuava v IVLmRspnd l7 shers ExnnTesl F Armkagz Prupnmon Tmnd T551 m 5 ME Run mg pmceduve wuh 1h2 cmrenl samngs an wewme Dutpu leptm and p u s Template m Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 18 umquot 11 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 1114 Multiple Comparisons Methods of Displaying the Results of Mul ip e Comparisons use o CSS to do OneWay ANOVA Problems with Multiple Comparisons umquot 11 Jans M Damprt 2 MW 5m 41 Elpyrighlellll OneWay ANOVA We use the OneWay ANOVA to test H0P 1P 2quot39 k Ha at least one inequality Suppose we reject H0 in favor of H o How are the means not equal to one another umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Multiple Comparisons Multiple Comparisons is an example of the difficulties with simultaneous statistical inference We cannot simply make one decision at a time followed by another and expect the probabilities of errors to remain valid Jans M umquot 11 Damprt 4 MW 5m 41 Elpyrighlellll Multiple Comparisons To compensate for the fact that there are k k 1 l2 possible such pairwise comparisons and the lack of independence among the intervals we must modify these confidence intervals by making them wider umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Multiple Comparisons For the remainder of this discussion we will assume that the treatment population means have been ordered and we will always form differences between the sample means that are positive Thus we will assume without loss of generality that Z 2 Z umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrighlellll Multiple Comparisons The new comparison interval is given by Qt 1 1 K YJiT MSEn rn x where Q m is a percentage point from the Studentized Range Distribution umquot 11 Jans M Damprt vcus 5m 41 Luminan Multiple Comparisons These percentage points 0 m can be chosen in several ways The particular choice that you make determines the type of multiple comparison procedure that is used umquot 11 Jans M Damprt 1 MW 5m 41 amino 2qu Multiple Comparisons Notation m a parameter to control for the number of means spanned in a comparison v the degrees of freedom associated with MSE from the ANOVA table Jans M umquot 11 Damprt 1 MW 5m 41 Luminan Multiple Comparisons Notation a the probability level to control the probability of error both pairwise comparison errors and experimentwise errors umquot 11 Jans M Damprt 1 MW 5m 41 amino 2qu Multiple Comparisons We are going to construct three types of multiple comparisons Fisher s LSD Tukey s HSD StudentNewmanKeuls SNK umquot 11 Jans M Damprt 11 MW 5m 41 Luminan Multiple Comparisons We will also consider a fourth amp fifth type of multiple comparisons called the Duncan s Multiple Range Test and the Scheffe Method We will discuss them but not develop any of the background or computation since they are defined quite differently form the previous three umquot 11 Jans M Damprt 12 MW 5m 41 amino 2qu Multiple Comparison Procedures The multiple comparison intervals for pair wise comparisons are given by 1 1 Y Yxl QM MSE J quotr n Recall that we assume wog17r 2 Z umquot 11 Jans M Damprt 1 MW 5m 41 Luminan Multiple Comparison Procedures Fisher s Fisher s LSD Procedure Let Q1311 5ta2 if you choose the percentage points with this definition the you get Fisher s LSD Procedure Least Significant Difference while it is not apparent m 2 Jan5M umquot 11 Damprt 15 MW 5m 41 Luminan Multiple Comparison Procedures Fisher s and where Q m is a percentage point from the Studentized Range Distribution umquot 11 Jans M Damprt 14 MW 5m 41 amino 2qu Multiple Comparison Procedures Fisher s For example if k 2 v 17 and a 005 we have the following results using NCSS s Probability Calculator Qamu Q005217 298375 J5 t002517 14142142109816 298373 Jans M n u umquot 11 Damp vcus 5m 41 amino 2qu Multiple Comparison Procedures Tukey s Tukey s HSD Procedure Honestly Significantly Different Let m k ie Quanta kaa where the percentage points are chosen from the Studentized Range Distribution Damprt 17 umquot 11 Jans M vcus 5m 41 Luminan Multiple Comparison Procedures Tukey s Note the choice of mk of populations Table V in your textbook provides the percentage points of the Studentized Range Distribution They can also be found using NCSS s Probability Calculator umquot 11 Jans M Damprt vcus 5m 41 amino 2qu Multiple Comparison Procedures Note for k gt 2 then we can show Q k a2 gt taZ u thus these intervals wi ALWAYS be wider than the Fisher s LSD intervals based on the Student s T Distribution umquot 11 Jans M Damprt 11 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures For example if k 3 v 17 and a 005 we have the following kau Q005317 362798 256537 J5 J5 tam toms 210982 2 J M 2 MW 5m 41 Elpyrighlellll Multiple Comparison Procedures SNK StudentNewmanKeuls SNK Procedure Quanta QwLu NotethechoiceofmL23k umquot 11 Jans M Damprt 21 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures SNK The parameter L is defined to be the number of means being spanned by the difference Y K inclusively Suppose we compare k 5 means that are arranged in ascending order Jans M n 22 umquot 11 Damp vcus 5m 41 Elpyrighlellll Definition No of Means Being Spanned 17E lt lt71 lt17C lt78 th Comparing E to YA then m L 2 Comparing 17E to 17C then m L 4 Comparing 17 to 178 then m L 5 m u 5 M um 23 w s 541 aizuu Multiple Comparison Procedures Duncan s Duncan s Multiple Range Test Let QmmJ 7Lu where L is chosen in the same manner as in the SNK procedure an L l 71 1 a m J M Multiple Comparison Procedures Duncan s Note that this makes the probability level 1 a function of L and thus special tables are needed however NCSS also offers Duncan s Procedure as one of the multiple comparison alternatives but the distribution is not in the Prob Calculator umquot 11 Jans M Damprt 25 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures Scheffe s Scheffe s method applied to these pair wise confidence intervals is really an application of testing contrasts and as such it involves the F test from the ANOVA procedure and is an exact procedure It is the only one considered here that is not an approximation umquot 11 Jans M Damprt 25 MW 5m 41 Elpyrighlellll Multiple Comparison Procedures Each of these five procedures will have to a lesser or greater extent some probability as to the number of pairwise comparisons declared significant on average given the method the assumptions etc etc umquot 11 Jans M Damprt 27 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures Investigations have shown that these procedures have the followin relationships as to the average number of pairwise comparisons declared significant They are in descending order Jans M umquot 11 Damprt 2 vcus 5m 41 Elpyrighlellll Average No of Pairwise Comparisons Actually Found Fisher s LSD the most Duncan s MR Test SNK s Tukey s HSD Scheffe s the fewest umquot 11 Jans M Damprt 21 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures anomalies It may rarely happen that the ANOVA F test will produce an insignificant result and yet one of the pairwise comparisons procedures will show one or more pairs of means as being significantly different umquot 11 Jans M Damprt an vcus 5m 41 Elpyrighlellll Multiple Comparison Procedures anomalies You must remember that these pairwise multiple comparisons procedures are approximate procedures If you fail to reject the null hypothesis in the ANOVA table then you should M do the multiple comparisons procedures umquot 11 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures anomalies You most likely will notice this when you are using a statistical package to analyze your data Often the multiple comparisons procedures are provided as a default hence you will see that one or more pairwise differences are there but F test in the ANOVA is insignificant umquot 11 Jans M Damprt 2 MW 5m 41 Elpyrighlellll Multiple Comparison Procedures anomalies Sometimes but rarely you will observe a significant F test in the ANOVA indicating differences in the means but the multiple comparisons procedures fail to pick up a significant pairwise difference in the group means umquot 11 Jans M Damprt 3 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures anomalies This is where Scheffe s method comes in to play If you observe a significant F test in the ANOVA table then you can prove that there exist at least one significant comparison contrast among the population means Jans M umquot 11 Damprt 4 MW 5m 41 Elpyrighlellll Multiple Comparison Procedures anomalies That contrast may not be a simple linear contrast of the two means Z Z but most likely will be more complicated The general definition of a contrast amongkthe means is given by umquot 11 t Jans M Damprt vcus 5m 41 Elpyrigm 2qu abut0 where Eat0 1 i1 5 Multiple Comparison Procedures anomalies Contrasts that are not pairwise 1 l12lz 12 3 0 0r HAA 2 Multiple Comparison Procedures anomalies Scheffe s procedure is more complicated than the pairwise procedures Hence it will find these more complicated contrasts if properly implemented Consequently it is conservative in its ability to find pairwise differences umquot 11 1 s M Damprt 7 MW 5m 41 Elpyrigm 2qu Multiple Comparison Procedures anomalies Scheffe s procedure is an exact procedure That is it controls all of the errors at the levels advertised provided all of the assumptions are satisfied It just may never find a pairwise comparison that is significant only the more complicated contrasts mm 1 Jans M Damprt vcus 5m 41 Elpyrighlellll Example Beam Deflection Let s construct the Tukey Multiple Comparison Confidence Intervals for the pairwise comparisons of these means 912184 c17c79 8278277 Jans M m 31 umquot 11 Damp vcus 5m 41 Elpyrigm 2qu Example Beam Deflection For yA B the interval is given by 12 17 i Q W MSE J5 Damprt 41 umquot 11 Jans M vcus 5m 41 Elpyrigm 2qu O I v v I l v v I l v I l 75 76 77 78 79 8D 81 82 83 84 85 Dot Plot ofThree Treatment Averages forthe Steel amp Alloy Beams Data umquot 11 Jans M Damprt 411 MW 5m 41 Elpyrighlellll Example Beam Deflection Tukey s HSD Procedure m k 3 17A 84 11A 8 Set a005 17877 n3 6 MSE60 017 From NCSS s Probability Calculator aka 2 Q005317 h M m cizuu mm 1 vcus 5m 41 Example Beam Deflection 84 77l363 h1 8 E 7i3630875 713396 361 1039 umquot 11 Jans M Damprt Example Beam Deflection 361 s uA pB g 1039 1605 pA pc 5 840 163 Spa pB S 563 We conclude A 1 I A uc uc uB vcus 5m 41 Elpyrigm 2qu Multiple Comparisons Methods of Displaying How do we display these results either graphically or in text or in table form The first method is to list the means in ascending order and underline homogeneous subsets umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu um J M mm W s 541 WWW U nde rl i ni ng Homogeneous Subsets Underlining Homogeneous Subsets As another example suppose we have six populations A thru F and that the sample means in ascending order are as follows YESYFSYC 171573 A umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu 77 79 84 Ha Ha mm 11 J M MWquot 45 U nde rl i ni ng Homogeneous Subsets Furthermore let s suppose that the Tukey s comparison intervals led us to conclude FE is differentfrom I D 8 amp A pF is different from pA Inc is different from 1A umquot 11 Jans M Damprt vcus 5m 41 Elpyrighlellll Underlining Homogeneous Subsets and continuing ID is di erentfrom IE amp uA 8 is different from 4E A is di erentfrom HE up Ac amp up Es Fs cs D S BS A umquot 11 Jans M Damprt 5 MW 5m 41 mmimzm aquot u h M mm 4 W s 541 mm 2qu U nderl ini ng Homogeneous Subsets The final underlining of homogeneous subset is as follows FESI FSI CSI DSI BSI A umquot 11 Jans M Damprt 51 MW 5m 41 Elpyrigm 2qu Multiple Comparisons Methods of Displaying The second method is to construct a matrix of differences of the means Put the columns and rows in ascending order according to the values of their sample means Enter the values of the differences of the sample means along with the critical values Jans M umquot 11 Damprt 52 MW 5m 41 mmimzm Example Beam Deflection Enter the value of 17 Z A B or A C compare to 340 B C compare to 363 umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Multiple Comparisons Methods of Displaying Variations continued Subtract the critical distances from the differences in the sample means and note which are positive umquot 11 Jans M Damprt 54 MW 5m 41 mmimzm Multiple Comparisons Methods of Displaying B 1 63 um 11 1 M Damprt 55 MW 5m 41 mum qu Multiple Comparisons Methods of Displaying A third method of displaying the multiple comparison results is to use what is called an Engineering Plot However we must require that the sample sizes be equal from population to population and we only use Tukey s umquot 11 Jans M Damprt 55 MW 5m 41 mmimzm Multiple Comparisons Engineering Plot You construct a bar graph to represent the value of the treatment means At the top of each bar you add a whisker of length equal to the HSD critical distance Half above the top of the bar and half below umquot 11 Jans M Damprt 57 MW 5m 41 Elpyrigm 2qu Multiple Comparisons Engineering Plot The HSD critical distance is given by Quku i i JE MSEnA J quotB Jans M umquot 11 Damprt vcus 5m 41 mmimzm Multiple Comparisons Engineering Plot Example An experiment was conducted to determine the compressive strength of concrete cylinders made by three different drying methods A B amp C Five observations were obtained for each of the three treatments umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrigm 2qu Multiple Comparisons Engineering Plot Example The data yielded the following 17A 472 178 518 Z 462 MSE585 nAnBnC5 df u 12 Q035312 3773 umquot 11 Jans M Damprt in vcus 5m 41 mmimzm Multiple Comparisons Engineering Plot Example The HSD critical distance is given by 3773 HSD Q 585 1 1 J5 5 5 2667915297 4081 umquot 11 Jans M Damprt 51 MW 5m 41 Elpyrigm 2qu Engineering Plot to Show Multiple Co mparisons 60 m 395 2 s 55 39 D g n 55 50 E e U E 45 40 Drying Method umquot 11 J M am 2 MW 5m 41 Elpyrighlellll Multiple Comparisons Engineering Plot Example The underlining of homogeneous subsets would produce the following result 462 472 518 NC HA PB umquot 11 Jans M Damprt n vcus 5m 41 Elpyrigm 2qu Multiple Comparisons Procedures SNK Let us turn our attention to the SNK intervals to see how they differ Recall the intervals are given by 17 Z i Q quot MSE J5 Jans M umquot 11 Damprt n vcus 5m 41 Elpyrighlellll Multiple Comparisons Procedures SNK where L the number of means spanned in the comparison of the two means Let s compute the intervals for the beam deflection example to see how is differs umquot 11 Jans M Damprt 55 MW 5m 41 Elpyrigm 2qu Example Beam Deflection For A pB L 3 77 79 84 B c A YA 84 11A 8 Set a005 17 77 1136 MSE60 017 8 From NCSS s Probability Calculator Q LD 005317 362798 same as TUkeV39S umquot 11 1 bumpquot vcus 5m 41 s M Elpyrighlellll Example Beam Deflection For A pcL2 77 79 84 B C A 17A 84 11A 8 Set a005 17C79 nC6 MSE60 1217 From NCSS s Probability Calculator QmLp Q0052 298375 Jans M Damprt Elpyrigm 2qu umquot 11 vcus 5m 41 Example Beam Deflection For pa 8L2 77 79 84 B C A 7877 1186 Seta005 17C79 nC6 MSE60v17 From NCSS s Probability Calculator QM Q005 298375 Jans M umquot 11 vcus 5m 41 ampquot amino 2qu Example Beam Deflection 84 77l363 60 1 1 79 77l298 TE 6 s M Damprt Elpyrigm 2qu 60 1 1 81101298 TiaJ Example Beam Deflection oAB 7339 3613pA p831039 oAC 5279 2215pA pcs779 C B 2298 09SsuC p8 5498 umquot 11 Jans M Damprt 70 MW 5m 41 mzuu Ami Comparing Tukey s HSD amp SNK Intervals Tukey s HSD SNK s Intervals Intervals AB 361 1039 361 1039 AC 160 842 221 779 63 163 5 3 098 498 MM 1 M n Comparing Tukey s HSD amp SNK Intervals Note the Tukey intervals are greater than or equal to the length of the SNK intervals Hence Tukey s procedure would on average declare fewer significant pairwise differences that does the StudentNewmanKeuls procedure umquot 11 vcus 5m 41 Jans M Damprt 72 Elpyrighlellll Multiple Comparison Procedures Fisher s LSD the most Duncan s MR Test SNK s Tukey s HSD Scheffe s the fewest umquot 11 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu Linear Models Of course there exist more complicated models that the OneWay ANOVA or Completely Randomized Design If you are interested in learning more about these statistical models you should take a course in Design and Analysis of Experiments STAT 642 umquot 11 Jans M Damprt 74 MW 5m 41 mmimzm The following output from NCSS provides the OneWay Analysis of Variance and Multiple Comparison Report for a set of data involving the presence of mercury pollution in a certain river Six measurements were made at each of six different sites downstream from the source of the contamination l have highlighted the pairwise comparison results for the five methods that we discussed in class Note the differing number of pairwise comparisons declared significantly different for each method Analysis of Variance Report PageDateTime 1 4172003 104950 AM Database DMY CURRENT WORK FILESCOU DATASETSGENERALMERCURYSO Response CONC Analysis of Variance Table Sum of Mean Prob Power Term DF Squares Square FRatio Level Alpha005 A STATION 5 2301271 4602542 2921 0000000 1000000 SA 30 4727778 1575926 Total Adjusted 35 2774049 Total 36 Term significant at alpha 005 Means and Effects Section Standard Term Count Mean Error Effect A 36 3125833 05209723 A STATION CA 6 02016667 05124981 03193056 CB 6 07 05124981 01790278 E1 6 1893333 05124981 1372361 E2 6 3258333 05124981 2737361 E3 6 724 05124981 6719028 E4 6 5461667 05124981 4940694 Plots of Means Section Means ofCONC CONC E1 E2 STATION Analysis of Variance Report PageDateTime 2 4172003 104950 AM Database DMY CURRENT WORK FILESCOU DATASETSGENERALMERCURYSO Response CONC Fisher39s LSD MultipleComparison Test Response CONC Term A STATION Alpha0050 Error TermSA DF3O MSE1575926 Critical Value20423 Flsher s LSD Different From procedure declared 12 GrouP COW Mean GrouPS 4 pairwise differences CA 6 02016667 E1 E2 E4 E3 CB 6 07 E21 E4 E3 Slgnl cantly dlfferent E1 6 1893333 CA E4 E3 E2 6 3258333 CA CB E4 E3 E4 6 5461667 CA CB E1 E2 E3 E3 6 724 CA CB E1 E2 E4 Notes This report provides multiple comparison tests for all pairwise differences between the means When this procedure is used only after the Ftest associated with this term is significant at the same error rate these tests are approximately accurate When the Ftest associated with this term is ignored this procedure does not account for the multiplicity of tests In either case the TukeyKramer test is better Duncan39s MultipleComparison Test Response CONC Term A STATION Duncan s procedure Alpha0050 Error TermSA DF30 MSE1575926 declared 11 pairwise Different From d erences Group Count Mean Groups Slgm camly dl erem CA 6 02016667 E2 E4 E3 CB 6 07 E2 E4 E3 E1 6 1893333 E4 E3 E2 6 3258333 CA CB E4 E3 E4 6 5461667 CA CB E1 E2 E3 E3 6 724 CA CB E1 E2 E4 Notes This report provides multiple comparison tests for all pairwise differences between the means According to Hsu1996 page 130 the specified familywise error rate alpha is overstated and the TukeyKramer method is recommended instead Analysis of Variance Report PageDateTime 3 4172003 104950 AM Database DMY CURRENT WORK FILESCOU DATASETSGENERALMERCURYSO Response CONC NewmanKeuls MultipleComparison Test Response CONC Term A STATION SNK s procedure Alpha0050 Error TermSA DF30 MSE1575926 declared 11 pairwise Different From K differences signi cantly Group Count Mean Groups different same as CA 6 02016667 E2 E4 E3 Duncan s CB 6 07 E2 E4 E3 E1 6 1893333 E4 E3 E2 6 3258333 CA CB E4 E3 E4 6 5461667 CA CB E1 E2 E3 E3 6 724 CA CB E1 E2 E4 Notes This report provides multiple comparison tests for all pairwise differences between the means According to Hsu1996 page 127 the specified familywise error rate alpha is overstated and the TukeyKramer method is recommended instead TukeyKramer MultipleComparison Test Response CONC Term A STATION Tukeys s HSD Alpha0050 Error TermSA DF30 MSE1575926 Critical Value43015 procedure declared 9 Different From pairwise differences Group Count Mean Grou s Slgnl camly dl erem CA 6 02016667 E2 E4 E3 CB 6 07 E2 E4 E3 E1 6 1893333 E4 E3 E2 6 3258333 CA CB E3 E4 6 5461667 CA CB E1 E3 6 724 CA CB E1 E2 Notes This report provides multiple comparison tests for all pairwise differences between the means Analysis of Variance Report PageDateTime 4 4172003 104950 AM Database DMY CURRENT WORK FILESCOU DATASETSGENERALMERCURYSO Response CONC Scheffe39s MultipleComparison Test Response CONC Term A STATION Alpha0050 Error TermSA DF30 MSE1575926 Critical Value35592 Soheffe s Procedure declared 8 pairwise Different From differences GrouP COW Mean Gr UPS signi cantly different CA 6 02016667 E2 E4 E3 CB 6 07 E4 E3 E1 6 1893333 E4 E3 E2 6 3258333 CA E3 E4 6 5461667 CA CB E1 E3 6 724 CA CB E1 E2 Notes This report provides multiple comparison tests for all possible contrasts among the the means These contrasts may involve more groups than just each pair so the method is much stricter than need be The TukeyKramer method provides more accurate results when only pairwise comparisons are needed IS n 30 BIG ENOUGH A while back there was a discussion on the rule of thumb that a sample of size n30 is sufficient in practice for assuming normality of the sample mean Does anyone know the source of this 39rule of thumb Professor Paul Velleman Cornell University responded I recently responded to a High School students question on the same subject Here is whatl wrote Bill Willis asks The text book we are using says use the 2 distribution when ever the sample size exceeds 30 and the distribution is normal Let sample s approximate population sigma How come the t table in the AP booklet has degrees of freedom entries way past 30 what am I missing Bill You have caught statistics teachers in one of our more egregious lies Here is the truth Background which you probably know The tdistributions are a family of distributions one for each value of the degrees of freedom parameter twith infinite df is Normal so the Normal distribution is a member of the family Moreover as df grows the difference betweent and Normal becomes very slight If we know the true sigma then we should use Normal in testing and estimation Since we almost never know sigma but rather estimate it from the data with s we should instead use t provided we have a random sample and are willing to assume underlying Normality of the data The rest of the story In the olden days years and years ago before there were computers on every desktop and graphing calculators in every student39s hands we had to make tables of t and 2 values because there was no other practical way to get these values This forced us to make several compromises First we had to work with a standard normal distribution gotta make tables for only one of the Normal distributions so lets make it have mean zero and sd 1 Second we had to settle on a few special significance levels for ttests 10 5 l because we don39t want to have an entire book of ttables an entire table like the normal table for each df Third we had to choose an arbitrary number of df to tabulate for the t distribution and declare that we should move to the Normal tables thereafter The tfamily goes on and on for any number of df and is only really Normal at infinite df so really complete ttables should be infinitely long The obvious choice was to stop at the bottom of the first page Most book pages have between 40 and 50 lines so most ttables after allowing for headings and a final row labeled infinite df go up to 30 or 40 df All of these choices were arbitrary and had nothing whatever to do with statistics only with practical compromises To cover our tracks statistics texts made a quotrulequot that after 30 or 40 depending on page size and type size df we should use the Normal distribution This quotrulequot is not statistics it is a rationalization of outdated practical compromises but we all agreed to go along with the charade The modern truth is Use twhenever you have estimated the standard deviation from the data and a you have a random sample or data from a properly randomized experiment and b you are willing to assume that the data are Normally distributed Virginia Commonwealth University STAT 541 APPLIED STATISTICS FOR ENGINEERS amp SCIENTISTS Instructor Dr James M Davenport Lecture 19 m u J M mm 1 MW sun 41 aquoter 2qu Today s Lecture Information in today s lecture corresponds to the following sections in your textbook in addition to my notes 121 amp 122 Introduction to Simple Linear Regression Estimation o e Parameters Least Squares Importance of Scatter Plots and esidual Plots Isz u Jans M Damprt 2 MW 5m 41 Elpyrigm 2m Regression Analysis Relationships among variables to understand them cause amp effect to explain and use them for prediction Isz u Jans M Damprt vcus 5m 41 Elpyrigm 2qu Regression Analysis Some such relationships are known exactly and are called DETERMINISTIC But most relationships between variables are NOT known at least in the exact sense umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Regression Analysis The variables are subject to chance or random variation The models that we develop to approximate or characterize their main features are STATISTICAL in nature umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrigm 2m Regression Analysis Both variables may be subject to random variation but for the purposes of our discussion we will assume that only one of the variables is random namely the response variable denoted Yi umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Regression Analysis The explanatory variable will be assumed to be fixed The explanatory variabes are controllable and thus if the experiment is repeated at the same value of xi the response Yi will not produce the same value on every trial umquot 11 Jans M Damprt 7 MW 5m 41 Elpyrigm 2qu Regression Analysis The process of developing these approximations is called Regression Analysis umquot 11 Jans M Damprt 1 MW 5m 41 Elpyrigm 2m Regression Analysis If we assume this relationship between the response variable and the explanatory variable is linear we call it LINEAR REGRESSION ANALYSIS as opposed to nonlinear reg analysis Jans M Damprt 1 umquot 11 vcus 5m 41 Elpyrigm 2qu Regression Analysis If there is only one explanatory variable X then we call it SIMPLE LINEAR REGRESSION Lump 19 James M Duvmpori 1o VCU39 5m 41 Copyright zoos ms m m w mn Luum 19 James M Duvmpori 11 VCU39 5m 41 Copyright zoos um 19 James M Duvmpori VCU39 5m 41 Copyright zoos Regression Analysis In his Memories Galton describes a problem that puzzled him 1 Galton Francis 1877 Memories ofMy Life New York E P Dutton umquot 11 Jans M Damprt 1 MW 5m 41 Elpyrigm 2qu Regression Analysis How is it possible for a population to remain alike in its features as a whole during many successive generations if the average produce of each couple resemble their parents umquot 11 Jans M Damprt 14 MW 5m 41 Elpyrigm 2m Regression Analysis Their children are not alike but vary therefore some would be taller some shorter than their average height so among the issue of a gigantic couple there would be usually some children more gigantic still Conversely as to very small couples But from what I could thus far find parents had issue less exceptional than themselves umquot 11 Jans M Damprt 15 MW 5m 41 Elpyrigm 2qu Regression Analysis He says that Reversion regression is the tendency of the ideal filial type to depart from the parental type reverting regressing to what may roughly and perhaps fairly be described as the average ancestral type mquot u 1 M mm 15 W s 541 mm 2qu Example Galton Francis 1877 Typical Laws of Heredity Proceedings of the Royal Institute Vol 8 282 301 umquot 11 7 M Damprt 17 MW 5m 41 Elpyrigm 21m Example Diameters of the Seeds Parent 15 I 16 I 17 1s 19 20 I 21 Daughter I 154 I 157I16oI 163 I 166 I 170 I 173 umquot 11 7 M Damprt vcus 5m 41 Elpyrigm 2qu E z c q e umquot w 7 1m M mmquot m MM 5qu 541 wagm 2m umquot w 7 Ms M Wm an Mm 5m 54 awn 2m Exam ple Pearsan Karl and Lee Alice 1903 On the Laws af Natural Inheritance in Manquot Biometrika Val 2 357 462 umquot w 7 1m M mmpm 2x VCU 5m 54 wagm 2m Son39s Height Fathel s Height umquot 11 Jans M Damprt 22 MW 5m 41 Elpyrigm 2qu Simple Linear Regression Model The model Y o 1xsl f0ri1 2 11 umquot 11 Jans M Damprt 2 MW 5m 41 Elpyrigm 2m Assumptions 1 xi is the iquotI value of the explanatory variable a nonrandom constant that corresponds to particular settings that are chosen by the investigator Yi is the variable response that corresponds to the setting xi fori12n N umquot 11 Jans M Damprt 24 MW 5m 41 Elpyrigm 2qu Assumptions 3 so and 51 are the coefficients parameters in the simple linear relationship in is the intercept and S1 is the slope A change of one unit in the explanatory variable will result in a change of 31 units in Y Jans M Damprt Elpyrigm 2qu umquot 11 vcus 5m 41 Assumptions 4 The random variables 81 32 e3 5quot are the errors that create the scatter around the linear relationship 0 lxl i123n aquot ureiidN0 0392 Jans M Damprt m 2m umquot 11 vcus 5m 41 Ami Regression Analysis Yi 0 1xisi is a sum of two components one random and the other nonrandom o plxi VurEYJ VurD0 Ax 5 VurEq 0392 umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Regression Analysis Thus Yi are independently distributed as N o 1xio 2 fori123n Note that they are not identically distributed they have differing means umquot 11 Jans M Damprt 2 vcus 5m 41 Elpyrigm 2qu Regression Function pa lxi is called the REGRESSION FUNCTION And it is this function that we must estimate to understand the relationship umquot 11 Jans M Damprt 21 MW 5m 41 Elpyrigm 2m Example Fuel Consumption Y gallons of fuel consumed per 100 miles driven x weight of the vehicle Y o 1xsl for i1 2 10 umquot 11 Jans M Damprt an vcus 5m 41 Elpyrigm 2qu Gallons per 100 Miles Weight of Vehicle in 1000 lbs ioctum 19 James M Davenport VCU39s Stat 541 Copyright 2999 Estimation of Parameters There are really several approaches to estimating 50 B1 and 02 Two common methods Least Squares Maximum Likelihood In the linea regresswn setting with normally r distributed errors these two methods are equivalent ioctum 19 James M Davenport VCU39s Stat 541 Copyright 2999 52 ioctum 19 James M Davenport VCU39s Stat 541 Copyright 2999 Least Squares Our objective is to find the Regression Function E Y o 1x that is close to the observation points xi yi fori123n umquot 11 Jans M Damprt 4 MW 5m 41 Elpyrigm 2qu Fuel Consumption u 2 E o 2 a o u 2 TI 3 Weight ofVehicle in 11130 lbs m u J M om as vcu39s s 541 cum 2m LEAST SQUARES a 3 E E E a 39o z a o a D Independent Variable m u J M um 35 vcu39s s 541 cum 2qu LEAST SQUARES Dependent Variable Independent Variable umquot 11 Jans M Damprt vcus 5m 41 5 m 2qu LEAST SQUARES 2 n 5 Tu gt E o 395 z o n o 1 Independent Variable umquot 11 Jans M Damprt vcus 5m 41 5 m 2m Least Squares We want to minimize the following SSE 20 572 1 1 Jans M Damprt u 5 m umquot 11 vcus 5m 41 2qu Least Squares s3ial y AAxi This is the sum of squared deviations of the observed values yi from theA A estimated regression function 0 Ax umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Least Squares It is also called the Sum of Squares due to Error or the Error Sum of Squares We denote this by SSE We want to minimize the value of SSE vvith respect to the choices of pa amp l umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2m Partial Derivatives gs g g4 o lxiZ 11 0 f71Ei22yiI o lxizl xl 0 42 umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Normal Equations 0n lxiiyi i1 2 no 2x 131 2x zxy i1 i1 i1 Least Squares The solutions to this system of equations produce the following estimates which are called the LEAST SQUARES ESTIMATES umquot 11 Jans M Damprt 44 MW 5m 41 Elpyrigm 2m Least Squares Estimates x i1 1 um 11 1 M Damprt as aw 5m 41 cam 2qu Example i Vi Vi2 X1 Xi2 Xiyi 1 55 552 34 342 187 2 59 592 38 382 2242 10 49 492 34 3421666 Sum 439 2073 290 8928 1358 umquot 11 Jans M Damprt 45 MW 5m 41 amim 2qu Example Fuel Consumption 13580 290439 A 10 2 1639 8929 10 A A 439 290 pa y lx W 1639F 0363 3 835 7 Example Fuel Consumption Fitted simple linear regression model is j 10 Ax 0363 1639x X 34 j 0363163934 5 521 this is the predicted value of y x 34 umquot 11 Jans M Damprt 4 vcus 5m 41 amim 2qu Regression Analysis The fitted regression model is valid for all real values of x ie no lt x lt no However for practical reasons you cannot EXTRAPOLA TE outside the range of the existing data without risk umquot 11 Jans M Damprt 41 MW 5m 41 Elpyrigm 2qu Fitted Regression Line So realistically the range of the explanatory variable x will be something like xmm s x 5 XM Using the estimates of the parameters in the Regression Function leads to the ESTIMA TED REGRESSION FUNCTION or the FITTED REGRESSION LINE umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrigm 2m Fitted Regression Line The fitted regression line is given by 57 pa plx Note that this equation is not tied to any one or any particular set of x values umquot 11 Jans M Damprt 51 MW 5m 41 Elpyrigm 2qu Fitted Values If we use the set of values for the explanatory variable that are in the original data used to estimate the parameters namely x1 x2 x3 xquot then we obtain a set of values called the FITTEQ VAIUES 52 o 1xi for i1 2 3 n umquot 11 Jans M Damprt 52 MW 5m 41 Elpyrigm 2qu Residuals We use the fitted values to compute the set of RESIDUALS er y j eiy o 1xi for i1 2 3 n umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrigm 2m Residual Plots We will use the residuals as diagnostic tools by forming various plots ei vs xi amp ei vs 2 the fitted values ei vs yi the observed values ei vs the time sequence ei vs eik lagged residual plots umquot 11 Jans M Damprt 54 MW 5m 41 Elpyrigm 2qu umquot w 7 1m M Dwaipr 55 VCU 5m 54 wagm 2m Importance of Scatter Plots and Residual Plots Anscombe F J 1973 Graphics in Statistical Analysls The American Statistician Vol 27 1721 umquot w 7 1m M Dwaipr 56 VCU 5m 54 wagm 2m Anscombe Regresslon Exam ple x1 v1 x2 v2 X N x4 v4 1nn am1nn s1a1nn ms sn 65a sn En 13n 13n an 9n 11n 11n 1m 99615quot E1 1d mu u no 6n 6n m m 12n 12n 7n 7n 5n 532 563 5n m 5n 573 sn 639 umquot w 7 1m M Dwaipr 57 VCU 5qu m wagm 2m Anscombe Data Sets The following summary statistics are the same for all four data sets n 11 11 11 in99 i9 Zyi8251 y7500 i1 i1 Jo 300 l 05 a1 153 umquot 11 Jans M Damprt vcus 5m 41 Elpyrigm 2qu Anscombe Data Sets The fitted equation for ALL four data sets is given by j300050x One might be lead to believe that all four data sets are samples from the same po umquot 11 Jans M Damprt 5 MW 5m 41 Elpyrigm 2m Anscombe Data Sets for Regression o 3 2 a gt o u z o n 3 M gt x Explanatory Variable um 11 a M mm m vcu39s sun 41 mm 2qu 20 Anscombe Data Sets forRegreSSton 5 B gt x r tndependentVanabte um 11 1 M Damprt n vcus 5m 41 amim 2qu Restduat Ptot forAnscombe Data Sets forRegressto Data Sett umber 1 o o o g u o 5 7 0 0 1 1 2 x r tndependent Vanabte um 11 1 M Damprt s2 vcus 5m 41 amim 2m Ansoombe Data Sets forRegresston S B gt x r tndependentVanabte um 11 1 M Damprt n vcus 5m 41 amim 2qu 21 Restduat Ptot for Anscombe Data Sets forRegressto Data Set umber 2 o O E O O E a a E o o o 1 1 2 x r ndependent Vaname um 11 1 M Damprt 54 MW 5m 41 Elpyrigm 2qu v A DependentVartabte Anscombe Data Sets forRegresston XrtndependentVaHabte umquot 11 vcus 5m 41 Jans M Damprt 5 Elpyrigm 2m Restduat Ptot for Anscombe Data Sets forRegressto Data b Sett um er 3 o 5 ago X a m 00 o 1 139 2 XrtndependentVaHabte humquot u Damplrt n vcus 5m 41 Jans M Elpyrigm 2qu 22

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.