BUSINESS STATISTICS II
BUSINESS STATISTICS II MGMT 302
Virginia Commonwealth University
Popular in Course
Timmy Eichmann IV
verified elite notetaker
Popular in Business, management
This 20 page Study Guide was uploaded by Timmy Eichmann IV on Wednesday October 28, 2015. The Study Guide belongs to MGMT 302 at Virginia Commonwealth University taught by Charles Correia in Fall. Since its upload, it has received 28 views. For similar materials see /class/230686/mgmt-302-virginia-commonwealth-university in Business, management at Virginia Commonwealth University.
Reviews for BUSINESS STATISTICS II
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/28/15
GOODNESS 0f FIT amp CONTINGENCY TABLES We want to reach conclusions about the proportion of items falling into k categories Statistical inferences on categorical data are usually based on a sampling distribution called the CHI SQUARE Distribution We use a method known as a GOODNESS OF FIT procedure to compare the proportions of items which fall into k categories that is it compares the distribution of observed outcomes of a random sample with the distribution of observed outcomes of a random sample with the distribution one would expect to observe if the claim of a Hypothesis H0 were correct Note Marketing Strategy Example on page 528 Manufacturing Banking Insurance Government Medical Total Observed 13 7 8 10 12 50 If all customers are evenly distributed then one Expected 10 10 10 10 10 50 would expect the proportion for each of the k categories to be the same that is 20 There are ve categories hence 15 of total for each category H02 7171 739l72 739l75 020 Ha At least one 71 differs from the rest The test statistic is Pearson s CHI Square Goodness of Fit Statistic is 2 X Oi E1 E If the expected number of occurrences is at least 5 the sampling distribution of the test statistic is closely approximated by a CHI SQUARE Distribution with kl degrees of freedom If the expected number is less than 5 do not use For our example the test statistic is X24 13 10210 710210 810210 10 10210 12 10210 26 P 984 2 26 1 13084 5 26 06268 This Test is used on a multinomial population The multinomial distribution is an extension of the binomial distribution to the case of three or more categories of outcomes On each trial of a multinomial experiment one and only one of the outcomes occurs Each trial of the experiment is assumed to be independent and the probabilities of the outcomes remain the same for each trial Note Example 111 page 533 in your text A market share study is done on three car manufacturers The market shares have been 40 for A 35 for B and 25 for C H02 TEA 40 TC 35 We 25 Ha Not this arrangement A random sample of 200 recent sales of new cars observed the following breakdown A B C Total 65 95 40 200 Ifthe hypothesis is correct the expected number should be 40 200 80 for A 35200 70 for B and 25200 50 for C x 22 65802 80 95702 70 40502 50 137411 P X22 21374111 P062 3 137411 1 7 0998962 0001 The pvalue provides convincing evidence against H0 hence the market share has changed 0n MINITAB Step 1 Put observed values in C1 Step 2 Put expected values in C2 Step 3 From menu window choose CAL In Store Result In Variable Box enter C3 In expression box type clc22c2 Step 4 From menu window choose CAL Column Statistics In Dialogue Box click sum then enter C3 in the input variable Click OK and CHISQUARE value appears Step 5 To find pvalue go to CAL CHI SQUARE click Cum DF and input constant Analysis of TwoWay Contingency Tables The Chi Square Procedure for Independence Another important application of the Chisquare distribution involves using sample data to test for the independence of two variables Consider the example in the following table School of Business School of Arts amp Science Total Male 130 130 260 Female 70 170 240 Total 200 3 00 500 This table is known as a twoway contingency table The table consists of the observed frequencies of all combinations of categories of the two variables The analysis of the table addresses the question of whether the two variables are unrelated and thus independent of each other That is type of school enrollment is unrelated to gender The statistic developed measures how greatly the observed frequencies differ from the frequencies expected if the variables are independent If these differences are sufficiently large the null hypothesis is contradicted Therefore the analysis of a contingency table is based on the goodnessoffit concept Simple Linear Regression Analysis CH 9 Y be b1X Regression Output Enter response variable into C1 amp predictor variable into C2 STATREGRESSIONREGRESSION In dialogue box specify C2 as response and C1 as PREDICTORS Residual Plots In the STATREG RESSIONREGRESSION box input response amp predictor data In dialogue box click GRAPHS enter C1 in the residuals vs the variables box click OK Estimation and Prediction Enter response variable into C1 amp predictor into C2 STATREG RESSIONREGRESSION Complete the dialogue box with C1 amp C2 in response and predictor spots click OPTIONS enter the value of X for which an estimate of the mean on andor a prediction on is desired in the PREDICTION INTERVALS FOR NEW OBSERVATIONS box only 1 value at a time Click the buttons for FITS estimated mean value and predicted value SDs of FITS standard error of the estimated mean value CONFIDENCE LIMITS and PREDICTION LIMITS Click OK Correlation Coefficient Choose STATBASIC STATISTICSCORRELATION In the dialogue box enter C1 C2 and click OK Selling Price Problem Predict selling price of one house in a population of houses Y selling price response variable X area of house in square feet predictor variable if knowledge of X is useful in predicting Y there is an association between X amp Y there is causation if a change in X results in a change in Y Step 1 Graph points on a scatter plot StatRegresslonRegresslonResponse Variable is Y Predictor is Xgraphscheck normal plotOK Step 2 Use equations to find b1 bo amp by hand this info is given in Minitab output If b1 is 0 then there is no linear association between Y amp X SPXY 2Xi XavgYi Yavg 55X 2Xi Xavg2 Step Estimate the Error Variance 02 Use estimator Se2 to estimate SE SE2 SSE n2 where SSE ZYi Yavg2 In Minitab output the residual variance will be under MS column and in the Error row Residual variance MSE mean square for error Se2 measures how well the regression line fits the sample Yvalues If Se2 is closer to O the fit is close to perfect The larger the value of SE the worse the fit of the line In Minitab output s 2503 whatever is the residual standard deviation Residual standard deviation RMSE root mean square for error Step Coefficient of Determination Partitioning the Total Variation Objective of regression is to find the best fit to make the value of Se2 as small as possible To find that best fit we use Coefficient of Determination r2 r2 SSR SST SST SST SST UJ Jgt In Minitab output it is listed next to s 2503 as Rsq 8272 or 8272 The greater the variation among sample Yvalues the more difficult they are to predict Coefficient of determination always lies between 0 and 1 a value closer to 0 indicates the estimated regression explains little of the variation in the sample where a value close to 1 indicates the estimated regression equation explains most of the variation in the sample Y values The greater the coefficient of determination the better the fit and more effective the estimated regression equation is at predicting Y Assumptions of Model Simpe linear model correctly depicts the association between response and predictor variables Error variance is constant Random errors are independent Random errors are normally distributed Mean and Standard Error of b1 amp bo If b1 gt O or if b1 lt0 there is a presence of linear relationships If bl 0 no linear relationship present Eb1 31 SEb1 Se2 SSX Sampling Distribution of b1 T b1 31 SEb1 The statistic has a T distribution with n2 degrees of freedom Hypothesis Testing for 51 Ho 31 0 Ha 31 at O StatBASIC STATISTICS1 SAMPLE T etc etc ANOVA Approach to 51 Ho 31 0 Ha 31 at O StatBASIC STATISTICSANOVA1 way Unstacked enter response amp predictorOK Multiple Linear Regressions CH 10 Fitting Regression Models Enter the data sales into C1 price into C2 temperature into C3 type of day data into C4 0 for weekdays and 1 for weekends and the humidity into C5 Choose STATREGRESSIONREG RESSION In the dialogue box specify C1 as the Response variable and C2 amp C3 as predictor variables click the results button In the results dialogue box click the button for IN ADDITION THE FULL TABLE OF FITS AND RESIDUALS click OK twice To compute the predicted values and the residuals in the REGRESSION dialogue box click STORAGE button in STORAGE dialogue box click the boxes for RESIDUALS and FITS click OK This will place the predicted values and residuals in C6 FITS1 and C7 RESI1 respectively Estimation and Prediction Methods of Statistical Inference Independent samples design Sample 1 Sample 2 Sample 3 Property Appraiser 1 Property Appraiser 2 Property Appraiser 3 1 X11 6 X21 1 1 X31 2 X12 7 X22 X32 3 X13 8 X23 X33 4 X14 9 X24 X34 5 X15 X25 X35 X1 X2 X3 31gtZ2X33lt H03 H1 H2 H3 Ha at least one is different Assumptions 1 Independent Samples 2 Normal Distributions 3 Equal Variances i e 021 022 023 4 Stable Process Processes are free from assignable causes of variation Variation Due To Differences among Treatment Means Total Variation Variation Due To Random Error SST SSTR SSE 22X1j39X2Zn1XiX2ZZXijXj2 Total Variation within a data set is measured by the Total Sum of Squares Treatment Sum of Squares measures the variation among the sample means caused by treatment differences and random error Variation due to random error is determined by computing the squared deviation of individual observations in a given sample from the mean of that sample and summing over all observations in the entire data set and is called Error Sum of Squares Mean Square for Treatments average variation among samples MSTR SSTR kI Mean Square Error average variation Within samples MSE SSE nk The best statistic for the ANOVA procedure is the ratio MSTR MSE The larger the number the greater the variation among the samples in comparison to the variation Within the samples and therefore the stronger the evidence against the null hypothesis of no differences among the population means The sampling distribution of the ratio MSTPU MSE is the F Distribution With kl and nk degrees of freedom F SSTRk1 SSE nk MSTRMSE SST SSTR SSE nl kl nk General ANOVA Table Source of variation df SS MS Fvalue pValue Treatments kl SSTR MSTR MSTRMSE PFgtMSTRMSE Error nk SSE MSE Total variation n1 SST ANOVA with Blocked Data Variation Due To Differences among Blocks Total Variation Variation Due To Differences among Treatment Means Variation Due To Random Error SST SSBL SSTR SSE Property Appraiser l Appraiser 2 Appraiser 3 Average for Blocks 1 X11 X21 X31 B1 2 X12 X22 X32 B2 3 X13 X23 X33 B3 4 X14 X24 X34 B4 5 X15 X25 X35 B5 X1 X2 X3 Multiple Linear Regression Model YZBO l39Ble 82X2 Bka deterministic component random component Note Linear Wrt Parameters 50151 Bk and not predictor variables Assumptions The specified regression model has the correct form Therefore for given values of X1 X2 Xk EY 50 lel 5ka and Es 0 This implies that when the least squares estimates are determined the least squares equation 3 b0 blxl kak estimates the average value of Y given a set of values for the predictor variables X39 and the model correctly represents the form of the association between the response variable and the predictor variables 2 The error variance is constant The ozS is constant over all values of the predictor variables Thus the range of deviations of the Y values from the regression model is the same regardless ofthe values of X1 X2 Xk 3 Random errors 8 s are independent and normally distributed Hence the random errors associated with the Yvalues are statistically independent of one another and normally distributed The residual variance 828 is defined by 828 SSE n kl where SSE Z Y 7 in 828 remains an absolute measure of how well the least squares equation fits the sample Y values If the fit were perfect all residuals would equal 0 and 828 0 Partitioning the Total Variation SST SSR SSE where SST 2Yi 302 SSR 20 7 02 SSE ZY W The coefficient of determination has the same interpretation as in simple linear regression that is the fraction of the total variation in the sample Yvalues that has been explained by the predictor variables in the least squares equation R2 SSR SST SST SSE SST l SSESST Note R2 does not decrease when a new predictor variable is added to the regression model even the new term adds no useful information for predicting Y If R2 is used to determine Whether a new term should be added to the model then the question is not Whether R2 increases When the term is added but by how much R2 increases An alternative relative measure of goodness of fit that takes into consideration the number of terms in the model is called the adjusted coef cient of determination and is defined by R2a 1 n ln p SSESST Where p denotes the number of 5 parameters in the model including the intercept R2a is always less than R2 and it is possible for R2 a to decrease by an appreciable amount when an irrelevant predictor variable is added to a regression model Therefore R2a is preferred to R2 as a descriptive statistic for comparing competing regression models ANOVA Approach to Regression The test of hypothesis for the multiple regression model is H0131B2Bk0 Ha At least one ofthe parameters 51 I52 Bk is not 0 The best statistic for the analysis of variance procedure is the ratio of the mean squares for regression and error that is MSRMSE Where MSR SSlUk and MSE SSE nkl The F statistic is used F MSRMSE The larger the F value the smaller the Pvalue and the stronger the evidence against the null hypothesis Therefore we conclude that an association eXits between the response variable and at least one of the predictor variables Evaluating the Contribution of an Individual Predictor Variable The T Statistic When the F value is larger enough to reject H0 that means at least one of the bi s is different from zero so the next step is to determine Which one So we test H0 Si 0 A small p value 3 005 contradicts H0 and suggests that Xi Ha Si 7t 0 provides a discernible contribution to explaining the variation in the sample Y values With Ti bi 0 SEbi The con dence interval for Bi is bi i t1a2n k 1 SEbi note the degrees of freedom for the t is dictated by the residual variance Summary of Regression Analysis 1 Analysis of the overall model 2 Analysis of marginal contributions of individual predictors 3 Analysis of precision of estimated regression coefficients 4 Analysis of coefficient of determination Incorporating Qualitative Variables in Multiple Linear Regression We include qualitative predictor variables in a regression equation by the introduction of artificial variables called dummy variables These variables are always assigned a value of 0 or 1 The value 1 indicates the presence of a characteristic weekend day and 0 indicates the absence of that characteristic week day 0 for week day X3 1 for weekend day Suppose we introduce the qualitative variable sky conditions sunny overcast and rainy The three possible conditions are defined by two dummy variables as follows 1 if sunny X4 0 otherwise Conditions X4 X5 a sunny day 1 0 1 if overcast an overcast day 0 1 X 5 a rainy day 0 0 0 otherwise In general the number of dummy variables needed is one less than the number of possible conditions of a qualitative variable where one of the conditions is a default condition ib0b1X1b2X2 b3X3 b4 sunny Yb0b1X1b2X2 b3X3 b5 overcast Yb0b1X1b2X2b3X3 rainy Curvilinear Regression Models Y30BIX32X28 Note this relationship is still a multiple linear regression model since it is linear with regard to the parameters 30 31 32 We treatX and X2 as two predictor variables Analysis of Residuals We use residual analysis to check for violations of the assumptions 1 The relationship between the response and one or more of the predictor variables may not be linear Graph the residuals against the corresponding values of each predictor variable in the least squares equation to detect curvature 2 The error variance is not constant Graph the residuals against the l values to determine ifthe error variance is constant An important predictor variable has been omitted Graph the residuals against the corresponding values of that variable 5 Note the residual plots in Figure 103 page 481 of vour text Statistical Inference for Two Populations Chapter 7 deals with the methods of statistical inference for comparing parameters of two populations or processes with respect to their means or their proportions We must determine the best statistic for the desired comparison and the sampling distribution of that statistic Two basic plans available for this purpose are Independent Samples and Paired Samples Independent samples design Sample 1 Sample 2 Property Appraiser 1 Property Appraiser 2 4 X 7 X 10 X 2 X l X 9 X 8 X 6 X 5 X 3 X X1 X2 Interested in i1 2 where sampling distribution is E69 59 Elt3lt1gt E59 1 H2 Var X1 X2 Var X1 Var X2 0121 11 02 21 12 SE X1 X2 512111 02 2112 Paired Samples Design Appraiser Property 1 2 Difference 1 X11 X21 D1 2 X12 X22 D2 3 X13 X23 D3 4 X14 X24 D4 5 X15 X25 D5 The paired comparisons analysis reduces to a onesample analysis of the mean of the differences between appraisals So we are interested in D XE n where E 13 ud and Var D 6d 2 Since we would notknow the value 6d 2 we would estimate 62 with S2 and hence the standard error of l is estimated to be SED z Sd ln Comparison of Independent vs Paired Samples Independent Paired l Appraiser differences 1 Appraiser differences 2 Random sampling variation 2 Random sampling variation 0 Property differences 0 No property effect 0 Appraiser inconsistency 0 Appraiser inconsistency In the paired samples design the differences between the appraisals are recorded property by property This eliminates any blurring of the analysis due to distinct differences among the properties When the same property is appraised to minimize property differences we call the property blocking variables Examples when paired samples are appropriate 1 A manufacturing company has two methods by which employees can perform a production task To maximize the output the company wants to identify the method with the shortest mean completion time 2 lndepepndent Samples A random sample of workers is selected and each worker in the sample uses method A A second random sample of workers is selected and each worker in the sample uses method B A source of sampling error is the variation between workers If the two production methods are tested under similar conditions with the same workers using both methods this variation between workers is eliminated Hence the paired t approach is better 3 Testing new tires If two independent samples are used possible sources of variation are the driving habits of each driver and the different types of autos used in the test To minimize these sources the same driver and auto should be used as blocks Hence a pairedt approach is better 4 Weight loss program Weigh each participant before the program is instituted then after to see if the program has worked The participants are the blocks The paired t approach is better in all before and after cases A Confidence Interval for 111 112 when 61 and 62 are known and samples are collected independently is Z 2 i Zlu2 012111 022112 2 2 If we assume 61 62 then we have Z 2 2 21U2 62 1111 1112 When 61 and 62 are unknown but equal then we have Z 2 i t 1 u2n1n22 Sp2 1111 1112 where sp2 n11s12 n21s22 n1 n2 2 and Sp2 is called the pooled sample variance Assumptions 1 Populations are approximately normal 2 Variances are equal 3 Samples are independent and large 230 When 61 and 62 are unknown and not equal then we have Y1 2 i 3 1a21 512111 522112 where 1 is calculated by a special formula A confidence interval for 11 ul 112 when samples are paired and population of differences is normally distributed is D i t 1412 n1SD N11 Test of Hypotheses H03 1 H2d0 H13 1 27Ed0 If 61 and 62 are equal but unknown then we use the test statistic for the pooled ttest T X1 X2 d0 N szp 1111 1112 which follows a T distribution with n1 n2 2 degrees of freedom If 61 and 62 are not equal and unknown and the sample sizes are large enough n1 2 30and n2 2 30 then we use 2 2 T X1 X2 10 S 1111S 2112 which is approximately a T distribution with DSIZn1 52211225121112111391522112 2 112391 When samples are paired then H0 D0 H1 Dlt00r gt0
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'