EXPERIMENTAL CHEMISTRY II
EXPERIMENTAL CHEMISTRY II CH 464
Popular in Course
Popular in Chemistry
This 19 page Class Notes was uploaded by Brandyn Altenwerth V on Monday October 19, 2015. The Class Notes belongs to CH 464 at Oregon State University taught by Staff in Fall. Since its upload, it has received 20 views. For similar materials see /class/224561/ch-464-oregon-state-university in Chemistry at Oregon State University.
Reviews for EXPERIMENTAL CHEMISTRY II
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/19/15
MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L Orlov Chemistry Department Oregon State University 1996 INTRODUCTION In modern science regression analysis is a necessary part of virtually almost any data reduction process Popular spreadsheet programs such as Quattro Pro Microsoft Excel and Lotus 123 provide comprehensive statistical program packages which include a regression tool among many others Usually the regression module is explained clearly enough in online help and spreadsheet documentation ie items in the regression M dialog box However the description of the output is minimal and is often a mystery for the user who is unfamiliar with certain statistical concepts The objective of this short handout is to give a more detailed description of the regression tool and to touch upon related statistical topics in a hopefully readable manner It is designed for science undergraduate and graduate students inexperienced in statistical matters The regression output in Microsoft Excel is pretty standard and is chosen as a basis for illustrations and examples Quattro Pro and Lotus 123 use an almost identical format CLASSIFICATION OF REGRESSION MODELS In a regression analysis we study the relationship called the regression function between one variable y called the dependent variable and several others xi called the independent variables Regression function also involves a set of unknown parameters bi If a regression function is linear in the parameters but not necessarily in the independent variables I we term it a linear regression model Otherwise the model is called non linear Linear regression models with more than one independent variable are referred to as multiple linear models as opposed to simple linear models with one independent variable The following notation is used in this work y dependent variable predicted by a regression model y dependent variable experimental value number of independent variables number of coefficients xi il2 p ith independent variable from total set of p variables bi il2 p ith coefficient corresponding to xi b0 intercept or constant kpl total number of parameters including intercept constant n number of observations experimental data points i l2 p independent variables index jl2 n data points index Now let us illustrate the classi cation of regression models with mathematical expressions Multiple linear model General formula yb0b1x1b2x2 bpxp 1 or y b0 2i bixi i12 p 1a Polynomial model is linear in parameters but not in independent variables y b0 bx bzx2 b3x3 bpxp which isjust a specific case ofl 3 2 w1thx1xxzx X3X xpxp Simple linear model y b0 b1X1 It is obvious that simple linear model is just specific case of multiple one with k2 pl Nonlinear model y A1e39Bx where A B are parameters In further discussion we restrict ourselves to multiple linear regression analysis MAIN OBJECTIVES OF MULTIPLE LINEAR REGRESSION ANALYSIS Our primary goal is to determine the best set of parameters bi such that the model predicts experimental values of the dependent variable as accurately as possible ie calculated values yj should be close to experimental values yjquot We also wish to judge whether our model itself is adequate to t the observed experimental data ie whether we chose the correct mathematical form of it We need to check whether all terms in our model are signi cant ie is the improvement in goodness of t due to the addition of a certain term to the model bigger than the noise in experimental data DESCRIPTION OF REGRESSION INPUT AND OUTPUT The standard regression output of spreadsheet programs provides information to reach the objectives raised in the previous section Now we explain how to do that and touch upon related statistical terms and de nitions The following numerical example will be used throughout the handout to illustrate the discussion Table 1 Original experimental data Data point y z j 1 206947 25 2 285623 31 3 1570020 81 4 3346340 122 5 4065697 135 6 6960331 179 7 9451385 210 We choose y to be the dependent experimental observable and z to be the independent one Suppose we have say theoretical reasons to believe that relationship between two is y b0 b1z b2z2 b313 We can rewrite this expression in form 1 y b0 b1x1 b2x2 b3X3 where 1b x1z X212 and X2z3 In the next step we prepare the spreadsheet input table for regression analysis Table 2 Regression input Data point Dependent var Independent variables j y x1z xzzz X3Z3 1 206947 25 625 1563 2 285623 31 961 2979 3 1570020 81 6561 53144 4 3346340 122 14884 181585 5 4065697 135 18225 246038 6 6960331 179 32041 573534 7 9451385 210 44100 926100 In order to perform a regression analysis we choose from the Microsoft Excel menu Tools gt Data analysis gt Regression Note that data analysis tool should have been previously added to Microsoft Excel during the program setup Tools 7 AddIns 7 Analysis ToolPak The popup input dialog box is shown on Fig 1 Elements of this box are described in on line help Most ofthem become clear in the course of our discussion as well The Input Y range refers to the J J 39 cells 39 39 the 39 J J J variable y and the Input X range to those containing independent variables x in our example x x1 x2 X3 see Table 2 Ifwe do not want to force our model through the origin we leave the Constant is Zero box unchecked The meaning of Con dence level entry will become clear later The block Output options allows one to choose the content and locations of the regression output The minimal output has two parts Regression Statistics and ANOVA ANalysis Of VAriance Checking the appropriate boxes in subblocks Residuals and Normal Probability will expand the default output information We omit from our discussion description of Normal Probability output Now we are ready to proceed with the discussion of the regression output In Quattro Pro the sequence is Tools Numeric Tools Analysis Tools Advanced Regression In the last step instead of AdvancedRegression one can choose Regression from the menu In this case the simpli ed regression output will be obtained Fig Reyessinn input dialng hnx Regles Ban21 ew 71pm W111i Range ham 8 Range r Eanslam ezem 7 85 1 Lahe s r Eanbdence Leve r Emvaange 1 heel 31y UK New Wm r New mm Residual 1mm Exa 12 In Memsoft Excel the resxdual output has the followmg format 39l M 1 n 39 I m t 39 39 v Residuals 139 yi r 1 20 4424 0 2523 2 28 9772 en 4149 3 156 3982 0 6038 4 335 5517 en 9178 5 406 3355 0 2342 6 695 6173 0 4159 5 3121 en 1 e Corvupomi 1g wom ow used m 5 gm mpmew ws 5 a1 r Errnr or dzviztinn 15 the ddfference between the observed value y of the dependent unable for the 1111 expenmental data pomt x m and th corresponding value yj given by the regression function yj b0 b1x1j bzxzj bpxpj yj b0 b1x1j bzxzj b3X3j in our example Ti yjquot yj 2 Parameters b b0 b1 b2 bp are part of the ANOVA output discussed later Ifthere is an obvious correlation between the residuals and the independent variable x say residuals systematically increase with increasing x it means that the chosen model is not adequate to fit the experiment eg we may need to add an extra term X4z4 to our model lb A plot of residuals is very helpful in detecting such a correlation This plot will be included in the regression output if the box Residual Plots was checked in the regression input dialog window Fig 1 Exam Qle X Variable 1 Residual Plot Residuals r o x Variable 1 x1 However the fact that the residuals look random and that there is no obvious correlation with the variable x does not necessarily mean by itself that the model is adequate More tests are needed Standard or standardized residual is a residual scaled with respect to the standard error deviation Sy in a dependent variable Fj Ti Sy la The quantity Sy is part of the Regression statistics output discussed later Standardized residuals are used for some statistical tests which are not usually needed for models in physical sciences ANOVA output There are two tables in ANOVA Analysis of Variance Example Table 4 ANOVA output part I df SS MS F Signi cance F Regression 3 de 72363006 SSR 24121002 MSR 42550702 FR 612E09 PR Residual error 3 de 170 SSE 057 MSE Total 6 dfT 72363176 SST NA MST Table 4a ANOVA output part II Coef cients Standard t Stat P Value Lower 95 Upper 95 bi Error S6 W ti Pi bunPo burlPo Intercept b0 052292226 177984111 0293802778 07881 51413318 61871763 X Variable 1 X1 291437225 073039587 3990126957 00282 058992443 52388201 X Variable 2 X2 202376459 007318737 2765182747 00001 179084949 22566797 X Variable 3 X3 00009602 000206174 046574477 06731 00075216 00056011 Corresponding notation used in this handout is given in parenthesis NA means not available in Microsoft Excel regression output Coef cients The regression program determines the best set of parameters b b0 b1 b2 bp in the model yjb0 b1X1jbzij bpxpj by minimizing the error sum ofsquares SSE discussed later Coefficients are listed in the second table of ANOVA see Table 4a These coef cients allow the program to calculate predicted values of the dependent variable y yl yz y which were used above in formula 2 and are part of Residual output Table 3 Sum of squares In general the sum of squares of some arbitrary variable q is determined as SSq 21 39 ang2 Where 1 3 jth observation out of 11 total observations of quantity q qavg average value of q in n observations qavg 2f1 qjn In the ANOVA regression output one will nd three types of sum of squares see Table 4 1 Total sum ofsquares SST ssT Ej yj yavg2 where 3a Yavg 03139quot yjn It is obvious that SST is the sum of squares of deviations of the experimental values of dependent variable y from its average value SST could be interpreted as the sum of deviations of y from the simplest possible model y is constant and does not depend on any variable x y b0 with b0 yavg 4 SST has two contributors residual error sum of squares SSE and regression sum of squaresSSR ssT ssE ssR 5 2 Residual or error sum of squares SSE SSE 21 rj ravgf 6 Since in the underlying theory the expected value of residuals ravg is assumed to be zero expression 6 simpli es to ssE 2 r92 6a The signi cance of this quantity is that by the minimization of SSE the spreadsheet regression tool determines the best set of parameters b b0 b1 b2 bp for a given regression model SSE could be also viewed as the duetorandomscatteringof yabout predictedline contributor to the total sum of squares SST This is the reason for calling the quantity due to error residual sum of squares 3 Regression sum of squares SSR SSR 21 y ymgf 7 SSR is the sum of squares of deviations of the predicted by regression model values of dependent variable y from its average experimental value yavg It accounts for addition of p variables X1 X2 Xp to the simplest possible model 4 variable y is just a constant and does not depend on variables x ie y b0 vs y b0 b1x1 b2x2 bpxp Since this is a transformation from the non regression model 4 to the true regression model 1 SSR is called the due to regression sum of squares The de nition of SSR in the form 7 is not always given in the literature One can nd different expressions in books on statistics 1 2 SSR Zip bi Zjquot xi j Xavg yj where 7a Xavg 22f1 xjn 0r 35R 2ip hi 2139quot Xij yjquot 21quot yj2n 7b Relationships 7ab give the same numerical result however it is dif cult to see the physical meaning of SSR from them Mean square variance and degrees of freedom The general expression for the mean square of an arbitrary quantity q is MSq SSq df 8 SSq is de ned by 3 and df is the number of degrees of freedom associated with quantity SSq MS is also often referred to as the variance The number of degrees of freedom could be viewed as the difference between the number of observations 11 and the number of 39 xed 1 39 A with the corresponding sum of squares SSq 1 Total mean square MST total variance MST SSTn 1 9 SST is associated with the model 4 which has only one constraint parameter b0 therefore the number of degrees of freedom in this case is dfT n 1 10 2 Residual error mean square MSE error variance MSE SSEn k 11 SSE is associated with the random error around the regression model 1 which has kp1 parameters one per each variable out of p variables total plus intercept It means there are k constraints and the number of degrees of freedom is ME n k 12 3 Regression mean square MSR regression variance MSR SSR k 1 13 The number of degrees of freedom in this case can be viewed as the difference between the total number of degrees of freedom HT 10 and the number of degrees of freedom for residuals de 12 de dfT de n 1 n k dek 1p 14 Tests of signi cance and F numbers The Fnumber is the quantity which can be used to test for the statistical difference between two variances For example if we have two random variables q and V the corresponding F number is qu qu Msv 15 The variances MSq and MSV are defined by an expression of type 8 In order to tell whether two variances are statistically different we determine the corresponding probability P from Fdistribution function PPqu dfq dfv 16 The quantities dfq dfV degrees of freedom for numerator and denominator are parameters of this function Tabulated numerical values of P for the Fdistribution can be found in various texts on statistics or simply determined in a spreadsheet directly by using the corresponding statistical function eg in Microsoft Excel one would use FDISTFqV dfq dfv to return the numerical value of P An interested reader can find the analytical form of PPqu dfq dfv in the literature eg 1 p383 The probability P given by 16 is a probability that the variances MSq and MSV are statistically indistinguishable On the other hand 1 P is the probability that they are di erent and is often called con dence level Conventionally a reasonable con dence level is 095 or higher If it turns out that 1 P lt 095 we say that MSq and MSV are statistically the same If 1 P gt 095 we say that at least with the 095 or 95 confidence MSq and MSV are different The higher the con dence level the more reliable our conclusion The procedure just described is called the F test There are several Ftests related to regression analysis We will discuss the three most common ones They deal with significance of parameters in the regression model The first and the last of them is performed by spreadsheet regression tool automatically whereas the second one is not 1 Signi cance test of all coef cients in the regression model In this case we ask ourselves With what level of con dence can we state that AT LEAST ONE of the coe cients b b1 b2 bp in the regression model is signi cantly different from zero The first step is to calculate the Fnumber for the whole regression part of the regression output see Table 4 FR MSR MSE The second step is to determine the numerical value of the corresponding probability P R also part ofthe regression output see Table 4 P R FDISTFR de ME 18 Taking into account expressions 12 and 14 we obtain PR FDISTFR k 1 n k 18a Finally we can determine the confidence level 1 PR At this level of confidence the variance due to regression MSR is statistically different from the variance due to error MSE In its turn it means that the addition of p variables x1 xz xp to the simplest model 4 dependent variable y is just a constant is a statistically significant improvement of the fit Thus at the confidence level not less than 1 P R we can say At least ONE of coef cients in the model is signi cant FR could be also used to compare two models describing the same experimental data the higher FR the more adequate the corresponding model Exam Qle In our illustrative exercise we have PR 612E 09 Table 4 the corresponding level of confidence 1 PR 09999 Therefore with the confidence close to 100 we can say that at least one of coefficients b1 b2 and b3 is significant for the model y b0 b1x1 bzxz b3X3 where x1z x2 22 and X3 Z3 NOTE From this test however we can not be sure that ALL coef cients b1 b2 and b3 are nonzero If 1 P R is not big enough usually less than 095 we conclude that ALL the coef cients in the regression model are zero in other words the hypothesis that the variable y is just a constant is better than it is function of variables x x1 xz xp 2 Signi cance test 0fsnbset 0fc0efficients in the regression model Now we want to decide With what level of con dence can we be sure that at least ONE of the coef cients in a selected subset of all the coef cients is signi cant Let us test a subset of the last m coef cients in the model with a total of p coef cients b1 b2 bp Here we need to consider two models y b0 b1x1 b2x2 bpxp unrestricted 19 and y b o b 1x1 b zx b pmxpm restricted 20 These models are called unrestricted l9 and restricted 20 respectively We need to perform two separate least square regression analyses for each model From the regression output see Table 4 for each model we obtain the corresponding error sum of squares SSE and SS E as well as variance MSE for the unrestricted model The next step is to calculate the Fnumber for testing a subset of m variables by hand it is not part of Microsoft Excel ANOVA for an obvious reason ie you must decide how many variables to include in the subset Fm ss E SSE m MsE 21 Fm could be viewed as an indicator of whether the reduction in the error variance due to the addition of the subset of m variables to the restricted model 20 SS E SSE m is statistically signi cant with respect to the overall error variance MSE for the unrestricted model 19 It is equivalent to testing the hypothesis that at least one of coef cients in the subset is not zero In the nal step we determine probability Pm also by hand P m FDISTFm m n k 22 At the con dence level 1 P m at least ONE of the coef cients in the subset of m is signi cant If 1 P m is not big enough less than 095 we state that ALL m coef cients in the subset are insignificant Exam Qle The regression output for the unrestricted model y b0 b1X1 bzxz b3X3 where X1Z X2 22 and X3 Z3 is presented in Table 4 Say we want to test whether the quadratic and the cubic terms are signi cant In this case the restricted model is y b0 blxl restricted model 23 where x1 z The subset of parameters consists of two parameter and m2 By analogy with the input table for the unrestricted model Table 2 we prepare one for the restricted model Table 5 Regression input for quotestricted model Data point T I J var J I J var j y X1Z 1 206947 25 2 285623 31 3 1570020 81 4 3346340 122 5 4065697 135 6 6960331 179 7 9451385 210 We perform an additional regression using this input table and as part of ANOVA obtain Table 6 Regression ANOVA output for the restricted model di SS MS F Signi cance F Regression 1 689216 689216 100 1704 Residual error 5 34415 70 6883 Total 6 723632 From Table 4 and Table 6 we have SSE 170 error sum of squares unrestricted model MSE 057 error mean square unrestricted model de n k 3 degrees of freedom unrestricted model S E 3441570 error sum of squares restricted model Now we are able to calculate Fm2 Fm2 3441570170 2 057 Fm2 3018772 Using the Microsoft Excel function for the Fdistribution we determine the probability Pm2 Pm2 FDIST3018772 2 3 Pm2 350E 07 Finally we calculate the level of con dence 1 Pm2 1 Pm1 1 350E 07 1 Pm1 099999 The con dence level is high more than 9999 We conclude that at least one of the parameters b or b3 in the subset is nonzero However we can not be sure that both quadratic anal cubic terms are significant 3 Signi cance test of an individual coef cient in the regression model Here the question to answer is With what con dence level can we state that the ith coef cient bi in the model is signi cant The corresponding Fnumber is Fi b2 sebi2 24 sebi is the standard error in the individual coef cient hi and is part of the ANOVA output see Table 4a The corresponding probability Pi FDISTFi 1 n k 25 leads us to the con dence level 1 Pi at which we can state that coef cient b is signi cant Ifthis level is lower than desired one we say that coef cient b is insigni cant Fi is not part of spreadsheet regression output but might be calculated by hand if needed However there is another statistics for testing individual parameters which is part of ANOVA see Table 4a t bi sebi 26 The ti number is the square root of Fi expression 24 It has a Student s distribution see 1 p 381 for the analytical form of the distribution The corresponding probability is numerically the same as that given by 25 There is a statistical function in Microso Excel which allows one to determine Pi part of ANOVA see Table 4a P i TDISTti n k 2 27 Parameters of the function 27 are the number of degrees of freedom df de n k and farm of test TL2 If TLl a result for a onetailed distribution is returned if TL2 twotailed distribution result is returned An interested reader can find more information about the issue in ref 1 Exam Qle In our illustration P007881 and P3 06731 see Table 4a corresponds to fairly low con dence levels 1 P0 02119 and 1 P3 03269 This suggests that parameters b0 and b3 are not signi cant The con dence levels for b1 and b2 are high 1 P1 1 00282 09718 and 1 P2 1 00001 09999 which means that they are signi cant In conclusion of this Ftest discussion it should be noted that in case we remove even one insigni cant variable from the model we need to test the model once again since coef cients which were signi cant in certain cases might become insigni cant after removal and visa versa It is a good practice to use a reasonable combination of all three tests in order to achieve the most reliable conclusions Con dence interval In the previous section we were obtaining con dence levels given Fnumbers or t numbers We can go in an opposite direction given a desired minimal con dence level 1 P eg 095 calculate the related F or tnumber Microsoft Excel provides two statistical functions for that purpose F1pFINV P dfq dfv 28 t1pTINVP df 29 dfq dfV degrees of freedom of numerator and denominator respectively see 15 df degree of freedom associated with a given ttest varies from test to test NOTE in expression 29 P is the probability associated with so called two tailed Student s distribution A one tailed distribution has the different probability a The relationship between the two is aP2 30 Values of F numbers and t numbers for various probabilities and degrees of freedom are tabulated and can be found in any text on statistics l234 Usually the onetailed Student s distribution is presented Knowing the t number for a coefficient bi we can calculate the numerical interval which contains the coefficient bi with the desired probability 1 Pi bL 1 Pi bi sebit1pi lower bU 1 Pi bi sebit1pi upper 313 t17PiT INVPia 114 32 The standard errors for individual parameters sebi are part of ANOVA Table 4a The interval bL 1 pi bU 1 pg is called the con dence interval of parameter bi with the 1 Pi con dence level The upper and lower limits of this interval at a 95 con dence are listed in the ANOVA output by default Table 4a columns Lower 95 and Upper 95 If in addition to this default the confidence interval at a confidence other than 95 is desired the box Con dence level should be checked and the value of the alternative confidence entered in the corresponding window of the Regression input dialog box see Fig 1 Example For the unrestricted model 1b the lower and upper 95 limits for intercept are 5 1413 and 61872 respectively see Table 4a The fact that with the 95 probability zero falls in this interval is consistent with our conclusion of insignificance of b0 made in the course of F testing of individual parameters see Example at the end of previous section The confidence intervals at the 95 level for b1 and b2 do not include zero This also agrees with the Ftest of individual parameters In fact analysis whether zero falls in a con dence interval could be viewed as a different way to perform the F test ttest of individual parameters and must not be used as an additional proof of conclusions made in such a test Regression statistics output The information contained in the Regression statistics output characterizes the goodness of the model as a whole Note that quantities listed in this output can be expressed in terms of the regression Fnumber FR Table 4 which we have already used for the significance test of all coef cients Example For our unrestricted model lb the output is R Square R2 R Square Rzadj Error Sy parenthesis Standard error Sy sy MSE 5 33 MSE is an error variance discussed before see expression 1 1 Quantity Sy is an estimate of the standard error deviation of experimental values of the dependent variable y with respect to those predicted by the regression model It is used in statistics for different purposes One of the applications we saw in the discussion of Residual output Standardized residuals see expression 2a Coef cient of determination R2 or R Square R2SSR SST 1 SSESST 34 SSR SSE and SST are regression residual error and total sum of squares defined by 7 6a and 3a respectively The coef cient of determination is a measure of the regression model as whole The closer R2 is to one the better the model 1 describes the data In the case ofa perfect fit R21 Adjusted coefficient of determination R2 or Adjusted R Square Rzadj1 SSE n k SST n 1 35 SSE and SST are the residual error and the total sum of squares see expressions 6a and 3a The significance of Rzadj is basically the same as that of R2 the closer to one the better Strictly speaking Rzadj should be used as an indicator of an adequacy of the model since it takes in to account not only deviations but also numbers of degrees of freedom Multiple correlation coefficient R R ssR SST 36 This quantity is just the square root of coef cient of determination Examgle The fact that Rzadj 09999953 in our illustration is fairly close to 1 see Table 7 suggests that overall model 1b is adequate to t the experimental data presented in Table 1 However it does not mean that there are no insigni cant parameters in it REGRESSION OUTPUT FORMULA MAP For references the following tables present a summary of the formula numbers for individual items in the Microso Excel Regression Output Variables in parenthesis introduced and used in this handout do not appear in the output Table 8 Formula ma of Re ression statistics out ut Multiple R 36 R Square R2 34 Adjusted R Square Rzadj 35 Standard Error Sy 33 Observations n Table 9 Formula map of Residual output Observation j Predicted Y yj Residuals rj Standard Residualsr j 1 1 2 2a 2 1 2 2a Table 10 Formula map of ANOVA output Ert I df SS MS F Signi cance F Regression de 14 ssR 7 MsR 13 FR 17 PR 18 Residual error de 12 SSE 6a MSE 11 Total dfT 10 SST 3a MW 9 not reported in MicrosoftExcel Regression output Table 10a Formula map of ANOVA output part II Coef cients Standard t Stat P Value Lower Upper bi Error sebi ti Pi 95 95 bL17Pi bU17 Pn Intercept b0 26 25 27 31 31a X Variable 1 X1 26 25 27 31 31a X Variable 2 X2 26 25 27 31 31a LITERATURE press New York 1979 N DC 1963 LA Homewood Illinois 1974 4 NewYork 1980 VI The McGrawHill Companies Inc 1996 Neter J Wasserman W Applied linear statistical models A AA Azen SP Statistical analysis Computer oriented approach Academic Natrella MG Experimental Statistics National Bureau of Standards Washington RD Irwin Inc Gunst RF Mason RL Regression analysis and its application Marcel Dekker Inc Shoemaker DP Garland CW Nibler JW Experiments in Physical Chemistry
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'