### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# ENGINEER STATISTICS STA 3032

UF

GPA 3.83

### View Full Document

## 24

## 0

## Popular in Course

## Popular in Statistics

This 87 page Class Notes was uploaded by Golden Bernhard on Friday September 18, 2015. The Class Notes belongs to STA 3032 at University of Florida taught by Ramon Littell in Fall. Since its upload, it has received 24 views. For similar materials see /class/206552/sta-3032-university-of-florida in Statistics at University of Florida.

## Popular in Statistics

## Reviews for ENGINEER STATISTICS

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/18/15

Random Variables Random variables are functions that describe the outcomes of an experiment They are usually denoted with a capital letter such as X or Y RVs can be either discrete of continuous A discrete random variable has a countable number of values meaning that they could be listed A continuous random variable has an uncountable number of numeric values meaning they cannot be listed Here are some examples Discrete o spots on top face ofdie l 2 3 4 5 6 o suit of drawn card C D H S o aphids on leaf 01 2 3 o defects in box of 1000 nails 0 l 2 1000 o germinating seeds out of 50 0 l 2 50 Continuous 0 heights of people 0 7 o ph of soil 0 10 o voltage in circuit 0 Discrete random variables have probability mass functions which we denote px The probability function gives probabilities of each individual value of the RV e g PX x 19x Continuous random variables have probability density functions which we denote x The probability density function can be used to give probabilities of ranges of values of the continuous RV For example Px1 lt X lt x2 fxdx The Binomial and Normal are very important random variables Binomial Random Variable LetX Number of successes out of 11 trials in which the probability of success on each trial is a number 71 Then X has a binomial distribution with parameters 11 and 71 This is abbreviated X N Bn 71 Example Consider ipping a coin and declare a success if a head H appears Then the probability of success is 5 That is nPS5 Suppose the coin is ipped n3 times andX number of heads Then x B3 5 The possible values ofx are 0 1 2 3 The probability of any event is 123 18 Events HHH HHT HTH THH HTT THT TTH TTT Heads 3 2 2 2 1 1 1 0 PEvent 18 18 18 18 18 18 18 18 P3 H 18 P2 H 38 Pl H 38 PO H 18 32 1 3 3 1 IngeneralPkH k3k E k3k39 8 11 Binomial Random Variable con t Example X number of 1 s in 3 rolls of die 0 1 2 3 Events 111 11X 1X1 X11 lXX X1X XX1 XXX 1 s 3 2 2 2 1 1 1 0 PEvent 1363 125163 125163 125163 115263 115263 115263 536 P3l s W 11li 0046 666 63 216 3 1quot g 39 aquot 130 k3 k 6 6 The probability mass function is given by the socalled Binomial Formula 7 probability of success on single trial I n 72x17zn7x PX successes in 11 trials x39n x Mean of the binomial distribution 117 Variance of the binomial distribution 71750 7r Normal Distribution and normal random variable The notation Y Nu62 means The random variable Y is distributed normally with mean u and variance 62 The standard normal distribution has mean 110 and variance 621 The letter Z is reserved to represent the standard normal random variable Computer programs and tables are available to obtain probabilities from the normal distribution For example you can discover that P1 ltzlt 1 68 o P zgt 1 16 P196ltZlt 196 95 PZgt 196 025 PZgt 142 0778 The probability density function of the normal random variable with mean u and variance 62 is given by the formula f y 27r02 2 eX13y m2 202 Mean of Normal Distribution 11 Variance of Normal Distribution 6 Using the Normal Distribution Standardizing a Normal Distribution If Y Nu62 then Z Y uc N01 This result allows us to compute probabilities from any normal distribution using tables or a computer program for the standard normal distribution If you wanted to calculate the probability that a random variable y is greater that 142 standard deviations above its mean you would compute Y PYgtu1420P O gt 142 PZ gt 142 0778 As a more specific application suppose you believe the egg weights to be normally distributed with mean 654 and standard deviation 517 You would calculate the probability that a randomly drawn egg is greater than 72 as PY gt 72 PY 65452 gt 72 65452 PZ gt 127 1 Using the normal distribution an application Egg weights are normally distributed with mean 65 g and standard deviation 0 50 1 What is the probability one randomly drawn egg will exceed a 65 b 66 c 70 d 75 Let Y egg weight Then a PYgt 65 Pgt PZgt 0 12 5 b PYgt 66 Pgtj PZgt 2 4207 c PY gt 70 Pgtj PZgt 1 1587 d PYgt 75 pgt j PZgt 2 0228 2 What is the probability one egg is between 66 and 70 g P66 lt Ylt 70 PYgt 66 PYgt 70 4207 1587 262 Probabilities of this type can be expressed in terms of the cumulative distribution function PZ lt z Fz 277 2 exp z2 2 The integral for the normal distribution is dif cult to evaluate so tables or computer programs are used to obtain actual values Normal Approximation to the Binomial You can use the normal distribution to approximate binomial probabilities This often simplifies a computation For example suppose you are shooting freethrows in basketball You know that you make 75 of your shots that is the probability of making any one shot is 75 You have entered a contest that awards a prize if you make at least 18 out of 20 shots What is the probability that you will win a prize You need to calculate PY218 where y is the number of shots you make out of 20 The exact probability is given by the binomial formula with 7 75 and n 20 PY218 PY 18 PY 19 PY 20 201827518252 201917519251 20200752025O 0069 0211 0032 0912 Normal Approximation to the Binomial The calculation on the previous page would be tedious by hand but many computer programs are available that can readily do it However even good computer programs may fail for computations involving extremely large n The normal approximation sets u 1175 and 62 nir17r and assumes Y Nu62 to evaluate the probability The approximation is improved by using a continuity correction which means you compute PY2185 P Y2175 The normal approximation is then computed as u 117 2075 15 62 nir17r 207525 375 c 375 2 194 P Y2175 Py 15194 2 175 15194 PZZ 129 1 901 099 This is a reasonable approximation to the exact binomial probability of PY218 091 Means and Variances of Random Variables Means of random variables are called expected values denoted E X If X is a continuous RV then X EltXgt I xfxdx If X is a discrete RV then X EX 321932 Variances of random variables are also expected values If X is a continuous RV then a EX m2 I x 2fxdx If X is a discrete RV then a EXu2 20939 Iu2pxi Means and Variances of Linear Functions of Random Variables If X is an RV with mean uX and variance 6X2 and Y abX Where a and b are constants then uY EabXabEXabuX and 0 EY y2EabX ab x2 Eltb2ltX uXgt2gtbza IfX1XkareRVs and YXthhen y EX1XkEX1EXkp1yk If X 1 Xk are independent RVs each with mean u and variance 0392 and if YX Xk then 039 EX1Xk ylyk2 EX1 y12 EXk uk2 of a It follows from the above results that the sample mean X X1Xnn is a random variable with mean 1U 1 and variance of 0211 This called the sampling distribution of 10 Other Discrete Random Variables The Binomial random variable is the most commonly used discrete random variable It represents the number of successes of out on n independent Bernoulli trials Bernoulli means that a trial has only two possible outcomes eg SF 01 YN MF TF etc There are other discrete random variables that are used to represent outcomes from other situations Poisson Random Variable The Poisson RV is used to represent the number of times that randomly occurring items are detected in a given interval of time or space An example is given by the number of cars crossing a point on a road in a speci ed interval of time assuming the cars come at random times Another example is the number of particles suspended in a specified volume of air or liquid medium assuming the particles are randomly distributed The distinction of the Poisson from the Binomial is that the Poisson counts are out of a given time or space and the Binomial counts are out of a given number of trials There is no upper bound on the Poisson counts its possible values are 0 l 2 3 The probability mass function for the Poisson RV is 1x 1 px 39e forx 0 12 3 x The values of 905 form an in nite series that converges to l 00 x e 1 l x0 x The mean and variance of the Poisson distribution are both equal 2 to xi that 1s U Oquot 2 Therefore the P01sson count x is an estimate of xi and xi is an estimate of the uncertainty expressed by the standard deviation The Poisson distribution provides a good approximation to the Binomial for large n and small it with xlmr Hypergeometric Random Variable This RV is used in a situation similar to that of the Binomial except that the population of S s and F s is finite Thus the trials are not independent because the probability of S or F changes each time a value is drawn from the population Geometric and Negative Binomial Random Variable This RV represents the number of independent Bernoulli trials required to obtain a specified number r of successes If rl then the Negative Binomial RV is called a Geometric RV Other Continuous Random Variables If data are skewed and thus do not t the normal distribution then other RV s can be used Lognormal Random Variable In many types of applications the logarithms of data can be assumed normally distributed It this assumption is true then the data follow a lognorinal distribution Then if X has a lognormal distribution Y ZnX will have a normal distribution Usually data are transformed and then analyzed in the log scale Exponential Random Variable The exponential distribution is used as a waiting time distribution If items are arriving at random as with the Poisson distribution the times between arrivals will follow the exponential distribution The probability density function is fx 16 for x gt 0 Probability Plots Statistical graphs are useful to visually assess certain attributes of probability distributions The gure below shows several suc plots based on a data set containing measured values of bone zinc in 48 shee The quantiles and moments at the right of the plots give a computational summary of the data At the bottom is a histogram with the best tting normal probability density curve plotted through it The curve passes through the histogram bars fairly well but there are a few departures There are no data in the 375400 interval and a few too many values in the 350375 interval The 375400 interval is ty and there is only one value in the 400425 interval The familiar boxplot does not detect these departures 5 z Mnments E 1DDDmaXmum man Mean mums 2 E 995 MEISEI 9mm awassm S 975 mm asquot ean swam 1 2 9m 3m MPFEVQSVoMean azamsus 2 75 mm mm lawstSVoMean 2972599 u 5m median 3112B N 9 25 mm mm 1 mm 2392 25 2175B 2 U5 mm u m mm 215 2a 2m 25 am am mu i Namw n mm was The departure from normality is depicted more clearly with a so called quantilequantile QQ plot which is shown in the upper portion of the plot The QQ plot is basically a plot of the quantiles of the fitted normal distribution versus the quantiles of the data distribution ie the ordered observations ylm yn If the data represented a sample from a normal distribution the points would be distributed about a straight line Clusters of points above the line indicate more data that would be eXpected in that area and points below the line indicate fewer You can see the cluster from the 225250 interval and the 350375 interval You can also see the gap due to the empty 375400 interval Even though the QQ plot detects possible departures from normality the results are not definitive because of the relative small sample of 48 values QQ plots are more effective with larger data sets than n48 Central Limit Theorem The sampling distribution is one of the most difficult concepts in all of statistics for students to comprehend In order to grasp the concept of the sampling distribution imagine obtaining a large number of samples with each sample consisting of n observations Then you compute the mean of each sample to generate a population of sample means These means constitute the sampling distribution of I Previously it was shown that the sampling distribution of the sample mean has mean u and variance 0211 The Central Limit Theorem states that the sampling distribution of I also is approximate normally distributed even though the distribution from which the samples were obtained is not normal The closeness to the normal distribution increases as the sample size 11 increases To illustrate consider a Bernoulli distribution that has only 0 and l as distinct values and suppose there is a proportion it of 1 s and a proportion l7r of 0 s Imagine taking numerous samples from this population each of size n and computing the means for each of the samples Now imagine doing this for several different values of n say n1 3 5 10 30 and 100 Following are histograms of the means for each of the sample sizes with n3 nhulinns 1 5 muzmes Mnmems 3 munmaxmm mun Mean E 995 mun 2m 55mg 2 S 975 1 mm smEnMean H mm 1 2 Sun mum uppevaswean uazzzasa g 75m warms mum awerSVaMean uzssmz u mm mm mm N mm 25 warms mm mm mm 25 mm u5 mm 73 um mmmum uuuuu u muzuamua u7uau91 mm mm StdEvaean unnaaw US57 uppevaswean nausuam name 33 awerSVaMean umzw News was N am e mm mm mm mm mmmum uuuuu i Nama m mm 2535 nhu nns 5 mum as Mnmems 3 mum maxmmm mum Mean mama E 995 mm Sthev umzuza 2 S 975 um smEnMean unusaass 1 2 Sun usuuu upperSMean ammo E Van warms mum aWerSVaMean mamas u 5H M New u mm mm 25 m warms u mm mm mm 25 mm u5 mm m 73 um mmmum uuuuu u muzuamus u7uau91 i Nama m am u 2152 D slnhulinns 1n 5 Mnmems 3 i M n 2953 a 5mm uwaaaaa 2 3 an EU Mean mass 1 g ppev 95 Mean an amazuz g awerSVo Mean 55279 n N u u 73 u m mmmum u uuuuu u m uzuamususmuaua i Nama m 2953313536 5 Quzn es Mnments 9 1555715x171575 555555 Mean 59522999 5 999 99999 915552 5522919 2 S 979 525557 StdEnMean 5 525719 1 g 955 525555 255579911555 59572752 g 795 5557515 95557 15W579911555 52959952 5 955 7155155 595555 1555 295 5557515 529999 155 525555 2 29 519999 59 515517 55 75551571 515555 51 52 59 52 59 55 7 55771515 9522915 55229 Dislrihulinns 13155 7 5 H1525 59 1 1155555515 9 i 1555715x171575 529555 M 52912 E 99 9 22 15 555 5 5295999 2 S 979 59955 StdEnMean 55512952 1 g 955 95555 255579911555 59519595 g 75D 559751 3555 aWerSVoMean 952762 5 955 7155155 595555 5 155 295 5557515 527555 155 2555 29 21555 59 5559 55 75551571 517555 52 59 7 55771555 2991215 52955 Introduction to Statistics Ramon C Littell littellu edu What is Statistics The purpose of statistics To make inference about unknown quantities from samples of data Basically you have questions about a set of objects that are too numerous to observe in its entirety The large set of objects is called a population But can observe a subset of the objects called a sample You obtain a sample from the population and observe each object in the sample record data and use information in the sample to make inference about the population For example You want to know something about the age distribution of undergraduate students at the University of Florida that is how many ages are lt18 lt19 lt20 lt21 lt22 etc Or you might want to know the average age or the age range In either case you want information about the set of ages of all UF undergraduate students These ages would be the population of interest Note the ages are the population not the people It is infeasible to get the ages of all UF students You cannot observe the entire population Instead you get ages of a subset of the population The subset is called a sample Then you use the data in the sample to estimate what you want to know about the population Getting a Sample of Data from a Population There are several ways to get a sample of data from a population In the case of the population of ages of UF graduate students here are some examples i Draw 100 names of undergraduate students at random from the UF Student Directory Contact them and ask their ages Get the ages of the students in STA 3032 during a particular semester 3 Go to a bar during finals week and ask the ages of all the patrons J Each of these approaches has its own drawbacks Probably the first approach is best and the third is worst The second approach might be acceptable to the extent that students who take STA 3032 represent all UF undergraduate students Types of Sample Simple Random Sample Each subset of a given size has the same chance of being drawn Convenience Sample Using data that is immediately available Types of Populations Tangible Populations Populations whose members physically eXist e g ages of UF undergraduate students Conceptual Populations Populations whose members eXist only in our imaginations eg breaking strength of pencils all of a certain type that you could possibly use this semester Types of research studies and sources of data a Designed experiments Treatments are applied to experimental units according to a prescribed plan Surveys Data are collected on existing units selected form a population according to a prescribed plan Observational Studies Data are gathered on units that are available N U Questions from the previous page What would you call the first example of getting a sample What would you call the second example of getting a sample Other Examples of Populations and Samples Populations 1 Numbers of cars passing an intersection in an hour 2 Serum zinc levels in dogs in Gainesville area 3 Strengths of concrete from given mix of sand cement and gravel Samples a Counts of cars passing the intersection in a specified set of hours 2 Serum zinc levels in dogs entering UF College of Veterinary Medicine Small Animal Clinic 3 Measurements from samples of concrete with known ingredients in concrete mix Data Summarization It is usually difficult to learn much about a set of measurements from a list If you wanted to report information about the ages of UF graduate students you would probably employ some method of data summarization Here are some possible ways to summarize the data 1 Report the mean or the range of the data 2 Report how many values are in various age categories 3 Construct a graph to display the data Example of Data Summarization Sixtythree pregnant women participated in a nutritional intake study As a baseline indicator their bodyweights in kg were recorded at the end of the first trimester Here are the data 423 518 614 702 805 1045 448 527 618 705 818 1120 473 536 623 705 848 1318 489 539 623 707 848 495 550 630 714 864 555 632 720 864 559 634 727 882 564 641 739 898 570 643 745 570 648 748 570 666 750 575 668 755 575 673 757 591 682 759 593 682 759 689 698 5 15 17 15 8 0 3 Summary Statistics Min 423 Max 1318 Mean 684 Range 89 5 Standard deviation 156 Frequency Histogram of Bodyweight Data Body Weights of Pregnant Women in First Trimester I O i i i i i i i i i 45 55 65 75 85 95 105 115 125 135 Body Weights in kg category midpoints Frequency i i O 01 O 01 i Relative Frequency Histogram of Bodyweights Body Weights of Pregnant Women in First Trimester gt g 03 3 025 is g 02 w u 015 W E 01 w a 005 7 E o 39 39 39 39 45 55 65 75 85 95 105115125135 Body Weights in kg category midpoints Guideline for Histogram Construction Divide range of data into 5 to 20 intervals Counts number of data values in each interval Draw bars whose heights re ect counts Another Example of Data Description Egg weights on particular date from 54 hens 534 552 808 831 range 831 534 297 intervals 5055 5560 6065 6570 7075 7580 8085 freq 1 3 22 22 4 0 2 relfreq 0185 0556 4074 4074 0741 0 0370 Histogram of Egg Weight Data 25 20 gt g 1577 510W 5 0 1 1 50 55 60 65 70 75 80 EggWeights Question How do the body weights of pregnant women characteristically differ from the egg weights Does this surprise you Sample descriptive statistics Data yi y1 670y2 712y53 83154 697 Sample size n54 54 sum y1y2my53y542234235313 Mean yzzyi n 23531354 26539 Ordered data y1 534 552y53 808 54 831 Median y27 7554 405 n41 681 2554 135 ym 621 75th Percentile 25th Percentile Interpretation No more than 25 below and no more than 75 above 621 No more than 75 below and no more than 25 above 681 Measures of Central Tendency Mean 6539 7 Median 6525 50th percentile middle value Mode 627 most frequently occurring observation Measures of Dispersion Range ymax ymin 831 534 q3 q168l 62160 EM2 2 yE Variance S Zyi y 2 n 226747 17 1 17 1 s st Z x26747 5172 Interquartile range Standard deviation Empirical Rule The Empirical Rule provides a practical use of the standard deviation 0 If the distribution is moundshaped then Approx 68 of the data are between E 5 and J7 S Approx 95 of the data are between E 25 and J7 25 Approx 99 of the data are between E 35 and J7 35 Empirical Rule for Egg Weight Data Egg Weights 1 534 1 552 2 583 592 7 602 603 610 614 615 615 618 12 620 620 621622 626 627 627 627 630 630 635 636 8 643 645 647 652 653 654 654 659 9 660 660 660 663 670 670 674 675 676 8 681 682 688 690 691 692 697 698 2 712 718 2 720 731 1 808 1 831 7 6539 s 517 Lower Upper Count 7 s6022 7S7056 43 79 37 2325502 72S7573 51 95 7 3S4988 73S8090 53 98 Populations and Samples Parameters and Statistics There are means standard deviations etc for samples and populations but conventionally use different notation Sample quantities are called statistics You can compute statistics because the sample values are available to you Population quantities are called parameters You cannot compute parameters because not all of the population values are available to you Notation for Populations and Samples Sample Population Statistic Parameter Mean y H Variance S2 52 Standard Deviation S 6 Sample Statistics are estimates of the corresponding Population Parameters Empirical Rule for Normally Distributed Population 0 68 of measurements are between i 7 and 039 o 95 of measurements are between Y 20 and W 20 0 gt99 of measurements are between i 30 and Ho Normal Distribution and the Empirical Rule Other Graphical Procedures 0 Box Plot 0 Stem and leaf 0 Distribution function 0 Normal probability plot SAS PROC UNIVARIATE Output The UNIVARIATE Procedure Variable ew Moments N Sum Weights 54 Mean 653944444 Sum Observations 35313 Std Deviation 517178181 Variance 26747327 Skewness 093676395 Kurtosis 283048891 Uncorrected SS 23234501 Corrected SS 141760833 Coeff Variation 790859506 Std Error Mean 070379036 Basic Statistical Measures Location Variability Mean 539444 Std Deviation 517178 Median 6525000 Variance 2674733 Mode 6270000 Range 2970000 600000 Interquartile Range NOTE The mode displayed is the smallest of 2 modes with a count of 3 Tests for Location Mu00 Test Statistic Value Student39s t t 9291751 Pr gt t lt0001 Sign M 27 Pr gt M lt0001 Signed Rank S 7425 Pr gt S lt0001 0uantiles Definition 5 0uantile Estimate 100 Max 8310 99 8310 95 7310 90 7120 75 03 6810 50 Median 6525 25 01 6210 10 6030 5 5830 1 5340 0 Min 5340 SAS PROC UNIVARIATE Output The UNIVARIATE Procedure Variable EW Extreme Observations Lowest Value Obs 534 1 552 2 583 3 592 4 602 5 Leaf 1 8 01 28 12801278 000300456 35723449 001267770056 2304558 2 4 Highest Value Obs 718 50 720 51 731 52 808 53 831 54 1 1 2 2 8 9 8 12 7 2 Boxplot 0 0 NH x x mm xxxxxx xxxxxxx xxxxx mm xxxxx x x x vC rh VV hh Foam gtHHHH N QLQ HNELoz gtgtm umHanLmgt mLsnmooLm mFltHmltgtH22 mck 235 E m z DOME mltm Multiple Linear Regression Model Multiple Linear Regression refers to regression applications in which there are more than one independent variables x1 x2 xk A multiple linear regression model with k independent variables has the equation y o 1x139 kxkg 1 The 8 is a random variable with mean 0 and variance oz The parameter l represents the expected change in y resulting from a single unit change inx1 holding all other independent variables xed A prediction equation for this model tted to data is 0 1xl 39 kxk 2 where j denotes the predicted value computed from the equation and denotes an estimate of i These estimates are usually obtained by the method of least squares This means nding among the set of all possible values for the parameter estimates the ones which minimize the sum of squared residuals 201 fol The least squares 11 estimates yield the best tting equation in terms of minimizing the sum of squared distances of the tted plane to the data points The interpretation of the parameter estimates is the same as the interpretation of the model parameters except with respect to the tted model The parameter estimate l represents the change in j resulting from a single unit change in x1 holding all other independent variables xed Example of Multiple Linear Regression An example of a multiple linear regression with two independent variables is given by the KWH data but now with x1AC and x2DRYER Figure 1 shows a plot or KWH versus DRYER 100 90 80 I I 70 60 50 40 30 20 10 l l I l l l o5 o 5 1 15 2 25 3 35 DRYER KVVH Figure 1 Plot of KWH versus DRYER The plot in Figure 1 clearly shows KWH increases with increasing runs of the dryer but the plot does not take into account the variable AC Visualizing the simultaneous effects of AC and DRYER on KWH would requite a plot in three dimensions which is dif cult to construct The model equation would be KWH o 1AC 2DRYER 8 Least squares parameter estimates are A 811 1 547 1322 Computation of the estimates by hand is tedious and infeasible for more than two independent variables Estimates are ordinarily obtained using a regression computer program Standard errors also are usually part of output from a regression program The prediction equation for the KWH data is KWH 811 547AC 1322DRYER 3 This model ascribes 547 KWHs to hourly use ofthe AC and 1322 KWHs to each use of the DRYER and 811 to all other electrical devices combined Remember that A 547is an estimate of the amount of change in KWH due to a one unit increase in AC holding DRYER constant Compare this prediction equation with the one including only AC in the model KWH 2785 543AC 4 The intercept estimate has changed substantially from 2785 to 811 This change occurs because KWH consumption due to DRYER usage is not accounted for in the equation The KWH consumption due to average DRYER usage is combined into the intercept estimate in the model that does not contain DRYER But the change in KWH due to a oneunit increase in DRYER usage is not explicitly shown in equation 4 The estimate of the coef cient on AC has changed very little from 534 to 547 This is related to the fact that AC and DRYER usage are relatively uncorrelated In other words use of one is not related to use of the other See Figure 2 Generally speaking if AC and DRYER were positively negatively correlated then the regression coef cient on AC would be reduced increased when DRYER was added to the model 15 125 10 2 75 I 5 I 25 C I I I I I I I 05 0 5 1 15 2 25 3 35 DRYER Figure 2 Plot of AC versus DRYER Compare the values of predicted KWH from the two models Previously AC10 was inserted in the simple linear prediction equation to get KWH2785 534108125 5 A value of DRYER must also be inserted into the multiple regression equation to get a predicted KWH value Trying DRYER 0 l and 2 and holding AC10 gives KWH 811 54710 13220 6281 KWH81154710132217603 6 KWH 811 54710 13222 8925 KWH consumption increases by 1322 as DRYER goes from 0 to l and again from 1 to 2 holding AC xed at 10 Analysis of Variance for Multiple Regression Model An analysis of variance for a multiple linear regression model with k independent variables tted to a data set with n observations is Source of Variation DF SS MS Regression k SSR MSR 7 Error nk l SSE MSE Total nl S STot The sums of squares SSR SSE and SST have the same de nitions in relation to the model as in simple linear regression SSR EdaFYSSEZyJ9ZSST0t io W 8 Also SSTotSSRSSE The value of SSTot does not change with the model It depends only on the values of the dependent variable y But SSE decreases as variables are added to a model and SSR increases by the same amount This amount of increase in SSR is the amount of variation due to variables in the larger model that was not accounted for by variables in the smaller model This increase in regression sum of squares is sometimes denoted SSRadded variables l original variables 9 where original variables represents the list of independent variables that were in the model prior to adding new variables and added variables represents the list of variables that were added to obtain the new model The overall SSR for the new model can be partitioned into the variation attributable to the original variables plus the variation due to the added variables that is not due to the original variables SSRall variables SSRoriginal variables 10 SSRadded variables l original variables Generally speaking larger values of the coef cient of determination R2SSIUSST indicate a better tting model The value of R2 must necessarily increase as variables are added to the model However this does not necessarily mean that the model has actually been improved The amount of increase in R can be a mathematical artifact rather than a meaningful indication of an improved model Sometimes an adjusted R2 is used to overcome this shortcoming of the usual R2 Most regression computer programs include both versions of R2 The analysis of variance for the twovariable model tted to the KWH data is Source of Variation DF SS MS Regression 2 92998 46499 Error 18 2788 155 Total 20 95786 Adding DRYER to the model affected a dramatic change in the value of SSR which increased from 56097 to 92998 The value of SSE dropped accordingly from 39689 to 2788 The coef cient of determination is now R29299895786097 The two variables AC and DRYER account for 97 of the variability in KWH consumption in the house This is up from R25609795786058 for the variable AC alone The regression sum of squares partitioned into the amount due to AC alone plus the amount due to DRYER that was not attributable to AC is SSRAC and DRYER SSRAC SSRDRYERlAC 11 92998 56097 36901 Thus 36901 is the amount of variation due to DRYER that was not accounted for by AC We can expand the ANOVA table to show the breakdown of the regression SS given in equation 11 Source of Variation DF SS MS F P Regression 2 92998 46499 AC 1 56097 56097 DRYERlAC 1 36901 36901 2381 0001 Error 18 2788 155 Total 20 95786 The values of R2 increases as variables are added to the model as shown in the table Variables in Model AC AC and DRYER R2 5856 5609795786 9709 560973690195786 One of the detracting features of R2 is that it can be driven closer and closer to 10 by adding variables even though the variables may have no relationship to KWH To overcome this problem and adjusted version is available in most computer programs Statistical Inference for Regression Parameters Statistical inference about the parameters requires standard errors of the estimates A 95 con dence interval for is 1 irmsw 12 where td 025 is the critical value from a t distribution with dfmkl the degrees of freedom for error and a is the standard error of 1 Standard errors for parameters in the twovariable model are a n248 0A028 032086 13 The critical value from a t distribution with dfl8 is t1870252 1 Thus a 95 con dence interval for l is li thsa A 547 i 21028 547 i 059 We are 95 confident that the true hourly KWH consumption of the AC is between 488 and 606 This is a considerably shorter interval than the interval 534i2 16 that was obtained from the simple linear regression model because the variance estimate MSE has been reduced from 2089 to 155 It seems apparent that the model including both AC and DRYER is superior to the model containing AC alone The value of R2 is much higher 9709 compared to 5856 and MSErr0r is much smaller You can conduct a statistical test of signi cance to compare the two models using the ANOVA table with the partitioned SSReg The test statistic is F MSDRYER ACMSError 36901155 2381 with numerator dfl and denominator dfl8 This is a huge value of F with these degrees of freedom and is signi cant at any reasonable level Bivariate Fit of RESIDUAL By AC 5 39 l I I g 0 I I I 9 I I I I I Lu m I 395 I 3910 I I I I I 0 25 50 75 100 125 150 AC Figure 3 Plot of Residuals versus AC Bivariate Fit of RESIDUAL By DRYER 5 I I lt z39 39 I D 0 I I D a I I Lu m I 5 3910 I I I I I I I 05 0 5 10 15 20 25 30 35 DRYER Figure 4 Plot of Residuals versus DRYER Plots of the residual from regressing KWH on AC and DRYER in Figures 3 and 4 reveal essentially the same pattern as when KWH is regressed on AC or DRYER individually That is because AC and DRYER are essentially uncorrelated The curvature of the points in the residuals versus DRYER persists Bivariate Fit of RESIDUAL By PREDICTED 5 39 l I 39 g 0 I 39 I 9 I I I 39 I Lu m I 395 I I 3910 I I I I I I I I 10 20 30 40 50 60 70 80 90 100 PREDmTED Figure 5 Plot of Residuals versus Predicted Values Residuals from the regression of KWH on AC and DRYER plotted versus predicted values in Figure 5 shows a pattern distinctly different from the plots versus AC or DRYER Regression with Collinear Variables The example on household KWH consumption utilized two independent variables that are almost uncorrelated Thus when DRYER was added tot the model in addition to AC the AC regression coefficient changed very little Also the amount of variation attributable to DRYER is almost the same when it is the only variable in the model as when it is added to a model that already includes AC The following example illustrates a situation when two highly correlated variables are in a regression model Students in a graduate statistics course recorded the spans of their left and right hands and their heights all in inches The objective was to develop a regression model to predict height from hand span The variable names are HT LSPAN and RSPAN Of course LSPAN and RSPAN are highly correlated This example illustrates the consequence of using two highly correlated variables in a multiple regression equation The two simple linear regression models are HT o 1LSPANg and HT o 2RSPANg The prediction equations are HT43 622 88 LSPAN and HT4135317RSPAN Not surprisingly the two equations are quite similar Figures 6a and 6b show HT versus LSPAN and RSPAN I I I 65 70 75 80 85 90 95 100 105 Ispan Figure 6a Regression of HT on LSPAN Figure 6b Regression of HT on RSPAN Figure 7 shows a plot of RSPAN versus LSPAN showing the high degree of collinearity 105 100 95 39 39 90 85 80 rspan 1539 I TO 5 I I I I I I I 65 70 T5 80 85 90 95 100105 Ispan Figure 7 Plot showing collinearity between LSPAN and RSPAN The multiple linear regression model is HT o lLSPAN zRSPAN g 14 The prediction equation is HT 4113 7 431LSPAN 753RSPAN At rst look this equation seems to make no sense at all The regression coef cient on LSPAN is negative and the coef cient on RSPAN is twice as large as the coef cient when RSPAN was the only variable in the model These are consequences of collinearity between LSPAN and RSPAN Figure 8 shows a plot of HT versus LSPAN and RSPAN Figure 8 HT plotted versus LSPAN and RSPAN When you observe this plot you can imagine the instability of a plane tted to the data due to the lack of data points for large LSPAN small RSPAN combinations and small LSPANlarge RSPAN combinations Such points of course would not be typical values of LSPAN and RSPAN Here is a table showing the summary statistics for three regression models usingLSPAN and RSPAN Model 1 contains LSPAN alone model 2 contains RSPAN alone and model 3 contains both LSPAN and RSPAN Modle Variables in Model Intercept lspan rspan R2 l lspan 4362 288 36 se62 2 rspan 4l34 317 41 se6l 3 lspanamp rspan 4l 13 431 753 43 se 324 se 334 The table shows these phenomena Huge standard errors when both variables are included in the model Similar intercept values for all models Only very small increases in R2 when going from model 1 to model 3 or from model 2 to model 3 9 59 Figures 9a and 9b show the data and fitted plane on similar axes Figure 9a HT plotted Versus LSPAN and R PAN Figure 9b Plane ofpredicteol values plotted versus 39 39 39 39 39 he nlanp increases in slope from the srnallest values of LSPAN anol RSPAN up to the largest values This illustrates the following facts ofregression with highly collinear variables 1 Individual regression coef cients are practically meaningless 2 reaction u e 39 39 39 variables are relatively stable SAS Program for Multiple Linear Regression Analysis of KWH Data options nonumber nodate Titlel 39Housenold Electricity Consumption Data39 data kilowatt input kWh ac dryer cards 35 l 63 66 17 94 79 93 66 94 82 78 65 77 75 62 85 43 57 33 65 33 5 5 0 O 5 O l l l OWQCJTNOWHQGDQGJOWQl ml O Xmmm N N D OCJ lOO lO 39 0 U1 m o H o w o H m m H m N w w o N N H H H1 H 000 Nl O proc print datakilowatt run proc corr sscp datakilowatt run proc sort datakilowatt by ac run data acplot do ac0 to 15 by 5 kwn output end run proc print dataacplot run Title2 39Regression of KWH vs AC and DRYER39 run data acplot merge acplot kilowatt by ac run proc reg dataacplot id ac model kWhacp plot kwhac output outacplotl pacpred racresidl run proc gplot dataacplotl plot acresidlac plot acresidlacpred run proc reg dataacplot id ac model kWhacp clm cli plot kwhac output outacplot2 pacpred lclmlcl uclmucl lcllpl uclupl run proc print dataacplot2 run proc gplot dataacplot2 plot kwhac acpredac lclac uclac overlay plot kwhac acpredac lplac uplac overlay run Title2 39Regression of KWH vs DRYER39 run proc sort datakilowatt by dryer run data dryplot do dryer0 to 3 by 1 kWh output end run proc print datadryplot run data dryplot merge dryplot kilowatt by dryer run proc reg datadryplot id ac model kWhdryerp plot kwhdryer output outdryplotl pdrypred rdryresidl run proc gplot datadryplotl plot dryresidldryer plot dryresidldrypred run proc reg datadryplot id ac model kWhdryerp clm cli plot kwhdryer output outdryplot2 pdrypred lclmlcl uclmucl lcllpl uclupl run proc print datadryplot2 run proc gplot datadryplot2 plot kwhdryer drypreddryer lcldryer ucldryer overlay plot kwhdryer drypreddryer lpldryer upldryer overlay run Introduction to Probability Probabilities are expressed in terms of events Examples of Events Six shows on a roll of a die Jack of Spades is drawn from a deck of cards Rains at 300 pm today in front of the Reitz Union Have an automobile accident this year Florida beat Florida State Florida beat Tennessee Florida beat Florida State and Tennessee Yield of a randomly drawn citrus tree is greater than 8 boxes Mean yield of 25 randomly drawn citrus trees is greater than 8 boxes Events are denoted by capital letters A B etc Probabilities of events are denoted PA or Pevent occurs Terminology Terms in probability relate to terms in mathematical set theory Set theorv Probability Universe Universe Subset of Universe Event Element in Universe Outcome Think of an outcome as the result of an experiment The universe is the set of possible outcomes Example Experiment roll a die gt Universel23456 Event A outcome is even 246 PA365 More on Terminology of Probability Think about the events Florida beat FSU and Florida beat Tennessee These are events pertaining to winning games in the football season Let s try to determine a set of outcomes that would allow us to describe all possible such events There may be several ways to do this Perhaps the most basic is a list of all possible winloss records for the regular season of 12 games There are 2124096 of these outcomes that s the number of elements in the universe set We could list them as follows Outcomes Game 1 W W W W W W L 2 W W W W W L L 3 W W W W W L L 4 W W W W W L L 5 W W W W L L L 6 W W W W W L L 7 W W W W L L L 8 W W W W W L L 9 W W W W W L L 10 W W W L W L L 11 W W L W W L L 12 W L W W W L L Wins 12 1 1 1 1 1 1 10 1 O Losses 0 a a a a a N Venn Diagrams Universe 0 Event A Event B Determination of Probabilities Probabilities are determined from 1 Relative frequency computation 2 Subjective assessment PsiX on roll of diel6 a relative frequency computation PFlorida beat Florida State8 a subjective assessment PRain today at 300 at JWRU4 combination of relative frequency and subjective assessment Compound Events and Probabilities of Compound Events Compound events are formed from combinations of events The union of events A and B occurs if either A or B occur Florida beat FSU U Florida beat TennesseeFlorida beat either FSU or Tennessee The intersection of events A and B occurs if both A and B occur Florida beat FSU 0 Florida beat TennesseeFlorida beat both FSU and Tennessee Calculating Probabilities of Compound Events PAUBPAPBPA B PFlorida beat either FSU or TennesseePFlorida beat FSUPFlorida beat TennesseePFlorida beat both FSU and Tennessee Two events A and B are disjoint if they are mutually exclusive ie if A B where is the empty set If A and B are disjoint then PAUBPAPB Two events A and B are independent if PA BPAPB Are the events Florida beat FSU and Florida beat Tennessee independent Are the events Drives 4WD truck and Voted Republican independent Are the events Drives Prius and Voted Republican independent Are the events Lives in Florida and Voted Republican independent Conditional Probability The conditional probability of A given B is the probability that A occurs when it is known that B occurs It is calculated by the formula PABPA BPB If A and B are independent then PAPAB Examples ale 4 PG Pdraw J from deck 0f cards 5 p l U 1 HQ Pdraw club from deck of cards 5 Z J 1 Intersection PO 0 CPJ 31101 C 5 Union PUU C P0 or C PJ PC PJr C ili413 1 l3 4 52 52 52 8 2 PJU PJnQ 0 Q 52 13 Dependence PO 0 CPJ PC quotJquot and quotCquot are independent PO 0 Q 7 PO PQ quotJquot and quotQquot are not independent Application Prostate Cancer PPC i L 005 10000 200 48 350 398 398 PSA PPSA 0398 Test 10000 2 9600 9602 48 P PC n PSA 0048 10000 50 9950 10000 PA B Conditional Probability PA given B PAB 1303 Sensitivity PPSA PC 48100005010000 96 Specificity PPSA 130 960010000995010000 9648 Predictive value PPC PSA PPC PSAPPSA PPSAPCPPCPPSA 96X0050398 12 This is an example of the application of Bayes Rule It allowed us to calculate the conditional probability PPC PSA using the other one PPSA PC More generally suppose A1 A2 An is a partitioning of the universe ie U is the union of A1 A2 An and the intersection of any two of the A1 sets is empty You want to calculate PAkB where know PBA1 PBAn and you also know PAl PAn Bayes Rule says that PAle PBlAkPAk PBIA1PA1 PBlAnPAn In the prostate cancer application A1 A2 An could represent the various stages of prostate cancer and B is the event of a positive diagnosis From epidemiological records you might know PBAi the probability of positive diagnosis for each stage and HA the probabilities of various stages Random Variables Random variables are functions that describe the outcomes of an experiment They are usually denoted with a capital letter such as X or Y RVs can be either discrete of continuous A discrete random variable has a countable number of values meaning that they could be listed A continuous random variable has an uncountable number of numeric values meaning they cannot be listed Here are some examples Discrete o spots on top face ofdie l 2 3 4 5 6 o suit of drawn card C D H S o aphids on leaf 01 2 3 o defects in box of 1000 nails 0 l 2 1000 o germinating seeds out of 50 0 l 2 50 Continuous 0 heights of people 0 7 o ph of soil 0 10 o voltage in circuit 0 Discrete random variables have probability mass functions which we denote px The probability function gives probabilities of each individual value of the RV e g PX x 19x Continuous random variables have probability density functions which we denote x The probability density function can be used to give probabilities of ranges of values of the continuous RV For example Px1 lt X lt x2 fxdx There are two random variables that are especially important in statistics the binomial and normal Binomial Random Variable LetX Number of successes out of 11 trials in which the probability of success on each trial is a number 71 Then X has a binomial distribution with parameters 11 and 71 This is abbreviated X N Bn 71 Example Consider ipping a coin and declare a success if a head H appears Then the probability of success is 5 That is nPS5 Suppose the coin is ipped n3 times andX number of heads Then x B3 5 The possible values ofx are 0 1 2 3 The probability of any event is 123 18 Events HHH HHT HTH THH HTT THT TTH TTT Heads 3 2 2 2 1 1 1 0 PEvent 18 18 18 18 18 18 18 18 P3 H 18 P2 H 38 Pl H 38 PO H 18 32 1 3 3 1 IngeneralPkH k3k E k3k39 8 11 Binomial Random Variable con t Example X number of 1 s in 3 rolls of die 0 1 2 3 Events 111 11X 1X1 X11 1XX X1X XX1 XXX 1 s 3 2 2 2 1 1 1 0 PEvent 1363 125163 125163 125163 115263 115263 115263 536 P31 s W 11li 0046 666 63 216 3 1quot g 39 aquot 130 k3 k 6 6 The probability mass function is given by the socalled Binomial Formula 7 probability of success on single trial I n 72x17zn7x PX successes in 11 trials x39n x Mean of the binomial distribution 17 Variance of the binomial distribution 71750 7r Normal Distribution and normal random variable The notation Y Nu62 means The random variable Y is distributed normally with mean u and variance 62 The standard normal distribution has mean 110 and variance 621 The letter Z is reserved to represent the standard normal random variable Computer programs and tables are available to obtain probabilities from the normal distribution For example you can discover that P1 ltzlt 1 68 o P zgt 1 16 P196ltZlt 196 95 PZgt 196 025 PZgt 142 0778 The probability density function of the normal random variable with mean u and variance 62 is given by the formula f y 27r02 2 eX13y m2 202 Using the Normal Distribution Standardizing a Normal Distribution If Y Nu62 then Z Y uc N01 This result allows us to compute probabilities from any normal distribution using tables or a computer program for the standard normal distribution If you wanted to calculate the probability that a random variable y is greater that 142 standard deviations above its mean you would compute Y PYgtu1420P O gt 142 PZ gt 142 0778 As a more specific application suppose you believe the egg weights to be normally distributed with mean 654 and standard deviation 517 You would calculate the probability that a randomly drawn egg is greater than 72 as PY gt 72 PY 65452 gt 72 65452 PZ gt 127 1 Using the normal distribution an application Egg weights are normally distributed with mean 65 g and standard deviation 0 50 1 What is the probability one randomly drawn egg will exceed a 65 b 66 c 70 d 75 Let Y egg weight Then a PYgt 65 Pgt PZgt 0 12 5 b PYgt 66 Pgtj PZgt 2 4207 c PY gt 70 Pgtj PZgt 1 1587 d PYgt 75 pgt j PZgt 2 0228 2 What is the probability one egg is between 66 and 70 g P66 lt Ylt 70 PYgt 66 PYgt 70 4207 1587 262 Probabilities of this type can be expressed in terms of the cumulative distribution function PZ lt z Fz 277 2 exp z2 2 The integral for the normal distribution is dif cult to evaluate so tables or computer programs are used to obtain actual values Normal Approximation to the Binomial You can use the normal distribution to approximate binomial probabilities This often simplifies a computation For example suppose you are shooting freethrows in basketball You know that you make 75 of your shots that is the probability of making any one shot is 75 You have entered a contest that awards a prize if you make at least 18 out of 20 shots What is the probability that you will win a prize You need to calculate PY218 where y is the number of shots you make out of 20 The exact probability is given by the binomial formula with 7 75 and n 20 PY218 PY 18 PY 19 PY 20 201827518252 201917519251 20200752025O 0069 0211 0032 0912 Normal Approximation to the Binomial The calculation on the previous page would be tedious by hand but many computer programs are available that can readily do it However even good computer programs may fail for computations involving extremely large n The normal approximation sets u 1175 and 62 nir17r and assumes Y Nu62 to evaluate the probability The approximation is improved by using a continuity correction which means you compute PY2185 P Y2175 The normal approximation is then computed as u 117 2075 15 62 nir17r 207525 375 c 375 2 194 P Y2175 Py 15194 2 175 15194 PZZ 129 1 901 099 This is a reasonable approximation to the exact binomial probability of PY218 091 Means and Variances of Random Variables Means of random variables are called expected values denoted E X If X is a continuous RV then X EltXgt I xfxdx If X is a discrete RV then X EX 321932 Variances of random variables are also expected values If X is a continuous RV then a EX m2 I x 2fxdx If X is a discrete RV then a EXu2 20939 Iu2pxi Means and Variances of Linear Functions of Random Variables If X is an RV with mean uX and vanance 6X2 and Y abX Where a and b are constants then uY EabXabEXabuX and a EltltY m2 EltltltabXgt ltabegtgt2gt Eb2X X2 1920 IfX1Xk are RVs and YX thhen y EX1XkEX1EXkp1yk IfX1Xk are independent RVs and YX Xk then 039 EX1Xk ylyk2 EX1 y12 EXk uk2 of a Linear Regression Analysis Simple Linear Regression A homeowner recorded the amount of electricity in kilowatthours KWH consumed in his house on each of 21 days He also recorded the numbers of hours his air conditioner AC was turned on and the numbers of times his electric clothes dryer DRYER was operated His objective was to relate the KWH consumption to the AC and DRYER usage In addition he wanted to know how many KWH s the AC used per hour and the number of KWH s used in each run of the DRYER Statistical regression analysis can serve this purpose Following are the data in tabular form kwh achours dryer runs 35 l 5 l 63 4 5 2 66 50 2 17 20 0 94 8 5 3 79 60 3 93 135 1 66 80 l 94 125 1 82 75 2 78 65 3 65 80 l 77 75 2 75 80 2 62 75 l 85 12 0 l 43 60 0 57 25 3 33 50 0 65 75 l 33 60 0 In regression terminology KWH is called the dependent variable and AC and DRYER are called the independent variables The names dependent and independent come from the notion that the amount of KWH consumption depends on the amount of AC hours and DRYER usage Usually dependent variables are denoted yvariables and independent variables are denoted xvariables The prime objective of linear regression analysis is to obtain an equation of the form KWH 0 blAC bZDRYER that quanti es the dependency of KWH on AC and DRYER To get started we shall investigate the dependency of KWH on AC alone Later we shall explore the dependency of KWH on AC and DRYER simultaneously Figure 1 shows a plot of KWH versus AC Household Elechicity Consump on Data llllllllllllllllllllllllll lllllllll lllllllllIllll lllllllll l l l l 5 B l E 9 ll ll ll ll ll DE Figure 1 Kilowatt Hours versus AC Hours Figure 1 shows that KWH increases as AC increase as you would expect In this application the rate of increase is more important the than the simple fact that there is an increase We already knew that the air conditioner consumes electricity and therefore that KWH will increase with AC What we want to know is how much electricity the AC is using per hour39 that is the rate of consumption We shall obtain an equation K WH 70 blAC that will be used to quantify the rate of increase in KWH as a function of AC This is the equation of a straight line The coefficient 71 is the slope of the straight line and it represents the rate of increase of KWH with AC The coefficient 0 is the intercept of the line It represents the amount of KWH consumption when AC0 The equation turns out to be KWH 2785 534AC This equation is plotted through the data in Figure 2 Household Electricity Consumption Data lrri H351 iHlll ur t 2 is it this ld39ls rlrri ll rrsr I not it q 4 i n c 39l 3 n it it u Figure 2 Regression of KVVH versus AC The number 534 is an estimate of the amount of electricity in KWH consumed for each hour the air conditioner is turned on Then number 2785 is an estimate of the amount of electricity consumed per day by all other electrical devices in the house Some things to think about 1 How precise is 534 as an estimate of the true rate of KWH consumption by the air conditioner Can we use 534 to construct a con dence interval about the true rate 2 How accurate is 534 as an estimate of the true rate of KWH consumption by the air conditioner If we did the experiment over and over again would the estimates we obtain be clustered about the true rate Is the expected value of the estimates equal to the true rate 3 What other uses can be made of the regression equation Can we predict the amount of KWH consumption on a certain day if we know the air conditioner usage was say AC8 hours What can we say about the accuracy and precision of the prediction 4 How well does the equation fit the data Is a linear equation appropriate for this application The process of obtaining the equation of the line and making inference about the coefficients is called Linear Regression Analysis At the heart of linear regression analysis is a statistical model Simple Linear Regression Model The expression Simple Linear Regression refers to regression applications in which there is only one independent and one dependent variable A simple linear regression model is given by the equation y o lx85 1 where o and l are unknown parameters and e is a random variable usually considered normally distributed The model equation 1 states that a value of y is equal to a linear function of x plus a random quantity 8 The parameters o and l are the intercept and slope of the regression line In the electricity consumption example yKWH and xAC would yield a simple linear regression model for relating KWH to AC The parameter l is the expected KWH s consumed per hour use of the AC and the parameter o is the expected combined KWH s used by all other electrical devices in the house per day These parameters are population quantities and cannot be known exactly but we can estimate them from the data The quantity 8 is a random quantity that accounts for random deviation from expected KWH consumption For example suppose the AC is turned on for x eight hours on a particular day What is the value of y KWH consumption The expected consumption that is the mean consumption in the conceptual sub population of all similar days that the AC would be turned on for eight hours is E y o lx The actual KWH consumption for the particular day in question is the mean for that population plus the random quantity 8 to account for deviation from the mean for that particular day That is to say yEy8 o 1x8 2 The mean ie expected KWH consumption for a day with known AC usage cannot be calculated because it involves the unknown parameters o and If the random 8 values are distributed with mean 0 and variance 0392 then the subpopulation of KWH values also has variance 0392 When it is necessary to write the model equation in reference to a particular observation a 4 a subscript can be inserted on x and y Generically the subscript could be used to indicate the jt observation and the model equation would be y o 51x 8 3 Fitting the Simple Linear Regression Model In applications the model is tted to data using the method of least squares giving the prediction equation o 1x 4 where 0 and A are estimates of o and l and j is a predicted value of y obtained by inserting a value of x into the prediction equation The prediction equation is also useful to estimate the mean E y o lx of the subpopulation of y values corresponding to a given value of x The caret above the parameters is called hat and is used to distinguish the actual parameters from their estimates Thus l is called betaone ha Likewise the estimate 6392 of 0392 is called sigmasquared hat All these parameter estimates can be computed from ve summary statistics mean ofx s 7c Zxn mean ofy s 7 2y 7 sum of squares ofx s SDC 2x 9 02 2x2 202 n 5 sum of squares ofy s Syy Ey 72 Zyz 2y2 n sum of products ofx s and y s Sy 2x Tcy 7 ny Zx2y n where n is the total number of data points In the electricity consumption example n21 and the summary statistics are f145521 693 7 136221 6486 Sm 120475 71455221 120475 100812 19664 Syy 97914 71362221 97914 7 883354 95786 SIy 10487 7 1455136221 10487 7 94367 10503 The regression parameter estimates are A 7 SW S 1050319664 7 534 6 0 7 12 7 6486 7 534693 7 2785 This gives the prediction equation KWH 2785 534AC Using the Prediction Equation The prediction equation is useful for two purposes It can be used to estimate the mean of the subpopulation of amounts of electricity consumed on all days when the AC is turned on for a speci ed number of hours It is also useful to predict the amount of electricity used on a particular day when the AC is turned on for a speci ed number of hours For example consider the conceivable days when the AC could be turned on for 10 hours The value from the prediction equation corresponding to AC10 is 2785 53410 8125 kilowatthours This number is an estimate of the mean KWH consumption on all days when the AC is turned on for 10 hours Also suppose the AC was turned on for 10 hours on a particular day of interest The homeowner could use the prediction equation to predict the amount of electricity used on that day to be 8125 KWH Accounting for Variation in Simple Linear Regression Another aspect of regression analysis is accounting for the variation in the dependent variable as it relates to variation in the independent variable A fundamental equation is yyJ7J9J7 7 Equation 7 states that a deviation of a value of the dependent variable from the overall mean y f is equal to the sum of a deviation of the dependent variable from the predicted value y f2 plus the deviation of the predicted value from the overall mean f 7 It can be shown that f 7 0 when xc Also j 7changes by an amount A for each unit of change in X Thus the deviation 3 7 depends directly on the independent variable x But the deviation y f2 can be large or small and positive or negative for any value of x So the deviation y f2 does not depend directly on x It turns out that the sums of squares of the deviations in equation 7 obey a similar equation Zy J72 Zy JA2 ZJ3 J7Z 8 The sums of squared deviations have names The left side of the equation is call the total sum of squares and is denoted SSTotalZ y 72 The terms on the right side are the error and regression sums of squares SSRegression 20 72 and SSError 2 y fol For brevity we write SSRSSRegression SSESSError and SSTSSTotal and equation 8 takes the form SSTSSRSSE 9 The total sum of squares SST is a measure of the total variation in the values of the dependent variable y The regression sum of squares SSR is a measure of the variation in y that is attributable to variation in the independent variable x Finally the error sum of squares measures the variation in y that is not attributable to changes in x Thus equation 9 shows the fundamental partitioning of the total variation into the portion attributable to x and the portion not attributable to x The coef cient of determination usually denoted R2 is SSR divided by SST and thus measures the proportion of total variation in y that is attributable to variation in x R2 sswssr 10 Analysis of Variance for Simple Linear Regression An analysis of variance associated with the regression is Source of Variation DF SS MS Regression l SSR MSR 11 Error n2 S S E M SE Total n l S S T The column headed DF contains degrees of freedom for the sums of squares The column headed MS contains mean squares which are the corresponding sums of squares divided by the degrees of freedom The error mean square MSE is an estimate of 0392 the variance of the errors 639 2 MSE Analysis of Variance Computations for Simple Linear Regression The sums of squares in the ANOVA table can be calculated from the summary statistics as SSR S Sm SST Syy and SSE Syy Sy Su These computations for the KWH data are SSR 10503219664 110313019664 56097 SST 95786 and SSE 95788 7 56097 39689 For the KWH data the analysis of variance ANOVA is Source of Variation Re gre s s ion Error Total DF 1 19 20 SS 56097 39689 95786 MS 56097 2089 The estimate of 0392 is 6392 MSE 2089 and the estimate of the error standard deviation is 3 208951445 Statistical Inference in Simple Linear Regression Standard errors of the regression parameter estimates A and AO are needed in order to make statistical inference about the parameters l and o The variances of the sampling distributions of A and AO are V o ZS and vmo o 2 UnaS 12 13 The standard errors of the parameter estimates are obtained by inserting 6392 MSE in place of 0392 and then taking the square root Thus the standard error of A is se A1MSESm395 For the KWH data se 1208919664395 103 14 Tests of hypotheses and con dence intervals can be constructed using the parameter estimates and standard errors The hypothesis H0 l lo where lo is a known constant can be tested using the test statistic t Anasewn 15 This statistic has a t distribution with n2 degrees of freedom when H0 is true A 95 con dence interval for l is 1 it semi 16 In some applications it is useful to test the hypothesis H0 l 0 For the KWH example this hypothesis would not be of interest because there is no question that the AC uses electricity The interesting inference is about the amount of electricity consumed As already seen the estimate is A 534 KWH per hour A con dence interval would give more meaningful inference about l than atest of hypothesis H0 10 With 19 degrees offreedom t0252 l The confidence interval is 534i 21103 or 534i 216 Prediction and Estimation of Sub population Means It is often useful to make inference about the mean of a subpopulation corresponding to a given value of x say x0 The estimate of the subpopulation mean is obtained by inserting x0 into the prediction equation That is the estimate of E y o lxo is f AO le The standard error of f AO Ex is needed to make statistical inference about E y o lxo The variance of the sampling distribution is V09 021n x0 c2Sm 17 Therefore the standard error is sej MSEln x0 f2Sm 395 18 A 95 confidence interval for the mean of the subpopulation of y values corresponding to xx0 is 0 le r t025MSE1n x0 c2Su 5 19 When used to predict a value of y the relevant variance is the prediction variance V3 y 0211nx0 7c2S 20 The relevant standard error is seA yMSE11nx0 c2Sm 395 21 A 95 prediction interval for the y value corresponding to xx0 is 0 le r t025MSE1 ln x0 c2Su 5 22 Consider the conceivable days when the AC could be turned on for 10 hours A 95 con dence interval for the mean KWH consumption on those days is 8125i212089121 10 7 6932196645 01 8125i938 71879063 Now consider the particular day when the homeowner had the AC turned on for 10 hours A 95 prediction interval is 8125i2120891 121 10 7 693219664395 01 8125i3177 494811302 SAS Program for Simple Linear Regression Analysis of KWH Data options nonumber nodate Titlel Household Electricity Consumption Data Title2 Simple Linear Regression Analysis data kilowatt input kWh ac dryer cards 35 15 l 63 5 2 66 5 O 2 l7 2 O O 94 8 5 3 79 6 O 3 93 13 5 l 66 8 O l 94 12 5 l 82 7 5 2 78 65 3 65 80 l 77 7 5 2 75 8 O 2 62 7 5 l 85 12 O l 43 6 O O 57 2 5 3 33 5 O O 65 7 5 l 33 6 O O proc print run proc reg datakilowatt model kWhac plot kwhac run Multiple Linear Regression Model Multiple Linear Regression refers to regression applications in which there are more than one independent variables x1 x2 xk A multiple linear regression model with k independent variables has the equation y 0 1x1 kxkg 23 The 8 is a random variable with mean 0 and variance oz A prediction equation for this model tted to data is f2b0b1x1mbkxkg 24 where j denotes the predicted value computed from the equation and bi denotes an estimate of i These estimates are usually obtained by the method of least squares This means finding among the set of all possible values for the parameter estimates the ones which minimize the sum of squared residuals 2y fol This yields the best tting equation in terms of minimizing the sum of squared distances of the tted plane to the data points An example of a multiple linear regression with two independent variables is given by the KWH data but now with x1AC and x2DRYER The model equation would be KWH o BlAC zDRYER 8 Least squares parameter estimates are b0 811 b1 547 and b2 1322 Computation of the estimates by hand is tedious They are ordinarily obtained using a regression computer program Standard errors also are usually part of output from a regression program The prediction equation is KWH 811 547AC 1322DRYER This model ascribes 547 KWH to hourly use ofthe AC and 1322 KWH to each use of the DRYER and 811 to all other electrical devices Compare this prediction equation with the one including only AC in the model KWH 2785 543AC The intercept estimate has changed substantially from 2785 to 811 This change occurs because KWH consumption due to DRYER usage is combined into the intercept estimate in the model that does not contain DRYER The estimate of the coef cient on AC has changed very little from 534 to 547 This is related to the fact that AC and DRYER usage are relatively uncorrelated In other words use of one is not related to use of the other Generally speaking if AC and DRYER were positively negatively correlated then the regression coef cient on AC would be reduced increased when DRYER was added to the model Compare the values of predicted KWH from the two models Previously AC10 was inserted in the simple linear prediction equation to get KWH 2785 53410 8125 A value of DRYER must also be inserted into the multiple regression equation to get a predicted KWH value Trying DRYER 0 l and 2 gives KWH 811 54710 13220 6281 KWH 811 54710 13221 7603 KWH 811 54710 13222 8925 An analysis of variance for a multiple linear regression model with k independent variables tted to a data set with n observations is Source of Variation DF SS MS Regression k SSR MSR 25 Error nkl SSE MSE Total n l S S T The sums of squares SSR SSE and SST have the same de nitions in relation to the model as in simple linear regression SSR EQW SSE 2y fi2 26 SST 2yW Also SSTSSRSSE The value of SST does not change with the model It depends only on the values of the dependent variable y But SSE decreases as variables are added to a model and SSR increases by the same amount This amount of increase in SSR is the amount of variation due to variables in the larger model that was not accounted for by variables in the smaller model This increase in regression sum of squares is sometimes denoted SSRadded variables l original variables 27 where original variables represents the list of independent variables that were in the model prior to adding new variables and added variables represents the list of variables that were added to obtain the new model The overall SSR for the new model can be partitioned into the variation attributable to the original variables plus the variation due to the added variables that is not due to the original variables SSRall variables SSRoriginal variables 28 SSRadded variables l original variables Generally speaking larger values of the coefficient of determination R2SSIUSST indicate a better tting model The value of R2 must necessarily increase as variables are added to the model However this does not necessarily mean that the model has actually been improved The amount of increase in R2 can be a mathematical artifact rather than a meaningful indication of an improved model Sometimes an adjusted R2 is used to overcome this shortcoming of the usual R2 Most regression computer programs include both versions of R2 The analysis of variance for the twovariable model fitted to the KWH data is Source of Variation DF SS MS Regression 2 92998 46499 Error 18 2788 155 Total 20 95786 Adding DRYER to the model affected a dramatic change in the value of SSR which increased from 56097 to 92998 The value of SSE dropped accordingly from 39689 to 2788 The coefficient of determination is now R29299895786097 The two variables AC and DRYER account for 97 of the variability in KWH consumption in the house This is up from R25609795786058 for the variable AC alone The regression sum of squares partitioned into the amount due to AC alone plus the amount due to DRYER that was not attributable to AC is SSRAC and DRYER SSRAC SSRDRYERlAC 92998 56097 36901 Thus 36901 is the amount of variation due to DRYER that was not accounted for by AC Statistical inference about the parameters requires standard errors of the estimates A 95 confidence interval for A is bi i tde 025sebi 29 where mum is the critical value from a t distribution with dfmkl the degrees of freedom for error Standard errors for parameters in the twovariable model are seb0 248 seb1 028 sebz 086 The critical value from a t distribution with dfl8 is I180252 1 Thus a 95 con dence interval for l is 1 i 1 18025Seb1 547 i 21028 547 i 059 Thus we are 95 con dent that the true hourly KWH consumption of the AC is between 488 and 606 This is a considerably shorter interval than the interval 534i2 16 that was obtained from the simple linear regression model because the variance estimate MSE has been reduced from 2089 to 155 SAS Program for Simple Linear Regression Analysis of KWH Data options nonumber nodate Titlel Household Electricity Consumption Data Title2 Multiple Linear Regression Analysis data kilowatt input kWh ac dryer cards 35 15 l 63 5 2 66 5 O 2 l7 2 O O 94 8 5 3 79 6 O 3 93 13 5 l 66 8 O l 94 12 5 l 82 7 5 2 78 65 3 65 80 l 77 7 5 2 75 8 O 2 62 7 5 l 85 12 O l 43 6 O O 57 2 5 3 33 5 O O 65 7 5 l 33 6 O O proc print run proc reg datakilowatt model kWhac dryer run

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.