### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# PRIN OF STATISTICS STAT 101

ISU

GPA 3.5

### View Full Document

## 11

## 0

## Popular in Course

## Popular in Statistics

This 187 page Class Notes was uploaded by Giovani Ullrich PhD on Saturday September 26, 2015. The Class Notes belongs to STAT 101 at Iowa State University taught by Staff in Fall. Since its upload, it has received 11 views. For similar materials see /class/214413/stat-101-iowa-state-university in Statistics at Iowa State University.

## Reviews for PRIN OF STATISTICS

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/26/15

Stat 101L Lecture 26 Sampling Distribution Models 2 139 Population 1 Paramet p Inference Samp Ea et39e ifo Mg Sample selection Yr lat i W V Statistic Sampling Distribution of 13 Shape Approximately Normal Center The mean is p Spread The standard deviation is p1p n Sampling Distribution of 13 as Conditions 10 Condition The size of the sample should be less than 10 of the size of the population SuccessFailure Condition up and nl p should both be greater than 10 z Stat 101L Lecture 26 68 95 997 Rule Prob ability If the population proportion p is known we can find the probability or chance that 3 takes on certain values using a normal model Inference amp In practice the population parameter p is not known and we would like to use a sample to tell us something about p it Use the sample proportion 13 to make inferences about the population proportion p 5 Stat 101L Lecture 26 Example Population All adults in the US Parameter Proportion of all adults in the Us who feel that abortion should be legal Unknown Example Sample 1130 randomly selected adults nationwide ABC News Washington Post Poll Jan 9 12 2008 as Statistic 644 of the 1130 adults in the sample 57 answered that abortion should be legal 6895997 Rule 95 of the time the sample proportion 13 will be between p2 pap andp2 pap Tl Tl Stat 101L Lecture 26 6895997 Rule 95 of the time the sample proportion f will be within 17017 71 two standard deviations of p Standard Deviation av Because p the population proportion is not known the standard deviation 51313 pap n is also unknown Standard Error Substitute 3 as our estimate best guess of p The standard error of 3 is Slam Pan P Stat 101L Lecture 26 as About 95 of the time the sample proportion will be within 251313 lm 71 two standard errors of p as About 95 of the time the sample proportion p will be within 2SE132 L113 71 two standard errors of Con dence Interval for p We are 95 con dent that p will fall between n n Stat 101L Lecture 5 Measure of Center 0 Sample mean Total y y n n Sample Mean Total 8669 0n24 y Total 2 8669 2 3612 n 24 Mean or Median 0 The sample mean is the balance point of the distribution 0 The sample median divides the distribution into a lower and an upper half 0 For skewed data the mean is pulled in the direction of the skew Stat 101L Lecture 5 Numerical Summaries OHow much variation is there in the data OLook for the spread of the distribution 0What do we mean by spread 4 Measures of Spread 0 Sample Range iThe distance from the minimum and the maximum Range 378 7 349 29 grams iThe length of the interval that contains 100 of the data iGreatly affected outliers Quartiles 0Medians of the lower and upper halves of the data 0Trying to split the data into fourths quarters Stat 101L Lecture 5 Quartiles Lower uartile 3543542 34 9 q 354 grams 35 123 5567 36 2567 37 0038 Upper quartile 3683692 3685 grams 7 Measure of Spread 0 InterQuartile Range IQR iThe distance between the quartiles IQR 3685 7 354 145 grams iThe length of the interval that contains the central 50 of the data Five Number Summary 0 Minimum 349 grams 0 Lower Quartile 3 54 grams OMedian 3595 grams OUpper Quartile 3685 grams 0 Maximum 378 grams Stat 101L Lecture 5 Box Plots 0Establish an axis with a scale 0 Draw a box that extends from the lower to the upper quartile 0 Draw a line from the lower quartile to the minimum and another line from the upper quartile to the maximum Outlier Box Plots 0Establishes boundaries on what are usual values based on the width of the box OValues outside the boundaries are agged as potential outliers Contents of Cans of Cola us 35m 355 35m 355 37m 375 am 355 Weighugrams Stat 101L Lecture 5 Measures of Spread 0Based on the deviation from the sample mean 0DeViation 9hole Golf Scores 46 44 50 43 47 52 2 47 strokes 40 45 50 55 Deviations 41 5 1 3 W4 40 45 50 55 Stat 101L Lecture 5 Sample Variance Almost the average squared deviation 2 2y72 s n l Sample Variance 5 5 S2G691259 g9 12 strokes2 Sample Standard Deviation S y 07Y n l s V12 346 strokes Stat 101L Lecture 5 Which summary is better 0For symmetric distributions use the sample mean 7 and sample standard deviation S 0For skewed distributions use the five number summary Why 0 For symmetric distributions the sample mean and sample median should be approximately equal so either would work 0We will see in Chapter 6 why the sample standard deviation is best for symmetric distributions Why 0F0r skewed distributions the sample mean and standard deviation will be affected by the skew and0r potential outliers The ve number summary displays the skew and is not affected by outliers Handout 2 Descriptive Time Series Statistics and Introduction to Autoregression Class notes for Statistics 451 Applied Time Series Iowa State University Copyright 2004 W Q Meeker January 7 2007 17h 8min Populations A population is a collection of identifiable units or a spec ified characteristic of these uni s u A frame is a listing of the units in the population We take a sample from the frame and use the resulting data to make inferences about the population Simple random sampling with replacement from a po ue latiOn implies independent and identically distributed iid observa IOnS Standard statistical methods use the fit assumption for ob servations or residuals in regeression Processes A process is a system that transforms inputs into outputs operating over time inputs outputs O O O O u A process generates a sequence of observations data over time n We can use a realization from the process to make infere ences about the process The iia model is rarely appropriate for processes observa tions close together in time are typically correlated and the process often changes with time Process Realization Model Inference and Applications Descnpuun Process Realization Genemtes Data Z 12 2 l Statisuml Mudel mmepmeess reputatuitn i P sumates furMudel l inferences Stationary Stochastic Processes Zt is a stochastic process Some properties of Zt include mean Mt variance 03 and autocorrelation p2 In general these can change over time Ztlc39 Strictly stationary also strongly or completely stationary FWDwhy FZ1kgtgtzznk Difficult to check 2nd order weakly stationary or covariance stationary re quires only that ii W and a 03 be constant and that pk thZWC depend only on Is Easy to check with sample statistics Generally stationary is understood to be covariance sta tionary Change in Business Inventories 1955 1969 Billinnstnllars lllllllllllllll 1955 957 1959 1951 1953 1955 957 1959 year Estimation of Stationary Process Parameters Change in Business Inventories 1955 1969 Formula Process Number Parameter Notation Estimate in Wei 7 A 7 7 Z 3t Mean on M27EZ pzizi l 251 Variance of Z 0 VarZ 8 254 a a 7 2 Variance of Z 70 0 3 am W 258 2 standard 02 32 3 Deviation a A 7 zijlta72gtltak72gt a Autocovarlance 7k 7k 7 n 258 1955 957 1959 1951 1953 1955 1957 1959 7 i A i as Autocorrelatlon pk 7 V0 pk V0 2518 7 27 8 Data Analysis Strategy Wolfer Sunspot Numbers 1770 1369 Term V6 Time SeriesPlot Identi cation RangeM eaquot Plot ACF and PACF Estimation Least Squares or Maximum Likelihood D39 gnos 39 Residual Analysis Checking andForecasts Model 0k G 7 l l l l l l Yes a an on an an mu mashHg Explanation ol 2 g 27 10 Correlation From Statistics 101quot sample AUtocorre39ation Consider the random time series realization Z1Z2Zn Consider random data 11 and z eg sales and advertising M m1 311 m2 112 M 2m pm denotes the population correlation between all values ofz and y in the population To estimate pm we use the sample correlation Lagged Variables t Zr ztil 1amp2 zti3 1amp4 1 Z1 Z2 Z3 Z4 Z5 2 Z2 Z3 Z4 Z5 Z6 3 Z3 Z4 Z5 Z6 Z7 2 Zn72 zn7l Zn n 7 1 Zn71 Zn 7 7 7 Zn 7 7 Assuming covariance stationarity let pk denote the process correlation between observations separated by k ime perie o compute the order k sample autocorrelation ie correlation between 7 and QM 2518 W k012 n z Z2 Zr Sample Autocorrelation alternative formula Consider the random time series realization zlz2zyL Lagged Variables t 7171 1 z1 7 7 7 2 z2 Z1 7 7 7 3 Z3 z2 Z1 7 7 4 7i Zn zn71 M72 M73 Z Assuming covariance stationarity let pk denote the process correlation between observations separated by k time periods compute the order k sample autocorrelation ie cor relation between zt and sz A 2 z 7 Z z 2 7 Z klw k012 10 219 i Z Note 71 lt A lt 1 7 Pk 7 2713 Lagged Sunspot Data Lagged Variables t Spot Spotl Spot2 Spot3 Spot4 1 101 7 7 7 7 2 82 101 7 7 7 3 66 82 101 7 7 4 35 66 82 101 7 5 31 35 66 82 101 99 37 7 16 3O 47 100 74 37 7 16 30 101 7 74 37 7 16 102 7 7 74 37 7 103 7 7 7 74 37 104 7 7 7 7 74 Wolfer Sunspot Numbers Correlation Between Observations Separated by One Time Period ShowlaCfspotltS Autocon elatlon 0 806244 Correlation 0 81708 W lsH nnl spm lsHl Wolfer Sunspot Numbers Correlation Between Observations Separated by Is Time Periods ugl lag uga lag HHH H ugs ugs ugv ugx lags Wolfer Sunspot Numbers Sample ACF Function Series spams LEE Autoregressive Models ARO 4 u 0 White noise or Trivial model ARC 7 90 451271 0 AR2I 7 00 451271 452744 0 AR31 7 60 451271 452744 453744 0 ARO 7 90 451271 4522 pzt7p 0 0 N momma AR1 Model for the Wolfer Sunspot Data AR2 Model for the Wolfer Sunspot Data Data and lrSlep Ahead Predlotlohs sdhsoot Residuals ARM Model Data and lrSlep Ahead Predlotlohs sdhsoot Residuals Am Model an inn lsn an inn lsn n n d Zn on an Eu mu d 2H m an Eu mu d Zn on an Eu mu u an on an Eu mu VeavNurilhev VeavNurilhev VeavNurilhev VeavNurilhev Series residualsspotar1 t Series residualsspotaiQ t g 6N a N G E E quot a a quot a a 39 a g N E V n 5 m 15 2 l u l 2 u s m ls a a u l 2 Lay Lag Ouarmlesmslandavd Namial Ouarmlesmslandavd Namial 27 19 27 2O AR3 Model for the Wolfer Sunspot Data AR4 Model for the Wolfer Sunspot Data Data and Law Ahead Predictions Suhsoot Restduals AR3 Model Data and Law Ahead Predictions Suhsoot Restddals AR4 Model d 2H Eu mu d 2H Eu mu d 2H Eu mu d 2H Eu VeavNurilhev VeavNurilhev VeavNurilhev VeavNurilhev Series residualsspotai6 t Series residualsspotar4 t ACF m as ACF m as restaualstsml arZ hi restaualstsml arZ hi 3 3 3 3 u 5 in la 2 l u l 2 u 5 in la 2 l u l 2 Lay tag ouamlesmaanm hml ouamlesmaanm hml 2721 2 22 Sample Partial Autocorrelation summary 0f sunSPOt AUtoregreSSions The true partial autocorrelation function denoted by 45 for k 12 is the process correlation between obsere order M PACF vations separated by k time periods ie between zt and Model 17 R2 S 43 t 43W QM with the effect of the intermediate zt1ztk1 AR1 1 6676 2153 810 1396 8062 remOVed AR2 2 8336 1532 7711 7981 76341 A AR3 3 8407 1512 208 204 0805 We can estimate 45 k 12 With 45 k 12 AR4 4 8463 1501 7147 7141 70611 abl l 1 from the AR1 model lt73 is from ordinary least squares OLS 39 4522 452 from the ARQ mOGe39 A 43M 43 from the AROc model 45W is from formula 2525 giving the solution to the Yule Walker equations General formula 2525 gives somewhat different answers due to the basis of the estimator ordinary least squares versus solution of the YuIeWaIker equations Autoregressive Moving Average Box Jenkins Models ARO z n1 White noise or Trivial model 39 AR13zt 60 7511371 at 39 AR23 z 90 1zt71 452744 0t ARGO z 6o 451271 pztepa 39 MA13 z 90 9171 0t 39 MA23 z 90 9171 9272 0t 39 MAUI Z 90 islatil quzeq 0quot 39 ARMA1gt 13 z 90 1zt71 9171 0t ARMWm Zr 60 abiztei abpzzep 7 a 7177640 eqa aw momma Graphical Output from Splusts Function idenspotd Waller Sunspms mnnu m m Tabular Output from Splusts Function idenspotd Identification Output for Wolfer Sunspots w Number of Spots 1 quotStandard deviation of the working series 3735504quot ACF Lag ACF Se teratio 1 1 0805243955 01000000 805243955 2 2 0428105325 01515594 282280593 3 3 0059511110 01532975 042528402 Partial ACF Lag Partial ACF Se tiratio 1 1 0 806243896 0 1 806243896 2 2 0i634121358 0 1 6i34121358 Standard Errors for Sample Autocorrelations and Sample Partial Autocorrelations Sample ACF standard error S k 317 Also can compute the telike statistics 1 kS k Sam e PACF standard error so 1 318 D Wok W Also can compute telike statistics it EMSal In long realizations from a stationary process the telike statistics can be approximated by N01 Values of k and 43 may be judged to be different from zero if the telike statistics are outside specified limits i2 is often suggested might use i15 as a warning Drawing Conclusions from Sample ACF and PACF s The telike statistics should only be used as guidelines in model building because n An ARMA model is only an approximation to some true process Sampling distributions are complicated the telike do not really follow a standard normal distribution only approxie mately in large samples with a stationary process Correlations among the k values for different It eg l and g may be correlated Problems relating to simultaneous inference looking at many different statistics simultaneously confidence level have lit tle meaning Chapters 8 and 9 Linear Regression and Regression Wisdom At the end of this chapter you should be able to I Calculate a regression line given summaw statistics I Interpret the slope of the regrssion line in context Know when the interpretation ofthe intercept ofthe regression line is rasona le given 3 Find predictions and rsiduals for points I Interpret a residual plot I Interpret the R2 value for a regrssion in context I Describe and apply the limitations of regression Correlation and Regression ICorrelation linear relationship between two variables ISummarize relationship with line ICalled Regression line Explanatory variable X Response variable y Regression line IExplains how response variable y changes in relation to explanatory variable X IUse line to predict value of yfor given value of X Regression line INeed mathematical formula model Different lines by sight IPredict yfrom X The values are called The values are called 3 Regression Line IPredicted values Model Summary of relationship between x and y IObserved values Model Error Model summary of relationship between xan y Error amount leftover that the model doesn t explain Regression ine Look at vertical distance IError in regression line IPlace line to make these errors as small 5 possible Least squares regression 3 Most commonly used regression line IPuts line where sum of the squared errors as small as possible Minimizes IBased on statistics Regression line equation where Regression line equation slope 1 Interpretation Very important for interpreting data Regression line equation intercept Interpretation Usually not important for interpreting data Values of Xare usually not close to o Degree Days vs Gas Usage per munm Calculating the regression line IDegree Days vs Gas Usage 1 2231 5314 1774sy 337r 09953 Calculating the regression line IDon t forget to write the equation Properties of regression line IRegression line always goes through Iris connected to the value of 11 Interpretations IInterpretation Slope IInterpretation Intercept Predicted Values and Residuals IPredicted Value and Residual for each point IPredicted Values Model Calculate from the Regression Line IResidual Error Calculate from Data and Predicted Value Predicted Values IFrom Regression Line Model EX Predicted gas consumption when degree days 43 EX Predicted gas consumption when degree days 24 Residuals IFrom Predicted Value and Observed Value EX Residual when degree days 43 EX Residual when degree days 24 Observed Values 6 IPredicted Value Residual IMean Standard Deviation Predicted Values IRegression Line Model IMean IStandard Deviation Residuals lObserved Value Predicted Value IMean IStandard Deviation Variability IIn Observed Values IIn Predicted Values IIn Residuals Variability in y explained by Regression Line IR2 Variability in y explained by Regression Line IInterpretation of R2 Properties of R2 3 IAny value from Higher values of R2 ILower values of R2 Residuals IError relationship between x and y not modeled by regression line IPIot x and Error Residual Plot Residual Plot ISpecial Scatterplot Explanatory variable X on horizontal axis Residuals e on vertical axis Horizontal line at residual 0 lGood Residual Plot Residual Plot Bad Residual Plots lContain a pattern or outliers IIndicates model with regression line is not explaining everything about relationship between x and y Example of Other Residual Plots Example of Other Residual Plots 7 Example of Other Residual Plots Regression in JMP v um um gt wwmvmm v v mmquot Outliers IObservations outside overall pattern ISome examples Outliers ISome outliers are Removing the outlier would markedly change regression line Outliers in are often Usually have Age at Fvst Word vs Geseu Scare Age at rst word vs Gesell score 9 w nu Scatterplot Reg ression Resrdual Prot Interpretations Slope IIntercept IRZ I Age at first word vs Gesell score Age at rst word vs Gesell score IResidual Plot IRemove 4257 from data What should you do IMake sure data points recorded correctly ICollect more data Remove outlier Cautions about regression 93 ILinear relationship only IExtra polation Cautions about regression I Makes relationship appear stronger Removes variation I Important effect on variables but not included in study Example Cautions about regression I Strong association between explanatory and response variables does not mean that explanatory variable causes response variable Ex High positive correlation between number ofTV sets per person and average life expectancy Proving Causation IExperiment Change values of xand control for lurking variables Not all problems can be solved by experiments Smoking causes lung cancer Living nar power lines causes leukemia Proving Causation IProving smoking causes lung cancer Module 6 Some onesample normal examples Prof Stephen B Vardeman Statistics and lMSE Iowa State University March 5 2008 Example 4 We continue with our illustration of what is possible using WinBUGS to do Bayes analyses in this module treating some problems involving normal observations with both pi and 7 unknown A real example on page 771 of Vardeman and Jobe s Basic Engineering Data Collection and Analysis concerns measurements on a critical dimension of n 5 consecutive parts produced on a CNC Lathe The measurements in units of 0001 in over nominal were 4 3 3 2 3 We note that the usual sample mean for these values is x 30in and the usual sample standard deviation is s 7071 in To begin we re going to first do what most textbooks would do here and consider analysis of these data under the model that says that the data are realized values of X1X2 X5 independent N y random variables I l Example 4 co Standard Stat 101 con dence limits for the process mean pt are 5 ii tniliy 7 f while con dence limits for the process standard deviation 7 are n 7 l n 7 l s 7 s 7 2 l 2 xuppermil Xlowermil and prediction limits for an additional dimension generated by this process XnewY are 1 Xitnils 1i n In the present problem these yield 95 limits as in Table l on the next paneL Example 4 co Table 1 Standard nonBayesian 95 Con dence and Prediction Limits Based on a Normal Assumption for the Part Measurements Quantity lower limit upper limit y 307277639704 212 5 a 7071 1133 42 7071 203 xnew 30727767071q1 85 515 Bayesians often try to find relatively noninformative priors for which inferences based on their posteriors are in substantial agreement with nonBayes methods of inference for problems where the latter are available This gives hope that when such priors are used as parts of more complicated models some for which no nonBayesian methods of inference are even known they will give inferences not too heavily affected by the choice of prior We now consider what are fairly standard choices of priors in onesample normal problems Example 4 co A standard quotnoninformative choice of prior distribution for a onesample normal problem is pt Uniform 700 00 independent of lnU quotUniform7oo 00 Strictly speaking this is nonsense as there is no such thing as a uniform probability distribution on 70000 Precisely what is meant is that with 5 ln 7 one is using as a quotpriorquot for y 5 a function gi4151 This turns out to be equivalent to using as a quotpriorquot for y 72 a function 1 g 472 olt E Example 4 cont These don t actually specify a joint probability distribution for y 72 but can be thought of as approximately equivalent to the honest priors pt Uniform flarge large independent of ln 7 Uniform flarge large pt N0arge independent of 1 E N Gamma small small WinBUGS allows the use of the quotimproperquot quotUniform7oo 00 prior through the notation dflat Example 4 C0 The file BayesASQEX4AodC contains WinBUGS code below for implementing a Bayes analysis for these data based on independent at priors for pi and ln 7 model mu dflat logsigma dflat sigmaltexplogsigma taultexp2logsigma for i in 15 X i dnormmu tau Xnew dnormmu tau listXC43323 here are 4 starts list mu7 logsigma2 Xnew7 list mu7 logsigma2 Xnew7 list mu2 logsigma2 Xnew2 list mu2 logsigma2 Xnew2 The gure on the next panel shows the result of running this code Example 4 con I checked that burnin had occurred by 1000 iterations What is pictured are then results based on iterations 1001 through 11000 if Xnew chains 1 4 sample U U mu chains I 5 sample UUUD AEILI nude mean 5 MC error 259 median 3157 sum Xnew 3002 1 09 0 005574 05203 3 5135 1001 40000 mu 2 397 n 4433 D IEIZSM 2118 2593 a 372 mm 400nm sinma H834 04579 0 unznzS 04219 07655 2052 1 am 4am Ll Example 4 cont Bayes results for this quotimproper prior analysis are in almost perfect agreement with the ordinary Stat 101 results listed in Table 1 This analysis says one has the same kind of precision of information about WU and Xnew that would be indicated by the elementary formulas The files BayesASQEx4Bodc and BayesASQEx4Codc contain code that the reader can verify produces results not much different from those above The priors used in these are respectively pt Uniform 710000 10000 independent of ln 7 Uniform 7100 100 pt N0106 independent of H 1 E N Gamma 01 01 II Example 4 C0 The code for the rst is and the code for the second is model model mudunif10000 10000 mudnorm0 000001 logsigma dunif100 100 tau dgamma 01 01 sigmaltexplogsigma sigmaltsqrt1tau taultexp2logsigma for i in 15 for i in 15 Xi dnormmutau Xi dnormmutau Xnew dnormmutau Xnew dnormmutau listXC43323 listXc43323 here are 4 starts here are 4 starts listmu7tau1Xnew7 list mu7 logsigma2 Xnew7 list mu7 tau 0001 Xnew7 list mu7 logsigma2 Xnew7 list mu2 tau1 Xnew2 list mu2 logsigma2 Xnew2 list mu2 tau 0001 Xnew2 list In quot I losi a2 Xnew2 l Digitalizatio uantizationRounding A real potential weakness of standard analyses data like those in Example 4 is that they treat what are obviously to the nearest something values as if they were exact in nite number of decimal places numbers The quot4quot in the data set is treated as if it were 400000000000 Sometimes this interpretation is adequate but other times ignoring the digitalization quotquantizationquot of measurement produces inappropriate inferences This is true when the quantization is severe enough compared to the real variation in what is being measured that only a very few values appear in the data set This is potentially the case in the real situation of Example 4 We consider how one might take account of quantization evident in data DigitalizationQuantizationRounding cont Consider a model for recorded part dimensions that says they are infinite number of decimal places normal random variables rounded to the nearest integer unit of0001in above nominal A cartoon of a possible distribution of the real part dimension and the corresponding distribution of the digital measurement is below Figure DigitalizationQuantizationRounding cont To model this we might suppose that real part dimensions are realized values of X1 X2 Xn independent N y 72 random variables but that data in hand are Y1 Y2 Yr where Y the value of X rounded to the nearest integer In this understanding the value Y1 40 in a data set doesn t mean that the first measurement was exactly 4000000000 but rather only that it was somewhere between 35000000 and 4500000 What is especially pleasant is that WinBUGS makes incorporating this recognition of the digital nature of measurements into a Bayes analysis absolutely painless Ii Example 5 Let s do an analysis for the situation of Example 4 with priors pt Uniform 700 00 independent of ln 7 Uniform 700 00 and suppose that of interest are not only the parameters pi and 7 but also an Xnew from the process and its rounded value Ynew and in addition the fraction of the X distribution below 1000 that is h402Pzlt139 qgtlt139l gt There are no widely circulated non Bayes methods of inference for any of these quantities Lee and Vardeman have written on nonBayesian interval estimation for pi and 7 but these papers are not wellknown The file BayesASQEx5odc contains some simple WinBUGS code for implementingrra Bayes analysis Example 5 cont The code is model mu dflat logsigma dflat sigmaltexplogsigma taultexp2logsigma X1 dnormmutau I3545 X2 dnormmutau I2535 X3 dnormmutau I2535 X4 dnormmutau I1525 X5 dnormmutau I2535 Xnew dnormmu tau YnewltroundXnew probltphi10musigma Example 5 co here are 4 possible initializations list mu7 logsigma2 Xnew7 X14 X24 X34 X42 X53 list mu7 logsigma2 Xnew7 X14 X24 X34 X42 X53 list mu2 logsigma2 Xnew2 X12 X14 X24 X34 X42 X53 list mu2 logsigma2 Xnew2 X14 X24 X34 X42 X53 The gure on the next panel summarizes an analysis that results from using this code Example 5 co t Vnew chains 1 a sample 120000 720 0 0 0 200 72 0 5 75 0 0 0 5 0 man chains 11 same 120000 Sigma 5111mm sample 120000 1 0 2 0 1 5 50 0 1 n 0 5 0 0 0 0 00 005 01 015 00 25 50 75 100 Nod stalislirs 1 1 39 node mean 5 median 915 51m snmnie A Xnaw a 002 1 017 0 003005 0 0703 a 001 5 ms 11001 120000 Y39naw a 002 1 055 0 00322 1 0 a 0 5 0 11001 120000 mu 2 age 0 4m 0 001m 2 m 2 see 3 an 11001 120000 mm 0 02532 0 0597 239054 1 295710 0 001773 02099 11001 120000 slums 0 7976 0 4450 0001077 0325 11001 120000 Ll Figure Results of a Bayes analysis for n 5 part dimensions treating what is observed as integerrounded normal variables Example 5 con Perhaps the most impressive part of all this is the ease with which 107 7 the precision with which it is known is provided and this taking account of rounding in the data Note that of the parts of the analysis that correspond to things that were done earlier it is the estimation of 7 that is most affected by recognizing the quantized nature of the data The posterior distribution here is shifted somewhat left of the one in Example 4 is estimated and some sense of something as complicated as d This is probably a good point to remark on the fact that for a Bayesian learning is always sequential Today s posterior is tomorrow s prior As long as one is willing to say that the CNC turning process that produced the parts is physically stable the posterior from one of the analyses in this module would be appropriate as a prior for analysis of data at a subsequent point So for example instead of starting with a flat prior for the process mean y a normal distribution with mean near 300 and standard deviation near 44 could be considered Chapter 18 Sampling Distribution for the Sample Mean p and u p is the population proportion summarizes a categorical variable EX What proportion of ISU students smoke take a sample and get f take many samples and get many different p 8 the distribution ofthese 19 s is Np pan m p and1 Iu is the population mean summarizes a quantitative variable Ex What is the mean age of all STAT 101 students take a sample and get 7 the sample mean Value of 7 is random Changes from sample to sample Different from population mean u y take manysamples of size n and get many different y s these y s are data summarize data Shape Center and Spread sampling distribution for 7 Sampling Distribution for 7 Mean Center f Expect to get on average p 7 is unbiased for p Sampling Distribution for Standard Deviation Spread 7 U o y W As n gets larger 01 gets smaller Larger samples are more accurate than smaller samples Name that distribution Central Limit Theorem sampling distribution for 7 is As the sample size n increases the mean NORMALllllllll of n independent values has a sampling only ifthese three conditions hold distribution that tends toward a normal 1 Sample must be random sample dIStr39bUt39on39 2 Sample must be independent values 3 Sample must be less than 10 of population 4 n ls large enough 3 J n How large does n need to be Example 1 Depends on shape of population Ithaca New York gets an average of 354 distribution inches of rainfall per year with a standard deviation of 42 inches Assume yearly rainfall follows a normal distribution Symmetric n between 5 and 15 Skewed n at least 30 Example 1 cont Example 1 cont What is the probability a single year will have more than 40 inches of rain Pa gt40 PZ gt 40 354 Y annual rainfall 42 Y is N354 42 PYgt40 PZgt110 21 08643 201357 Example 1 cont What is the probability that over a four year period the mean rainfall will be less than 30 inches Example 1 cont 7 mean rainfall overfour year period Since Y annual rainfall is N354 42 17 is Ny N3544 N3542 1 n f Example 1 cont P7lt30PZlt PZlt 257 00051 Example 2 Carbon monoxide emissions for a certain kind of car vary with mean 29 gmmi and standard deviation 04 gmmi A company has 80 cars in its eet Estimate the probability that the mean emissions for the eet is between 295 and 30 gmmi Example 2 cont Y emissions from one car Y mean emission from 80 cars n 80 is large Yis NuN29N290045 r Example 2 cont P295lt Y lt30 P239952399 lt Z lt 30 29 0045 0045 P111ltZ lt222 09868 08665 01203 Example 3 Grocery store receipts show that customer purchases are skewed to the right with a mean of 32 and a standard deviation of 20 Example 3 cont Can you determine the probability the next customer will spend at least 40 Y amount a single customer will spend Y has a skewed distribution Given info we cannot determine this probability Example 3 cont What is the probability the next 50 customers will Example 3 cont spend an average of at least 40 Y amount one customer will spend gt gt Y mean amount 50 customers will spend n 50 is large so 75 PZgt283 a 20 1 09977 NWN3Z N32283 0 Example 4 Example 4 cont Suppose there were 312 customers at the grocery store in one day What is the probability the store s revenues were at least 10000 Total revenues total amount spent by all 312 customers All 312 customers must spend over 10000 On average each customer must spend 10000312 3205 Example 3 cont Example 4 cont What is the probability that 312 customers will spend at least an average of 3205 Y amount one customer will spend 7 mean amount 312 customers will spend H gt 3205 P Z gt m 113 n 312 is large so PZ gt 004 Y395 21 05160 204840 Nui Nazi N32113 n 312 J J Stat 101L Lecture 15 Re expressing Data Chapter 6 Normal Model What if data do not follow a Normal model Chapters 8 amp 9 Linear Model What if a relationship between two variables is not linear Re expressing Data Re expres sion is another name for changing the scale of transforming the data Usually we re express the response variable Y Goals of Re expression Goal 1 Make the distribution of the re expressed data more symmetric Goal 2 Make the spread of the re expressed data more similar across groups Stat 101L Lecture 15 Goals of Re expression Goal 3 Make the form of a scatter plot more linear Goal 4 Make the scatter in the scatter plot more even across all values of the explanatory variable Ladder of Powers Power 2 2 9 Re expres SIOHI y Comment Use on left skewed data Ladder of Powers Power 1 its Re expres sion y Comment No re expression Do not re express the data if they are already well behaved Stat 101L Lecture 15 Ladder of Powers Power 12 Re expres sion J Comment Use on count data or when scatter in a scatter plot tends to increase as the explanatory variable increases 7 Ladder of Powers Power 0 Re expression logy Comments Not really the 0 power Use on right skewed data Measurements cannot be negative or zero Ladder of Powers Power 12 1 1 1 Re expression W y Comments Use on right skewed data Measurements cannot be negative or zero Use on ratios Stat 101L Lecture 15 Goal 1 Symmetry Data are obtained on the time between nerve pulses along a nerve ber asTime is rounded to the nearest m half unit Where a unit is 0 of a second 305 represents 30390 06156c Time Nerve Pulses Distribution is skewed right Sample mean 12305 is much larger than the sample median 75 it Many potential outliers Data not from a Normal model Stat 101L Lecture 15 Summary Time Highly skewed to the right a SqrtTime Still skewed right L0gTime Fairly symmetric and mounded in the middle Could have come from a Normal mode Stat 101L Lecture 17 Randomness at It s not easy being random Pick a number either 1 2 or 3 at random Write this number down as your first digit Pick a number either 1 2 or 3 at random Write this number down as your second digit Trying to be Random Second Digit 1 2 3 Total ED 1 Q 2 E 3 Total Random using JMP Sec nd D39git 1 2 3 Total 1 12 14 14 40 9 13 8 30 First Digit 0 3 11 10 9 30 Total 32 37 31 100 Stat 101L Lecture 17 Randomness Why do we need randomness We use randomness in our data collection to give a fair and accurate picture of the world Drawing conclusions from data relies on using randomness in data collection Practical Randomness as We can also use randomness to simulate outcomes that model a random situation as The Pick 3 Lottery iPaySlSl iPick 3 numbers between 0 and 9 say 123 71f the 3 numbers drawn match your numbers in any order you win 100 Simulation as Component repeated a random three digit number Explain model if number generated is 123 132 213 231 312 or 321 you win 100 Use JMP Random Integer1000 s Stat 101L Lecture 17 Simulation Response variable number of Wins in 1000 plays Run several trials Analyze the response variable State your conclusion Simulation Trial Wins GainLoss 1 7 300 loss 2 5 500 loss 3 5 500 loss 4 2 800 loss 5 5 500 loss 6 7 300 loss 3 Simulation at Conclusion Play the Pick 3 Lottery and you Will end up losing money Stat 101L Lecture 12 Algebra Review 9The equation of a straight line 9y mx b m is the slope the change in y over the change in x 0r rise over run b is the y intercept the value Where the line cuts the y axis x y3gtlt2 75478724012845 x Review 9y 3x 2 x 0 gt y 2 y intercept x 3 Igt y 11 Change in y 9 divided by the change in x 3 gives the slope 3 Stat 101L Lecture 12 Linear Regression OExample Tar mg and nicotine mg in cigarettes y Response Nicotine mg x Explanatory Tar mg Cases 25 brands of cigarettes Correlation Coef cient OTar and nicotine r szzy 229437 n 1 24 1 0956 Linear Regression 9There is a strong positive linear association between tar and nicotine 9What is the equation of the line that models the relationship between tar and nicotine Stat lOlL Lecture 12 Linear Model 9The linear model is the equation of a straight line through the data 9A point on the straight line through the data gives a predicted value of y denoted Residual 0The difference between the observed value of y and the predicted value of y 5 is called the residual OResidual y j Nicotine Content vs Tar Content Nicotine mg I I I 10 15 2o 25 Tar m9 Stat 101L Lecture 12 Line of Best Fit 9There are lots of straight lines that go through the data 9The line of best fit is the line for Which the sum of squared residuals is the smallest the least squares line Line of Best Fit 52 2 b0 blx Least squares S slope 1 r y SX intercept b0 y b1 Summary of the Data Tar x Nicotine y 7r1192mg O908mg sx 4636mg sy 02812mg r0956 Stat 101L Lecture 12 Least Squares Estimates b1 0956W 0058 4636 290 0908 00581192 0217 f2 02170058x Interpretations Slope for every 1 mg increase in tar the nicotine content increases on average 0058 mg Intercept there is not a reasonable interpretation of the intercept in this context because one wouldn t see a cigarette With 0 mg of tar Nicotine Content vs Tar Content 2 Predicted Nicotine o 217 o OSBTar Nicotine mg Stat 101L Lecture 33 Inference for a Who Young adults is What Heart rate beats per minute as When a Where In a physiology lab is How Take pulse at wrist for one minute 6 Why Part of an evaluation of general health Inference for y What is the mean heart rate for all young adults Use the sample mean heart rate y to make inferences about the population mean heart rate 4 Inference for asDistribution of 7 Shape Approximately normal Center Mean M Spread Standard Deviation SDG i J Stat 101L Lecture 33 Problem The population standard 0 deviation is unknown Therefore SDW i is unknown as well I Solution Use the sample standard deviation s to get the standard error of 7 SEW He Problem The distribution of the standardized sample mean y SEW does not follow a normal model Stat 101L Lecture 33 S oluti on The distribution of the standardized sample mean 7 I SEW does follow a Student s trnodel W39th Ma L ul will WHIMSWVWSSH nummuu 1937 1832 1923 NW Stat 101L Lecture 33 Inference for u nDo NOT use Table Z Use Table T instead Conditions Randomization condition t 10 condition Nearly normal condition Stat 101L Lecture 33 Randomization Condition Data arise from a random sample from some population Data arise from a randomized experiment 10 Condition The sample is no more than 10 of the population asNot as critical for means as it is for proportions Nearly Normal Condition The data come from a population Whose shape is unimodal and symmetric Look at the distribution of the sample Could the sample have come from a normal model Stat 101L Lecture 33 Confidence Interval for y rISEv to W rZASEW I is fromTable T SEi J Table T mpuNHa f n71 Confidence Levels 80 90 95 98 99 Inference for What is the mean heart rate for all young adults Use the sample mean heart rate i to make inferences about the population mean heart rate ll Stat 101L Lecture 33 Sample Data Random sample of n 25 young adults it Heart rate beats per minute 7074 7578 74 64 70 78 8173 82 75 71 79 73 79 85 79 71 65 70 69 76 77 66 Summary of Data 11225 as y 7416 beats its 2 5375 beats S as SE 1075 beats 4 Conditions as Randomization condition Met because we have a random sample of 25 young adults as 10 condition Met because 25 is less than 10 of all young adults that could have been sampled Stat 101L Lecture 33 Conditions Nearly Normal Condition Could the sample have come from a population described by a normal model Normal 0mm mm m ea 7 75 an E5 Heartrate Nearly Normal Condition Normal quantile plot data follows straight line for a normal model Box plot symmetric Histo gram unimodal and symmetric Stat 101L Lecture 33 Confidence Interval for r rLSEG to W rZASEW I is fromTable T S SE J Table T gtagtwN9 a 24 gt2064 ConfidenceLeVels 80 90 95 98 99 Confidence Interval for v ILISEW to v ILISEW 7416i20641075 7416 222 to 74l6222 7194 beats to 7638 beats Stat 101L Lecture 33 Interpretation We are 95 confident that the population mean heart rate of young adults is between 7194 beatsmin and 7638 beatsmin Interpretation Plausible values for the population mean 95 of intervals produced using random samples Will contain the population mean J MPIAnalyze Distribution Mean 7416 Std DeV 5375 Std Err Mean 1075 Upper 95 Mean 7638 Lower 95 Mean 7194 N 25 zu Stat 101L Lecture 25 Samples IHHHI Sampling Distribution of 3 Shape Approximately Normal Center The mean is p vSpread The standard deviation Reese s Pieces Sampling distribution of f7 7Shape Approximately Normal 7Center The mean is 045 7Spread The standard deviation is pili pi lawless 00995 n 25 Stat 101L Lecture 25 Conditions The sampled values must be independent of each other The sample size n must be large enough Conditions a 10 Condition When sampling Without replacement the sample size should be less than 10 of the population size Reese s Pieces the number of pieces in the machine is much greater than 250 Conditions as SuccessFailure Condition The sample size must be large enough so that np and n p are both bigger than 10 Reeses Pieces np 1125 and n1 p 1375 which are both greater than 10 Stat 101L Lecture 25 Comment To be able to use these results you need to know What the Value of the population parameter p is This is no problem in the Reese39s Pieces simulation because We can choose the proportion of Orange pieces Rmeu Pieces Sumpm Inference For most populations We don39t know 7 the population proportion kWe can use the sampling distribution of f to help us make inferences about the reasonable or plausible Value of p Stat 101L Lecture 14 Regression Wisdom Sifting Residuals for Groups Display residuals versus the explanatory variable Look at the distribution of residuals Example try Life Expectancy years aux Wealth Index r0874 y 241 771x Count 15 10 5 0 5 10 15 Re idnal T ifs Fxnermm Stat 101L Lecture 14 Interpretation There appear to be several groups of residuals two large negative residuals a large group between 5 and 0 another group between 5 and 10 A Life Expectancy 8 Wealth Index Residual Wealth Index Stat 101L Lecture 14 Getting the Bends as A fundamental assumption is that the relationship is a straight line What looks straight on a scatter plot may show a curve When one looks at the plot of residuals versus the explanatory variable 7 Example try Stopping distance feet aux Speed miles per hour R2 0984 yz 628348x 150 Distance 0 Stat 101L Lecture 14 Residual Interpretation There is a curved pattern in the residuals under predicts over predicts under predicts Interpretation Although the straight line does a very good job explaining the variation in stopping distance a curved relationship model would do even better Stat 101L Lecture 14 Dangers of Extrapolation Suppose we use the least squares equation relating speed to stopping distance for a vehicle traveling at 5 mph The predicted stopping distance is 454 feet Special Points Outlier In regression this is a point With a large residual Ieverage In regression a point has high leverage if it is an extreme value for the explanatory variable In uence Outliers and high leverage points can be in uential points that is they can greatly in uence What the intercept and the slope of the least squares line Will be Stat 101 Lecture 20 Prob ability Subjective Personal iBased on feeling or opinion Empirical iBased on experience Theoretical Formal iBased on assumptions The Deal Bag 0 chips poker chips isome are red isome are white isome are blue Draw a chip from the bag The Deal Draw a red chip win 3 bonus points Draw ablue chip win 1 bonus point Draw awhite chip lose 1 bonus point Stat 101 Lecture 20 Is this a good deal Subjective personal probability 7Based on your beliefs and opinion Empirical probability 7Based on experience 7Conduct a series of trials 7Each trial has an outcome R W B Empirical Probability Look at the long run relative frequency of each of the outcomes 7Blue Theoretical Probability Look in the bag and see how many 7 Blue chips 7 7 Red chips 7 7White chips 7 Assumption 7 Each chip has the same probability of being chosen Equallylikely Stat 101 Lecture 20 Law of Large Numbers For repeated independent trials the long run relative frequency of an outcome gets closer and closer to the true probability of the outcome Probability Rules A probability is a number between Oandl Something has to happen rule iThe probability of the set of all possible outcomes ofa trial must be 1 Probability Rules Event 7 a collection of outcomes 7Win bonus points Blue or Red chip Complement rule iThe probability an event occurs is 1 minus the probability that it doesn t occur 7PA 1 7 PAC Stat 101 Lecture 20 Probability Rules Disjoint events 7 no outcomes in common Addition Rule for disjoint events 7PA or B PA PB 7PBlue or Red PBlue PRed Probability Rules Independent trials Multiplication rule for independent trials Poutcome lSI and outcome 2 Poutcome 15 Poutcome 2nd Example What is the chance that two people in a row win bonus points Pwin 1st and win 2ndPwin 15 Pwin 2nd Pwin 15 PBlue or Red PBluePRed Pwin 25 PBlue or Red PBluePRed 12 Chapter 24 Comparing Means Comparing Two Means What do you see Comparing Two Means An educator believes that new reading activities for elementary school children will improve reading comprehension scores he randomly assigns her thirdgrade students to one oftwo comprehension exam re he scores for the w reading activities group higher than for the traditional group Comparing Two Means Does the new reading program produce better average scores For this particular class Comparing Two Means Look at boxplot ofeach group s scores Comparing Two Means P2 Interested in quantity u1 uz Comparing Two Means M and uz are parameters unknown Estimate u1 uz with Sampling Distribution 01 and 02 are parameters unknown Sampling Distribution Assumptions Random Samples Samples are Independent Nearly Normal Population Distr butions Sampling Distribution Sampling Distribution If assumptions hold sampling distribution for is Degrees of Freedom t doesn t really have a t distribution The true distribution oft is when you use this formula for the degrees of freedom Degrees of Freedom Example 1 A statistics student designed an experiment to test the battery life of two brands of batteries For the sample of 6 generic batteries the mean amount of time the batteries lasted was 2060 minutes with a standard deviation of 103 minutes For the sample of 6 name brand batteries the mean amount of time the batteries lasted was 1874 minutes with a standard deviation of 146 minutes Calculate a 90 confidence interval for the difference in battery life between the generic and name brand batteries Degrees of Freedom Problem Solution Example 1 cont Assumptions Random samples OK Independent samples different batteries for each sample Nearly Normal data shows no real outliers Inference for H1 H2 0 confidence interval for H1 p2 t is critical value from t distribution table Example 1 cont df iii liz n 671 2060s1 103 nZ 672 1874sZ 146 Example 1 cont Example 2 cont Assumptions Random samples no reason to think nonrandom Independent samples different students in each group Nearly Normal n1 and n2 are large so not important Example 1 cont Example 2 cont df iii H2 n1 3121 290s1 188 nZ 26572 384sZ 162 Example 2 The Core Plus Mathematics Project was designed to help students improve their mathematical reasoning skills At the end of 3 years students in both the CPMP program and students in a traditional math program took an algebra test without calculators The 312 CPMP students had a mean score of 290 and a standard deviation of 188 while the 265 traditional students had a mean score of 384 with a standard deviation of 162 Calculate a 95 confidence interval forthe mean difference in scores between the two groups Example 2 cont Example 2 cont Hypothesis Test for M 2 Test statistic df smaller ofn1 1 and n2 1 Hypothesis Test for M p2 Pvalue for H5 Pvalue Pt d gt t Hypothesis Test for M p2 Pvalue for H5 Assumptions Random Sam ies P39Va39 39 P M lt 0 independent sampies Neariy Normai Popuiation Distrbutions Js i Pvalue for H5 Pvalue 2Pt d gt lth JL Example 1 Back to the reading example The e ucator takes a random sample of all third graders in a large school district and divides them into the two rou s The mean score ofthe 18 students inthe new ac39 quot 39tha E a u 9 o z 390 s m u o E up have a higher mean reading score U e at 01 Decision If pvalue lt 01 Example 1 cont m1875172x1171 n2 202 418e2 1745 Hypothesis Test for M u2 Conclusion Statementaboutvalue of i11 p2 Always stated in terms of problem Example 1 cont HO HA Assum 39ons ndurn Samples 0 lndependentsample event set ul s1ud Di Nearly Nurrnal huxpluls luuk symmeme s ems in each gmup Example 1 cont Example 1 cont Conclusion Example 1 cont df Pvalue Example 2 In June 2002 the Journal oprpied Psychology reported on a study that examined whetherthe content of TV shows influenced the ability of viewers to recall brand names of items featured in commercials Researchers randomly assigned volunteers to watch either a program with violent content or a program with neutral content Both programs featured the same 9 commercials After the shows ended subjects were asked to recall the brands in the commercials Is there evidence that viewer memory for ads differs between programs Use Cl 005 Example 1 cont Decision Example 2 cont iii liz n110871208s1187 nZ 108jZ 317sZ 177 Example 2 cont HO H A Assumptions Random Samples no reason to think not random Independent Samples Different people in each group Nearly Normal n1 and n2 are large so not important Example 2 cont Decision Example 2 cont Example 2 cont Conclusion Example 2 cont df Pvalue Chapter 5 l ml 1 Understanding and Comparing Distributions 1 Height of Stat 101 Students I Who I What At the end of this chapter you should be able to I Compare and contrast the distribution of a quantitative variable between two or more groups lull in nu Groups I Gender h I I LILd ill III h In Height of Stat 101 Students he I l1d Histograms of Female and Male Heights I Number of HRS per Season I I l Two Players rBarry Bonds I H r Hank Aaron I Hank Aaron Veav HR Veav HR Veav HR 55 27 63 44 71 47 Stern and LeafDisplay Comparison Comparison ofFive Number Summaries He ts Fema e Ma e MW 01 Med an 3 Comparison of Five Number Summaries Graph of Five Number Summary I Boxplot Box Line in the box marks the Boxplot I Whiskers and Outliers Calculate Q3 15IQR If Max gt Q3 15IQR Whisker out to 03 15 IQR Points greater than 03 15 IQR denoted by dots If Max lt Q3 15 IQR Whisker out to Max l I Boxplot l l I I Whiskers and Outliers Calculate Q1 15IQR If Min lt Q1 15IQR Whisker out to Q1 15 IQR Points less han Q1 15 IQR denoted by dots If Min gt Q1 15IQR Whisker out to Min I I hI m1 Boxplot Example I Average Driving Distance for 202 golfers on PGA Tour 25m 27m zau 29m am am am h I I LIL i h I IIIsud h I LIL i Boxplot Example h In Comparing Heights with BoxPlots bl I lsd Questions for Further Study l Stat 101L Lecture 29 Interpretation a Getting a value of the sample proportion of 012 is consistent with random sampling from a population with population proportion p 010 a This sample result does not contradict the null hypothesis The Pvalue is not small therefore fail to reject Interpretation are Even though the sample proportion 012 is larger than the hypothesized population proportion 010 it is not large enough forus to believe that the population proportion is greater than 010 are There is not convincing evidence Conclusion Based on this sample the law firm should not pursue the class action lawsuit because the population proportion of defective cars could be only 10 Stat 101L Lecture 29 Test of Hypothesis Step 1 State your null and alternative hypotheses H0p po HAp gtpo Test of Hypothesis Step 2 Check conditions Independence Random sampling condition 10 condition SuccessFailure condition Test of Hypothesis as Step 3 Calculate the test statistic value and convert it into a P value A P P 0 Z POO Po 7L Use Table Z Stat 101L Lecture 29 Test of Hypothesis Step 4 Use the P Value to reach a decision If the P Value is small then reject Ho If the P Value is not small then fail to reject Ho Test of Hypothesis Step 5 State your conclusion in the context of the problem What does rejecting or failing to reject HO mean in the context of the problem Alternatives H03P po HA p ltp0 P Value Pr lt z HA p gtp0 P Value Pr gt z HA p 2 p0 P Value Pr gt z Stat 101L Lecture 29 Another Example a According to the US census Story County has 97 of its population classified as nonWhite it Of 120 people called for jury duty in Story County only 3 are nonWhite Is this convincing evidence of under representation of nonWhites Another Example Step 1 State your null and alternative hypotheses 0 p 0097 HA I lt 0097 p is the proportion of nonWhites among all people in thejury pool for Story County Test of Hypothesis Step 2 Check conditions Independence Random sampling condition 10 condition SuccessFailure condition Stat 101L Lecture 29 Test of Hypothesis at Step 3 Calculate the test statistic value and convert it into a P value 19 7 pm 0025 70097 z pn 17 pm 0097 170097 71 120 z 0027 P 7 value 00038 2 Test of Hypothesis Step 4 Use the P value to reach a decision Because the P value is small we should reject Ho Test of Hypothesis Step 5 State your conclusion in the context of the problem This is convincing evidence that non Whites are under represented in the jury pool Stat 101L Lecture 28 Inference Confidence Interval for p 13113 pz to zgtxlt n Tl Con dence Interval Plausible values for the unknown population proportion p We have con dence in the process that produced this interval Inference Using CI The population proportion p could be any of the values in the interval Values outside the interval are not plausible values for p Stat 101L Lecture 28 Inference Hypothesis Test Propose a value for the population proportion p Does the sample data support this value Example A law rm will represent people in a class action lawsuit against a car manufacturer only if it is sure that more than 10 of the cars have a particular defect Example Population Cars of a particular make and model Parameter Proportion of this make and model of car that have a particular defect Stat 101L Lecture 28 Example Null Hypothesis H0 p 010 as Alternative Hypothesis HA p gt 010 Example The laW rm contacts 100 car owners at random and nds out that 12 of them have cars that have the defect it Is this sufficient evidence for the laW rm to proceed With the class action laW suit Example How likely is it to get a sample proportion as extreme as the one we observe when taking a random sample of 100 from a population With p 010 Stat 101L Lecture 28 Example Sampling distribution of Shape approximately normal because 10 condition and successfailure condition are satis ed Mean p 010 Standard Deviation 010090 003 100 Draw a Picture 03900 035 olioo 2 03915 o o pml Standardize Z 2 p p0 p0 p0 n 012 010 002 z 067 010090 003 100 Stat 101L Lecture 28 Use Table Z 005 006 gt 07486 007 Interpretation Getting a sample proportion of 012 or more Will happen about 25 P Value 025 of the time When taking a random sample of 100 from a population Whose population proportion is p 010 l5 Stat 101L Lecture 36 Inference for 1 2 Do males and females at ISU spend the same amount of time on average at the Lied Recreation Athletic Center Could the difference between the population mean times be zero Test of Hypothesis for 1 2 Step 1 Set up the null and alternative hypotheses HOZILIIZILIZ or Haida 220 HAZILllilLIZ or HAM2 0 Test of Hypothesis for 1 2 Step 2 Check Conditions Randomization Condition aTwo Independeanandom Samples 10 Condition Nearly Normal Condition Stat 101L Lecture 36 Females V V 1 f7 7 V Norm al Ouanme PH 4 72 an AD an an 7 an an mu Males Nmmlouanhle F bt Count Ammbw auaumeummaumu Timemin s Nearly Normal Condition The female sample data could have come from a population With a normal model The male sample data could have come from a population With a normal model Stat 101L Lecture 36 Test of Hypothesis for 1 2 as Step 3 Compute the value of the test statistic and nd the P Value ty1y20 SEy1y2 s2 52 SEy1y2 12 n1 n 2 Time minutes SE71 V2 2 1 S2 135272 137922 15 15 24248824988 Stat 101L Lecture 36 Test of Hypothesis for 1 2 as Step 3 Compute the value of the test statistic and nd the P Value i 2 0 5587 6920 t 2672 SE71 72 4988 Table T Twota probabih39ty 020 010 005 002 P276108 001 df 1 2 3 4 25 1313 1701 2048 2467 2672 2763 Test of Hypothesis for 1 2 Step 4 Use the P Value to make a decision Because the P Value is small it is between 001 and 002 we should reject the null hypothesis Stat 101L Lecture 36 Test of Hypothesis for 1 2 Step 5 State a conclusion Within the context of the problem The difference in mean times is not zero Therefore on average females and males at ISU spend different amounts of time at the Lied Recreation Athletic Center Comment This conclusion agrees With the results of the con dence interval Zero is not contained in the 95 con dence interval 2355 mins to 3 11 mins therefore the difference in population mean times is not zero Alternatives were 0 A 2111 lt 2 One tail prob Pr lt t A 2111 gt 2 One tailprob Pr gt t mam A 2111 2 Two tail prob Pr gt ltl Stat 101L Lecture 36 JMP Data in two columns Response variable a Numeric Con nuous Explanatory variable a Character Nominal J MP Starter Basic Two Sample t Test Y Response Time X Grouping Sex prawnAnalysis 0 Tim By Sax mu Sex Maans and Std Dw39mions Le l ean suDev SmErrMean anerQE A UpperQE A 15 555557 15mm 54927 45375 55555 5 592mm mans 556 mass 5537 M 1 n51 1 M Agmmg unequal variances DMerence mm mm 7257325 sm Err m 4955 or 27 mm Upper CL m 75 H5 pm gt m u m varCLD 72555 mun usgaa 39 w Gunmen2 UQSPraDltl unusg39 quot 395 5 Stat 101L Lecture 19 Observational Studies Simply observing What happens A sample survey is an observational study There are other observational studies that are not surveys Tanning and Skin Cancer The observational study involved 1500 people Checked medical records to see ifa person had experienced skin cancer or had never had skin cancer a Asked all participants Whether they had used taming beds Sodium and Blood Pressure as Enroll 100 individuals in the study as Give them diet diaries Where they record everything they eat each day for a month From this the amount of sodium in the diet is found as Measure their blood pressure Stat 101L Lecture 19 Differences Retrospective look at past records and historical data Tanning and Skin Cancer Prospective identify subjects and collect data as events unfold Sodium and Blood Pressure Experiments Explanatory variable Factor as Treatments Subjects Participants Experimental Units igt Response variable Experiments as The experimenter mu st actively and deliberately manipulate the factors to establish the method of treatment Experimental units are assigned at random to the treatments Stat 101L Lecture 19 Sodium and Blood Pressure m 20 subjects a Factor amount of dietary sodium a Treatments low sodium diet and high sodium diet 10 subjects randomly assigned to each treatment a Response systolic blood pressure Experimental Principles Control it Randomize a Replicate Within an experiment repeating the entire experiment Block Control as Control outside variables that may affect the response Have subjects of the same age gender general health ethnic group By controlling outside variables you prevent those variables from causing variation in the response Examining Relationships So far we have primarily focused on a single variable It is often more interesting to examine the relationship between variables For now we will focus on relationships between two quantitative variables Roles for Variables Response Variable Often the variable we are most interested in Explanatory Variable Predictor Used to explain the variation in the response variable Scatterplots Scatterplots are a common and effective way to visualize the relationship between two quantitative variables When describing the association or relationship between two variables always look at the overall pattern and deviations from the pattern Teenage Drug Use Is there a relationship between the use of marijuana and the use of other drugs A survey was conducted in the US and in 10 countries of Western Europe to determine the percentage of teenagers who had used marijuana and other drugs A scatterplot of the data follows Teenage Drug Use jawaname Flt of other Drugs 1 Ev Marljuana 1 j 6U Positive Association Above average values of marijuana usage are associated with above average values of the usage of other illegal drugs Below average values of marijuana usage are associated with below average values of the usage of other illegal drugs Price of Used Cars Is there a relationship between the age of a car and its price A scatterplot of the ages and prices of 17 used Toyota Corollas follows Price of Used Cars Negative Association Above average values of car age are associated with below average values of car price Below average values of car age are associated with above average values of car price Stat 101L Lecture 13 Prediction 9Least squares line 52 0217 0058x for x 13 9 0217 0058 13 52 0 97 mg Residual OTar x 13 mg ONicotine y 08 mg OPredictedA 097 mg OResidual A 08 097 y y 0l7mg Residuals Residuals help us see if the linear model makes sense Plot residuals versus the explanatory variable If the plot is a random scatter of points then the linear model is the best we can do Stat 101L Lecture 13 Plot of Residuals vs Tar Content Residual I I 10 15 Tar mg l 5 Interpretation of the Plot 9The residuals are scattered randomly This indicates that the linear model is an appropriate model for the relationship between tar and nicotine content of cigarettes r2 or R2 9The square of the correlation coefficient gives the amount of variation in y that is accounted for or explained by the linear relationship With x Stat 101L Lecture 13 Tar and Nicotine r 0956 r2 09562 0914 or 914 914 of the variation in nicotine content can be explained by the linear relationship with tar content Regression Conditions Quantitative variables both variables should be quantitative Linear model does the scatter plot show a reasonably straight line Outliers watch out for outliers as they can be very in uential Regression Cautions Beware of extraordinary points Don t extrapolate beyond the data Don t infer x causes y just because there is a good linear model relating the two variables Don t choose a model based on R2 alone Stat 101L Lecture 30 More on Testing 500 randomly selected US adults were asked the question Would you be Willing to pay much higher taxes in order to protect the environment 216 answered yes More on Testing 18 this convincing evidence that the proportion of all US adults Who are Willing to pay higher taxes is different from 50 Test of Hypothesis Step 1 State your null and alternative hypotheses H0 p 050 HA p 7 050 Stat 101L Lecture 30 Test of Hypothesis Step 2 Check conditions Independence Random sampling condition 10 condition SuccessFailure condition Test of Hypothesis as Step 3 Calculate the test statistic value and convert it into a P Value 13 7 pm 7 0432 705 Z pullip i 05 05 n 500 P value 200012 00024 Test of Hypothesis Step 4 Use the P Value to reach a decision The P Value is small therefore we should reject Ho Stat 101L Lecture 30 Test of Hypothesis Step 5 State your conclusion in the context of the problem There is convincing evidence that the proportion of the population Willing to pay more taxes is different from 50 Thinking about the Pvalue The P value is the probability of getting a value of the sample proportion as extreme as the one we actually observed given that the null hypothesis is true Thinking about the Pvalue The P value is NOT the probability that the null hypothesis is true as The smaller the P value the more comfortable we feel about rejecting the null hypothesis Stat 101L Lecture 30 What is a small Pvalue as A small P value is one that corresponds to a rare occurrence What is a rare occurrence We can de ne What is a rare occurrence by setting a level of signi cance or alpha level Rare Occurrence Many use the rule of thumb that a rare occurrence is anything With a probability less than 005 A small P value would then be any P value less than 005 Rare Occurrence We can define a rare occurrence as anything With a probability less than say 001 If the P value is less than 001 then the result is statistically significant at the 001 level Stat 101L Lecture 30 Signi cance as Statistical signi cance indicates a result that cannot be explained by the variation due to random sampling is Statistical signi cance is not necessarily practical signi cance 12 CIs and Tests A 95 confidence interval for the proportion of the US population Willing to pay more taxes to protect the environment is 388 to 476 CIs and Tests 50 is not in the con dence interval therefore 50 is not a plausible value for the population proportion This agrees With the decision to reject the null hypothesis that p050 Stat 101L Lecture 22 Probability Rules In Chapter 14 we were introduced to some simple rules of probability in Chapter 15 we will look at more general rules Surviving the Titanic Crew First Total Probability of Surviving If we select an individual at random from the Titanic what is the chance that individual survived Each individual has an equal chance of being selected Stat 101L Lecture 22 Formal Prob ability PEvem of outcomes in Event Total of outcomes Probability of survival Probability individual selected was saved 70622230318 or 318 Probability individual selected was lost 151722230682 or 682 General Addition Rule PA or B PA PB PA and B PSaved or First Class PSaved PFirst Class PSaved and First Class 7062223 3292223 1992223 0376 or 376 Stat 101L Lecture 22 Special Addition Rule Disjoint mutually exclusive events no outcomes in common PA and B 0 as Addition Rule for disjoint events PA or B PA PB PFirst or Second 2 PFirst PSecond 3292223 2852223 6142223 0276 or 276 Conditional Probability Probability relative to a pre existing condition PAB The probability of A occurring given B has already occurred Conditional Probability PSavedFirst Classnumber of First Class Who were saved relative to the total number of First Class passengers PSavedFirst Class 199329 0605 or 605 Stat 101L Lecture 22 General Multiplication Rule PA and B PAPBA PFirst Class and Saved PFirst ClassPSavedFirst Class 32922230605 009 or 9 PFirst Class and Saved 1992223 009 or 9 Independent Events Two events are independent if the probability of the occurrence of one event does not effect nor is it affected by the occurrence of the other event PA PAB PABc Not independent events Saved and First Class are not independent events because PSaved 0318 PSavedFirst Class 0605 The two probabilities are not equal Stat 101L Lecture 22 Independent Trials at Sampling With replacement creates independent trials PRed on 1st and Red on 2nd PRed 0n 15 PRed on 2nd Dependent Trials at Sampling Without replacement creates dependent trials PRed on 1st and Red on 2nd PRed 0n 15 PRed 0n 2ndRed on 15 Tree diagram 1st Trial 2nd Trial Red Red lt White Blue Red Whitelt White Blue Stat 101L Lecture 37 Experiments We wish to conduct an experiment to see the effect of alcohol explanatory variable on reaction time response variable Factors and Treatments aw The manipulated factor will be the amount of alcohol consumed at There will be two treatments No alcohol Control group drink grape punch Alcohol Treatment group drink grape punch spiked with grain alcohol Experimental Design The twelve participants will be split at random into two groups of 6 Each participant will drink two 8 ounce glasses of grape punch in half an hour Reaction time of each participant will be measured after drinking the punch Stat 101L Lecture 37 Experimental Design Control of outside variables Each participant drinks grape punch Each participant has reaction time measured in the same way Experimental Design as Randomization Participants are randomly assigned to treatment groups its Replication There are 6 participants in each treatment group Natural Variation Participants Will vary in terms of their natural reaction time Randomization spreads this variation evenly across the treatment groups Stat 101L Lecture 37 Data 1 Control Group In 2 Treatment Group 111 6 n2 2 6 71 i2 1 2 Analysis of Results The data gathered from this experiment can be analyzed using the methods presented in Chapter 24 Lectures 33 and 34 Two independent samples Natural Variation We cannot control the natural variation in reaction time ie make each participant have the same reaction time to begin With We can account for this natural variation by introducing a blocking variable Stat 101L Lecture 37 Block Design Have each participant serve as a block Each participant Will experience both treatments no alcohol alcohol in a random order Block Design There is no variation in the natural reaction time Within a block it is the same person Within a block it Therefore we can better assess the effect of alcohol on each person s reaction time Data With this block design we Will get a pair of observations reaction time after grape punch and reaction time after grape punch With alcohol for each participant Stat 101L Lecture 37 Two Independent Samples Two separate sets of individuals One value of the response variable for each individual Paired Samples One set of individuals Two values of the response variable a pair of values for each individual Know the Difference It is important to know the difference between data arising from two independent samples and data arising from paired samples Stat 101L Lecture 10 Scatter plots amp Association 9Statistics is about variation ORecognize quantify and try to explain variation Variation in contents of cola cans can be explained in part by the type of cola in the cans Scatter plots amp Association OResponse variable variable of primary interest OExplanatory variable variable used to try to explain variation in the response Scatter plots amp Association 9When both the response and the explanatory variables are quantitative display them both in a scatter plot OLook for a general pattern of association Stat 101L Lecture 10 Scatter plots amp Association OExample Tar mg and nicotine mg in cigarettes y Response Nicotine mg x Explanatory Tar mg Cases 25 brands of cigarettes Scatter plot Nicotine Content vs Tar Content Nicotine m g l Positive Association 9Above average values of Nicotine are associated With above average values of Tar OBelow average values of Nicotine are associated With below average values of Tar Stat 101L Lecture 10 Negative Association Example Outside temperature and amount of natural gas used Response Natural gas 1000 ft3 Exp1anatory Outside temperature 0 C Cases 26 days Scatter plot 5 ran TEND Negative Association 9Above average values of gas are associated With below average temperatures OBelow average values of gas are associated With above average temperatures Stat 101L Lecture 11 Correlation 9 Linear Association How closely do the points on the scatter plot represent a straight line The correlation coef cient gives the direction of the linear association and quantifies the strength of the linear association between two quantitative variables Correlation y OStandardize y Zy S y gtlt 9 Standardize X x Z Niodine Content vs Ta39 Content Standardized Nicotine Stat lOlL Lecture 11 Correlation Coef cient z szy n 1 W sxsyn 1 r r Correlation Conditions 9Correlation applies only to quantitative variables 9Correlation measures the strength of linear association OOutliers can distort the value of the correlation coefficient Correlation Coef cient OTar and nicotine r szzy 229437 n l 24 1 0956 Stat 101L Lecture 11 Correlation Coef cient 9There is a strong positive correlation linear association between the tar content and nicotine content of the various cigarette brands J MP OAnalyze Multivariate methods Multivariate OY Colunms A Tar A Nicotine Wm Stat 101L Lecture 11 Correlation Properties The sign of r indicates the direction of the association The value of r is always between 1 and 1 Correlation has no units Correlation is not affected by changes of center or scale Correlation Cautions Correlation and Association are different Correlation speci c linear Association vague trend ODon t correlate categorical variables Correlation Cautions Don t confuse correlation with causation There is a strong positive correlation between the number of crimes committed in communities and the number of 2 graders in those communities Beware of lurking variables Stat 101L Lecture 39 Categorical Data as In Chapter 3 we introduced the idea of categorical data In Chapter 15 we explored probability rules and when events are independent are In Chapter 26 we put these two ideas together to compare counts Categorical Data it National Opinion Research Center s General Social Survey In 1996 a sample of 1895 adults in the US were asked the question When is premarital sex wrong The participants were also asked with what religion they were affiliated Who W hat Who A sample of 1895 adults What Attitude towards premarital sex Religious af liation Stat 101L Lecture 39 What When is premarital sex wrong Categorical Always Almost Always Sometimes Never as What is your religious af liation Categorical Catholic Jewish Protestant None Other When is Premarital Sex Wrong Always When is Premarital Sex N ever Total Stat 101L Lecture 39 Mosaic Plot cam Jewsh Other Pmstant Hellman Comparing Counts xPeople who have no religion or are Jewish are more likely to say premarital sex is never wrong Protestants are much more likely to say premarital sex is always wrong Comparing Counts x Are these differences statistically signi cant it Or are these differences due to chance variation that is religion and attitude towards premarital sex are independent Stat 101L Lecture 39 Comparing Counts If religion and attitude towards premarital sex are independent then PrA and B PrAPrB where A is a religion category and B is an attitude category Expected Count If religion and attitude toward premarital sex are independent we would expect to see nPrAPrB people in the religion category A and the attitude category B Stat 101L Lecture 39 Expected Count Catholic and Always 445 452 1895 1895 E 1895 4 E 445 452 1061 1895 Counts Alway Almost Never Total 5 a Observed Expected Take the difference between the observed and expected counts in a cell Square the difference Divide by the expected count Sum up over all the cells Stat 101L Lecture 39 Chisquare Test Statistic zz 220 EE2 df r 1c 1 Cell contributions to 262 Catholic and Always 62 7 10612 7 4412 1833 1061 1061 Test of Independence 6 H0 Religion and attitude towards premarital sex are independent 2157017df514112 t PValue lt 00001 Because the PValue is so small we reject the null hypothesis Religion and Attitude towards premarital sex are not independent Stat 101L Lecture 6 Quantitative Data Who Cans of cola What Weight g of contents 368 351 355 367 352 369 370 369 370 355 354 357 366 353 373 365 355 356 362 354 353 378 368 349 Weight of Contents Weight u Cuntents u Cans u cma Frequency m 39H 3m m m Weight gvams Weight of Contents Weight u laments u Cans u cma Frequency I 33 3m 35 Jan 37H m 39H Weight gvams Stat 101L Lecture 6 Weight of Contents Who Cans of cola What Weight of contents g Type of cola Regular or Diet Weight of Contents Regular Cola Diet Cola 36 2 34 36 5678899 34 9 37 003 35 123344 37 8 35 55567 Comparing Distributions How do the distributions compare in terms of Shape Center Spread Stat 101L Lecture 6 Comparing Groups Regular Diet Min 362 g Min 349 g QL 3665 g Q 3525 g Med 3685 g Med 354g QU 370 g QU 355 g Max 378 g Max 357 g Comparing Groups D Type OICoia o I 4 545 asu 555 mu 555 am 575 am 555 Weigh grams Comparing Groups Regular Diet Med 3685 g Med 354 g Mean 3688 g Mean 3537 g Range 16 g Range 8 g IQR 35 g IQR 25 g Std dev 403 g Std dev 223 g Stat 101L Lecture 6 JMP The data table is arranged so that rows are cases Who and columns are variables What Before entering data into J MP answer the questions Who and What J MP Data Table J MP Analyze Analyze Distribution Y Columns Weight Stat 101L Lecture 6 J MP Output Distribution Stack Weight Display Options Horizontal Layout Histogram Options Count Axis J MP Output J MP Will automatically select the bins You can change these by Right click on Weight axis Axis Settings Minimum 340 0 Maximum 380 Increment 10 l islribulions Weight ouamiles l lMomems 432 mummammum mun Mean 3612 2 95 mm SmDev 33352 975 mm StdEnMean 17m quot gum 37150 uppevBSMean as 2 a my quamle gem luw2155 aMean3576EE 6 5 50m memam 2 5 25m quamle 54mm 4 tuma 50 2 25 Sun u5 34 an m 35D m m 3m max mmmum 34m Stat 101L Lecture 6 J MP Analyze Analyze Distribution Y Columns Weight By Type of Cola J MP Output Distiibution Uniform Scaling Stack 9 Weight Display Options Horizontal Layout Histograrn Options Count Axis awnings j woman s mumme awn Man mam 3 new 1 awn mews umsaaz 35am wageMan 355mm 04 3 DD magsJ1 5125 mm 35mm N w E 3225 man Menu Menu mm mm lemmas Mounts u m maxwm 7 s m mun Man 3875 m mu 5 1Deu m2 mu 5mgquot 1 a 37553 WESTWISH 37130757 mm maul mgquot 122 den as N m mzs am can mm znu Stat 101L Lecture 6 J MP Analyze Analyze Fit Y by X Y Response Weight X Factor Type of Cola 9 Note Y is numericalcontinuous X is characternominal J MP Output One way analysis of Weight by Type of Cola Display Options Box Plots Mean Lines Grand Mean Highlight click on hold down shift if more than one potential outliers Means and Std Dev D 12 353557 222323 D5435 35225 355mg F1 12 353750 412543 11521 35513 371312 Stat 101L Lecture 23 Random Variables Discrete variables Numerical values associated With elements in a sample space Only distinct discrete points on the number line The Deal Continued Bag 0 chips poker chips Some are red Some are White Some are blue DraW a chip from the bag The Deal Continued DraW a blue chip get 3 Extra Credit points DraW a red chip get 1 Extra Credit point DraW a White chip lose 1 Extra Credit points Stat 101L Lecture 23 Discrete Random Variable X number of bonus points x 1 1 3 Px 060 010 030 Discrete RV X number of bonus points u an u 50 u an 3 m mug u u2u um y Discrete RV Property 1 0SPxS1 at Property 2 Zpx1 Stat 101L Lecture 23 Mean of a Discrete RV The center of the distribution of values found as a weighted average of the values 2 161306 Discrete Random Variable X number of bonus points x 1 1 3 Px 060 010 030 XPX 060 010 090 p204 Variance of a Discrete RV The spread of the distribution of values found as a weighted average of the squared deviations from the mean cl z 0 Mgt2Pltxgt Stat 101L Lecture 23 Discrete Random Variable X number of bonus points K 1 1 3 x402 142 062 262 Pltxgt 060 010 030 x p2Px 1176 0036 2028 02324 Rake it ianM Rake it ianM as If all 1440000 tickets are sold and if all prizes are claimed the Iowa Lottery will payout 8 24400 as Mean payout H 824400l440000 05725 as This means the Iowa Lottery pays out on average under 60 cents for every 1 ticket sold Stat 101L Lecture 4 Quantitative Data For a Statistics project students weighed the contents of cans of cola In 2000 24 cans of cola were weighed full and empty The difference full empty is the weight of the contents The units are grams Quantitative Data OWho Cans of cola OWhat Weight g of contents 368 351 355 367 352 369 370 369 370 355 354 357 366 353 373 365 355 356 362 354 353 378 368 349 Weight of Contents What can we say about the weight of contents of a can of cola Variation Srnallest value Largest value Middle value Stat 101L Lecture 4 Display of Data 9 Stem and Leaf Display or Stem Plot Orders the data and creates a display of the distribution of values Display of Data Histogram A picture of the distribution of the data Collects values into bins Bins should be of equal Width Different bin choices can yield different pictures Histogram Frequency Measur ement Stat 101L Lecture 4 Constructing a Histogram 90rder data from smallest to largest using a stem and leaf display 9 Determine bins equal Width more data gt more bins Weight of Contents Wmth u Cuntents u Cans u cma Frequency u 33 m 35H m m Jan m Weight gvams Shape Symmetry Mounded flat Skew Right left Other Multiple peaks outliers Stat 101L Lecture 4 Symmetric mounded in middle H stugram m Octane Ratmg Frequency asa7aaaggumgzmmgsga Omae Fvequency Skew Right pH 04 Pork Lows Frequency Skew Left Hembuuy ndex 04 Vouer AduH Men Stat 10le Lecture 4 Frequency Multiple Peaks 8 25 04 Diamonds carats 15 5 u u 3 5le carats 9A typical value OSummary 0f the Whole batch of Center numbers OFor symmetric distributions easy Histogram of Octane Frequency Histogram ot Octane Rating Stat 101L Lecture 4 Spread OVariation matters Tightly clu stered Spread out Low and high values Numerical Summaries Weights of contents of cans of cola 349 35 12334455567 36 25678899 37 0038 Numerical Summaries 0What is a typical value 0Look for the center of the distribution 0What do we mean by center Stat 101L Lecture 4 Measures of Center OSample Midrange Average of the minimum and the maximum 34937823635 grams Greatly affected outliers Measures of Center Sample Median A value that divides the data into a lower half and an upper half About halfthe data values are greater than the median about half are less than the median Sample Median 349 35 123344555 3 678899 37 0038 Median 3573622 3595 grams Stat 101L Lecture 34 Test of Hypothesis for y Could the population mean heart rate of young adults be 70 beats per minute or is it something higher Test of Hypothesis forl Step 1 State your null and alternative hypotheses Hozy70 HAzygt7O Test of Hypothesis forl Step 2 Check conditions Randomization condition met 10 condition met Nearly normal condition met Stat 101L Lecture 34 Test of Hypothesis forl Step 3 Calculate the test statistic and convert to a P Value yluo t sEe SEW i J Summary of Data 11225 as y 7416 beats its 2 5375 beats SE7 1075 beats Value of Test Statistic y yo 7416 70 I SEW 1075 t387 Use Table T to nd the P Value Stat 101L Lecture 34 Table T Onetailprobabiljty 010 005 0025 001 000 Pr mlne df 1 2 3 i 2394 2064 2492 2797 387 The P Value is less than 0005 Test of Hypothesis for Step 4 Use the P Value to reach a decision asThe P Value is very small therefore we should reject the null hypothesis Test of Hypothesis for Step 5 State your conclusion Within the context of the problem The mean heart rate of all young adults is more than 70 beats per minute Stat 101L Lecture 34 Alternatives zuuo o Aultuo One tailprobPrltt A ugt 0 One tailprobPr gtt mmmm A 211 0 Two tail prob Pr gt ltl J MPIAnalyze Distribution as Test Mean Example What is the mean alcohol content of beer A random sample of 10 beers is taken and the alcohol content is measured Stat 101L Lecture 34 Data Alcohol 519 476 432 453 479 Test of Hypothesis for Step 1 State your null and alternative hypotheses H0y5 HAu 5 Test of Hypothesis for1 Step 2 Check conditions Randomization condition met 10 condition met Nearly normal condition met Stat 101L Lecture 34 Test of Hypothesis for Step 3 Calculate the test statistic and convert to a P value t o SEW L SEy J Test of Hypothesis for Test statistic t y o 4762 5 0238 2397 J K n 10 Table T Two tail probability 020 010 005 Pr mlue 002 df 1 2 3 i 399 2262 2391 2321 The P Value is between 002 and 005 Stat 101L Lecture 34 Test of Hypothesis for Step 4 Use the P Value to reach a decision The P Value is smaller than 005 therefore we should reject the null hypothesis Test of Hypothesis for y Step 5 State your conclusion Within the context of the problem The population mean alcohol content of beer is not 5 J MP Output Mnmms Mean 4 752 ev n 3142323 EIEIBBEEAE per 95 M 49mm r a n 4971 4748495051525 N m u 5 Tm Sm slit 239 Stat 101L Lecture 34 Con dence Interval for yiz to yr 47627 226 to 4762 2262m 4 JE 4762 7 0225 to 4762 0225 4537 to 4987 Interpretation We are 95 confident that the population mean alcohol content of beer is between 4537 and 4987 Interpretation The population mean alcohol content of beer could be any value between 4537 and 4987 as If we repeat the procedure that produces a con dence interval 95 of intervals produced will capture the population mean Stat 101L Lecture 8 Normal Models Our conceptualization of what the distribution of an entire population of values would look like Characterized by population parameters u and c7 Percent Describe the sample Shape is symmetric and mounded in the middle 9 Centered at 60 inches Spread between 45 and 75 inches 30 of the sample is between 60 and 65 inches Stat 101L Lecture 8 Normal Models Our conceptualization of What the distribution of an entire population of values would look like Characterized by a bell shaped curve With population parameters Population mean u Population standard deviation 2 S Sample Data Densin E l Normal Model Densiiy Illllilll wassnssl s esmns Heigni incnes Stat 101L Lecture 8 Nomal Model Population 7 items of interest Example All children lt2 Variable Heiglt p 7 few items from the population Example 550 children Normal Model Height Center Population mean u 60 in Spread Population standard deviation 6 6 in 6895997 Rule For Normal Models 68 of the values fall Within 1 standard deviation of the mean 95 of the values fall Within 2 standard deviations of the mean 997 of the values fall Within 3 standard deviations of the mean Stat 101L Lecture 8 Normal Model Height 68 of the values fall between 60 654and60666 95 of the values fall between 60 1248 and60 1272 997 of the values fall between 60 1842and601878 From Heights to Percentages What percentage of heights fall above 70 inches Draw a picture How far away from the mean is 70 in terms of number of standard deviations n Normal Model m um nus 3 E nus E 8 mm 7 ma Shaded area m 7 Elm m lllllllll 57u75m so 55 m Height inches Stat 101L Lecture 21 Probability as Subjective Personal Based on feeling or opinion as Empirical Based on experience Theoretical Formal Based on assumptions The Deal Bag 0 chips poker chips Some are red Some are White Some are blue DraW a chip from the bag The Deal DraW a blue chip give everyone 3 Extra Credit Points DraW a red chip give everyone 1 Extra Credit Point DraW a White chip take 1 Extra Credit Point away from everyone Stat 101L Lecture 21 Is this a good deal Subjective personal probability Based on your beliefs and opinion as Empirical probability Based on experience Conduct a series of trials Each trial has an outcome R W B A Empirical Prob ability shook at the long run relative frequency of each of the outcomes Blue Red White Theoretical Probability Look in the bag and see how many Blue chips Red chips White chips as Assumption Each chip has the same probability of being chosen Equally likely Stat 101L Lecture 21 Law of Large Numbers For repeated independent trials the long run relative frequency of an outcome gets closer and closer to the true probability of the outcome Formal Prob ability A probability is a number between 0 and 1 Something has to happen rule The probability of the set of all possible outcomes of a trial must be 1 Formal Prob ability Event a collection of outcomes Win extra credit points Blue or Red chip as Complement rule The probability an event occurs is l minus the probability that it doesn t occur PA 1 PAC Stat 101L Lecture 21 Formal Prob ability Disjoint events no outcomes in COHIHIOH Addition Rule for disjoint events PA or B PA PB PBlue or Red PBlue PRed m Formal Prob ability at Independent trials Multiplication rule for independent trials Poutcome 1St and outcome 2 Poutc ome 15 Poutc ome 2nd Example What is the chance that two draws in a row Will result in everyone getting extra credit points PWin 1SI and Win 2ndPWlH 15 PWin 2nd PWin 15 PBlue or Red PBluePRed PWin 25 PBlue or Red PBluePRed r2 Stat 101L Lecture 16 Goal 3 Straighten Up What is the relationship between the temperature of coffee and the time since it was poured Y temperature 0F X time minutes Bivariale Fit of Temp By Time min Time mm Cooling Coffee There is a general negative association as time since the coffee was poured increases the temperature of the coffee decreases Stat 101L Lecture 16 Linear Model 180 I 170 A150 L5 CL150 140 180 120 no I y y y y 70 010 20 so 40 50 39 m e m in A Linear Model Fit Smnmmy Predicted Temp 2 1767 156Time On average temperature decreases 156 F per minute R2 0997 99 of the variation in temperature is explained by the linear relationship with time Plot of Residuals mm v7a 2 an TWE mm Stat 101L Lecture 16 Curved Pattern There is a clear pattern in the plot of residuals versus time Under predict over predict under predict The linear fit is very good but we can do better Bivariate Fit of LogTemp By Time min I l v 70 0 to 20 30 40 50 Ttme mm Lmear FM LogTemp by Time at Summary Predicted LogTemp 51946 00114Time On average 10g temperature decreases 00114 10g F per minute Stat 101L Lecture 16 Plot of Residuals o 005 o 000 Residual 70 005 Interpretation There is a random scatter of points around the zero line asThe linear model relating LogTemp to Time is the best we can do Original Scale Predicted LogTemp 51946 0114Time Predicted Temp 2 18036410114Time iPredicted temp at time0 1803 F 7 The predicted temp in one more minute is the predicted temp now multiplied by e 0 0114 098866 n Stat 101L Lecture 16 J MP Method 1 Create a new column in J MP L0gTemp Cols Formula Transcendental Log J MP Method 1 continued th Y by X Y 7 LogTemp IX 7 Time Fit Linear J MP Method 2 th Y by X Y 7 Temp IX 7 Time 7Fit Special Transform Y 7 Log

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.