New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Introduction to Applied Statistics

by: Blair Williamson

Introduction to Applied Statistics STAT 5005

Marketplace > University of Connecticut > Statistics > STAT 5005 > Introduction to Applied Statistics
Blair Williamson
GPA 3.87

Ofer Harel

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Ofer Harel
Class Notes
25 ?




Popular in Course

Popular in Statistics

This 100 page Class Notes was uploaded by Blair Williamson on Thursday September 17, 2015. The Class Notes belongs to STAT 5005 at University of Connecticut taught by Ofer Harel in Fall. Since its upload, it has received 81 views. For similar materials see /class/205899/stat-5005-university-of-connecticut in Statistics at University of Connecticut.


Reviews for Introduction to Applied Statistics


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/17/15
Introduction to Applied Statistics Stat 5005 Chapter 8 Ofer Hare Department of Statistics University of Connecticut Inferences about several population central 1 Very often the comparison of two populations is a simplification 9 When we use covariates of interest we can consider many different populations 3 If we have a race variable in our data we will compare AfricanAmerican AngloAmerican and Hispanics J How do we deal with this type of setup Are the means different 1 We are going to answer that via Analysis of variance ANOVA CHa pler 8 p 271 a When we wanted to compare 2 means we used 371 372 spxlnl 1n2 t where 82 n1 1sn2 1s p m 772 We want to extend this notion to more than 2 means Consider we have five populations such that Population II III IV V Sample mean Q1 132 73 374 375 Samplevariance 5 53 53 Si 5 Samplesize n1 n2 n3 n4 n5 If we want to test the equality of the means we need to test M1IM2M5 J I J b b Can we do it by pairwise comparisons For that we would have to run 10 ttests Such that the null hypothesis is 2 H5 3 H4 M1IM2 M1ZM3 M1M4 1 H5 MIME MIM M3 M5 4 5 For that we would need 10 tests which is a lot of work It also increases the rate of Type I error The Type I error can get to be 040 in this case which is much more than 005 which is expected We are interested in ONE test for the previous hypothesis 1 With a set error level a p The ANOVA test is developed under the following assumptions All the populations have normal distributions 1 All variances are considered equal of 0 052 the five sets of measurements are independent samples from their respective populations The withinsample variance can be estimated by S2 n1 15 n2 15 39 n5 15 w n1n2 n5 5 Which is an extension of the pooled variance estimate Next we need to consider the variation between the sample means If all the means are the same drawing 5 samples will be equivalent for sampling only one sample for the grand population We need to find the variance between the means 3 Since all populations are normal the sampling distributions for the means are normal we would estimate 0225 by Elna a 5 1 it follows that After having two different estimates for the variance 5 4 We can compare them 5 If all means are equal we should expect 2 s B1 82 W b It follows that the test statistic F g has an F distribution with t 1 and HT 75 degrees of freedom In order to show how to solve similar problems we need to represent the problem mathematically We first assume a Completely Randomized design In which each sample of the t populations is a random sample We introduce the following notations yij the jth observation from the ith population m number of observations from population i nT total number of observations nT E m Q21 the average of observations from sample z g1 Zj yzjm I The average of all observations 5 grand sample variance estimate Let the total sum of squares be TSS y H2 T 15 i1 jl This represents the difference from the grand mean We can decompose it as follows 232 339 72 232 9 m m i g The within sample sum of squares SSW is defined by t m SSW 222sz 7222 n1 15 nt 15 i1 j1 The second part of the total 88 is SSB nAs QJ2 This is the between samples sums of squares p H is used to compute 5 Using SSW and 888 we can find 523 and 3 2 SSB 2 SSW 8 S B t 1 W nT t 4 Which we call the mean square errors 3 The MSE are distributed Chisquare M For the null hypothesis all the means are equal and alternative at least one mean is different we use the F test The F statistic is 82 SSE B t l F ST m N Ft lmt t W nt t The results can be summarized in an ANOVA table The ANOVA table is Sum of Source DF Squares Mean Square F test Between Sample 25 1 SSB 5 Sig8 Within Sample nT 25 SSW 53V SEE Total nT 1 TSS 9 The question is presented on page 390 The data trt Scores Mean SE Size 1 96 79 91 85 83 91 82 87 8675 563 8 2 77 76 74 73 78 71 80 7557 310 7 3 66 73 69 66 77 73 71 7O 74 7100 367 9 J Hypothesis H0 3 H1 H2 3 Ha At least one mean is different M The ANOVA table is Source DF 88 MS Ftest Pvalue Between Sample 2 109063 54532 2958 lt0001 Within Sample 21 38717 1844 Total 23 147780 The critical F value for a 005 is F2721 347 4 We reject the null hypothesis We can only put upper bound for the pvalue since 977 corresponds to probability 0001 M Which of the means is different 9 by looking at the means we can say that group 1 is the different one The issue of multiple comparisons will not be covered here p But it is important to know the danger of using multiple comparisons without adjusting the error rates What Happens When We Reject H0 0 When we reject H0 it implies that not all populationtreatment group means can be considered equal As a result at least two of the populationtreatment group means differ The statistical tools that address this important question are called Multiple Comparison Procedures and there are a number of such procedures in the statistical literature J The naive approach can be adjusted to provide a test procedures with overall Type I error a This is known as the Bonferroni multiple adjustment In this procedure one uses dk as the Type I error for each pairwise comparison I Other approaches Linear Contrasts Fisher s LSD Tukey HSD The StudentNewmanKeuls Dunnett and Scheffe Procedures are discussed in chapter 9 of your textbook Chapter 8 p 191 ln completely randomized design or one way ANOVA we assume the following The samples are independent random samples Each sample is selected from a normal distribution The mean and variance of population 2quot are m and 02 112t p The model will be 3217 M 042 Eij Each observation 317 is sum of ILL the overall mean 05 a population affect eij error term or the deviation from the population mean 4 Assume 6U N N0 of 439 Hi EZz j EH04 z j HO z E z j HJrO z The null hypothesis is H03 M1ZM2Z Ht p Using our model this will be equivalent of testing H0 0410z20zt0 The alternative will be Ha At least one 05 differ from O It is important to remember the assumptions conditions for this type of analysis One can use the test for equal variances Hartley Levine The Hartley test depends on the normality assumption When sample sizes are nearly equal due to homogeneity the differences in variance do not affect the estimation much In more extreme situation consider transformations The normality assumption can be evaluated via normality plots or boxplots When the sample size is not large we can consider residual analysis Let ezj yij 91 be the estimated error Use normality plot of the is When the normality assumption does not hold we can consider transformation or a nonparametric procedure 4 A transformation of a sample data is a process where the original scale is converted to a new scale 3 Consider y with increasing variance over treatments transformations such as w might be appropriate 3 The selection of the transformation is not easy task I There are many different types of transformations 1 For heterogeneous variance the following transformation may be considered w or My 0375 logy or logy 1 arcsm Introduction to Applied Statistics Stat 5005 Chapter 3 Ofer Hare Department of Statistics University of Connecticut We can divide the statistics field into descriptive statistics and inferential statistics In the first branch we want to describe the data In the second we need to make some inference In most cases we use both branches In this chapter we will deal with descriptive statistics The rest of the semester will focus on inference 9 There are two major ways to describe data Graphical representation numerical techniques 1 Both ways can be done both by hand and using computers You will need to know BOTH ways dotplots bar and pie charts histograms or frequency plots frequency tables stem andIeaf plots time plots boxplots scatterplot Consider n numerical observations 51315132 xn 9 Draw a numbered line that is large enough to include all data points need to choose the appropriate scale plot a point a dot a cross above the numbered line corresponding to the position assumed by each point 0 When there are ties stack the dots on top of one another Example Octane Ratings for various gasoline blends 1 885 947 867 903 909 887 927 904 922 877 911 882 934 890 899 918 893 912 834 910 908 885 961 898 916 897 910 875 942 883 901 933 896 904 903 922 Copyright 2001 Alan M Polansky 915 878 988 892 918 874 911 916 900 886 899 942 883 923 889 926 905 907 1003 956 883 927 853 904 912 899 937 876 932 879 901 893 906 927 933 843 910 886 930 944 911 922 8 Chapter 3 p 6 Example Octane Data y 840 875 910 945 980 1015 Copyright 2001 Alan M Polansky 29 Qualitative and Quantitative data sets contain often large or moderate number of observations 3 Extract some useful information from such a data set 0 One way to summarize such a data set is to group the observations into categories or classes if necessary and then report the number of observations in each class I There are many ways to define the class intervals We will divide the range by the number of classes needed round this number Use the width for the class interval Make the interval more specific than the measurements Organizing Data sets Example 1 The increasing emphasis on exercise has resulted in a nearavalanche of sportsrelated injuries Consider the accompanying data set in which the type of injury for each of 82 incidents was recorded where the following coding is used Sp sprain Co contusion St strain Di dislocation L laceration Cn concussion F fracture Ch chronic De dental SP 00 SP F 00 SP 00 Go On SP SP F L On Sp F Ch F 00 St St 00 Sp F 00 F Sp F De Sp Sp Co L 00 00 SP Ch SP 00 St SP SP Ch Sp F Sp Co Sp St Sp L Di F SP 00 00 SP 00 St St 00 SP F Sp Di F F F Di St F Ch St L Exam pie 2 The trace element of zinc is an important dietary constituent partly because it aids in the maintenance of proper immune response The following data represent the zinc intake mgMJ from 40 patients with rheumatoid arthritis 80 104 118 108 129 157 130 107 130 136 95 115 89 193 81 161 101 73 99 85 69 115 99 88 111 111 112 107 109 107 136 62 68 49 81 74 188 88 48 157 Chapter 3 p 92 Organizing Datasets with Frequency tables Example 1 Relative frequency for the types of sport related injury Category Frequency Relative frequency 1 Sprain 22 02682282 2 Contusion 18 0220 1882 3 Fracture 17 0207 1782 4 Strain 9 0110 982 5 Laceration 6 0073 682 6 Chronic 4 0049 482 7 Dislocation 3 0037 382 8 Concussion 2 0024 282 9 Dental 1 0012 182 Total 82 1000 Chapter 3 p 15 Frequency tables Example 2 Relative frequency of zinc intake mgMJ Class interval Frequency Relative frequency 3 to lt 6 2 0050 6 to lt 9 12 0300 9to lt 12 16 0400 12to lt 15 5 0125 15to lt 18 3 0075 18 to lt 21 2 0050 Total 40 1000 The smallest observation is 48 and the largest is 193 It seems reasonable to start the first class interval at 30 and let each interval a width of 30 This gives a class intervals 3 to lt6 6 to 9 to lt 12 12 to lt15 15 to lt 18 18 to lt 21 Chapter 3 p 11 9 The smallest observation is 48 and the largest is 193 range 19348145 Let assume we want 6 intervals 145624 p Start the first interval at 3 and use 30 as interval length A histogram is a barplot of the frequenciesrelative frequencies of each class in the data set Bar width corresponds to the class width Bar height corresponds to the class frequency or relative frequency 0 Each observation must belong to one and only one class 1 Histogram axes class centers should be clearly labeled particularly if the class sizes are different Barplot of SportsRelated Injuries IHIIU sprain contus fract strain lacer Chron disloc Con cus dental 20 15 10 Frequency 15 10 Histogram plots of Zink Intake Zink intake Stemand Leaf Separate each observation by their leading digit stem and trailing digit leaf 9 List the stems vertically in increasing order draw a vertical line to the right of the stems and add the leaves to the right of the line 3 Arrange leaves in increasing order Read pages 5457 of your book for guidelines for constructing stemandleaf plots Chapter 3 p 16 Good for J Identifying peaks and typical values of the variable measured Identifying the center of the observations giving an idea about the spread in the data telling if there are gaps in the data provide information about extent of symmetry and asymmetry Not good for large data sets StemandLeaf and Histogram 1 Good for displaying key shape aspects of a distribution of a data symmetry skewness clustering moundshaped aspect of the relative frequency plots 1 Good also for displaying unimodality one peak and bimodality 2 or more peaks 1 Not good for making comparisons among data sets l Too many classes can hide information about areas where the data is concentrated the most and variability Too few classes hide information about central location 0 You should be able to draw and more importantly to interpret both plot types Chapfer 3 p 18 9 Modes a mode is a peak on a histogram Unimodal Bimodal Multimodal Skewness vs Symmetric Skewed to the right Skewed to the left Other types of distributions Bell Shaped Triangular Rectangular Uniform Ma w 19 scatterplot and time plot A scatterplot displays associations between two variables This is a plot of paired data points on a rectangular coordinate system A time plot is no different than a scatterplot except that here the Xaxis is a time variable This graph shows trends in time Chapter 3 p 50 We can summarize the data using numerical descriptives There are two main types of measures Central tendency measures variability measure Consider a population of size N with outcomes X17X27 H 7XN Order these values from the smallest to the largest to get the sorted population outcomes Xm g X9 g g XltNgt Population mean denoted u Xi M N Population median denoted gag or g X N12 n Odd 905 XltN2gt XltN21gt 2 n is even Central tendency measures Population mode most observed outcomes in the population 0 Sample mode observation in the data set that occurs the most The mode may not be unique or may not exist I The sample mean can be regarded as the balancing point of a data set The sample mean can be affected by outliers Trimmed mean of the highest and lowest values are being discarded 3 The sample median can be regarded as the balancing point of the corresponding ordered data set Chapter 3 p 23 Consider n data points x1 x2 xn Order these data points from the smallest to the largest to get the sorted observations 5131 s 5132 g g mm Sample mean denoted 5 221 951 TL 5 Sample median denoted m middle point n12 n is Odd aim2 a Rn21 2 n is even Measures of variability By variability we mean how different and distant are one data point to the another The range is the distance between the largest and smallest observation In general more variability implies larger range However variability is more than distance between two extreme points Variability is a characteristic of the entire data set therefore each observation contributes to the variability The pth percentile is the value that has at most p of the measurements below it and 100 p above it dhapfer 3 p 53 Population variance denoted 02 N 09 02 i1 N 2 a Population standard deviation denoted a Sample variance denoted 52 Measures of variability Variation reflects the spread of the data from the central location measure Deviation distance of the ith observation xi to the center of the data set 17 that is di xi 51 3 2le d is the sum of squared deviations from the sample mean 52 is a better measure of variability than the range Why do we use n 1 rather than n in calculating the variance Because the sum of the deviations is always zero The last deviation can be found once we know the other n 1 deviations So we are really averaging n 1 unrelated numbers instead of n For large n it does not make much difference b Sample range R asm 131 Not very used in practice Some times we approximate s by 5 razge Interquartile range IQR the difference between the 75th percentile and the 25th percentile 5 211 5739 5 Sample mean absolute deviation Sample standard deviation 8 b Sample statistics Population parameters u 2 02 52 039 8 7025 QL 050 QM 9075 QU The empirical rule is a rule of thumb that applies to samples with frequency distributions that are moundshaped nearly symmetric 3 Approximately 68 of the measurements will fall within 1 standard deviation of39the mean ie S X 9 Approximately 95 of the measurements will fall within 2 standard deviations of the mean ie i 2877 2 Essentially all of the measurements will fall within 393 standard deviations of the mean ie X 3S 2 Simple Boxplot or Boxand Whiskers plot 1 Consider n data points 51315132 xn 0 Order these data points from the smallest to the largest to get the sorted observations 5131 g 5132 g m g mm Sample median denoted m middle point also called second quartile Lower Quartile QL median of the lower half of the sample also called first quartile Upper Quartile QU median of the upper half of the sample also called third quartile Interquartile range IQR QU QL Chapter 3 p 3273 ple oxplot or plot Draw a horizontal vertical line represent the measurements 0 Construct a rectangular box of length IQR whose left or lower edge is at the lower quartile and whose right or upper edge is at the upper quartile Draw a vertical or horizontal line segment inside the box at the location of the median Extend horizontal or vertical line segments from the each end of the box to the smallest and largest observations in the data set These lines are called whiskers Boxplot or BoxandWhiskers plot Example The following data are diameters in mm of holes in a group of 12 wing leading edge ribs for a commercial transport airplane 120 1204 1207 1209 1202 1211 1203 1201 1209 1213 1205 1208 1200 1202 1204 1206 1208 1210 1212 I l l CHap er 3 p IE Interpretation of a Boxle Modified Boxplots are fivenumber summary plots median QM lower quartile QL upper quartile QU outliers thresholds QL 15 x IQR and QM 15 gtlt IQR Examine the length of the box The IQR is a measure of variability If length is short there is little variability and if box is long there is some variability o Visually compare the lengths of the whiskers If one is clearly longer the distribution of the data is probably skewed in the direction of the longer whisker Any observation falling more than 15 x IQR above the upper quartile or below the lower quartile is called suspected outliersoroutljersandneedstobeplottedindividuallyas pOSSIble Interpretation of 3 BOXle 9 Any observation falling more than 15 x IQR above the upper quartile or below the lower quartile is called suspected outliers or outliers and needs to be plotted individually as possible Any observation falling more than 3 x IQR above the upper quartile or below the lower quartile is called extreme outliers and needs to be plotted individually as possible dhapfer 3 p Boxplot of the Zink Intake Data 0 O 39 I LO I l I l I I I I O I I I l I I I I LO I Interpretation of a Boxplot Analyze any outlier points carefully Fewer than 5 of the data should qualify as outliers even for very skewed data 9 Outliers can mean that the measurement is incorrect the measurement belongs to a population which is different from the target population or a correct measurement which happen to be in the extreme tail of the possible population measurements Needs to be investigated n In general the boxplot and its fivenumber summary plots are more meaningful when used to compare two or more populations Boxplot is good for providing information about center spread variability symmetry and skewness a measure of the extent of departure from symmetry and detect outliers J Boxplot fivenumber summaries are robust number in the sense that they are resistant to presence of outliers Introduction to Applied Statistics Stat 5005 Introduction Ofer Hare Department of Statistics University of Connecticut Time and Place Classes Will be held M 200400 CLAS 344 W 200300 CLAS 344 Instructor Ofer Harel Office CLAS 320 3 Phone 4866989 J Office hours M and W 9001000 or by appointment Email oharelstatuconnedu 1 Web WWWstatuconneduoharel Teaching assistant Valerie Pare CLAS 316 valeriepareuconnedu b 39 p 2 Grading AS follows 3 Homework 25 Class participation 5 I Quizzes 10 Exam 1 15 October 19 Tentative J Exam 2 15 November 9 Tentative Final Exam 30 According to schedule There will be no makeup exams Homework Assignments will be announced in class and placed on the class website Assignments are due in class on the assigned date m Statistical techniques are being used in many aspects of our life 0 Surveys for elections consumer reports product satisfaction etc The effects of drugs Product quality Econometrics Statistics Dictionarycom The mathematics of the collection organization and interpretation of numerical data especially the analysis of population characteristics by inference from sampling p ll There are many different definitions for statistics Certain concepts appear in most definitions variation uncertainty inference science In our daily life there are many examples for the use of Statistics 3 Parents of a child with genetic defect consider having another child They will base their decision on the chance that the next child will have the same defect To choose the best therapy a doctor must choose between several possibilities In an experiment to investigate whether a food additive is carcinogenic enhances the chance for cancer the USDA has animals treated with and without the additive 0 Does smoking cause cancer In designing and planning medical care facilities there is a need to take into account changing needs of medical care 3 Introduce the world of statistics 9 Use statistics methodology 4 Apply our knowledge to real data problems p Make inference Basic concepts and terminology Population The entire collection of subjectsitems under investigation The number of subjectsitems in the population is called the size of the population J A population parameter is a numerical quantity that describes a characteristic of a population The true value of a population parameter can be known if and only if the outcome for every subjectitem in the population is recorded The population parameter is considered to be an unknown constant 9 Thegoalisto estimate population parameteravalue Intro p 82 Basic concepts and terminology A variable is any informationmeasurement any characteristic of interest about each member of a population blood type eye color height weight age ethnicity blood pressure cholesterol level packs of cigarettes smoked per day As you can see a variable can take different values for different individualsitems or different values for the same individual eg blood pressure is constantly changing A variable is a measureable attribute that typically varies over time or between individuals of a population Irffl 3 92 Basic concepts and terminology Measurements or variables come in 4 types nominal ordinal interval and ratio Nominal data measurements that classify the sample units into categories eg gender ethnicity color of eye Ordinal data there is a natural ordering in the data eg grades severity of pain They are often numerical or quantitative values Interval data the recorded measurement lies within an interval Ratio Data The unit of measurement is relative eg heshe is 2 times stronger than me Intro p 10 Basic concepts and terminology Measurements Quantitative Numerical I Discrete Number of patients Number of salmon in the ocean Qualitative Categorical oEye s color agender oPoliticaI affiliation Continuous oHeight oweight otime oTemperature Ih39ti39ii pi 112 Basic concepts and terminology A sample is a subset of the population selected for study in order to gain more information about the entire population The number of subjectsitems in the sample is called the sample size Information collected on a sample is used to draw conclusions about the population parameter 0 It is crucial that the sample will resemble the population Intro p 12 Basic concepts and terminology A sample statistic or a statistic is a numerical measure calculated from the observed outcomes in the sample 1 Parameter is a term that refers to a population quantity and a statistic refers to a sample 0 The statistic value depends on a particular sample selected from the population In other words a statistic changes value each time a new sample is selected There is always a potential risk that the sample results will be different from the population parameter under investigation Hence it is more important to quantify how likely a sample result will be far from the population parameter This is where probability comes into play Ih39ti39o p 132 Basic concepts and terminology Probability will give a statement about how confident we can be in the claim provided by the sample data analysis that is how confident we can be that the answer is correct 1 The process that generalizes the result obtained from a sample to the entire population under study is called inferential statistics 1 An observed effect so large that it would rarely occur by chance is said to be statistically significant For any type of research First specify the objective of the study Define the population in mind and parameters of interest What are the variables that will be collected What is the study design Data collection Data analysis Inference h Study types de nitions An observational study collects data from an existing situation The data collection does not intentionally interfere with the running of the system 3 Notice that the act of observation might affect the system 0 An experiment is a study in which an investigator deliberately sets one or more factors to a specific level J In general experiments lead to stronger scientific inferences than do observational studies 3 Experimentstudy unit is the smallest unit on which an experiment or study is performed Intro p 15 Study types de nitions An experiment is a crossover experiment if the same unit receives more than one treatment or is investigated under more than one condition The different treatments are given during nonoverlapping time periods The risks with this type of experiment are crossover effects a change of the experimental unit over time permanent physiological change in humans and animals longer time higher risk for dropout D A clinical study is one that takes place in a setting of clinical medicine lnfro p 1 72 De nitions A cohort of people is a group of people whose membership is clearly defined For example The students enrolled to Stat 5005 for Fall 2009 1 An endpoint is a clearly defined outcome or event associated with an experimental or study unit The final grade of the above 0 A prospective study is one in which a cohort of people is followed for the occurrence or nonoccurrence of specified endpoints events or measurements 1 In the analysis of a prospective study the occurrence of the endpoint is often related to cohort measurements in the begjnningofthestudy Intro p 18 De nitions 1 Baseline characteristics are values collected at the time of entry to the study 9 A Retrospective study is one in which people having a particular outcome or endpoint are identified and studied For example cancer registry J A Casecontrol study selects all cases that meets a specific criteria A group called control that serves as a comparison group is also selected The two groups are then compared o A Matched casecontrol study matches the cases and controls according to some uncharacteristic Intro p 19 De nitions assignment 1 A Longitudinal study collects information on study units over a specific period of time while a crosssectional study collects information on study units at a fixed time s A placebo treatment is designed to appear exactly like the active treatment but is devoid of the active part of the treatment 1 The Placebo effect results from the belief that one has been treated rather than having experienced actual changes o A study is single blind if the subjects are unaware of which treatment they are receiving It is double blind if in addition those who evaluate the study do not know the group lnfro p 50 Introduction to Applied Statistics Stat 5005 Chapter 6 Ofer Hare Department of Statistics University of Connecticut TWOSample inference In many problems more than one population is involved Sometimes we are concerned with estimating the difference m 2 between two population means m and 2 For example We might want to estimate the differential height between freshman males and females at Uconn We might want to compare the performance of two brands of snow tires We might be interested in knowing if male students are more easily bored than their female counterparts We might want to compare the scores obtained by Uconn and Yale students on a statistical test CHa pfer 6 p 22 We can use two samples sizes m and n to represent the two populations We can use descriptive statistics such as summary statistics and plots to review the data This will give us some understanding of the data Theorem Let yl N NW1 0 and 32 N Nu2 0 be independent then 1 241 y2NNH1M27OOg J 91 y2NNM1 M270 0 Using the CLT we can find the sampling distribution of y and g2 to be J y N Nu1a n1 and y N NW2 03712 respectively Therefore the distribution of 34 1 gig is normal with J Min 312 M1 M2 2 2 2 oi oi 0211 92 021 1 0212 m n 0quot2 0392 and standard error jg HE 771 VT TwoSample CI when variances are known 3 Let X11 o X1 m be m random sample from a normal distribution with unknown mean 1 and known variance of This sample constitutes population 1 Let X1 be the sample mean of the sample from population 1 I Let X21 o 7X2n2 be n2 random sample from a normal distribution with unknown mean 2 and known variance 0 This sample constitutes population 2 Let X2 be the sample mean of the sample from population 2 We assume that the two samples are independent 0 We have X2 M1 M2 1 Hence X1 X2 is an unbiased estimator of m M2 Chapf 39f39G 3 53 We also have MLle X2 R1 712 Therefore the sampling distribution of the statistic X1 X2 is 02 02 X1 X2 N NM1 M27 1 2 n1 n2 Hence the confidence interval L731 I Zaz forms a realization of a largesample of CI for M1 W with confidence level 101 00 TwoSample CI When variances are known Here we no longer assume that population 1 and population 2 are normally distributed or approximately normally distributed We rely heavily on largesample theory Central Limit Theorem 3 Again the confidence interval 2 2 0 0 1 2 5131 2 2 Za2 n n 1 2 forms a realization of a largesample of CI for M1 2 with approximate confidence level 1001 00 Chapf 39f39G b 73 TwoSample CI 01 72 are unknown Here we assume that population 1 and population 2 are normally distributed or approximately normally distributed with common unknown variance Under the assumption that 01 02 0 we have 1 1 VCLTltX1 X2 0392 n1 n2 Under the normality assumption it follows that 1 1 512 2 Z US 2 2 WW1 and N X 1 0 02 quot2 Chapf 39f39G p 83 As a consequence E0912 02 and Var8f and 4 02 and VarS 73L 2 1 I Therefore 812 and 822 are two unbiased estimate of 02 But we can do better Indeed the statistic n1 2 Sp is an unbiased estimator of 02 and also has smaller variance than both sample variance 8 and therefore better 813 is called a pooled sample variance TWOSample CI 01 72 are unknown ES quot1 1gtE312ltf22 1gtE53 02 n 12VarS2 n 12VarS2 04 39 1 nn22 2 7L1fn2 2 1 It can be shown that m 712 2 N Xn1n2 2 039 Hence the sampling distribution of the statistic X1X2H1H2N 1 1 SpvnTn3 b n1n2 2 Chapter 6 p 10 3 The interval 39 1 1 721 332 i ta2n1n22 Sp quot 1 772 forms a realization of a largesample of CI for H1 2 with confidence level 1001 00 ample CI 01 02 are Example 1 A farmequipment manufacturer wants to compare the average daily downtime for two sheetmetal stamping machines located in two different factories Investigation of company records for 10 randomly selected days on each of the two machines gave the following results m 10 51 31 12min 536 n2 10 52 9min 534 Assume that the common variance assumption holds Estimate the difference between the average daily downtime for the two sheetmetal stamping machines with confidence coefficient 095 What additional assumptions are necessary 39i I I an 39 V quot39 ap r6 p12E TWOSample CI 01 02 are 3 Example 2 A civil engineer wishes to measure the compressive strength of two different types of concrete A random sample of 10 specimens of each type yielded the following data in psi Type 1 3250 3268 4302 3184 3266 3297 3332 3502 3064 3116 Type 2 3094 3106 3004 3066 2984 3124 3316 3212 3380 3018 If we assume that the samples are normal with a common variance construct a 95 percent twosided confidence interval for M1 2 the difference in means ap r6 p1E Hypotheses Testing Two Sample Case We can also test a hypothesis about the difference of two means 9 Here the setup begins with two independent populations and one would like to compare these two populations 3 The goal is to test if the difference between means or proportions from two independent populations are significantly different dhapfgr 6 p 1m Testing mean Difference Known Variances 1 Let X11 o Xl m be m random sample from a normal distribution with unknown mean 1 and known variance of This sample constitutes population 1 1 Let X21 o 7X2n2 be 712 random sample from a normal distribution with unknown mean ng and known variance 0 This sample constitutes population 2 Notation for Sample 1 and Sample 2 Sample Sample Sample Size Mean Variance Sample from Population 1 m 531 5 Sample from Population 2 712 51 32 53 Chapter 6 p 15


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Amaris Trozzo George Washington University

"I made $350 in just two days after posting my first study guide."

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.