Popular in Probability and Statistics II
verified elite notetaker
Popular in Engineering and Tech
ENGR 2200 - 002
verified elite notetaker
This 13 page Class Notes was uploaded by Bethanee Smith on Monday December 21, 2015. The Class Notes belongs to STAT 3610 at Auburn University taught by Saeed Maghsoodloo in Spring 2013. Since its upload, it has received 13 views. For similar materials see Probability and Statistics II in Engineering and Tech at Auburn University.
Reviews for Chapter 1
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 12/21/15
th STAT 3600 Reference: Chapter 1 of Devore’s 8 Ed. Maghsoodloo Definition. A population is a collection (or an aggregate) of objects or elements that, generally, have at least one characteristic in common. If all the elements can be well defined and placed (or listed) onto a frame from which the sample can be drawn, then the population is said to be concrete and existing; otherwise, it is a hypothetical, conceptual, or a virtual population. Example 1. (a) All Auburn University students (N 29000 members on 2 campuses). Here the frame may be AU Telephone Directories. (b) All households in the city of Auburn. Again the frame can be the Auburn-Opelika Tel. Directory. (c) All AU COE (College of engineering) students, where the frame can be found on the web at http://www.eng.auburn.edu/info/listing-all.html Examples 1.1 and 1.2 on pp. 4-6, 1.5 on p. 11, 1.11 on pp. 20-21, and 1.14 on p. 29 of Devore’s 8 edition provide sampling from conceptual (or virtual) populations. A variable, X, is any characteristic whose value changes from one element of a population to the next and can be categorical, or quantitative. Example 2. (a) Categorical or Qualitative variable X: Examples are Grade performance in a college course; Success/Failure; Freshman, Sophomore, Junior, and Senior on a campus; Pass/Fail, Defective/ Conforming, Male/Female, etc. (b)Quantitative Variable X: Flexural Strength in MPa (Example 1.2 of Devore, p. 5), Diameter of a Cylindrical Rod, Length of steel pipes, Bond Strength of Concrete (Example 1.11 on pp. 20-21, sample size n = 48), Specific Gravity of Exercise 12 on p. 24 and Shear Strength (lb) of Exercise 24 on p. 26, etc. Note that W. Edwards Deming (perhaps the most prominent of Quality gurus in the 20 century) generally refers to studies made on concrete populations as enumerative and those made on conceptual populations as analytic. 1 Branches of Statistics (1) Descriptive, (2) Inductive or Inferential Descriptive Statistics comprises of all methods that summarize collected data and is subdivided into 2 categories: (i) Pictorial and Tabular : Stem-and-leaf plot, Histogram, and Boxplots. (ii) Numerical (or quantitative) Measures: of Location (i.e., the mean, the median), of Variability, of Skewness, and of Kurtosis. (i) Stem-and-leaf Plot for the Exercise 12 on page 24 of Devore’s (8e) The data is already in order-statistics format with x(1)= 0.31 and x (n)= x(36)= 0.75. The sample size, universally denoted by n, is equal to 36. X = Specific Gravity (a quantitative measure). Stem = 0.10 (The same as Minitab’s increment = 0.10) and Minitab’s Leaf unit = 0.01. (Cumf i) Stem Leaf (n/2 = 18) 6 3 156678 (19) 4 0001122222345667888 11 5 14458 6 6 26678 1 7 5 I will name the increment “0.4” as the median stem for the above data because its sample median lies in the interval 0.40 x = xˆ0.50 Sample Median = 0.4450 < 0.50 th Histograms. (See the Example 1.10 on p.19 of Devore’s 8 edition) st The 1 -order statistic is (1)= 2.97, the nth-order statistic is (n)= 18.26, and the sample size n = 90. Sample range R = x (90) x(1)= 15.29, C = No. of subgroups (or classes, or bins) for which there are 3 guidelines: C 1 1 + 3.3log (n10[Sturges’ practical guideline], C 1 1+3.3 log (10) = 7.45, C 2 n C =29.49, or Shapiro’s recommendation C = 3 4[0.75(n 1)2]0.20= 4[0.75(89) ] 0.2= 22.7420; this last guideline is generally too large and should be used only when n > 500. Thus, it is best to select between 7 to 10 subgroups. So, we choose C = 9 subgroups. As a result, = j jubgroup width = R/C = 15.29/9 = 1.6988 1.70 = 1.70. (Always round up to obtain to the same number of decimals 2 as the original data.) Note that class limits must have the same number of decimals as the original data, but boundaries must carry one more decimal. In Table 1, the upper class st st limit of the 1 subgroup is 4.66 while the upper boundary of the 1 class is Ub = 4.665. 1 th th The lower class limit of the 4 subgroup is 8.07 while the lower boundary of the 4 subgroup is Lb = 4.065, etc. Further, = Ub jb for jll j.jThe frequency distribution for C the Example 1.10 of Devore is given Table 1 below; the f j must always add to n (= 90 j=1 in this case). This is why the subgroup intervals must be non-overlapping. Table 1 The Frequency distribution of Example 1.10, on p. 19 Subgroups 2.97 – 4.66 4.67 – 6.36 6.37 – 8.06 ƒ j 2 5 17 Classes 8.07 – 9.76 9.77 – 11.46 11.47 – 13.16 ƒ j 18 22 13 Subgroups 13.17 – 14.86 14.87 – 16.56 16.57 – 18.26 ƒ j 8 3 2 The histogram from Minitab is provided in Figure 1. In Figure 1, the area inside each rectangle (or bar) represents Relative Frequency (f/n), and the ordinate represents the j height or density h = d of each rectangle. Because every histogram in the universe must j j C C C have the “Total Area Under the Histogram” = Relf j = 1.0000 = h jj = d jj , j=1 j1 j and because both (Relf , d jrepjejent the same jth rectangular area of the histogram, then it follows that Relf = d , and hence d = Relf / for all j = 1, 2, 3, ..., C. For the j j j j j j histogram of Figure 1, d = (1/90)/1.70 = 0.02222/1.70 = 0.013072 = h , d = h = 1 2 2 0.05555/1.70 = 0.032676/BTU, etc. Note that must have the same number of decimals as the original data. It is extremely paramount to understand that the densities d have very j little (if any) statistical or geometrical meaning but it is their product with the corresponding th class-width, j, that gives the corresponding j rectangular area a = Relf =jd. Fujtherj j the midpoint of each subgroup (or bin) is simply m = (Ub +Lj )/2 = jUclajsL +LclassL )/2, j j 3 Minitab Project Report The Histogram for the Exp1.10 on page 19 of Devore’s 8 edition based on midpoints (m) anj 9 subgroups of length 1.70. The values inside each bar represent the Relf j 0.15 0.10 h =d 0.05 0.20 0.0.1444 0.0555 0.0333 0 3.815.515 7.215 10.615 17.415 8.915 12.3114.0155.715 BTUs 1 e r u g i F th where Ub is jhe upper boundary and LclassL is the lojer class limit of the j subgroup. rd The 3 pictorial summary, the Boxplot, will be discussed on pp. 11-12 of these notes. Finally, the Relf js are unit-less while d’sjalways have units. 1(ii) (Quantitative) Measures in Descriptive Statistics. (a) Measures of Location : Mean (or arithmetic average), median, geometric mean, harmonic mean, trimmed mean, mode, and percentiles. A bar is universally used to denote averages such as the arithmetic mean x (or y). The n arithmetic mean is defined as x = xn . For the example 1.11 of Devore’s 8 edition, i 1 n p. 20, n = 48, xi = 387.80 x = 8.0792. Note that the sample mean ( x , y , etc.) i=1 4 represents the center of gravity of a data set; see Figure 1.15 on p. 29 of Devore’s 8 edition. The median, x =xˆ0.50 is another measure of central location of data such that exactly (or at most) half of the data are belowx0.50and at most half of the data exceed x . To obtain x for any data (whether n is odd or even), 1 multiply n by 0.50. If 0.50 0.50 this result is an exact integer, say r, thxn = [X + x ]/2. Only in this case exactly 0.50 (r) (r+1) ~ half the data will lie below and the other half above the sample median x . If 0.50n is not integer, then always round it up to the next higher integer, say m. Then, for this last case x = x = the m order-statistic. 0.50 (m) For the Example 1.11 on p. 20 of Devore, 0.50n = 24, which is an exact integer. ~ Thus, x = x0.50= [x(24)+ x(25)2 = (5.7 + 6.2)/2 = 5.95. Note that exactly 24 data points lie below 5.950, and 24 points lie above 5.950. For the data of Example 1.14 on p. 29 of th Devore’s 8 edition, the sample size n = 21 gives 0.50n = 10.5, which is not an integer. ˆ ~ Thus, round 10.5 up to the next higher order statistic m = 11, and as a resultx 0.50 x = x(11) 21.20, while x= 21.181. Note that in this case only 47.61905% of the data are below ~ 21.20, and 47.61905% of the data are above x 0.50= x = 21.20. 1/n th The geometric mean is defined as x g = ( 1 .2 …..n ) , i.e.xg is the n root of n ( x ) only if ali x’s > 0 for all i, and in genxra x . For the data of example 1.14 of i g i=1 th Devore on p. 29 of his 8 edition,x g= 19.379764 < 21.180952 = x . Geometric mean has limited applications in DOX (Design of Experiments) where at least 2 responses from each experimental unit is observed. 1 n n The harmonic mean is defined as x h= (1/ xi) / n = , 1 (1/ xi) i.e.,x is the inverse of the average reciprocals of x’s. For the data of Example 1.14 on h i 21 1 page 29 of Devore, n = 21, and (1/ xi) = [(1/16.1) + (1/9.6) + ... + (1/28/21 = 21 i=1 5 1 1.1902504 /21 = 0.0566786, which yields xh = = 17.643346 < x g< x. 0.0566786 The harmonic mean has applications in ANOVA (Analysis of Variance) when the design is unbalanced. It gives the average sample size over all levels of a factor, and always xh x g x . In general, the geometric and harmonic means are not as important measures of central tendency as x and x . The sample mean, x , is the most common measure of central tendency and always is the central gravity of the data. TRIMMED MEANS A 10% trimmed mean, x , is computed by deleting the smallest and largest tr(10) 0.10n of the order-statistics from the two tails of data and computing the arithmetic average of the remaining 80% of the data. It seems that such a mean should be called the 20% trimmed mean because 20% of the data is actually removed from the original n observations x 1 x2, ...,nx . However, I am certain that our author, Devore, is notationally consistent with other statistical literature, and therefore, we will use Devore’s notation ofx tr(10)To illustrate, consider the Bond-Strength data of Example 1.11 on page 20 of Devore, for which 0.10n = 4.8. Step 1. Trim or remove the order-statistics x , x , x , x , x , x , x , and (1) (2) (3) (4) (48) (47) (46) 44 x(45) Next, compute x (i)/ 40 = 7.380 = xtr4. i=5 Step 2. Trim x tr4further by removing x(5)and x(44)in order to obtainx tr5 43 x = x / 38 = 7.300. tr5 (i) i=6 Step 3. Interpolate between xtr4and x tr5to obtainx tr(10)Note that most statistical packages, such as Minitab, only give thex tr(5)nd they round the value of 0.05n to the nearest integer in order to obtain the 5% trimmed mean x . However, the tr(5) 6 exact trimmed mean for the Example 1.11 should be computed from the following convex combination. x = 0.2 x + 0.8 x = 7.3160 tr(10) tr4 tr5 Had n been equal to 44, then 0.10n = 4.4 and the above formula would change to x tr(10) = 0.6 x + 0.4 x . tr4 tr5 The trimmed mean, xtr, has applications when data contain outliers (or when the data originate from an underlying distribution with heavy tail probabilities), and x tr is always as close or closer to x ( = 5.950 for the Example 1.11 than is x = 8.0792). 0.50 The MODE The mode is the observation with the highest frequency. For the data of Example 1.11 on p. 20 of Devore, MO = 3.6 and modal frequency f = 4 (this is the highest frequency). Most populations have a single mode; however, if a population has two or more modes, then it should be stratified for the purpose of sampling. In calculus, Mode is referred to as the point on the abscissa at which the maximum of the ordinate, y, occurs. Computing Sample Percentiles (or Quantiles) The 100p thsamplepercentile, x , is obtained using the following steps. p (1) First rearrange the data in ascending order of x , x (1), (2) wher(n) (1)s st nd th called the 1 -order statistic, x (2)the 2 -order statistic, …., x (n)s called the n - order statistic. (2) Multiply p by n+1: if (n+1)p is an exact integer, say I, then x = x . p (I) (3) If (n+1)p is not an exact integer such that I < (n+1)p < I +1, then the sample th p -quantile is given by the convex combination xp = ax I )1a)x (I+1) where 0 < a = (I +1) (n+1)p < 1 For the data of Example 1.11 on p. 20 of Devore’s 8(e), where X represents Bond Strength, the 10 t, 25 , 50 , 75 , 80 , and 90 sample percentiles are computed 7 below: (Note that only for convenience hats have been removed from sample percentiles, and the sample size n = 48.) x0.10 0.1049 = 4.9 x0.10= 0.10x(4).90x (5)= 3.60 x0.25 0.2549 = 12.25 x0.25 0.75x (12)0.25x (13) 4.35 x : 0.5049 = 24.5 x = 0.5x +0.5*x = 5.950 0.50 0.50 (24) (25) x0.75: 0.75(n+1) = 36.75 x0.75= 0.25x(36) 0.75x (37) 10.70 x0.80: 0.8049 = 39.2 x0.80= 0.80x(39)+ 0.20x(40) 12.20 x0.90: 0.90(n +1)= 44.1 x0.90= 0.9x(44)0.10x (45)= 14.30. The above sample percentiles are also called the 0.10, 0.25, 0.50, 0.75, 0.80, and 0.90 sample quantiles, respectively. The 0.10 quantile is also called the 1decile, and the 0.90 quantile is called the 9 decile. Note that Minitab’s quantile estimates given below do not always match those of SAS’s. Minitab’s Descriptive Statistics: BNDS Variable Mean SE Mean TrMean StDev Variance CoefVar Sum BNDS 8.079 0.703 7.607 4.868 23.702 60.26 387.800 Sum of Squares Minimum Q1 Median Q3 Maximum Range IQR 4247.080 3.400 4.350 5.950 10.700 25.500 22.100 6.350 Mode N for Mode Skewness Kurtosis 3.6 4 1.54 2.64 th The IQR (interquartile range) is defined as IQR = x0.75 – x 0.25 Q3 Q1= the 4 spread = fs, while the interdecile range is defined ax0.90 – x0.10. The 4 spread, f s is Devore’s uncommon notation and terminology, which is explained near the bottom of his p. 39. For the Example 1.11 of Devore, its value is equal to 10.70 4.35 = 6.350. If Q1 – 3IQR < x(i)Q1 – 1.5 IQR, or Q3 + 1.5IQR < x < (i)+ 3IQR, then th the i order-statistic,(i),s a mild outlier. If the value(i)< Q1 – 3IQR, or x (i)3 + 3IQR, then x (i)an extreme outlier. For the Example 1.11 of Devore, since Q1 1.56.35 = 4.35 9.525 < 0 and Q3 +1.5IQR = 20.225, then the data contain 2 outliers 8 on the RHS (or the upper tail). Further, because Q3+ 3IQR = 29.75, then the data has no extreme outliers. (b) Measures of Variability (Three Quantitative Measures) (1) Standard deviation (Stdev) = S, (2) Range/d 2 R/d ,2and (3) the IQR = x0.75– x0.25 where the range R = x(n)– x(1)nd the IQR (or fs) have already been defined. The parameter d i2 a Quality Control constant that will be defined in INSY 4330, and for the most common sample size n = 5 the value of d 2 is approximately equal to 2.326. The most common measure of variability is the standard deviation followed by R/d . In order to 2 compute S, we must always compute the variance first; there are no other alternatives. Definition. The sample variance, v, is the average of deviations of n observations from their-own-mean squared. (USS = Uncorrected Sum of Squares) Data Set 1: 2.7, 3.5, 3.8, 4.6, 5.4. n = 5 x = 4.0, Sample range R = 2.7, USS = n n x i= 84.30, CF = Correction Factor = (x i2 = 20 /5 = 80 i i 5 xi x = xi 4 = 1.30, 0.50,0.20, 0.60, 1.40 (xi 4) = 0 i=1 5 (xi– 4) : 1.69, 0.25, 0.04, 0.36, 1.96, (xi x) 2 = 4.30 Sample i=1 n variance v 1 4.3/5 = 0.86. Note that (xi x) 0 for all data sets in the universe. i=1 Data Set 2: 2.1, 3.2, 3.6, 4.5, 6.6, ( x = 4.0, R = 4.5, USS = 91.42, CF = 80), 5 x i x: 1.9, 0.8, 0.40, 0.50, 2.6 (xi 4) = 0 i=1 2 (i – x) : 3.61, 0.64, 0.16, 0.25, 6.76 5 2 CSS = Corrected Sum of Squares = S xx= 3.61+ 0.64 + 0.16 + 0.25 + 6.76 = (x ix) = i=1 9 5 2 5 n 2 2 2 (x 2x x + x ) = +2x = USS 2x(nx) + nx = USS nx = i i i i i=1 i=1 i i 1 n n 2 2 USS n( xi/ n) = USS (x i = USS CF = 91.42–80 = 11.42 i1 i1 v2= 11.42/5= 2.284 Data sets 3: 1.9, 2.9, 4.0, 4.5, 6.7, (x = 4.0, R = 4.8, USS = 93.16, CF = 80). (xi–x ): 2.1, 1.1, 0, 0.50, 2.7 CSS= S = xx.16 v = 13316/5 = 2.632. Note that in general as the overall spread of the data increases, so does the variance, i.e., 2 variance is a measure of variability. Further, the divisor of v is n, i.e., v = (1/i)× ) .– Note that the n deviations from the mean (x 1x ), (x2x ), (x3 x ),…, (xn x ) are not n independent because of the constraint (xix) 0 for all data sets in the universe. For i=1 the data set number 3 above, if we are given x 1 x = 2.1, x2– = 1.1, x 3 x = 0, and x5x = 2.7, then the value of x 4 x is automatically constrained to x 4x = (2.1) – (1.1) – (0) – 2.7 = 3.2 – 2.7 = 0.50, i.e., the variabl1s x 2 x , …, n x have (n 1) degrees of freedom (df) not n, i.e., before the sample is taken, we have freedom to specify any of the (n1) of them, and the nth deviation from the mean is determined n from (x x) 0 . Therefore, we define most common measure of variability with the i=1 i divisor of (n 1) for S= (x x) , given by xx i n 2 1 2 S = (xi x) = nv/(n 1) = Sxx(n1) = CSS/(n 1) (1) n1 i=1 2 2 2 For data above sets 1, 2 and 3 the values of S 1 = 4.30/4 = 1.075, S 2 = 2.855, and S3 = 13.16/4 = 3.29 because df = 4 (not 5). Further, as stated in equation (1), S2 = 5v1/4 = 5 1 0.86/4 =1.075, and so forth. The reader should deduce from above examples that the 2 USS plays a more important role in determining the value of S than does the CF. 2 The exact name for S is not the sample variance as defined by Devore on his p. 2 36. In actuality the sample variance is v = (i– x ) /n as defined herein, but v generally 10 2 n 2 underestimates the population variance because (xic) , where c is any real i n constant, attains its minimum value iff c = x = x/i . To compensate for this 1 underestimation, we divide the CSS = S =xx(x i x ) by a smaller number than n, 2 namely its df (degrees of freedom) = n 1, in order to obtain an “unbiased estimate” of . The positive square root of S provides the standard deviation, S, and dividing S by n gives the standard error of the mean, i.e., sex ) = S / n . Further, the ratio Sx is called the coefficient of variation (or variation coefficient), and generally the sample cv = S/s expressed in % with at least, but most commonly, 2 decimals. Graphical Measure of Variability (The Boxplot) ~ Step 1. Draw a vertical line thru the median x = x 0.50. Step 2. Draw vertical lines thru Q1 = x 0.25and Q3 = x0.75 and connect at the bottom and the top to make a rectangular box. For the data of Example 1.11 on p. 20 of Devore, the box is shown atop the next page, where hats are removed from sample percentiles only for convenience. Step 3. Compute both 1.5IQR and 3IQR. For the Example 1.11 on p. 20, 1.5IQR = 9.525, which yields the mild interval (Q1 1.5IQR = 5.175, Q3 +1.5IQR = 20.225). If the entire data lies in this last interval, then the data has no outliers. Thus, the data of Example 1.11 contain 2 outliers on the RHS, namely 20.60 & 25.50. Because, Q3+3IQR = 10.70 +19.050 = 29.750, then both outliers are mild. Note that the dots on the RHS of the above Boxplot represent the mild outlier x (47) 20.6 and x (48) 25.50. Step 4. Draw whiskers from Q1 and Q3 to the smallest and largest order statistics that are not outliers. 11 The Box-Plot for the Example 1.11 on page 20 of Devore’s 8 thEdition x0.25 x0.50 x0.75 17.10 4.35 5.95 10.70 Bonus Homework 1. (Worth 5 Points, either none or all, i.e., no partial bonus points). Please note that the solution to all Bonus problems must be only yours and no one else’s. It will be considered cheating to discuss any aspects of bonus problem solutions with anyone else. n (a) Prove that (xic) 0 iff the real constant cx=. i=1 n (b) Prove that the SS = (x c) 2 attains its minimum 1 i value only if real the constant x . (c) Prove that for any data set of size n the Corrected Sum of Squares = CSS = n 2 Sxx (x ix) = USS CF , where the Uncorrected Sum of Squares i 1 n n USS = x2 , and the correction factor CF n(x 2 . i1 i i i=1 (d) By definitions the mean and variance of a grouped (gr) data (or an empirical distribution) are given by C C gr = 1 mf = x , and Sgr = 1 (m g)2 f . n j1 jj gr n1 j1 j j 12 C Prove that for a histogram (or a frequency distributin )mj jg) f 0, j1 C ()f jj 2 2 2 1 C 2 1 and that the computing formula for gris given bySgr= [ mjj ], n j n C (mf 2 C j jj where mfj j = USS grand = CF =grrouped Correction Factor. j1 n Bonus Homework 2 (7 Points). Compute the 0.25, 0.75 quantiles and the resulting IQRs of the Example 1.10 on pages 18-19 of Devore’s 8 edition from 3 different methods: (1) SAS’s Method, (2) Minitab’s and verify your answers by submitting the Minitab output, (3) Excel’s using its percentile function. Further, fully explain how MS Excel computes the two quartiles, i.e., what formulation Excel uses to obtain sample percentiles. (4) Provide two data sets each of size n = 6 such that data set 1 has a larger range (R) than data set 2, but data set 1 has smaller standard deviation, S, than data set 2. Note that it will be impossible to create such 2 samples iff both sizes are n = 2. 13
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'