### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Data Analysis I STAT 528

OSU

GPA 3.58

### View Full Document

## 45

## 0

## Popular in Course

## Popular in Statistics

This 47 page Class Notes was uploaded by Alison Vandervort on Monday September 21, 2015. The Class Notes belongs to STAT 528 at Ohio State University taught by Staff in Fall. Since its upload, it has received 45 views. For similar materials see /class/210001/stat-528-ohio-state-university in Statistics at Ohio State University.

## Reviews for Data Analysis I

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/21/15

Stat 528 Autumn 2008 Comparing two proportions Reading Section 82 0 Comparing two proportions A motivating example 0 Mean and variance of the difference in sample proportions 0 Sampling distribution for the difference in sample proportions o The signi cance test for a difference in proportions The variance of the difference under H0 0 Approximations for the standard error of the difference 0 A con dence interval for the difference in proportions Comparing two proportions a motivating example A study was designed to nd reason why patients leave a health maintenance organization Patients were classi ed as to whether or not they had led a complaint with the HMO We want to compare the proportion of complainers who leave the HMO with the proportion of those who do not le complaints ln the year of the study 639 patients led complaints and 54 of these patients left the HMO voluntarily For comparison the HMO chose an SR8 of 743 patients who did not le complaints Twenty two of these patients left voluntarily o ls there a difference in the two proportions 0 Provide a 95 con dence interval for the difference in the two proportions Comparing two proportions 0 Let p1 denote the proportion of successes for population 1 and let p2 be the success proportion for population 2 0 Suppose we have a SR8 of size 711 from population 1 and an independent SR8 of size 712 from population 2 Let X1 denote the number of successes in the sample from population 1 and 131 X1 n1 be the associated sample proportion Let X 2 denote the number of successes in the sample from population 2 With sample proportion 132 Xgng 0 Let D l g be the difference between the two sample proportions Mean and variance for the difference in sample pro portions By the rules for means MD 1517152 171 152 P1 P2 0 Thus D is an unbiased estimator for the difference in the population proportions o By independence of the samples and the rules for variances 2 i 2 0D 01917192 01271lt1gt201272 i 2 2 i 0151 0172 p11 p1p21 p2 n1 n2 Sampling distribution for the difference in sample proportions o For large m and 712 D has an approximate 1 1 Nltp1 p27 P1lt P1 P2lt 132 n1 n2 distribution o If SEltDgt did not depend on p1 and p2 a test would be based 011 1 I32 50 SED o A 1001 00 Cl for the difference in the population pro portions pl p2 would be 31 g l Zag 0 Since p1 and p2 are unknown in practice we need to approxi mate SED The signi cance test for a difference in proportions o Hypotheses H0 3 P1 P2 0 equale tly P1 P2 versus Ha 2131 132 lt 00Rp1 p2 070Rp1 p2 gt 0 o The test statistic is 231 a SEltDpgt Where Dp is the standard error of the pooled estimate of the common value of p1 and p2 as we now explain The variance under H0 0 Under H0 p1 p2 let p p1 p2 be the common population parameter 0 Then X1 is a 301113 RV and X2 is a 301217 RV 0 These two RVs are independent and so X1 X2 is a 3011 71213 RV 0 An estimate of p is and so The test for a difference in proportions cont 0 Check 1 711131 2 and 7111 1gt Z 2 712132 2 10 and 7121 132 Z 0 Then under H0 the test statistic approximately follows a N O 1 distribution 0 Pvalue Using Table A calculate the area under the Z distribution curve For HQ 2131 p2 lt O the P value is PltZ g For HQ 2131 p2 gt O the P value is PltZ Z For HQ 2131 p2 y O the P value is 2PltZ Z The test for a difference in proportions cont o The P vahie is approximate since the distribution of the test statistic is approximately normal 0 For a test of signi cance at the level or If the P Vahie 3 oz we reject H0 If the P Vahie gt oz we fail to reject H0 Testing in the HMO example c We expect a higher proportion of oomplainers to leave Do the data support this belief Horn tests to intervals o Hypothesis tests The hypothesis test for no difference77 was simpli ed by the presumption of a common proportion under the null Other hypothesized differences are more di icult How would you estimate p1 and p2 under the restriction that p2 p1 6 for some speci ed 6 0 Con dence intervals Without the base of a solid family of hypothesis tests the construction of a con dence interval becomes fuzzy Idea Since the plug in method worked for a single pro portion we could try the same for two proportions This idea motivates the most commonly used con dence interval for the difference between two proportions Alternatively we could patch up the interval a little with a parallel to the plus four77 method 11 Approximations for SED 1 Plug in method 1 131 1 132 5EltDgt 711 712 2 Wilson7s plus four77 estimate We add one success and one failure to each proportion Let X11 X21 and p2 n12 7122 The estimated standard error is 1311 51 If0v2lt1 52gt SAED ltgt n12 7122 The con dence interval 0 An approximate 1001 00 con dence interval for p1 p2 is given by 131 132 i Za2SEltDgt7 Where D is given by the plug in method 0 Alternatively rarely used an approximate 1001 Oz con dence interval for p1 p2 is given by 171 I72 3 Za2SEltDgt7 Where D is given by the plus four method A con dence interval in the HMO example 0 Form a 95 con dence interval for the difference in the two proportions Other issues 0 Power calculations The calculations follow the same methodology as the ear lier power calculations With Minitab use the command sequence Stat gt Basic Statistics gt 2Proportions 0 Many studies investigate small proportions Does a particular prescription increase the risk of death due to cardiovascular disease ln these settings inference is often about the ratio of the two proportions P2 p l ln this setting no difference corresponds to a ratio of 1 Stat 528 Autumn 2008 Elly Kaizar Numerical summaries for data Describing distributions with numbers Reading Section 12 0 Measures of the center the mean median mode trimmed mean What do we use i the mean or the median o Quartiles A ve number summary Boxplots Outliers 0 Measures of spread variability Standard deviation variance lQR and range 0 Numerical Summaries in MlNlTAB 0 Changing the units of measurement The sheep weights revisited 0 Here is the dataset of the weights in pounds of 23 sheep 180 160 157 185 159 165 168 165 175 186 155 169 168 170 173 181 189 179 182 177 157 169 166 0 Here is the stemplot Stem and leaf of sheep N 23 Leaf Unit 4 15 5 16 7 16 11 17 9 17 6 18 3 18 10 5779 0 5568899 03 579 012 569 0 Can you summarize the features of this dataset Measures of the center the mean 0 Let 1 2 ajn be our set of n observations 0 The mean of the observations is 7 i 12n i Ef mzr n n o a is shorthand notation for take the sum from 239 a to b o For the sheep weights 22 so 3935 and n The mean of the data is The median o The median M is the exact midpoint77 of the data 1 Sort the data in increasing order 2a lf the number of data points 71 is odd the median M is the center value of the sorted data ie the 7ch largest value OR 2b lf n is even the median M is the average of the two center values ie the average of the gth and 7ch largest values 0 For the sheep dataset the sorted weights are 155 157 157 159 160 165 165 166 168 168 169 169 170 173 175 177 179 180 181 182 185 186 189 The order statistics 0 Suppose we have 71 data values 61 2 ajn 0 De ne the order statistics 331 332 301 as the data sorted from smallest to largest 331 smallest value 362 2nd smallest value 3301 largest value 0 Ex What is 361 and 363 for the sheep weight data Other measures of the center o The modes of a set of observations is the value or values which occurs most frequently The modes of the sheep weights is 0 Suppose we remove the bottom 0 and top 0 of the values from a set of observations The 0 trimmed mean is the mean of the remaining values As oz increases the mean is less affected by outliers Ex What is the 10 and 20 trimmed means of the sheep weights What do we use the mean or the median symmetric distributions mean median left skewed distributions mean lt median right skewed distributions mean gt median o Outliers values that lie outside the main body of the data can affect the mean too 0 The median is more robust or resistant for measuring the center of a distribution lf distribution is symmetric without outliers use the mean lf distribution is skewed or contains outliers use the me dian instead 0 Example Applet Mean vs Median Example Adult American Weight Density 0006 0008 0010 w 0004 0002 0000 0 100 200 300 weight in pounds Source NHAN ES 20052006 Quartiles o The sample median M denotes the half way point of the sorted observations the 50 point Half of the data is below the median and half is above 0 The rst quartile Q1 is the 25 point 25 of the data lies below Q1 and 75 lies above 0 The third quartile Q3 is the 75 point 0 The interquartile range IQR is given by QR Q3 Q1 0 To calculate quartiles NOT standardized 1 Calculate the sample rnedian Split the sorted data in half If n is odd drop the median 2 Q1 is the median of the lower half of the data 3 Q3 is the median of the upper half of the data A ve number summary o The ve number summary is a simple description of the data Minimum Q1 M Q3 Maximum 0 For sheep weights the ve number summary is Minimum Q1 M Q3 Maximum 0 The boxplot illustrates this summary graphically Some example bOXplots 190 H co sheep weights in pounds H l P H m 2 150 egg found in nest of N A Length of cuckoo e N W N N 21 robin sparrow Outliers 0 An observation further than 15 X QR from the closest quar tile is called an outlier Sometimes an observation further than 3 X QR from the closest quartile is called an extreme outlier o MlNlTAB denotes outliers on box plot with a asterisk separate from the main box plot 0 Sometimes outliers are not included in the numeric sum maries of data such as in calculating the mean median etc o lmportant question in practice should we really leave out data Outliers example The amount of aluminum contamination ppm in plastic of a certain type was determined for a sample of 23 plastic specimens The data is sorted 3O 6O 63 7O 79 87 90 101 115 118 119 119 120 125 140 145 182 183 191 222 244 291 511 Construct a boxplot that shows outliers and comment on the features 500 i 400 i 300 i 200 i aluminium contamination ppm Measures of Spread o The range of the observations is range largest value smallest value 0 The interquartile range is IQR Q3 Q1 the range of the middle 50 of the sorted data 0 Variance 0 Standard Deviation Measures of Spread the variance o The variance 52 of a set of observations is 2 Emir agt2 5 71 1 0 measures the average of the squared deviations of the ob servations from the mean Why squared Why 71 1 O Applet http hspm sph sc eduCUURSESJ716demosLeastSquaresLeastSquaresDemo html 0 When 52 O we have no spread c As 52 increases above 0 observations spread out further about the mean 0 BUT variance has squared units compared to the original data The standard deviation o The standard deviation s of a set of observations is the square root of the variance ie 0 Has the same units as the observations 0 5 0 corresponds to no spread c As 5 increases above 0 observations spread out further about the mean 0 5 and 52 are sensitive to outliers and skewness Calculating numerical summaries in MINITAB To summarize our sheep dataset 0 Select Stat gt Basic Statistics gt Display Descrip tive Statistics 0 A dialog box now appears 0 ln variable you select the variable you want to summarize Either type the variable number eg C1 into the box OR in the right hand panel click on C1 sheep and choose Select 0 To produce the summaries click OK in the dialog You can also produce plots of the data by clicking on Graphs Numerical summaries in MINITAB cont 0 The summaries are presented in the Session Window Descriptive Statistics sheep Variable N N Mean SE Mean StDev Minimum sheep 23 0 17109 209 1001 15500 Variable Maximum sheep 18900 0 Some headings you might not know N number of observations N number of missing observations StDev standard deviation Q1 Median Q3 16500 16900 18000 SE Mean standard error for the mean ignore for just now What summaries do we use 0 The mean is a good measure of the center of symmetric distributions For skewed distributions or distributions with outliers better to use the median o The standard deviation is a good measure of the spread or variability of symmetric distributions For skewed distributions or distributions with outliers better to use the range or IQR Changing units of measurement transformations c When we collect data what is the effect of changing the units of measurement ls the distribution affected Do the measures of center and spread change 0 Example Suppose we measure the weight of Americans How are the summaries of the data altered by our choice of the units of measurement eg Pounds vs Kilograms 20 Linear transformations 0 Let 131332 ajn be our data and let our transformed data be y17y27quot397yn o A linear transform is de ned by y ab i1n Where a and b are constants Some examples of linear trans formations include ajzr yzr formula a b cents dollars yzr 36 100 pounds kilograms yzr 045mi Fahrenheit Celsius yzr 3 36 32 o A linear transformation shifts and scales the x axis 0 Thus the shape of a distribution is not affected by a linear transform 21 Linear transformations example Adult American Weight Density 0006 0008 0010 0004 0002 0000 i i i i 0 100 200 300 weight in pounds Adult American Weight Density 0 010 0015 0020 0005 0000 i i i i 0 100 200 300 weight in kg 22 How do measures of center and spread change under a linear transform Remember 36 gt er for each 239 0 F01quot measures Of center 9 abi My a 9 M95 Modey a b Modegg o For measures of spread 511 W 596 513 b2 535 Rangey b Rangeyc IQRy 39539 9 here b means the absolute value of b 23 Nonlinear transformations Density Statistical methods often perform better when a histogram or distribution is a particular shape eg we often consider transformations of the data Which makes a histogram look more symmetric Linear transformations preserve the shape of distributions so we need to consider nonlinear transformations eg log arithms square roots reciprocals Adult American Weight Adult American Weight 00 O O o39 E 0 v I r g E o39 0 Ln 0 O O O 0 39 39 39 39 o39 39 39 39 O 0 100 200 300 400 500 45 50 55 60 65 weight in pounds ogweight in pounds 24 Stat 528 Autumn 2008 Elly Kaizar Inference for a single proportion Reading Section 81 0 Review of counts and proportions Means and stdevs of counts proportions Approximate sampling distributions 0 The signi cance test for a single proportion 0 Coin ipping example 0 Con dence intervals for a single proportion 1 lnverting a family of hypothesis tests 2 The plug in method in textbook 3 The p 12 make the standard error big77 method 4 An exact method using the binomial distribution A motivating example An entomologist samples a eld for egg masses of a harmful insect by placing a yard square frame at random locations and carefully inspecting the ground Within the frame An SR8 of 75 locations selected from a county7s pastureland found egg masses in 13 lo cations Compute the proportion of locations Which contain egg masses 0 ls the proportion of locations With egg masses at least 010 0 Form a 90 con dence interval for the proportion of all pos sible locations that are infested Review Counts and proportions 0 Suppose we ask a random sample of size n a yes1no0 question Let the RV X denote the number of yes answers 0 The sample count is X o The sample proportion is A X p n 0 Suppose the following are true 1 There are a xed number of n observations in the sample 2 The n observations are all independent 3 There are only two outcomes for each observation 4 The success probability p constant for each observation 0 Then X has a Binomial 30117 distribution where p is the true population proportion Review Means and stdevs of countsproportions o For the count X we have MX npa and 0X xnp pgt o For the sample proportion 1 lug I7 thus I is an unbiased estimator of p and 130 1 O39A p n o The standard deviations 0X and 05 both depend on the pop ulation proportion p Review Approximate sampling distributions 0 IF np Z 10 or 5 and M1 p Z 10 or 5 1 The number count of successes is the sample X has approximately a N011 np1 17 distribution 2 Also the sample proportion of successes 13 is approx imately a N p 1 RV 0 Problem 0X and 05 both depend on the population pro portion p We can handle this trivially when conducting a hypothesis test We need to think carefully about this when forming con dence intervals The signi cance test for a population proportion o Hypotheses H0 p p0 for some constant p0 versus EupmmgtmeRpltpo o A test statistic sometimes called the score test statistic i Po z 7 P0lt1 P0 n 0 Check that the population size is at least ten times the size of the sample and that npo Z 10 and M1 pg 2 10 o If so then under H0 the test statistic approximately follows a NO 1 distribution 0 Another test statistic sometimes called the Wald test statis tic 3 130 l n 2W7 Test for a population proportion cont o Pvalue Find the appropriate area under the standard nor mal Z density curve As usual For Ha p lt p0 the P value is PltZ g For Ha p gt p0 the P value is PltZ Z For Ha 2p y p0 the P value is 2PltZ Z This P value is approximate since the distribution of the test statistic is approximately normal 0 For a test of signi cance at the level or If the P Value 3 oz we reject H0 If the P Value gt oz we fail to reject H0

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.