### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Stats 10 Labs/HW/Practice Tests Stats 10

UCLA

GPA 3.16

### View Full Document

## 20

## 0

## Popular in Intro to Statistics

## Popular in Statistics

This 57 page Bundle was uploaded by Erica Roberts on Saturday July 30, 2016. The Bundle belongs to Stats 10 at University of California - Los Angeles taught by in Winter 2015. Since its upload, it has received 20 views. For similar materials see Intro to Statistics in Statistics at University of California - Los Angeles.

## Similar to Stats 10 at UCLA

## Reviews for Stats 10 Labs/HW/Practice Tests

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 07/30/16

STATS10(2) Roberts, Erica 404400827 1/15/15 Q1. There is a strong, positive, correlation between weight and desired weight. Q2. Q3. Wdiff is a descrete variable. If wdiff=0 then the person’s weight and desired weight are the same. If wdiff is positive, then the person’s desired weight is more than their weight. The opposite is true for a negative wdiff; the person’s weight is more than their desired weight. Q4. The histogram of wdiff is unimodal and skewed left. It is centered around the median, which is -10. The range is found using the IQR, which is 21. It tells us that more people want to lose weight because their desired weight is lower than their current weight. Q5. The medians for both boxplots are around the same number (0), so in general, both genders view their weight the same. Q6. The mean is 169.68 and the standard deviation is 40.08. The proportion of weights within one standard deviation of the mean is 0.7076. Q7. In this lab, we use concepts that help describe boxplots an d dot plots, as well as interpretations of the mean, median, IQR, and standard deviation. Finding the proportion of weights within one standard deviation of the mean is a new concept. Appendix Coding ############################################ # Stats10 Roberts,Erica # # 4040400827 # # This code does the On your Own section of lab 1 # ############################################# # Read data source("http://www.openintro.org/stat/data/cdc.R" ) #### # Q1 # plot(x=cdc$weight,y=cdc$wtdesire) ##### # Q2 # wdiff<- (cdc$wtdesire-cdc$weight) # Q3 # ####### ##### # Q4 # hist(wdiff[wdiff<100 &wdiff> -100],breaks=100) median(wdiff) IQR(wdiff) ##### # Q5 # boxplot(wdiff ~ cdc$gender) ##### # Q6 # m <- mean(cdc$weight) s <- sd(cdc$weight) b2 <- m + s b1 <- m - s c <- cdc$weight < b2 & cdc$weight > b1 table(c)/20000 mean(cdc$weight) sd(cdc$weight) Stat 10 –Sanchez LAST Name: Roberts ID :404400827 Lecture Homework 1 First Name: Erica Instructions: • Type your answers in this document. For multiple choice questions, just highlight or indicate which is the right answer in the __________ . • When done, save your file as .pdf, give it your name and ID and upload it in the “Upload homework 1 here” link of “Lecture Homework” folder before the due date. 1.A 2008 survey asked its respondents to report their political party affiliation. The graphs show the results for 615 men. Which political affiliation has the least men? ____C_______ A. Dem (not strong) B. Rep (not Strong) C. Strong Rep D. Other E. Strong Dem 2.-‐ The Table below shows the average lifespan for some mammals in years. (a) Which histogram correctly represents the data? _____A____ A B. C. 1 (b) If you were to include humans in your chosen graph, where would the data point be in your selected graph? Comment on the nature of this point. (Note: humans average 75 years) If humans were included, the data point for their lifespan would be placed on 75 years in graph A. This would be an outlier because all of the other data points are closer to the lower left end of the graph (from 0-‐40 years). 3. A right-‐skewed distribution has ____A_________ A. a tail that goes to the right. B. a tail that goes to the left. C. one mode. D. a bell-‐shape. 4.-‐True or False Distributions that have larger standard deviations have more observations that are farther from the mean. _____A______________ A. True B. False 5.-‐ The length of a sample of songs in minutes is given below. Compute the sample mean and the sample standard deviation. 10, 7, 5, 4 x __________6.5________________________ € S _____________2.64_______________________ 6.-‐A city planner says, “The typical commute to work for someone living in the city limits is less than the commute to work for someone living in the suburbs.” What does this statement mean? _______B_____ A. If you live in the city limits you will have a longer commute time. B. The center of the distribution of commute times for a city-dweller is less than the center of the distribution for those living in the suburbs. C. All city dwellers spend less time commuting to work than those living in the suburbs. D.There is less variation in the commute time of those living in the suburbs. 7. The following nine values represent race finish times in hours for a randomly selected group of participants in an extreme 10k race (a 10k race with obstacles). Which of the following is closest to the mean of the following data set? ____B______ 2 1.0, 1.1, 1.2, 1.2, 1.3, 1.4, 1.4, 1.4, 1.5 A. x is about 1.1 hours B. x is about 1.3 hours C. x is about 1.5 hours D. x is about 1.6 hours € € 8. The boxplots summarize the number of sentenced prisoners by state in the Midwest and West. € € Based on the boxplot for the Midwest, which of the following is true? _____B_____ A. 25% of the states sentenced less than 1,435 prisoners. B. 25% of the states sentenced more than 29,928 prisoners. C. 50% of the states sentenced less than 4,322 prisoners. D. 50% of the states sentenced more than 29,928 prisoners. 9.- Each boy in a sample of 200 Mexican American males, age 10-18, was classified according to smoking status and response to a question asking whether he likes to do risky things. The following table is based on data given in the article, ”The Association between Smoking and Unhealthy Behavior among Mexican American Adolescents” (Journal of School Health [1998- 376-379]). Risky Behavior Beehavior Row nonrisky risky Summary nonsmoker 75 45 120 Habit smoker 35 45 80 Column Summary 110 90 200 S1 = count( ) a) What proportion of these subjects are smokers? 80/200= .4 b) If we select one smoker at random, what are the chances that he likes risky things? 45/80= .56 chance 3 c) Are “smoking” and “likes risky things” independent attributes? Justify your answer with specific calculations. P(“smoking”)=.4, P(“smoking” given that “likes risky things”)=.5, these probabilities are not even, so “smoking” and “likes risky things” are not independent 10. As gasoline prices have increased in recent years, many drivers have expressed concern about the taxes they pay on gasoline for their vehicles. In the United States, gasoline taxes are imposed by both the federal government and by individual states. The graph below summarizes the distribution of state gasoline taxes, in cents per gallon, for all 50 states on January 1 .t State Gas Tax Box Plot Q M 1 e Q 0 5 10 15 20 25 30 35 State_Tax di 3 a n a) Based on the boxplot, what are the approximate values for the median and the interquartile range of the distribution of state gasoline taxes, in cents per gallon? Mark and label the boxplot to indicate how you found these values. Median:21 IQR= Q3-Q1= 25-17.5= 7.5 b) The federal tax imposed on gasoline is 19.4 cents per gallon at the same time the state taxes were in effect. The federal gasoline tax is added to the state gasoline tax for each state to create a new distribution of combined taxes. What is the value, in cents per gallon, of the median and interquartile range for the new distribution? Justify your answer. Median: 21+19.4= 40.4 cents/gallon IQR: the IQR does not change because the new Q3= 25+19.4, or 44.4. The new Q1= 17.5+19.4, or 36.9. IQR= 44.4-36.9= 7.5 4 5 STATS10 Roberts,Erica 404400827 1/26/14 On your own Comparing Kobe Bryant to the Independent Shooter Using calcstreak, compute the streak lengths of simbasket . 1.-Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the player’s longest streak of baskets in 133 shots? The distribution of streak lengths for the simulation of the independent shooter is skewed right and unimodal. The typical streak length in this case is the median, which is a streak of 1 basket. The longest streak of baskets was 6 baskets. The range, or the IQR, is 2. 2.-If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning. I would expect the shape, the typical streak length and the maximum str eak length to be very similar, but the height of each bar in the plot to be a little different. The bar plot below represents another simulated streak length distribution, which has a similar shape and the same center (1), but a different maximum (4). 3.-How does Kobe Bryant’s distribution of streak lengths from page 2 compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence that the hot hand model fits Kobe’s shooting patterns? Explain. The bar plot of Kobe’s shooting streak is also skewed right and unimodal. It has a center (median) of 0 and a maximum of 4. Also, Kobe’s streak distribution has a range of 2. Since Kobe’s streak distribution looks a lot like the independent shooter’s distribution, which represents someone without hot hands, it a ppears that Kobe’s shooting patterns don’t fit the hot hand model. 4.-What concepts from the textbook are covered in this lab? What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer The concept of describing distributions using the cent er, shape, and spread is covered in the textbook and in lecture. The idea of independent probabilities is newer, but it came up in lecture and homework. Appendix Coding ############################################ # Stat 10 Roberts, Erica # # 404400827 # # This code does the On your Own section of lab 2 # ############################################# # Read data download.file("http://www.openintro.org/stat/data/kobe.RData ", destfile = "kobe.RData") load("kobe.RData") #### # Q1 # ###### sim_streak <- calc_streak(sim_basket) barplot(table(sim_streak)) median(sim_streak) max(sim_streak) IQR(sim_streak) ##### # Q2 # ####### sim_basket <- sample(outcomes, size = 133, replace = TRUE,prob=c(0.55,0.45)) sim_streak <- calc_streak(sim_basket) barplot(table(sim_streak)) median(sim_streak) ###### # Q3 # ####### barplot(table(kobe_streak)) median(kobe_streak) IQR(kobe_streak) STATS10 Roberts,Erica 404400827 1/27/14 Lab Exam Practice Use the handout Lab 1. Introduction to Data. You may find it in CC LE under Lab 1 Week 2, or going to this link https://www.openintro.org/download.php?file=os2_lab_01A&referrer=/stat/labs.php For this practice exam you will r epeat what the TAs did in session. So answer the questions asked in Exercise 1 to Exercise 5 on your own and Exercise 6 (to be announced by the TA). While you do it, write the Script file in Rstudio and save it and copy paste it at the end. NOTE: this is n ot the “on your own” questions. Rather, they are the exercises in the handout as discussed by the TA on Week 2. 1. Exercise 1 : There are 20,000 cases in this data set and 9 variables. The categorical variables are gender, smoke100, health plan, genhlth , and exerany. The numerical continuous variables are age, wtdesire, weight, and height. 2.-Exercise 2: The IQR for height is 6 and the IQR for age is 26. There are 9,569 males in this sample and the proportion of people in the sample in excellent health is 0.232. 3.-Exercise 3: The mosaic plot reveals that less women have smoke 100 cigarettes in their life than men. 4.- Exercise 4 : See commands 5. Exercise 5 : The boxplot shows that people in all types of health conditions tend to have the sammedian BMI, around 4. I found the boxplot for age and BMI because I figured, as people get older, their weight increases. In general, the median BMI increases with age. 6. Exercise 6 (Surprise question given by the T A at the beginning of lab session):find the median for male weight, and calculate the z -score: The median for male weight is 185. The zscore for the median of male weight is -0.118. Appendix Coding ############################################ # Stat 10 Roberts,Erica # # 404400827 # # This code does the Lab 1. Introduction to Data # ############################################# # Read data source ( "http://www.openintro.org/stat/data/cdc.R" ) #### # Exercise 1 # ###### names(cdc) dim(cdc) ##### # Exercise 2 # ####### summary(cdc$height) IQR(cdc$height) summary(cdc$age) IQR(cdc$age) table(cdc$gender, cdc$exerany) 2149+7420 table(cdc$gender,cdc$genhlth) (2298+2359)/20000 ###### # Exercise 3 # ####### mosaicplot(table(cdc$gender,cdc$smoke)) ###### # Exercise 4 # ####### under23_and_smoke<-subset(cdc,cdc$age<23&cdc$smoke==1) ###### # Exercise 5 # ####### bmi<-(cdc$weight/cdc$weight^2)*703 boxplot(bmi ~ cdc$genh lth) boxplot(bmi~cdc$age) ###### # Exercise 6 # ####### mdata<-subset(cdc,cdc$gender=="m") median(mdata$weight) mean(mdata$weight) sd(mdata$weight) (185-189.32)/36.55 STATS10 Roberts, Erica 404400827 2/5/14 The Normal Distribution (Answers for discussion in the lab) 1. Exercise 1: Both distributions have a bell curve shape and both are unimodal . Men have a higher mean height at 177.7, whereas women have a mean height of 164.9. Men also have a higher spread with a standard deviation of 7.18 compared to the standard deviation of women’s heights at 6.54. 2.-Exercise 2:Yes, the plot of female height distribution seems to follow a nearly normal distribution. Because the male height distribution looked similar, we can infer that it is also ne arly normal. 3.-Exercise 3: Not all of the points fall perfectly in the line, which is similar to the distribution plot for the real data. 4.- Exercise 4: Just like in the last exercise , the normal probability plot for female heights looks similar to the plot for simulated data , which proves that the female heights are nearly normal. 5. Exercise 5: Female weight does not seem to follow the normal distribution. By the histogram, we can tell that the distribution is more right skewed than bell shaped. Also, in the Q-Q plot, the points don’t fall on the line even though in the simulated plot (second one) they are on the line. 6. Exercise 6: Q1: What is the probability that is someone shorter than 160 cm? Probability using theoretical: 0.228 Probability using Empirical: 0.192 Q2: What is the probability that is someone heavier than 65 kg? Probability using theoretical: 0.676 Probability using empirical : 0.261 The difference between the probabilities is smaller for female heights than it is for weight, so there is a closer agreement between the two masses. The Normal Distribution (On your own) Question 1. a) Female biiliac diameter belongs to plot B b) Female elbow diameter belongs to plot C c) General age belongs to plot D d) Chest depth belongs to plot B Question 2: The data has been rounded or discrete and that’s why the histograms have a stepwise pattern. Question 3: Knee diameter appears to be right skewed by looking at the probability plot and it is right skewed as proven by histogram. Appendix Coding ############################################ # Stat 10 Roberts, Erica # # 404400827 # # This code does the Lab The Normal Distribution # # As discussed in lab with the TA # ############################################# # Read data ############## # Exercise 1 # mdims <- subset(bdims, sex == 1) fdims <- subset(bdims, sex == 0) par(mfrow=c(1,2)) hist(mdims$hgt,main='Histogram pf Men\'s height') hist(mdims$hgt,main='Histogram pf Women\'s height') summary(mdims$hgt) sd(mdims$hgt) summary(fdims$hgt) sd(fdims$hgt) par(mfrow=c(1,1)) ####### # Exercise 2 fhgtmean<-mean(fdims$hgt) fhgtsd<-sd(fdims$hgt) hist(fdims$hgt,probability=TRUE,ylim=c(0,.06)) x<-140:190 y<-dnorm(x=x,mean=fhgtmean,sd=fhgtsd) lines(x=x,y=y,col="blue") ###### # Exercise 3 # qqnorm(fdims$hgt) qqline(fdims$hgt) sim_norm<-rnorm(n=length(fdims$hgt),mean=fhgtmean,sd=fhgtsd) ###### # Exercise 4 # qqnormsim(fdims$hgt) ###### # Exercise 5 # hist(fdims$wgt) summary(fdims$wgt) sd(fdims$wgt) fwgtmean <- mean (fdims$wgt) fwgtsd <- sd(fdims$wgt) hist(fdims$wgt,probability=TRUE,ylim=c(0,.06)) x<-40:110 y<-dnorm(x=x,mean=fwgtmean,sd=fwgtsd) lines(x=x,y=y,col="blue") qqnorm(fdims$wgt) qqline(fdims$wgt) sim_norm_wgt <- rnorm(n=length(fdims$wgt), mean = fwgtmean, sd = fwgtsd) qqnorm(sim_norm_wgt) qqline(sim_norm_wgt) qqnormsim(fdims$wgt) ###### # Exercise 6 # pnorm(q=160, mean= fhgtmean, sd = fhgtsd) sum(fdims$hgt < 160)/ length(fdims$hgt) pnorm(q=65, mean= fwgtmean, sd= fwgtsd) sum(fdims$wgt >65)/ length(fdims$wgt) ############################################ # Stat 10 Roberts, Erica # # 404400827 # # This code does the Lab The Normal Distribution # # On your Own part ############################################# ########## # Question 1 qqnorm(fdims$bii.di) qqline(fdims$bii.di) qqnorm(fdims$elb.di) qqline(fdims$elb.di) qqnorm(fdims$age) qqline(fdims$age) ########### # Question 2 ########## ########### # Question 3 qqnorm(fdims$kne.di) qqline(fdims$kne.di) hist(fdims$kne.di) STATS10 Erica Roberts 404400827 2/10/15 Foundations for statistical inference-sampling distributions (Answers for discussion in the lab) Exercise 1: The population distribution is right skewed, with a median of 1442, an IQR of 617, a mean of 1500, and a standard deviation of 505.5. Exercise 2: The distribution of this sample is also right skewed, like the population distribution. The median of the sample is higher than the median of the population at 1482, but the standard deviation is smaller (493.84). The IQR of the sample is almost the same, but a little smaller than the IQR of the population, 607 vs. 617. The mean of the sample is higher than the mean of the population. Exercise 3: The mean of sample two, 1495, is lower than the mean of sample one, 1539.66. The largest sample size will be a better representation of the population distribution. The mean of sample 4, of 1000 subjects, is closest to the mean of the population (1503.84 vs. 1500). Exercise 4: There are 5,000 elements in this distribution. The sampling distribution of sample_means50 has an almost normal shaped distribution with a center of 1498.64. With 50,000 samples, the distribution looks even more normal with a mean of 1499.83. Exercise 5: There are 100 observations in this sample, each observation represents the mean of a sample of size 50. [1] 1486.12 1509.82 1464.76 1507.88 1496.26 1551.30 1595.54 1609.06 1465.08 1455.06 [11] 1420.54 1489.38 1535.74 1592.62 1492.10 1512.54 1559.30 1473.76 1490.76 1470.44 [21] 1605.24 1450.46 1436.58 1532.60 1504.66 1504.12 1435.56 1446.66 1509.64 1521.30 [31] 1428.18 1454.30 1405.40 1546.60 1494.76 1430.48 1420.74 1410.50 1462.56 1459.86 [41] 1408.06 1387.54 1415.42 1591.54 1433.36 1437.98 1509.08 1516.56 1480.58 1466.58 [51] 1496.58 1685.24 1554.48 1552.72 1486.50 1434.02 1566.86 1528.20 1623.26 1407.00 [61] 1552.10 1395.52 1532.06 1550.60 1546.56 1559.84 1527.22 1386.48 1568.38 1459.32 [71] 1511.44 1427.74 1615.22 1557.66 1509.18 1471.50 1499.84 1413.18 1543.52 1514.66 [81] 1510.54 1260.08 1620.80 1375.56 1580.68 1467.80 1374.42 1421.60 1537.46 1423.20 [91] 1534.42 1477.38 1465.56 1578.38 1513.52 1406.10 1556.80 1436.74 1530.36 1549.88 Exercise 6: As the sample size gets bigger, the mean of the sample is closer to the mean of the population mean. As the sample size gets bigger, the standard deviation of each sample gets smaller. On your own 1. The mean price of this sample of 50 is 191,550.4, which would be my estimate for the population mean. 2. The sampling distribution of this sample looks normal, with a center around 180,665.5. Because this distribution has a lot of sample, 5,000, the mean of this sample is a good estimate of the population mean. The actual population mean of price is 180,796.1. 3. The sampling distribution of this plot also looks normal. Because more means were gathered, the overall mean of 180,776.8 is even closer to the population mean than in the previous sample. 4. The standard deviation of sample_means50 is 11,189.81 whereas the standard deviation of sample_means150 is 6,412.49. Sample_means150 has a smaller spread, so the estimate of the mean would be closer to the true value. For estimations, we want distributions with smaller spreads. Appendix Coding ############################################ # Stat 10 Roberts, Erica # # 404400827 # # This code does the Lab The Normal Distribution # # As discussed in lab with the TA # ############################################# # Read data download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData") load("ames.RData") area <- ames$Gr.Liv.Area price <- ames$SalePrice ############## # Exercise 1 # ############## summary(area) sd(area) summary(area)[5]-summary(area)[2] par(mfrow=c(1,2)) hist(area, breaks=50) qqnorm(area) qqline(area) ############## # Exercise 2 # ############## samp1 <- sample(area, 50) summary(samp1)[3] mean(samp1) summary(samp1)[5]-summary(samp1)[2] hist(samp1, breaks=10) sd(samp1) qqnorm(samp1) qqline(samp1) ############## # Exercise 3 # ############## samp2 <- sample(area, 50) mean(samp2) samp3 <- sample(area, 100) mean(samp3) samp4 <- sample(area, 1000) mean(samp4) ############## # Exercise 4 # ############## sample_means50 <- rep(NA, 5000) for(i in 1:5000){ samp <- sample(area, 50) sample_means50[i] <- mean(samp) } mean(sample_means50) sd(sample_means50) hist(sample_means50, main="5000 means") qqnorm(sample_means50) qqline(sample_means50) sample_means50 <- rep(NA, 50000) for(i in 1:50000){ samp <- sample(area, 50) sample_means50[i] <- mean(samp) } hist(sample_means50, main="50000 means") ############## # Exercise 5 # ############## sample_means_small<- rep(0, 100) for(i in 1:100){ samp <- sample(area, 50) sample_means_small[i] <- mean(samp) } sample_means_small length(sample_means_small) ############## # Exercise 6 # ############## sample_means10 <- rep(NA, 5000) sample_means100 <- rep(NA, 5000) for(i in 1:5000){ samp <- sample(area, 10) sample_means10[i] <- mean(samp) samp <- sample(area, 100) sample_means100[i] <- mean(samp) } par(mfrow = c(3, 1)) xlimits <- range(sample_means10) hist(sample_means10, breaks = 20, xlim = xlimits) hist(sample_means50, breaks = 20, xlim = xlimits) hist(sample_means100, breaks = 20, xlim = xlimits) ############################################ # Stat 10 Roberts, Erica # # 404400827 # # This code does the Lab The Normal Distribution # # On your Own part ############################################# download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData") load("ames.RData") ############## # Q1 # ############## price1 <- sample(price, 50) mean(price1) ############## # Q2 # ############## sample_means50 <- rep(NA, 5000) for(i in 1:5000){ samp <- sample(price, 50) sample_means50[i] <- mean(samp) } mean(sample_means50) sd(sample_means50) hist(sample_means50, main="5000 samples") qqnorm(sample_means50) qqline(sample_means50) mean(price) ############## # Q3 # ############## sample_means150 <- rep(NA, 5000) for(i in 1:5000){ samp <- sample(price, 150) sample_means150[i] <- mean(samp) } mean(sample_means150) sd(sample_means150) hist(sample_means150, main="5000 samples") qqnorm(sample_means150) qqline(sample_means150) ############## # Q4 # ############## sd(sample_means150) sd(sample_means50) STATS10 Roberts, Erica 404400827 2/17/15 Foundations for statistical inference-Confidence Intervals (Answers for discussion in the lab) Exercise 1: The shape of this distribution is right skewed and unimodal. The median of the sample is the best estimate of a typical value, which in this case is 1638. The spread of the sample is represented by the IQR, which is 411. Exercise 2: I would expect another student’s distribution to be similar in shape, spread, and center, but these values would vary a little because they are random samples. Exercise 3: The sample size must be at least 30. The observations are independent. The data distribution isn’t strongly skewed. Exercise 4: 95% of all possible samples of the same size will result in an interval that captures the true average size of houses in Ames. Exercise 5: I am 95% confident that the true average size of houses in Ames is between 1507.7 and 1730.9. This is the confidence interval I got; however, the true mean value for this population is 1499.69, which is not included in this range. The student next to me found a confidence interval that did include the population mean. Exercise 6: I would expect 95% of the confidence intervals to include the true mean for this population because we used a confidence level of 95%. On your own 1. 96% of the confidence intervals created include the true population mean. This is very close the confidence level of 95%, but is not exact because of the varying factor of randomization. 2. I chose a confidence level of 99%. The critical value for this is 2.98 3. Using this confidence level of 99%, all of the confidence intervals found include the true population mean. Appendix Coding Exercises ############################################ # Stat 10 Roberts, Erica # # 404400827 # # This code does the Lab Confidence Interval # # As discussed in lab with the TA # ############################################# # Read data download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData") load("ames.RData") ############## # Exercise 1 # ############## population <- ames$Gr.Liv.Area samp <- sample(population,60) par(mfrow=c(1,2)) hist(samp,breaks = 25) qqnorm(samp) qqline(samp) summary(samp) sd(samp) ############## # Exercise 2 # ############## ############## # Exercise 3 # ############## ############## # Exercise 4 # ############## ############## # Exercise 5 # ############## sample_mean <- mean(samp) se <- sd(samp)/sqrt(60) lower <- sample_mean - 1.96 * se upper <- sample_mean + 1.96 * se c(lower, upper) mean(population) ############## # Exercise 6 # ############## On your own. ############################################ # Stat 10 Roberts, Erica # # 404400827 # # This code does the Lab Confidence Interval # # On your Own part ############################################# download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData") load("ames.RData") ############## # Q1 # ############## samp_mean <- rep(NA, 50) samp_sd <- rep(NA, 50) n <- 60 for(i in 1:50){ samp <- sample(population, n) samp_mean[i] <- mean(samp) samp_sd[i] <- sd(samp) } lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n) c(lower_vector[1], upper_vector[1]) plot_ci(lower_vector, upper_vector, mean(population)) 48/50 ############## # Q2 # ############## ############## # Q3 # ############## lower_vector <- samp_mean - 2.98 * samp_sd / sqrt(n) upper_vector <- samp_mean + 2.98 * samp_sd / sqrt(n) c(lower_vector[1], upper_vector[1]) plot_ci(lower_vector, upper_vector, mean(population)) STATS10 Roberts, Erica 404400827 2/24/15 Inference for Numerical Data Exercise 1: There are 1000 cases in our sample. Each case is a birth recorded in North Carolina. Exercise 2: Mothers who don’t smoke had a greater birth weight for their babies than smokers. Also, nonsmoking mothers’ babies have more outliers in the lower quadrant of weight. Exercise 3: The sample is random. The amount of nonsmokers in the sample is 873 and the amount of smokers is 126, which are both greater than 10. The histograms for both habits have no strong skewedness. Also, the observations are independent because the population is greater than 10 times the sample size. Exercise 4: Ho: ???? weight of smoking mothers= ???? weight of nonsmoking mothers Ha: ???? weight of smoking mothers≠ ???? weight of nonsmoking mothers Exercise 5: We are 95% confident that the interval 0.0534 to 0.5777 captures the true average difference between mean weights of babies born by nonsmoking mothers and smoking mothers. On your own 1.We are 95% confident that the interval 38.15 to 38.51 weeks captures the true average length of pregnancies for all mothers. 2. We are 90% confident that the interval from 38.18 to 38.49 weeks captures the true average length of pregnancies for all mothers. 3. Ho: ???? weight of younger mothers= ???? weight of mature mothers Ha: ???? weight of younger mothers≠ ???? weight of mature mothers We are 95% confident that the interval -0.27 to 0.33 captures the true average difference between the weights of babies born to younger mothers and babies born to mature mothers. Because 0 is captured in this interval, at the .05 level, we fail to reject the null hypothesis that ???? weight of younger mothers= ???? weight of mature mothers. 4. Based on the summary data, the age cutoff for younger and mature mothers is 35. Looking at the summary data, the max age for younger mom is 34, whereas the minimum age for mature mother is 35. 5 .We are interested in researching the relationship between the low birth weight categorical variable and the weight gained by the mother during pregnancy. The null hypothesis is that there is no relationship. Ho: ???? weight gained of mothers’ with a low birth weight baby= ???? weight gained of mothers’ with a non low birth weight baby Ha: ???? weight gained of mothers’ with a low birth weight baby ≠ ???? weight gained of mothers’ with a non-low birth weight baby We are 95% confident that the true average difference between weight gained of mothers with a low birth weight baby and weight gained of mothers with a non-low birth weight baby is between -7.6815 and -1.8332 lbs. Because 0 is not captured in this interval, at the .05 significance level, we reject the null hypothesis. Appendix Coding ############################################ # Stat 10 Roberts, Erica # # 404400827 # # Exercises code # ############################################# # Read data download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData") load("nc.RData") ############## # Exercise 1 # ############## dim(nc) ############## # Exercise 2 # ############## boxplot(weight~habit, data=nc) ############## # Exercise 3 # ############## by(nc$weight, nc$habit, length) par(mfrow=c(1,2)) hist(subset(nc, habit=="smoker")$weight, main="smoker", xlab="weight") hist(subset(nc, habit=="nonsmoker")$weight, main="nonsmoker", xlab="weight") ############## # Exercise 4 # ############## ############## # Exercise 5 # ############## inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical") ############################################ # Stat 10 Roberts, Erica # # 404400827 # # On your Own code ############################################# ############## # Q1 # inference(y = nc$weeks, est = "mean", type = "ci",method="theoretical") ############## # Q2 # inference(y = nc$weeks, est = "mean", type = "ci",method="theoretical", conflevel=0.90) ############## # Q3 # inference(y = nc$weight, x = nc$mature, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical") ############## # Q4 # by(nc$mage, nc$mature,summary) ############## # Q5 # inference(y = nc$gained, x = nc$lowbirthweight, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical") STATS10 Erica Roberts 404400827 3/2/15 Introduction to linear regression Exercise 1: I would use a scatterplot to show the relationship between runs and another numerical variable. From the scatterplot of at bats and runs, the relationship between these variables does not look very linear, so I would not feel comfortable predicting runs off of a linear model for at bats. Exercise 2: The relationship between at bats and runs is positive, weak, linear association with some outliers. The correlation is 0.61, which means that it is positive, but weak linearly. Exercise 3: The smallest sum of squares that I got from my trials was 139,380.8. My neighbors received sums around the 140,000s so this is a good measure. Exercise 4: runs= 415.24+ 1.83 (homeruns) The slope is 1.83, which means that for every homerun, the total number or runs gained is 1.83. This means that the more homeruns, the greater the success of the team is. Exercise 5: runs=-2789.24+ 0.631(at bats) With 5,578 at bats, the predicted amount of runs would be 730.478. The closest real data point is 5,579 at bats with 713 runs. This means that the residual for this point is 713-730.5= -17.5, so this point is an overestimate by 17.5 runs. Exercise 6: The residual plot shows no obvious pattern, so the relationship appears to be linear. Exercise 7: The plot looks roughly normal and so does the Normal Q-Q plot, so we can use the approximately normal assumption. Exercise 8: The constant variability condition calls for the variability of points around the least squares line remains roughly constant. Based on the previous plots, this condition has been met. On your own 1. This plot shows the relationship between batting average and runs, which seems to be positive, strong, and linear. 2. 65% of the variation in runs is explained by the least squares regression of batting average. This is better than the value of R^2 for homeruns vs. runs which was 63%. 3. After investigating different variables, it turns out that batting average is the best predictor of runs. 4. The new observations, the ones used in Moneyball, seem to be a better predictor of runs. 93% of the variation in runs is explained by the least squares regression of the new variables. 5. These are the plots for the new observations. They seem to follow the normal approximation very well. Appendix Coding ############################################ # Stat 10 Roberts, Erica # # 4044008 # # Exercises code # ############################################# # Read data download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData") load("mlb11.RData") ############## # Exercise 1 # ############## plot(mlb11$at_bats, mlb11$runs，main="Relationship") ############## # Exercise 2 # ############## cor(mlb11$runs, mlb11$at_bats) ############## # Exercise 3 # ############## plot_ss(x = mlb11$at_bats, y = mlb11$runs) ############## # Exercise 4 # ############## m2<- lm(runs~homeruns,data = mlb11) summary(m2) ############## # Exercise 5 # ############## b0 <- -2789.2429 b1 <- 0.6305 x <- 5578 Yhat <- b0 + b1*x Yhat mlb11[order(mlb11$at_bats,mlb11$runs),] ############## # Exercise 6 # ############## m1 <- lm(runs ~ at_bats, data = mlb11) summary(m1) plot(m1$residuals ~ mlb11$at_bats) abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0 ############## # Exercise 7 # ############## par(mfrow=c(1,2)) hist(m1$residuals) qqnorm(m1$residuals) qqline(m1$residuals) # adds diagonal line to the normal prob plot ############## # Exercise 8 # ############## ############################################ # Stat 10 Roberts, Erica # # 404400827 # # On your Own code ############################################# # Read data download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData") load("mlb11.RData") ############## # Q1 # ############## m3 <- lm(runs ~ bat_avg, data = mlb11) plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship3") abline(m3) ############## # Q2 # ############## summary(m3) ############## # Q3 # ############## m3 <- lm(runs ~ bat_avg, data = mlb11) hist(m3$residuals) qqnorm(m3$residuals) qqline(m3$residuals) # adds diagonal line to the normal prob plot summary(m3) ############## # Q4 # ############## names(mlb11) mnew_obs <- lm(runs ~ new_obs, data = mlb11) mnew_slug <- lm(runs ~ new_slug, data = mlb11) mnew_onbase <- lm(runs ~ new_onbase, data = mlb11) summary(mnew_obs) summary(mnew_slug) summary(mnew_onbase) mnew_obs <- lm(runs ~ new_obs, data = mlb11) # adds diagonal line to the normal prob plot ############## # Q5 # ############## hist(mnew_obs$residuals) qqnorm(mnew_obs$residuals) qqline(mnew_obs$residuals) Practice Midterm 2 - 2/24/15, 2:42 PM / 15W-STATS10-2 / MIDTERM 2 / Practice Midterm 2 - Control Panel My home Started on Tuesday, February 24, 2015, 2:17 PM State Finished Completed on Tuesday, February 24, 2015, 2:42 PM Time taken 24 mins 13 secs Points 16.00/20.00 Grade 8.00 out of 10.00 (80%) Questio1 An economist believes that Americans will spend less money on holiday shopping this year than last year. Last year, 85% of Correct all Americans bought a gift for the Winter holiday season. The economist publishes results of a survey of Americans that, he says, shows that a lower proportion of Americans are spending money on gifts this year. He shows this with a hypothesis 1.00pointsoutof 1.00 test, and says that he used "a signiﬁcance level of 1%." What does this mean? Flagquestion Select one: a. The probability that he will conclude that a lower proportion of Americans are shopping than last year is 1%. b. The probability that he will get a proportion more extreme than the one he saw, assuming that in fact 85% of Americans are shopping this year, is 1%. c. The probability that he will make the correct decision is 99%. d. The probability that he will conclude that a lower proportion are shopping when, in fact, the proportion is 85%, is 1%. The correct answer is: The probability that he will conclude that a lower proportion are shopping when, in fact, the proportion is 85%, is 1%. Questio2 A researcher was interested in comparing the salaries of female and male employees of a particular company. Independent Correct random samples of female employees (sample 1) and male employees (sample 2) were taken to calculate the mean salary, in dollars per week, for each group. A 90% conﬁdence interval for the difference, (μ1 - μ2) between the mean weekly salary of 1.00pointsoutof all female employees and the mean weekly salary of all male employees was determined to be (-$110; $10) 1.00 Flagquestion Select one: a. We are 90% conﬁdent that a randomly selected female employee at this company makes between $110 less and $10 more per week than a randomly selected male employee. b. We know that 90% of female employees at this company make between $110 less and $10 more than he male employees. c. Based on these data, with 90% conﬁdence, male employees at this company average between $110 less and $10 more per week than the female employees. d. Based on these data, with 90% conﬁdence, female employees at this company average between $110 less and $10 more per week than the male employees. The correct answer is: Based on these data, with 90% conﬁdence, female employees at this company average between $110 less and $10 more per week than the male employees. The dean in the college of education collected book costs of college students. An analysis was performed to determine if the Questio3 Incorrect average book cost of the graduate and undergraduate students differed. The results of the study are shown below: Graduate students: Sample size 42; Mean $370; Standard Deviation $80. 0.00pointsoutof 1.00 Undergraduate students: Sample size 100; Mean $400; Standard Deviation $95. https://ccle.ucla.edu/mod/quiz/review.php?attempt=552447 Page 1 of 7 Practice Midterm 2 - 2/24/15, 2:42 PM Flagquestion Give the rejection region for the test, using α=0.05 Select one: a. Reject Ho if t ≥ 1.656 or t ≤ -1.656 b. Reject Ho if z ≥ 1.96 or z ≤ -1.96 c. Reject Ho if t ≥ 1.977 or t ≤ -1.977 d. Reject Ho if z ≥ 1.645 or z ≤ -1.645 The correct answer is: Reject Ho if t ≥ 1.977 or t ≤ -1.977 Question4 Which of the following would make the sampling distribution of the sample mean narrower? Check all answers that apply. Correct 1.00pointsoutof 1.00 Select one or more: Flagquestion a. A larger sample size b. A smaller population standard deviation c. A larger population standard deviation d. A larger standard error e. A smaller sample size The correct answer is: A larger sample size, A smaller population standard deviation Question5 Suppose that you want to estimate the percentage of the UCLA undergraduates who belong to a gym. You ask a random Correct sample of 200 students and 120 say that they belong to a gym. What is the 90% CI? 1.00pointsoutof 1.00 z* = 1.645 Flagquestion Select one: A. 58.04% to 61.96% B. 54.30% to 65.70% C. 56.5% to 63.5% D. 51.23% to 68.76% E. 55.52% to 64.47% The correct answer is: 54.30% to 65.70% Question6 A survey was conducted to determine what percent of college seniors would chose to attend a different college if they knew what they know now. In a random sample of 100 seniors, 34% indicated that they would attend a different college. Correct 1.00pointsoutof Determine the 90 percent conﬁdence interval for the proportion of seniors who would attend a different college. 1.00 Flagquestion Select one: a. 25.8% to 42.2% https://ccle.ucla.edu/mod/quiz/review.php?attempt=552447 Page 2 of 7 Practice Midterm 2 - 2/24/15, 2:42 PM b. 31.25% to 36.8% c. 24.7% to 43.3% d. 30.6% to 37.4% e. 26.2% to 41.8% The correct answer is: 26.2% to 41.8% Question7 A researcher is examining the impact of a new method of teaching Farsi. She concludes that the new method is signiﬁcantly better than the old method, based on a test with alpha = 0.05. Correct 1.00pointsoutof Select one: 1.00 Flagquestion a. We need the standard error and the sample mean to decide is the new method is better. b. The probability of rejecting the true null is between 0.05 and 0.10. c. She could have made the same decision at alpha = 0.01. d. She could have made the same decision at alpha = 0.10. The correct answer is: She could have made the same decision at alpha = 0.10. Question8 Which of the following statements best describes the effect on the Binomial Probability Model if the number of trials is held Incorrect constant and the p( the probability of "success") increases? 0.00pointsoutof 1.00 Select one: a. The mean and standard deviation both decrease. Flagquestion b. None of these statements are true. c. The mean increases and the standard deviation decreases. d. The mean decreases and the standard deviation increases. e. The mean and the standard deviation both increase. The correct answer is: None of these statements are true. Question9 Market researchers claim that 90% of all college students use an ATM at least once a week. If we take a random sample of 10 college students, what are the chances that 9 or more students use their ATM this week? Correct 1.00pointsoutof 1.00 Select one: Flagquestion a. Normal Probability Model can be used to approximate the Binomial Model with N(9, root(.9), X) and P(X >= 9). b. Either the exact Binomial Model or the Normal Model to approximate the Binomial can be used. c. Binomial Probability Model is appropriate with B(10, 0.9, X) and P(X = 9 or 10). The correct answer is: Binomial Probability Model is appropriate with B(10, 0.9, X) and P(X = 9 or 10). https://ccle.ucla.edu/mod/quiz/review.php?attempt=552447 Page 3 of 7 Practice Midterm 2 - 2/24/15, 2:42 PM Question10 A certain school district in California has the following ethnic makes up: Correct 30% Hispanic 1.00pointsoutof 20% African American 1.00 40% Anglo Flagquestion 10% Others If you pick four students at random with replacement, what is the probability that one of them will be Hispanic and the other three will not be Hispanic (not Hispanic includes the other three backgrounds)? Select one: a. 2.4 b. 0.4116 c. 0.1029 d. 0.5884 The correct answer is: 0.4116 Question11 Is smoking on the decline in the United States? A researcher believes that a lower proportion of Americans smoke Incorrect cigarettes now than in the past. Historically, roughly 25% of Americans smoked a cigarette on any given week. A random sample of 1000 Americans found that 21% of the sample smoked. Which of the following would be the correct standard 0.00pointsoutof 1.00 error for a z-test of whether smoking rates have changed? Flagquestion Select one: a. sqrt(.25*.75*1000) b. sqrt(.21*.79*1000) c. sqrt(.21*.79/1000) d. sqrt(.25*.75/1000) The correct answer is: sqrt(.25*.75/1000) An arts appreciation program is instituted in a large school district. 200 third-grade students are randomly assigned to Question12 Correct attend the experimental art appreciation classes, while another 200 are assigned to receive the traditional art instruction. After the course is completed, 95% conﬁdence interval is computed for the differences in the mean scores on a Creativity 1.00pointsoutof 1.00 Exam. The differences are computed as follows: mean treatment group minus mean control group. So a positive difference indicates that the treatment students scored higher, on average, then the control students. The test is designed so that high Flagquestion scores indicate high creativity. The 95% conﬁdence interval was 1.3 to 5.7 points. Which of the following is the best conclusion for these results? Select one: a. On average, students in the arts instruction class had higher scores than the other students, and this difference is too large to plausibly be due to chance. b. If we were to repeat this study, there is a 95% chance that the mean difference would be between 1.3 and 5.7 points. c. On average, students in the arts instruction class had higher scores than the other students, but this difference is due to chance variation. https://ccle.ucla.edu/mod/quiz/review.php?attempt=552447 Page 4 of 7 Practice Midterm 2 - 2/24/15, 2:42 PM d. We cannot reject the hypothesis that the mean scores of both groups of students is the same. The correct answer is: On average, students in the arts instruction class had higher scores than the other students, and this difference is too large to plausibly be due to chance. Question13 In a controlled laboratory environment, random sample of 10 adults and 10 children were tested by a psychologist to determine the room temperature that each person ﬁnds most comfortable. The data are summarized below. Correct 1.00pointsoutof Adults (sample 1): mean 74.5 °F Standard Deviation 2.1° F 1.00 Children (sample 2): mean 77.5 °F Standard Deviation 1.6° F If the psychologist wished to test the hypothesis that children prefer warmer room temperature than adults, which set of Flagquestion hypotheses would he use? Select one: a. Ho: (μ1 - μ2) = 0 vs. Ha: (μ1 - μ2) ≠ 0 b. Ho: (μ1 - μ2) = 0 vs. Ha: (μ1 - μ2) < 0 c. Ho: (μ1 - μ2) = 0 vs. Ha: (μ1 - μ2) > 0 d. Ho: (μ1 - μ2) = 3 vs. Ha: (μ1 - μ2) ≠ 0 The correct answer is: Ho: (μ1 - μ2) = 0 vs. Ha: (μ1 - μ2) < 0 Question14 Cell Phones Incorrect Suppose that you read that 88% of all college students have working cell phones. If you take a random sample of 20 college 0.00pointsoutof students, what are the chances that at least 18 of the 20 will have working cell phones? 1.00 Flagquestion Select one: a. 1 - 190*(0.88)^18 (0.12)^2 + 20*(0.88)^19(0.12) + (0.88)^20 b. 190*(0.88)^18 (0.12)^2 + 20*(0.88)^19(0.12) + (0.88)^20 c. 0.88*20 d. 20!/(18!*2!) (0.88)^18 (0.12)^2 e. None of these values. The correct answer is: 190*(0.88)^18 (0.12)^2 + 20*(0.88)^19(0.12) + (0.88)^20 Question15 The principal of Berkeley High School has read that based on the national norm, a high school student will miss 15 days of Correct class per school year. He is curious if the number of day missed by the students in his school is different from the national norm. He takes a random sample of 25 students' attendence records, and ﬁnds that for this sample the average number of 1.00pointsoutof 1.00 school days missed per year is 17 and the standard deviation is 8. What is the approximate p-value for a two-tailed test? Flagquestion Select one: a. The p-value is between 0.10 to 0.15. b. The p-value is less than 0.15. c. The p-value is between 0.20 to 0.30. https://ccle.ucla.edu/mod/quiz/review.php?attempt=552447 Page 5 of 7 Practice Midterm 2 - 2/24/15, 2:42 PM d. The p-value is less than 0.10 The correct answer is: The p-value is between 0.20 to 0.30. We read that the average weight for babies born in the United States is 7.5 pounds with deviation of 0.25 pounds. We can Question16 Correct assume that the distribution of birth weights is nearly normal. If we select one baby at random, what are the chances that the baby weights less that 8 pounds? 1.00pointsoutof 1.00 Select one: Flagquestion a. 0.9772 b. 0.0456 c. 0.0228. d. 0.9544 The correct answer is: 0.9772 Question17 When a test of signiﬁcance for the null hypothesis, Ho against the alternative, Ha, the p_value is: Correct Select one: 1.00pointsoutof 1.00 a. The probability of o

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.