# Comm106, Week 4 notes Comm106

Stanford

GPA 3.9

This 5 page Class Notes was uploaded by Erica Evans on Saturday January 30, 2016. The Class Notes belongs to Comm106 at Stanford University taught by Jennifer Pan in Fall 2016.

Date Created: 01/30/16

Comm106 Class 8 1292016 Announcements Midterm next Friday Midterm review sections next Thursday and Friday Getting Representative Data the universe of cases we want to describe Ex What is the average income of adults in the US The population is every single adult in the US Ex How will America vote in the 2016 presidential election The population is all eligible voters in 2016 or all intending to vote Could be either the characteristic of the population we care about Ex What is the average income of adults in the US The population parameter is average income when we have every single case documented But this is really expensive So we take a instead It is a subset of the population of all the cases the word used to denote the population parameter in a sample case what we use to estimate the population parameter because we do not know it for certain since we have no census r saying something about the population from the sample is a part of what makes research scientific infer something about the world beyond what we observe If you re doing a study of a certain class and what you find only applies to that class then that is not scientific it is only scientific if we can infer something from it Scientific research is never certain because we can never measure the entire population And so we must How do we get the sample right How uncertain will our estimate of the population parameter be Depends on How was the sample chosen 9 must be random How large the sample is 9 the larger the more accurate The population itself and variation every member of the population must have an equal chance of being chose If you are selecting 100 students from Stanford undergrads everyone must have a 1006999 chance ofbeing selected This sample is representative of the population If you repeat this over and over you will not get the exact same sample but on average it will be the same Random sampling tries to eliminate the biased differences you can t let people selfselect Example literary digest predicted that Landon would beat FDR But this wasn t true 9 they got phone numbers from phone book auto clubs and magazine subscribers But people who are semiliterate rural or poor would not be in this sample so the results were skewed You can t have people selfselect divides a boxplot into two or more categories names c 9 names the sets of your boxplot logpopulation 9 makes it easier to look at when there are a lot of outliers ylim 9 sets y axes range Ex ylime c02000 If you have the wrong sampling frame We can test for this in R by comparing sampled and nonsampled data This is not your fault This is just when your sample data differs from the actual data by a random error sample statistic random sampling error The more variation in the population the more random sampling error there will be Comm106 Class 7 1272016 Announcements Midterm next Friday will be multiplechoice The test will cover content up until this Friday There will be questions about R Section next week sections Thursday and Friday will be review sections No section Monday Feb 8 Descriptive statistics This week we will be doing some R exercises related to descriptive statistics in section Data sets are really unwieldy and hard to make sense of We need to summarize the information 9 descriptive statistics is the name for ways to summarize the information We make sure specific examples or anecdotes are not outliers in a data set We talked about different types of variables type of descriptive statistics depends on the type of variable like nominal ordinal or interval Used with nominal variables like categories that we cannot rank This tells us how many among the data are in each category Like a table that shows majors after graduation and the number of students that went into journalism law business politics etc in no specific order You can also look at percentages with frequency tables You can also put the numbers into a bar chart or pie chart But pie charts can be misleading because it is hard to make comparisons takes out the people that didn t respond what is a typical or average value most frequent value In the careers example the mode is law because the most people went into law mode is the only valuable measure for nominal variables because you can t rank them When two categories are the same for example political party Can be used for ordinal variables Value in the middle when we rank all values of the variable half are above half are below When we have an odd number of observations it is the single observation in the middle ranked observation number n 12 When we have an even number of observations it is the midpoint between the two middle observations ranked observation between n2 and n2 2 divide observations into groups the median is a quantile that divides observations in half A divides observations into four groups The first quartile would be below the median of the lower part divides observations into 100 groups P of observations are below 100p are above That means that 25 is the first quartile 50 percentile is median 75 is third quartile 1234567799 Median 55 First quartile is 3 Second quartile is 55 75 percentile is 7 can be used for continuous variables Mean is used interchangeably with average The mean is the sum of all observations divided by the total number of observations The median is robust it is insensitive to outliers but the mean is affected by outliers and so can be misrepresentative each bar represents the values of a certain interval or bins The taller the bar the more values in that bin Measuring Spread range interquartile range outlier standard deviation variance the difference between the largest and smallest numbers in the sample the difference between the third and first quartiles so basically the middle section helps remove outliers We say an observation is an if it falls more than 15 x IQR above the 3rd quartile or the observation falls 15 x IQR below the 1St quartile or box and whisker plots Shows the median 1St and 3rd quartiles Within the box Everything beyond the whiskers is an outlier Standard Deviation On average how far away are data points from the sample mean LOOK UP THE FORMULA Data 1 3 5 6 10 Mean is 5 15quot235quot255quot265quot2105quot2 46 Divide by sample size 1 9 4651 115 variance Sqrt115 339 Standard deviation 339 A high standard deviation implies there is more variation around the mean The sample variance is in squared units R functions look these up Frequency table 9 table Bar chart 9 barplot Histogram 9 hist Median 9 median Quantiles 9 summaryO Range 9 range Interquartile range 9 IQRO Boxplot 9 boxplotO Standard deviation 9 sd Variance 9 var

