S301 Exam 1 Study Guide
S301 Exam 1 Study Guide STAT-S301
Popular in Business Statistics
verified elite notetaker
Popular in Statistics
This 14 page Study Guide was uploaded by Lauren Detweiler on Thursday February 12, 2015. The Study Guide belongs to STAT-S301 at Indiana University taught by Hannah Bolte in Spring2015. Since its upload, it has received 965 views. For similar materials see Business Statistics in Statistics at Indiana University.
Reviews for S301 Exam 1 Study Guide
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/12/15
S301 Exam 1 Study Guide Prof Hannah Bolte Chapters 19 12 O 9 Week 1 Chapters 1 2 amp 3132 I What Is Statistics a The art of learning from data 11 What Are Statistics a Numbers based on the collected data from which we infer things about a population 111 Acceptable Variability in Model Predictions a Models are NOT meant to make perfect predictions b Point estimation vs Con dence intervals i ie guessing number vs a range of numbers c We often report a speci c number predicted by a model but models really provide us a range of expectations IV Data Table A rectangular arrangement of data as in a spreadsheet Rows and columns carry speci c meaning a The bad see lecture slides for examples b The good see lecture slides for examples c Best practices V Vocabulary a Observation name given to the rows in a data table b Case synonymous with observation name given to the rows in a data table c Variable a column in a data table that holds a common attribute of the cases d Aggregate to reduce the number of rows in a data table by counting or summing values within categories VI Categorical vs Numerical Variables a Categorical Variable Column of values in a data table that identi es cases with a common attribute b Numerical Variable Column of values in a data table that records numerical properties of cases also called continuous variable VII CrossSectional vs Time Series Data a CrossSectional data that measure attributes of different objects observed at the same time i Example retail sales at WalMart stores around the United States in March 2011 b Time Series a sequence of data recorded over time i Examples stock prices dollars in sales employment other economic and social variables ii Can be categorical or numerical VIII Frequency Table a A tabular summary that shows the distribution of a variable b Each row lists a category along with the number of cases in this category c Example from textbook pg 26 TABLE 32 Frequency table of hosts I Host IfrequencyIProportlonI Ilyped 39amazoncom 89919 947577 ancom II7258 I 03840 I hoocom I 6078 I 003216 I googlecom I 4381 I 002318 I recigesourcecom 4283 I 002266 I ooom I 1639 I 000867 I iwoncom I 1573 I 000832 I Elwolacom I 1289 I 000682 I bmezinecom I 1285 I 000680 I dailyblessingscom I1166 I000617 I mdbcom I 86 I000469 I I ou onmountaincom 813 I000430 I arthlinknet I790 I 000418 I o u dnet I589 I000312 I verturecom I586 I 000310 I dotcomscoogcom 577 I 000305 I netscagecom I 544 I 000288 I dealtimecom I 543 I 000287 I w I533 I 00282 I gostcardsorg I 532 I 000281 I l24hour mallcom 503 I 000266 I Other I 63229 I 033455 I Total I 1 88996 I 100 I IX Charts a Know the how to tell a good chart from a bad one b Common types i Bar chart A display using horizontal or vertical bars to show the distribution of a categorical variable 1 Pareto Chart A bar chart with categories sorted by frequency ii Pie Chart A display that uses wedges of a circle to show the distribution of a categorical variable Week 2 Chapters 3334 amp 4 1 Area Principle the area of a plot that shows data should be proportional to the amount of data a Presenting relative sizes accurately b Violations of the area principle include i Decorations that sacri ce accuracy ii Baseline of the chart is not at zero 11 Best Practices a Use a bar chart to show the frequencies of a categorical variable Use a pie chart to show the proportions of a categorical variable Keep the baseline of a bar chart at zero Preserve the ordering of an ordinal variable Respect the area principle Show the best plots to answer the motivating question Label your chart to show the categories and indicate whether some have been combined or omitted Ill Measures of Central Tendency a Mean i The average found by dividing the sum of the values by the number of values Shown as a symbol with a line over it as in 37 ii Calculated by adding up the data and dividing by n the number of values b Median i The median of an ordinal variable is the label of the category of the middle observation when you sort the values ii Is not available unless the data can be put into order iii The 50th percentile iv If there is an even number of cases n is even the median is the average of the two values in the middle qormpoo c Mode i The mode of a categorical variable is the most common category IV Measures of Dispersion a Variance i Sample variance 1 Standard deviation sample is square root of sample variance ii Population variance 1 Standard deviation population is square root of population variance iii To nd variance nd mean nd deviations square each one sum them divide by n if population variance or 11 if sample variance iv Identi ed by the symbol s2 b Standard Deviation i A measure of variability found by taking the square root of the variance ii Abbreviated as SD in text and identi ed by the symbols in formulas c Skewness i Mean higher than median for right skew ii Mean lower than median for left skew iii Rightleft is based on which side the tail is on V IQR a Distance between the 25th and 75th percentiles calculate the difference between 7 5th and 25th b A natural summary of the amount of variation to accompany the median a A graphic consisting of a box whiskers and points that summarize the distribution of a numerical variable using the median and quartiles b Shows the ve number summary the minimum lower quartile median upper quartile and maximum of a variable in a graph Also lets us determine outliers Sort data from low to high Calculate the first quartile Q1 the median Q2 and the third quartile Q3 Draw a box w its left edge at Q1 and its right edge at Q3 Draw a vertical line through the box at the median Compute the following limits 1 Q115Q3Q1 2 Q315Q3Q1 Draw a line from Q1 to the lowest data value above the lower limit Draw a line from Q3 to the highest data value below the upper limit d Any points outside the limits are considered outliers and should be plotted as individual points VI Boxplots 1 c Steps i ii iii iv v vi VII Distribution Shapes a Normal i ii Bellshaped Symmetrical and unimodal b Uniform i A at histogram With bars of roughly equal height 01 008 006 004 d 002 g 123 4 5 67 8 91131112131415 c Unimodal vs Bimodal i One distinct peak vs Two distinct peaks VIII Standardization a Standard normal curve i Represents the normal curve With a mean of zero and a standard deViation of one The standard normal distribution 04 Probability density 02 01 00 b zscore i The distance from the mean counted as a number of standard deViations ii iii Converting data to zscores is known as standardizing the data For a population the zscores are x M Z 0 iv For a sample the zscores are X X Z S 0 Empirical Rule i 68 of data within 1 SD of mean 95 within 2 SDs and almost all within 3 SDs Interval Percentage of Data 7 s to 7 s 68 7 25 to 7 25 95 7 35 to 735 997 Week 3 Chapters 51 amp 6 I Contingency Tables a A table that shows counts of the cases of one categorical variable contingent on the value of another b Cells are mutually exclusive ie Total Televisions Sold by Region and Store d Marginal distribution i The frequency distribution of a variable in a contingency table given by counts of the total number of cases in rows or columns e Conditional distribution i The distribution of a variable restricted to cases that satisfy a condition such as cases in a row or column of a table f Best Practices i Use contingency tables to nd and summarize association between categorical variables ii Be on the lookout for lurking variables iii Use plots to show association iv Exploit the absence of association g Pitfalls i Don39t interpret association as causation ii Don39t display too many numbers in a table II Vocabulary a Scatterplot i A graph that displays pairs of values as points on a 2D grid b Response Variable i Placed on the yaxis in scatterplots ii The variable that has the variation we want to understand explain or predict c Explanatory Variable i Placed on the xaxis in scatterplots ii The variable we use to explain variation in the response 111 Association in Scatterplots a Association the value of the xcoordinate tells us about the value of the ycoordinate ie they are related b Visual test for associations i A method for identifying a pattern in a plot of numerical variables Compare the original scatterplot to artificial plots in which the variables are unrelated c Describing the association i Once you decide the scatterplot shows association you need to describe the association 1 Direction does the pattern trend up down or both a Positive points concentrate in the lower left and upper right As explanatory variable increases so does the response 0 b Negative pattern running the other way As x increases y tends decrease 2 Curvature does the pattern appear to be linear or does it curve a Linear patterns have consistent direction b Curved direction changes 3 Variation Are the points tightly clustered along the pattern a Strong association means little variation around the trend 4 Outliers and surprises did you nd something unexpected a An outlying point is almost always interesting and deserves special attention IV Measuring Association a Covariance i Quanti es the strength of association between numeric variables ii covariancesDataColumnlDataColumn2 b Correlation i Equal to the covariance divided by the product of the standard deviations often denoted as r ii Normalized no matter what 1 S r S 1 iii correlDataColumnlDataColumn2 covxy sxsy V Summarizing Associations With a Line a Slopeintercept form b In high school algebra we use y mx b c In statistics we use iv c0rrx y y a bx with a237 bf bTSySx d a is the yintercept e b is the slope f NOTE that i 7 andx i 9 VI Know Your Symbols a We have several versions of most variables i Memorize what the symbols indicate and S301 will be a lot less confusing b a or 37 i The variable s AVERAGE is denoted by drawing a line or bar over it ii We say xbar and ybar c y i The variable s EXPECTATION is denoted by giving it a hat ii We say yhat d x or y or xi or yi i The variable s actual observed value is denoted by letting it go topperless ii Sometimes it has a subscript ito indicate it s an INDIVIDUAL observation but so can expectations wearing hats iii We say y or y subi VII Standardizing the Regression Equation a Starts as y mx b b Rewritten as zy rzx VIII Spurious Correlation a Correlation between variables due to the effects of a lurking variable b Example i A scatterplot of the damage S caused to homes by re would show a strong correlation with the number of re ghters who tried to put out the blaze 1 Does this mean re ghters cause damage No A lurking variable size of the blaze explains the super cial association IX Correlation Matrix a A table showing all of the correlations among a set of numerical variables b Note diagonal values are equal to l i Correlation of a variable to itself c Example from textbook page 123 TABLE 62 Correlation matrix of the characteristics of large companies I J AssetsJ alesJ Market ValueJ ProfitsJ Cash FlowJ EmployeesJ JAssets J1ooo Jp746Jp682 Jp602 Jp641 Jp594 J Sales JpJ46 JJ1000Jp879 Jp814 Jpass Jp924 J Marketvaluer BZ Jp879JJ1ooo Jpess Jp97o 318 J Profits Jp602 Jp814Jp968 J1000 Jp939 Jp762 J Joasn ow Jp641 Jp855Jp970 Jp989 J1ooo Jpn J Employees Jp594 Jp924Jp818 JDJGZ Jp787 J1000 J X Best Practices Correlation Matrix To understand the relationship between two numerical variables start with a scatterplot Look at the plot look at the plot look at the plot Use clear labels for the scatterplot Describe a relationship completely Consider the possibility of lurking variables Use a correlation to quantify the association between two numerical variables that are linearly related XI Pitfalls Correlation Matrix a Don39t use the correlation if the data are categorical b Don39t treat association and correlation as causation weeds c Don39t assume that a correlation of zero means that the variables are not associated d Don t assume that a correlation near 1 or 1 means near perfect association Week 4 Chapters 7 amp 8182 1 Law of Large Numbers LLN a The relative frequency of an outcome converges to a number the probability of the outcome as the number of observed outcomes increases b In other words as you increase the number of observations the probability will be closer to the predicted probability i ie if you keep tossing a coin the proportion of tosses that are heads will eventually be close to 12 ii Often misunderstood because we forget that it only applies in the long run 11 Essential Rules a Rule 1 The probability of an outcome in the sample space is l i PS 1 ii Remember sample space is the collection of all possible outcomes b Rule 2 For any event A the probability of A is between 0 and l i 0 S PA S 1 ii Makes sense because probability of an event can never be negative and also cannot exceed the sample space 1 c Rule 3 The probability of a union of disjoint events is the sum of the probabilities If A and B are disjoint events then PA or B PA PB d Additional rules Rule 4 Complement rule The probability of one event is one minus the probability of its complement PA 1 PAC Rule 5 Addition rule For two events A and B the probability that one or the other occurs is the sum of the probabilities minus the probability of their intersection PA or B PA PB PA and B 111 Vocabulary a Disjoz39nt Events Events that have no outcomes in common mutually exclusive events b Mutually Exclusive Events Events are mutually exclusive if no element of one is in the other c Independent Events Events that do not in uence each other the probabilities of independent events multiply i Multiplication Rule for Independent Events PA and B PA x PB IV Boole s Inequality PA WA or Ong 51 1 pk a b Probability of a union S sum of the probabilities of the events c Boole s Inequality is most useful When the events have small probabilities d Best practices i Make sure your sample space includes all the probabilities ii Include all of the pieces When describing an event iii Check that the probabilities assigned to all possible outcomes add up to 1 iv Only add probabilities of disjoint events v Be clear about independence When you use probabilities vi Only multiply probabilities of independent events V Table 81 From the Textbook page 176 a Focuses on household income and education in the US TABLE 01 Household income and education in the United States Cells show percentages DI A ll 3 H 0 ll 0 H E ll F ll G 1 EH Household Income Bracket in 2009 H l U 5 5 5 S 0 d a 3121 1333 m 5233 39 3323 39 1quot 123 Em HS diploma H526 H14 H141 Hose 1043 H1203 1 la 2232 086 896 553 296 313 2946 E ome college H419 H499 H163 HZ22 H2138 H1791 EIAssociale39s degreeH1J1 HZ37 H203 H144 H191 H5146 EBachelor s degree HZ23 H174 H188 H101 H102 H1988 EMaster s degree H162 H108 H141 H13 H168 H109 l Eggfes ma39 01 018 019 015 094 156 E Doctoral degree H107 H114 H118 H12 H197 H156 Epolumn Total HZ406 12436 H1826 H1186 HZO96 H100 l b Spend some time thinking about how to interpret this VI Probability a Joint Probability probability of an outcome with 2 or more attributes as found in the cells of a table probability of an intersection b Marginal Probability probability that takes account of one attribute of the event found in the margins of a table c Conditional Probability the conditional probability of A given B is PAB PA n BPB VII Multiplication Rule for Dependent Events a The joint probability of two events A and B is the product of the marginal probability of one times the conditional probability of the other PA n B PAB gtlt PB b Best practices i Think conditionally ii Presume events are dependent and use the multiplication rule iii Use tables to organize probabilities iv Use probability trees for sequences of conditional probabilities v Check that you have included all the events vi Use Bayes Rule to reverse the order of conditioning Week 5 Chapters 9 amp 12 I Notations a X random variable b o sigma standard deviation of a random variable 02 is variance c u mu the mean of a random variable The weighted sum of possible values with the probabilities as weights d EX expected value of X Equal to u A weighted average that uses probabilities to weight the possible outcomes e PXx the probability distribution of a random variable 11 Random Variable RV a The uncertain outcome of a random process b Discrete random variable i X is a discrete random variable when we can list all the outcomes ii Takes on one of a list of possible values typically counts c Continuous random variable i X is a continuous random variable when it can take on any value within an interval ii Shows how probability is spread over an interval rather than assigned to speci c values 111 Calculating Parameters for Discrete Random Variables a Calculating u ZXj39PXx xl PXx1xz PXx2x3 PXx3x4PXx4 1 b Calculating o2 I 02 2xi2Pxs x u2Px1x2u2Pxzxa2Px3xau2PX4 1 IV Graphing Discrete RVs a Example C 0 V Expected Value a In this class the expected value is always the mean E X M b Adding subtracting a constant from an RV i Shifts every possible value of the RV changing the expected value by the constant ii EX i c EX Ec EX i c iii A shift has no effect on the variance or standard deviation of a random variable c Multiplying a constant with an RV i Changes the mean and standard deviation by a factor of c E cX 2 CE X SDCX cSDX ii Changes the variance by a factor of c VarcX czVarX d Rules for expected values i If a and b are constants andX is a random variable then EabX abEX SDCa b X bSDX Vara b X szarX VI Best Practices a Use random variables to represent uncertain outcomes b Draw the random variable c Recognize that random variables represent models d Keep track of the units of the random variables VII Normal Continuous Random Variables a A random variable whose probability distribution defines a standard bell shaped curve b Continuous RV an RV that can conceptually assume any value in an interval ie on a bellshaped curve VIII Central Limit Theorem CLT a The probability distribution of a sum of independent random variables of comparable variance tends to a normal distribution as the number of summed random variables increases b Explains why bellshaped distributions are so common c Assures our assumption of normality when considering a sample mean is well founded even if we cannot be entirely sure of the underlying distribution d Shows that extreme observations outliers have less effect as they are averaged in with more typical observations IX Identifying PZ S 2 By Shading the Bell Curve a The first step toward determining the probability of observing a certain data point or sample mean begins with drawing a bell curve marking your mean and i 3 SDs and SHADING the area of interest b Example from lecture slides Example What is P O5 3 Z S 1
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'