Statistics 401, Midterm I
Statistics 401, Midterm I 01:960:401
Popular in Basic Statistics for Research
verified elite notetaker
Popular in Statistics
This 5 page Study Guide was uploaded by Wendy Liu on Friday September 30, 2016. The Study Guide belongs to 01:960:401 at Rutgers University taught by Hei-ki Dong in Fall 2016. Since its upload, it has received 412 views. For similar materials see Basic Statistics for Research in Statistics at Rutgers University.
Reviews for Statistics 401, Midterm I
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/30/16
Midterm I: Study Guide 4 October 2016 Basic Statistics for Research Professor HK Dong Wendy Liu Common Notations ∑ Summation – addition of set of values x Variable representing individual data values x bar – mean of all x values Standard Notation for sample vs. population: Variable sample Population Mean Standard deviation Variance Data set size Correlation coeffiecient Unit/subject – single entity from which you collect data Ex: person Population of units/subjects – complete collection of units from which you collect data Ex: American citizens Population – large set of all potential measurements corresponding to the population of units Ex: age of all American citizens Sample – subset of measurements that are actually collected during investigation Ex: age of 40 American citizens Statement of purpose – the reason to collect data; must be specific and unambiguous Statistics – collecting, summarizing, interpreting data and then drawing conclusions Objectives of Statistics 1. Make inferences about a population by analyzing info from a sample a. Includes assessing the extent of uncertainty involved in these inferences 2. Design the process and extent of sampling so that the sample is representative of the population, and thus inferences are valid Inferential statistics – evaluate info present in data, assess the new learning gained from this info Descriptive statistics – summarize and describe prominent features of data 3 S’s of descriptive statistics: shape, center, spread Shape of the distribution Normal – bell shaped, symmetric about the center o Area under entire curve = 1 = 100% o Mean = median = mode (or v. close) Skewed left/negatively skewed – long tail on left o Few data points to the left (more negative) of the majority Skewed right/positively skewed – long tail on right Uniform distribution o Few data points to the right (more positive) of the majority Uniform – nearly equal frequency of all values o Flat-topped Bimodal – continuous probability distribution of two diff. variables w/ two peaks o Looks like two normal distributions merged together Spread (variation) – how far apart the data is from each other o Range – difference btwn largest and smallest observations Range = max. – min. o Deviation from the mean: Measure of variation for one data point (not entire data set) Total deviation for any data set Positive and negative deviations of diff. data points eventually cancel out Average deviation from the mean: o Interquartile range (IQR) – middle 50% of data IQR = Q3-Q1 o Standard deviation – avg. distance of scores in a distribution from their mean Sample standard deviation Population standard deviation o Variance – standard deviation squared Sample variance Population variance o Bessel’s correction for standard deviation and variance: use n-1 for samples for the n-1 degrees of freedom Samples generally won’t have as many outliers as population (if any at all) Dividing by a smaller value (n-1) results in a larger st.dev., which will be more similar to the true population st.dev. Central tendency – where the data is clustered o Mean – average of set o Median – Q 2 middle value of ordered measurements Positioning point: = value corresponding to median Points ending in (x.5) occurs for even data sets (sets with even n) Take average of data values corresponding to x.5±0.5 o Mode – most frequent data measurement Types of Data 1. Qualitative – classified in categories, not numerically measured 2. Quantitative/numerical/measurement – variables measured w/ numbers Discrete – gaps btwn neighboring distinct values Continuous – no gaps btwn neighboring value Organizing quantitative data Ordered array: list all data smallest to largest/largest to smallest Visual representations: o Frequency distributions + cumulative distributions Histogram – like a bar graph, but for quantitative data (number line on x-axis) Polygon Ogive – like a polygon, but cumulative o Stem + leaf plot For smaller sets of data o Dot plot – dots on top of a number line representing each data point Frequency distributions for continuous variables Class intervals – cover ranges of equal length w/o overlapping Class boundaries – endpoints of intervals Class frequency – number of observations belonging to each class interval Relative frequency – percentage of observations in each class out of total observations 5 number summary – forms box & whisker plot Min – smallest value in data set th Q 1 25 percentile; 25% of data is below it o aka the median of min-Q 2 Q 2 aka the median – splits data in half equallytht 50% mark Q 3 75 percentile; 75% of data is below it o aka the median of Q 2max Max – largest value in data set Outlier data – marked as an asterisk (*) on box&whisker plot Q 3 1.5IQR = upper limit: data points above the upper limit are outliers Q 1 1.5IQR = lower limit: data points below the lower limit are outliers z-score – a data point’s distance away from the mean, measured in units of standard deviation allows for comparison across data sets z-score of mean: z=0 sample: population: Empirical Rule – for normal distributions o 68% of the data will be within 1 standard deviation away from the mean o 95% of the data will be within 2 standard deviations away from the mean o 99.7% of the data will be within 3 standard deviations away from the mean Data points more than 2 stdevs away from the mean are considered outliers o Chebyshev’s Rule – any type of distribution No useful info for z=1 At least 75% of data will be within z=±2 At least 89% of data will be within z=±3 Bivariate/multivariate data – observations on two or more variables Marginal totals – total frequency of any row or column of a data table, given in the right-hand margin or bottom margin Simpson’s parado – reversal of conclusions from a data table after combining several data tables together due to appearance of unreported variables Experimentation Predictor/input/independent variable – denoted by x Response/output/dependent variable – denoted by y Random assignment – subjects placed randomly into control/experimental groups Placebo effect – subject’s e pectations of a treatment to work cause positive results, even though the treatment itself has no therapeutic value o Placebo – treatment that has no physiological effect; usually a sugar pill, for drug testing Double-blind procedure – e perimenters don’t know which subjects are in which group, and subjects themselves don’t know what group they are in o eliminates placebo effect and experimenter bias Scatter diagram/plot – pairs of observations plotted as dots on a graph, with one observation as one variable (x,y) Positive correlation – x and y increase/decrease together Negative correlation – x and y increase/decrease in opposite directions Correlation coefficient r – measures strength and direction of linear relationship between x and y Ranges from -1 ≤ r ≤ 1 o Magnitude of r indicates strength |r| = 1, perfect linear relationship o Sign of r indicates direction r > 0, positive correlation r < 0, negative correlation o r = 1, perfect positive correlation o r = -1, perfect negative correlation o r = 0, no correlation Calculating r: Definitional formulas: sum of squared deviations of x: sum of squared deviations of y: sum of cross products of x & y deviations: Alternative formulas: sum of squared deviations of x: sum of squared deviations of y: sum of cross products of x & y deviations: Spurious correlation – observed correlation btwn two variables that is false, due to influence by a third variable (the lurking variable) Method of Least Squares – for the line of best fit – minimizes the average amount of residual residual (SSE) – vertical error btwn data point and line of best fit o sum of squared error: regression equation for line of best fit: o slope: o intercept: Coefficient of determination r – amount (%) of variation in Y due to variation in X
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'