ST 260 Final exam review
ST 260 Final exam review ST 260
Popular in Statistical Data Analysis
Popular in Statistics
verified elite notetaker
This 5 page Study Guide was uploaded by Jia Liu on Sunday May 1, 2016. The Study Guide belongs to ST 260 at University of Alabama - Tuscaloosa taught by in Spring 2016. Since its upload, it has received 12 views. For similar materials see Statistical Data Analysis in Statistics at University of Alabama - Tuscaloosa.
Reviews for ST 260 Final exam review
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 05/01/16
Final Exam Review 1. Categorical (Qualitative) Variables • Examples: male/female, registered to vote/not, ethnicity, eye color 2. Quantitative Variables –Discrete - usually take on integer values but can take on fractions when variable allows - counts, how many –Continuous - can take on any value at any point along an interval - measurements, how much. Ex: weight, height, income 3. • Cross-sectional data is collected around the same time, but across different sections (or groups) 4. • Time series data is collected over several time periods. –There is only one observation per time period 5. Categorical Data 1. Tabular Displays: Frequency distribution, Relative Freq. Dist., Percent Freq. Dist., Crosstabulation 2. Graphical Displays: Bar Chart, side by side bar chart, stacked bar chart, Pie Chart 6. Quantitative Data 1. Tabular Displays: Frequency distribution, Relative Freq. Dist., %Freq. Dist., Cumulative Freq. Dist., Cum. Rel. Freq. Dist., Cum. % Freq. Dist. Crosstabulation 2. Graphical Displays: Dot Plot, Histograms –Binning, Stem-and-Leaf Display, Scatter Diagram 7. A histogram is similar to a bar chart with the bin counts used as the heights of the bars. Note: there are no gaps between bars unless there are actual gaps in the data. o Before making a histogram, the Quantitative Data Condition must be satisfied (be binned): have units o Caution: Categorical data cannot be displayed in a histogram, and quantitative data cannot be displayed in a bar chart or a pie chart. 8. Because of the possibility of Simpson’s Paradox: • Examine both aggregated and unaggretated crosstabulation data • Check for a “hidden variables” that will give different results • Use the tabulation (aggregated or unaggregated) that gives better insight – Better yet, show both 9. Descriptive Statistics for Quantitative Data Which measures of center and spread should be used for a distribution? • If the shape is skewed or has outliers, the median and IQR should be reported. • If the shape is unimodal and symmetric, with not outliers, the mean and standard deviation and possibly the median and IQR should be reported. • Always pair the median with the IQR and the mean with the standard deviation. If cannot find a reason for an outlier or remove it, it is better to use median and IQR to summarize the center and spread. Most variable: Largest IQR and Range. IQR=Q3-Q1 Range=Max-Min 10. Five Number Summary: Min, Q1 (25%), median (Q2=50%), Q3 (75%), Max. 11. Standardizing ( z= x−x /s ) z>0, SD above the mean, z<0, SD below the mean. z-scores have no units, so they can be compared to z-score of other variables. 12. Use boxplots to compare distributions. • Boxplots facilitate coparisons of several groups. It is easy to compare centers (medians) and spreads (IQRs). • Because boxplots show possible outliers separately, any outliers don’t affect comparisons. 1. Types of Probability: Model-Based or Theoretical or Classical –The exact probabilities can be derived –Ex: Flipping coin, drawing cards Empirical or Relative Frequency –Ex: batting average, taking surveys Subjective or Personal –Use your judgment/experience to estimate the likelihood something will occur Notice that empirical probability is actually an estimate of the actual population probability 2. 5 Rules of Probability (���=´,σ= σ ) p N √ n The distribution of ´ can be approximated by a Normal Distribution with parameters if ���≥30 (or ���≥50 for highly skewed/outliers), OR the population is known to have a normal distribution Distribution of the sample mean approaches a normal distribution as the sample size grows large for any population 4.Distribution of Sample Proportion: The point estimate of a population proportion is the sample proportion, p p N ¿ ���= ��σ= p(1− p) ) √ n 1. Samples are independent 2. The sample size should be large enough such that there are expected to be at least 10 “successes” and 10 “failures” in the sample. ������≥10 and ��� (1−���) ≥10 The standard deviation of a sample statistic is also called the standard error, sampling variability 5. Use an interval estimate (confidence interval) to estimate a population parameter (same as point estimator) CI = point estimate ± margin of error CI ( ´ ) = x´ ± �������´� () = x±z × σ �������´� ) = z¿ × SE ( ´ ) √ n σ SE ( ´ ) = √ n •“We are 95% confident that between12.8% to 17.2% of US adults thought the economy was improving” ¿ ´(1−´p) ¿ CI ( p ) = ´ p ± �������p� () = ´p ± z × n MOE ( ´ p¿=¿ z × √ SE ( p ) SE (p= p(1−´p) √ n 6. The critical value ���∗ determines the confidence level of the interval. When ���∗ is large, Larger Margin of Error (MOE) –High Accuracy, Low Precision Find the closest value in Z-table to (1+C)/2 ¿ 2 z ¿ 7. Sample Size Calculation: ¿ ¿ n=¿ **90% CI is wider than 80% CI. **MOE and CI will increase as the sample size decrease. **A confidence interval provides more information than a point estimator. **If we take 100 samples (each of size n) from a population and construct 99% CI for the population proportion from each sample, what is the expected number of intervals that will contain the true population parameter? 99 **If we repeatedly take samples (of size n=10) from a population and construct 90% CI for the population proportion 100 times, what is the expected number of intervals that will not contain the true population parameter? 10 **For small finite populations, will the confidence interval be wider or narrower than in infinite populations? 1 n **What equation is the point estimator for the population mean? x= ∑ xi ni=1 ´−μ 8. t= has a t-distribution with df = n-1 s/ √ Confidence Interval for averages/means: CI = point estimate ± margin of error MOE ( x ) = t ×������ ( ) SE ( x ) = s √n CI ( ´x ) = ´ x ± t s √ n 2 Sample¿mean:n= t ×s 2 ( σ2 ) MOE 2 ¿ 2 z ¿ Sample Size for proportion: ¿ ¿ n=¿ 1.Scatterplots: Direction, form, strength, outliers x-axis: explanatory, predictor, or independent variable. y-axis: response, or dependent. 2. Correlation: measure the strength of the linear association between x and y (Quantitative variables). Strength = how tightly the points follow a straight line. -1≤r≤1 ∑ zxzy z = x−x z = y−y Correlation equation (Correlation coefficient) r= n−1 , x s , y s x y y y−¿ ¿ ¿ Sample correlation coefficient (x−x ¿) (sample covariance) ∑ ¿ s r= xy ,sxy¿ sxsy r=1 or -1 is a perfect linear relationship, r=0 is a lack of linear relationship (no linear pattern) The sign of the correlation gives the direction of the relationship. Correlation has no units, so shifting or scaling the data, standardizing, or even swapping the variables has no effect on the numerical value. A large correlation is not a sign of a causal relationship. Correlation ≠ causation (lurking variable) **Don’t correlate categorical variables. 3.Linear model: y=b 0b x1 b0:intercept 1b :slope y represent an approximate or predicted value. True Regression Equation: y=β +β x +error 0 1 Estimated Regression Equation: y=b 0b x1 Rewrite the regression equation: z =r z y x 4.Residuals: e=y− y y: observed value y : predicted value 5. Least squares line: The regression (best fit) line doesn’t pass through all the points, but it is the best compromise in the sense that the sum of squares of the residuals is the smallest possible. The slope tells us the change in y per unit change in x. When plotted against the predictive values, the residuals should show no pattern, no change and no direction in spread. sy 6. b 1r s , the slope gets its sign from the correlation. b 0y−b x 1 x 7. Variance in model: 0≤ r2 ≤1 r :coefficient of determination( percentage of variation) 2 fraction of data’s variance accounted for by the model, how well a model fits. r : If the correlation were 1.0, then the model predicts y perfectly, the residuals would all be zero and have no variance. If the correlation were 0, the model would predict the mean for all x-value. The residuals would have the same variability as the original data. r=(signof b 1 √ 2 8. Assumption: Models are useful only when specific assumptions are reasonable. We check conditions that provide information about assumption: 1). Quantitative Data Condition --- linear models only make sense for quantitative data, so don’t be fooled by categorical data recorded as numbers. 2). Linearity Assumption check Linearity Condition --- two variables must have a linear association, or a linear model won’t mean a thing. 3). Outlier Condition --- outliers can dramatically change a regression model. 4). Equal Spread Condition --- check a residual plot for equal scatter for all x-values. 9. Confidence Interval: the regression equation y=b 0b x1 gives a point estimate for the mean of y at a particular value of x. Prediction Interval: which gives an interval estimate of the value of a new y at (given) x. This will be wider than the confidence because we are estimating the value of a single observation, not the average. ** Slope: the slope of k means that for every extra unit of x, the y of the ,,,,, is predicted to decrease or increase by k units of y. OR: The slope is k. Based on this model, each additional unit of x tends to require an additional y units of y. **Intercept: the intercept of m is the value of the regression line when x=0. If it is not very reasonable (meaningful), the reason is that the x would not be 0.
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'