Applied Regression Analysis
Applied Regression Analysis STAT 462
Popular in Course
Popular in Statistics
This 0 page Class Notes was uploaded by Hilbert Denesik on Sunday November 1, 2015. The Class Notes belongs to STAT 462 at Pennsylvania State University taught by Staff in Fall. Since its upload, it has received 26 views. For similar materials see /class/233128/stat-462-pennsylvania-state-university in Statistics at Pennsylvania State University.
Reviews for Applied Regression Analysis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/01/15
REVIEW OF STATISTICAL CONCEPTS F Chiaromonte Problem Differences in genome sizes among vertebrates Overall the chicken genome is 40 of the size of the human genome Does the size ratio vary across orthologous regions Can this be explained by insertion of repeats Source lnt l Chicken Genome Seq Consortium Nature 1204 Population 6727 nonoverlapping windows length btw 100 and 15OKb which cover all alignments of chicken to human genome Variables considered Y oghLcL ratio between human and chicken length including repeat bases X oghMScMS ratio between the fraction of each window occupied by repeat bases in human and in chicken proxi for insertion ratio y and X are guantitative variables with continuous values Sample n1OO randomly selected windows and correspondingly yx i1 n y X 1 031387 048023 2 052634 165562 99 035218 145637 F Chiaromonte 2 100 022531 044261 CENTER of a quantitative variable eg log of the length ratio on the sample data Mean or arithmetic average El 1n 4 038 A robust alternative evaluation of the center is the Median value splitting the data in half n12th sorted data value if 17 odd midpoint btw n2th and n21th sorted data values if 17 even 0358885 Median is lower than mean in fact there is a slight excess of high values the data is slightly asymmetric about its center see graphs below However we measure the center though it seems quite clear that it is gt O on log scale ie1 for length ratios Based on the sample data human length exceeds orthologous chicken length on average F Chiaromonte 3 VARIABILITY of a quantitative variable on the sample data About the mean variance and standard deviation same scale as variable 1 n 520 ZHE 04 y2 0039 i1 averaging square deviations from the mean can use n or n1 at the denominator n1 degrees of freedom of the sum of squares Minmax Range min max y 0 07572 107482 Interquartile range q25 q75 y 02699500 0481065 Based on the sample data ratio btw human and chicken length is highly variable across orthologous loci F Chiaromonte 4 Min 007572 Min 03447 lSt Qu 026995 lSt Qu06104 Median 035889 Median 08l93 Mean 038422 Mean 08822 3rd Qu 048107 3rd Qull2l3 Max 107482 Max 22389 Useful summary statistics functions computed on the sample data A statistic such as the sample mean or sd can be meant as descriptor of the data but also as an estimate of the corresponding feature of the population parameter from which the sample was drawn F Chiaromonte 1 n yi n i1 El Sample mean as a point estimate of the population mean uy Its value will Change depending on the sample On average over all possible samples of size n E 7 2 y unbiasedness 1 MSE 02 y n y population variance Of y Thus we evaluate the standard error of the sample mean in estimating the population mean as 1 907 Sdy J sample sd ofy F Chiaromonte Inference CONFIDENCE INTERVAL CI for a population parameter Use the se to create a Cl with the statistic as pivot For the population mean C1 0 f i am My multiplier guarantees or approximates a certain coverage or confidence level probability that the random interval over all possible nsamples contains the population mean PrCIa3uy1 2a selected using a reference distribution If the sample size is lame asvmptotic arguments Central Limit Theorem tell us that Lily No1 sey approx roughly this means y N pysey approx We can use multipliers fromaN01quantiles i i i F Chiaromonte If the sample size is small but we can assume the population y to be normal we have that H N 11 1 590 39 degrees of freedom of a T and we can use Student s Tmultipliers instead This works approximately also when the population y does not depart too extremely from a Gaussian shape A T distribution is symmetric about 0 and bell shaped as a NO1 but has heavier tails so that the multiplier is larger and the Cl broader for the same coverage When the dots increase it converges to a NO1 The approximation is already good for n gt 20 So in practice standard normal and T multipliers are indistinguishable for large samples F Chiaromonte 8 Cl with coverage 95 for the population mean of the loglength ratio Recall n 100 sample mean 0384 sample sd 0196 25 quantile of the NO1 distribution 19599 25 quantile of the Student s Tdistrib with n1 99 dof 19842 can get these from tables or statistics software packages note quantiles are very close 17 large C1a yi aa 5607 0384 Jr196M 0346 0423 V100 F Chiaromonte Inference TEST OF HYPOTHESIS for a population parameter Null hypothesis on a feature of the population for instance HO 2 ply O This is what we would like to refute Using our sample data we investigate the null in comparison to an alternative for instance Ha uy 7k 0 twosided Ha uy gt O onesided right could be left This is what we would like to show is supported by evidence in the data Note in this case the null specifies one value while the alternative specifies a range these are the most common specifications Rightsided assess if we have enough evidence to conclude based on our sample of n1OO observations that the log length ratio for human vs chicken has a positive mean in the population F Chiaromonte 1O We need to use a test statistic Le a function of the data whose distribution under Ho null distribution is known and can thus be used as reference We know that y 1 N N 01 if n is large regardless of the distribution of y in the S600 approx population T M if y in the population is approximately normal for any n Thus under Ho say we use the first result n100 is u N N 01 approx RI the pvalue or achieved significance level associated with the observed u is the probability that under the null the statistic would take the observed value or a value even more extreme in the direction here right defined by the alternative F Chiaromonte 11 For the twosided alternative the pvalue is 1900 PM E lu0bsl or u 2 luobsl l My 0 2 PM 2 luobsl l My 0 because of symmetry of the null distribution Basic idea we can reject Ho in favour of Ha if the observed value of the test statistic is very extreme with respect to what one would expect under the null distribution that is if the corresponding pvalue is small The smaller the pvalue the stronger the evidence against Ho provided by the data Testing whether the population mean of the loglength ratio is 0 vs positive Recall n 100 sample mean 0384 sample sd 0196 u 7 0384 se7 01964100 The area on the right of 19575 under a standard normal is 9575 pu0bs 1260955685 O by all practical means Very strong evidence that the population mean log length ratio is positive F Chiaromonte 12 Rejection rule reject Ho if the pvalue is 5 a threshold a say or 005 5 This is called the level or target significance With this rule we ensure that Prrejecting H0 l H0 3 0c ie we control the probability of a false positive or so called typel error The other error we can make is to fail to reject Ho when Ha holds Prnot rejecting H0 l Ha This is the probability of a false negative or so called typell error 1 such probability is called the power of the test and the function expressing it for each point in the alternative in our instance Ha is a range is called the power function Typel and II error probabilities are in tradeoff test statistics are evaluated based on their power function once the level is fixed F Chiaromonte 13 GRAPHICAL REPRESENTATIONS of a quantitative variable on the sample data Histogram of chickentoy quotyquot EC mX BOX PLOT o 7 5 o s W o 7 c if f a co 39 fr 2 a a a 5 7 X 0 g c I f l st V o i O 9 g a 5g 3 5 3 C chickeniloyL v Box Median 1st 25 and 3rd 75 quartile Whiskers to rst point beyond quartile 15 leif f Outliers marks beyond whiskers These plots show that for almost all windows in the sample the log length ratio is pos human length larger than chicken and that it varies substantially across windows with a slight excess of high values F Chiaromonte 14 ASSOCIATION btw two quantitative variables on the sample data Measure of linear association Pearson39s correlation coefficient i y m f coryx n i1 2 0396 6 11 Z y mi x m2 1 maximal direct or inverse linear association y and X values lie on a line 0 lack of linear association which does NOT necessarily mean lack of association in general On our data length ratio and insertion ratio both on the log scale present a sizeable positive correlation Statistical rather than exact functional association unless cor1 Also cor can be meant as descriptor of the sample data but also as an estimate of the corresponding population feature parameter F Chiaromonte 15 GRAPHICAL REPRESENTATION of bivariate sample data Scatter plot Showing a positive statistical association btw length ratio and insertion ratio both on log scale qgchickengtoyL w qichickenito x F Chiaromonte 16 This course is about REGRESSION ANALYSIS 0 Constructing quantitative descriptions of the statistical association between y response variable and X predictor or explanatory variable on the sample data 0 Introducing models to interpret estimates and inferences on the parameters of these descriptions in relation to the underlying population log length ratio log large inst ratio MULTIPLE regression when we consider more than one predictor variable F Chiaromonte 17