### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Intr Stat&Data Analy SI 544

UM

GPA 3.88

### View Full Document

## 12

## 0

## Popular in Course

## Popular in Information

This 45 page Class Notes was uploaded by Yvette Hegmann on Thursday October 29, 2015. The Class Notes belongs to SI 544 at University of Michigan taught by Lada Adamic in Fall. Since its upload, it has received 12 views. For similar materials see /class/231626/si-544-university-of-michigan in Information at University of Michigan.

## Reviews for Intr Stat&Data Analy

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/29/15

ANOVA analysis of variance Lada Adamic March 11 2008 S1 544 1 oneway ANOVA First we will work with Prof Karen Markeyls data on peoplels understanding of library subject headings for complete info see http www siumich edu ylimeNewFilesmorekmdhtmlAnchorSubject54980i 1711 be passing out some of the surveys for you to look at The surveys were conducted at 3 Michigan public libraries Each person taking the survey was asked to ll out demographic info such as age sex occupation etc as well as to try and write a description for 8 library subject headings eigi Cattle N United States N Marketing All subjects at each library were asked to interpret the same 8 headings for a total of 24 headings across all three librariesi There were three ways the subject headings were presented just the heading itself the heading along with the headings directly preceding and following them in alphabetical order and the headings in a particular book bibliographic recordi Sometimes the headings were kept in their original order and sometimes they were rearranged to a recommended standardized orderi load the demographic information for each person taking the survey demog readtablequotoclcdemographicstxtquotheadTsep t load the results of the survey surveyresults readtablequotdataoclcpubtxtquotheadTsep t stripwhiteT calculate the percent subject headings interpreted correctly scorebysurvnum tapplysurveyresultscorrec quot cquot surveyresultssurvynum mean it combine them into one data frame a 5 m it attach it attachdemogandscore look at what we have gt summarydemogandscore survynum sex age libuse Min 2 1 00 f 205 Min 2 9 00 a 19 1st Qu2 7775 m 101 1st Qu 14 00 b 117 Median 215450 NA s 2 Median 21800 c 119 Mean 215450 Mean 22831 d2 41 3rd Qu223125 3rd Qu24200 e 12 Max 230800 Max 27400 NA s 2 500 eductn profssn whichlibrary scorebysurvnum a 252 student 2128 lib12106 b 80 retired 21 lib2103 1st Qu 0 1250 c 30 homemaker 15 lib3 99 Median 0 3750 d 55 teacher 10 Mean 0 3449 e 79 clerk 6 3rd Qu 0 5000 NA s12 Other 77 Max 0 8750 NA s 51 NA s 1 0000 In essence we have 308 people who took the survey 205 of whom were female 101 of whom were male Their ages ranged from 9 to 74 they had varying levels of library use azdaily bzweekly czmonthly d 2 to 3 timesyr e l 2 timesyr with most of them coming to the library on a weekly or monthly basis There were a lot of students 128 followed by retirees and homemakers and then a variety of other professions The score in terms of the proportion of subject headings which were correctly identi ed ranges from 0 a person who got none of the headings correct to 01875 a person who got 7 out of 8 of them correcti No one had gotten all the headings righti proportion correct 4 elementary HS college BS The rst thing well do is create boxplots for the scores grouped by the education level of the person ranging from having completed elementary school to having a college degree From the boxplot we can see that high school kids those who have completed JHS so the JHS box and whisker plot do about as well as those who have had some college education the boxplot labeled 77college i Now weld like to test if any of these means is signi cantly different from any of the others Which means that we will be doing an Ftest for the null hypothesis that all the means are equal gt anovalmscorebysurvnum eductn Analysis of Variance Table Response scorebysurvnum Df Sum Sq Mean Sq F value PrgtF eductn 4 05786 01447 25262 004149 Residuals 238 136290 00573 Signif codes 0 0001 001 005 01 1 We see that we can reject the hypothesis that all the means are equal at the 005 level Great But which off all the pairs of means are actually different As we7ve discussed in class we can7t go and ttest all pairs against each other because our probability of committing a typel error rejecting the null hypothesis when it is true goes up with each additional test we make So we need to use a correction that will multiply the pvalue by a factor corresponding to the number of tests made The Bonferroni adjustment multiplies all p values while the Holm method applied in R by default corrects the smallest p by the full number of tests the second smallest by n 7 1 etc We will use the pairwisettest method for this which will conveniently do all the pairwise ttests for us and put them in a nice little table 11 Pairwise ttests gt pairwisettestscorebysurvnum eductnpadjquotbonferroniquot Pairwise comparisons using t tests with pooled SD data scorebysurvnum and eductn a b c d b 1000 c 1000 1000 d 1000 1000 1000 e 0031 0209 0068 1000 P value adjustment method bonferroni gt pairwisettest C J uum quot Pairwise comparisons using t tests with pooled SD data quot uum and quot a b c d b 1000 c 1000 1000 d 1000 1000 1000 e 0031 0167 0062 1000 P value adjustment method holm What we are getting in the table are the pvalues for the ttests between eg a and c multiplied by the number of tests in the case of bonferroni If this value exceeds 1 then R returns 10000 for that ttest According to the results at the 005 level we can only be sure that library partons with a college degree are doing better than kids who have only completed elementary school For both the Holm and Bonferroni methods where we penalize different pscores nonuniformly we have at the 01 level that people with a college degree do better than people with just a high school degree and no years of college 12 Paired vs pairwise One of you asked the very relevant question of what the difference is between a paired and a pairwise ttest A pairwise ttest means that you have more than 2 groups and you are pairwise comparing them meaning that you have nn 7 l2 comparisons if you have n groupsi Paired ttests if you remember refer to just two samples but each observation in one sample is paired with an observation in the other For example the two samples could be before and after a treatment was administered to the patient Let7s roll back to a ttest and see how we can apply it to this data One of the questions posed in the study was whether the order of the subdivisions within the subject headings matteredi If the cataloger chooses to apply subdivisions the subdivisions should always appear in the following order topical geographic chronological form Conway 1992 6 For example one of the original subject headings was EducationUnited StatesFinance77 and the recommended reordering would be Education FinanceUnited States The ordering changes the likely interpreted meaningi So let s do the following For each subject heading we will take the proportion of people who got it right when it was in the original order and we will pair it with the proportion of people who got it right when it was in the recommended order to see if there s a difference gt summary surveyresults survynum subj type sex libcode shnum ordpattrn form order Min 2 100 a21264 12 16 Min 21000 Min 2 100 a21240 a2824 021232 1st Qu 2 7775 c21200 f 21632 1st Qu 21000 1st Qu 2 600 b 1224 b2832 r21232 Median 2154 50 m 2 816 Median 22000 Median 21200 132808 Mean 215450 Mean 21 977 Mean 12 32 3rd Qu223125 3rd Qu23 000 3rd Qu 1800 Max 230800 Max 23 000 ax 22400 quest ion correct code1 code2 assess assest Min 2100 1 2 102 ids 44 ric 25 7 2 iids 2433 1st Qu 2275 a 2 1 lmo 2350 loi 2 20 4 461 ilmo 2336 Median 24 50 c 2 850 loi 2349 rmo 2 16 6 2389 iloi 2327 Mean 2450 i 21510 ric 2285 lmo 2 11 5 2387 ccds 2281 3rd Qu 2625 NA s 1 s 2284 ids 1 2166 c 2261 Max 2800 Other 2493 Other 8 Other 2555 ccdl 2241 2262 NA s 22375 NA s 2 1 Other 585 Looking at the survey results we basically want to count the cc proportion of correct answers separately for each subject heading denoted by shnum for two different kinds of orderingsi it let s keep only the attempted SH discriptions justattempted surveyresultsisnasurveyre n l quot amp u eyre ul 393 1 it and look at a summary of what is left summary j ustattempted it now we will average over all people who described the same subject heading in the same order bothorderings aggregatejustattemptedcorrec quotcquot liStuLdeL j r J J u j r Jm L FUNmean it paired ttest gt ttestx orderpairedTdatabothorderings Paired ttest data x by order t 07796 df 23 p value 04436 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 0 03249104 0 07178732 sample estimates mean of the differences 001964814 unpaired ttest WRONG gt ttestx orderdatabothorderings Welch Two Sample ttest data x by order t 03432 df 45982 p value 0733 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 00955977 01348940 sample estimates mean in group 0 mean in group r 03677402 03480920 Notice that neither the paired ttest nor the regular one which is incorrectly applied here7 give us signi cant differences in peoplels average ability to interpret the subject heading correctlyi But the paired ttest does give a lower p value and a narrower confidence interval7 which shows that is it superior Why is this Well7 there is a lot of variation in the interpretation dif culty of each subject heading The paired test keeps the variation due to question dif culty separate from the variation due to the two different ways of presenting the subject headings I ve used the aggregate command to calculate the proportion of people who answered the question correctly in that order I did this by creating a TRUEFALSE vector with justattemptedc0rrect C and grouping by order and subject heading7 and then taking the mean mean appears to treat boolean TRUEFALSE vectors as 01 vectors7 which means that we can average themi 11 any case7 since standardizing headings to be in a prescribed recommended order did not seem to impact people s ability and inability to interpret subject headings7 one of the recommendations resulting from this study was to standardize 2 twoway ANOVA We may wish to consider two variables factors simultaneously7 and for this we would do a twoway ANOVA The example I am using here is made up and should be familiar from lecturei Recently7 there was a NYT article in which it was mentioned that boys and girls may do better when taught in separate classrooms7 in part because the conditions could be set to fit each gender separately An argument was made that boys are more comfortable with the thermostat set to a lower temperature in the classroomi We are going to test this assumption but with some fake data of course it have 90 different observation 30 at each it of the following temperatures 68 72 78 temperature crep6830 rep7230 rep7830 it we have two different genders boys and girls gender repcrepquotboyquot15repquotgirlquot153 it we translate this to a numerical value it just to generate some data gendernum repc rep115 rep115 3 it now we re ready to create some fake it data the kids are going to average a it score of 80 but the distribution is normal In 7 vi w l 1 x 1 o a 5 0 I H l I o i l 1 m 44444 LO 1 1 boy gm with a standard deviation of 5 In addition the boys are going to score better when it s colder for the girls it s the opposite testscore 80 rnorm9005 temperature 72gendernum first we should see no difference in average scores for boys and girls when the temperature is not taken into account boxplottestscore genderylabquottest scorequot similarly the t test should not be able to pick up anything ttesttestscore gender gt ttesttestscore gender Welch Two Sample ttest data testscore by gender t 14815 df 87997 p value 01421 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 48887375 07128885 sample estimates mean in group boy mean in group girl 7895415 8104208 you can do the same for temperature but now let s ask for an interaction plot this should look familiar from the lecture interactionplottemperaturegendertestscore Normal distribution and samples July 26 2007 Lada Adamic Sl 544 1 First how to do a few things more ef ciently in R Plots need not take up that much space if they are placed in a table We can use the par function for this Furthermore it is sometimes a bit tiresome to be always typing librariesPOPU librariesTOTlNCM etc One option is to attach the data after reading it in or loading a library eg gt libraries readtablequothttpwwwpersona1umichedu 1adamicsi544f06data1ibrariesdatquot sepquott quot quotequotquot quot headT gt attach1ibraries gt POPU Now R will understand what POPU is without me having to type librariesPOPU every time When we don7t need libraries any more and we don7t want to get confused with similarly named columns for other data frames we simply detach as so gt detach1ibraries Bye bye librariesi OK so let s plot some stuff We make our 1 by 2 table gt parmfrowc12 gt plot ATTEND POPU gt plot1ogATTEND logPUPU gt parmfrowc11 Notice that llve reset the plot to go back to a single axis just so I don7t forget to do it later llve plotted the same data twicei Notice that I can just say y N z to plot y vsi Xi lfl hadn t attached the libraries data set I could have told plot what I meant by typing gt plot ATTEND PUPUdata1ibraIieS What do we see We see a fairly linear trend That s good the more people a library serves the higher the attendance There is some scatter however It is best observed once the variables are logarithmically transformed or plotted on a loglog scale because there is so much variability in the size of population served Sometimes some data seems a bit shyi There might be outliersi Frankly in this plot I don7t see any outliers that are too suspect but I do see a couple of points that have relatively low attendance given the populations servedi How can I gure out which points those are I can use the identify functioni ATTEND 6 0 e06 12 e07 IogATTEND 14 46810 00 e00 0 e00 2 e06 POPU IogPOPU Figure 1 Attendance vsi population served for each library gt biglibraryindices identifyPOPUATTENDn10 warning nearest point already identified gt libraries biglibraryindices LIBID LIBNAME ADDRESS 424 M658 County of Los Angeles Public Library 7400 E Imperial Highway 425 M659 Los Angeles Public Library 630 w Fifth St 879 FSCS5010 Broward County Division of Libraries 100 S Andrews Ave 890 FSCS5023 MiamiDade Public Library System 101 West Flagler 1752 0960 Chicago PL 400 S State Street 5474 0400300000 BROOKLYN PUBLIC LIBRARY GRAND ARMY PLAZA 6007 6800310000 QUEENS BOROUGH PUBLIC LIBRARY 89 11 MERRICK BOULEVARD 6206 4801290000 THE NEW YORK PUBLIC LIBRARY 455 FIFTH AVENUE 7053 526510006 FREE LIBRARY OF PHILADELPHIA 1901 VINE ST 7594 189 Houston Public Library 500 McKinney Street CITY ZIP1 ZIP2 PHONE POPU CENTLIB BRANLIB BKMOB MASTER LIBRARIAN 424 Downey 90241 7011 3109408462 3324500 0 86 3 25100 25100 425 Los Angeles 90071 2097 2132287515 3638100 1 66 4 347 00 347 00 879 Fort Lauderdale 33301 1826 9543577376 1392252 1 32 3 141 00 171 00 890 Miami 33130 1523 3053752665 1709909 1 31 0 155 00 155 00 1752 Chicago 60605 1203 3127474090 2783726 1 81 2 363 63 364 56 5474 BROOKLYN 11238 5698 7187807803 2300664 1 59 1 323 84 323 84 6007 JAMAICA 11432 5242 7189900790 1951598 1 62 0 330 75 330 75 6206 YORK 10016 0122 2123400941 3070302 6 79 0 730 37 730 37 7053 PHILADELPHIA 19103 1189 2156865300 1585577 1 51 0 272 81 272 81 7594 Houston 77002 2534 7132472232 1702086 1 36 0 165 00 165 00 What happened and you7ll get to try this in class7 is that I called the identify function with the same X and y vectors as the plot I told it I wanted to quit after identifying n10 pointsi IfI don7t specify n7 then I would just keep clicking until I was done Then on windows I would click on the right mouse button and tell it I wanted to stop77i I then went back and asked for the rows corresponding to those indicesi I found some big libraries in there7 corresponding to major metropolitan areas gt plot logATTEND logPOPUdatalibraries 6206 o 12 e07 o 424 6007 1752 6007 o 425 O 879 51374 O o 705 90 7592 o 7594 ATTEND 6 0 e06 00 e00 0 e00 1 e06 2 e06 3 e06 POPU Figure 2 Attendance vsi population served for each library with the indices of some points identi ed gt lowattendanceindices identifylogPOPU logATTEND gt libraries lowattendanceindicesc24428331320 LIBNAME CITY STABR POPU ATTEND LIBRARIAN TOTINCM 3670 Athens Community Library Athens MI 2515 19 050 8797 7424 HIGHLAND RIM REGIONAL LIBRARY CENTER MURFREESBORO TN 378334 2823 614 502388 490 WATAUGA REGIONAL LIBRARY CENTER JOHNSON CITY TN 387852 2906 350 451251 So we see a couple of libraries that have poor attendance given their population But with only 0 5 librarians to serve library patrons in Athens7 Ml7 maybe the patrons are not to blame The picture looks not so great in TNT On the other hand7 maybe we want to locate our favorite library on the plot First let s get the index of the Ann Arbor public library gt libraries CITYquotAnn Arborquot LIBID LIBNAME ADDRESS CITY ZIP1 ZIP2 PHONE 3668 MI0012 Ann Arbor District Library 343 S Fifth Avenue Ann Arbor 48104 2293 3139942339 4006 MI0357 Washtenaw County Library 4135 Washtenaw Avenue Ann Arbor 48107 8645 3139716749 POPU CENTLIB BRANLIB BKMOB MASTER LIBRARIAN OTHPAID TOTPEMP LOCGVT STGVT FEDGVT 3668 136894 1 3 1 2725 2725 7000 9725 4871677 124603 16854 4006 931 1 0 0 188 188 363 551 328939 25874 17769 OTHINCM TOTINCM SALARIES BENEFIT TOTEXP TOTEXPCOL OTHOPEXP TOTOPEXP1 CAPITAL BKVOL 3668 592797 5605931 3098296 993087 4091383 702029 1314961 6108373 734309 405351 4006 51604 424186 212347 42022 254369 17690 142216 414275 593 AUDIO VIDEO SUBSCRIPT DUPLI ATTEND REFERENCE TOTCIR LOANTO LOANFM KIDCIRCL KIDATTEND 3668 24808 3792 1197 14782 548276 164060 1370113 40 1305 450765 11574 4006 12876 800 3 2103 13669 7475 54328 0 243 17814 576 CRELATN CLEGBASE STABR 3668 ME SD MI 4006 ME CO MI Aha7 so 36687 is the Ann Arbor District Library 1 donlt know how they measure attendance7 I think llve been there once or twice in fact7 I think I have an overdue book or two So I pick the row that corresponds to AADLY I add a point on the plot using a big fat red circle to mark the AADL7 and then label it in a big red font for good measure But in order to know where to place the text7 I need to know the coordinates For this I use the locator function gt aadl libraries 3668 1 gt points log aadlPUPU logaadlATTEND colquotredquot cex2 1wd5 gt locator n1 3 1 1181990 W 1 1332358 gt annarborcoords locator n1 gt text annarborcoordsxannarborcoordsy quotAnn Arborquot colquotredquot pos3 cex2 What exactly didl tell the text function I told it where to put the text what the text is to make it red to make it big cex2 and to position it above the text pos3i Et Voila 16 ogATTEND 1 IogPOPU Figure 3 Attendance vsi population served for each library with Ann Arbor identi ed and labeled 2 Samples As we learned previously even if the random variable X is not normally distributed the sample average X tends to be For small samples however we should actually use the student t distributioni Why use the student t We should use it in the case where we are estimating the sample error mean SEM from the sample itself The t distribution allows more extreme values for small samples than the normal distribution and is therefore more likely to accept the null hypothesis A parameter of the tstatistic is the number of degrees of freedom After we compute the mean of a sample of size n7 we have n 7 1 degrees of freedom left Lets see what we have for the normal and the t distribution when the sample size and correspondingly the degrees of freedom are small gt normal distribution gt it PEZ lt 1 where Z is the distance from the mean measured in SEMS gt pnorm1 1 08413447 gt PEZ lt gt pnorm1 1 01586553 gt it Between 1 SEM below amp 1 SEM above gt pnorm1 pnorm1 1 06826895 gt it if we have a sample of just 9 the student distribution will differ gt it from the normal gt pt18 1 08267032 gt pt 18 1 01732968 gt pt18 pt 18 1 06534065 So for the normal distribution we have 68 of the data points lying within 1 standard deviation of the mean7 but for the student t distribution it is a bit less 65 If we increase our sample size to 1007 the normal and t distributions are quite close gt pt199 pt 199 1 06802515 We can actually also compare graphically the normal and t distributions like so gt plot xdt x8typequotlquotxlabquotNo of SEMs from the meanquotylabquotprobability densityquot gt linesxdnormx typequotlquotlty2 gt legend3035 cquott8 dfquotquotnormalquotltyc12 What about constructing a con dence interval 95 of the distribution will lie within 196 standard deviations Typically we don7t know the standard deviation of mean SEM7 so we estimate it from the standard deviation of the sample SEM We can simulate drawing many samples as follows it number of samples usually you do one or two in real life it here we are simulating what all can happen samples 1000 it the size of each sample samplesize 35 it the vector we ll be looking at the av it librarians per 1000 population libperpop librariesLIBRARIANlibrariesPUPU1000 it we calculate the mean and variance of the entire underlying population meanlibperpop sdlibperpop pmbammydenmw N0 ofSEMS from the mean Figure 4 Attendance vsi population served for each library With Ann Arbor identi ed and labeled intialize the vector of sample means samplemeans c loop through all our samples for i in 1 samples calculate the sample mean samplemeansEi meansamplelibperpopsamplesizereplaceFALSE the mean of the means xbar xbarmean meansamplemeans the standard deviation of xbar normally we would only estimate this from our one sample but here we have the luxury of calculating it xbarsd sdsamplemeans gt calculating the 95 confidence interval gt xbarmean xbarsd196 1 04251261 gt xbarmean xbarsd196 1 01955790 gt gt compare with estimate from original population parameters gt meanlibperpop196sdlibperpopsqrtsamplesize 1 04156066 gt meanlibperpop196sdlibperpopsqrtsamplesize Lada Adamic S1 544 March 5th 2008 Simple linear regression and correlation 1 Dealing with missing values Now that we are processing data to make inferences and predictions our R tools may start to complain about the missing values the NAls that are hiding out in our data It will want to know what to do with themi Should it exclude all rows with NAls or just keep complaining about them What do we want it to do Well if we go about our business and try to ignore the problem the following happensi Letls load our libraries data yet again we know that there are a lot of missing values in there libraries read table quothttp wwwpersonal umich edu ladamiccoursessi544w08datalibraries datquot sepquot t quot quot equotquot quot headT attachlibraries By the way if you get errors like gt attachlibraries The following object 5 are masked from libraries position 5 ADDRESS ATTEND AUDIO BENEFIT BKMUB BKVUL BRANLIB CLEGBASE CRELATN possibly because you have attached libraries previously just keep typing detachlibraries until it says it has never heard of libraries and then you can attach it again Suppose I just want to nd out the correlation between the total expenditures on the library collection and other operating expenditures excluding salaries and the collection gt cor TOTEXPCULOTHUPEXP Error in corTUTEXPCOL UTHUPEXP missing observations in covcor Well that is because OTHOPEXP is chock full of missing values gt UTHUPEXP 1 20 1 NA 2609487 NA 13389 NA NA NA 8590 NA NA NA NA NA NA 275126 NA 19 NA NA 1 have a few choices I could only compare those rows where there are no missing values gt nonmissingothopexp is na0TIIUPEXP gt cor TOTEXPCOL nonmissingothopexp UTHUPEXP nonmissingothopexp 1 08726715 I ve used the isna function which returns TRUE if the value is unde ned and FALSE otherwise In this case I wanted to keep just the de ned values and hence the lisina meaning NOT missing This can get a bit tiresome thou h if I want to exclude all rows where any column entry is unde ned eigi lisinaTOTEXPCOL amp lisinaPOPU amp lisinaBKMOB lfl simply want to include all rows that contain any missing values I can use the C0mpletecases function to get an index of all complete rows gt cc completecaseslibraries gt librariescc librariesEcc gt sumcc 1 5080 In this case we7ve unfortunately lost about 3000 data points In general this is a quick way to omit observations with missing data Most simply we can tell R to ignore the missing values as it is computing the correlation coef cient using the use optioni cor TOTEXPCOL UTHUPEXP usequot complete obsquot In general if you are unsure of how a function deals with missing values bring up the help page For example if you are taking the mean of a column you can set nairm Ti Finally we can set an option that will tell R to ignore missing values for functions like linear regression lm using options i gt optionsnaaction naexclude 2 A preview of simple linear regression gt lm0THUPEXP TOTEXPCOL Call lmformula UTHUPEXP TUTEXPCUL Coeff ic ients Intercept TOTEXPCUL 10946548 1284 gt summarylmUTIIUPEXP TOTEXPCUL Call lmformula UTHUPEXP TUTEXPCUL Residuals 1Q Median 3Q 7401788 33716 13330 7266 11564481 Coeff ic ients Estimate Std Error t value Prgtt Intercept 1095e04 5747e03 1905 00569 TUTEXPCUL 1284e00 1008902 127364 2916 Signif codes 0 0001 001 005 01 1 Residual standard error 392900 on 5079 degrees of freedom Multiple RSquared 07616 Adjusted Rsquared 07615 Fstatistic 1622e04 on 1 and 5079 DF pvalue lt 22e16 it plot on a loglog scale plotTOTEXPCUL0THUPEXPlogquotxyquot yfitted fittedlmUTIIUPEXP nonmissingothopexp TUTEXPCUL nonmissingothopexp it add the linear fit points TOTEXPCOL nonmissingothopexp yf itted colquotgreenquot cex0 5 So we7ve gotten a nifty little formula for each dollar that a library spends on their collection they spend around 128 on other expenses excluding salaryi O a O l 2 LO E 9 7 O D E O 7 0 O l 2 O O i w w 1e01 1e03 1e05 1e07 TOTEXPCOL Above is a loglog plot showing the t and the approximately linear relationship between the two expenses variables Letls now return though to the correlations By default C0r will compute Pearsonls correlation coef cient which as we learned last time corresponds to a signed square root of the coef cient of determination It assumes that the two variables are normally distributed and that the observations within each variable are independent If M 1 then the relationship between the two variables is perfectly lineari 1 We can get just the correlation coef cient using the C0r function but we should also usually check for the signi cance of that coefficient especially if we have a small samp ei Signi cance simply means that the probability is small that the you drew a sample of the given correlation when the correlation of the underlying population is actually 0 For example we can use the C0rtest function which in addition to giving us the t statistic and the corresponding p value actually doesnlt mind missing entries in the data as long as we have the proper global naiaction option set gt cortestTOTEXPCULUTIIOPEXP Pearson s productmoment correlation data TOTEXPCOL and UTHUPEXP t 1273639 df 5079 p value lt 22e16 alternative hypothesis true correlation is not equal to O 95 percent confidence interval 08659536 08790744 sample estimates cor 08726715 So we know that not only is our correlation high but we also have a narrow con dence interval for its V l 1 Now lets have some fun Can you nd the two most correlated columns in the library data 3 Nonparametric correlation tests Sometimes the assumption of normality in the data just does not hold In this case one may want to use nonparametric methods 31 Spearman7s p Spearman s p works basically like Pearsonls 7 except that you take each observations rank rather than the actual value and then plug it into the equation abovei Sometimes this will give worse results that is a lower correlation if your data is actually close to normally distributed because you are losing accuracy with the substitution However if your data is not normally distributed and has fairly high variance it can actually give you a higher correlationi Case and point are variables that we7ve looked at before such as the total income of libraries per population served gt cortestTUTINCMPUPUATTENDPUPU Pearson s productmoment correlat ion data TOTINCMPOPU and ATTENDPUPU t 614503 df 8926 p value lt 22e16 alternative hypothesis true correlation is not equal to 0 95 percent confidence interval 05304939 05596512 sample estimates cor 05452374 gt cortestTOTINCMPOPUATTENDPOPUmethodquotspearmanquot Spearman s rank correlation rho data TOTINCMPOPU and ATTENDPUPU S 43762711641 p value lt 22e16 alternative hypothesis true rho is not equal to 0 sample estimates rho 06310284 Warning message Cannot compute exact pvalues with ties in cortestdefaultTUTINCMPOPU ATI39ENDPUPU method quotspearmanquot The above is a case and point If we compare the total income per population with the attendance per population using the Pearson correlation coef cient was get a substantially lower correlation than if we use Spearman s rank correlation Other cases where this may be useful are for example comparing the indegree and PageRank of a webpage These distributions are heavytailed meaning not normally distributed 32 Kendall7s 739 Kendallls 739 counts the number of concordant and discordant pairsi Lets say we are looking at attendance and population If both attendance and population are higher for library A than for library B then this is a concordant pairi If however library A has higher attendance but serves a lower population than B then this is a discordant pairi If the two variables are uncorrelated not a likely outcome in this scenario then you would expect about the same number of concordant and discordant pairsi Kendallls 739 then does this comparison for all pairs of libraries and that would be pretty computationally intensive and also unnecessaryi It is a measure better suited to small data sets with a limited range of discrete outcomesi Do you remember how endall s 739 was used in the handedness study Where it is often used in SI type applications is with interrater agreement You have 7experts7 usually students looking to earn a buck rate something questions documents etc You want to know that these ratings are accurate and this is the case if the raters can independently agreei Here we will be looking at a data set collected by now graduated PhD student Jun Zhangi He had two students rate the expertise of members of a Java programming forum based on the posts of those members The scale was 1 7 5 ranging from most to least experti gt javair readtablequotjavaforuminterratertxtquotheadT gt cor test javairrater1 javairrater2methodquotkendallquot Kendall s rank correlation tau data javairrater1 and javairrater2 z 90708 pvalue lt 22e16 alternative hypothesis true tau is not equal to 0 sample estimates 06428882 gt cor test javairrater1 javairrater2methodquotpearsonquot Pearson s productmoment correlation data javairrater1 and javairrater2 t 119907 df 132 pvalue lt 22e16 alternative hypothesis true correlation is not equal to 0 95 percent confidence interval 06295492 07943661 sample estimates cor 07220486 gt cor test javairrater1 j avairrat er2 methodquot spearmanquot Spearman s rank correlation rho data javairrater1 and javairrater2 S 1053416 pvalue lt 22e16 alternative hypothesis true rho is not equal to 0 sample estimates 0 7372995 In this case the Spearman correlation coef cient does better than either Kendall7s tau or Pearsonls r The correlation is pretty high but not incredibly sol We7ll revisit interrater agreement when we discuss tabular data 4 simple linear regression again this time in greater detail Remember that simple linear regression means that we re assuming a linear relationship we re drawing a line through it and then we can predict additional values for yetunobserved data points Welll use the data le poliblogitxti For 40 political blogs it has the number of citations they received from the posts of a large set of blogs in fall of 2004 llcitationsll7 the number of times they appeared on blogrolls of leftleaning blogs llleftlinksll and the number of times they appeared on blogrolls of rightleaning blogs 77rightlinks77i You are trying to determine how the number of citations depends on the number of blogro si We7ll sum the blogroll links from left and right leaning blogs to get the total number of blogrolls the blog appeared on Plot the scatterplot of citations vsi 77total links77i Welll add a regression line7 as well as prediction and con dence intervals for the plot We also want to know Whether the slope of the line is signi cantly different from 0 it POLITICAL BLOGS it first read in the political blog data set poliblog readtablequotpoliblogtxtquotheadT it get the total number of blogroll links it both from the left and right totlinks polihloa lpftlinks polihloa viahtlink citations poliblogcitations it regress lmpoliblogs lmcitations totlinks summarylmpoliblogs V V V V V V V V V V V V V Call lmformula citations totlinks Residuals Min 1Q Median 3Q Max 24772 5609 2149 5542 31390 Coefficients Estimate Std Error t value Prgtt Intercept 245617 663214 0370 0715 totlinks 8 4656 5388 404e 05 Signif codes 0 0001 001 005 01 1 Residual standard error 1275 on 18 degrees of freedom Multiple RSquared 06173 Adjusted Rsquared 0596 Fstatistic 2903 on 1 and 18 DF pvalue 4041e05 The slope is positive and signi cant It looks like on average each additional blogroll link from one of the 1500 political blogs corresponds to 25 additional citations from the larger blogospherei it create a new data frame just for plotting out the it prediction intervals it it seems we do this just to have an ordered set it of data points in this case from 30 to 300 predframe dataframetotlinks 30300 pp predictlmpoliblogs intquotpquot newdata predframe it get the upper and lower confidence intervals gt gt gt gt gt gt gt it get the upper and lower prediction intervals gt gt gt gt pc predictlmpoliblogs intquotcquot newdata predframe gt gt it make the scatter plot gt plottot1inkscitationsylimrangecitations pp x1abquotnumber of incoming blogroll linksquot ylabquotnumber of incoming post citationsquot cex 15 cexaxis15cex1ab15 gt predtot1inks pred frametot1inks matlinespredtot1inks pc 1tyc122 colquotb1ackquot gt gt gt it add the prediction and confidence intervals using matlinesO gt gt matlinespredtot1inks pp 1tyc133 colquotb1ackquot Suppose I somehow left out a blog that had 250 blogroll links pointing to it Lets give an estimate and the prediction interval for the number of citations that blog would have received gt predframe dataframetot1inks c250 gt predict1mpoliblogs intquotpquot newdatapredframe fit u r 1 6026578 3036027 901713 The prediction interval is fairly broad7 with 95 certainty we can expect anything between 3036 and 9017 10000 i 6000 number of incoming post citations 2000 2000 i i i i 50 100 150 200 250 number of incoming blogroll links Just one last thing7 welll check that the residuals are normal gt qqnormresid1mpoliblogs Yup7 doesnlt look too far from normal Power amp multiple regression Lada Adamic November 30 2006 S1 544 1 Power 11 ttests Remember that power gives the probability that we will reject the null hypothesis H0 given that an alter native hypothesis H1 is true For example in class we considered H0 the mean height of midwestern men is equal to the mean height on men in the entire US 68 inches Suppose H1 is true H1 says that the mean height is actually 69 inches We are given the fact that the standard deviation in the height of a US male is 31 inches If we draw a sample of 100 men we would expect the standard error of the sample mean to be approximately a a 7 7 031 1 xN V100 We can draw two normal distributions centered around 68 and 69 corresponding to the distribution of sample means of H0 and H1 respectivelyi Where the vertical line is drawn represents the boundary of the twosided 95 con dence interval for H0 The probability that a sample mean exceeds 686 given H1 is 089 which is just the power of rejecting the null hypothesis given that H1 is true x seq6670by005 plot xdnormxmean68 sd3 1sqrt 100 type l lines x dnorm x mean69 sd3 1sqrt 100 colquotredquot legendlocator1 legendc quotpopulat ion mean68quot quotpopulat ion mean69quot ltyc 1 1 colcquotblackquot quotredquot 21966819631sqrt100 abline v2196 Visuals are well and good but most of the time we just want the numbers So let s try it out We have a sample of 100 H1 is that the underlying midwestern men population has 1 69 so what is our power For this we use Rls powerititest function gt power t test delta1 sd31n100 siglevel0 05typequotone samplequot Onesample t test power calculation n 100 delta 1 sd 31 siglevel 005 power 08914722 alternative twosided population mean68 S 7 i population mean69 probability of sample mean sample mean Right just what we had surmised from the plot On the other hand we may yet have to decide how large of a sample we need to gather but we know that we would like an 80 chance of rejecting the null if it is actually false gt power t test delta1 sd3 1power0 8 siglevel0 05 typequotone samplequot Onesample t test power calculation n 7737044 delta 1 sd 31 siglevel 005 power 08 alternative two sided So we7ll need to measure the heights of about 7778 men before we reach 80 chance of rejecting the null assuming of course that H1 is true OK nally assuming that we have a sample of 100 men and we want a power of 80 to detect a difference between the midwest and the rest of the US how different do the mean heights in the US and midwestern population actually have to be gt power t test power0 8 sd31n100 siglevel0 05typequotone samplequot Onesample t test power calculation n 100 delta 08770312 sd 31 siglevel 005 power 08 alternative two sided So the midwestern men would need to be at least 087777 taller or shorter on average in order for our ttest to be able to pick out that there s a difference with 08 probability for a sample of 100 12 proportions It may seem just like last week oh7 wait7 it was last week7 that we were doing tests of proportion Rls powerlpropltest will tell us how many trials we need to do in each sample in order to tell if the proportion of successes is different between the samples Suppose we are testing two different Ul s on our website For each Ul we check whether the user returns for another visit If 100 users are exposed to each UL and the proportion of returning users for U1 is 020 for the rst7 and 022 for the other7 what is the power gt power proptest n100p10 2 p20 22 siglevel0 05 Twosample comparison of proportions power calculation n 100 p1 02 p2 022 siglevel 005 power 005334612 alternative twosided NOTE n is number in each group Gosh7 005 is not that much power In order to have an 80 chance of rejecting the null7 how many users would need to see each Ul gt powerproptest p102p2022siglevel005power08 Twosample comparison of proportions power calculation n 6509467 p1 02 p2 022 siglevel 005 power 08 alternative twosided NOTE n is number in each group W39ow7 6500 users I guess this is something you could do if you are a really large retailer and getting 10 more repeat Visitors was a big deal So how big of an actual difference in proportion would we have needed in order to be able to already detect a signi cant improvement after trying things out on two sets of 100 users gt power prop test p10 2 n100 sig level0 05 power0 8 Twosample comparison of proportions power calculation n 100 p1 02 p2 03785940 siglevel 005 power 08 alternative twosided NOTE n is number in each group So the return rate would have to be nearly twice as high7 in order for us to be able to reject the null at the 005 signi cance level with a power of 08 with only 100 users observed with the new Ull 13 power and anova For the anova we can calculate the power given the within group variance the between group variance the number of groups the number of observations per group and the desired con dence level That is a lot to know and to put in But say you run your anova it is going to give you your within and between group variance and so you can run the power test to see how much power you had in rejecting the null hypothesis in the rst place Lets start with a fake data set Well have the independent variable be the number of years with the rm ranging from 1ii5i Well draw 100 samples with replacement from this vector getting approximately but not exactly 2 in each group gt x sampleseq15100replaceT gt summaryfactorx 1 2 3 4 5 21 18 17 17 27 We then set the salary to be 100x normally distributed noise with standard deviation of 5 gt y xrnorm1001005 gt boxplot y x E H H 100 l From the boxplots we can see that we ve successfully added quite a bit of noise Can an ANOVA still tell the difference Lets see gt anovalmy factorx Analysis of Variance Table Response y Df Sum Sq Mean Sq F value PrgtF factorx 4 22777 5694 24424 005193 Residuals 95 221482 2331 Signif codes 0 0001 001 005 01 7 1 Shoot we just missed it What would our power be gt poweranovatest groups5n20betweenvar22777withinvar221482siglevel005 Balanced oneway analysis of variance power calculation groups 5 n 20 betweenvar 22777 withinvar 221482 siglevel power 05940449 NOTE n is number in each group So our power is only about 60i Could we have derived the between and within variance just from knowing our fake data setup The between variance should be something like 22 12 0 12 225 100 200 The within group variance should be 52 100 2500 This is pretty close to what we actually observed So how large of a sample would we need in order to have 80 power gt poweranovatest groups5n40betweenvarbetvarwithinvarwithinvarsiglevel005 Balanced oneway analysis of variance power calculation groups 5 n 40 betweenvar 400 withinvar 5000 siglevel 005 power 08201283 NOTE n is number in each group We need about 40 observations in each group Lets go back and do that gt it REDU with larger sample gt x sampleseq15200replaceT gt summaryfactorx 1 44 31 43 39 43 gt gt create a noisy salary variable that depends on x gt y xrnorm2001005 gt boxplot y x gt gt an anova remembering to treat x as a factor gt anovalmy factorx Analysis of Variance Table Response y Df Sum Sq Mean Sq F value PrgtF factorx 4 5628 1407 56425 00002561 Residuals 195 48621 249 Signif codes 0 0001 001 005 01 1 So it worked out with the larger samples We can tell that different years correspond to different fake salaries But 20 of the time we still would not have been able to reject the null with this sample size Enough of power let7s move on to multiple regression 2 Multiple regression This is just a repeat of the powerpoint slides First let7s set up and get our data and models in place We are loading the libraries data but keeping just Maine in order to not have too much data that would give us signi cant results even for very small effects We will also normalize all the variables circulation attendance by the population served by the library We are also logtransforming the variables because of the large variance in library sizei maine librarieslibrariesSTABRquotMEquot kiddata data frame log KIDATTENDO 001PUPU log TUTCIRKIDCIRCLO 001PUPU gt gt gt attach maine gt log KIDCIRCLO 001 PUPU log BKVUL0 001PUPU detach maine colnames kiddatac quotkidattendancequot quototherc irculat ionquot quotkidcirculationquot quotbookvolumequot pairskiddata m1 lmkidcirculation kidattendanceothercirculationbookvolumedatakiddata m1b lmkidcirculation othercirculationkidattendancebookvolumedatakiddata m2 lmkidcirculation othercirculationdatakiddata VVVVVVVVVVV m3 lmkidcirculation othercirculationkidattendancedatakiddata The rst thing we can do is to visualize Already we see that other circulation total kid is quite highly correlated with kid circulation The other variables are visibly less correlatedi But until we do our anovas and multiple regressions we won t know if they still might come in handy when modeling the amount of circulation of kids materialsi gt summary m1 Call lmformula kidcirculation kidattendance othercirculation bookvolume data kiddata Residuals 1Q Median 3Q Max 4 13904 030004 001340 033643 206613 Coefficients Estimate Std Error t value Prgtt Intercept 033925 013697 2477 00139 kidattendance 006019 002964 2031 00433 othercirculation 079608 004069 19566 lt2e 16 bookvolume 009728 005911 1646 01010 Signif codes 0 0001 001 005 01 7 1 Residual standard error 06031 on 261 degrees of freedom kidatte ndance 1234 bookvolume 41 Multiple RSquared 07386 Adjusted Rsquared 07356 Fstatistic 2459 on 3 and 261 DF pvalue lt 22e16 Note that kidattendance and othercirculation are both signi cant as shown by the tstatistics7 as well as the entire model according to the F statistic The order matters for the anova gt anovam1 Analysis of Variance Table Response kidcirculation Sum Sq Mean Sq F value PrgtF kidattendance 1 82273 82273 2261626 lt2e 16 othercirculation 1 185049 185049 5086864 lt2e 16 bookvolume 1 0985 0985 27085 01010 Residuals 261 94946 0364 gt anovam1b Analysis of Variance Table Response kidcirculation Df Sum Sq Mean Sq F value PrgtF othercirculation 1 265740 265740 7305004 lt 2e 16 kidattendance 1 1582 1582 43485 003801 bookvolume 1 0985 0985 27085 010102 Residuals 261 94946 0364 Note that the only difference between models ml and mlb is the order of the variables It does not make a difference to the multiple regression7 since it considers the contribution of each variable keeping the others xedl But for the anova7 it only considers the amount of variance explained by each variable that was not already explained by the variables above it So we can see that although kid ttendance at rst seems highly signi cant F226 and p lt 10 157 once othercirculation is taken into account7 it is only marginally signi cant F4i347 p 0038 This brings us to the idea of trying to drop some variables We can do this manually using the anova function to compare a submodel with the full model it will give us the difference in the sum of squared errors and the F statistici Another option is to use the function step i gt anovam1m2 Analysis of Variance Table Model 1 kidcirculation kidattendance othercirculation bookvolume Model 2 kidcirculation ResDf RSS Df Sum of Sq 1 261 94946 2 263 97513 2 gt anovam1m3 Analysis of Variance Table otherc irculat ion F PrgtF 2567 35285 003076 Signif codes 0 0001 001 005 01 1 Model 1 kidcirculation kidattendance othercirculation bookvolume Model 2 kidcirculation ResDf RSS Df Sum of Sq 1 261 94946 2 262 95931 1 otherc irculat ion kidattendance F PrgtF 0985 27085 01010 What we can see from the above is that we shouldnlt reduce our model to just otherjz39rculation there is a signi cant bene t to keeping more variables than that The second anova tells us that it is suf cient to keep just kz39szttendzmce7 but having the bookwolume as well does not explain signi cantly more of the variation We can also automate this process with the step function7 which will order the variables by their ability to explain the variance in the dependent variable gt step m1 Start AIC 264 kidcirculation kidattendance othercirculation bookvolume Df Sum of Sq RSS AIC ltnonegt 94946 264002 bookvolume 1 0985 95931 263266 kidattendance 1 1500 96446 261847 othercirculation 1 139265 234211 26729 Call lmformula kidcirculation kidattendance othercirculation bookvolume data kiddata Coefficients Intercept kidattendance othercirculation bookvolume 033925 079608 009728 step con rms the same order of importance for the variables Finally7 let s check that the residuals are normally distributed so that we know that we are justi ed in doing a linear regression in the rst place Lada Adamic Sl 544 Oct 127 2006 One and two sample ttests 1 Con dence interval cheat sheet Lets say you are taking a large sample of n observations from a population with mean M and standard deviation 0 Your sample mean7 i is going to be your estimate of the population mean 8 is the standard deviation of your sample The standard error of the mean SEM7 is related to s as follows 3 S EM 7 1 Youlre asked to give a con dence interval for the mean of your sample Say you want a 95 con dence interval This means that a 0 05 and Dz2 0025 Since the sample is large7 you can just use zscores which implies that a normal distribution of means is assumed as opposed to tscores which are a bit larger than zscores for small samples i The con dence interval is given by fizag rSEMjJrzag rSEM 2 If on the other hand your sample was fairly small eigi n 10 or n 207 you would want to get the t value instead of the 2 score i 7 tut2 SEMJ tut2W SEM 3 Remember that df n 7 1 This is how it would work in practice Let7s simulate sampling 87 normally distributed variables with mean 10 and standard deviation i gt samp1e87 rnorm87102 1 10702794 7803232 11567596 12197608 12145459 12110301 9795155 8587220 9 8490917 8008483 11452050 10217464 8278613 8654241 11953143 8113491 17 8723327 10605430 9139390 10921014 10426906 7 760826 11515485 8 649291 25 8 446337 10 526330 9 023458 9901821 7889055 7 976825 8926175 12 974170 33 10 451839 6 921623 9 222301 11136571 7434310 9 912896 14221466 11 428891 41 8 137289 9 167073 5 068394 10147317 10527114 9 170393 10390491 6 154173 49 6 974125 11 029793 9 988622 7975544 10547826 6 106702 7677219 7 722798 57 6 001083 8 697960 10 176290 10951141 6619053 8 250223 12113802 11 133227 65 10 094023 10 622176 460355 029852 8543819 10 391742 9 414384 8 193490 9 9 73 8921530 11007032 5903301 4779386 10207699 9771383 13044497 8482020 81 7699894 12289890 5353009 6722348 6334610 12364507 6601207 We calculate the mean and standard deviation7 as well as the standard error of the mean gt xbar meansamp1e87 gt s sdsamp1e87 gt xbar 1 9743203 1 1790519 gt SEM ssqrt87 gt SEM 1 01919638 Finally we construct the 95 con dence interval gt xbar qnorm0975SEM 1 936696 gt xbar qnorm0975SEM 1 1011944 So the mean of the distribution from which we have drawn is actually contained within our 95 con dence interval Good for us 2 One sample t test test for the mean of a single sample When we are constructing the con dence interval we are saying eg that with 95 certainty the mean should be within that interval This allows us to test whether the sample could have been drawn from a distribution with a certain mean The ttest will return the con dence interval at the desired level the number of degrees of freedom n 7 l as always the value oft and the probability p that the population mean could have been it Lets try this with the age guessing datai Remember the woman who everyone guessed was younger than 33 from her photo Well what is the likelihood that if you quizzed a large number of people their average guess would be 33 but just the groups in SI 544 happened by chance to have all guessed as they did that is below the actual age We simply run the titest on the data But rst let s load it in gt ages readtable quothttp wwwpersonal umich edu ladamicsi544f06dataageguessingdatquot headT Group X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 A 35 60 42 38 64 32 25 55 32 23 40 48 B 33 62 44 40 54 23 28 64 26 30 47 48 C 37 64 36 39 57 31 27 59 23 25 42 42 D 37 68 42 40 65 28 34 60 24 27 45 47 60 28 28 56 26 30 51 40 F 37 60 35 42 57 25 27 60 28 29 50 45 G 35 63 45 43 65 31 35 60 36 28 55 42 H 31 62 40 37 57 25 30 60 25 30 45 45 I 32 63 36 46 61 37 29 59 26 25 49 49 10 truth 27 54 51 55 69 24 37 62 34 33 47 40 gt guesses ages19 gt truth ages10 gt guessesX10 1 23 30 25 27 30 29 28 30 25 gt truthX10 1 33 DOONOEO39IprNMD k D1 A A on 01 J M J 00 gt t test guessesX10mutruthX10 One Sample ttest data guessesX10 t 64018 df 8 p value 00002087 alternative hypothesis true mean is not equal to 33 95 percent con dence uterval 2544323 2944ssl sample esmmaces mean of x 2744444 So we haol our 9 group guesses rangrng lrorn 23 to 30 Then we haol the selheponed age ol 33 We passed the guesses to t test as the sarnple and we asked w ether u 33 could have been the actual rnean KL l e s rt v lmdmn u w v whroh Mood r l mm m y u r r r u a ls me 33 lsp 710A So we can more than oornlortabh reject the hypothesrs that the rnean ls 33 The t test also gryes us a httle brt ol other uselul rnlo 1t gryes us the 95 con dence rnteryal I know don t you hate that I haol you do those by hanol lor the last assrgnrnentv and r on 7 all we try another e gt otestguessesxsmucruchxs Due Sample titest data guessesm 2 8 d 8 a 1ue 0 01222 alternatlve hypothesls true mean 5 not equal to 52 erval But we p0 can only reject the t the 5 and not at the 1 level smce 012 was guesseol to be younger than hls selheponed age Even thls lello hypothesls that the true mean guess ls 6 Letls switch to a different data set One where we have the 2005 graduates VV39inter7 SpringSummer7 or Fall who did not start taking classes before Fall 2003 The rst column numCourses has the number of courses they took including things like Si 690 The second column has their specialization gt classenrollment read table quothttp wwwpersonal umich edu ladamicsi544f06datanumcoursesforstudents txt quot headT gt summaryclassenrollment numcourses specialization Min 21500 ARM 13 1st Qu21600 HCI 42 Median 21700 IEMP 2 8 Mean 21737 LIS 25 3rd Qu 21800 tailored 13 Max 23200 The summary function has given us the numerical summary of the number of classes taken7 and the number of students in each specializationi Suppose we could only interview a few students rather than having this nice more or less complete data set Letls sample 10 students and test whether the mean of the population could be 17 gt sisample sampleclassenrollmentnumcourses10replaceF gt sisam le 1 15 18 16 16 18 17 15 16 17 16 gt ttestsisamplemu17 One Sample ttest data sisample t 1765 df 9 p value 01114 alternative hypothesis true mean is not equal to 17 95 percent confidence interval 1563101 1716899 sample estimates mean of x 164 Our sample had a mean of 16147 but we still can7t reject the hypothesis that the mean of the whole population could be 177 which is a good thing7 because the mean is actually 171371 3 ttest for comparing the means of two samples More interestingly7 if we have two samples7 we may want to gure out if they are drawn from distributions with the same mean For example7 looking at the class enrollment data7 we may want to gure out if students in one specialization take more classes than students in another gt tapply classenrollmentnumcourses classenrollmentspecializat ionmean IE P LIS tailored 16 61538 17 59524 1950000 1688000 1700000 gt tapply classenrollmentnumcourses classenrollmentspecializat ion sd ARM I P LIS tailored 07679476 28802060 39279220 11298968 17320508 tapply is a super handy function It says to apply the function mean to the number of courses but group the data by specialization rsti Do you remember when we were doing those loops to gure out averages by state for library data Well we could have saved ourselves some trouble by just using tapply gt tapply librariesLIBRARIANl ibrariesPOPU1000 librariesSTABR mean AK CA CO CT DC 067199991 034353128 008338738 029672054 015043927 033314519 023951910 029097606 GA 018197145 016449086 007969554 013432197 041013489 045896533 038638645 036022637 052989149 024299701 011846032 033807690 024109363 021917676 023081505 034201587 024776455 018437426 035308872 007551917 031643625 052267907 043155079 016754934 NY 047924908 045921956 034703159 031953871 043025236 030437731 015021074 019752814 TN 012093127 042058878 008314324 018973665 022884649 013233361 025347344 034842758 031433425 020893042 041440443 Now wasn t that much shorter as an aside I had to clean the data a bit because the state abbreviations for lL had numeric codes after them eigi lLOOOl but this approach will work if all your data is nicely categorized in one column i OK but let s stick to the class enrollment data for the time being You ll notice that lEMP students take 191 courses on average but ARM students only 166 We want to know whether this difference is signi cant One clue is the standard deviation of the samples For lEMP it is almost 4 393 higher than for the other specializations So lEMP students take more courses on average but they are also more variable and there are actually not that many of them ttest to the rescue gt IEMPcourses classenrollment classenrollmentspecializationquotIEMPquot1 gt IEMPcourses 1 19 19 17 27 24 17 17 16 gt ARMcourses classenrollment classenrollmentspecializationquotARMquot1 gt ARMcourses 1 17 16 17 17 16 17 15 17 16 18 17 16 17 gt ttestIEMPcoursesARMcourses Welch Two Sample ttest data IEMPcourses and ARMcourses t 20532 df 7331 p value 007735 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 0407455 6176686 sample estimates mean of x mean of 1950000 1661538 So the ttest says that we can t reject the null hypothesis that IBM and ARM students basically take the same average number of classes Isn t it nice to save the reputation of an entire specialization on the same page that you try and tarnish it This is why its cool to use statistics and even cooler to properly report the results Enough playing around thoughi Lets see how a ttest was used to do some HCl research on sensemaking by then PhD student Yan Qu and our very own Prof George Furnasi Their 2005 Chi conference paper on Sources of Structure in Sensemaking77 can be downloaded from httpwwwsiumicheducosen 0 w ITRCAKPCHI E yd Hu furnaspdi 1derl o UuLiuho Uiurud o To summarize the experiment7 they had asked 30 grad students to gather information about a topic by browsing the web and write an outline for a talk they were pretending to be needing to give at a local library The rst topic was 77tea77 and the second was 77 everyday drinks for old people77 referred to here as 77 elderly drink The subjects were using a tool Cosen for bookmarking useful web resources This allowed the researchers to keep track of how many URLs the subjects were bookmarking and also how many folders they were organizing them into Let7s load the data gt sense readtablequothttpwwwpersonalumichedu 1adamicsi544f06datasensemakingtxtquotheadT gt sense SubjectID Group NumFolders NumBookmarks 1 T1 T 5 17 2 T2 T 5 34 3 T3 T 0 6 4 T4 T 2 14 5 T5 T 4 17 6 T6 T 7 23 7 T7 T 6 30 8 T8 T 5 20 9 T9 T 11 38 10 T10 T 15 31 11 T11 T 3 13 12 T12 T 9 27 13 T13 T 8 18 14 T14 T 7 18 15 T15 T 6 45 16 E1 E 2 4 17 E2 E 4 12 18 E3 E 0 1 19 E4 E 4 14 20 E5 E 6 5 21 E6 E 0 3 22 E7 E 6 10 23 E8 E 2 7 24 E9 E 8 14 25 E10 E 7 29 26 E11 E 4 13 27 E12 E 0 9 28 E13 E 1 6 29 E14 E 4 6 30 E15 E 4 8 Summarize by task gt attachsense gt tapp1yNumFolders Group mean E T 3466667 6200000 gt tapply NumBookmarks Group mean E T 94 234 We can immediately see that both the number of bookmarks and the number of folders is greater for the task of gathering information on a broad and popular topic But we should do a proper ttest just to make sure gt t test NumBookmarks Groupquot quot NumBookmarks Groupquot quot Welch Two Sample ttest data NumBookmarksGroup quot quotJ and NumBookmarksGroup quotEquot t 43301 df 23817 p value 00002315 alternative hypothesis true difference in means is not equal 95 percent confidence interval 7324365 20675635 sample estimates mean of x mean of y 234 94 toO Yup7 de nitely different Same for folders gt t test NumFolders Groupquot quot NumFoldersGroupquot quot1 Welch Two Sample ttest data NumFolders Group quot quotJ and NumFolders Group quot quotl t 23582 df 25167 pvalue 002644 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 03469416 51197251 sample estimates mean of x mean of y 6200000 3466667 So what is the t test doing It is testing whether the difference in the means of the two populations is signi cantly different from 0 It is computing I2 I1 7 SEDM 4 where the standard error of the difference of means is SEDM 4SEM12 SEM 5 By default the R titest function does not assume that the variance of the two populations being sampled from is the same this is the Welsh procedure Otherwise7 you can have R assume that the two populations have the same variance titest XyvariequalTi This will then estimate a SEM by pooling all the data points into one group and taking the standard devi ationi The t statistic in this case has n1 n2 7 2 degrees of freedomi Both options should give you similar results7 but when in doubt7 go with Rls default and donlt assume the variance in the two populations is the same it 4 The paired t test Paired tests are used when there are two measurements on the same experimental unit A before and after77 or 77the same subject under condition 1 and condition 277 In this case7 we will consider just Winter 2005 graduates before we were taking anyone graduating in any semester in 2005 and look at the number of classes they took in the fall2003winter2004 semesters and fall2004winter2005 semestersi gt attachbyyear gt t test firstyear secondyearpairedT Paired ttest data firstyear and secondyear t 30048 df 81 p value 0003536 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 1 0135163 02059959 sample estimates mean of the differences 06097561 We have told R that the observations are paired we have the number of courses taken by the student in the rst year in one column and the number of courses taken by the very same student in the second year in the second column From the titest we can tell than the students took about 06 classes more in the rst year compared to the second ormally if you would ignore that the data were paired you would get a wider con dence interval and a larger pValuei And this is undesirable In this case we get a lower pValue but in general it is always more correct to do the paired test if the data are in fact pairedi llm going to speculate that there is a slight although not signi cant anticorrelation between number of classes taken in the rst year and those taken in the second year If you took a bunch your rst year you need to take fewer your second and Vice versai Pairing then may reduce the signi cance in a test that is looking for a onedirectional difference gt ttest firstyearsecondyear it WRONG Welch Two Sample ttest data firstyear and secondyear t 31801 df 161894 p value 0001765 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 09883955 02311167 sample estimates mean of x mean of y 8292683 8902439 If however we use the dataset from the Dalgaard book we see the bene ts of pairing data more clearly The data is paired by woman and has her pre and post menstrual energy intake V dataintake attach intake intake pre post 5260 3910 5470 4220 V V U 0 c o a a U 6180 5160 6390 5645 6515 4680 DOONOEO39IprNMD k 11 8770 7335 gt t test pre post pairedT ANOVA analysis of variance Lada Adamic November 2 2006 S1 544 1 oneway ANOVA First we will work with Prof Karen Markey s data on peoplels understanding of library subject headings for complete info see http2wwwsiumicheduNylimeNewFilesmorekmdhtmlAnchorSubj ect54980i 1711 be passing out some of the surveys for you to look at The surveys were conducted at 3 Michigan public libraries Each person taking the survey was asked to ll out demographic info such as age sex occupation etc as well as to try and write a description for 8 library subject headings eigi Cattle N United States N Marketinglli All subjects at each library were asked to interpret the same 8 headings for a total of 24 headings across all three librariesi There were three ways the subject headings were presented just the heading itself the heading along with the headings directly preceding and following them in alphabetical order and the headings in a particular book bibliographic recordi Sometimes the headings were kept in their original order and sometimes they were rearranged to a recommended standardized order it load the demographic information for each person taking the survey demog readtablequotoclcdemographicstxtquotheadTsep t it load the results of the survey surveyresults readtablequotdataoclcpubtxtquotheadTsep t stripwhiteT it calculate the percent subject headings interpreted correctly scorebysurvnum tapplysurveyresultscorrect quotcquotsurveyresultssurvynummean it combine them into one data frame a 5 m it get the size of the dataset gt dimdemogandscore 1 308 7 it attach it attachdemogandscore look at what we have gt summarydemogandscore survynum sex age libuse eductn profssn scorebysurvnum Min 100 f 2205 Min 2 900 a2 19 a 252 student 2128 Min 200000 1st Qu2 7775 m 2101 1st Qu21400 b2117 b 280 retired 2 21 1st Qu 01250 Median 215450 NA s 2 Median 21800 c2119 c 30 homemaker 15 Median 203750 Mean 215450 Mean 22831 d2 41 d 255 teacher 2 10 Mean 203449 3rd Qu223125 3rd Qu24200 e2 12 e 279 clerk 2 6 3rd Qu 05000 Max 230800 Max 7400 NA s12 Other 77 Max 208750 NA S 500 NA S 51 NA S 10000 In essence we have 308 people who took the survey7 205 of whom were female7 101 of whom were male Their ages ranged from 9 to 747 they had varying levels of library use azdaily7 bzweekly7 cmonthly7 d 2 to 3 timesyr7 e i 2 timesyr7 with most of them coming to the library on a weekly or monthly basis There were a lot of students 1287 followed by retirees and homemakers7 and then a variety of other professions The score7 in terms of the proportion of subject headings which were correctly identi ed ranges from 0 a person who got none of the headings correct to 01875 a person who got 7 out of 8 of them correcti No one had gotten all the headings righti m proportion correct l l 1 elementary JHS HS college BS The rst thing well do is create boxplots for the scores grouped by the education level of the person7 ranging from having completed elementary school to having a college degree From the boxplot7 we can see that high school kids those who have completed JHS7 so the JHS box and whisker plot do about as well as those who have had some college education the boxplot labeled 77college i Now weld like to test if any of these means is signi cantly different from any of the others Which means that we will be doing an Ftest for the null hypothesis that all the means are equali gt anovalmscorebysurvnum eductn Analysis of Variance Table Response scorebysurvnum Df Sum Sq Mean Sq F value PrgtF eductn 4 05786 01447 25262 004149 Residuals 238 136290 00573 Signif codes 0 0001 001 005 01 1 We see that we can reject the hypothesis that all the means are equal at the 0105 level Greati But which off all the pairs of means are actually different As we7ve discussed in class7 we canlt go and ttest all pairs against each other7 because our probability of committing a typel error rejecting the null hypothesis when it is true goes up with each additional test we ma er So we need to use a correction7 that will multiply the pvalue by a factor corresponding to the number of tests made The Bonferroni adjustment multiplies all p values7 while the Holm method applied in R by default7 corrects the smallest p by the full number of tests7 the second smallest by n 7 17 etc We will use the pairwisettest method for this7 which will conveniently do all the pairwise ttests for us and put them in a nice little table 11 Pairwise ttests gt pairwisettestscorebysurvnum eductnpadjquotbonferroniquot Pairwise comparisons using t tests with pooled SD data scorebysurvnum and eductn a b c d b 1000 c 1000 1000 d 1000 1 000 1000 e 0031 0209 0068 1000 P value adjustment method bonferroni gt pairwisettest 393 J Hum m Pairwise comparisons using t tests with pooled SD data quot uum and C P value adjustment method holm What we are getting in the table are the pvalues for the ttests between eg a and c multiplied by the number of tests in the case of bonferroni If this value exceeds 17 then R returns 10000 for that ttest According to the results at the 005 level we can only be sure that library partons with a college degree are doing better than kids who have only completed elementary school For both the Holm and Bonferroni methods where we penalize different pscores nonuniformly7 we have at the 01 level that people with a college degree do better than people with just a high school degree and no years of college 12 Paired vs pairwise One of you asked the very relevant question of what the difference is between a paired and a pairwise ttest A pairwise ttest means that you have more than 2 groups7 and you are pairwise comparing them meaning that you have nn 7 l2 comparisons7 if you have n groupsi Paired ttests7 if you remember7 refer to just two samples7 but each observation in one sample is paired with an observation in the other For example7 the two samples could be before and after a treatment was administered to the patient Let7s roll back to a ttest and see how we can apply it to this data One of the questions posed in the study was whether the order of the subdivisions within the subject headings matteredi If the cataloger chooses to apply subdivisions7 the subdivisions should always appear in the following order topical7 geographic7 chronological7 form Conway 19927 6 For example7 one of the original subject headings was EducationUnited StatesFinance77 and the recommended reordering would be Education FinanceUnited States The ordering changes the likely interpreted meaningi So let s do the following For each subject heading7 we will take the proportion of people who got it right when it was in the original order7 and we will pair it with the proportion of people who got it right when it was in the recommended order7 to see if there s a difference gt summary surveyresults survynum subjtype sex libcode shnum ordpattrn form order Min 2 100 a21264 12 16 Min 21000 Min 2 100 a21240 a2824 o21232 1st Qu 2 7775 c21200 f 21632 1st Qu 21000 1st Qu 2 600 b 1224 b2832 r21232 Median 2154 50 m 2 816 Median 22000 Median 21200 132808 Mean 215450 Mean 21977 Mean 21232 3rd Qu223125 3rd Qu23000 3rd Qu21800 Max 230800 Max 23000 Max 22400 quest ion correct code1 code2 assess assest Min 2 1 2 102 ids 2441 ric 2 25 7 2505 iids 2433 1st Qu 2275 a 2 1 lmo 2350 loi 2 20 4 2461 ilmo 2336 Median 2450 c 2 850 loi 2349 rmo 2 16 6 2389 iloi 2327 Mean 2450 i 21510 ric 2285 lmo 2 11 5 2387 ccds 2281 3rd Qu2625 NA s 1 c s i s 2 9 1 2166 c 2261 Max 2800 Other 2493 Other 8 Other 2555 ccdl 2241 A s 2262 NA s 22375 NA s 2 1 Other 2585 Looking at the survey results7 we basically want to count the cc proportion of correct answers separately for each subject heading denoted by shnurn7 for two different kinds of orderingsi it let s keep only the attempted SH discriptions justattempted surveyresultsisnasurveyresultscorrectampsurveyresultscorrect 1 it and look at a summary of what is left summary j ustattempted it now we will average over all people who described the same subject heading in the same order bothorderings aggregatejustattemptedcorrect quotcquot liStuLdeL j r J J u j r J L FUNmean it and then separate out the original orderings from the recommended ones original bothorderingsIbothorderingsorderquotoquot LLL L LL 16v L gt t testoriginalxrecommendedxpairedT Paired ttest data originalx and recommendedx t 07796 df 23 p value 04436 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 0 03249104 007178732 sample estimates mean of the differences 001964814 gt t test originalxrecommendedx Welch Two Sample ttest data originalx and recommendedx t 03432 df 45982 p value 0733 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 00955977 01348940 sample estimates mean of x mean of 03677402 03480920 Notice that neither the paired ttest nor the regular one which is incorrectly applied here7 give us signi cant differences in peoplels average ability to interpret the subject heading correctlyi But the paired ttest does give a lower p value and a narrower confidence interval7 which shows that is it superior Why is this Well7 there is a lot of variation in the interpretation dif culty of each subject heading The paired test keeps the variation due to question dif culty separate from the variation due to the two different ways of presenting the subject headings I ve used the aggregate command to calculate the proportion of people who answered the question correctly in that order I did this by creating a TRUEFALSE vector with justattemptedcorrect C and grouping by order and subject heading7 and then taking t e meani mean appears to treat boolean TRUEFALSE vectors as 01 vectors7 which means that we can average themi 11 any case7 since standardizing headings to be in a prescribed recommended order did not seem to impact people s ability and inability to interpret subject headings7 one of the recommendations resulting from this study was to standardize 2 twoway ANOVA We may wish to consider two variables factors simultaneously7 and for this we would do a twoway ANOVA The example I am using here is made up Letls pretend that there s an exam question that asks students to describe in a paragraph what ANOVA is The students fall in two categories those who essentially know the correct answer7 and those who dont Students also decide to write answers of different lengths7 represented in the data by the variable verbosity ln grading7 the instructor grades on both how good the explanation is a good explanation may need to be lengthy7 and whether it is correct In fact7 writing too much while not really having a clue may not necessarily improve the score gt anovalmscores verbositycorrectness Analysis of Variance Table Response scores Df Sum Sq Mean Sq F value PrgtF verbosity 1 244 244 18319 0187553 correctness 1 41877 41877 3141236 4894e 16 verbositycorrectness 1 1753 1753 131481 0001230 Residuals 26 3466 133 Signif codes 0 0001 001 005 01 1 Welve told R that the formula we are considering is verbosityl correctness7 because we would like to consider not just the two factors individually7 but also how the two factors interacti The interaction term is signi cant at the 001 level at if we were to consider verbosity and correctness separately gt anovalmscores verbositycorrectness Analysis of Variance Table Leftovers Lada Adamic S1 544 1 Trellis graphics yield The trellis lattice package allows one to do some visual exploratory data analysis fairly easily7 especially When one has several explanatory variables Install and load the package library lattice See a quick demo of the many different kinds of plots you can produce example xyplot The commands for these are listed one by one in your R scripti Lets just list one here7 it is the Total yield in bushels per acre for 10 varieties at 6 sites in each of two years dotplotvariety yield I siteyear databarley 2 Logistic regression Sometimes the outcome you want to predict is not continuous but rather binary whether someone will purchase a product whether someone will win a game etc What we are modeling is the log of the odds of a given event occuring logodds logplp logp loglp a 61X1 betagX2 Here we will be trying to gure out whether an answer will be selected as the best answer based on the following variables 0 the length of the answer 0 number of other answers supplied by the user 0 number of best answers supplied by the user 0 number of other people who have answered the same question oganswer length 6 is best answer gt ya dataframeya gt pdglm glmisbestans familybinomialdataya gt summarypdglm anslengthnumrepliesnumuserans numuserbest Call glmformula isbestans anslength numreplies numuserans numuserbest family binomial data ya Deviance Residuals Min 1Q Median 3Q Max 3 1835 09936 01726 10530 22002 Coefficients Estimate Std Error 2 value Prgtz Intercept 04981808 01939247 2 569 00102 anslength 00010072 00001825 5519 340e 08 numreplies 0 2700685 00399614 6 758 1 40e 11 numuserans 0 0231300 00049542 4 669 303e 06 numuserbest 00582897 00085216 6840 791e12 Signif codes 0 Huh 0001 001 005 01 1 Dispersion parameter for binomial family taken to be 1 Null deviance 13405 Residual deviance 11165 AIC 11265 on 966 degrees of freedom on 962 degrees of freedom Number of Fisher Scoring iterations 5 gt gt cv binary pd glm Fold 62917341085 Internal estimate of accuracy 0733 Crossvalidation estimate of accuracy 0736 So we can get roughly 73 accuracy7 where our odds of just guessing randomly whether an answer will be selected as a best answer or not was only 50i 3 Time series analysis Letls return to the water levels in lake Huroni First7 we would like to know if water levels have in fact been declining since 19707 as the recent newpaper article claimedi We already have a tool for this7 which is just simple linear regression Lake Huron Water Levels 1770 1 a 3 0 00 0 no a o 00 a oo 1 a a 00 a 1760 1 Levelmeters 176 5 1 O oo o 1 1 1 1 1 1920 1940 1960 1980 2000 gt huron readtablefilefilechooseheadT gt attachhuron gt summarylmLevelmeters Year data huronYear gt 1970 Call lmformula Levelmeters Year data huronYear gt 1970 J Res iduals Max 1Q Median 3 042784 020256 oo2534 013243 065938 Coefficients Estimate Std Error t value Prgtt Intercept 222332798 8355645 26609 lt 2e 16 Year 0023056 0004203 5486 368e 06 Residual standard error 0273 on 35 degrees of freedom Multiple RSquared 04623Adjusted Rsquared 04469 Fstatistic 3009 on 1 and 35 DF pvalue 3682e06 Indeed7 it appears that water levels have been dropping at a rate of 23cm approx 1 a year7 since 1970 But how much is this attributable to a carefully selected starting year What if we consider the entire period 19182006 gt summarylmLevelmeters Year data huron Call lmformula Levelmeters Year data huron Res iduals Max 1Q Median 073285 025993 0 01501 028574 077932 Coefficients Estimate Std Error t value Prgtt Intercept 1703e02 3 028e00 56228 lt2e16 Year 3 083e03 1543e03 1998 00489 Signif codes 0 Huh 0001 001 005 01 1 Residual standard error 0374 on 87 degrees of freedom Multiple RSquared 004387Adjusted Rsquared 003288 Fstatistic 3992 on 1 and 87 DF pvalue 004885 Clearly7 if we take the entire period into account7 there isnlt a signi cant trendi So there is no support for the hypothesis that water levels have been dropping throughouti But what we can see if that the water level in any given year appears to be correlated with the water level in the previous year This is what is called autocorrelation7 and we can run an autoregression to get at whether this is in fact the case7 or whether year to year levels are uncorrelate i First lets look at some lag plotsi When the lag is 17 we are plotting the level in one year against the level in the year directly preceding it When the lag is 2 we are looking at water levels two years apart7 etci lag plot Levelmeters lags4 do linesFALSE We can see that yearto year the water levels do appear to be correlated7 and even levels 2 years apart Beyond 2 years of lag7 the correlation weakens further D DD 5 a a D on 0 mg Dan 5000 D E a gal on a D D a D E m D as a a E a an an an m n a an w n D n D u 5 am 0 5 0 9 4 a a o D 4 a no 0 D c a 0 0 2 0 0 w m a a o a Bu 0 D a D n a D on 7g E a Dauu E a a n c 5 a a a gt on g a a DD 2 D 0 2 a Do a n on o o 0 0 o o o 5 c on D 0 a D b E u q 9 mg i lagu 1755 175a 1755 mu We can use the ACF autocorrelation function to compute the autocorrelation coef cient for all possible And nally we can perform the autoregression which models the observation at time t against observations at times t7 l t7 2 X Ma1X171 7M 02Xt72 7Mez 1 Where 6 is a random variable the noise term with zero mean gt at Leve1meters methodquotmlequot Call arx Leve1meters method quotmlequot Coefficients 09867 02264 Order selected 2 sigma 2 estimated as 004812 What R is telling us is that the best estimate is XE M 0 9867Xt1 7 M 7 0 2264Xt2 7 M 6 2 Sometimes people who model microeconomic time series want to know whether the series has a unit root that is whether our 11 0987 is indistinguishable from 1 If that is the case the rstorder model we would have would be Xz1 MXt z 3 Such a process is called a random walk with drift where

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.