Methods in Experimental Ecology
Methods in Experimental Ecology PCB 6466
University of Central Florida
Popular in Course
Popular in Biology Molecular Cell & Dev
This 10 page Class Notes was uploaded by Mrs. Carol Pagac on Thursday October 22, 2015. The Class Notes belongs to PCB 6466 at University of Central Florida taught by Quintana-Ascencio in Fall. Since its upload, it has received 36 views. For similar materials see /class/227679/pcb-6466-university-of-central-florida in Biology Molecular Cell & Dev at University of Central Florida.
Reviews for Methods in Experimental Ecology
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 10152008 R Demonstration Multiple Regression Objective The purpose of this week s session is to demonstrate how to perform multiple linear regressions ie linear regression models with two or more predictor variables in R In the first part we will examine the issue of collinearity in multiple regression models In the second part we will investigate various issues related to multiple regression model selection Part 1 Multiple Regression and the Issue of Collinearity NOTE This part of the exercise assumes that you have downloaded the dataset from Paruelo amp Lauenroth 1996 and saved it in your PCB6 4 6 6 folder as a tabdelimited text le named paruelo txt You also need to download the mul tiRegression R script and save it in your PCB6 4 6 6 folder After starting R change the directory to your PCB6 4 6 6 folder and open the mul tiRe gressi on R script The first two lines ofthe script read and attach the Paruelo amp Lauenroth 1996 dataset load the Paruelo amp Lauenroth 1996 dataset from file parueloidata lt readtablequotparuelotxtquot headerT attachparueloidata Next we use the par function with the mf row argument to create a plotting area comprised of two rows After plotting the histograms of the C3 and LOGl 07C3 response variables we use the par function to restore the plotting area to a single frame plot histograms of the C3 and LOGlOiC3 variables parmfrowc2l histC3 histLOGlO c3 parmfrowcll From the histograms it appears that the C3 variable is lognormally distributed and that the log10transformation improves somewhat the normality of the LOGl 07C3 variable Now we will check for possible collinearity issues by creating a new data frame object that contains only our potential predictor variables to assess the potential for collinearity we create a subset of the data containing only the potential predictor variables subsetidata lt dataframeLAT LONG MAP MAT JJAMAP DJFMAP PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 10152008 We can now check for collinearity both graphically and numerically First we will use the pai rs lnction to create a matrix of scatterplots and then we will use the cor function to display the correlation matrix for all of the predictor variables use the pairs function to plot all the variables against each other pairssubsetidatapanelpanelsmoot generate a correlation matrix using the cor function c0rsuhset data From the scatterplots and the correlation matrix we see that LAT and MAT have a strong negative correlation 0839 and so do LONG and MAP 0734 As discussed in the lecture based on our scienti c knowledge we expect the temperature to decrease as latitude increases and we also expect precipitation to decrease as you move from east to west across the United States Thus these variables will exhibit collinearity and we will want to choose the geographic pair LAT and LONG or the climatic pair MAP and MAT but not both in our regression model To see how collinearity in ates the variance of a multiple regression we rst create a full model with the lm function based on all of our predictor variables create a full regression model and summarize the results model lt7 lmLOG107C3 MAPMATLONGLATJJAMAPDJFMAP summarymodel Notice that while the overall model is highly signi cant F 1162 p lt 0001 only the LAT and the intercept coef cients are signi cant at the 005 level Furthermore we can use the Vi f function to compute the variance in ation factor VIF values for all of the coef cients in our estimated model Because the Vi f function is not built into the base version of R it is included in an add on library known as a package you will rst have to go to the R website From there click the Search link on the menu bar on the lefthand side Next enter car package in the search textbox car stands for Companion to Applied Regression and the car package contains many regressionrelated utility functions The rst result from your search should be labeled CRAN Package car and you should click on this link Next download the appropriate le e g caril 2 8 zip for Windows and extract it to the l ibrary subfolder of your R installation folder Once you have done this you can load the package into your R session using the l ibrary function and then you can invoke the Vi f function on your model use the vif function to calculate the variance inflation factors librarycar vifmodel Notice that the VIF values for all of the predictor variables are well over one some are over 5 indicating that collinearity has caused the model variance to be in ated PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 10152008 Part II Selecting the Best Multiple Regression Model Based on the analysis in the previous section assume that we have decided to avoid potential problems with collinearity by using only LAT and LONG as our predictor variables Now we will create a series of linear models based on these predictors and determine which of the models has the best t modell lt lm model2 lt lm model3 lt lm LOG107031 LOGlOiC3LONG LOG107C3LAT model4 lt lm LOGlOiC3LONGLAT modelS lt lm LOGlOiC3LONGLATLONGzLAT NOTE the LONGLAT term is equivalent to LONGLATLONGzLAT model6 lt lmLOGlOic3LONGLAT The rst model has no predictor variable model2 and model3 are simple linear regressions based on a single predictor LONG and LAT respectively and model4 contains both of the predictor variables The next model modelS adds an interaction term between the LONG and LAT predictors which is indicated in R by the LONGzLAT notation You can also use the LONGLAT notation to indicate both predictor variables and their interaction term ie LONGLATLONG LAT as was done for model6 above Thus modelS and model6 are identical models Now that we have de ned all of our regression models we will use the AIC function to determine which model ts the data best AIC AIC AIC modell model2 model3 model4 modelS model6 AIC AIC As discussed on page 285 of the Gotelli amp Ellison text AIC Akaike information criterion takes into account the number of predictor variables ie parameters when calculating the model t Also somewhat counterintuitively a lower AIC value indicates a better tting model Thus the results for our models indicate that model 5 which has the lowest AIC 0437 has the best t Note that model 6 which is identical to model 5 has the exact same AIC value Now that we have identi ed model 5 as our best tting model we can use the Vi f function to again test for variance in ation in the predictor variables since model5 appears quotbestquot so far check its VIE values vifmodel5 PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 10152008 Yikes Now the VIF values for all of our predictors are much greater than one with the smallest VIF being close to 67 As mentioned in the lecture we can center the predictor variables by subtracting their means in order to try to reduce the variance in ation to reduce the VlFs center the predictor variables cLAT lt7 LATemeanLAT cLONG lt7 LONGemeanLONG Next we will create 4 new linear models based on the centered predictor values and then compute the AIC values for all of these models run the models again this time using the centered values modelcl lt lmLOG107 3cLONG modelc2 lt lmLOGlO C3CLAT modelc3 lt lm Locioc3cLONGcLAT modelc4 lt lmLOGl07C3CLONGCLATCLONG2CLAT re compute the AIC values for the centered models AICmodelcl AICmodelc2 AICmodelc3 AICmodelc4 As before our model with both predictor variables and the interaction term mode l c 4 here has the lowest AIC Now when we compute the VIFs for this model however we get values that are all close to one since the full model still appears best recalculate the VlFs vifmodelc4 Thus we conclude that this model LOGl 07C3CLONGCLATCLONG2 cLAT has the best t for our data Finally we can use the summary function on this model summarymodelc4 Notice that the overall regression is highly signi cant F 2437 p lt 0001 and that all of the model parameters except cLONG are signi cant at the 005 level Finally with its Adjusted R2 value of 04934 we can conclude that this model explains almost 50 of the variation in the response variable PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 09122008 R Demonstration Summa Statistics and the Law of Large Numbers Objective The purpose of this session is to use some of the R functionality you have recently learned to demonstrate the Law of Large Numbers Then you will be introduced to your first R script which contains some more advanced programming logic Part I Demonstrating the Law of Large Numbers in R As explained in Chapter 3 of the Gotelli amp Ellison 2004 text the Law of Large Numbers states that as the size of a sample drawn from a random variable increases the mean of the sample gets closer and closer to the true population mean u This fundamental theorem of probability is fairly straightforward to demonstrate in R though as you will soon discover the process can be quite tedious First start the R software Next let us assume that we somehow know that the lengths of the tibial spines of the entire population of linyphiid spiders are normally distributed with a mean of 0253 mm and a standard deviation of 004 mm We assign these values to the variables mu and s igma as follows remember we use Greek letters for population parameters gt mu lt 0253 gt sigma lt 004 Next we will create a vector named meanivector to hold our sample means gt meanivector lt rep0 30 The re p function as used above creates a vector of length 30 that initially contains all zero values Every time we calculate a sample mean we will store it in this vector Now you will proceed to create 30 random samples from our normal distribution and these samples will increase in size from nl to n30 For the first sample type gt sample lt rnorml meanmu sdsigma gt meanivectorl lt meansample The first line uses the rnorm function to draw a sample of size 1 from our normal distribution with the mu and s igma parameters we specified earlier The second line uses the mean function to calculate the arithmetic mean of the sample and then stores it in the first index ofthe me anive cto r variable To fill in the rest of the vector we will increase the sample size by one each time To save some typing however we will combine the process of drawing the random sample calculating the sample mean and storing the value in meanivector into a single step Thus for a sample size of 2 type the following PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F Quintana Ascencio and James Angelo 09122008 gt meanivectorm lt meanrnorm2 meanmu sdsigma The rnorm function will draw a sample of size 2 from our hypothetical distribution the mean function will then calculate the sample mean and then the mean will get stored in the second index of the meanivector variable Now for the tedious part you will need to repeat this process for all of the remaining sample sizes Use the Up Arrow key and then change the index position and the sample size as illustrated below gt meanivectorm lt meanrnorm3 meanmu sdsigma gt meanivector 30 lt mean rnorm 30 meanmu sdsigma Finally we will create a scatteIplot of the results using the plot function gt plot seq1 30 meanivector gt ablinehmu colquotredquot The seq 1 3 O argument to the plot function generates a sequence of integers from 1 to 30 on the X axis for all of the sample sizes we computed and the abline function on the next line draws a red horizontal line showing the value of our population mean ie u0253 While your speci c results will be different your scatteIplot should look similar to the following 026 l 025 l o o o meangvector o o o 024 l o o 023 l PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 09122008 As predicted by the Law of Large Numbers when our sample size increases from 1 to 30 the estimated sample mean gem closer and closer to the population mean To really see the Law of Large Numbers in action however we need to dramatically increase the sample size And since we don t want to have to compute all of those samples by hand we will see how to write a simple R script to automatically do this for us Part 11 Your First R Scrigt For this part of the lesson you will need to download a couple of files from the course website to the PCB 6 4 6 6 folder on your Desktop 1 the Excel spreadsheet containing the tibial spine length data for 2000 spiders gotelliichi x15 and 2 the file containing the R script LaxLo fiLargeiNums R Next open the Excel spreadsheet and navigate to the data tab Using the procedure you learned in the lesson on 09082008 save the data as a tabdelimited text file named Tibialidata txt NOTE you should open the text file and make sure that the column heading for the data is named Spineil ength otherwise the R script won t run properly lfyou closed R start it again Choose File change dir from the menu bar and then select your PCB 6 4 6 6 folder Next choose Fi l eOpen 5 cript from the menu bar and then choose LaxLo fiLargeiNums R from the list of files This will open the script in a new window inside of R Finally choose Edi t Run al 1 from the menu bar When the script finishes running you should see the following output Sample mean Sample SlZe PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 09122008 As promised earlier this script provides an even more dramatic illustration of the Law of Large Numbers You can clearly see the sample means the dots converge on the population mean denoted by the red line as the sample size increases from 1 to 1000 To see how the LawofLargeNums R script works let s take a look at the code read in the data from the text file then quotattachquot tibialidata lt7 readtablequotTibialidatatxtquot headerT attachtibial7data it compute the population mean of the spine lengths u lt7 meanspine7length the following variable holds the maximum sample size maxisampleisize lt7 1000 this vector will hold the mean calculated for each sample size meanivec lt7 rep0 maxisampleisize Illustrate the law of large numbers by calculating means for all of the different samples from size n1 to nmaxisampleisize for n in 1maxisampleisize draw a random sample of length n from the spine length data values lt7 samplespine7length n calculate the sample mean and store it in the meanivec meanivecn lt7 meanvalues l finally plot the sample mean vs the sample size plotseg1maxisampleisize meanivec xlabquotSample sizequot ylabquotSample meanquot draw a red horizontal line with a Yeintercept equal to mu ablinehmu colquotredquot As shown above a script is just a collection of R commands stored in a text le To execute the commands you can 1 choose Edit Run al 1 from the menu bar to run the entire script in a linebyline fashion or 2 select one or more lines of text in the script and then choose Edit Run 1 ine or 5 el ection from the menu bar or 3 copy one or more lines of code from the script and paste directly into the R Console An important thing to note is that any line that begins with the character will not be executed These lines are known as comments in programming lingo and they are used to explain to other readers what the various parts of our script do While only a single character is required to denote a comment line I use double characters as in the script above because Ithink they stand out more Now let s take a linebyline look at the script to see what each pa1t does PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 09122008 read in the data from the text file then quotattachquot it tibialidata lt7 readtablequotTibialidatatxtquot headerT attachtibial7data The first line is a comment and the following 2 lines read the tibial spine length data from the text le into a data frame named t ibi a lidat a and then execute the a t ta ch function to make the variable spineil ength visible in the R session Remember you can see what variables are contained in the data frame by typing names tibi alidata at the R Console after you have executed a t tach compute the population mean of the spine lengths mu lt7 meanspine7length the following variable holds the maximum sample size maxisampleisize lt7 1000 this vector will hold the mean calculated for each sample size meanivec lt7 rep0 maxisampleisize The lines above create 3 new variables The first is mu which holds the mean computed from our population of 2000 tibial spine lengths The second is maxis ampl eis i ze which indicates that our maximum sample size will be 1000 And the third is meanivec which is a vector that will hold the sample means we will calculate We use the rep function to initialize this vector with a sequence of zero values Now we get to the portion of the script that may seem unfamiliar to you Illustrate the law of large numbers by calculating means for all of the different samples from size n o nmaxisampleisize for n in lmaxsamplesize draw a random sample of length n from the spine length data values lt7 samplespine7length n calculate the sample mean and store it in the meanivec meanivecn lt7 meanvalues After the 2 commented lines you see a line in bold that begins with the R command for In programming lingo the block of code from this line until the bold 7 lines below is known as a for loop The way it works is this when R gets to the for command it evaluates the expression in parentheses n in lmaxsamplesize This expression assigns the value 1 to the variable n which is known as the counter variable for the for loop Then R executes the block of code between the The first line of code uses the sampl e function to draw a random sample of size n from our spine length data and the PCB 6466 Methods in Experimental Ecology Fall 2008 Semester Dr Pedro F QuintanaAscencio and James Angelo 09122008 second line computes the sample mean and stores it in the first position of our me a nive c vector The important concept to grasp here is that unlike the previous commands we ve looked at R will not simply move on to the next lines after executing these two lines Rather it will go back into the n in lmaxsamplesize expression and increase the value of n by one It will then execute again the lines of code between the only this time with the new value of n R will keep doing this until n exceeds the value of the variable maxis ampl eis i z e at which time execution will continue below the for loop In a nutshell then the for loop shown above counts from a sample size of nl to nmax7s ampl eis i ze Each time it draws a sample of size n from the tibial length data computes the sample mean and stores it in the meanivec vector NOTE if you are confused about how the f or command works consult the Crawley 2005 reference or the R help system Once we ve used our for loop to fill in all the sample mean values in meanivec we simply use the pl 0 t function to create a graph of sample size versus sample mean finally plot the sample mean vs the sample Size plot5eqlmax75ample75ize meanivec xlab Sample size ylab Sample mean draw a red horizontal line with a Yeintercept equal to mu ablinehmu col red Congratulationsiyou ve just become an R programmer Part III Practice Exercises 1 Download the smaller sample of tibial spine data described in Ch 3 of Gotelli amp Ellison 2004 from the following link nnn quot r fa harvm39d quot39 quot 39 ic iniderTibiaDam t t Read this data into R and then calculate the mean sum of squares variance and standard deviation for the sample DO NOT use the shortcut functions e g mean sd etc that R provides but instead use the formulas in Ch 3 of the text 2 Another fundamental theorem of probability is the Central Limit Theorem According to page 65 of Gotelli amp Ellison 2004 the Central Limit Theorem Chapter 2 shows that arithmetic means of large samples of random variables conform to a normal or Gaussian distribution even ifthe underlying random variable does not emphasis added Download and run the R script CentraliLimi tiTheorem R from the course website Try changing some of the variables in the script such as the number of samples or the sample size Also try to use a different probability distribution than the Poisson to demonstrate the Central Limit Theorem Did you get the same result
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'