Popular in Course
verified elite notetaker
Popular in Electrical Engineering
This 35 page Class Notes was uploaded by Dorris Borer on Monday September 28, 2015. The Class Notes belongs to ESE502 at University of Pennsylvania taught by T.Smith in Fall. Since its upload, it has received 30 views. For similar materials see /class/215445/ese502-university-of-pennsylvania in Electrical Engineering at University of Pennsylvania.
Reviews for INTRTOSPATIALANALYSIS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/28/15
NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis 3 Testing Spatial Randomness There are at least three approaches to testing the CSR Hypothesis the quadrat method the nearest neighbor method and the method of K functions We shall consider each of these in turn 31 Quadrat Method This simple method is essentially a direct test of the CSR Hypothesis as stated in expression 2161 Given a realized point pattern from a point process in a rectangular region R one begins by partitioning R it into congruent rectangular subcells quadrats C1Cm as in Figure 1 below where m 16 Then regardless of whether the given Fig 1 Quadrat Partition of R pattern represents trees in a forest or beetles in a eld the CSR Hypothesis asserts that the cellcount distribution for each C must be the same as given by 216 But rather than use this Binomial distribution it is typically assumed that R is large enough to use the Poisson approximation in 219 In the present case if there are n points in R and if we let a aC1 and estimate expected point density 7 by 7 aR 1 X then this common Poisson cell count distribution has the form A k A 2 PrNl km e M k012 Moreover since the CSR Hypothesis also implies that each of the cell counts MNCilk is independent it follows that Nlilk must be a independent random samples from this Poisson distribution Hence the simplest test of 1 This refers to expression 6 in section 21 of Part I All other references will follow this convention ESE 502 131 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis this hypothesis is to use the Pearson x2 goodnessoff1t test Here the expected number of points in each cell is given by the mean of the Poisson above which recalling that a aR m by construction is n n 3 Eavm as a g Hence if the observed value of N is denoted by n then the chi square statistic x n n nm2 4 XZZH nm is known to be asymptotically chisquare distributed with m l degrees of freedom under the CSR Hypothesis Thus one can test this hypothesis directly in these terms Note that nm is simply the sample mean ie nm lmZ1nl n and hence that this statistic can also be written as 5 x2 ZM m4 a where s2 ZLM n2 is the sample variance In this form observe that since the variance if the Poisson distribution is exactly equal to its mean ie varN E N l and since s2 a is the sample estimate of this ratio if is often convenient to use this index of dispersion s2 a as the test statistic In particular if s2 n is signi cantly less than one then it can be inferred that there is too little variation among quadrat counts suggesting possible uniformity rather than randomness Similarly if s2 a is significantly greater than one then there is too much variation among counts suggesting possible clustering rather than randomness But this testing procedure is very restrictive in that it requires a rectangular region2 More importantly it depends critically on the size of the partition chosen As with all applications of Pearson s goodnessoff1t test if there is no natural choice of partition size then the results can be very sensitive to the partition chosen 32 Nearest Neighbor Methods In view of these shortcomings the quadrat method above has for the most part been replaced by other methods The simplest of these is based on the observation that if one simply looks at distances between points and their nearest neighbors in R then this provides a natural test statistic that requires no arti cial partitioning scheme More 2 More general random quadrat methods are discussed in Cressie l995section 823 ESE 502 132 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis precisely for any given points ss1s2 and vv1v2 in R we let denote the Euclidean distance between s and v by3 6 dSV lSrVr232V22 and denote each point pattern of size n in R by Sn s i ln then for any point s1 6 Sn 4 the nearest neighbor distance nn alistance from s to all other points in Sn is given by5 7 dd1SnmindslsjsjeSWjii In a manner similar to the index of dispersion above the average magnitudes of these nndistances relative to those expected under CSR provide a direct measure of uniformity or clustering in point patterns This is seen clearly by comparing of the two gures below each showing a pattern of 14 points o O O O o 0 g 0 GD O 39 o o O 0 0 O O 0 Fig2 Uniform Pattern Fig3 Clustered Pattern In Figure 2 these points are seen to be very uniformly spaced so that nndistances tend to be larger than what one would expect under CSR In Figure 3 on the other hand the points are quite clustered so that nndistances tend to be smaller than under CSR 3 Throughout these notes we shall always take dsv to be Euclidean distance However there are many other possibilities At large scales it may be more appropriate to use greatcircle distance on the globe Alternatively one may take ds v to be travel distance on some underlying transportation network In any case most of the basic concepts developed here such as nearest neighbor distances are equally meaningful for these definitions of distance 4 The vector notation Sn s l n means that each point s1 is treated as a distinct component of Sn Hence with a slight abuse of notation we take s 6 Sn to mean that s is a component of pattern Sn 5 This is called the eventevent distance in BG p98 One may also consider the nn distance from any random point x E R to the given pattern as defined by dxSn mindxs i ln However we shall not make use of these paintevent distances here For a more detailed discussion see Cressie 1995 section 826 ESE 502 133 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis 321 Nearest Neighbor Distribution under CSR To make these ideas precise we must determine the probability distribution of nn distance under CSR and compare the observed nndistance with this distribution To begin suppose that the implicit reference region R is large so that for any given point density 7 we may assume that cellcounts are Poisson distributed under CSR Now suppose that s is any randomly selected point in a pattern realization of this CSR process and let the random variable D denote the nn distance from s to the rest of the pattern To determine the distribution of D we next consider a circular region Cd of radius d around 3 as shown in Figure 4 below Then by de nition the probability that D is at least equal to d is precisely the probability that there are no other points in Cd Hence if R we now let Cds Cd s then this proba bility is given by C 8 PrD gt d PrNCd s 0 But since the right hand side is simply a cellcount probability it follows from expression 239 that F1g4 Cell of radlus d 9 PrD gt d aws e W where the last equality uses the fact that aCds aCd Quid2 Hence the cumulative distribution function cdf F Dd for D is given by 10 FDd PrDSd1 PrDgtd 1 e Wz In Section 2 0f the Appendix to Part I it is shown that this is an instance of the Rayleigh distribution and in Section 3 0f the Appendix that for a random sample of m nearest neighbor distances D1Dm from this distribution the scaled sum known as Skellam 3 statistic 11 Sm 2x11211112 is chi square distributed with 2m degrees of freedom as on p99 in BG Hence this statistic provides a test of the CSR Hypothesis based on nearest neighbors ESE 502 134 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis 322 Clark Evans Test While Skellam s statistic can be used to construct tests it follows from the Central Limit Theorem that independent sums of identically distributed random variables are approximately normally distributed Hence the most common test of the CSR Hypothesis based on nearest neighbors involves a normal approximation to the sample mean of D as de ned by 12 5m zZLQ To construct this normal approximation it is shown in Section 2 of the Appendix to Part I that the mean and variance of the distribution in 10 are given respectively by 1 13 Eon m 4 11 14 varD 47m To gain some feeling for these quantities observe that under the CSR Hypothesis as the point density 7L increases both the expected value and variance of nndistances decrease This makes intuitive sense when one considers denser scatterings of random points in R Next we observe from the properties of independently and identically distributed iid random samples that for the sample mean Dm in 12 we must have 15 E01 wZZIEan mEltD1gt1 E091 21 JX and similarly must have 2 m 4 16 varDm a Z varaz mvarD1 W50 But from the Central Limit Theorem it then follows that for large sample sizes6 5 must be approximately normally distributed under the CSR Hypothesis with mean and variance given by 15 and 16 ie that 6 Here large is usually taken to mean m 2 30 as long as the distribution in 10 is not too skewed Later we shall investigate this by using simulations ESE 502 135 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis 1 4 11 17 D NixX m47m Hence this distribution provides a new test of the CSR Hypothesis known as the Clark Evans Test see for example BG p 100 If the standard error of OM is denoted by 18 cam lvarajn 4 nm4nx then to construct this test one begins by standardizing the sample mean 5m in order to use the standard normal tables Hence if we now denote the standardized sample mean under the CSR Hypothesis by 19 Z 5mE5m m lZJX 39quot 65 4 mmm then it follows at once from 17 that under CSR7 20 Zm N0l To construct a test of the CSR Hypothesis based on this distribution suppose that one starts with a sample pattern Sn s z39 ln and constructed the nn dz39stance d for each point s1 6 Squot Then it would seem most natural to use all these distances d1dn to construct the samplemean statistic in 12 above However this violates the assumed independence of nndistances on which this distribution theory is based To see this it is enough to observe that if s and s are mutual nearest neighbors so that d E d then these are obviously not independent More generally if s and s share a common nearest neighbor sk then d and d 1 must be dependent However if one were to select a subset of nndistance 39 values that contained no common points such as those f f shown in Figure 5 then such dependencies could be avoided The question is how to choose such a subset We N 0 shall return to this problem later but for the moment we simply assume that some reasonably independent 239 K0 subset d1dm of these distance values has been selected with m lt n This is why the notation m rather than n has been used in the formulation above Fig5 Independent Subset 7For anyrandom variable X with EXu and varX62 if ZX7uoXoiuo then EZEXoiuo 0 and varZ varXcs2 l ESE 502 136 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis Given such a subsample 511 dm one can construct the samplemean value 21 67 WEE and use this to construct tests of CSR Two Tailed Test of CSR The standard test of CSR in most software is a twotailed test in which both the possibility of signi cantly small values of dm clustering and signi cantly large values of 67m uniformity are considered Hence it is appropiiate to review the details of such a testing procedure First recall the notion of upper tail points zX for the standard normal distribution as de ned by PrZ Zzu0c for Z N0l In these terms it follows that for the standardized mean in 19 22 PrZm 2 2m PrZm s zm or 22 s Zm at under the CSR Hypothesis as in Figure 6 below Hence if one estimates expected point density as in l and constructs corresponding estimates of the mean 15 and standard deviation 18 under CSR by 23 6 ml4 nm4mi then the CSR Hypothesis can be tested by constructing the following standardized sample mean 24 z hf O m If the CSR Hypothesis is true then by 19 and 20 2 should be a sample from m N0l 8 Hence a test of CSR at the a level of signi cance9 is given by the rule T wo T ailed CSR Test Reject the CSR Hypothesis if and only if l zm l gt 21 m 8 Formally this assumes that if is a sufficiently accurate estimate of it to allow any probabilistic variation in ii to be ignored Since ii is based on all n pattern points this assumption is usually reasonable 9 By definition the level of signi cance of a test is the probability or that the null hypothesis in this case the CSR Hypothesis is rejected when it is actually true This is discussed further below ESE 502 137 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis The significance level a is also called the size of the test Example results of this testing procedure for a test of size a are illustrated in Figure 6 below Here the two samples zm in the tails of the distribution are seen to yield strong evidence against the CSR Hypothesis while the sample in between does not Reject Do Not Reject CSR Reject CSR Fig6 Two Tailed Test of CSR One Tailed Tests of Clustering and Uniformity As already noted values of 67m and hence zm that are too low to be plausible under CSR are indicative of patterns more clustered than random Similarly values too large are indicative of patterns more uniform than random In many cases one of these alternatives is more relevant than the other In the Redwood Seedling example of Figure lll 0 it is clear that trees appear to be clustered Hence the only question is whether or not this apparent clustering could simply have happened by chance ie whether or not this pattern is significantly more clustered than random Similarly one can ask whether the pattern of Cell Centers in Figure 112 is signi cantly more uniform than random Such questions lead naturally to onetailed versions of the test above First a test of clustering versus the CSR Hypothesis at the a level of signi cance is given by the rule Clustering versus CSR Test Conclude significant clustering if anal only if zm lt 21 Example results of this testing procedure for a test of size a are illustrated in Figure 7 below Here the standardized sample mean zm to the left is sufficiently low to conclude the presence of clustering at the a level of signi cance and the sample toward the middle is not 10 This refers to Figure l in Section 11 above All references to figures in previous sections will be denoted similarly ESE 502 138 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis Significant No Significant Clustering Clustering Fig7 One Tailed Test of Clustering In a similar manner one can construct a test of uniformity versus the CSR Hypothesis at the a level of signi cance using the rule Uniformity versus CSR Test C onclude significant uniformity if and only if zm gt 21 Example results for a test of size a are illustrated in Figure 8 below where the sample zm to the right is sufficiently high to conclude the presence of uniformity at the a level of signi cance and the sample toward the middle is not Fig8 One Tailed Test of Uniformity While such tests are standard in literature it is important to emphasize that there is no best choice of a The typical values given by most statistical texts are listed in Tables 1 and 2 below Table 1 Two Tailed Signi cance Table 2 One Tailed Signi cance ESE 502 139 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis So in the case of a twotailed test for example the nonrandomness of a given pattern is considered strongly weakly significant if the CSR Hypothesis can be rejected at the a 01 a 10 level of significance11 The same is true of onetailed tests where the cutoffvalue za2 is now replaced by 21 In all cases the value a 05 is regarded as a standard default value indicating signi cance However since these distinctions are admittedly arbitrary another approach is often adopted in evaluating test results The main idea is quite intuitive In the onetailed test of clustering versus CSR above suppose that for the observed standardized mean value zm one simply asks how likely it would be to obtain a value this low if the CSR Hypothesis were true This question is easily answered by simply calculating the probability of a sample value as low as zm for the standard normal distribution N 0 1 If the cumulative distribution function for the normal distribution is denoted by 25 z PrZ S 2 then this probability called the P value of the test is given by 26 PrZ S 2m zm as shown graphically below z 0 Fig9 P Value for Clustering Test Notice that unlike the signi cance level a above the Pvalue for a test depends on the realized sample value zm and hence is itself a random variable that changes from sample to sample However it can be related to a by observing that if PZ S 2m S a then for a test of size a one would conclude that there is significant clustering More generally the Pvalue PZ Szm can be defined as the largest level of significance smallest value of a at which CSR would be rejected in favor of clustering based on the given sample value zm Similarly one can define the Pvalue for a test of uniformity the same way except that now for a given observed standardized mean value 2 one asks how likely it would be to 5m 11 Note that lower values of 0 denote higher levels of significance ESE 502 1310 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis obtain a value this high if the CSR Hypothesis were true Hence the P value in this case is given simply by 27 PrZ 2 2m PrZ gtzml PrZ Szm l cDzm where the first equality follows from the fact that PrZ zm0 for continuous distributions12 This Pvalue is illustrated graphically below Fig10 P Value for Uniformity Test Finally the corresponding Pvalue for the general twotailed test is given as the answer to the following question How likely would it be to obtain a value as far from zero as zm if the CSR Hypothesis were true More formally this Pvalue is given by 28 POZl Z Zm Z39CDHZm l as shown below Here the absolute value is used to ensure that lzml is negative regardless of the sign of zm Also the factor 2 re ects the fact that values in both tails are further from zero than zm H 2m l l 2 l lzml 0 Fig11 P Value for Two Tailed Test 33 Redwood Seedling Example We now illustrate the ClarkEvans testing procedure in terms of the Redwood Seedling example in Figure lll This image is repeated in Figure 12a below where it is now compared with a randomly generated point pattern of the same size in Figure 12b Here it 12 By the symmetry of the normal distribution this Pvalue is also given by 7zm 17 zm ESE 502 131 1 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis is evident that the redwood seedlings are more clustered than the random point pattern However it is important to notice that there are in fact some apparent clusters in the random pattern Indeed if there were none then this pattern would be too uniform So the key task is to distinguish between degrees of clustering that could easily occur by chance and those that could not This is the essence of statistical pattern analysis Fig12a Redwood Seedlings Fig12b Random Point Pattern To do so we shall start by assuming that most of the necessary statistics have already been calculated We shall return to the details of these calculations later Here the area aR 44108 sqmeters of this region R is given ARCMAP It appears in the Attribute Table of the boundary file Redwbndshp in the map document Redwoodsmxd The number of points 71 62 in this pattern is given in the Attribute Table of the data le Redwptsshp in Redwoodsmxd The bottom of the Table shows Records 0 out of 62 Selected Note that there only appear to be 61 rows because the row numbering always starts with zero in ARCMAP Hence the estimated point density in 1 above is given by 62 29 iL 0 aR 44108 0141 For purposes of this illustration we set m n 62 so that the corresponding estimates of the mean and standard deviation of nndistances under CSR are given respectively by 30 13336 meters A 2 42900141 31 6 7 3853 meters ln4mt 62431400141 ESE 502 13 12 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis For the redwood seedling pattern the mean nndistance d turns out to be 32 dquot 9037 meters At this point notice already that this average distance is much smaller than the theoretical value calculated in 30 under the hypothesis of CSR So this already suggests that for the given density of trees in this area individual trees are much too close to their nearest neighbors to be random To verify this statistically let us compute the standardized mean d t 9037 13336 6 8853 4855 33 2 Now recalling from Table 2 above that there is strongly signi can clustering if z S 201 233 one can see from 33 that clustering in the present case is even more signi cant In fact the Pvalue in this case is given by 34 Pvalue PZ S 2 4741 4855 000006 Methods for obtaining CD values are discussed later So the chance of obtaining a mean nearestneighbor distance this low under the CSR Hypothesis is only about 6 in a million This is very strong evidence in favor of clustering versus CSR However one major dif culty with this conclusion is that we have used the entire point pattern m n and have thus ignored the obvious dependencies between nndistances discussed above Cressie l993 p60910 calls this intensive sampling and shows with simulation analysis that this procedure tends to overestimate the signi cance of clustering or uniformity The basic reason for this is that positive correlation among nn distances results in a larger variance of the test statistic Zn than would be expected under independence for a proof of this see Section 4 of the Appendix to Part I and see also p99 in BG This will tend to in ate the absolute value of the standardized mean thus exaggerating the signi cance of clustering or uniformity With this in mind we now consider two procedures for taking random subsamples of pattern points that tend to minimize this dependence problem These two approaches utilize JMPIN and MATLAB respectively and thus provide convenient introductions to using these two software packages 331 Analysis of Redwood Seedlings using JMPIN One should begin here by reading the notes on opening JMPIN in section 21 of Part IV in this NOTEBOOK13 In the class subdirectory jmpin now open the le 13 This refers to section 21 in the Software portion Part IV of this NOTEBOOK All other references to software procedures will be done similarly ESE 502 1313 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis Redw00ddatajmp in JMPIN The columns nn dist and area contain data exported from MATLAB and ARCMAP respectively and are discussed later The column RandRelabel is a random reordering of labels with associated nndistance values in the column Sample These can be constructed using the procedure outlined in section 222 of Part IV in this NOTEBOOK Now open a second le labeled CETestsjmp which is a spreadsheet constructed for this class that automates ClarkEvans tests Here we shall use a random 50 subsample of points from the Redwood Seedlings data set to carry out a test of clustering14 To do so click Rows Add Rows and add 31 rows 62 2 Next copyandpaste the rst 31 rows of Redw00ddatajmp into these positions In Redw00ddatajmp i Select rows 1 to 31 click Row 1 hold down shift and click Row 31 ii Select column heading Sample this entire column is now selected iii Click Edit Copy Now in CETestsjmp i Select column heading nn dist ii Click Edit Paste Finally to activate this spread sheet you must ll in the two parameters area 11 Start with area as follows i Right click on the column heading area ii Right click on the small red box that may say no formula iii Type 44108 hit return and click Apply and OK The entire column should now contain the value 44108 in each row The procedure for lling in the value 11 62 is the same Once these values are registered the spread sheet does all remaining calculations Open the formula windows for lam mu sig s mean and Z as above and examine the formulas used The results are shown below where only the rst row is displayed lam l mu l sig I smean I Z l PVaICSR l PVal Clust l PVal Unif l l 00014 l 133362 l 12521 l 82826 l 40363 l 00000546 l 00000273 l 09999727 l 14 In BG p99 it is reported that a common a ruleof thumb to ensure approximate independence is to take a random subsample of no more than 10 ie m n10 But even for large sample sizes 71 this rule discards most of the information in the data So 50 subsamples are much more common An alternative approach will be developed in the MATLAB application of Section 325 below ESE 502 1314 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis Notice rst that all values other than lam differ from the fullsample case mn calculated above since we have only m 31 samples Next observe that the Pvalue for clustering 0000273 is a full order of magnitude larger than for the fullsample case So while clustering is still extremely signi cant as it should be this signi cance level has been de ated by removing some of the positive dependencies between nndistances Notice also that the Pvalue for CSR is by de nition exactly twice that for Clustering and similarly that the Pvalue for Uniformity is exactly one minus that for Clustering This latter Pvalue shows that there is no statistical evidence for Uniformity in the sense that values as high as Z 40363 are almost bound to occur under CSR 332 Analysis of Redwood Seedlings using MATLAB While the procedure in JMPIN above does allow one to take random subsamples and thereby reduce the effect of positive dependencies among nndistances it only allows a single sample to be taken So the results obtained depend to some degree on the sample selected What one would like to do here is to take many subsamples of the same size say with m3l and look at the range of Z values obtained If almost all samples indicate signi cant clustering then this would yield a much stronger result that is independent of the particular sample chosen In addition one might for example want to use the Pvalue obtained for the sample mean of Z as a more representative estimate of actual signi cance But to do so in JMPIN would require many repetitions of the same procedure and would clearly be very tedious Hence an advantage of programming languages like MATLAB is that one can easily write a program to carry out such repetitive tasks With this in mind we now consider an alternative approach to Clark Evans tests using MATLAB One should begin here by reading the notes on opening MATLAB in section 31 of Part IV in this NOTEBOOK Now open MATLAB and set the Current Directory at the top of the MATLAB window to the class subdirectory Fsys502matlab and open the data le Redwoodsmat15 The Workspace window on the left will now display the data matrices contained in this le For example area is seen to have Size l x l and hence constitutes a single number If you type gtgt area at the command prompt gtgt you will see the number 44108 which corresponds to the area value used in JMPIN above This number was imported from ARCMAP and can be obtained by following the ARCMAP procedure outlined in Section 128 of Part IV Next consider the data matrix Redwoods which is seen to be a 62 x 2 matrix with each row denoting the xy coordinates of one of the 62 redwood seedlings You can display this matrix by typing gtgt Redwoods at the prompt 15 The extension mat is used for data les in MATLAB ESE 502 1315 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis I have written a program clusterm16 in MATLAB to carry out ClarkiEvans tests You can display this program by clicking Edit Open and selecting the le clusterm17 The rst few lines of this program are displayed below function D clusterptsamtest CLUSTERM performs the Clark Evans tests for Clustering NOTE These tests use a random subsample size m of the full sample of n nearest neighbor distances and ignore edge effects Written by TONY E SMITH 122899 INPUTS i pts file of point locations xiyi i1n ii a area of region iii m sample size m lt 11 iv test indicator of test to be used 0 two sided test for randomness 1 one sided test for clustering 2 one sided test for uniformity OUTPUTS OUT vector of all nearest neighbor distances SCREEN OUTPUT critical z value and p value for test The rst line de nes this program to be a function call cluster with four inputs ptsantest and one output called OUT The percent signs on subsequent lines indicate comments intended for the reader only The rst few comment lines describe what the program does In this case cluster takes a subsample of size mSn and performs a ClarkEvans test as in JMPIN The next set of comment lines describe the four inputs in detail The rst pts contains the Xy coordinates of the given point pattern and corresponds in our present case to Redwoods The parameter a corresponds to area and m corresponds to the number of subsamples to be taken in this case m 31 Finally test is an indicator denoting the type of test to be done so that for a onetailed test of clustering we would give test the value 1 During the execution of this program the nearestneighbor distance for each pattern point is calculated Since this vector of nn distances is useful for other applications such as the JMPIN spreadsheet above it is 16 The extension In is used for all executable programs and scripts in MATLAB 17 To View this program you can also type the command gtgt edit cluster ESE 502 1316 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis useful to save this vector Hence the single output OUT is in this case the n X 1 matrix of nndistances The last comment line describes the screen output of this program which in the present case is simply a display of the Z value obtained and its corresponding Pvalue To run this program suppose that you want to save the nndistance output as a vector called D the names of inputs and outputs can be anything you choose Then at the command prompt you would type gtgt D clusterRedw00dsarea311 Here it is important to end this command statement with a semicolon for otherwise all output will be displayed on the screen in this case the contents of D Hence by hitting return after typing the above command the program will execute and give a screen display such as the following RESULTS OF TEST FOR CLUSTERING ZValue 33282 PValue 00043697 Here the results are different from those of JMPIN above because a different random subsample of size m 31 was chosen To display the first four rows of the output vector D type18 gtgt D14 Here the absence of a semicolon at the end will cause the result of this command to be displayed If you would like to save this output to your home directory say E as a text file say nndisttxt then use the command sequence gtgt save Enndisttxt D ascii As was pointed out above the results of this ClarkEvans test depend on the particular sample chosen Hence each time the program is run there will be a slightly different result try it But in MATLAB it is a simple matter to embed cluster in a slightly larger program that will run cluster many times and will produce whatever summary outputs are desired I have constructed a program to do this called clustdistrm If you open this program you will see that it has a similar format 18 Since D is a vector there is only a single column So one could simply type D14 in this case ESE 502 1317 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis function OUT clustdistrptsamtestN CLUSTDISTRM samples clusterm a total of N times Written by TONY E SMITH 122899 INPUTS i pts file of point locations xiyi i1n ii a area of region iii m sample size m lt 11 iv test indicator of test to be used 0 two sided test for randomness 1 one sided test for clustering 2 one sided test for uniformity v N number of sample tests 0 OUTPUTS OUT vector of Z values for tests SCREEN OUTPUT 1 Normal t of Histogram for OUT 2 Mean of OUT 3 P value of mean if normcdf present The key difference here is the new parameter N that specifies the number of point pattern samples of size m to be taken ie the number of times the cluster is to be run The output chosen for this program is the vector of Z Values obtained So if N 1000 then OUT will be a vector of length 1000 The screen outputs now include summary measures of this vector of Z Values namely the histogram of Z Values in OUT along with the mean of these Z Values and the PValue for this mean If this program is run using the command gtgt Z clustdistrRedwoodsarea3111000 then 1000 samples will be drawn and the resulting ZValues will be saved in a vector Z In addition a histogram of these Z Values will be displayed as illustrated in Figure 13 below Notice that the results of this simulated sampling scheme yield a distribution of Z Values that is approximately normal While this normality property is again a consequence of the Central Limit Theorem it should not be confused with the normal distribution in 17 upon which the ClarkEvans test is based However this simulation result does show that a 50 sample m n 2 in this case yields sufficient independence among nndistances to yield a sampling distribution that is approximately normal19 19 Notice that this provides some evidence that the 10 rule of thumb in footnote 14 above is overly conservative ESE 502 l318 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis Fig13 Sampling Distribution of Z Values In particular the mean of this distribution is now about 346 as shown by the program output below RESULTS OF TEST FOR CLUSTERING Mean Z Value 34571 P Value of Mean 000027298 Here the Pvalue 000273 is of the same order of magnitude as the single sample above indicating that this single sample was fairly representative20 Notice however that the single sample in the JMPIN application above has a Pvalue 0000546 is an order of magnitude smaller Hence this illustrates an unlucky choice of samples that yields a much higher degree of signi cance than is warranted But for this Redwood Seedling example such differences have little relevance since even a Pvalue of 000273 indicates very significant clustering 7 as is obvious for this point pattern 34 Bodmin Tors Example The redwood seedling example above is something of a straw man in that statistical analysis is hardly required to demonstrate the presence of strong clustering Rather it 20 Again it should be emphasized that this Pvalue has nothing to do with the sampling distribution in Figure 13 Rather it is the Pvalue for the mean Z value under the normal distribution in 17 ESE 502 1319 Tony E Smith NOTEBOOK FOR SPA TML DA TA ANALYSIS Part I Spatlal Pam Pattern Analysls s rves as an illustrative case where we know what the answer should be21 However the presence of signi cant clustering or uniformity is usually not so obvious Our secon example again taken from BG Figure 32 provides agood case inpoint It also serves to illustrate some additional limitations of the above analysis Here the point pattern consists of granite outcroppings tors in the Bodmin Moor located at the very southern tip of England in Cornwall county as shown to the right Since the granite in these tors was used for tomb stones during the 0 age they are something of a tourist Bronze BODMIN MOOR attmctron 1n Englan The map in Figure 14a below shows a portion of the Moor containing 71 35 tors A mndomly genemted pattern of 35 tors is also shown for comparison in Figure 14b Fig14a Bodmin Tors Fig14b Random Tors Here there does appear to be some clustering of tors relative to the random pattern on the right But it is certainly not as strong as the Redwood Seedling example above So it is of interest to see what the ClarkEvans test says about clustering in this case see also exercise 35 on pp11415 in BG The maps in Figures 14a and 14b appear in the AP project bodminmxd in the directory arviewprojectlBodmin The area aR 20662 of the region R in Figure 14a is given in the Attribute Table of the shape le bodbdyZZ This point pattern data was imported to MATLAB and appears in the matrix Bodmin of the data le bodminmat in the matlab directory For our present purposes it is of interest to run the following fullsample version of the Clark Evans test for clustering 21 39 um quot quot methodsfor getecting clustering ESE 502 1320 Tony E 3mm NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis gtgt D clusterB0dminarea351 RESULTS OF TEST FOR CLUSTERING ZValue 10346 PValue 015043 Hence even with the full sample of data points the ClarkEvans test yields no signi cant clustering Moreover since subsampling will only act to reduce the level of signi cance this tells us that there is no reason to proceed further But for completeness we include the following results for a subsample of size m 18 approximately 5023 gtgt clustdistrB0dminarea1811000 RESULTS OF TEST FOR CLUSTERING Mean Z Value 071318 P Value of Mean 023787 So even though there appears to be some degree of clustering this in not detected by ClarkEvans It turns out that there are two key theoretical dif culties here that have yet to be addressed The rst is that for point patterns samples as small as the Bodmin Tors example the assumption of asymptotic normality may be questionable The second is that nndistances for points near the boundary of region R are not distributed the same as those away from the boundary We shall consider each of these difficulties in turn First with respect to normality the usual ruleof thumb associated with the Central Limit Theorem is that sample means should be approximately normally distributed for independent random samples of size at least 30 from distributions that are not too skewed Both of these conditions are violated in the present case To achieve suf cient independence in the present case subsample sizes m surely cannot be much larger that 20 Moreover the sampling distri bution of nndistances in Figure 15 shows a de nite Fig15 BOdmin nn39DiStanceS skewness with long right tail This type of skewness is typical of nndistances 7 even under the CSR Hypothesis Under CSR the theoretical distribution of nndistances is given by the Rayleigh density in expression 2 of Section 2 in the Appendix to Part I which is seen to have the same skewness properties 23 Here we are not interested in saving the Z values so we have specified no outputs for clustdistr ESE 502 1321 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis The second theoretical dif culty concerns the special nature of nndistances near the boundary of region R The theoretical development of the CSR Hypothesis explicitly assumed that the region R is of in nite extent so that such edge effects do not arise But in practice many point patterns of interest occur in regions R where a signi cant portion of the points are near the boundary of R Recall from the discussion in Section 25 that if region R is viewed as a window through which part of a larger stationary point process is being observed then points near the boundary will tend to have fewer observed neighbors than points away from the boundary So in cases where the nearest neighbor of a point in the larger process is outside R the observed nndistance for that point will be greater than it should be such as the example shown in Figure 16 below Thus the distribution of nndistances for such points will clearly have higher expected values than for interior points For samples from CSR processes this will tend to in ate mean nndistances relative to their theoretical values under the CSR Hypothesis This edge effect will be demonstrated more explicitly in the next section Fig16 Example of Edge Effect 35 A Direct Monte Carlo Test of CSR Given these shortcomings we now develop a testing procedure that simulates the true distribution of 5 in region R for a given pattern size n24 While this procedure is computationally more intensive it will not only avoid the need for normal approxi mations but will also avoid the need for subsampling altogether The key to this procedure lies in the fact that the actual distribution of a randomly located point in R can easily be simulated on a computer This procedure known as rejection sampling starts by sampling random points from rectangles Since each rectangle is the Cartesian product of two intervals a1b1gtlta2b2 and since drawing a random number s from an interval a 1 random point 3 31 Q is a standard operation in any computer language one can easily draw a 32 from a 1b1gtlta2b2 Hence for any given planar region R the 24 Procedures for simulating distributions by random sampling are known as Monte Carlo procedures ESE 502 1322 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis basic idea is to sample points from the smallest rectangle recR containing R and then to reject any points which are not in R To obtain 71 points in R one continues to reject points until 71 are found in R Thus the choice of recR is designed to minimize the expected number of rejected samples An example for the case of Bodmin is illustrated in Figure 17 where for simplicity we have sampled only 7 10 points Here there are seen to be four sample points that were rejected The resulting sample points in R then constitute an independent random sample of size n that by construction must satisfy the CSR Hypothesis To see this note simply that since the larger sample in recR automatically satis es this hypothesis it follows that for any subset C QR the probability that a point lies in Fig17 ReleCtion sampling C given that it is in R must have the form PrC MR PrC aCarecR aC PrR PrR aRarecR aR 35 PrC R Hence expression 212 holds and the CSR Hypothesis is satisfied More generally for any pattern of size 71 one can easily simulate as many samples of size n from R as desired and use these to estimate the sampling distribution of 5 under the CSR n Hypothesis This procedure has been operationalized in the MATLAB 1 155 program clustsimm Here the only add1tlonal 1nput 4 7 9 7 information required is the file of boundary points defining 4 7 9 7 the Bodmin region R The coordinates of these boundary points are stored in the 156 x 2 matrix B0dpoly in the data file b0dminmat To display the first three rows and 39 402 last three rows of this file at the prompt type 44 402 B0dpoly13 hit return and then type 4 7 9 7 B0dp01y154156 You will then see that this matrix has the form shown to the right Here the first row gives information about the boundary namely that there is one polygon and that this polygon consists of 155 points Each subsequent row contains the xy coordinates for one of these points Notice also that the second row and the last row are identical indicating that the polygon is closed and thus that there are only 154 distinct points in the polygon This boundary information for R is necessary in order to define the rectangle recR It is also needed to determine whether a given point in ESE 502 1323 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis recR is also in R or not While this latter determination seems visually evident in the present case it turns out to be relatively complex from a programming viewpoint A brief description of this procedure is given in section 5 of the Appendix to Part I The program clustsim is designed to estimate the sampling distribution of 5 under the CSR Hypothesis by simulating a large number N of random patterns of size n in R This distribution is then used to determine whether a given pattern in R exhibits signi cant clustering To do so observe that if the mean nndistance d for the given pattern were in fact a sample from this distribution then the probability Pr5n S 67 of obtaining a value as low as 67 could be estimated by the fraction of simulated mean nn distance values that do not exceed More precisely if N0 denotes the number of simulated patterns with mean nndistances not exceeding 67 then this probability could be estimated as follows A N 36 PrDSd 0 n NH Here the denominator N l includes the observed sample along with the simulated samples This estimate then constitutes the relevant P value for a test of clustering relative to the CSR Hypothesis Hence the testing procedure in clustsim consists of the follows two steps i Simulate N patterns of size n and for each pattern 139 lN compute the mean nit distance an ii Determine the number of patterns N0 with 67 S 67 and calculate the P value for 67 using 36 above To run this program we require one additional bit of information namely the observed value of Given the output vector D of nndistances for Bodmin tors obtained above from the program cluster this mean value say mdist can be calculated by using the builtin function mean in MATLAB as follows gtgt mdist meanD In the present case mdist 11038 To input this value into clustsim we shall use a MATLAB data array known as a structure Among their many uses structures offer a convenient way to input optional arguments into MATLAB programs In the present case we shall input the value mdist together with the number of bins to be used in constructing a histogram display for the simulated mean nndistance values The default value in MATLAB is bin 10 This value is useful for moderate samples sizes say N 100 But for simulations with N Z 1000 is better to use bin 20 or 25 If you open ESE 502 1324 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis the program clustsim you will see that the last input of this function is a structure namely opts for options that is described in more detail under INPUTS function OUT clustsimpolyamNopts CLUSTSIMM simulates the sampling distribution of average nearest neighbor distances in a xed polygon It can also determine the P value for a given mean nearest neighbor distance if supplied Written by TONY E SMITH 123100 INPUTS i poly boundary file of polygon ii a area of polygon iii m number of points in polygon iv N number of simulations v opts an optional structure with variable inputs optsbins number of bins in histogram default 10 optsmdist mean nearest neighbor distance for testing To de ne this structure in the present case we shall use the value of mdist just calculated and shall set bins 20 This is accomplished by the two commands gtgt optsmdist mdist optsbins 20 Notice that opts is automatically de ned by simply specifying its components25 The key point is that only the structure name opts needs to be speci ed in the command line The program clustsim will look to see if either of these components for opts have been speci ed So if you want to use the default value of bins just leave out this command Moreover if you just want to look at the histogram of simulated values and not run a test at all simply leave opts out of the command line This is what is meant in the description above when opts is referred to as an optional structure Given these preliminaries we are now ready to run the program clustsim for Bodmin To do so enter the command line gtgt clustsimBodpolyarea3510000pts Here we have speci ed 11 35 for the Bodmin case and have speci ed that N 1000 simulated patterns be constructed The screen output will start with successive displays 25 Note also we have put both commands on the same line to save room Just remember to separate each command by a semicolon ESE 502 1325 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis percentd0ne 10 percentd0ne 20 percentd0ne 100 that indicate how the simulations are proceeding The nal screen output will then include both a histogram of mean nndistance values and some numerical outsputs as described in the SCREEN OUTPUT section of the comments in clustsim The histogram will be something like that shown in Figure 18 below the red vertical bar will be discussed below Fig18 Histogram of Mean nil Distances Note rst that in spite of the relatively skewed distribution of observed nndistance values for Bodmin this simulated distribution continues to be approximately normal Hence given the sample size 7135 it appears that the dependencies between nn distance values in this Bodmin region are not suf cient to rule out the assumption of normality here But in spite of its normality this distribution is noticeably different from that predicted by the CSR Hypothesis To see this recall rst that that for the given area of Bodmin aR 2066 the point density estimate is given by i3520661694 Hence the theoretical mean nndistance value predicted by the CSR Hypothesis is 1215 A l 37 y A 2J1 However if we now look at the numerical screen output for this simulation we have ESE 502 1326 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis CLUSTSIM RESULTS SIMMEANDIST 13087 MDIST 11038 P VALUE FOR MDIST 0044955 Here the rst line reports the mean value of the 1000 simulated mean nndistances But since by the Law of Large Numbers a sample this large should give a fairly accurate estimate of the true mean E03 we see that this true mean is considerable larger than that predicted by the CSR Hypothesis above26 The key point to note here is that the edge e cts depicted in Figure 16 above are quite signi cant for pattern sizes as small as n 35 relative to the size of the Bodmin region R27 So this simulation procedure does indeed give a more accurate distribution of nndistances in the Bodmin region under the CSR Hypothesis Observe next that the second line of screen output above gives the value of 0ptsmdist as noted above assuming this component of opts was included The nal line is the critical one and gives the P value for 0ptsmdist as estimated by 36 above Hence unlike the ClarkEvans test where no signi cant clustering was observed even under full sampling the present procedure does reveal signi cant clustering28 This is shown by the position of the red vertical bar in Figure 18 above at approximately a value of mdist 11038 Here there are seen to be only a few simulated values lower than mdist Moreover the discussion above now shows why this result differs from Clark Evans In particular by accounting for edge effects this procedure reveals that under the CSR Hypothesis mean nndistance values for Bodmin should be higher than those predicted by the ClarkEvans model Hence the observed value of mdist is actually quite low once this effect is taken into account 26 You can convince yourself of this by running clustsim a few times an observing that the variation in this estimated mean values is quite small 27 Note that as the sample size n becomes larger the expected nndistance E 5 for a given region R becomes smaller Hence the fraction of points sufficiently close to the boundary of R to be subject to edge effects eventually becomes small and this edge effect disappears 28 Note again that this Pvalue will change each time clustsim is run However by trying a few runs you will see that all values are close to 05 ESE 502 1327 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part II Continuous Spatial Data Analysis 3 Spatial Stationarity Observe that all regressions done up to this point including the regression model in expression 221 above are based on the classical ordinary least squares assumption of independently and identically distributed residuals ex i1n We shall consider this classical model in more detail in Section below But for the present it important to note that when the underlying index set 1 n has structure such as proximity relations among points in time or space then this structure itself can induce statistical dependencies In particular unobserved in uences on the dependent variable of interest such as lnPCBI in 221 often vary smoothly in space andor time so that values 81 and c with i close to j tend to be more similar than would be expected under statistical independence as in the example of Section 31 below Moreover such statistical dependencies often have little substantive relation to the main phenomena of interest In terms of our basic modeling framework Y s uses in 121 above we are usually more interested in the global structure of the spatial process as represented by us than in the specific relations among unobserved residuals 831 i 1n at any given set of sample locations s i 1n These relations are sometimes referred to as secondorder effects in contrast to the firstorder effects represented by us Hence it is often desirable to model such secondorder effects in a manner that will allow the analysis to focus on the firstorder effects while at the same time taking these unobserved dependencies into account This general strategy can be illustrated by the following example 31 Example Measuring Ocean Depths Suppose that one is interested in mapping the depth of the sea oor over a given region Typically this is done by taking echo soundings sonar measurements at regular intervals from a vessel traversing a system of paths over the ocean surface This will yields a set of depth readings Q Dsl i 1 n such as the set of measurements is shown in Figure 31 below WW 1 D2 1 1 1 DH Figure 31 Pattern of Depth Measurements ESE 502 ll3l Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part II Continuous Spatial Data Analysis However the ocean is not a homogeneous medium In particular it is well known that such echo soundings can be in uenced by the local concentration of zooplankton in the region of each sounding These clouds of zooplankton illustrated in Figure 32 below create interference called ocean volume reverberation on Cell D Figure 31 Zooplankton Interference This tends to vary from location to location and even from day to day much like the way in which sunlight is affected by cloud patterns1 So actual readings are random variables of the form 311 D0 dsles1 iln where in this case the actual depth at location s1 is represented by dsl EDsl and es1 represents measurement error due to interference2 Moreover these errors are statistically dependent since plankton concentrations at nearby locations will tend to be more similar than at locations widely separated in space Hence to obtain confrdence bounds on the true depth at location s1 it is necessary to postulate a statistical model of these joint interference levels es1i ln Now one could in principle develop a detailed model of zooplankton behavior including their patterns of individual movement and clustering behavior However such models are not only highly complex in nature they are very far removed from the present target of interest which is to obtain accurate depth measurements3 1 Actual variations in the distribution of zooplankton are more diffuse than the clouds depicted in Figure 31 Veritical movement of zooplankton in the water column is governed mainly by changes in sunlight and horizontal movement by ocean currents 2 In actuality such measurement errors include many different sources such as the re ective properties of the sea floor Moreover depth measurements are actually made indirectly in terms of the transmission loss L Ls between the signal sent and the echo received The corresponding depth Q is obtained from L by a functional relation D L 6 where 6 is a vector of parameters that have been calibrated under idealized conditions For further details see Urick RI 1983 Principles of Underwater Sound 3quoti ed McGrawHill New York and in particular the discussion around p413 3 Here it important to note that such detailed models can be of great interest in other contexts For example acoustic signals are also used to estimate the volume of zooplankton available as a food source for sea creatures higher in the food chain To do so it is essential to relate acoustic signals to the detailed behavior ESE 502 H32 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part II Continuous Spatial Data Analysis So what is needed here is a statistical model of spatial residuals that allows for local spatial dependencies but is simple enough to be estimated explicitly To do so we will adopt the following basic assumptions of spatial stationarity 312 Homogeneity Residuals 803 are identically distributed at all locations s1 313 Isotropy The joint distribution of distinct residuals 83 and 8Sj depends only on the distance between locations s1 and s j These assumptions are loosely related to the notion of isotropic stationarity for point processes discussed in Section 25 of Part I But here we focus on the joint distribution of random variables at selected locations in space rather than point counts in selected regions of space To motivate the present assumptions in the context of our example observe first that while zooplankton concentrations at any point of time may differ between locations it can be expected that the range of possible concentration levels over time will be quite similar at each location More generally the Homogeneity assumption asserts that the marginal distributions of these concentration levels are the same at each location To appreciate the need for such an assumption observe first that while it is in principle possible to take many depth measurements at each location and employ these samples to estimate locationspecific distributions of each random variable this is generally very costly or not even feasible Moreover the same is true of most spatial data sets such as the set of total rainfall levels or peak daily temperatures reported by regional weather stations on a given day So in terms of the present example one typically has a single set of depth measurements Dsli ln and hence only a single joint realization of the set of unobserved residuals 851 i 1 n Thus without further distributional assumptions it is impossible to say anything statistically about these residuals From this viewpoint the fundamental role of the Homogeneity assumption is to allow each joint realization 8312lln to be treated as multiple samples from a common population that can be used to estimate parameters of this population The Isotropy assumption is very similar in spirit But here the focus is on statistical dependencies between distinct random variables 8039 and 8sj For even if their marginal distributions are known one cannot hope to say anything further about their joint distribution on the basis of a single sample But in the present example it is reasonable to assume that if a given cloud of zooplankton in Figure 31 covers location s1 then it is very likely to cover locations s j which are sufficiently close to s j Similarly of such microscopic creatures See for example Stanton TK and D Chu 2000 Review and recommendations for the modelling of acoustic scattering by uidlike elongated zooplankton euphausiids and copepods ICESJournal ofMarine Science 57 7937807 ESE 502 1133 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part II Continuous Spatial Data Analysis for locations that are very far apart it is reasonable to suppose that clouds covering s have little to do with those covering s j Hence the Isotropy assumption asserts more generally that similarities between concentration levels at different locations depend only on the distance between them The practical implication of this assumption is that all pairs of residuals 83 and 8Sj separated by the same distance hsl sj must exhibit the same degree of dependency Thus a collection of such pairs can in principle provide multiple samples to estimate the degree of statistical dependency at any given distanceh A second advantage of this Isotropy assumption is that it allows simple models of local spatial dependency to be formulated directly in terms of this single distance parameter So it should be clear that these two assumptions of spatial stationarity do indeed provide a natural starting point for the desired statistical model of residuals But before proceeding it should also be emphasized that while these assumptions are conceptually appealing and analytically useful 7 they may of course be wrong For example it can be argued in the present illustration that locations in shallow depths Figure 31 will tend to experience lower concentration levels than locations in deeper waters If so then the Homogeneity assumption will fail to hold Hence more complex models involved nonhomogeneous residuals may be required in some cases4 As a second example suppose that the spatial movement of zooplankton is known to be largely governed by prevailing ocean currents so that clouds of zooplankton tend to be more elongated in the direction of the current If so then spatial dependencies will depend on direction as well as distance and the Isotropy assumption will fail to hold Such cases may require more complex anistropic models of spatial dependencies5 32 Covariance Stationarity In many cases the assumptions above are stronger than necessary In particular most of our subsequent analyses will assume that residuals are multinormally distributed as discussed further in Section below Since these joint distributions are determined entirely by their means and covariance structures it suffices to assume stationarity of first and second moments More formally a spatial stochastic process Y s s e R is said to be covariance stationary if and only if the following two conditions hold for all s1s2vlv2 ER 321 Ems EYS2 322 quotsl s2quotquotv1 v2 2 covYs1Ys2covYv1Yv2 4 For example it might be postulated that the variance of s depends on the unknown true depth d s at each location s Such nonstationary formulations are complex and beyond the scope of these notes 5 This possibility will be touched on in the discussion of anisotropic variograms in Section 7 below ESE 502 H34 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part II Continuous Spatial Data Analysis These conditions can be stated more compactly by observing that 321 implies the existence of a common mean value u for all random variables Moreover 322 implies that covariance depends only on distance so that for each distance h and pair of locations sve R with s vquot h there exists a common covariance value C II such that covYsYv Clz Hence process Ys s e R is covariance stationary if and only if iff the following two conditions hold for all s v e R 323 EYs u 324 ls v h 2 covYsYv Ch Note in particular from 324 that since varYs covYsYs by de nition and since s s 0 it follows that these random variables must also have a common variance 0392 given by 325 varYsC00392 seR At this point it should be noted that while the most important application of covariance stationarity for our purposes will be to model residual distributions the above de nition is given in terms of an arbitrary spatial stochastic process YsseR But 323 together with 12 1 imply that every covariance stationary process can be written as 326 Ys ues so that each such process is associated with a unique residual process css ER Moreover since covYsYv covesev Eesev EsEv we see that csseR must satisfy the following more specialized set of conditions for all sv ER 327 Ees 0 328 quots vquot h 2 Eesev Ch These are the appropriate covariance stationarity conditions for residuals that correspond to the stronger Homogeneity 312 and Isotropy 313 conditions in Section 31 above6 6 At this point it should be noted that many standard references focus on a weaker form of stationarity in which the Isotropy assumption is relaxed by requiring that covariances dependent only on the di erence between locations ie that for all h hphz s7 v h 2 covYsYv Ch So the covariogram ESE 502 H35 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part II Continuous Spatial Data Analysis Note nally that even these assumptions are too strong in many contexts For example as mentioned above it is often convenient to relax the isotropy condition implicit in 324 and 328 See for example BG p162 and Cressie Sections 221 and 23 However we shall treat only the isotropic case 323324 and shall use these assumptions throughout 33 Covariograms and Correlograms Note that since the above covariance values C h are unique for each distance value It in region R they de ne a function C of these distances which is designated as the covariogram for the given covariance stationary process7 But as with all random variables the values of this covariogram are only meaningful with respect to the particular units in which the variables are measured Moreover unlike mean values the values of the covariogram are actually in squared units which are dif cult to interpret in any case Hence it is often more convenient to analyze dependencies between random variables in terms of dimensionless correlation coef cients For any stationary process Y sseR the product moment correlation between any Y s and Y v with quots vquot h is given by the ratio covYsYv C00 C00 331 Y s Y v p lvarYs varYv JC0 C0 CO which is simply a normalized version of the covariogram Hence the correlations at every distance h for a covariance stationary process are summarized by a function called the correlogram for the process 332 in this more general setting is twodimensional and allows for directional differences in covariances See for example Cressie 1993 p40 and Waller and Gotway 2004 p273 7 To be more precise if the set of all distances associated with pairs of locations in region R is denoted by hR h s7 v h for some sv E R then the covariogram C is a numerical function on hR ESE 502 H36 Tony E Smith ESE 502 Tony E Smith SIMPLE EXAMPLE OF THE EFFECTS OF SPATIAL AUTOCORRELATION Given the simple spatial model 1 Ys u es ses1sn suppose it is assumed that es N N 0 0392 so that in matrix form we have 11d 2 YLile eN003921 with 1 l l Then it is well known that the sample mean estimator 3 17 23 is the minimum variance linear unbiased estimator for u and in particular that this minimum variance is given by But suppose that in reality there is positive spatial autocorrelatiori among the residuals in 2 so that in fact covelel coveleH 0392 quot01quot 5 cove E E E E 202 covenelcovenen onl with 0391 gt 0 for many distinct ij pairs Then as in expression 4103 ofthe NOTEBOOK it follows that since covYcove the true variance of 17 is given by 6 varl7 varZ1Yl ZlvarY1 Z chova j ZLVa H ZXMCOWUC ZilazZZ i quotO39ZHn EZZRFU 22191 ESE 502 Tony E Smith which together with the positive spatial dependencies shows that 7 varl7 gtgt 0 2 2 an gtgt 1 n J Hence if we consider say a 95 con dence interval for the true mean u then the actual interval is given by 8 Clam H 196o Y 1 rather than the assumed interval 9 Clam H 096 So for any given estimate 7 this implies from 7 that Assumed CI r r i f 1 L J l Actual CI
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'