Popular in Course
Popular in Electrical Engineering
This 98 page Class Notes was uploaded by Dorris Borer on Monday September 28, 2015. The Class Notes belongs to ESE502 at University of Pennsylvania taught by T.Smith in Fall. Since its upload, it has received 28 views. For similar materials see /class/215445/ese502-university-of-pennsylvania in Electrical Engineering at University of Pennsylvania.
Reviews for INTRTOSPATIALANALYSIS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
ESE 502 Tony E Smith BIAS OF REGRESSION ESTIMATOR FOR RHO To establish that the regression estimator 3 of 3 is biased suppose that the actual residual values it in the relation 1 u qu8 are observable Then we can estimate 3 by a nointercept regression to obtain u39W39u u39W39Wu 2 But from 1 it then follows that u39W39que pu39W39Wu u39W39e 3 p u39W Wu u39W Wu u39W Wu u39W39e Wu39e Wu39g quotaquot J P P p WW lqullz lqullllgll lqull 3 corrWue Now if 3 gt 0 then it is clear from 1 that corrue gt 0 and corruWu gt 0 so that one must almost surely have corrWu e gt 0 which can in fact be shown Hence by the positivity ofthe ratio Wu I it follows that that gt 0 so that A 8 J 4 Ep 3 corrWueE gt 3 lqull Thus 3 is biased upwards Moreover this bias does not disappear even for large sample sizes so that this result also shows that 3 is not even a consistent estimator of p Finally since 3 tends to make 3 look too large for 3 gt 0 it tends to in ate the signi cance of spatial autocorrelation the worst of all possible cases ESE 502 Tony E Smith But Moran s I tends to mitigate this effect To see this note rst that u39W39u u39W39u IllVullZ A J 5 I p llullz lqullz Hull Hull Next recall that row normalized matrices W act to average the components of u and thus tend to reduce their variance so that 6 E lqullz lt E llullz This in turn can be shown to imply that E II2 lt l and hence that A qu2 A 7 EU E p M ltEp Hence Moran s I tends to reduce the in ated values of 3 and thus provides a more stable indicator of spatial autocorrelation ESE 502 Tony E Smith VARIANCE OF MLESTIMATES Given a random sample x x1xn from a normal distribution N daz consider the simplest case of maximumlikelihood estimation of u with 02 known 1 quot ll 72 2 Lltrlxazgt1n 2 WW ill ll U nln axg zlxi M2 So by solving the rstorder condition for u we obtain dL n n 0 E ZHW taxi l l A1 l l gt Z xi nd O gt u gz xi xn Hence the variance of this estimate is given by vari 02 n But by taking the second derivative of L with respect to a notice that dZL n A dZL 1 2 gt varu dJ2 More generally it is always true that The variance of any maximumlikelihood estimator is given by the inverse of the expected value of minus the second derivative of L evaluated at its maximum NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis 4 K Function Analysis of Point Patterns In the Bodmin Tors example above notice from Figure 3l4a p20 that the clustering structure is actually quite different from that of the Redwood Seedling example in Figure 312a p12 Rather than small isolated clumps there appear to be two large groups of points in the northwest and southwest separated by a large empty region Moreover the points within each group are actually quite evenly spaced locally dispersed These observations suggest that the pattern of tors exhibits different structures at di erent scales Hence the objective of the present section is to introduce a method of point pattern analysis that takes such scale effects into account and in fact allows scale to become a fundamental variable in the analysis 41 Wolf Pack Example To motivate the main ideas we begin with a new example involving wolf packs A map is shown in Figure 41a below representing the relative locations of wolf packs in a portion of the Central Arctic Region in 19981 The enlarged portion in Figure 4lb is a schematic map depicting individual wolves in four of these packs o o 0 39 Wolf acks o o P o o o o o O o 0 50 km 0 Fig41a Map of Wolf Packs Fig41b Enlarged Portion At the level of individual wolf locations in Figure 4lb there is a pattern of isolated clumps that bears a strong resemblance to that of the Redwood seedlings above2 Needless to say this pattern would qualify as strongly clustered But if one considers the larger map in Figure 4la a different picture emerges Here the dominant feature is the remarkable dispersion of wolf packs Each pack establishes a hunting territory large enough for its survival roughly 15 to 20 km in diameter and actively discourages other 1 This map is based on a more detailed map published in the Northwest Territories Wolf Notes Winter 199899 See also httpwwu W m ilr ife m er m We 1 7 1 139 39 wulmuteswolf32htm 2 The spacing of individual wolves is of course exaggerated to allow a representation at this scale ESE 502 l4 l Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis packs from invading its territory3 Hence this pattern of wolf locations is very clustered at small scales and yet very dispersed at large scales But if one were to analyze this wolflocation pattern using any of the nearestneighbor techniques above it is clear that only the small scale clustering would be detected Since each wolf is necessarily close to other wolves in the same dens the spacing between dens would never be observed In this simple example one could of course rede ne wolf dens to be aggregate points and analyze the spacing between these aggregates at a larger scale But there is no way to analyze multiple scales using nearest neighbors without some form of reaggregation4 42 K Function Representations To capture a range of scales in a more systematic way we now consider what amounts to an extension of the quadrat or cellcount method discussed in section 1 above In particular recall that the quadrat method was criticized for being too dependent on the scale of individual cells Hence the key idea of Kfunctions is to turn this dependency into a virtue by explicitly incorporating scale as a variable in the analysis Thus rather than fixing the scale and locations of cell grids we now consider randomly sampled cells of varying sizes While many sampling schemes of this type can be defined we shall focus on the single most basic scheme which is designed to answer the following question for a given point process with density 1 What is the expected number of point events within distance h from any randomly sampled point event Note that this expected number is not very meaningful without specifying the point density 1 since it will of course increase with 1 Hence if we divide by l in order to eliminate this obvious density effect then the quantities of interest take the form 42 1 K h E number of additional events within distance h of an arbitrary event If we allow the distance or scale h to vary then expression 421 is seen to define a function of h designated as a K function5 As with nndistances these values K h yield information about clustering and dispersion In the wolfpack example above if one were to define K h with respect to small distances h around each wolf in Figure 41b then given the close proximity to other wolves in the same pack these values would surely be too high to be consistent with CSR for the given density of wolves in this area Similarly if one were to define K h with respect to much larger distances h around each wolf in Figure 41a then given the wide spacing between wolf packs and the relative uniformity of wolfpack sizes6 these values would surely be too low to be 3 Since wolves are constantly on the move throughout their hunting territories the actual locations shown in Figure la are roughly at the centers of these territories 4 One could also incorporate larger scales by using higherorder nearest neighbors as discussed for example in Ripley 1996 sec62 But these are not only more complex analytically they are difficult to associate with specific scales of analysis 5 This concept was popularized by the work of Ripley 19761977 Note also that following standard convention we now denote distance by h to distinguish it from nndistance d 6 Wolf packs typically consist of six to eight wolves see the references in footnote 1 above ESE 502 I42 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis consistent with CSR for the given density of wolves Hence if one can identify appropriate benchmark values for K h under CSR then these Kfunctions can be used to test for clustering and dispersion at various scales of analysis We shall consider these questions in more detail in Section 44 below But for the moment there are several features of de nition 421 that warrant further discussion First while the distance metric in 421 is not speci ed we shall always refer to Euclidean distance dsv between pairs of points as de ned expression 321 above Hence with respect to any given point event s the expected number of point events within distance h of s is simply the expected number of such events a circle of radius h about s as shown in Figure 42 below 1K0 Expected Number of Points in here Fig42 Interpretation of K h This graphical image helps to clarify several additional assumptions implicit in the de nition of K h First since this value is taken to depend only on the size of the circle ie the radius h and not its position ie the coordinates of s there is an implicit assumption of spatial stationarity as in expression 251 above In other words it is assumed that the expected number of additional points in this circle is the same regardless of where s is located This assumption will later be relaxed in our Monte Carlo applications of Kfunctions Observe next that the circularity of this region implicitly assumes that direction is not important and hence that the underlying point process is isotropic as in Figure 22 above On the other hand if the point process of interest were to exhibit some clear directionality such as the vertical directionality in shown in Figure 23 above then it might be more appropriate to use directional ellipses as de ned by weighted Euclidean distances of the form 422 dsv lwls1 v12w2s2 vz2 where the weights w1 and w2 re ect relative sensitivities of point counts to movements in the horizontal or vertical direction respectively7 More generally if the relevant point 7 One can also use appropriate quadratic forms to define anisotropic distances with any desired directional orientations We shall consider such distances in more detail in the analysis of spatial variograms in Part 11 of this NOTEBOOK ESE 502 143 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis events occur in specific environments such as the patterns of Philadelphia housing abandonments in Figures 14 and 15 then the relevant distances might be determined by these environments such as travel distance on the Philadelphia street system Finally it is important to emphasize that the expected value in 421 is a conditional expected value In particular given that there is a point event 3 at the center of the circle in Figure 42 above this value gives the expected number of additional points in this circle This can be clari ed by rewriting K h in terms of conditional expectations In particular if as in Section 321 above we now denote the circle in Figure 42 minus its center by 423 Ch sveR0ltdvsSh then K n can be written more precisely as follows 424 Kh ENCh slNs1 To see the importance of this conditioning recall from expression 234 that for any stationary process not just CSR processes it must be true that the expected number of points in Ch s is simply proportional to its area ie that 425 ECh s 10Ci S But this is not true of the conditional expectation above Recall from the wolfpack case for example that for small circles around any given wolf the expected number of additional wolves is much larger than what would be expected based on area alone ieis larger than laCh s These ideas will be developed in more detail in Section 44 where it is shown that such deviations from simple area proportionality form the basis for all Kfunction tests of the CSR Hypothesis 43 Estimation of K Functions Given this general de nition of Kfunctions as conditional expected values we now consider the important practical question of estimating these values To do so we introduce the following notation analyzing for point counts For any given realized point pattern Sn SI i1n and pair of points 313 6 Sn we now denote the Euclidean distance between them by 431 dU dslsj and for any distance h define the indicator function 1h for point pairs in Sn by 8 Here it should be noted that tools are available in the spatial analyst extension of ARCMAP for 39 t i an b 4 t p 4 distances However we shall not do so in this NOTEBOOK ESE 502 144 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis 1 dysh 432 Ihdy1hdswsj o dUgth From this de nition it follows at once that for any given point 31 6 Sn the total number of IhdU Hence ifz39 now refers to a randomly selected point generated by a point process on R and if both the number and locations of points in R are treated as random variables then in terms of 432 the Kfunction in 421 above can now be given the following equivalent definition additional points s within distance h of s is given by the sum EM 433 Kh Emma Observe also that for stationary point processes the value of K h must be independent of the particular point event 139 chosen So multiplying through by A in 433 and summing over all point events 139 l n in region R it follows that 434 EZj lhdljlKhiln 2 ZIEZJ II1alUanh 3 K01 Z1EZ 11hdy This pooled version of K h motivates the following pooled estimate of K h designated as the sample K functz39on 435 160 2 medj where again A n aR 9 The advantage of this estimator is that uses all points of the given realized point pattern Sn in R To interpret 1301 note that if we rewrite 435 as 436 m 3212 WM then the expression in brackets is seen to be simply an average of the relevant point counts for each of the pattern points 31 e Squot Hence if the underlying process were truly stationary and edge effects were small then this sample Kfunction would be 9 At this point it should be noted that our notation differs from BG where regions are denoted by a script R with area R Here we use R for region and make the area function aR explicit In these terms 435 is seen to be identical to the estimate on the top of p 93 in BG where liln aRn2 ESE 502 145 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis approximately unbiased and reasonably efficient as an estimator of the common expected point count ESE11 dU in 43310 However since this idealization can never hold exactly in bounded regions R it is necessary to take into account the edge e ects created by the boundary of R Unlike the case of nndistances where the expected values of nndistances are increased for points near the boundary as in Figure 316 the expected value of point counts are reduced for these points as shown in Figure 43a below o o R R O O O O O Fig43a Edge Effects for K h Fig43b Ripley s Correction To counter this downward bias Ripley 1976 proposed a corrected version of 435 that is quite effective in practice His correction consists of weighting each point s in the count EM1 dU in a manner that in ates counts for points near the boundary If one considers the circle about s1 passing through s as shown in Figure 43b and defines WU to be the fraction of its circumference that lies inside R then the appropriate reweighting of s in the count for s is simply to divide 1h d by w j producing a new estimate known as Ripley s correction Mdy WU 437 Kh ZLZM One can gain some intuition here by observing in Figure 43b that weights will be unity unless circle about 31 passing through 3 actually leaves R So only those point pairs will be involved that are close to the boundary of R relative to distance h Moreover the closer that s is to the edge of R the more of this circumference is outside R and a hence the smaller wU becomes This means that values 1h dU wy are largest for points closest 10 For further discussion of this approximate unbiasedness see Ripley 1977 Section 6 ESE 502 146 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis to the edge thus in ating If h to correct the bias An explicit derivation of Ripley s correction in given in Section 6 of the Appendix to Part 1 It should be emphasized that while Ripley s correction is very useful for estimating the true Kfunction for a given stationary processes this is usually not the question of most interest As we have seen above the key questions relate to whether this process exhibits structure other than what would be expected under CSR and how this structure may vary as the spatial scale of analysis is increased Here it turns out that in most cases Ripley s correction is not actually needed Hence this correction will not be used in the analysis to follow 44 Testing the CSR Hypothesis To apply Kfunctions in testing the CSR Hypothesis it is convenient to begin by ignoring edge effects and considering the nature of Kfunctions under this hypothesis for points 3 ER and distances h that are not in uenced by edge effects Hence in contrast to Figure 43a above we now assume that the set oflocations Ch within distance h of s is entirely contained in R ie that 441 Ch veRdsvShgR Next recall from the basic 39 J J J quot about 39 quot quot 39 point locations in CSR processes Section 22 above that for such processes the expected number ofpoints in C h s does not dependent on whether or not there is a point event at s so that 442 ENCh S l NS 1 ENCh Sl Hence from expression 423 together with the area formula for circles and the fact that aCh s aCh it follows that 443 ENCh s l Ns l haCh s haCh tzz39h2 which together with expression 424 yields the following simple Kfunction values 444 Kh Thus by standardizing with respect to density 1 and ignoring edge effects as in 441 we see that the Kfunction reduces simply to area under the CSR Hypothesis Note also that when K h gt 7239h2 this implies a mean point count higher than would be expected under CSR and hence indicates some degree of clustering at scale has illustrated in 11 Readers interested in estimating the true Kfunction for a given process are referred to Section 843 in Cressie 1993 and to the additional references found therein ESE 502 147 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis Section 42 above Similarly a value K h lt 39l lz implies a mean point count lower than would be expected under CSR and hence indicates some degree of dispersion at scale n Thus for any given h gt 0 Kh gt 39l lz 3 clustering at scale h 445 Kh lt 39l lz 3 dispersion at scale h While these relations are adequate for testing purposes area values are difficult to interpret directly Hence it usually convenient to further standardize Kfunctions in a manner that eliminates the need for considering these values If for each h we let 446 Lh h 7239 then under CSR this L function has the property that 39l lz 447 Lh hh h0 7239 for all h 2 0 In other words this associated Lfunction is identically zero under CSR Moreover since Lh is an increasing function of K h it follows that Lh is positive exactly when K h gt 39l lz and is negative exactly when K h lt 39l lz Hence the relations in 445 can be given the following simpler form in terms of Lfunctions Lh gt 0 3 clustering at scale h 448 Lh lt 0 3 dispersion at scale h Given the estimate 1amph in 437 above one can estimate Lh by 449 h1l h 7239 and can in principle use 448 to test for clustering or dispersion 45 Bodmin Tors Example We can apply these testing ideas to Bodmin by using the MATLAB program kfuncti0nm The first few lines of this program are shown below ESE 502 148 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis function C kfunctionlocareabextent KFUNCTION computes the k Function for a point pattern and plots the normalized L Function without edge corrections Written by TONY E SMITH 112601 INPUTS i loc le of locations xiyi i1m ii area area of region iii b number of bins to use in CDF and plot iv extent 1 if max h half of max pairwise distance typical case 2 if max h max pairwise distance to be considered DATA OUTPUTS C 1b vector containing raw Point Count SCREEN OUTPUTS Plot of L Function over the speci ed extent To apply this program again open the data file Bodminmat and recall that the tor locations are given in the matrix Bodmin As seen above the program first computes 1amph for a range of distance values h and then coverts this to h and plots these values against the reference value of zero The maximum value of h for this illustration is chosen to be the maximum pairwise distance between pattern points tors listed as option 2 in input iv above The number of intermediate distance values bins to be used is speci ed by input iii Here we set b 20 Hence to run this program type gtgt kfunctionBodminarea202 The resulting plot is shown in Figure 44 u to the right Here the horizontal line H indicates the theoretical values of Lh 392 g under the CSR Hypothesis So it would appear that there is some degree of clustering at small scales h However 5 recall that the above analysis was predicated on the assumption of no edge effects Since there are clearly strong edge effects in the Bodmin case the real h question here is how to incorporate these effects in a manner that will allow a meaningful test of CSR Fig44 Bodmin L function ESE 502 149 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis One approach is suggested by recalling that random point pattern for Bodmin was also generated in Figure 314b above Hence if the Lfunction for such a random pattern is plotted then this can serve as a natural benchmark against which to compare the L function for tors This random pattern is contained in the matrix B0dm2 of data le B0dminmat and is also shown again in Figure 47 below Hence the corresponding command kfuncti0nB0drn2area202 now yields a comparable plot of this benchmark Lfunction as shown in Figure 45 below u u 2 2 L 3 3 L 5 5 Relative Clustering 5 m m 2 L 5 39a m 2 1 1 5 5 2 a s m 2 a s a u h h Fig45 Random L function Fig46 L function Overlay Here it is clear that the Lfunction for this random pattern is not at but rather is everywhere negative and decreases at an increasing rate Hence relative to zero this pattern appears to exhibit more and more dispersion as the scale increases The reason for this of course is that the theory above and expression 441 in particular ignores those points near the boundary of the Bodmin region such as the point shown in Figure 47 Here it is clear that for sufficiently small scales h there is little effect on h so that values are close to zero for small h But as this radius increases it is also clear that most of the circle is eventually outside of R and hence is mostly empty Thus given the estimated point density 1 for Bodmin tors inside R point counts for large h start to look very small relative to the area 39l lz This is precisely the effect that Ripley s correction expression 437 attempts to eliminate12 Fig47 BOdmill Edge Effe 12 A nice comparison of Ripley s correction with uncorrected Lfunctions such as in Figure 4 above is given in Figure 815 of Cressie 1993 p617 ESE 502 14 10 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis But if we now ignore the zero reference line and use this random Lfunction as a benchmark then a perfectly meaningful comparison can be made by overlaying these two Lfunctions as in Figure 46 above Here one can see that the region of relative clustering is now considerably larger than in Figure 44 and occurs up to a scale of about h 8 see the scale shown in Figure 314 But observe even these benchmark comparisons have little meaning at scales so large that circles of radius h around all pattern points lie mostly outside the relevant region R For this reason the commonly accepted ruleof thumb is that for any given point pattern Sn one should not consider hvalues larger that half the maximum pairwise distance between pattern points Hence if we now denote the maximum pairwise distance for Sn by hmax maxalslsj s s 6 Sn and use h to indicate the largest value of h to be considered in a given case then the standard ruleof thumb is to set 451 hhm2 This corresponds to option 1 for input iv of kfuncti0n above and option 2 correspond to h hmax We shall have occasion to use 18 in many of our subsequent analyses and in fact this will usually denote the default value of h A more important limitation of this benchmark comparison is that like the JMPIN version of the ClarkEvans test in Section 331 above the results necessarily depend on the random point pattern that is chosen for a benchmark Hence we now consider a much more powerful testing procedure using Monte Carlo methods 46 Monte Carlo Testing Procedures As we saw in Section 35 above it is possible to use Monte Carlo methods to estimate the sampling distribution of nndistances for any pattern size in a given region of interest This same idea extends to the sampling distribution of any statistics derived from such patterns and is of sufficient importance to be stated as a general principle SIMULATION PRINCIPLE To test the CSR Hypothesis for any point pattern Squot of size n in a given region R one can simulate a large number of random point patterns S i 1N of the same size anal compare Sn with this statistical population Essentially this simulation procedure gives us a clear statistical picture of what realized patterns from a CSR process on R should look like In the case of Kfunction tests of CSR we first consider the standard application of these ideas in terms of simulation envelopes This method is then refined in terms of a more explicit Pvalue representation ESE 502 I4ll Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis 461 Simulation Envelopes The essential idea here is to simulate N random patterns as above and to compare observed estimate h with the range of estimates LAlh 139 lN obtained from this simulation More formally if one de nes the lower envelope and upper envelope functions respectively by 461 LNh minLAlhi 1N 462 UNUI maXLAlhi 1N then h is compared with LNh and UNh for each h So for a given observed pattern Sn in region R the steps of this Monte Carlo testing procedure can be outlined as follows i Generate a number of random patterns Sf z39 lN of size n in region R say N 99 ii Choose a selection of hvalues H hlh2 and compute LAlh for each heH and ilN iii Form the lower and upper envelope functions and LNh and UNh in 461 and 462 iv Plot the Lvalues h for the observed pattern Sn along with the upper and lower values UNh and LNh for each h e H The result of this procedure is to yield a plot similar that shown in Figure 48 to the right Here the blue region indicates the area in which the observed Lfunction is outside the range defined by the upper and lowerenvelope functions In the case shown this area is above the envelope indicating that there is significant clustering relative to the simulated population under CSR F ig48 Simulation Envelope The key difference between this gure and Figure 46 above is that rather than a single benchmark pattern we now have a statistical population of patterns for gauging the ESE 502 14 12 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis signi cance of This plot in fact summarizes a series of statistical tests at each scale of analysis heH In the case illustrated if we consider any h under the blue area in Figure 48 then by de nition hgtUNh But if pattern Sn were just another sample from this population of random patterns then every sample value LAhlA hLANh would have the same chance of being the biggest So the chance that h is the biggest is only lN l More formally if pattern Sn is consistent with the CSR Hypothesis then A 1 463 PrLh gt UNhl m a h E H A 1 464 PrLh lt LNhl m a h E H These probabilities are thus seen to be precisely the P values for onetailed tests of the CSR Hypothesis against clustering and dispersion respectively For example if N 99 as in step i above then the chance that h gt UNh is only l99l 01 Hence at scale h one can infer the presence of signi cant clustering at the Ol level Similarly if there were any h e H with h lt LNh in Figure 48 then at this scale one could infer the presence of signi cant dispersion at the Ollevel Moreover higher levels of signi cance could easily be explored by simulating larger numbers of random patterns say N 999 This Monte Carlo test can be applied to the Bodmin example by using the MATLAB program kfuncti0nsimm shown below function kfuncti0nsimlocareabextentsimspoly KFUNCTIONSIM computes the k Function for a point pattern plus N random point patterns for a single polygon and plots the normalized L Function plus Upper and Lower envelopes INPUTS i 10c le of locations xiyi i1n ii area area of region iii b number of bins to use in CDF and plot iv extent 2 if max h max pairwise distance to be considered 1 if max b half of max pairwise distance typical case v sims number of simulated random patterns Vi poly polygon boundary le ESE 502 1413 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis Note that the two key additional inputs are the numbers of simulations here denoted by sims rather that N and the boundary le poly for the region R As with the program clustsim in Section 35 above poly is needed in order to generate random points in R To apply this program to Bodmin with sims 99 be sure the data le B0dminmat in open in the Workspace and write gtgt kfuncti0nsimB0dminarea20199B0dpoly The results of this program are shown in Figure 49 to the right Notice first that 5 there is again some clustering and that 7 now it can be inferred that this clustering is signi cant at the Ollevel N99 VHS Notice also that the range of signi cant clustering is considerably smaller that that depicted in Figure 46 above This will almost always be the case since 72 here the h values must be bigger that 25 99 other random values rather than just one benchmark value Notice also that this scale roughly 15 S h S 45 appears F3949Bd39E l Tt to be more consistent with Figure 314a lg 0 mm nve Ope es However this approach is still rather limited in the sense that it provides information only about the relation of h to the maximum and minimum simulated values UNh and LNh for each h e H Hence the following re nement of this approach is designed to make fuller use of the information obtained from the above Monte Carlo procedure 462 Full P Value Approach By focusing on the maximum and minimum values UNh and LNh for each h EH the only Pvalues that can be obtained are those in 463 and 464 above But it is clear for example that values of h that are just below UNh are probably still very signi cant Hence a natural extension of the above procedure is to focus directly on Pvalues for clustering and dispersion and attempt to estimate these values on the basis of the given samples Turning rst to clustering the appropriate Pvalue is given by the answer to the following question If the observed pattern were coming from a CSR process in region R then how likely it would be to obtain a value as large as h To answer this question let the observed L value be denoted by l0 h and h denote the Lvalue at scale h obtained from a randomly sampled CSR pattern of size n on R Then the answer to the above question let the random variable LCSR ESE 502 14 14 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis is given formally by the probability that LCSR h is at least as large as lo which we designate as the clustering P value P h at scale h for the observed pattern Sn lustered 465 Emirz h PrlLCSRUl 2 10 To estimate this probability observe that our simulation has by construction produced a sample of N realized values A LAlh i lN of this random variable LCSRh Moreover under the CSR Hypothesis the observed value lo is just another sample which for convenience we designate as sample i0 Hence the task is to estimate 465 on the basis of a random sample lolllN of size Nl The standard approach to estimating event probabilities is simply to count the number of times the event occurs and then to estimate its probability by the relative frequency of these occurrences In the present case the relevant event is LCSR h 2 l0 Hence if we now de ne the indicator variables for this event by l 21 1 466 51 3901N 0 0 all z then the relative equency estimator 15mm 01 h of the desired Pvalue is given by13 A N 467 Pclusterea h PrLCSRh 2 lo ZPO 60 ll To simplify this expression observe that if m lo denotes the number of simulated samples ilN that are at least as large as 10 ie with 50Al then this estimated Pvalue reduces to14 A m t 1 468 P h L NH Observe that expression 463 above is now the special case of 468 in which h happens to be bigger than all of the N simulated values But 468 conveys a great deal more information For example suppose that N 99 and that h is only the fifth highest among these N l values Then in Figure 49 this value of h would be inside the envelope probably much closer to UNh than to LNh But no further information could be gained from this envelope analysis However in 468 the estimated the chance of observing a value as large as h is 599l 05 so that 13 This is also the maximumlikelihood estimator of Pmm h Such estimators will be considered in more detail in Part III of this NOTEBOOK 14 An alternative derivation of this Pvalue is given in Section 7 of the Appendix to Part I ESE 502 1415 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis this Lvalue is still sufficiently large to imply some signi cant degree of clustering Such examples show that the Pvalues in 468 are considerably more informative than the simple envelopes above Turning next to dispersion the appropriate Pvalue is now given by the answer to the following question If the observed pattern were coming from a CSR process in region R then how likely it would be to obtain a value as small as h The answer to this question is given by the dispersion P value 13mde h at scale h for the observed pattern Sn 469 Pmspm h M CSR 00310 Here if we let m l0 denote the number of simulated Lvalues that are no larger than l0 then exactly the same argument above with respect to the event LCSR h S l0 now shows that the appropriate relative frequency estimate of Pd h is given by 15mm m7l0l 4610 Pimpmwp NH To apply these concepts observe first that unless many l values are the same as l0 15 it must be true that 15mm hm l ISEZWW h So there is generally no need to compute both Hence we now focus on clustering Pvalues 155mm h for a given point pattern Sn in region R Observe next that to determine 155mm h there is no need to use L values at all One can equally well order the Kvalues In fact there is no need to normalize by 1 since this value is the same for both the observed and simulated patterns Hence we need only compute raw Kfunction values as given by the bracketed part of expression 436 Finally to specify an appropriate range of scales to be considered we take the appropriate maximum value of h to be the default value h hum2 in 451 and specify a number b of equal divisions of h The values of Adm h are then computed for each of these h values and the result is plotted This procedure is operationalized in the MATLAB program kc0untplotm This program will be discussed in more detail in the next section So for the present we simply apply this program to Bodmin with B0dminmat in the Workspace by setting N 99 b 20 and writing gtgt kc0untplotB0dmin 99201B0dp01y 15 The question of how to handle such ties is treated more explicitly in Section 7 of the Appendix to Part I ESE 502 1416 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis Simply ignore the fourth input 1 for the present The screen output of kcountplot gives the value of Z computed by the program which in this case is DmaxZ 86859 The minimum pairwise distance between all pairs of points Dmin 05203 is also shown This value is useful for interpreting Pvalues at small scales since all values of h less that this minimum must have 1511 0 and hence must be maximally dispersed by de nition since no simulated pattern can have smaller values of 1301 The cluster Pvalue plot for Bodmin is shown in Figure 410 With respect to as signi cant clustering there is seen to m be general agreement with the results m of the envelope approach above Here we see signi cant clustering at the 05 level denoted by the lower dashed red line for scale values in the range m 13 S h S 61 remember that one will obtain slightly different values for each simulation But this gure clearly 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 shows more In particular clustering at scales in the range 17 S h S 57 is now seen to be signi cant at the 01 level Fig410 Bodmin Cluster P Values which by de nition the highest level of signi cance possible for N 99 P clustered P Values 3 3 Here it is also worth noticing that the clustering Pvalue at scale h 5 is so large in fact 93 in the above simulation that it shows weakly signi cant dispersion where the upper dashed red line indicates signi cant dispersion at the 05 level The statistical reason for this can be seen from the screen output that shows the minimum distance between any two tors to be 52 Hence at scale h 5 it must be true that no circle of radius 5 about any tor can contain other tors so that we must have K 5 0 But since random point patterns such as in Figure 3 14b often have at least one pair of points this close together it becomes clear that there is indeed some genuine local dispersion here Further re ection suggests that is probably due to the nature of rock outcroppings which are often only the exposed portion of larger rock formations and thus cannot be too close together So again we see that the Pvalue map adds information about this pattern that may well be missed by simply visual inspection 47 Nonhomogeneous CSR Hypotheses As mentioned in Section 24 above it is possible to employ the Generalized Spatial Laplace Principle to extend CSR to the case of nonhomogeneous reference measures 16 Simulations with N 999 yield about the same results as Figure 410 so this appears to be a more accurate range than given by the envelope in Figure 49 ESE 502 1417 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis While no explicit applications are given in BG we can illustrate the main ideas with the following housing abandonment example 471 Housing Abandonment Example As in the Philadelphia example of Section 12 above suppose that we are given the locations of n currently abandoned houses in a given city R such as in Figure 4lla below City Boundary 9 Fig411a Abandoned Houses Fig411b Census Tract Data In addition suppose that data on the number of housing units HI pC1 in each census tract C1 i lm within city R is also available as in Figure 4llb If the number of total housing units in the city is denoted by 471 H mm ELMO 21H then the probability that a randomly sampled housing unit will be located in tract i is given by 472 13 5 pC ilm H mm Thus if these n housing abandonments were completely random events ie with no housing unit more likely to be abandoned than any other then one would expect the distribution of abandoned houses across census tracts to be given by n independent random samples from the distribution in 472 More formally this is an example of a nonhomogeneous CSR hypothesis with respect to a given reference measure p 17 In particular this would yield a marginal distribution of abandonments in each tract Q given by the binomial distribution in expression 243 above with C C ESE 502 1418 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis 472 Monte Carlo Tests of Hypotheses To test such hypotheses we proceed exactly the same way as in the homogeneous case The only real difference here is that the probability distributions corresponding to nonhomogeneous spatial hypotheses are somewhat more complex Using the above example as an illustration we can simulate samples of n random abandonments from the appropriate distribution by the following two stage sampling procedure i Randomly sample a census tract C11 from the distribution in 472 ii Randomly locate apoint s1 in C11 iii Repeat i and ii n times to obtain a point pattern S5 sf j l n The resulting pattern Si corresponds to the above hypothesis in the sense that individual abandonment locations are independent and the expected number of abandonments in each tract C j is proportional to the reference measure H j pC However this reference measure p is only an approximation to the theoretical measure since the actual locations of individual housing units are not known This is typical of situations where certain key spatial data is available only at some aggregate level18 Hence in step ii the location of a housing units in C1 is taken to be uniformly homogeneously distributed throughout this subregion The consequences of this local uniformity approximation to the ideal reference measure p will be noted in the numerical examples below Given a point pattern Sn s j j ln such as the locations of n abandonments above together with N simulated patterns S i lN from the Monte Carlo procedure above we are now ready to test the corresponding nonhomogeneous CSR hypothesis based on this reference measure p To do so we can proceed exactly as before by constructing Kcounts 1amph for the observed pattern Sn over a selected range of scales h and then constructing the corresponding Kcounts KWUI for each simulated pattern i lN This procedure is operationalized in the same MATLAB program kcountplot which is more general than the Bodmin application above Here the only new elements involve a partition of region R into subregions C i lm together with a specification of the appropriate reference measure p defined on this set of subregions 18 Such aggregate data sets will be treated in more detail in Part III of this NOTEBOOK ESE 502 1419 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part1 Spatial Paint Pattern Analysis 473 Lung Cancer Example To illustrate this testing procedure the following example has been constructed from the LarynX and Lung Cancer example of Section 12 above Here we focus only on Lung Cancer and for simplicity consider only a random subsample of n 100 lung cases as shown in Figures 412 below Fig412 Subsample of Lung Cases Fig413 Random Sample of Same Size Note from Figures 17 and 18 that this is fairly representative of the full data set 917 lung cancers To analyze this data set we begin by observing that in terms of area alone the point pattern in Figure 412 is obviously quite clustered One can see this by comparison with a typical random pattern of the same size in Figure 413 This can be verified statistically by using the program kfuncti0nplot as in the Bodmin case to conduct a Monte Carlo test for the homogenous case developed above The results are shown in Figure 414 to the right Here it is evident that there is extreme clustering In fact note from the scale in Figure 412 above that there is highly significant clustering up to a radius of h20km which is large enough to encompass the entire region Notice also that the significance levels here are as high as possible for the given number of simu P Value I snun mnuu 15mm zunun h Fig414 Test of Homogeneous Clustering lations which in this case was N 999 This appears to be due to the fact that the overall pattern of points in Figure 412 is not only more clustered but is also more compact So for the given common point density in these figures cell counts centered at pattern points in Figure 412 tend to be uniformly higher than in Figure 413 ESE 502 1420 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis But the single most important factor contributing to this clustering as observed in Section 24 above is the conspicuous absence of an appropriate reference measure 7 namely population In Figure 415 below the given subsample of lung cases in Figure 412 above is now depicted on the appropriate population backcloth of Figure 18 Fig415 Subsample of Lung Cases Fig416 Random Sample from Population Here it is clear that much of the clustering in Figure 412 can be explained by variations in population density Notice also that the relative sparseness of points in the west and east are also explained by the lower population densities in these areas especially in the east For comparison a random pattern generated using the twostage sampling procedure above is shown in Figure 416 Here there still appears to be somewhat less clustering than in Figure 415 but the difference is now far less dramatic than above Using these parish population densities as the reference measure p a Monte Carlo test was run with N999 n7 simulated patterns including the one shown in Figure 416 The results of this test are plotted in Figure 417 to the 5 right Notice that the dramatic results of Figure 414 above have all but disappeared There is now only significant clustering at the local scale m with hSka Moreover even this local clustering appears to be an artifact of the spatial aggregation inherent in the 1mm m m atbu Adm tbu Brim mbu Esau gu39m 1 an parish population density measure p h meters P Value As pointed out above this aggregation leads to simulated point patterns under the nonhomogeneous CSR hypothesis that tend to be much too homogeneous at the parish level This is particularly evident in the densely populated area of the south central portion of the region shown Here the tighter clustering of lung cancer cases seen in Figure 415 more accurately re ects local variations in population density than does the relatively uniform scattering of points in Figure 416 So in fact a more Fig417 Test of Nonhomogeneous Clustering ESE 502 1421 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis disaggregated representation of population density would probably show that there is no significant clustering of lung cancer cases whatsoever 48 Local K Function Analysis Up to this point we have only considered global properties of point patterns namely the overall clustering or dispersion of patterns at various scales However in many cases interest focuses on more local questions of where signi cant clustering or dispersion is occurring Here we begin by constructing local versions of Kfunctions and then apply them to several examples 481 Construction of Local K Functions Recall from expression 433 that Kfunctions were de ned in terms of expected point counts for a randomly selected point in a pattern But exactly the same de nitions can be applied to each individual point in the pattern by simply modifying the interpretation of 433 to be a given point i rather than a randomly sampled point and rewriting this expression as a local K function for each point i 2 1101 Moreover if we now relax the stationarity assumption used in 434 above then these expected values may differ for each point i In this context the pooled estimator 435 for the stationary case now reduces to the corresponding local estimator 211505 481 K101 1E 482 16101 Hence to determine whether there is signi cant clustering about point i at scale h one can develop local Monte Carlo testing procedures using these statistics 482 Local Tests of Homogeneous CSR Hypotheses In the case of homogenous CSR hypotheses one can simply hold point 1 xed in region R and generate N random patterns of size n l in R corresponding to the locations of all other points in the pattern Note that in the present case 482 is simply a count of the number of points with distance h of point i scaled by 1 But since this scaling has no effect on Monte Carlo tests of signi cance one can focus solely on point counts which may be thought of as a raw Kfunction For each random pattern one can then simply count the number of points within distance h of point i Finally by comparing these counts with the observed point count one can then generate p values for each point i l n and distance h paralleling 468 above m1 hl 483 15h NH 1 ESE 502 1422 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis where mlh now denotes the number of simulated patterns with counts at distance h from 139 at least as large as the observed count This testing procedure is operationalized in the MATLAB program kcountlocm shown below function PValC0 kcountloclocsimsDMpoly KCOUNTLOC computes the raw K function at each point in the pattern loc for a range of distances D and allows tests of non homogeneous CSR hypotheses by including a set of polygons poly with reference measure M INPUTS i loc population location le lociXi Yii1N ii sims number of simulations iii D set of distance values in ASCENDING order iv M k vector of measure values for each of k polygons v poly matrix describing boundaries of k polygons Here the main output Pval is a matrix of P values at each reference point and each distance value under the CSR Hypothesis The point counts for each pointdistance pair are also in the output matrix C0 Notice that since homogeneity is simply a special case of heterogeneity this program is designed to apply both to homogeneous and nonhomogeneous CSR hypotheses Application to Bodmin Tors The homogeneous case can be illustrated by the following application to Bodmin tors Recall that the location pattern of tors is given by the matrix Bodmin in the workspace Bodminmat Here there is a single boundary polygon Bodpoly Hence the reference measure can be set to a constant value say M 1 So the appropriate command for 999 simulations in this case is given by gtgt PvalC0 kcountlocBodmin999D1Bodpoly In view of Figure 410 above one expects that the most meaningful distance range for significant clustering will be somewhere between h 1 and h 5 kilometers Hence the selected range of distances was chosen to be D 12345 One key advantage of this type of local analysis is that since a pvalue is now associated with each individual point is now possible to map the results In the present case the results of this Monte Carlo analysis were imported to ARCMAP and are displayed in Bodminmxd In Figure 418 below the pvalue maps for selected radii of h 235 km are shown As ESE 502 1423 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis seen in the legend lower right corner of the gure the darker red values correspond to lower pvalues and hence denote regions of more signi cant clustering As expected there are basically two regions of signi cant clustering corresponding to the two large groupings of tors in the Bodmin eld h2km h3km h5km PVALU ES 0001 0005 0005 0010 0010 0050 0050 0100 0100 0999 Figure 418 Cluster P Values for Bodmin Tors 00000 Notice here that clustering is much more pronounced at a radius of 3 km than at smaller or larger radii The red circle in the gure shows the actual scale of a 3 km radius This gure well illustrates the ability of local Kfunction analyses to pick up sharper variations in scale than global analyses such as Figure 410 above where there appeared to be equally signi cant clustering at all three scales h 23 5 km Hence it should be clear from this example that local analyses are often much more informative than their global counterparts Local Analyses with Reference Grids The ability to map pvalues in local analyses suggests one additional extension that is often more appropriate than direct testing of clustering at each individual point By way of motivation suppose that one is studying a type of tree disease by mapping the locations of infected trees in a given forest Here it may be of more interest to distinguish diseased regions from healthy regions in the forest rather than to focus on individual trees A simple way to do so is to establish a reference grid of locations in the forest and then to estimate clustering pvalues at each grid location rather than at each tree The construction of reference grids is detailed in Section 483 below Such a uniform grid of pvalues can then be easily interpolated to produce a smoother visual ESE 502 1424 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis summary of disease clustering An illustration of this referencegrid procedure is shown in Figure 419 below where the red dots denote diseased trees in the section of forest shown and where the white dots are part of a larger grid of reference points In this illustration the diseasedtree count within distance h of the grid point shown is thus equal to 4 Figure 419 Reference Grid for Local Clustering Assuming that the forest itself is reasonably uniform with respect to the spatial distribution of trees the homogeneous CSR hypothesis would again provide a natural benchmark for identifying significant clustering of diseased trees In this case one would simulate random patterns of diseased trees and compare disease counts with those observed within various distances h of each grid point Hence those grid points with low pvalues at distance h would denote locations where there is significant disease clustering at scale h To develop the details of this procedure it is convenient to construct a reference grid representation for Bodmin so that the two approaches can more easily be compared To do so we start by constructing a reference grid for Bodmin By inspecting the boundary of Bodmin in ARCMAP one can easily determine a box of coordinate values just large enough to contain all of Bodmin In the present case appropriate bounding Xvalues and Yvalues are given by Xmin 52 Xmax 95 Ymin 115 and Ymax 83 Next one needs to choose a cell size for the grid as exemplified by the spacing between white dots in Figure 419 One should try to make the grid f1ne enough to obtain a good interpolation of the pvalues at grid points Here the value of 5 km was chosen for spacing in each direction yielding square cells with dimensions Xcell 5 Ycell ESE 502 1425 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis The construction of the corresponding reference grid is operationalized in the program gridf0rmm with the command gtgt ref gridf0rmXminXmaxXcellYminYmaxYcell This produces a 2column matrix ref of grid point coordinates The upper left comer of the grid is displayed on the screen for a consistency check A plot of the full grid Full Grid Masked Grid Figure 420 Reference Grid for Bodmin ref is shown on the left in Figure 42019 In Section 8 of the Appendix to Part I a procedure is developed for obtaining this full grid representation directly in MATLAB While all of these grid points are used in the calculation those outside of the Bodmin boundary are only relevant for maintaining some degree of smoothness in the interpolation constructed below On the right these grid points have been masked out in order to display only those points inside the Bodmin boundary The construction of such visual masks is quite useful for many displays and is discussed in detail in Section 124 of Part IV in this NOTEBOOK Given this reference grid ref the extension of kc0untlocm that utilizes ref is operationalized in the MATLAB program kc0untlocrefm This program is essentially identical to kc0untlocm except that ref is a new input Here one obtains pvalues for Bodmin at each reference point in ref with the command 19 Notice that the right side and top of the grid extend slightly further than the left and bottom This is because the Xmax and Ymax values in the program are adjusted upward to yield an integral number of cells of the same size ESE 502 1426 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis gtgt PvalC0 kc0untlocrefB0dminref999D1B0dpoly where the matrix Pval now contains one pvalue for each grid point in ref and distance radius in D The results of this Monte Carlo simulation were exported to ARCMAP and the pvalues at each grid point inside Bodmin are displayed for h 3 km on the left in Figure 421 below again with a mask By comparing this with the associated point PVA LU ES 0001 0002 0002 0005 0005 001 001 002 002 005 005 010 010 020 020 100 Figure 421 Interpolated P Values for Bodmin plot in the center of Figure 418 one can see that this is essentially a smoother version of the results depicted there However this representation can be considerably improved upon by interpolating these values using any of number of standard smoothers discussed further in Part II The interpolation shown on the right was obtained by the method known as ordinary kriging This method of stochastic interpolation will be developed in detail in Section of Part II in this NOTEBOOK 483 Local Tests of Nonhomogeneous CSR Hypotheses Next we extend these methods to the more general case of nonhomogeneous CSR hypotheses As with all spatial Monte Carlo testing procedures the key difference between the homogeneous and nonhomogeneous cases is the way in which random points are generated As discussed in Section 472 above this generation process for the nonhomogeneous case amounts to a two stage sampling procedure in which a polygon is rst sampled in a manner proportional to the given reference measure M and then a random location in this polygon is selected Since this procedure is already incorporated into both the programs kc0untlocm and kc0untlocrefm above there is little need for further discussion at this point ESE 502 1427 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis By way of illustration we now apply kc0untlocrefm to a Philadelphia data set which includes 500 incidents involving inter group con ict IGC situations such as housing discrimination that were reported to the Community Service Division of the Philadelphia Commission on Human Relations from 19951996 This data set is discussed in more detail in the project by Amy Hillier on the ESE 502 class web page The locations of these 500 incidents are shown on the left in Figure 422 below and are also displayed in the map document Philigcmxd in ARCMAP Here the natural null hypothesis would be that every individual has the same chance of reporting an inciden But as with the housing abandonment example in Figure 411 above individual location data is not available Hence census tract population levels ACTUAL IGC INCIDENTS RANDOM IGC INCIDENTS Figure 422 Comparison with IGC Random Incidents will be used as an approximation to individual locations so that the relevant reference measure is here taken to be population by census tract with corresponding population densities shown in green in Figure 422 The relevant nonhomogeneous CSR hypothesis for this case is thus simply that the chance of any incident occurring in a given census tract is proportional to the population of that census tract Under this hypothesis a typical realization of 500 random IGC incidents is shown on the right Here it is clear that incidents are more clustered in areas of high population density such as in West Philadelphia and South Philadelphia So clusters of actual data on the ESE 502 1428 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis left are only signi cant if they are more concentrated than would be expected under this hypothesis Hence even though there is clearly a cluster of cases in South Philadelphia it is not clear that this is a signi cant cluster Notice however that the Kensington area just Northeast of Center City does appear to be more concentrated than would be expected under the given hypothesis But no conclusion can be reached on the basis of this visual comparison Rather we must simulate many realizations of random patterns and determine statistical signi cance on this basis To do so a reference grid for Philadelphia was constructed and is shown with masking on the left in Figure 423 below in a manner similar to Figure 420 above Here a range of distances was tried and clustering was most apparent at a radius of 500 meters in a manner similar to the radius of 3 km in Figure 418 above for the Bodmin example The pvalue results for this case are contained in the MATLAB workspace PVALU ES 0000 0 001 0001 0 005 0005 0100 0100 0 200 0200 1 000 PVALUE CONTOURS REFERENCE GRID Figure 423 P Value Map for ICG Clustering philigcmat and were obtained using kc0untlocrefm with the command gtgt PvalC0 kc0untlocreflocref999Dp0pbnd Here 10c contains the locations of the 500 IGC incidents ref is the reference grid shown above D contains a range of distances including the 500meter case 0 and pop 20 The actual coordinates for this map were in decimal degrees so that the value 005 corresponds roughly to 500 meters ESE 502 1429 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis contains the populations of each census tract with boundaries given by bnd These results were imported to ARCMAP as a point le and are displayed as P valshp in the data frame PValues for Dist 005 of Philigcmxd Finally these pvalues were interpolated using a different smoothing procedure than that of Figure 421 above Here the spline interpolator in Spatial Analyst was used together with the contour option The details of this procedure are described in Section 8 of the Appendix to Part I21 Here the red contours denote the most signi cant areas of clustering which might be interpreted as IGC hotspots Notice in particular that the dominant hotspot is precisely the Kensington area mentioned above Notice also that the clustering in West Philadelphia for example is now seen to be explained by population density alone and hence is not statistically signi cant It is also worth noticing that there is a small hotspot just to the west of Kensington toward the Delaware River that appears hard to explain in terms of the actual IGC incidents in Figure 422 The presence of this hotspot is due to the fact that while there are only four incidents in this area the population density is less than a quarter of that in the nearby Kensington area So this incidence number is usually high given the low density This raises the practical question of how many incidents are required to constitute a meaningful cluster While there can be no de nitive answer to this question is important to emphasize that statistical analyses such as the present one should be viewed as providing only one type of useful information for cluster identi cation 21 Notice also that this contour map of Pvalues is an updated version of that in the graphic header for the class web page That version was based on only 99 simulations run on a slower machine 22 This same issue arises in regression where there is a need to distinguish between the statistical significance of coefficients relative to zero and the practical significance of their observed magnitudes in any given context ESE 502 1430 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis 2 Models of Spatial Randomness As with most statistical analyses cluster analysis of point patterns begins by asking What would point patterns look like if points were randomly distributed This requires a statistical model of randomly located points 21 Spatial Laplace Principle To develop such a model we begin by considering a square region S on the plane and divide it in half as shown on the left in Figure 21 below l4 l4 l2 12 gt gt o o o gt l4 l4 S Fig 21 Spatial Laplace Principle The Laplace Principle of probability theory asserts that if there is no information to indicate that either of two events is more likely then they should be treated as equally likely ie as having the same probability of occuring1 Hence by applying this principle to the case of a randomly located point in square S there is no reason to believe that this point is more likely to appear in either left half or the identical right half So these two mutually exclusive and collectively exhaustive events should have the same probability 12 as shown in the figure But if these halves are in turn divided into equal quarters then the same argument shows that each of these four occupancy events should have probability l4 If we continue in this way then the square can be divided into a large number of n grid cells each with the same probability 1 n of containing the point Now for any subregion or cell C CS the probability that C will contain this point is at least as large as the sum of probabilities of all grid cells inside C and similarly is no greater that the sum of probabilities of all cells that intersect C Hence by allowing n to become arbitrarily large it is evident that these two sums will converge to the same limit 7 namely the fractional area of S inside C Hence the probability PrC S that a random point in S lies in any cell C CS is proportional to the area of C 2 aC 211 PrCt S aS Finally since this must hold for any pair of nested regions C C R C S it follows that3 1 This is also known as Laplace s Principle of Insufficient Reason 2 This argument in fact simply repeats the construction of area itself in terms of Riemann sums as for example in Bartle 1975 section 24 3 Expression 212 refers to equation 2 in section 21 This convention will be followed throughout ESE 502 121 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis PrCl S aCaS PrR S aRaS 2 PrC R and hence that the square in Figure 21 can be replaced by any bounded region R in the plane This fundamental proportionality result which we designate as the Spatial Laplace Principle forms the basis for almost all models of spatial randomness 212 PrCl S PrC RPrRl S 2 PrC R In probability terms this principle induces a uniform probability distribution on R describing the location of a single random point With respect to any given cell C C R it convenient to characterize this event as a Bernoulli binary random variable X C where XC 1 if the point is located in C and XC 0 otherwise In these terms it follows from 212 that the conditional probability of this event given that the point is located in R must be 213 PrXC1lR aCaR so that PrXC OlR1 PrXC1lR1 aCaR 22 Complete Spatial Randomness In this context suppose now that n points are each located randomly in region R Then the second key assumption of spatial randomness is that the locations of these points have no in uence on one another Hence if for each i ln the Bernoulli variable X C now denotes the event that point i is located in region C then under spatial randomness the random variables X C i ln are assumed to be statistically independent for each region C This together with the Spatial Laplace Principle above de nes the fundamental hypothesis of complete spatial randomness CSR which we shall usually refer to as the CSR Hypothesis Observe next that in terms of the individual variables X C the total number of points appearing in C designated as the cell count N C for C must be given by the random sum 221 NC ZLXI C It is this additive representation of cell counts that in fact motivates the Bernoulli 01 characterization of location events above Note in particular that since the expected ESE 502 122 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis value of a Bernoulli random variable X is simply PX 1 4 it follows from the linearity of expectations that the expected number of points in C must be 222 ENCl nR ZIEXIClR 21PrXIC1lR n aC aC n EMU quot aR aRaC Finally it follows from expression 213 that the under the CSR Hypothesis the sum of independent Bernoulli variables in 221 is by de nition a Binomial random variable with distribution given by 223 PrNC k nR 1 7 k 01n For most practical purposes this conditional cell count distribution for the number of points in cell C CR given that n points are randomly located in R constitutes the basic probability model for the CSR Hypothesis 23 Poisson Approximation However when the reference region R is large the exact specification of this region and the total number of points n it contains will often be of little interest In such cases it is convenient to remove these conditioning effects by applying the wellknown Poisson approximation to the Binomial distribution To motivate this fundamental approximation in the present setting imagine that you are standing in a large tiled plaza when it starts to rain Now consider the number of rain drops landing on the tile in front of you during the first ten seconds of rainfall Here it is evident that this number should not depend on either the size of the plaza itself or the total number of raindrops hitting the plaza Rather it should depend on the intensity of the rainfall 7 which should be the same everywhere This can be modeled in a natural way by allowing both the reference region plaza R and the total number of points raindrops landing in the plaza n to become large in such a way that the expected density of points intensity of rainfall in each unit area remains the same In our present case this expected density is given by 212 as 231 Mm Hence to formalize the above idea now imagine an increasing sequence of regions R1 CR2 C CRm C and corresponding point totals n1 lt n2 lt lt nm lt that expand such a way that the limiting density 4 By definition EX Zxx px 1 pl 0 p0 pl ESE 502 123 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis n 232 7tlim 7 n R lim quot mam m m mam 61mm exists and is positive Under this assumption it is shown in the Appendix Section 1 that the Binomial probabilities in 223 converge to simple Poisson probabilities 233 PrNCkl A Wig C39er W k012 Morover by 222 and 232 the expected number of points in any given cell plaza tile C is now given by 234 ENC MC where density 7 becomes the relevant constant of proportionality Finally if the set of random variables N C describing cellcounts for every cell of nite area in the plane is designated as a spatial point process on the plane then any process governed by the Poisson probabilities in 233 is designated as a spatial Poisson process on the plane Hence when extended to the entire plane the basic model of complete spatial randomness CSR above corresponds precisely to a spatial Poisson process 24 Generalized Spatial Randomness The basic notion of spatial randomness above was derived from the principle that regions of equal area should have the same chance of containing any given randomly located point More formally this Spatial Laplace Principle asserts that for any two subregions cells C1 and C2 in R 241 aC1 aC2 2 PrXC1 1 R PrXC2 1 l R However as was noted in the Housing Abandonment example above simple area may not always be the most relevant reference measure backcloth In particular while one can imagine a randomly located abandonned house such houses are very unlikely to appear in the middle of a public park let alone the middle of a street So here it makes much more sense to look at the existing housing distribution and to treat a randomly located abandonned house as a random sample from this distribution Here the Laplace principle is still at work but now with respect to houses For if housing abandonments are spatially random then each house should have that same chance of being abandonned Similarly in the Larynx cancer example if such cancers are spatially random then each individual should have the same chance of contracting this disease So here the existing population distribution becomes the relevant reference measure ESE 502 124 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis To generalize the above notion of spatial randomness we need only replace area with the relevant reference measure say pC which may be the number of houses in C or the total population of C As an extension of 24 1 above we then have the following Generalized Spatial Laplace Principle For any two subregions cells C1 and C2 in R 242 pC pC 2 PrXC1R PrXC1R If 24 l is now replaced by 242 then one can essentially reproduce all of the results above There is only one techicality that needs to be mentioned The basic Laplace argument in Figure 21 above required that we be able to divide the square S into any number of equalarea cells Hence the simplest way to extend this argument is to assume that the relevant reference measure p is absolutely continuous in the area measure a Here it suffices to assume that the relevant reference measure can be modeled in terms of a density function with respect to area 5 So if housing or population is the relevant reference measure then we model this in terms of a housing density population density with respect to area Given this assumption exactly the same arguments leading to 223 now show that 243 PrNC kl nR ilc 1 7 k 01n Similarly if we now let L7lR n pR and again assume the existence of limiting positive density 7 244 A 1imW MnmRm 1imW m pRm as the reference region becomes larger then the same argument for 9 in Section 1 of the Appendix now shows that 245 PrNCkm xii C39We mc k012 Spatial point processes governed by Poisson probabilities of this type ie with non uniform reference measures are often referred to as nonhomogeneous spatial Poisson processes Hence we shall often refer to this as the nonhomogeneous CSR Hypothesis 5 More formally it is assumed that there is some density function f on R such that p is the integral of f ie such that for any cell C C R pC Cfxdx ESE 502 125 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANAL Y SIS Part 1 Spatial Point Pattern Analysis 25 Spatial Stationarity Finally we consider a number of weaker versions of the spatial randomness model that will also prove to be useful First observe that some processes may in fact be Laplace like in the sense that they look the same everywhere but may not be completely random A simple example is provided by the cell centers in Figure 11 of Section 1 above Here one can imagine that if the microscope view were shifted to the left or right on the given cell slide the basic pattern of cell centers would look very similar Such point processes are said to be stationary To make this notion more precise it is convenient to think of each subregion C C R as a window through which one can see only part of larger point process on all of region R In these terms the most important notion of stationarity for our purposes is one in which the processs seen in C remains the same no matter how we move this window Consider for example the pattern of trees in a large rainforest R part of which is shown in Figure 22 below Here again this pattern is much too dispersed to be completely random but nonetheless appears to be the same everywhere Suppose that the relevant subregion C under study corresponds to the small square in the lower left In these terms the appropriate notion of stationarity for our purposes amounts to the assumption that the cellcount distribution in C will remain the Fig22 Isotropic Stationarity Fig23 Anisotropic Stationarity same no matter where this subregion is located For example the tilted square shown in the gure is one possible relocation or copy of C in R More generally if cell C2 is simply a translation andor rotation of cell C1 then these cells are said to be geometrically congruent written C1 2C2 Hence our formal de nition of stationarity asserts that the cellcount distributions for congruent cells are the same ie that for any C1C2 CR 251 C1 EC 3 PrNC1kPrNC2k k Ol Since the directional orientation of cells make no difference this is also called isotropic stationarity There is a weaker form of stationarity in which directional variations are ESE 502 126 Tony E Smith NOTEBOOK FOR SPA TIAL DA TA ANAL Y SIS Part 1 Spatial Point Pattern Analysis allowed ie in which 251 is only required to hold for cells that are translations of one another This type of anisotropic stationarity is illustrated by the tree pattern in Figure 23 where the underlying point process tends to produce vertical alignments of trees more like an orchard than a forest Here the variation in cell counts can be expected to differ depending on cell orientation For example the vertical cell in Figure 23 is more likely to contain extreme point counts than its horizontal counterpart We shall see a similar distinction made for continuous stationary processes in Part II of this NOTEBOOK One basic consequence of both forms of stationarity is that mean point counts continue to be proportional to area as in the case of complete randomness ie that 252 ENC x aC where 9 is again the expected point density ie expected number of points per unit area To see this note simply that the basic Laplace argument in Figure 11 of Section 1 depends only on similarities among individual cells in uniform grids of cells But since such cells are all translations of one another it now follows from 251 that they all have the same cellcount distributions and hence have the same means So by the same argument above with cell occupancy probabilities now replaced by mean point counts it follows that such mean counts must gain be proportional to area Thus while there can be many types of statistical dependencies between counts in congruent cells as in the dispersed tree patterns above the expected numbers of points must be the same in each One final point should be made about stationarity This concept implicitly assumes that the reference region R is sufficiently large to ensure that the relevant cells C never intersect the boundary of R Since this rarely happens in practice the present notion of stationarity is best regarded as a convenient f1ction For example suppose that in the rain forest illustrated in Figure 22 above there is actually a lake as shown in Figure 24 below In this case any copies of the given vertical cell that lie in the lake will of course contain no trees More generally those cells that intersect that lake are likely to have fewer trees such as the tilted cell in the figure Here it is clear that condition 251 cannot possibly hold Such violations of 251 are often referred to as edge e ects Fig24 Actual Landscape Fig25 Stationary Version ESE 502 127 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Point Pattern Analysis Here there are two approaches that one can adopt The rst is to disallow any cells that intersect the lake and thus to create a buf er zone around the lake While this is no doubt effective it has the disadvantage of excluding some points near the lake If the forest R is large this will probably make little difference But if R is small say not much bigger than the section shown then this amounts to throwing away valuable data An alternative approach is to ignore the lake altogether and to imagine a stationary version of this landscape such as that shown in Figure 25 Here there are seen to be more points than were actually counted in this cell So the question is then how to estimate these missing points A method for doing so known as Ripley s correction will be discussed further in Section 43 below ESE 502 128 Tony E Smith ESE 502 Tony E Smith GEODA DIAGNOSTICS FOR SPATIAL REGRESSION 1 R Square Let the residuals of the spatial regression be denoted by 1 5 81539 y 7 y1y39 J31539 and let the error sum of squares ESS and total sum of squares TSS be given by 2 ESS 5395 3 TSS y rmy r1 then in GEODA 4 R square l i781 07386 for Eire example But if the model sum of squares MSS is denoted by 5 MSS rim 51 then the fraction of TSS that is explained by the model is 6 R2 Ang 05733 for Eire example So which do we choose 2 Akaike Information Criterion If the log likelihood of the model estimates is denoted by 7 L L l yX 480591 for Eire example where 191 19 is the vector of parameter estimates then the Akaike Information Criterion AIC is given by 8 AIC 2 Lk 24809513 102118 for Eire example in GEODA ESE 502 Tony E Smith Note In the present case d AO 1012 2 so this should be k 4 Intuitively AIC is an adjusted log likelihood that is penalized for the number of parameter tted in a manner analogous to adjusted R Square 3 Breusch Pagan Test of Heteroscedasticity The BreuschPagan Test considers heteroscedastic variance models of the form OZ 9 0392 02a0 a399q 3 0 a0 a399q where x l 9cm xlk is the vector of explanatory variables plus intercept for observation i and 0392 vara The appropriate null hypothesis is then H 0 a 0 To test this hypothesis observe that since varel E 812 the residual 12 constitutes a one sample estimate of 03912 If the mean square error is denoted by 9 s2 39 ESS then under H 0 the sample vector 52 2 10 r 1 lt gt is a natural estimate of 03912 0392039n2 0392 Hence if one regresses r on the set of explanatory variables X 1 x1 xk then signi cantly large values for the model sum of squares MSS of this regression under the null hypothesis H 0 indicate that model 9 ts better than would be expected under H 0 The appropriate Breusch Pagan is thus taken to be twice this MSS statistic SBP 11 SE MSS 7r39Xerle39r n 00743 for Eire example which can be shown to be asymptotically distributed 111 under H 0 In the Eire case this value is not suf ciently high to suggest the presence of signi cant heteroscedasticity P value 785 ESE 502 Tony E Smith KRIGING WITH ONE PREDICTOR Suppose we want to predict a at location 3 based only on one other location 3 so that 1 1 Then to nd the scalar minimum mean squared error predictor L we minimize 2 MSEO Eggs Mir E8 27v85817 2812 But this is a simple quadratic function in 7 so that 3 MSEO 2 2xc11 and dd y MSEOL 2c11gt 0 MSEO Hence the unique value of 5 is given by the first order condition 4 0 WMSEOL 2051 xcn 2 c51 ion 2 NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 12 TIPS FOR USING ARCMAP 121 Importing Text Files to ArcMap 1 You can access information about importing text les using the HELP Index by writing delimited in the window and clicking delimited text files Accessing delimited text le data 2 Among the most important information is the following ArcCatalog and ArcMap allow you to directly access data in delimited text files and work with them as tables ArcCatalog and the add data browser in ArcMap lists files with a txt asc csv or tab extension and assigns them a le type of Text File a Files with a txt asc or csv extension are interpreted as comma delimited while files with a tab extension are interpreted as tab delimited Fquot If your text le is tab delimited you should rename the extension as tab NOTE If the txt le extension is hidden as it is by default in the Lab then you must reveal this extension by the following procedure 0 Open the directory containing the text file in My Computer and in the main menu click Tools Folder Options View Now uncheck the box next to Hide le extensions for known le types You will then be able to see the txt le extension and to rename this file with a tab extension c If it is comma delimited you should make sure that all eld names are in quotes name NOTE You must NOT start a field name with a number Otherwise the le will not be properly read in either ArcMap or ArcCatalog This important detail is missing from the HELP files ESE 502 IV l2l Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 122 Changing Path Directories in Map Documents When a red appears where a map should be displayed it usually means that your maps are in the wrong directory To change all such maps with a single operation 1 Left click on the red next to the rst file in the map document that has such a mark 2 In the Data Source window that opens select the appropriate data source le in the new directory 3 Click Add and the source directories of all les in the same directory as the rst le will automatically be changed to the corresponding source le in the new directory assuming that they exist 4 If the map document contains source les in more than one directory some les will still have a red So left click on any one ofthese and repeat the above procedure 123 Making a Column of Row Numbers in an Attribute Table Here there are two procedures that one can use 1 Modify FID Numbers If the FID column of string identi ers is present and in the correct order then you can simply construct a numerical eld that adds 1 to each FID number as follows 1 Open the Attribute Table and select Options Add Field 2 Set Name ROW and leave all else as is Click OK 3 In the new column that appears with zero values right click on the ROW label and select Calculate Values Ignore any warnings 4 In the Field Calculator that opens ll in the window labeled ROW with the phrase FID 1 no quotes and click OK The column should now be lled with the desired row numbers 2 Construct an Independent Numbering An alternative procedure that is more exible is to use the utility recNumbercal in the class directory Sys502ArcviewProjectsUtiities as follows ESE 502 IV 122 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 1 Open the Attribute Table and select Options Add Field 2 Set Name ROW and leave all else as is Click OK 3 In the new column that appears with zero values right click on the ROW label and select Calculate Values Ignore any warnings 4 In the Field Calculator that opens check the Advanced box and the window label will switch to PreLogic VBA Script Code 5 Now click Load and retrieve the program Sy5502ArcviewProjectsUtiitiesrecNumber cal You will now see a small VBA program in the window written by Ianko Tchoukanski 6 V If you scroll down you will nd the following section 39adjust start value and interval below lStart 1 llnterval 1 This allows you to set the starting value and interval size as you like The default will give you consecutive integers starting at 1 But if for example you want to start at 0 as with FID numbering then you can set lStart O 7 If you now click OK then the desired values should now appear in the ROW eld 124 Masking in Arcmap The following procedure is designed to construct a masked area around a map boundary which hides unwanted contour and raster edges when displaying the results of kriging or spline interpolations This procedure will be developed in terms of boundary shape file boundaryshp It is assumed that this file is a layer called Boundary in the current data frame activated in ARCMAP It is also assumed that this data frame is part of a map document located in your home directory so that it can be edited 1 First use the Fixed Zoom Out tool to reduce the size of the map display of Boundary so that a large mask can be constructed at least twice the size of Boundary This ensures that when the map is enlarged to fill the window the edges of the rectangular mask will not be seen ESE 502 IV 123 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 2 Next be sure that the Editor toolbar is displayed View Toolbars Editor and click Editor Start Editing Then click Create New Feature on the toolbar pencil button and be sure that the layer Boundary is set as the Target 3 To create the mask use the mouse to make a large Box around the map boundary Just click at each successive comer and then right click anywhere inside the Box and select Finish Sketch on the popup menu The map should now be hidden under the colored Box Click Editor Stop Editing Save Edits 4 If you open the Attribute Table for Boundary you will see that a new row feature has been added at the bottom representing the polygonal Box just created You will now create a new shapeflle in which the complement of Boundary in the Box is also a feature a First open Arc Toolbox and click Analysis Tools Overlay Union b In the Union window that opens set Input Features Boundary c In the Output Feature Class window the default file name should be b0undaryUni0nshp Rename this to maskshp and click OK The layer Mask should now appear in the Table of Contents d Now with only Mask being displayed you should see both Boundary and the Box surrounding it Use the Select feature on the Main Menu to select the complement of the Boundary right click anywhere in the Box outside of Boundary e Open the attribute table for Mask and observe the row selected which should be the top row Click on Options at the bottom of the window and choose Switch Selection so that all other rows are now selected f Now reopen the Editor click Start Editing and in the Main Menu select Remove All that should be left is the complement of Boundary in the Box This constitutes the desired Mask g To clean up the original file open Boundary again select the row corresponding to the Box feature and click Remove Finally click Editor Stop Editing Save Edits ESE 502 IV 124 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 5 To display the Mask in a proper form double click on its colored box in the Table of Contents set Fill Color White and set the Outline Width and Outline Color to whatever you like 6 Once again display the Boundary layer and enlarge it with the Fixed Zoom In tool to ll the window Be sure that the Mask layer is positioned above it in the Table of Contents 3 Any raster or contour output from subsequent analyses of point data inside the map will now automatically be clipped at the boundary as long as those displays are positioned below the Mask layer in the Table of Contents 125 Making Spline Contours in Spatial Analyst 1 Select a pointdata layer with values X to be contoured 2 On the Spatial Analyst tool bar Click Spatial Analyst Interpolate t0 Raster Spline 3 Then in the Spline Window set 0 Z value eld X o Spline Type Regularized 0 Weight 01 close t to data 5 smoothed t All other settings are OK 4 If a raster grid of colors is OK then you are done 0 Right click on the layer which is only a temporary le and click Save as a Layer le 0 To edit the displayed number of contours or their colors right click on saved layer click Properties Symbology and edit Classes or colors 5 If you want contour curves rather than a raster grid you can now use the Surface Analysis tool as follows Select the rastergrid layer and click Spatial Analyst Surface Analysis Contour ESE 502 IV 125 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 6 In the Contour Window choose the contour interval size you want and check the Total number of contour values below to see if the number is adequate but not too dense 0 When you have made a contour map that is acceptable right click on the layer which is again a temporary le and click Save as a Layer le 126 Excluding Values from Map Displays To exclude a range of values from the map display of a given variable layer 1 Right click on the appropriate layer and click Propelties Quantities Classify Exclusion 2 In the Data Exclusion Properties dialogue 0 Double click on the variable so that it appears in the text box 0 Specify the range to be excluded such as Variable lt 100 0 Click OK 3 On the map those features points choropleth areas with value in this range will no longer be shown 127 Importing ArcMap Images to the Web Arcmap images can be imported to the web by converting them to gif les One simple procedure for doing so is the following 1 First export image from ARCMAP as an Enhanced Meta File emf 2 Open the image in POWER POINT an edit in any way desired you can even combine several images into one 3 Save in POWER POINT to the Desktop as say image using File Save as Web Page 4 You will now see a le on the Desktop imagehtm together with a folder image les ESE 502 IV 126 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 5 If you open the folder you will see one gif le with a label something like slide0001 This is the le containing the desired image open it to check 6 Now copy this le to the same local directory where you keep the copy of your web page and rename it as say imagegif 7 You can then insert this into your web page by editing the page in Netscape using File Edit Page and then using Insert Image 8 In the Image Properties window that opens just type in the name of the le imagegif and it will be inserted 9 It is often easier to control the placement of the image by rst creating a table in the web page using Table Insert Table and then inserting imagegif into an appropriate cell of the table 10 Finally you can publish the web page in the usual way by FTPing the image le to same directory on the server where your web page is kept 128 Adding Areas to Map Polygons The following instructions are adapted from the help le in ARCMAP 1 Open the attribute table of the layer containing the polygons 2 Ifthere is no field for area values add a new eld for area by clicking the Options button and selecting the new eld option Name the eld Area and set Type Double LA Click Calculate Values 4 Check Advanced V39 Type the following VBA statement in the rst text box Dim dblArea as double Dim pArea as IArea Set pArea shape dblArea pAreaarea ON Type the variable dblArea in the text box directly under the area eld name gt1 Click OK ESE 502 IV 127 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis NOTE On some old boundary les such as lymphbndshp in the Uganda example of Assignment 1 will yield a negative area because the ordering of points is reversed Here just change variable name in 7 above to Abs dblArea This will replace dblArea by its absolute value 129 Adding Centroids to Map Polygons The following instructions are adapted from the help le in ARCMAP 1 Open the attribute table of the layer containing the polygons You will make the X and Y coordinates in succession N Add a new eld by clicking the Options button and selecting the new eld option Name the eld X and set Type Double E Click Calculate Values 4 Check Advanced 9 Type or cutandpaste the following VBA statement in the rst text box Dim dle as double Dim pArea as IArea Set pArea shape dle pAreaCentroidX ON Type the variable Clle in the text box directly under the area eld name gt1 Click OK W Finally to make the Y coordinate repeat steps 2 through 7 by replacing the VBA script above with Dim dblY as double Dim pArea as IArea Set pArea shape dblY pAreaCentroidY and using the variable name dblY 1210 Adding Coordinate Fields t0 Attributes of Point Shapefiles The following instructions are adapted from the help le in ARCMAP 1 Open the attribute table of the layer containing the points You will make ESE 502 IV 128 Tony E Smith NOTEBOOKFORSPAHALDAZ4AAMLKHS Part IV Sa ware for Analysis N E 4 M gt1 W gt0 the X and Y coordinates in succession Add a new eld by clicking the Options button and selecting the new eld option Name the eld X and set Type Double Click Calculate Values Check Advanced Type or cutandpaste the following VBA statement in the rst text box Dim dle As Double Dim pPoint As IPoint set pPoint Shape dle pPointX Type the variable Clle in the text box directly under the area eld name Click OK Finally to make the Y coordinate repeat steps 2 through 7 by replacing the VBA script above with Dim dblY As Double Dim pPoint As IPoint set pPoint Shape dblY pPointY and using the variable name dblY 1211 Convelting Strings t0 Numbers in ARCMAP To convert a column of numerical stringtype data in an attribute table of ARCMAP to Double precision number 1quot N M First open the attribute table and select the column to be changed say TRACT Now add a eld to the table Options gtAdd Field Give a new name to the eld say TRACTiNUM and specify the eld as type Double Leave all other options at their default settings and click OK Now right click on the header of the new eld and select Calculate Values EE5M IV 129 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 4 To make the transformation you will use the Visual Basic function CDbl that converts Strings of numbers to Double valued data In the Field Calculator box labeled TRACTiNUM type CDblTRACT 5 Now click OK and the new set of Doublevalued data should appear in the column TRACTiNUM of the attribute table Procedure for Long Numbers The above procedure will work for integers as long as you have no more than seven digits If you have a long integer like an STFID identi cation number for a Philadelphia city block say 420171001021001 then you will not be able to reproduce this reliably as a number The format Double only allow 7 places before the decimal and Long Integer only allows 9 places So it will be recorded in scientific notation like 4 20171e014 These are not reliable for matching identifiers in join operations So if the number displayed is in scientific notation then the following additional steps can be used to correct this 6 In the Table of Contents rightclick on the layer with the Attribute Table of interest and select Propelties Fields 7 Suppose you are converting STFID to a numerical field called STFIDiNUM Then in the Layer Properties window that opens scroll down to the STFIDiNUM row and in the Number Format column click on the gray button with 3 dots 8 In the Number Format window that opens select Category Custom and select String Format 0 Click OK and in the Layer Properties window click Apply and then OK 9 If you check the Attribute Table again then the full digit number for STFIDiNUM should now be displayed NOTE 1 The same procedure can be used to convert integer strings to numerical integers by designate the data type of the column as Short Integer and using the Visual Basic function CInt in place of CDbl NOTE 2 There is also a VBA function CStr which will convert numbers to strings This may sometimes also be useful ESE 502 IV 1210 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 1212 Displaying Proper Distance Units When boundary shapeflles are in unknown units there is usually problem getting the distance measurement tool to show distance values in the same units as the unknown coordinates To accomplish this 1 Open the Properties of the given Data Frame and select Coordinate System 2 Under Select a coordinate system choose Layers and select either the boundary layer for the data frame ie with polygon features or any layer with point features Click Apply and OK 3 Now reopen Properties and select General In the Units window you should now see that both Map and Display are active 4 First set Display to Unknown Units top of the list and then set Map to Unknown Units Click Apply and OK 5 In the map display the Distance Tool should now be displaying the same units as the unknown coordinates 1213 Editing Point Styles in ARCMAP For maximum exibility in choosing colored points with or without borders the following procedure is useful For sake of concreteness suppose you wish to enlarge a portion of a point pattern for an export display and want to increase the size of dots and put a heavy black border around each for emphasis so that they will print well in WORD 1 Enlarge the desired portion in the window and click Properties Symbology 2 Now in the Layer Properties window right click on any symbol and select Properties for All Symbols 3 Select Circle 2 the lled circle and choose the size you want 4 Now click on Properties In the window that opens you will see a display of Layers to the left If you click on the border layer hollow circle you can set its properties i In the Properties section set Type Simple Marker Symbol ii Set Color No Color check Outline and Outline Color Black iii You will then see the border outline in the Preview window and can reset both the Size and Outline Size to obtain a border with the desired thickness ESE 502 IV 1211 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis around the point 5 Now click OK in each window to get back to the Layer Properties window Here you will see that points and borders have the right size but not ramp colors i To apply ramp colors if desired right click on any symbol and choose Ramp Colors The ramp color window should now display the same solid color as the points ii Click on the ramp color arrow and select the desired color ramp These colors should now appear on the point symbols 6 Click Apply and OK You should then see these point styles in the map window 7 Export this map and close the map document without saving it unless you want to keep these enlarged points as permanent choices 1214 Exporting Maps from ARCMAP t0 WORD To export a map that is currently displayed in ARCMAP use the command N E File Export Map The single best format to use for exporting is Enhanced Meta File emf This will give you good reproduction quality and will allow you maximum exibility in editing the image in WORD These images can be imported to WORD using Insert Picture From File Be aware that some printers will not have map marker fonts If your map marker symbols are not coming out or are completely different in appearance the best approach is to reexport the map from ARCMAP using the Format Options In the Export Map window click Options at the bottom and open the Format tag Now check the option Convert Marker Symbols t0 Polygons This will convert your symbols to graphic objects that can be printed without font sets 1215 Making Legends for Exported Maps 1 The procedure in ARCMAP for composing maps with legends north arrows etc is their Layout View However a more exible procedure in my View is simply to export each component as a map and arrange them using the Picture and Drawing toolbars in WORD ESE 502 IV 12 12 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 2 Equot 5 Activate the data frame in ARCMAP containing the desired map and click View Layout View You may still see many different images in the map frame shown But that is no problem Simply click Insert Legend In the Legend Wizard that opens select each unwanted item in the Legend Items window and click on the lt button that is activated This will remove that item from the list When only the desired map layer remains click Next and you will encounter many editing options Just start by clicking all the way through and check to see whether the displayed legend is acceptable 1 have never used any of these editing options Assuming that the desired legend is what you want simply copyandpaste it to the desired WORD document it is the only item selected so nothing else will be copied When it is imported to WORD you can resize it crop it and relocate it next to your map 1216 Making Voronoi Tessellations in ARCMAP as Shapefiles 1 N Equot In the point file to be tessellated you must first change the Primary display field to a unique integer or string field The Object ID field FID or OID will usually work Select the point file and Click Properties Fields and reset the Primary display field Next you must be sure that this file is a shape file If not right click on layer file and click Data ExpOIt Data and save as a shape file say P0intsshp in the same directory From now on we use this shape f11e Before constructing the Voronoi tessellation observe that this tessellation will extend only to the smallest rectangle containing all points If the points are inside a boundary say B0undaryshp and you want to extend the tessellation to the boundary the best way is to edit the point f11e adding four points at the corners of a box containing the desired boundary More fictitious point may be used as desired To do so 0 Create a new copy of P0intshp say Tempshp which can be edited Now display Tempshp instead of P0intshp ESE 502 IV 1213 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis o In the Editor toolbar click Editor Start Editing 0 Then set Task Create New Feature and set Target Tempshp In the map window the mouse arrow should now appear as a colored circle 0 Click on the positions of the desired points such as the comers of a large box containing the boundary in B0undaryshp and these new point features will appear as new rows in the Attribute Table of Tempshp 0 To stop click Editor Stop Editing Save Edits 4 Next to construct the Voronoi tessellation Click Spatial Analyst Distance Allocation Be sure the point eld is selected and leave the Maximum distance blank The resulting temp le will be a raster display of the Voronoi cells 5 To convert this to vector form select the raster layer and click Spatial Analyst Convelt Raster to Features Save the resulting le as shape le say V0r0n0ishp in the same directory 6 To use V0r0n0ishp as an overlay double click on the icon below the layer name in the Table of Contents and set Fill Color No Color You can also set the outline width and color in the same box 7 When V0r0n0ishp is overlaid onto P0intshp it will usually stretch beyond the original boundary for P0intshp and Tempshp You can clean up this discrepancy by using the Clip feature in ARCMAP To do so open ArcToolbox and click Analysis Extract Clip The Input Features layer to clip should be V0r0n0ishp and the Clip Features polygon clip layer should be B0undaryshp 9 Save this clipped form as a new shape le say V0r0n0icellsshp and you are done The temporary les Tempshp and V0r0n0ishp can now be deleted ESE 502 IV 1214 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Software for Analysis 1217 Running Local G Tests of Concentration in ARCMAP There is an option in ARCMAP for testing spatial concentration using local G statistics denoted here by Giquot The following notes illustrate the application of this test to the Eire Blood Group Data and compare it with the corresponding MATLAB results The testing tool in ARCMAP offers a number of weightmatrix options but most importantly allows one to import any weight matrix in the right format So a second objective of these notes is to show how to construct matrices in the right format using MATLAB 1 Equot The first task is to open Fsys502arcviewprojectseireEiremxd in ARCMAP and check that the Attribute Table of the shapefrle Eireshp contains a field that identifies counties with the corresponding row numbers of these counties in MATLAB data This Attribute Table already has a field called ID that contains these numbers For other applications you will need to construct such an identifier eld say ID To do so use the Add Field option in the Attribute Table Set Name ID and Type Short Integer Finally use the Calculate Values option and set ID FID 1 The next task is to convert the exponential weight matrix used in Assignment 6 to ARCMAP format in order to construct a comparable test to that of Assignment 6 The following instructions for forming this matrix are essentially the same as in Assignment 6 Open MATLAB and import the file Fsys502matlabeiretxt You should now have a 26x 7 matrix eire in the workspace Define the centroid locations L and blood group percentages x by L eire12 x eire3 2 Now make an exponential weight matrix by writing OUT distwtsL4101 W OUT1 Here the argument 4 denotes the quotexponentialweightmatrix option 10 denotes the exponent value and 1 denotes the includediagonalterms option We now convert this weight matrix W to ARCMAP format using the MATLAB program arcmapwtmatrixm Here the appropriate command is gtgt Wlist arcmapwtmatrixW The matrix Wist is actually a list form of W To see this display the first four rows of Wlist by writing ESE 502 IV 1215 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis 5 UI gtgt Wlist14 ans 1 030159 2 00036059 3 00051854 4 00056317 HHHH For example the third line says that the rownormalized exponential weight of County 1 with County 3 is 00051854 This can also be veri ed by displaying the 13 element ofW as gtgt W13 ans 00051854 We next save this le and convert it to a format that can be read by ARCMAP as follows First save Wist as a text le using gtgt save Wisttxt Wist ascii Next we need to clean this le in EXCEL Open Wisttxt in EXCEL and use Format Cells Number to format the rst two columns as integers Decimal places 0 and the third column as say a 4digit decimal Decimal places 4 Save as le as a tab delimited text le Wisttxt in your home directory which we shall assume to be S H0meeire Finally recall from 1 above that ARCMAP needs to match the ID numbers in the rst two columns of Wisttxt with a eld numbering the Counties in Eireshp Since there is such a eld ID we shall use this eld Open Wisttxt as a text le and add ID as a top line to this le so that the rst few lines now read ID 1 1 03016 1 2 00036 1 3 00052 6 With this text le completed we are now ready to ca1ry out Giquot tests in ARCMAP To do again open the map document le Fsys502arcviewpr0jectseireeiremxd in ARCMAP and also open ArcToolBox Then navigate to ESE 502 IV 1216 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Software for Analysis gt 9 9 p t Spatial Statistics Tools Mapping Clusters Hot Spot Analysis Getis Ord Gi In the window that opens set Input Feature Class Eireshp Input Field BLOODiGRP Output Feature Class EireiGiArcmapshp To use the custom weight matrix above set Conceptualization of Spatial Relationships Get Spatial Weights from File Distance Band 0 Weights Matrix File SHomeeireWilisttxt where it is assumed that the last line contains the proper path to Wlisttxt Leave all other options as is and click OK A new layer EireiGiArcmap will now appear in the Table of Contents If you open the Attribute Table you will see that the last eld is labeled GiWilist The values in this column are standardized Gi values ie zvalues of the form Zi Gi EGistdGi where exact expressions for the mean EGi and standard deviation stdGi can be found in the GetisOrd paper listed in the reserve section of the class web page To see the relevance of these values recall that under the null hypothesis of the random permutation test the underlying xvalues in this case Blood Group A percentages for each county are assumed to be typical random permutations of these values among counties Ord has shown that under this hypothesis the Zivalues above are asymptotically normally distributed for large sample sizes 11 Here there are only 11 26 counties so this quot J quot 39 J quot 39 39 at best But nonetheless it will be seen below that this test provides very reasonable results in the present example To display the results of this test in a manner that is comparable to the random permutation test gpermlocm in MATLAB recall that this is a one sided test focusing only on signi cant concentration levels The appropriate z values corresponding to onesided signi cance levels in this case are given by 001 233 010 128 005 165 I P Value I I z Value I Hence to display these zvalues GiWilist use a classi cation with Classes 4 Method Manual and values 128 165 233 and max where in this ESE 502 IV 1217 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis case max 3295240 Finally rightclick on the values and select Reverse Sorting so that the most signi cant values will be on the top You can then set the display labels to be the corresponding pvalues listed from top to bottom as 000 001 001 005 005 010 010 100 You should also use darker colors for the more signi cant categories The results of this procedure are plotted below with the legend we have just constructed shown to the right EireGArcmap GiWlist 000 001 001 005 005 010 CI 010 100 Finally when these results are compared with the pvalue display obtained from an application of gpermloc using the same weight matrix W and 999 simulations it turns out that the representations are identical Hence with respect to these pvalue classes the results are in perfect agreement As a rule however the actual pvalues for the asymptotic approximation in ARCMAP will tend to be somewhat smaller For example the smallest possible pvalue for a random permutation test with sims 999 is of course 0001 However the smallest pvalue for the ARCMAP results above corresponds to the zvalue max 3295240 which is approximately 00005 1218 Joining Point Data to Polygon Shape Files in ARCMAP When adding point data in a DBF or ASCII le to a polygon shape le there are some key pointers to keep in mind 1 The les must both be projected or an error message will occur Also the numerical coordinates in the point le must be in the right projection ssum1n t att e 0 on 1 e sa 1s a rea 1n t e ma ocument a 2 A 39gh hplyg fl yPOLYGON39 l dy39 h pd good way to proceed is to project the point data say POINTS at the same time it is ESE 502 IV 1218 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis being entered To do so a b c V d First be sure that the POINTS le to be added either a DBF le or a Tab Delimited TEXT le has Column Labels with proper format for ARCMAP single strings of not more that 8 symbols with no punctuation symbols other than 7 Also if you label the coordinates as X Y or as Xicoord Ycoord then these will be recognized by ARCMAP as the coordinate columns Add the POINTS le to the map document If it won t come in check the label formats again This is usually the problem Right click on the POINTS le in the Table of Contents and select Display XY Data If you have chosen coordinate labels as above then they should show already be selected in the Display XY Data window that opens Also you should see that a new le POINTS Events has been added to the Table of Contents In the Description box you should see Unknown Coordinate System This is what needs to be changed To do so click the Edit button You need to de ne a geographic coordinate system and a projection It will be assumed here that you have Philadelphia data So rst click Select and go to Geographic Coordinate Systems gt North America North American Datum 1983prj Now click Select again and this time go to Projected Coordinate Systems State Plane NAD 1983 Feet NAD 1983 StatePlane Penn South FIPS 3702 Feetprj g Click Add then click Apply and OK In the Description box you should now see both the Projected Coordinate system and the Geographic Coordinate system listed Click OK again 11 Finally you should save the POINTS Events le as a shape le by right clicking on this le and selecting Data Export Data ESE 502 IV 1219 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Software for Analysis j Save this le as say POINTSshp You can then join this to the polygon le by right clicking on the POLYGON le and selecting Joins and Relates Join k Here the best procedure for joining points to polygons is to change the What do you want to joint to this layer option to read Join data from another layer based on spatial location 1219 Saving Map Documents with Relative Paths When saving map documents it is useful to save only relative paths so that projects can be moved to new directories without changing all paths to shape les etc This option can be set in ARCMAP by clicking File Document Properties Then at the bottom of the Properties window click Data Source Options Store Relative Path Names Finally if you check Make relative paths the default then this option will be used for all future map documents as well 1220 Increasing Unique Values for editing Raster Outputs in Version 93 When you use the command Spatial Analyst Interpolate t0 Raster Spline to interpolate a set of point measurements like temperatures at weather station locations in South America you will obtain a default class representation of the interpolation as with older versions But when you right click on properties and try to edit this representation change the number of classes or colors etc it says quotUnique histogram does not exist Do you want to compute valuesquot When you try to do this it will tell you that there are too many unique values This can be xed by increasing the bound on unique values as follows 1 In the main menu click Tools Options Raster 2 In the raster window you will see an additional set of tabs about half way down Click Raster Attribute Table 3 Here change the upper bound on unique values to say 500000 This should be enough ESE 502 IV 1220 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part IV Sa ware for Analysis Now when you are asked if you want to compute unique values say Yes The computation takes a few seconds and then the desired editing window opens up ESE 502 IV 1221 Tony E Smith ESE 502 Tony E Smith TESTS BASED ON MAXIMUM LIKELIHOOD 1 The Basic Example To illustrate the properties of maximum likelihood estimates and tests we consider the simplest possible case of estimating the mean of the normal distribution with known variance 02 Given a random sample x x1xn from a normal distribution N 0402 we rst derive the maximumlikelihood estimate of u with 02 known 1 quot7 2 11 Lnyx02lnH71e U j Zyl1nt j x 4111405 Zx m2 So by solving the rstorder condition for u we obtain dL VI 1 n ZHOCI39 u Zi xi u 12 0 d l l A 1 VI 2 2 ESE 502 Tony E Smith 2 Covariance of Maximum Likelihood Estimators For this simple case it is well known that the variance of the sample mean in in is given by varin 02 n But by taking the second derivative of Ln with respect to a and evaluating at the true mean value say uowe see that 2 2 71 21 dLW ft02lz varcan d Ln39tO 071 dJ q Since in this case a ZLn a a2 is independent of the data x1xn it follows that Ea 2Ln a0a a2 E n02 so that in fact 22 varrz ED Elmer i where In no is designated as Fisher Information about no and is seen to increase as the variance of the maximumlikelihood estimator in decreasesFinally since the Law of Large Numbers shows in this case that in X n z no for n sufficiently large we may conclude om 22 that 23 var n z 40704 More generally for any parameter vector 6 61 k 39 de ning a well behaved distribution with likelihood function Ln 6 it can be shown that if 60 denotes the true value of 6 then the covariance matrix of the A maximumlikelihood estimator 6quot is well approximated for large n by ESE 502 Tony E Smith 23 covlt ngt z EVLnlt6o 1 a new Moreover as with the simple case of the normal mean above it can also be shown that 6 is always a consistent estimator of 60 so that 6quot z 60 for large n Thus as an extension of 23 we obtain the sample approximation 24 cov n z I any1 3 Wald Tests of Parameters For the simple case of the normal mean above it follows at once from the fact that linear combinations of independent normals are normal that 31 2 N 3972 This in turn implies from 12 and 22 that 32 In NInlt zn More generally it can be shown that for large n the distribution of the A general maximumlikelihood estimator 6n de ned above is well approximated by A 33 6 N60In n 1 where 60 again denotes the true value of 6 This forms the basis for all standard tests of hypotheses about the components of 6 ESE 502 Tony E Smith 4 LikelihoodRatio Tests of Parameters Another way to test the hypothesis that say 0 is the true value of y is simply to compare the likelihood of go with that of the most likely value itquot If the resulting ratio of likelihood values is close to one then this implies that go is a good candidate for the true value of y In terms of log likelihoods this is in turn equivalent to a difference Figure 1 LikelihOOd RatiOs 14402quot ano close to zero If we observe from 11 that 41 Ln10x02 n1naJE Zx a2 then this difference can be evaluated as 42 Wquot woo Zltx 4oz Zjltx M 211xl2 2x1 if 211xl2 inJO 3 T2 2fnzl x1 1123 102in nag 2an 3 J 5 war ESE 502 Tony E Smith Hence it follows that A fquot 1u0 2 43 2mm LAW a But under the null hypothesis that H 0 y 2 go it follows that f 44 quot 0 N N 01 a J and hence by de nition that 31 610 2N 2 45 05 261 Thus we see that under H 0 46 2Ln nLnol N 2612 More generally for any partition of parameters 6 6162 the null hypothesis H 0 61 63 can be tested in the same way by letting 602 be de ned by the condition 47 14463602 maxgz 146362 and comparing 14163602 with the unconstrained maximum Ln 6quot If 63 has k components then under H 0 we now have 48 2Lnlt n Lnlt6s gti 25 ESE 502 Tony E Smith 5 Lagrange Multiplier Score Tests of Parameters A nal way to test the hypothesis that go is the true value of y is to I go yquot simply examine the slope of the likelihood function at 0 If this slope is close to zero then it follows by continuity that yo must be close to t and thus must again be a good candidate for v Ln y the true value Hence one may Simply teSt Whether the 510106 Figure 2 Score Function V Ln 0 is signi cantly different n3 from zero But by 11 it follows that the score function sumo V Ln 0 is given by 51 we V 4211405 232 2102 402 lFM Z1xiJo lxi nJO n n xn Il t Ziixiu0 02 Hence it follows that 52 aJZsnlty0gt fquot Nlt01 an As in 44 and 45 above this in tum implies that ESE 502 TonyE Smith 53 0211Snu02 glz Finally recalling that varin 0211 we obtain the following score test statistic 54 varosno2 Z12 where vary0 is formally the variance of in under the null hypothesis which is completely independent of go in this simple case It is important to note that this score test is also called a Lagrange multiplier test To see the reason for this observe that nding the maximumlikelihood estimate of it under the null hypothesis y 0 is formally equivalent to the Lagrangian maximization problem 55 maxy 0 LN 1 with solution given by 56 0V gpV Lni l 57 0Vilt0 0 By combining these two conditions we see that the optimal value xi of the Lagrange multiplier is given by 58 i Vilma and thus that the score function sny0 V Ln 0 can also be interpreted as a Lagrange multiplier However the above slope interpretation seems to be simpler and more intuitive ESE 502 Tony E Smith More generally for any partition of parameters 6 6162 as in section 4 above the null hypothesis H 0 61 63 can be tested in the same way by again letting 612 be de ned by 47 and setting 60 63502 If 61 is of dimension k then the score vector s 60 also has dimension k and the variance term vara0 in 54 is now replaced by the covariance sub matrix cov1160 where 59 Kn60 COV1160 COV1260 COV2160 COV22 60 If the Fisher information matrix is given by 510 IMO PM 21460 InZl I 22 n then it can easily be shown by partitioned inverse identities that 511 COV1160 z In607111 Inll60 Ian 60In2260zn216071 so that the following score test statistic is obtained as a direct generalization of 54 512 snlt60gt39znlt6o s 60 i iiquot The single most important feature of this score test is that it only requires maximumlikelihood estimation under the null hypothesis ie ESE 502 Tony E Smith conditional maximumlikelihood estimation of 62 given 61 65 This is generally much easier to calculate For example if 61 0 in the SAR model or 61 2 xi in the SL model then the remaining parameters 62 8O392 are directly obtainable from OLS 6 Moran s I as a Score Test for SAR One key feature of score tests for our present purposes is that the score test statistic for p in the SAR model turns out to be precisely Moran s I statistic up to a scale factor To see this observe rst from section 83 in the BULKPACK that 61 VPLWCDHBJZ Vp consHlnprl ZL 02 y X 39BBpy X l 202 VPZZ11nltlpwigt y X3gt39BBplty X gtpp where BP 2 In pW and where a i 1n are the eigenvalues of W so that 62 Wvl on 139 1 113 n To analyze 61 we rst assume for simplicity that the eigenvectors in 62 are linearly independent so that the matrix V v i 1n is nonsingular Hence if A denotes the diagonal matrix of eigenvalues in 62 then by de nition 63 Wv1vnv1vnA gt WVzVA gt WziAV1 PP0 ESE 502 Tony E Smith Next we observe that if the trace of a matrix A 2 at z39 j 1n is de ned by trA a then it follows that for any matrices 11 A a1an39 and B b1bn IrAB 2 21017 2 IrBA Moreover since IrW 0 for all weight matrices it follows from 62 and 63 that 64 0 IrW IrVA V IrA V IV trA 210 Hence for the single most important null hypothesis p0 0 we have lill pa 65 VpZfllnl paip02quot L Zwo and it follows that 61 reduces to 66 Van0 02 lzvpky Xm39Bngo Xm 20 VP y X 391n pW I pWyX 18p0 y X gt vp In pltW39 W pZW39Wm y X l 202 1 y X 39 W39W20W39W1 yX 202 Jp0 2170 X 39W39 Wy X y X 39W39y X y X 39Wy X OX YW vX ESE 502 Tony E Smith Hence if we let 63063 denote the OLS estimates of 862 ie the maximumlikelihood estimates under the hypothesis 0 0 so that in this case 60 0 and then the score function is given by 67 snwo a vanm o Lgo X oywo 4050 But if we now let 0 y X 30 denote the OLS residuals so that 68 OA i 2y X 0ry X 0 or 0 then 67 takes the form 69 we n o39 oio X oywo X 0 2 we AVA 80 80 where the expression in brackets is precisely Moran s I To complete the argument recall from expression 77 in the BULKPACK that the offdiagonal 0 02 term in the Fisher information matrix In 60 is given by 610 map IrWIn per IrWIn 21pka trW pZ1pk 1Wk Hence under the null hypothesis p 0 it follows that ESE 502 Tony E Smith 611 trGpp0 trW 0 But this implies that In 60 is black diagonal so that in this simple case 57 reduces to 612 comma z We Inner which is given from 77 in the BULKPACK by 71 613 Inuwoyl trGpGp 6F0 trWW W39 1 Hence the nal score statistic is 2 W 614 63916 1 6 80 02 snlt0 quot0 Lame IrltWWWW O O a Note also that an equivalent test statistic which is proportional to Moran s I is obtained by taking the square root of 614 615 AVA 80 80 J go39ng 1MWW WW39 1 N NW ESE 502 Tony E Smith DISCRETE SPATIAL AUTOREGRESSIVE MODELS The standard logistic binomial and Poisson regression models of discrete counting data have natural spatial generalizations in a manner similar to the conditional autoregressive CAR model We consider each in turn 1 Autologistic Model Suppose one considers the voting outcomes for Pennsylvania counties during a given Presidential election Let V 1 if county 1 voted Democratic and v1 0 otherwise To model this voting behavior one may consider a number of county covariates q Jq1qk say with x11 per capita income in county 1 x12 average years of education for voters in county 1 and so on A standard logistic model of voting behavior would then take the form 1 Pro 1 M 1ul where 2 1 expazfl pc However it may well be that there are voting similarities among adjacent counties Hence 1f we let WU 1 1f county j 1s a cont1guous neighbor of county 1 w1th w 0 and 1f we let v7 v1 VH VHlVn denote the voting outcomes of all counties other than i then a natural spatial generalization of this model would be to hypothesize that the conditional distribution of v given the voting outcomes v71 of all other counties depends only on the voting behavior of its neighbors say 1m 3 Prvl llv71 1MVil with 4 W expa 214 p2 Wm To estimate this model we start with the standard logistic model that has a likelihood function of the form 5 La l v HLPrm 1 1 Prv 11 ESE 502 Tony E Smith 2 10gLa l v 212 logPrv 11 vlog1 Pro 1 where v v1 v and where Prvl l is given by l and 2 This function is quite easy to maximize and yields well behaved maximumlikelihood estimates of a and But for the spatial generalization in 3 and 4 this is not nearly so simple First it is not even clear whether this family of conditional probability distributions is actually consistent with a common joint probability distribution Here it turns out that they are indeed consistent and that this joint distribution has the form see for example in Cressie 1993 section 65 1 expQv 6 PrVZegtltpQs where n k 7 QltvgtZwaZ gtep Z w ISISSH Hence by de nition 8 La p l v M Zsemeon with Q given by 7 But notice that the denominator is summed over all possible realizations of v v1 vquot So for example if there were only 20 counties then the denominator would contain 220 l048576 terms Hence even for small numbers of areal units this function is completely intractable Several alternative procedures have been suggested The simplest is to ignore interdependencies and consider maximizing the pseudo likelihood function 9 Lquota p l v H7Prw 1v 1 Prv 1 wt in which the marginal probabilities in 5 are simply replaced by the conditional probabilities in 3 If spatial dependencies are not too great then it can be shown that this procedure still yields consistent estimates However the procedure turns out to be extremely inef cient in that the variances of these estimates especially 3 can extremely large ESE 502 Tony E Smith An alternative procedure is to simulate values of the joint distribution as proposed by Geyer and Thompson 1992 The idea is very simple Ifwe let 9 a p and write 6 as 10 Pr v mm M exPQV mem M then for any given set of parameter values say 60 a0 o p0 it follows that 11 121amva 2 C9ZV9XPQVl cw 2V eman 0 expanv C90ZV6XPQEUV C60 2609 E m 0 0 exPlQ9nV But since 60 is known it follows that if we can simulate this distribution and obtain sample values v S s lN then we can obtain corresponding sample estimates on WFZ N expgtwn mgmm of the mean value in l 1 This in turn implies that mgtamwHMm and thus that we can approximate the function 06 up to an unknown constant of proportionality C60 Hence this approximation can be used to obtain approximations the likelihood function in 10 Notice that while this simulation procedure is time consuming it only need be done once since the same samples VG s lN can be used in every round of the iterative maximization procedure to obtain estimates 0 3 2 Auto Binomial Model Next suppose that one is looking at housing abandonments in Philadelphia and has data h h 139 l n on the number of abandoned houses in each block group 139 ln Ifthe abandonment probabilities depend on a number of housing attributes x1 x11xlk as well ESE 502 Tony E Smith as the frequency of abandonments in neighboring block groups and if the total number of housing units in i is denoted by M then one might consider conditional binomial probabilities of the form N 14 Prat l ha hljmhx 1 P1hllN h where h 15 p1h1 1 M ha with 16 HULK expa2fl jxiwin1 Here these conditional binomials can again be shown to be consistent with a joint distribution of the form expQh 17 PrhZseXpQs where n n M 18 QlthgtZx1haZfl jxxp Z wyhhjZ logm ISISSn 1 Hence the GeyerThompson procedure can again be used for estimation 3 Auto Poisson Model Finally if the N s are quite large then it is reasonable to approximate the auto binomial model by an auto Poisson model of the form 19 PM l 111 Mil RVeXMMM with 20 110171 expazl jxy pZM wljhj ESE 502 Tony E Smith For the Poisson approximation to the auto binomial 10171 replaces the binomial mean values Nlplh71 In all cases the appropriate joint distribution takes the form 17 with n k n 21 QM m aZ x p Z W 210m ISISSn However it turns out that this model is only well de ned if p lt 0 The problem can be seen from 20 where the in uence of neighboring counts always increases the expected number of counts at i when p gt 0 This mutual in ation procedure can easily be shown to drive counts to in nity since there is no upper bound on counts in the auto Poisson model Some efforts have been made to rectify this see Augustin NH et al 2004 by considering truncated auto Poissons of the form M a 1 Hi exp 2 a 1 22 Prov l h 2k0amph kexp zh n 01N Notices that if N is replaced by N for each i then this starts to look very much like the auto binomial model above Hence this truncated model is most useful in cases where there are no reference populations as for example when looking at the number of traffic accidents in each areal unit Finally if we let lt23 Qlthgtzieaz xp Z wink2quotWM with n M N k 24 Do 210 h J Zk0ih1 kexp 2h then the joint distribution consistent with these truncated conditional Poissons is again given by 6 So Geyer and Thompson 1992 can again be used for maximum likelihood estimation of the parameters References Cressie N 1993 Statisticsfor Spatial Data New York Wiley Geyer CJ and EA Thompson 1992 Constrained Monte Carlo Maximum Likelihood for Dependent Data Journal of the Royal Statistical Society B 54 657699 ESE 502 Tony E Smith Augustin NH et a1 2004 Using the truncated autoPoisson model for spatially cor related counts httpwwwstatsglaaculdResearchTechRepZOO40414pdf EXAMPLES OF SMOOTHERS FOR SPATIAL INTERPOLATION IN GEOSTATISTICAL ANALYST ESE 502 Tony E Smith Local Polynomial Fit Radial Basis Fit 0 a Lquot I ee r 4 hi 4 Spline Function Fit Ordinary Krige Fit LOCAL LINEAR POLYNOMIAL FIT RADIAL BASIS FUNCTIONS Given data spyi i SSi2 Choose normal dens1t1es i s exp 2 z 1n Let yyl 11n CI isjij1n and Find coef cients a a1 i1n such that gt azclfly 25 I I I I I I I I I I I I 2 RBF with Inverse Quadratic Basis Functions 20 J I I l 0 I I x x x x x x 1 2 3 4 5 6 7 8 RADIAL BASIS INTERPOLATION Normal Density Family 9 10 CUBIC SPLINE MODELS o For simplicity we focus on the onedimensional case and consider a set of data points xiyl i ln such as the n 5 points in Figure 1 below 0 P39s age 5 119 l I use use I I or or I 116 7 use I I l I u57 u5 39 1 9 39 M E M I E E I use Q 3 ll 39 E a I u27 u2 a a E i i M i 05 1 15 2 25 a a5 4 45 5 55 05 1 15 2 25 a a5 4 45 5 55 Figure 1 Data Points Figure 2 Smooth Interpolation o The objective is to find a smooth curve y Cx that passes through these data points such as the curve shown in Figure 2 If the set of twice continuously differentiable functions on the interval x1xn is denoted by C x1xn and if smoothness is taken to be a lack of sharp curvature as measured by the second derivative cquotx of cx then this problem can be formalized in terms of the following constrained minimization problem x 2 1 minceqwn Ll cquotx dx subject to cxl yi i1n o In spite of its apparent complexity this problem has the following remarkable solution THEOREM The unique solution ci e C x1xn to l is described on each segment x l xm i ln l by a cubic polynomial 2 3 2 cxalblxclx dix xinSx 11 for appropriate choices of ablcldl i ln 1 This function ct is 1 called a cubic spline function o The actual solution to the problem in Figure 1 above is shown in Figure 3 below 09 e i 08 e i 07 e i 06 e i 04 r 7 02 e Figure 3 Cubic Spline Interpolation Example ESE 502 Tony E Smith STANDARD PROCEDURE FOR UNIVERSAL KRIGING Given a universal kriging model Ys xs39 es seR with i unknown ii Ees0 seR iii covesev known for all sv e R together with observed data y y i ln39 and X x1xn 39 at points s i 1n C R the task is to predict the unknown value YSo at a given location s0 6 R with attributes x0 xs0 or possibly many such locations This is accomplished by variations of the following standard estimation procedure in software such as Geostatistical Analyst 1 Construct an OLS estimate AOLS X 39X 391 X y of and corresponding residuals 50113 yX OLS39 2 Use these residuals to estimate an empirical variogram 7h at some set of selected distance values h i 1m 3 V Use this data 2 h i 1 m to t by nonlinear least squares a spherical variogram yh and 4 V Then use the identity C h 02 yh to construct the corresponding spherical covariogram h s yh and for all distances h 5 V If the distance between each pair of data points s1 and s j is denoted by h then the covariance 0391 covelej between the residuals at s and s j is estimated by A quot 2 A A 0391 C h where by de nition 039 039 3 039 s and the resulting covariance matrix between residuals at all data points i 1 n is given by ESE 502 6 7 8 V 9 V 10 11 Tony E Smith Using this covariance matrix one then obtains the nal estimate of by GLS as I ilXile ily I Note that all data points are used to estimate the covariogram 63901 However only a subset of these data points will be used to interpolate krige the value Y so at s0 In particular if one selects an appropriate kriging bandwidth ho about point s0 then the simplest approach is to choose the set of all data points within distance no of s0 denoted say by S0 s i 1 no If the attribute vector is of dimension k 1 including the intercept term then no is always chosen large enough to ensure that n0 2 k1 The corresponding n0 square submatrix of all covariances between residuals at points s1 s 6 S0 is then given by n01s In a similar manner if the covariance between 8s0 and each 8039 i l denoted by 03901 coves0 831l and ifthe distance from s0 to s is denoted by hm S no then the vector Cso 03901 i 1 n0 ofthese covariances is estimated by 5s0 63901 i 1n0 39 where 63901 0101 Next if the estimate of the residual vector 8 81 i 1n for all data points is de ned in terms of the GLS estimate of by 5 y X and if the corresponding n0 dimensional subvector of estimated residuals in S0 is denoted by 50 then the predicted residual s0 is given by the simple kriging expression 00 c006 Finally the desired prediction EGO at SO is given by too mm also SPATIAL POINT PATTERN ANALYSIS 1 Examples of Point Patterns We begin by considering a range of point pattern examples that highlight the types of statistical analyses to be developed These examples can be found in ARCNIAP map documents that will be discussed later 11 Clustering versus Dispersion Consider the following two point patterns below The first represents the locations of redwood seedlings in a section of forest1 This pattern of points obviously looks too o o 000 0o 0 o Do 000 O O 00000 o 000 O O O 0 0 0 00 o 0 000 no a Fig11 Redwood Seedlings clustered to have occurred by chance The second pattern represents the locations of cell centers on a microscope slide2 While this pattern may look more random than the redwood seedlings it is actually much too dispersed to have occurred by chance3 This can be seen a bit more clearly by including the cell walls shown schematically in Figure 13 to the right This additional information shows that indeed there is a natural s acing between these cells much like the individual cells of a beehive The cell walls were actually constructed Fig13 Cell Walls 1 This data rst appeared in Stmuss 1975 and is the lower le hand comer of his Figure l which contains 199 redwood seedlings Z This data rst appeared in Ripley 1977 Where it relates to an interesting biological problem regarding insect cells posed by Dr Francis Crick of Crick and Watson fame 3 The term dispersion is sometimes called uniformity in the literature Here We choose the former ESE 502 I 1 1 Tony E Smith NOTEBOOK FOR SPATIAL DA TA ANALYSIS Part I Spatial Paint Pattern Analysis schematically in ARCMAP by using the Voronoi Map option in the Geostatistz39cal Analyst extension of ARCMAP But this process is a reasonable depiction of the actual cellpacking process So the key question to be addressed here is how we can distinguish these patterns statistically in a manner that will allow us to conclude that the rst is clustered and the second is dispersed 7 without knowing anything else about these patterns The approach adopted here is to begin by developing a statistical model of purely random point patterns and then attempt to test each of these patterns against that statistical model In this way we will be able to conclude that the rst is signi cantly more clustered than random and the second is signi cantly more dispersed than random 12 Comparisons between Point Patterns Figures 14 and 15 below show the locations of abandoned houses in central Philadelphia for the year 20004 The rst shows those abandonments for which the owner s residence is off site and the Fig14 Off Site Owners Fig15 On Site Owners second shows properties for which the owner s residence is on site If offsite ownership tends to re ect abandoned rental properties while onsite ownership re ects abandoned residences then one might hypothesize that different types of decisions were involved abandoning a rental property might be more directly an economic decision than abandoning one s home However these patterns look strikingly similar So one may ask whether there are any statistically signi cant differences between them 4 This data was obtained from the Neighborhaadlnfarmat ian System data base maintained by the Cartographic Modeling Lab here on campus httpwwwcmlupennedu For further discussion of this data see Hillier Culhane Smith and Tomlin 2003 ESE 502 112 Tony E Smith NOTEBOOK FOR SPATIAL DATAAAZALYSIS Part L spmm Pumt PmtzrnAnalym Notice that there appears to be signi th clustering in each pattern But here it is important to emphasize that one can only this judgment by comparing these patterns with the pattern of all homing in this area For example there are surely very few houses in Fairmont Park while there are many important to treat the pattern of overall housi houses in other areas as the relevant reference pattern or So here it is backcloth against which to evaluate the signi cance of any apparent clusters of abandoned houses arynx cancer cases in Lancashire county England during the period 197419835 The speci c data set is from the southcentral area of Lancashire county shown by the red area in Figure 16 An enlargement of this region is shown in Figure 17 below where the population of blue dots are lung cancerx during that period and the smaller the smaller areal subdivisions shown are parixhex also called civil parishes cp and correspond roughly in scale to our census tracts 939 quot 3 mega Fig16 Lancashire County Here again it should be clear that clustering of such cancer cases is only meaningful relative to the distribution of population in this area The population densities in each parish are shown in Figure 18 below Fig 1 7 Larynx and Lung Cases Fig18 Population Backcloth 5 This data first appeared in the paper by Diggle Gatreii and Lovett 1978 which is included as Paper 12 arynx Cancer in the Reference Materials on the class web page ESE 502 113 Tuny E Swath NOTEBOOK FOR SPATIAL DATA ANALYSIS Part 1 Spatial Pam Pattern Andysxs An examination of these population densities reveals that the clustering of cases in some of the lower central parishes is now much less surprising But certain other clusters do appears to be in an area ofrelatively sparse population This cluster was in fact the center of interest in this particular study An enlar ement of this southern portion in Figure 19 below indicates that a large incinerator is located just upwind ofthis cluster of cases7 Hg19 Incinerator Location Moreover an examination of the composition of this cluster suggests that there are significantly more larynx cases present than one would expect given the total distribution of cases shown in Figures 17 and 18 above This appears to be consistent with the fact factor contributing to the presence of this particular clustering of cases To analyze this question statistically one may ask how likely it is that this could simply L 394 u IIoL 11 m A 4 m It paumh 5 i i r mm mm rm i ml h during the period from 19721980 7 i i i i a n m Fi m 1 sabove ESE 502 114 Tony E Smxm