Stat GraphData Explortn I
Stat GraphData Explortn I STAT 663
Popular in Course
Popular in Statistics
This 229 page Class Notes was uploaded by Mr. Jaycee Schmidt on Monday September 28, 2015. The Class Notes belongs to STAT 663 at George Mason University taught by Staff in Fall. Since its upload, it has received 58 views. For similar materials see /class/215209/stat-663-george-mason-university in Statistics at George Mason University.
Reviews for Stat GraphData Explortn I
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
STAT 663CSI 773 S ra ris rical Graphics and V Da ra Explora rion Edward J Wegman Lecture 2 Da ra Graphics Sof rware Crystal Vision Mason Hypergraphics I S PIUS a cL iH9 oo on 5 5 Gearjm an Ham mass m FHE gun Mew Gmup Feames Pupups GvandTuuy Exuas ap interim MfMHEsipeLgaHm i n39 a Crys ralVision Aum CrystalV unDemuVevsIun am Pnpup I Lg Three navigation rows text icon sliders CrystalVision man 7 CrystalVIsIun Damn Versl t Mew G map Features Pumps ONO open puts history Em A om anPveyrew eXIt m 1 2 FHEXS i FHEXZ A E egesmmsagdda a5a7 Others not currently implemented Autn Crystal39v39isinn Damn V Edit i 39 COPY to Clipboard MEEjEdit we GI UUF39 Features 3 I E Qradta 39 ifii ii i3 QSPBE Cnpyincnpbnard GirlC Then open as new j H image in Microsoft 1 E Fragrance Photo Editor Best way to preserve images Also can use MS PowerPoint CrystalVision View Toolbar Status Bar Group not implemented CrystalVision Popups pumps GrandTour E Can view additional plot 1 SE ErP39D Parallel Coordinate types In separate WlndOWS 3E Sc erplgt Simultaneously Ciilnzse Panama Grandm E GrandLEI Extras Hal 39 p A Gives Rotation Matrlx Grand Tnur Informatlnn for the Grand Tour g CrystalVision ilL LSL migigym japeeu Aipha j SiZE r J i 39 1 NEW 6 Help not implemented 2 Open 7 Undo 3 Save 8 Redo 4 Print 5 Information l CrystalVision 1 Scatterplot Matrix 6 Stereoscopic 3D Plot 2 Parallel Coordinate Plot 7 4D Plot 3 3D Scatterplot 8 3D Plot with Joins 4 Enlarge Scatterplot 9 No Alpha Blending 5 Data Label 10 White Background g CrystalVision 1 2 3 4 5 6 7 8 lepeeu Alpha 1 SiZE d A B C 1 Return to original 7 39 one Step forward 2 4 8 Not implemented A 39 Spee j 0f Grand Tour 3 one step back B Saturation Control 5 Pause C Pixel Image Size 6 Forward CrystolVision 8 Add More Groups 1 General Pointer colors 2 Cutting Tool 3 Cropping Tool 9 Customize Colors 10 Adjust Variables 4 Brushing Tool 5 MagnifyCompress 6 Reposition Center 7 ID a Point Crys ralVision Vanahles W 353 a 3 EB IE 2 228 523m 23 E mam E 53 gaming LLLLLLLL Exmm LEW Ram Em LLLLLLLL J E E o omgt3mfu l Cr ys ralVision XPrlce Rotation Matrix 008 Scale 000016 Shift 399598 50000 0 06 08 981 06897 3926 50000 50000 00000 50000 00000 CrysTalVision Cr39ys ralVisiion Gear a Dlsmacemen urr1ingircle EBSLpe Enos Crys ralVision Mason Hypergrophics Worked like old Lotus 123 The forward slash quot initiates You can load an old file or type in a new one No import facility 69 The S to show the menu n the next page v Mason Hygrgraphics E Mason Hypergmphics 1 1 1 11 11 1 Mason Hypergmphics 1 1 r H 39s 3 l j Gear a T 1w 39 1 V 39 1 q M Displace quot f w 2 I Il ire e l l L i P p l i LU ILJLJL J 7quot III III Hm I 1 Mason Hypergmphics Il WRng Hypgmgmph gg EL Length h Displace Height RearSeat HeadRoon gr Mason Hypergmphics 1I Magan Hypergmphics I Mason Hypergmphics STAT 663CSI 773 S ra ris rical Graphics and Da ra Explora rion Edward J Wegman Lec rur39e 8 Mul ridimensional Da ra Visualiza rion Techniques Scatterplot Matrix Parallel Coordinates Grand Tour Glyphs Chernoff Faces Tails i Parallel Coordinate Theory The parallel coordinate plot device is based on the observation that problems associated with Cartesian plotting arise because of the orthogonality constraint Because this is the case in parallel coordinates we simply give up orthogonality and draw the axes as parallel Any number of parallel axes can be drawn in a plane Ifthe data are ddimensional simply draw d parallel axes 7 Parallel Coordinate Theory A data vector 2 m1x2xd is drawn by locating 1 on the ith coordinate aXis and simply joining the 1 to mid by aline segment for i l7 d 7 1 There is in principle no upper bound on the dimension of the data that can be represented although there are practical limits related to the resolution available on a computer screen and to the human eye g Parallel Coordinate Theory The power of and motivation for using parallel coordinate displays derives from the underlying connection with projective geometry AXiomatic synthetic projective geometry is motivated by the asymmetry in Euclidean geometry induced by the parallel lines aXiom That is most pairs of lines in a twoplane meet in a point except if the lines are parallel However all pairs of points determine a line g Parallel Coordinate Theory In synthetic projective geometry the parallel lines axiom is replaced with the axiom every pair of lines meets in a point Together with the other aXioms of projective geometry this aXiom has the effect that any statement that is true about lines and points is also true when the words quotlinequot and quotpointquot are interchanged This notion of duality between points and lines induces all types of additional dualities in a projective plane g Parallel Coordinate Theory Nondegenerate mappings between projective planes have the property of preserving certain geometric structures In the case of transformations from Cartesian coordinate geometry to parallel coordinate geometry this implies structure in Cartesian coordinates have a dual structure in parallel coordinates The implication is that not only does a parallel coordinate display have the ability to uniquely map highdimensional points into a planar diagram but that the parallel coordinate display can be interpreted geometrically Projective Geometry The intersection of two parallel lines in projective geometry is a point at infinity These points are called ideal pointsquot and the set of ideal points is called the ideal line ii Proj ec rive Geqme rry Projeciive Geometry Crossca p with antipodal points identified 7 Parallel Coordinate Theory Consider a line in the Cartesian coordinate plane given by y mm b and consider two points lying on that line say amab and cmcb For simplicity of computation we consider the my Cartesian axes mapped into the my parallel axes We superimpose a Cartesian coordinate axes tu on the my parallel axes so that the y parallel axis has the equation u l g Parallel Coordinate Theory The point ama b in the my Cartesian system maps into the line joining 10 to mab1 in the tu coordinate axes Similarly cmc 17 maps into the line joining 00 to mcb1 It is a straightforward computation to show that these two lines intersect at a point in the tu plane given by 7 b17m 117m 1 Notice that this point in the parallel coordinate plot depends only on m and b the parameters of the original line in the Cartesian plot 5 Parallel Coor diina re Theory y axis amab cmcb u axis mab1 mcb1 y axis x axis 310 00 xt axis g Parallel Coordinate Theory Thus 2 is the dual of and we have the interesting duality result that points in Cartesian coordinates map into lines in parallel coordinates While lines in Cartesian coordinates map into points in parallel coordinates 7 Parallel Coordinate Theory For 0ltlim 1 lt1 m is negative and the intersection occurs between the parallel coordinate axes For m 7 l the intersection is exactly midway A ready statistical interpretation can be given For highly negatively correlated pairs the dual line segments in parallel coordinates will tend to cross near a single point between the two parallel coordinate axes 39 Parallel Coordinate Theory In the case that 1 7 39rrt 1 lt O or 1 7 INQTI gt 1 W is positive and the intersection occurs external to the region between the two parallel axes In the special case m 1 this fortnulation breaks down However it is clear that the point pairs are 17th b and 675 b The dual lines to these points are the lines in parallel coordinate space With slope 7 1 and intercepts 7 ab 1 and 7 01771 respectively Thus the duals of these lines in parallel coordinate space are parallel lines with slope 3771 We thus append the ideal points to the parallel coordinate plane to obtain a projective plane These parallel lines intersect at the ideal point in direction 13 g Parallel Coordinate Theory We represent points in the projective plane by triples 5031 As motivation for this representation consider two distinct parallel lines having equations in the projective plane ambycz0andambyc z0 Simultaneous solution yields c 7 c z 0 so that z 0 Thus when 20 the triple myz ie my0 describes ideal points il Parallel Coordinate Theory If a i l the resulting equation is asp by c i O and so 173771 is the natural representation of a point any in Cartesian coordinates lying on a1 by a 0 Notice that if 7x 7y 7 is any multiple of 1 y 1 on L1 by c 0 we have avxbvycryryabycryO0 Thus the triple 39yrn 39yyj y equally well represents the Cartesian point I y lying on 111 by c 0 so that the representation of a point in natural homogeneous coordinates is not unique Parallel Coordinate Theory The line C y 7 ma b mapped into the point 2 171 7 m 11 7 m 1 in parallel coordinates In natural quot E is 39 by the triple in 7 1 b and the point 2 by the triple b1 7 39rrL 1 1 7 39rrL l 1 or equivalently by b 1 1 7 39In The latter yields the appropliate ideal point when m 1 A straight tblward computation shows for 0 0 7l A7 0 71 71 l 0 0 that t El or b 1 1 m m 1bl Parallel Coordinate Theory Similarly a point 1m21 expressed in natural homogeneous coordinates maps into the line represented by La 7 17 7 1 in natural homogeneous coordinates Another straightforward computation shows that the linear transformation given by t xB or 17 m1 7 x2 7 m1 7 m1x21B where l l o l o B g Parallel Coordinate Theory This is straightforward to see since an elementary quadratic form in the original space say mCx 0 where x denotes 17 transpose represents the general conic Clearly then since 25 IB B nonsingular we have 17 tB l so that ZEB ICB 1 2 l 0 is a quadratic form in the image space g Parallel Coordinate Theory An instructive computation involves computing the image of an ellipse amz by2 7 022 0 with abc gt 0 The image in the parallel coordinate space is Ct u2 7 bu2 1122 a general hyperbolic form Parallel Coordinate Theory A common question about parallel coordinates involves the adjacency issue Axes that are adjacent allow for easier comparison that axes that are not adjacent How many parallel coordinate displays do you need so that all pairwise adjacencies are present g Parallel Coordinate Theory If the parallel coordinate axes are ordered from 1 through d then there is an easy pairwise comparison of 1 with 2 2 with 3 and so on However the pairwise comparison of 1 with 3 2 with 5 and so on was not easily done because these axes were not adjacent One simple mathematical question then is what is the minimal number 0 permutations of the axes in order to guarantee all possible pairwise adjacencies Although there are dl permutations many of these duplicate adjacencies Actually far fewer permutations are required 1273645 Parallel Coordinate Theory g Parallel Coordinate Theory A graph is drawn with vertices representing coordinate axes labeled clockwise l to d Edges represent adjacencies so that vertex one connected to vertex two by an edge means axis one is placed adjacent to axis two To construct a minimal set of permutations that completes the graph is equivalent to nding a minimal set of orderings of the axes so that every possible adjacency is present g Parallel Coordinate Theory A basic zigzag pattern used in the construction This creates an ordering which in the example is l 2 7 3 6 4 5 For d even this general sequence can be written as 1 2 d 3 d7 1 4 d72 d22 andfordoddas 1 2 d 3 d7 1 4 d7 2 d32 An even simpler formulation is dk1 dk7 1k1kmodd k 12d71 with d1 1 Here OmOdd dmodd d Parallel Coordinate Theory This zigzag pattern can be recursively applied to complete the graph That is to say if we let d dk we may de ne 153 dlcj1moddj12 where is the greatest integer function For d even it follows that this construction generates each edge in one and only one permutation Thus d2 is the minimal number of permutations needed to assure that every edge appears in the graph or equivalently that every adjacency occurs in the parallel coordinate representation Parallel Coordinate Theory For d odd the result is not exactly the same We will not have any duplication of adjacencies for jlt However jlt will not provide a complete graph The case j l will complete the graph but also create some redundancies Nevertheless it is clear that permutations are the minimal number needed to complete the graph and thus provide every adjacency in the parallel coordinate representation Thus we have that the minimal number of permutations of the d parallel coordinate axes needed to insure adjacency of every pair of axes is l 7 Parallel Coordinate Theory It is worthwhile to point out that all possible pairs may be found in only distinct parallel coordinate plots but for a scatterplot matrix dzgd plots are required One practical consequence is that for a xed computer screen size elements in the scatterplot matrix become dif cult to see much more rapidly than the parallel coordinate plots Grand Tour Theory Winding Algorithm Torus Method Random Curve Algorithm Fractal Curve Algorithm g Grand Tour Theory The Winding Algorithm in d space Let ej 00 0 10 0 be the canonical basis vector of length d The 1 is in the jth position The ej are the unit vectors for each of the coordinate axes in the initial position We want to do a general rigid rotation of these axes into a new position with basis vectors ajt a t a t 1310 where t is the time index g Grand Tour Theory The strategy then is to take the inner product of each data point say xi 139 ln with the basis vectors ajt This operation projects the data into the rotated coordinate system By convention d will refer to the dimension of the data and n will refer to the sample size of the data set Of course the j subscript on ajt means that ajt is the image under the generalized rotation of the canonical basis vector e j g Grand Tour Theory Thus the data vector xi is so that the representation of mi in the aj coordinate system is 21105 yit7y t77y t7 i1 2 n 77 1dandi1n g Grand Tour Theory The vector is a linear combination of the basis vectors representing the ith data point in the rotated coordinate system at time t It is also worth pointing out that is also a linear combination of the data If one component of the vector is held out from the grand tour ie a partial grand tour then the partial grand tour lends itself to an interpretation in terms of multiple linear regression g Grand Tour Theory The general goal then is to nd a generalized rotation Q such that Qej aj We can conceive of Q as either a function on the space of basis vectors or as a d X dmatriX Q where ej gtlt Q aj We implement this by choosing Q as an element of the special orthogonal group denoted by SOd of orthogonal d X d matrices having determinant 1 Thus we must nd a continuous space filling curve through SOd I Grand Tour Theory We shall do this by a composite mapping from the real line R to the p dimensional hypercube 0 2M ie a R gt0 2M1 y where p dd 7 12 The components of at are taken to be angles The mapping 6 from 027TP onto SOd is given by lt0127013770d71d 312012 gtlt 313013 gtlt gtlt Rd71d0d71d There are p d2 7 d factors in the expression These correspond to the cl2 7 d distinct two ats formed by the canonical basis vectors I Grand Tour Theory We let EMU be the element of SOclwhich rotates the 919 plane through an angle of 6 Thus 1 0 0 0 cos6 7 R16 39 39 cosw j i j i j i where the cosines and sines are respectively in the ith and jth columns and rows The restrictions on 61 are 0 g 61 3 QTY 1 g i lt j 3 cl The angles 6 are called the Euler angles g Grand Tour Theory Finally we construct at Alt A225 Apt as the mapping from R onto 07 Zn Where of course Ajt is taken modulo 271 A1 AP are taken to be linearly independent real numbers over the rational numbers Thus we de ne Qt at STAT 663CSI 773 S ra ris rical Graphics and Da ra Explora rion Edward J Wegman Lec rur39e 7 Tr39ivar39iate Data From Last Time Coplots or Conditioned Plots Abrasion Loss Data Indexing of Panel Plots Ethanol Data Brushing Fitting Surfaces Linear Quadratic Loess Surfaces Tr39ivar39iate Data From Last Time Coplots continued Residual Dependence Plots vs covariates NCG7531 Spiral Galaxy Data Level Plots Improvisation Contour Plots Algorithms Wireframes Wireframes 3D Wireframes Perspective versus Orthogonal Views Occlusion Grid density Too small surface features lost Too large too black Wirefmmes v w a Wirefr ames Wireframes Wireframes are effective for simple surfaces without much curvature I personally prefer using computer graphics methods Computer Graphics Ambien r 41 Ambient Light As indicated above the simplest form ofa lighting model is ambient light We express a lighting model in terms of an zllummatzml Equatlun A nondirectional diffuse source of ambient light as would result from general re ection of light from all objects in the environment yields the illumination equation 41 1 IcaIa Where In is the intensity of ambient light which is assumed to be constant for all objects and led is the ambzzntrrz ectmn Cuef czem The ambient re ection coetflcient is the fraction of ambient light re ected by a surface under consideration and ran es be een O and 1 Distinct surfaces will in general have distinct ambientre ection coetflcients The ambient re ection i i i i i i 1 es I Computer Graphics Diffuse 42 Diffuse Re ection Objects illuminated by ambient light are uniformly lit in proportion to the intensity of the ambient light For objects lit by a point source the illumination Wlll Vary 39om one part of the object to another with obviously no illumination m the bukside of the object relative to the point soluce Dull or matte surfaces exhibit dz iwe rzlzcticn These surfaces ear equally bright from all viewing angles because they re ect light equally in all directions For a 39 Jr J unthcmalcaL L J Nandthe t t quot r m 42 I mama Math239 I Computer Graphics Attenuation Here 1g is the material39s ditfuse re ection coe icient As with ha k5 ranges between D and 1 and is the fraction of lign re ected In many circumstances led kc p is the illumination intensity of the paint light source and L are the writ vectors respectively for the surface L J quot li htsnurce Pmmnnn I4 1 A u n L L39 4 39 o o 43 1 m 7 mkoupu L where km is the attenuation faAtor for light 1th 53 That is ligrt intensity mm apoint source falls off a the inverse square of the distance dL between the ligrt source and the 39 quot 39 tltm 7 3 39 39 39 min 1 km my c mh gdi 139 f Computer Graphics Color39 Equation 43 can be generalized in the obvious way to account for variable surface re ectivity for di ereiit colors by introducing an additional multiplicative attenuation factor for each color RGB or more generally for each 39equency A of light Thus the equation becomes p h 44 ft kafaAOai kaat vafpsass N 3915 Here OCH is the diffuse attenuation associated with a color Whose equenoy is It Computer Graphics Specular Specular reflecnun occurs with objects with a shiny surface such as a mirror or a metallic surface For a perfect specular re ection the angle of incidence is equal to the angle of re ection If is the unit vector in the direction ofre ection then the angle between N and R is 9 If V is the unit vector to the direction ofthe viewpoint the viewer would only see a perfect specular re ection if V and R coincide However for less perfect specular re ection 39 ome level of visibility ofthe highlight which is proportional to the angle o between V and R Maximum specular re ectance obviously should occur when o 0 and should drop to 0 when o 90 The Phong lighting model models this dropotfas proportional to Cus a For n 1 there is a gentle dropotf and as n oo we approximate perfect specular re ection The Phong model is then written 45 IA kaIaAOdA kcttprWOdAmSW ks CUSWQN where 0050 R V and k5 is the materiale specular re ection coetfrcient which as before anges between 0 and 1 Computer Graphics Final Model Here we have added one other subtlety USA which is the attenuation factor for specular re ection at a color of equency A Since it is known that the angle of re ection and the angle gt b r of incidence are the same R can be computed in terms of L and N using simple geometry by gt r gt p gt D gt b the formula R WWI L Thus R V can be computed by MWL L V Thus it is not necessary to compute R explicitly If there are multiple point light sources say at the illumination equation becomes p D r r 47 ft kafaAOda kamfpnlkdOaAlN 3915s ksOSACRt Vlnl 2m us Some Examples Density Some Examples STer39eo Pair39s Computer Graphics Tr39anspar ency Transparency can be implemented in either nonrefractive or refractive form We shall positioned in front of two lnte olated transparency determines the shade of a pixel on the intersection ofthe two polygons projection through linear interpolation 48 IA 1 kt11A1kt11A2 where 1ct1 is the ransmzssmn Cuef czem and measures the transparency ofpolygon 1 Here of course the t index reminds that this coetficient is associated with the transparency factor Obviously if 1ct1 0 polygon 1 is opaque and transmits no light from polygon 2 while if 1 l po ygon 1 is completely transparent This form oftransparency is called interpolated transparency I Computer Graphics Stereo Our method uses standard projective geometry With separate center of projection for the left and rigtt eyea denoted LCOP and RCOP respectively The LCOP and RCOP coordinate relative to the center of the workstation screen are 0 7 d and i 0 i d where e is the eye sepaxatmn and a is the distance from the viewscreen The coordinates ofa if r y t l CompuTer Graphics Stereo 9 v Compufer39 Graphics STer eo amp Lamond ngmeya a j x u 11 L Some Examples L Some Examples Computer Graphics Contours Contouring is a subtle problem particularly in three dimensions The grid upon which the function is computed must be su iciently fine so that the function surface can be well approximated and that the fine structure is reliably reproduced For two dimensions assuming a square grid a useful algorithm is to examine the sign ofthe value of fa e 1me on each of the four comers ofthe grid elements As discussed in Lecture 6 ifall the vertices have the s 39gi n the contour does not pass through that element If one or two vertices have different signs from the remainder then the contour passes through this square grid and usin linear interpolation an approximating line segnent can be calculated Ifthe grid is su iciently fine a close approximation to the contour can be m e Socalled Marching Cubes Algorithm in 3 dimensions g Compu rer GraPhics Contours I Some Examp es Some Examples 83 I ome xamp L Some Examples Some Examples Plotting Functions The function may either be represented by a twodimensional surface in three dimensional space or by a series of contours in the twodimensional space where the data live An implicit representation of the surface is gcy s 0 f3yj a Contours at level or are in this case essentially slices of the mction39s surface de ned as ST m E R2 rvfmarm where fmm sameEg These are essentially intersections of the gry s 3 surface with the plane given by z dfmm These contours are also knovm as Isnplaths l Plotting Functions In the discussion that follows we will abuse the language a little bit Actually the isopleths and the gradient are embedded in the support space in this case lRZ However in order to fuel the imagination we shall temporarily consider them as being embedded on the surface of the function itself The isopleths are contours Where the function is constant The gradient denoted Gmdlg is a vector Whose direction is the direction of maximum increase of g and Whose length Lg l Gradg l is the magnitude of that maximum rate of increase Plo r ring Functions Consider now a point 107 yo 20 where 20 zo7 yo Since the density is constant on the isopleth whose level is 20 the rate of increase ofg in the direction of the vector tangent say Tangg to the isopleth at point 107 yo is zero Gradg is a vector in the direction of maximum increase and Tanggltro y0 L Gradgro y0 If we consider the z and y components of Gradgro y0 to be the z and y components in R2 and the zcomponent of Gradgro y0 to be the rate of increase of z in the direction of Gradgltro y0 then these two vectors form a plane which is a tangent plane to the surface 9172172 0 at 10 yo 20 Plotting Functions The normal say Normg to the tangent plane at 107 yo 20 is the surface normal For visualization purposes we are interested in the unit normal which is given by N Normg l Normg l The three vectors just described the tangent to the isopleth the three dimensional gradient and the surface normal form a triplet of orthogonal vectors which is known as a trihedron in differential geometry Plo r ring Functions Our threedimensional gradient vector is orthogonal the tangent vector to the isopleth and together form the tangent plane Actually any two orthogonal vectors that form the tangent plane can be used Let egg ell and ez be the orthogonal unit vectors which form our right handed coordinate system Then let 39 8 39 8 VI 91 ai ez and V1 ey 92 These two vectors determine the tangent plane The surface normal is the unit vector given by a a gt 7 ervy A 7 A lVrXVyl PIoTTing Functions The threedimensional setting is actually considerably simpler In the threedimensional setting the function is a surface is four dimensions We are us unable to render these surfaces directly Rather we are interested in rendering the function39s contours ie the isopleths In R3 the isopleths are given by S7 z E R3 yfmax Where fmar supwemg and Where z 17y72 As in the twodimensional case the isopleths are orthogonal to the gradient Since the gradient is already orthogonal to the isopleth the gradient vector and the surface normal vector are coincident except that they point in opposite directions ie they are 1800 out By simply reversing the sign of the gradient and normalizing it to unit length we have the surface normal of the isopleth In this case 117 7 Gradf Gradf Numerical Methods for Exploratory Data Analysis Q We mmm39cd Pa r ovbn39l39ary dab analyst Dual 053nm of EDA I crify mule sun shoquot 2 Qard fgr quotMaln cSFahd nduh Emphasis on Wk Par n3 Hui discuss5 Norm Assunp holw Numd QG quot Outliers A M 39 BOINOH39 quot quotempauuHy emu Mr furall madam Admth Methods Shams l Kwkns ashmks h Dunsify PEshMon Crossvalidahon quot Runs Tcsf 4 Corre I v jedco n u Focus on Mlykctomruh o39ml me ood or EDA Fad Amara dis ribuhoh ampbfely cha ng by 55 an nomen39l39s mean variance Hi1quot or der moment s can indium nonnormlay quot 1 N i mean l 39139 395 ance llyi j in F Skemes is bad on 3rd mom dg E 39i3 i l ch 5 65 and it s mudr 55 I Vl Mdonn39Qy Hu 39Ht refvied Slum 39 EIX F l3 0 03 Hume 394 sample skewness d5 rs n O quotLFu caufncmamd39 one would L a quots 97ch PinMural Th skewness in 4 Huh nomeu39l 12 kurfosis Ac 3 on H 414 ownh Th Radical kurbsf i u 1 Th NH kur o ls in n q a 12 3 In quot10 i mrmil dens m Won a 3 J7 Tbs f sample kurbsisd t ns Siaiufumtfquot to 3 on would Suspd39 non mummy Wm temaism is I a bre row hormd y can screw an andysis L Hhilodmss is less somus Hum haghi duess 1k I r r 7 BPD Robs uh udlov39l Mes nd cs Rank csh mc ks Adgph39w aypmkes Consider and 1nd quarts 3 339zquot quot 31 149 n D itquot he 5 axe 9 Mashmks are man like 39 65M don R r39w lecD 394 Eff139 Im lw line Th Junk 4 l 2 v 05 0 39f Ho39o quotnormquot 39 ud bquot A n n 9 or mom 539 is quot5505quot Hr244w amquot m 3 Mme Hubquot w p Andrew Raul mHmIs buieally npku an awn hm iris uni Hand 19 ouvoruodu emioul SCH A Mi rd Sum a rank is 1 MDV inc cath 39scom Ad ve mailed a a quotlama 43 r 4 hhvior and 1h aims at Entrop 39ak icrhmbv39 mg if 4 ed midmean medium kiln mlhmy mn litany39MIM 39H mmd mean or Mdn an Beachup t Ruampl n3 SA 5 VIC r h in 5 uhan 0 am G Snug hwy 9744 like 4 know or csh ndt urinate oi 9 av dmib A 3 We mph com 1x wH k replacement lsi 1L 139 9 tCute 6 2w 1f gt but 9 ml 1f gt MM 6 A A anSst 9 9 W H11 I unn random ample mm i57 lquothh0Vl OI 1 quotA 9 k JZHD k 1626Juo Jquot C 9 j39l amp no mranmhli dens exh ubr approw39mkr 3 Nonpromhic Denay Es a on kernel i odioauml Smog Y xobnowuais 3 M in 439 he hgvo nah t picdquot MI CHquotg Mm A is called kernel b We Gaussian gt Sandman Epnedvm39lza V H 852 ng I fly a 23902 1 norm Jun 4 elude nonnan 6 3Vovy r D39lmmau hwlo base k Crossvall39dahou In anal by ems Maho divides duh uh I A Wuhan 2 rims due H4 eshcigho39n Jmmmz based on mm on ow tad In ham Then compares u k as k In has cs o dm case usually k n sample 33 Consider A 5 IF 01 Jv HM F be divided ale35 M on lamp Lame bvd unknown doubly 39 u A 1 I xu h ml Ely 39 ZIIIvaMujplx dx my T n 06 In 1 Mlhh39 m I Mquot prov u 39 mKaln JQ magi 1 F lilo Bur e m 2 gigm l39l Crosswalk k3 GYJnin E ll3h cki Q We39ve 0394quot caulk a qu 0 w fyui 0v dispmihj novmal y l ouHien 3 Mr also MMMH MS quot03 male am we Lind errors 13quot sh Suppose We hive Asomp39c x X 213 m median of u x I end x as H rjmN J L39uf X mo 39Tluu we NH novc Asequeuco of H ss s Exopr 13HI II Bll Th 33q 5 1sloll n 1 animal obnw ion m A s Lueaew 1 E 13 Hamp e mv are Imus Th um 11 L How my Mom sumuh mrcMommS 139 m 5 cl M3553 M I of low L W Fr 39n J nt 6 3 3 m bunt arm W yh nz 1 V 4 runs and MM av Lew quotgm 32 39339 Fir o approachaan noli a r In a Hawk h hf 3 I 355 5 Fr 7 d SOHOv10 g 3 v no4 HMO quotquot P39vds h 1 mad R p05 serid ConchJ39le u 3 u 1quot unu u39 ONGIGHOIW QWIICIQ39IJ39 A nil ul Vf Znugzy K 2 41w zvi my rpm at x cun Vi ukcm h M X L ih LI 4 ampt 1 3 J xn39 y Xquot 1quot Ix fquot zqixm b j 105 nIX Squot J is di cts sfan39lkan m 0 M mid COV IK 3s yum M h m Id wimp I Rojd o v Dinaed and Tour Consider a mlh un ah sexark 39 V quot3 i 31131 J Wquot X ut 139 F u emktolmiualahpo39m39t xi m dun ld A be 0 ud P0399159 vdrix AA39I 5014 I in I watch o39n 0 d olmuncmoo39nail In info k dl r icn nal sync than ILquot or 7 65 Rojedm Rum Qatari lufmi Projwhoivs naming or m mim39Jc he O KJ eo hve unch39w Prinu Itompnewh VGA U013quot minimjes MYMX Numcl Enymoa m k whiny magth he Ib l7 Jilv l39loh 39 PM Rshe Infom ai V I 17 4mm WW luv4N3 4 i alussIan donsd quotA ml Na 55 E 00 thd ole I STAT 663CSI 773 S ra ris rical Graphics and Da ra Explora rion Edward J Wegman Lec rur39e 9 Visual Da ra Mining Outline of Lecture gt Visual Complexity gt Description of Basic Techniques gt gt gt Parallel Coordinates Grand Tour Saturation Brushing gt Illustrations of Basic Techniques gt YYYYY Rapid Data Editing Density Estimation Pollen Data Inverse Regression Tree Structured Decision Rules Bank Data Classi cation amp Clustering SALAD Data amp Arti cial Nose Structural Inference PRIM 7 Data Data Mining BLS Cereal Scanner Data Cluster Trees Oronsay Sand Particle Size Data Visual Complexity Descriptor Data Set Size in Bytes Storage Mode Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks e g RAID Storage Massive lO12 Robotic Magnetic Tape Storage Silos Super Massive lO15 Distributed Archives The HuberWegman Taxonomy of Data Set Sizes Visual Complexity Scenarios Typical high resolution workstations 1280x1024 13lxlO6 pixels Realistic using Wegman immersion 45 aspect ratio 2333x1866 435xlO6 pixels Very optimistic using 1 minute arc immersion 45 aspect ratio 8400x6720 5 65x107 pixels Wildly optimistic using Maar2 immersion 45 aspect ratio l7284xl3828 239xlO8 pixels Visual Complexity I Visualization for Data Mining can realistically hope to deal with somewhere on the order of 106 to 107 observations This coincides with the approximate limits for interactive computing of 0n2 algorithms and for data transfer This also roughly corresponds to the number of foveal cones in the eye Methodologies for Visual Data Mining Parallel Coordinates gtEffective Method for High Dimensional Data gtHigh Dimensions Multiple Attributes gt Grand Tour gt Generalized Rotation in High Dimensions gtIn Depth Study of High Dimensional Data gt Saturation Brushing gtEffective Method for Large Data Sets Visual Data Mining Techniques I Multidimensional Data Visualization Scatterplot matrix Parallel coordinate plots 3D stereoscopic scatterplots Grand tour on all plot devices Density plots Linked Views Saturation brushing Pruning and cropping Crysi39ai Vision n as Pupups Gm p 1 Sue E39mooancgqpm Cr39ys ral Visior r m nanm Ewi Sue Lgfm wm i u u 7quot gt u m Mi Vanab e XV O VES VES N Im 11393 0 080 00 079 02 009138 l193 ll 391368606 391382H83 1095681 En12m mess FV Cr ys ral Vision Da ra Edi ring and Densi ry Es rima rion Pollen Data 3848 points 5 dimensions IO g Pollen Da ra J A pha l1lt4KlllgtHnl nu mm nrm r FmHe p mess Et stan g H wcamndaycu 6Eudmabgh Pevsunaw mcmsnuph R UGE NUB r RAC K waew Rxerraaw L as Aasz Pollen Da ra Pollen Da ra I Pollen Da ra Pollen Da ra ml I 9quot ENE igdjj mgw gum Featmes Pupupa GrandTuuy Ex vas de p 7 IDGE Fm Hexp gvgss L W T39r MHQ J Cawenaayc1a 4 him m 1WiJ Q q k M N 39 W I 1 121 DENSWJE 5345 357 3amp9 w h e39AM Pollen Da ra F e Em LEw Gvnup anmves Pnpups GvandTum Extvas He p rm funW mm nrwm EDGE NUB Kama Hogarzzz Nuam gm W m cmsuwh WD39ABAM Fu Hem mg 1 an smn g jjf cmenaaycyz V E dmaLigHN Pmsnna H w P519quot Inverse Regression and Tree S rr39uc rur39ed Decision Rules wi rh Financial Da ra gt Bank Demographic Data in 8 Dimensions with 12000 points Inverse Regression and Tree Structured Decision Rules with Financial Data Inverse Regression and Tree Structured Decision Rules with Financial Data EQEQ EQEQ g Inverse Regression and Tree Structured Decision Rules wifh Financial Dafa Emga EEEE SEE s n c c a s n c cv Classification and Clustering Using SALAD Data gt Chemical Agent Detection Data in 13 Dimensions with 10000 points Classification and Clustering Using SALAD Da39l39a Classification and Clustering Using SALAD Data lt5 Artificial Dog Nose I 19 dimensional time series in 2 spectral bands 60 time steps for 300 chemical species IO Ar rificial Dog Nose e r A x15u Time series In two spectr ands for same chemical species Phse loop I Ar rificial Dog Nose h Ar Tificial Dog Nose Orthogonal components Ar Tificial Dog Nose A er grand tour orthogonal variables X29 X992 X151 X16 X18 separate the two spectral bands L Ar39Tificial Dog Nose Four chemical species target highlighted in red Ar Tificial Dog Nose Target species separated by x1 x3 x5 x6 x11 x15 PRIM 7 7 dimensional high energy physics data 500 data points pimeson proton interaction Structural Inference Using PRIM 7 Data E5 R Ema ng wgwgggw 533qung 5 g 3 Eamp H Eggaawmvggbum SWucmm Imfemme ll Using PRISM 7 mm Structural Inference Using PRIM 7 DaTa Scanner Data for Breakfast Cereals I 55 gigabytes of scanner data in relational database I Price sales volume promotion store chain PSU UPC I Work done at BLS I Phase 1 Basic Data Analysis Single Month I Phase 2 Price Relative Effects 1 Year I Phase 3 Churning Effects 5 Years L Scanner Da ra for Breakfast Cereals gm Ianlitv Promotion has huge impact on sales volume Scanner Da ra for Breakfas r Cereals Stores not randomized h Scanner Da ra for Breakfast Cereals 7 h Scanner Da ra for39 Breakfast Cereals h Scanner Da ra for39 Breakfast Cereals V Scanner Da ra for Breakfas r Cereals Phase 2 far Breakfas Germs e Scamer Dam h Scanner Da ra for Breakfas r Cereals pacez pgrm piiceW promohonZ 6 9 1 V L expend quanzz promoan expend39l quamii chem Q 9 iiem Outliers belong to same chain Simmer mm for Breakfas quot Cerea s Samoa Dam for Breakfosi Cerea g Ran of items with no promotion Swimmer Dam for Bmokfas Ces eo g chain ceased promotions V Scanner Da ra for Breakfas r Cereals Phase 3 EL Scanner Dam fer Breakfes Cerea s Churning comes from both new items and new stores i Scanner Da ra for Breakfas r Cereals Churnng Effects Red PR0 Blue PRgt0 Green P in nity i Scanner Da ra for Breakfas r Cereals New items tend to have higher p ces I SCGAHWQP Data for Breakfast Caveats Many discontinued items have high expenditures r Scamer Dam far Breakfas 39 Cerea s Effect of item churning Scamer Emafar Breakfas 39 Cerea s h Scanner Da ra for Breakfast Cereals 5355 Outher due to prlce codlng error Seemer Dem few mekfaef Cemee e LI PSU Chamz Store SloreBampD hem Ty De th BampD Manu39acl PronoZ Tramp Prom Prxcel PrcRel Pnce Effects of Cereal Types Scanner Data far Breakfast Cergals Quantity Effects Sands of Time Data I 300 Samples of Sand Data from Oronsay Island in the Scotch Hebrides Sands of Time Objective The mesolithic shell middens on the island of Oronsay are one of the most important archeological sites in Britain It is of considerable interest to determine their position with respect to the mesolithic coastline If the sand below the midden were beach sand and the sand from the upper layers dune sand this would indicate a seaward shift of the beachdune interface Flenley and Olbricht 1993 Sands of Time Objec rive I Cluster samples of modern sand into beachlike 0r dunelike sand I Classify archeological sand samples as to Whether they are beach sand or dune sand Sands of Time Parametric Analysis I Historical strategy is to t parametric distributions and compare modern and archeological sands based on parameters I Weibull 1933 lognormal breakage models log hyperbolic logskewLaplace 1937 Barndorff Nielsen 1977 I Models 2 to 4 parameters theory developed practice problematic Sands of Time Graphical Analysis I Multidimensional Parallel Coordinate Display Combined with Grand Tour I BRUSHTOUR strategy Clusters recognized by gaps in any horizontal axis Brush existing clusters with colors Execute grand tour until new clusters appear brush again Continue until clusters are exhausted 7 Mining The Sands of Time 1 02 a xomquot L rgt 539s 1 c w TG Traigh a Gquot obainn 1 33 l s C I Caisteal nan Gillaan I Uam P J Ssilbhe N TU 5 M 39 o Metres 300 l L I l Mining The Sands of Time 1135 amup Eruul ltu53nn us D53D9r1rl usa mingquot nu 1251Hnn 1257 IB 25nn na 25 355nn 25 355 Sun1 55 SD 1nn su 71 1nm 71 1u714m 1Elr 144 1 L4 gt2Dnn gt2m Mining the Sands of Time nar ugnn V Mining The Sands of Time ORONQAY cc DATA 1 M n mg fhca Scamdg f T m Mining the Sands of Time M mmg Wag ngdg T mg i L Sands of Time Conclusions I Sands from the CC site and the CNG site have considerably different particle size distributions and cannot be effectively aggregated I Data at small and at large particle dimensions is too quantized to be used effectively I The visual based BRUSHTOUR strategy is extremely effective at clustering Edward J Wegman Lecture 4 Data Visualization Data Types Univariate Histograms Quantile Plots QQ Plots Normal QQ Plots Box Plots Fitting Methods Bivariate Trivariate Hypervariate Multiway Data Visualization Visualization Graphing Fitting Visualization Replace Probabilistic Inference Check Model Assumptions 4o 20 o 13 o 3 E 8 as 0 D 20 0 60 65 70 v His rogmms IJ FIH Bass 2 Mm l L Sogranoz Sogram39n 39 Alto 2 39quotAno 1 4O 20 n o TenorZ 39 Tenon 12 W graph the singer was by voice pan The 62257572372quot wm o Mal wid1h is one inch Univorio re Do ro Quantiles The f quantile Qf is the value along the measurement scale of the data such that approximately a fraction f of data is below or equal to f l Univaria re Da ra Example Considerasetofobservatioris 14 7 2 7 6 0 There are seven data points the smallest is 7 2 and the large is 7 The quantiles are as follows f qltfgt 2 fqltfgt4 f qltfgt0 fqltfgt6 fqltfgt1 fqltfgt7 fqltfgt3 Notice also that f 1 This is apotential problem for distributions with upper bo Quontiles 25 quantile is the lower quartile lst 5 quantile is the median 75 quantile is the upper quartile 3 If x1 x2 xn is a set of observations then the observations arranged in increasing order x1 x2xn are the order statistics which are closely related to the idea of rank Quantile Plot easurement Scale Qf Graphical ordering for multiway plots increase left to right bottom to top 3 1 0 Visual reference grids are used to enhance comparison of patterns 1 g Quantile Plo r l i we let fl 7 2 Then the Rather than use fl T minimum is 2171 and the maximum is 1 7 This works better with distributions with support 7 00 00 Take qfi I and plot qfiversus fi This is the quantile plot New York Choral Socie ry 00 05 10 Height inches 21 Quantile plots display univariate data the heights of singers in the New York Choral Society 00 05 10 fvalue a QQ Plots Let mu xnbe the first data set and let 311 y a second data set be m ilt5 n quantiles and to obtain a QQ plot we merely plot ya against mm Ifm n then I and ya are the Ifm lt n then ya is the quantile so we must plot ya against the quantile of the 2739s This must be interpolated InTerpolaTipn Example Example I 17 2 57 77 9 1127 67 107 12 Then for the y s ff fy Forthex sff1E5L fg i f 509x 577552gtlt 525 v QQ Plo r l I I 76 o o o o B A O m 2 72 o o a O C a o E 9 o o a I o N 3 68 o g o 0 O 64 I I l 64 GB 72 76 Tenor 1 Height inches in is Ft InHa39S ploi bgtt btn Winn c 23 The first tenor and second in height distributions are comparelt a qq plot Cups 3 Tukey Mean Difference PloT Difference inches Mean inches r by t I asonn Z 01 Mlquot T And 78 B1 W s on W5 PM 24 The first ienor and second bass height distributions are compared by a Tukey md plot Pairwise Q PloTs parts Height inches g Box Plots Filled circle median a robust measure of location Upper and lower ends of the box are the upper and lower quartiles IQR is measure of spread and the box can give a hint of skewness Adjacent Values If r IQR then the upper adjacent value is the upper quartile 15r and the lower adjacent value is the lower quartile 15r Adjacent values provide information about the tails Outside values those beyond the adjacent values are plotted individually and are the outliers Data Box Plo rs 7 o lt cutsids values r lt upper adjacent value lt upper quanib lt median lt Iowor quam lo 39 Iower adjacent value 26 The diagram defines the box 1 a H m plotdisplay method Box Plo rs Soprano1 1 5km Soprano 2 Alto 1 Alto 2 Tenor 1 Tenor 2 Bass 1 skwmrlo 3933 28 The eight singer disrributions are compared by box plots Bass 2 Height inches Normal Quon rile Plots In sample quantile plots we plot qf versus f f a fraction of sample mass smaller than or equal to qf In theoretical quantile plots f is replaced with the fraction of the probability massquot below qf Ie just the probability itself If X is an observation from a normal withp and 0 then the ProbXSqHIOf is f Normal Quan rile Plo rs 210 The curve displays the quantiies of a normal distribution for so fvalues from 170 to 6970 which are the fvaiuos oi the minimum and maximum Nights of the first alios I Normal QQ Plo rs Let x1 In be a set of observations so that I has fl quot5 I is the fl quantile ofthe data Let qw fi be the corresponding normal fl quantile Then plot 11 versus gm 6 fi This is the normal QQ plot Note qafi M 0401fi Normal QQ PloT 2 0 2 Height inches 75 70 65 211 Normal qq plots compare the eight height distributions with the nomal distribution 2 0 2 Unit Normal Quantile Big Idea 1 Normal Approximations Standard deviations affect slope Means affect vertical shift if slopes are fixed In vocal range example slopes vary I would say a fair amount Cleveland says not much See Figure 211 Use normal Q Q to verify normality If good fit can estimate mean and standard deviation by vertical shift and slope g Fi r ring and Examining Residuals In the choralvoice parts example7 Cleveland suggests there is a common variance of singer heights and that height differences are modeled only by voice part Let hm height of the ith person in the pth voice part Then hp Inean Fit Em hp Then Em hm 7 hm hm 7 hp The Em are the residuals Box Plo r of Residuals Box Id 4 22 Wm S 213 Box plots compare the distributions of the height residuals for the fit to the data by voicepart means Big Idea 2 Pooling If residuals are homogeneous then the additive shift hypothesis is confirmed Then form panel of Q Q plots of residuals of individual voice parts against all pooled residuals to confirm residuals of individual voice parts have the same distribution as the pooled residuals Figure 214 0 o 2 3 Lo 5 0 xx 6 0 C 0 E 2 J 9 o I 16 3 jg I a n 214 Each panel is a qq plot that compares the distribution of the residuals lor one voice part with the distribuiion of the pooled residuals 4 0 4 Residual Height inches Big Idea 3 Fit to Normal Normal QQ plot of pooled residuals against normal quantiles to verify residuals have approximate normal distribution Figure 216 Fi r To Normal 0 O 6 oo D 6 2 8 f 2 C 9 G I a 3 9 m a 2 39 216 A normal q q plot compares 0 the of the pooled 6 I I I r residuals with a normal dis1ribution 3 1 3 Big Idea 4 Skewness transform by logarithm Logarithm reduces large values and opens up small values near zero Changes skewness to near symmetry S rer eo Fusion Times large values the result is a strongly convex pattern 1 squot d 1 quotIt W f u v v A 00 u 10 quot5 SMd39ID39 g Fusion times i are badly g skewed 10 fvalue 219 Ouamile pots display the two distributions of the fusiontime data mmeiric vs Skewed Quaniile PloTs 220 The graph shows ihe quantile function of data with a symmetric distribution I I 00 025 050 075 100 Fvalue 221 The graph shows the quantile function of data that are skewed toward large values Skewness and Tail Behavior no rrm QQ for Fusion Da ra Time seconds Unit Normal Quantile 222 Normal qq plots compare the distributions of the fusiontime data with the normal Box Plo r for Fusion Da ra l 39 I l 39 I 39 T 223 ndistributionsufmetusion neswemwadbyboxplots Log Transform of Fusion Times 7 Skimm w 2 1 D 1 2 Visualizing Data L Log Time log2 seconds Unit Normal Quantile 224 Normal qq plots compare the distributions of the log fusion times with the normal distribution