DATA BASE MANAGEMENT SYSTEMS
DATA BASE MANAGEMENT SYSTEMS CS 236
Popular in Course
Popular in ComputerScienence
This 24 page Class Notes was uploaded by Adele Schaden MD on Thursday October 29, 2015. The Class Notes belongs to CS 236 at University of California Riverside taught by Staff in Fall. Since its upload, it has received 12 views. For similar materials see /class/231748/cs-236-university-of-california-riverside in ComputerScienence at University of California Riverside.
Reviews for DATA BASE MANAGEMENT SYSTEMS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/29/15
Dimensionality Reduction Techniques Dimitrios Gunopulos UCR Retrieval techniques for high dimensional datasets The retrieval problem 7 Given a set of objects S and a query object S 7 nd the objectss that are most similar to S Applications 7 nancial voice marketing medicine video Examples Find companies with similar stock prices over a time interval Find products with similar sell cycles Cluster users with similar credit card utilization Cluster products Indexing when the triangle inequality holds Typical distance metric Lp norm We use L2 as an example throughout 7 DsTZi1n Sm Tilz 2 Indexing The na39139ve way Each object is an ndimensional tuple Use a highdimensional index structure to index the tuples Such index structures include 7 Rtrees 7 kdtrees 7 vptrees 7 grid les Indexing Techniques We will look at 7 Rtrees and variants 7 kdtrees 7 vptrees and variants 7 sequential scan Rtrees and kdtrees partition the space vptrees and variants partition the dataset there are also hybrid techniques Rtrees and variants Guttman 1984 Sellis et al 1987 Beckmann et al 1990 kdim extension of Btrees Balanced tree Intermediate nodes are rectangles that cover lower levels Rectangles may be overlapping or not depending on variant Rtrees R trees R trees Can index rectangles as well as points L2 LIE Ll kdtrees Based on binary trees Different attribute is used for partitioning at different levels Efficient for indexing points External memory extensions hBHtree fl Grid Files Use a regular grid to partition the space Points in each cell go to one disk page Can only handle points f2 Vptrees and pyramid trees Ullmann Berchtold et al 1998 Bozkaya et all997 Basic idea partition the dataset rather than the space Vptrees At each level partition the points based on the distance from a center Others mVp TV 8 Pyramidtrees The root level of a Vptree with 3 children Sequential Scan The simplest technique 7 Scan the dataset once computing the distances 7 Optimizations give lower bounds on the distance quickly 7 Competitive when the dimensionality is large Highdimensional Indexing Methods Summary For low dimensionality lt10 space partitioning techniques work best For high dimensionality sequential scan will probably be competitive with any technique In between dataset partitioning techniques work best Highdimensional index structures All require the triangle inequality to hold All partition either 7 the space or 7 the dataset into regions The objective is to 7 search only those regions that could potentially contain good matches 7 avoid everything else The na39139ve approach Problems Hi ghdimensionality 7 decreases index structure performance the curse of dimensionality 7 slows down the distance computation Inef ciency Dimensionality reduction The main idea reduce the dimensionality of the space Project the ndimensional tuples that represent the time series in a kdimensional space so that 7 k ltlt n 7 distances are preserved as well as possible Dimensionality Reduction Use an indexing technique on the new space GEMINI Faloutsos et al 7 Map the query S to the new space 7 Find nearest neighbors to S in the new space 7 Compute the actual distances and keep the closest Dimensionality Reduction A time series is represented as a k dim point The query is also transformed to the k dim space query dataset n D Dimensionality Reduction Let F be the dimensionality reduction technique 7 Optimally we want 7 DFS FT DST Clearly not always possible If DFS FT DST 7 false dismissal when DST ltlt DFS FT 7 false positives when DST gtgt DFS FT Dimensionality Reduction To guarantee no false dismissals we must be able to prove that 7 DFSFT lt a DST 7 for some constant a a small rate of false positives is desirable but not essential What we achieve Indexing structures work much better in lower dimensionality spaces The distance computations run faster The size of the dataset is reduced improving performance Dimensionality Techniques We will review a number of dimensionality techniques that can be applied in this context 7 SVD decomposition 7 Discrete Fourier transform and Discrete Cosine transform 7 Wavelets 7 Partitioning in the time domain 7 Random Projections 7 Multidimensional scaling 7 FastMap and its variants SVD decomposition the Karhunen Loeve transform Intuition nd the axis that shows the greatest variation and project all points into this axis Faloutsos 1996 SVD The mathematical formulation Find the eigenvectors of the covariance matrix These de ne the new space The eigenvalues sort them in goodness order SVD The mathematical formulation Cont 1 Let A be the M X n matrix of M time series of length n The SVD decomposition ofA is U X L X VT 7 U V orthogonal 7 L diagonal L contains the eigenvalues of ATA M X n Ian Ian SVD Cont d To approximate the time series we use only the k largest eigenvectors of C 0 A U x Lk A is an M xk matrix SVD Cont d Advantages 7 Optimal dimensionality reduction for linear projections Disadvantages 7 Computationally hard especially if the time series are very long 7 Does not work for subsequence indexing SVD Extensions Online approximation algorithm 7 Ravi Kanth et al 1998 Local diemensionality reduction 7 Cluster the time series solve for each cluster 7 Chakrabarti and Mehrotra 2000 Thomasian et al Discrete Fourier Transform Analyze the frequency spectrum of an one dimensional signal For S SO Sn1 the DFT is sf 7 Wu 20 ani eJ39anin f 01nlj2 7 1 An ef cient Onlogn algorithm makes DFT a practical method Agrawal et al 1993 Ra ei and Mendelzon 1998 Discrete Fourier Transform To approximate the time X series keep the k largest M X Fourier coef cients only Parseval stheorem D m D m in m m in 0 2 2 m Zi0n1 Si Zi0n1 Sf 1 DFT1sa11neartransform so 2 W 3 2i0n1SiTi2 Zi0n1Sf 39Tf2 Discrete Fourier Transform Keeping k DFT coef cients lower bounds the distance 2i0n1Si39TiD2 gt 2i0k1Sf 39Tf2 Which coef cients to keep 7 The first k FindeX Agrawal et a1 1993 Rafiei and Mendelzon 1998 7 Find the optimal set not dynamic R Kanth et a1 1998 Discrete Fourier Transform Advantages 7 Ef cient concentrates the energy Disadvantages 7 To project the ndimensional time series into a k dimensional space the same k Fourier coef cients must be store for all series 7 This is not optimal for all series 7 To nd the k optimal coef cients for M time series compute the average energy for each coef cient Wavelets Represent the time series as a sum of prototype functions like DFT Typical base used Haar wavelets Difference from DFT localization in time Can be extended to 2 dimensions Chan and Fu 1999 Has been very useful in graphics approximation techniques Wavelets An example using the Haar wavelet basis 7 S E 2 2 7 7 S 5 6 0 7 S0 S 0 S l2 S 22 7 Sl S 0 S l2 S 22 7 S2S 0S l2 S 32 7 S3S 0S l2 S 32 Ef cient On algorithm to nd the coef cients 9 original time series 2 wavelet decomp Keeping the largest k coefficients Using wavelets for approximation Keep only k coefficients approximate the rest wit 0 Keeping the first k coefficients 7 equivalent to low pass filtering 7 More accurate representation But not useful for indexing Wavelets Advantages 7 The transformed time series remains in the same temporal domain 7 Efficient On algorithm to compute the transformation Disadvantages 7 Same with DFT Line segment approximations Piecewise Aggregate Approximation 7 Partition each time series into k subsequences the same for all series 7 Approximate each sequence by its mean andor variance Keogh and Pazzani 1999 Yi and Faloutsos 2000 a line segment Keogh and Pazzani 1998 l W Temporal Partitioning Very Ef cient technique On time algorithm Can be extended to address the subsequence matching problem Equivalent to wavelets when k 2i and mean is used Random projection Based on the JohnsonLindenstrauss lemma For 7 0lt e lt 12 7 any suf ciently large set S of M points in RH 7 k Oe39zlnM There exists a linear map fS gtRk such that 7 16 DST lt DfSfT lt leDST for ST in S Random projection is good with constant probability Indyk 2000 Random Projection Application Set k Oe39zlnM Select k random ndimensional vectors Project the time series into the k vectors The resulting k dimensional space approximately preserves the distances with high probability MonteCarlo algorithm we do not know if correct Random Projection A very useful technique Especially when used in conjunction with another technique for example SVD Use Random projection to reduce the dimensionality from thousands to hundred then apply SVD to reduce dimensionality farther Multidimensional Scaling Used to discover the underlying structure of a set of items from the distances between them Finds an embedding in k dimensional Euclidean that minimizes the difference in distances Has been applied to clustering visualization information retrieval Algorithms for MS Input M time series their pairwise distances the desired dimensionality k Optimization criterion Stress ZijDSiasj 39 DSkia Skj 2 ZajDSiasj 2 12 7 where DSiSJ be the distance between time series Si Sj and DSki Skj be the Euclidean distance 0fthe k dim representations Steepest descent algorithm 7 start with an assignment time series to kdim point 7 minimize stress by moving points Multidimensional Scaling Advantages 7 good dimensionality reduction results though no guarantees for optimality Disadvantages 7 How to map the query 0M obvious solution 7 slow conversion algorithm FastMap Faloutsos and Lin 1995 Maps objects to kdimensional points so that distances are preserved well It is an approximation of Multidimensional Scaling Works even when only distances are known Is efficient and allows efficient query transformation How FastMap works Find two objects that are far away Project all points on the line the two objects de ne to get the first coordinate Project all objects on a hyperplane perpendicular to the line the two objects de ne Repeat kl times MetricMap Wang et al 1999 Embeds obj ects into a kdim pseudometric space Takes a random sample of points and nds the eigenvectors of their covariance matrix Uses the larger eigenvalues to de ne the new k dimensional space Similar results to FastMap Dimensionality techniques Summary SVD optimal for linear projections slowest DFT efficient works well in certain domains Temporal Partitioning most efficient works well Random projection very useful when applied with another technique FastMap particularly useful when only distances are known Local Embeddings svD pvujectmn Ongmal Dataset LLE Mapping 6W0 W e SCURVE mapping SVD LLE ISQMAP