New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Internet Arch& Protocols

by: Alayna Veum

Internet Arch& Protocols CS 7260

Alayna Veum

GPA 3.81

Jun Xu

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Jun Xu
Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 0 page Class Notes was uploaded by Alayna Veum on Monday November 2, 2015. The Class Notes belongs to CS 7260 at Georgia Institute of Technology - Main Campus taught by Jun Xu in Fall. Since its upload, it has received 11 views. For similar materials see /class/234069/cs-7260-georgia-institute-of-technology-main-campus in ComputerScienence at Georgia Institute of Technology - Main Campus.

Similar to CS 7260 at

Popular in ComputerScienence


Reviews for Internet Arch& Protocols


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/02/15
A zs iea Shrear m mg Ngm dm m m Emmi Emmig exe 119 HW Chuck Zhao1 Ashwin Lall2 Jim Xu1 Mitsu Ogihara2 Oliver Spatscheck3 Jia Wang3 1Georgia Tech 2University of Rochester 3ATampT Research May 14th 2008 a URCS Up 94 m mummizmyw m m my my WWW mm mw Origin Destination Pairs 230 origins I Network 5 Mi Entropy Entropy H i E p Iogp p the 39action oftra ic 39om the ith flow Entropy norm 8 E m log m m frequency of ith flow Claude Shannon Sam I Motivation 0 Anomaly Detection Lakhina et al SIGCOMM 2005 0 Traffic Clustering Xu et al SIGCOMM 2005 0 Redsitribution of traffic across many links 0 DDoS attacks may not be detectable as simple volume changes 0 Traffic Engineering Introduction Algorithm i Experimental Results Sil hmlti r l ij Conclusion l 1 zw 393 Stable Distributions i l A distrbution D is called pstable if for any constants a1 an and random variables XX1 Xn drawn from the distribution a1X1 aan Nd a1 a 1 X Ashwin Lall MM QE XM a I dimmer tom mm Properties of Stable Distributions gt 39 t ie 0 Stable distributions exist for p e O 2 0 Examples The Gaussian distribution is 2stable and the Cauchy distribution is 1stable 0 Closed forms are known only for p 0512 0 There are known formulas Chamber et al for generating samples from each pstable distribution RC5 wt m mlmm z wwn Mankind my mama mmmw 0 For each flow i draw a stably distributed X set V O 0 For each item i encountered in the stream set V V X 0 Now V is distributed as im1ip imni 1 X where X has the pstable distribution 0 To extract the quantity im1ip imni 1 we repeat this independently 0612 log 1 6 times and take the median WE imilmms ww 39m mmy 4m Humanist WW ii w AMEN gn gm Mr W Approximating X In X h V i For any N gt 1 e gt 0 there exists a c such that f X cX1D 7 X1 D approximates the entropy function X In X for X e 1 N Within relative error bound 5 8000 i i i i i approximaion 7000 y mm 5000 7 7 4000 7 7 3000 7 7 2000 7 7 1000 7 0 i i i i i i i i i 0 100 200 300 400 500 600 700 800 9001000 Approximating X InX with w on 11000 3 mLRCS imlmwmnmemmwnmm 39 Approximating Entropy m In m1 z Cmlw 7 ml mnlnmn z cm1 7m So Emilnm z CZmIHDt 7 Zmll In parallel we have an elephantdetection module that handles with high probability all the flows of size greater than N Extracting OD Entropies 6 8 0 G D 0 ED 0 f1 p f1 p 91p 91 p 2f1 p WW g1 M fk h1 hmV glp V71 p WmquotJ gl h1 hm kaVJ Hence and Edgy give V1 P ka Emmmgii va fat Warm Estimating the Traffic Matrix Fix a e and N as in the approximation of X In X Then X1D X1 D 2 estimates f X X With relative error at most 35 in the range 1 N macawwa Data Collection 0 Data collected at an ATampT data center with 1 Gbps access links 0 Several 5minute collected on April 25 2007 0 The traces had 4OO million packets belonging to 18 million flows 0 All measurements were done at a single ingress router and egress traces were generated using the routing table Varying Sampling Rates for Elephant Detec Fraction of experiments QQQQQQQQQ 04Mmgtmmwm 4 Relative Error 005 01 015 02 025 03 lm Varying Fraction of Traffic from Ingress 011 I I I I Relative error O O D 002 t I I I I I LL 0 5 10 15 20 25 30 35 40 Ratio of OD traffic to ingress traffic H aims Varying Fraction of Traffic from Ingress 2 w r I I 0 005 01015 02 025 03 035 04 045 05 Relative Error Fraction of experiments 530053053000 O MmbU ImVOOLOA aims Relative error O O D 003 t 0 2 4 6 810121416182022 Ratio of OD traffic to egress traffic aims Error Distribution for Actual Entropy I Entro error 0 005 01 015 02 025 Relative Error Fraction of experiments 000000000 O MmbU ImVOOLOA aims A Tutorial on Network Data Streaming Jun Jim Xu Networking and Telecommunications Group College of Computing Georgia Institute of Technology Motivation for new network monitoring algorithms Problem we often need to monitor network links for quantities such as o Elephant ows traf c engineering billing c Number of distinct ows average ow size queue manage ment 0 Flow size distribution anomaly detection 0 Per ow traf c volume anomaly detection 0 Entropy of the traf c anomaly detection 0 Other unlikely applications traf c matrix estimation P2P routing IP traceback The challenge of high speed network monitoring 0 Network monitoring at high speed is challenging packets arrive every 25ns on a 40 Gbps QC 768 link has to use SRAM for per packet processing per ow state too large to t into SRAM traditional solution of sampling is not accurate due to the low sampling rate dictated by the resource constraints eg DRAM speed Network data streaming a smarter solution 0 Computational model process a long stream of data pack ets in one pass using a small yet fast memory 0 Problem to solve need to answer some queries about the stream at the end or continuously 0 Trick try to remember the most important information about the stream pertinent to the queries learn to forget unimportant things 0 Comparison with sampling streaming peruses every piece of data for most important information while sampling digests a small percentage of data and absorbs all information therein The hello world data streaming problem 0 Given a long stream of data say packets d1 d2 count the number of distinct elements F0 in it 0 Say in a b c a c b d a this number is 4 0 Think about trillions of packets belonging to billions of ows 0 A simple algorithm choose a hash function h with range 01 X minhd1 W2 c We can prove E X lFo l and then estimate F0 using method of moments 0 Then averaging hundreds of estimations of F0 up to get an ac curate result Another solution to the same problem Whang et al 1990 o Initialize a bit array A of size m to all 0 and x a hash function it that maps data items into a number in 1 2 o For each incoming data item mt set Ahxt to l 0 Let me be the number of 0 s in A 0 Then F0 m X lnmm0 Given an arbitrary index 239 let Y the number of elements mapped to it and let X be 1 when Y 0 Then EX Prm 0 l lmF0 eF0m Then EX w m X eF0m By the method of moments replace E X by me in the above equation we obtain the above unbiased estimator also shown to be MLE Cash register and turnstile models Muthukrishnan o The implicit state vector varying with time t is the form 5 lt a1 a2 an gt 0 Each incoming data item mt is in the form of lt z t Ct gt in which case am is incremented by C75 0 Data streaming algorithms help us approximate functions of 5 such as 1705 2171 lull 0 number of distinct elements 0 Cash register model Ct has to be positive often is l in net working applications Turnstile model C75 can be both positive and negative Estimating the sample entropy of a stream Lall et al 2006 0 Note that 2271 ai N o The sample entropy of a stream is de ned to be HE E ZWrN 1083aiN i o All logarithms are base 2 and by convention we de ne 0 log 0 E 0 c We extend the previous algorithm Alon et al 1999 to esti mate the entropy 0 Another team obtained similar results simultaneously The concept of entropy norm We will focus on computing the entropy norm value S E 2171 ai log ai and note that n al al H N10gn W1Zailoga 2ailogN 1 logN NZailogai 1 1 N S 0g N so that we can compute H from S if we know the value of N e 6 Approximation An 6 6 approximation algorithm for X is one that returns an estimate X with relative error more than 6 with probability at most 6 That is Pr X X 2 X6 3 6 For example the user may specify 6 005 6 001 ie at least 99 of the time the estimate is accurate to within 5 error These parameters affect the space usage of the algorithm so there is a tradeoff of accuracy versus space The Algorithm The strategy will be to sample as follows rrandlm 04 and compute the following estimating variable X NClogc C llogc can be Viewed as f lkc where f 1 log 1 Algorithm Analysis This estimator X N Clogc c 1 log 0 1 is an unbi ased estimator of S since Em Zltjlogj ltj 1gtlogltj 1gtgt 11 j1 Zn ai log ai 11 5 Algorithm Analysis contd Next we bound the variance of X VarX EX2 EX2 g EX2 N2 WlZZ J 10g J 1 109 J 12l 239l j2 g NZZlt210gj2 g 4NZaZlog2 a 239l j2 239l g 4N log NZ a log 01 g 42 a log a log 1W a log a 4S2logN assuming that on average each item appears in the stream at least twice Algorithm contd If we compute 31 32 log N 62 such estimators and compute their average Y then by ChebyscheV s inequality we have VarY gt63 S W 45210gN 4logN lt 816252 8162 1 lt 8 If we repeat this with 32 2 log 16 groups and take their me dian by a Chernoff bound we get more than ES error with proba bility at most 6 Hence the median of averages is an e 6 approximation The SieVing Algorithm 0 KEY IDEA Separating out the elephants decreases the vari ance and hence the space usage of the previous algorithm 0 Each packet is now sampled with some xed probability p o If a particular item is sampled two or more times it is consid ered an elephant and its exact count is estimated 0 For all items that are not elephants we use the previous algo rithm o The entropy is estimated by adding the contribution from the elephants from their estimated counts and the mice using the earlier algorithm Estimating the kth moments Alon et al 1999 0 Problem statement cash register model with increments of size 1 approximating Fk 2271 a 0 Given a stream of data 511x2 1 N the algorithm samples an item uniformly randomly at 31 X 32 locations like before 0 If it is already in the hash table increment the corresponding counter otherwise add a new entry lt 239 l gt to it 0 After the measurement period for each record lt 239 Ci gt obtain an estimate as cf oil 1 f xxc where f wk 0 Median of the means of these 31 X 32 estimates like before 0 Our algorithm is inspired by this one T ug of War sketch for estimating the 2nd moment Alon et al 1999 0 Fix an explicit set V 111112 Uh of h 0712 vectors of length n with 1 and 1 entries 0 These vectors are 4 wise independent that is for every four distinct indices 2 1 2 4 and every choice of 61 64 E 1 1 exactly 1 16 of the vectors in V take these values they can be generated using BCH codes using a small seed 0 randomly choose 1 lt 61 62 en gt from V and let X be square of the dot product of v and the implicit state vector ie X 2171 e X a 2 This can be calculated in one pass 0 Then take the median of a bunch of such X s Elephant detection algorithms 0 Problem nding all the elements whose frequency is over 6N 0 There are three types of solutions Those based on intelligent sampling Those based on a sketch that provides a reading on the ap proximate size of the ow that an incoming packet belongs to in combination with a heap to keep the largest ones The hybrid of them 0 We will not talk about change detection as it can be Viewed as a variant of the elephant detection problem Karp Shenker Papadimitriou Algorithm o A deterministic algorithm to guarantee that all items whose fre quency count is over 6N are reported 1 maintain a set of lt e f gt tuples 2 foreach incoming data 3 searcl incrementcreate an item in the set 4 if the set has more than 19 items then 5 decrement the count of each item in the set by l 6 remove all zero count items from the set 7 Output all the survivors at the end 0 Not suitable for networking applications Count Min or Cormode Muthukrishnan sketch h c lt 1 1t h gt 0 3 Ct o The count is simply the minimum of all the counts 0 One can answer several different kinds of queries from the sketch eg point estimation range query heavy hitter etc o It is a randomized algorithm with the use of hash functions Elephant detection algorithm with the CM sketch o maintain a heap H of small size 1 for each incoming data item wt 2 get its approximate count f from the CM sketch 3 if f 2 675 then 4 increment andor add wt to H 5 delete H if it falls under 675 6 output all above threshold items from H 0 Suitable for networking applications Charikar Chen Farach Colton sketch o It is a randomized algorithm with the use of hash functions 0 Setting An m X 1 counter array C hash functions in hm that map data items to 1 b and 31 3m that map data items to 1 1 0 Addrt compute z39j hjxt j l m and then incre ment 0m m by 8m o Estimateat return the medianlgjgm 0mm gtlt SJ115 0 Suitable for networking applications Sticky sampling algorithm Manku and Motwani 2002 0 sample and hold initially with probability 1 for rst 2t ele ments 0 sample with probability 12 for the next 2t elements and re sample the rst 2t elements 0 sample with probability 14 for the next 475 elements resample and so on o A little injustice to describe it this way as it is earlier than Estan and Varghese 2002 0 Not suitable for networking applications due to the need to re sample Lossy counting algorithm Manku and Motwani 2002 o divide the stream of length N into buckets of size w 19 each 0 maintain a set D of entries in the form lt e f A gt 1 foreach incoming data item wt 2 b 3 if mt is in D then increment its f accordingly 4 else add entry lt wt 1 b l gt to D 5 if t is divisible by w then 6 delete all items 6 whose f A g b 7 return all items whose f 2 9 EN 0 Not suitable for networking applications Sample and hold Estan and Varghese 2002 o maintain a set D of entries in the form lt e f gt 1 foreach incoming data item wt 2 if it is in D then increment its f 3 else insert a new entry to D with probability 1 gtllt lNQ 4 return all items in D with high frequencies Multistage lter Estan and Varghese 2002 o maintain multiple arrays of counters Cl 02 Cm of size I and a set D of entries lt e f gt and let h1 h2 hm be hash functions that map data items to 1 2 b 1 for each incoming data item wt 2 increment CZhZxt J l m by 1 if possible 3 if these counters reach value MAX 4 then insertincrement mt into D 5 Output all items with count at least N X 9 MAX 0 Conservative update only increment the minimums 0 Serial version is more memory ef cient but increases delay Estimating L1 norm Indyk 2006 0 Recall the turnstile model increments can be both positive and negative 0 L1 norm is exactly 115 2171 aZl and is more general than frequency moments under the turnstile model 0 Algorithm to estimate the L1 norm 1 prescribe independent hash functions in hm that maps a data item into a Cauchy random variable distributed as f and initialize real valued registers r1 rm to 00 2 for each incoming data item mt lt z t c t gt 3 obtain v1 h1z t um hmz t 4 increment n by v1 73 by m and rm by um 5 return medianr1r2 rml Why this algorithm works Indyk 2006 0 Property of Cauchy distribution if X1 X2 X are standard Cauchy RV s and X1 and X2 are independent then aX1 ng has the same distribution as lal lblX 0 Given the actual state vector as lt 011012 an gt after the execution of this above algorithm we get in each n a random variable of the following format a1 gtlt X1 a2 gtlt X2 an X Xn gt which has the same distribution as 2171 logDX 0 Since medianlX l 1 or F 1075 l the estimator sim ply uses the sample median to approximate the distribution me dian 0 Why not method of moments The theory of stable distributions 0 The existence of p stable distributions S p 0 lt p g 2 is discovered by Paul Levy about 100 years ago 9 replaced with oz in most of the mathematical literature 0 Property of p stable distribution let X1 Xn denote mutu ally independent random variables that have distribution S p then a1X1a2X2aan and lallerlaglplanlp1PX are identically distributed 0 Cauchy is l stable as shown above and Gaussian f e ggQQ is 2 stable The theory of stable distributions contd Although analytical expressions for the probability density func tion of stable distributions do not exist except for p 05 l 2 random variables with such distributions can be generated through the following formula sin p9 cos 61 19 1p1 X coslPQ lnr where 9 is chosen uniformly in 7r2 7r2 and 7 is chosen uni formly in 0 1 Chambers et al 1976 Fourier transforms of stable distributions 0 Each S p and correspondingly fpx can be uniquely charac terized by its characteristic function as Ee X E 00 fpxcos m Z sin e ltlp 1 o It is not hard to verify that the fourier inverse transform of the above is a distribution function per Polya s criteria 0 Verify the stableness property of S p Eeita1X102X2aan E z ta1X1 E z ta2X2 E z taan e laitlp e la2tlp e la2tlp eUailpla2lplanlp1pltlip Eeitlailpla2lplanlp1pX Estimating Lp norms for 0 lt p g 2 o Lp norm is de ned as 1195 2171 lamp 119 which is equiv alent to the p th root of Fp pm moment under the cash register model not equivalent under the turnstile model 0 Simply modify the L1 algorithm by changing the output of these hash functions in hm from Cauchy ie 31 to S p and diVide by distribution median of SW19 0 Moments of S p may not exist but median estimator will work when m is reasonably large say 2 5 o Indyk s algorithms focus on reducing space complexity and some of these tricks may not be relevant to networking appli cations Estimating entropy of OD Flows Zhao et al 2007 0 We can approximate 1 ln 1 by linear combinations of xp for 1 on a xed interval 0 N within relative error e 66 l6 lnN 10 l l m 1 here 1 n1 2am 1 W 0 011nm 0 Proof By Taylor expansion of ma e 0 Therefore we can use L1a and L1a norms to estimate the entropy norm S S 01 1nd iltzala 2amp11 a o In parallel we have an elephant detection module that handles with high probability all the ows of size greater than N Estimating L1 norm c To estimate entropy we need the entropy norm and the L1 norm 0 We can utilize L1a and L1a norm estimations to avoid the overhead of L1 norm estimation 1a ml a 1 5L Estimating LP norm of OD ows o Indyk s algorithm has Intersection Measurable Property IMP o If we denote the LP sketch at origin as O the one at destina tion as D and the median estimator as A0 then the LP norm of the cross traf c between origin and destination can be es timated by W W or W 1 p where O D and O D are component wise additions and subtractions of the sketches o LHQ and L1a norm estimations of OD ows give us entropy estimation of OD ows Estimating Lp norm of OD ows W fl p fk p 91 p gz p ME If1pfkph1PhmP A6 DP glpgzph1phmp Alt6DP 2f1P2fkpglpglPh1Phm Hence may MD M6 D 22 fl p fk p 22 f1p WV Modi cations to Indyk s Sketch 0 Note that for every packet we have to perform hundreds or thousands of updates per packet infeasible at line speeds 0 Solution Hash packets into many thousands of buckets For packets mapped to each bucket we apply Indyk s sketch with only a small number tens of registers and estimate the Lp norm of those packets We add those results together to get the Lp norm of all the packets o The overall relative error is much small than the relative error of each bucket c We also use large lookup tables for the stable distribution RV s Data Streaming Algorithm for Estimating Flow Size Distribu tion Kumar et al 2004 0 Problem To estimate the probability distribution of ow sizes In other words for each positive integer 239 estimate 71 the num ber of ows of size 239 0 Applications Traf c characterization and engineering net work billingaccounting anomaly detection etc 0 Importance The mother of many other ow statistics such as average ow size rst moment and ow entropy 0 De nition of a ow All packets with the same ow label The ow label can be de ned as any combination of elds from the IP header eg ltSource IP source Port Dest IP Dest Port Protocolgt Architecture of our Solution Lossy data structure 0 Maintain an array of counters in fast memory SRAM o For each packet a counter is chosen Via hashing and incre mented o No attempt to detect or resolve collisions 0 Each 64 bit counter only uses 4 bit of SRAM due to Zhao et al 2006b 0 Data collection is lossy erroneous but very fast Counting Sketch Array of counters Array of Counters Processor Counting Sketch Array of counters Array of Counters Packet arrival Processor Counting Sketch Array of counters Array of Counters W by hashing ow label Processor Counting Sketch Array of counters Array of Counters Increment counter Processor Counting Sketch Array of counters Array of Counters Processor Counting Sketch Array of counters Array of Counters 4PI OCCSSOI39 Counting Sketch Array of counters Array of Counters Processor Counting Sketch Array of counters Array of Counters 0 Processor Counting Sketch Array of counters Array of Counters Processor Counting Sketch Array of counters Array of Counters Processor The shape of the Counter Value Distribution 1e06 Actual flow dlstrlbutlon m1024 7777777 100000 9 333K quotquotquotquotquot quot m128K 77777777 W 10000 3 E g 1000 D39 2 100 10 1 i 1 10 100 1000 10000 100000 flow size The distribution of ow sizes and raw counter values both 1 and y axes are in log scale m number of counters Estimating n and n1 0 Let total number of counters be m 0 Let the number of value 0 counters be me 0 Then 7 m gtIlt lnmm0 as discussed before 0 Let the number of value 1 counters be 31 0 Then 7 21 yle m o Generalizing this process to estimate n2 713 and the whole ow size distribution will not work 0 Solution joint estimation using Expectation Maximization Estimating the entire distribution 5 using EM 0 Begin with a guess of the ow distribution pm 0 Based on this om compute the various possible ways of split ting a particular counter value and the respective probabilities of such events 0 This allows us to compute a re ned estimate of the ow distri bution New 0 Repeating this multiple times allows the estimate to converge to a local maximum 0 This is an instance of Expectation maximization Estimating the entire ow distribution an example o For example a counter value of 3 could be caused by three events 3 3 no hash collision 3 1 2 a ow of size 1 colliding with a ow of size 2 3 1 1 1 three ows of size 1 hashed to the same location 0 Suppose the respective probabilities of these three events are 05 03 and 02 respectively and there are 1000 counters with value 3 0 Then we estimate that 500 300 and 200 counters split in the three above ways respectively 0 So we credit 300 1 200 3 900 to m the count of size 1 ows and credit 300 and 500 to 712 and 713 respectively How to compute these probabilities 0 Fix an arbitrary index ind Let be the event that f1 ows of size 31 f2 ows of size 32 fq ows of size sq collide into slot ind wherel g 31 lt 32 lt lt sq g 2 let Z be and be their total 0 Then the a priori ie before observing the value v at ind probability that event happens is A A Plt l a 71 Z 6 211 0 Let Qv be the set of all collision patterns that add up to 1 Then by Bayes rule ip n v where p lb n and Maid n can be computed as above Evaluation Before and after running the Estimation algorithm 1e06 I I Actual ilow distribution 100000 Ramcow ervdues 7777777 if Estimation using our algorithm 777777777 10000 1000 100 10 frequency 1 01 001 0001 00001 39 39 39 1 10 100 1000 10000 100000 flow size Sampling vs frequency 100000 10000 1000 100 array of counters Web traf c Actual ow distribution Inferred from samplingN10 Inferred from samplingN100 39 ation using our algorithm ow size 1000 Sampling vs array of counters DNS traf c frequency 100000 10000 1000 100 Actual ow distribution Inferred from samplingN10 Inferred from samplingN100 Estimation using our algorithm 1 10 ow size Extending the work to estimating subpopulation FSD Kumar et al 2005a 0 Motivation there is often a need to estimate the FSD of a sub population eg what is FSD of all the DNS traf c 0 De nitions of subpopulation not known in advance and there can be a large number of potential subpopulation 0 Our scheme can estimate the FSD of any subpopulation de ned after data collection 0 Main idea perform both data streaming and sampling and then correlate these two outputs using EM Streaming guided sampling Kumar and Xu 2006 Packet stream Estimated Flow size Sampling Process Counting Sketch Per packet operations Flow Table Flow Records llll Usage Elephant SubFSD Other Accounting Detection Estimation 39 Applications Estimating the Flow size Distribution Results frequency 100000 10000 1000 100 10 1 01 001 0001 00001 1 Aciu ai ni orm ampling Sketch Uniform Sampling ketch SGS mam distribution 4 10 100 ow size a Complete distribution frequency 100000 10000 1000 Actual distr bution 4 Unifo m Sampling Sketch Uniform Sampling Sketch SGS rlir ow size b Zoomin to show impact on small ows Figure 1 Estimates of FSD of https ows using various data sources A hardware primitive for counter management Zhao et al 2006b 0 Problem statement To maintain a large array say millions of counters that need to be incremented by l in an arbitrary fashion ie AM Am o Increments may happen at very high speed say one increment every 10ns has to use high speed memory SRAM 0 Values of some counters can be very large 0 Fitting everything in an array of long say 64 bit SRAM counters can be expensive 0 Possibly lack of locality in the index sequence ie 2 1 2 2 forget about caching Motivations o A key operation in many network data streaming algorithms is to hash and increment o Routers may need to keep track of many different counts say for different sourcedestination IP pre x pairs 0 To implement millions of toker leaky buckets on a router o Extensible to other non CS applications such as sewage man agement 0 Our work is able to make 16 SRAM bits out of 1 Alchemy of the 21st century Main Idea in Previous Approaches Shah et a1 2002 Ramabhadran and Va Counter Increments small SRAM counters large DRAM counters 1 1 2 2 3 3 4 4 Over owing Flush to Counters counter DRAM Management Algorithm Figure 2 Hybrid SRAMDRAM counter architecture CMA used in Shah et a1 2002 o Implemented as a priority queue fullest counter rst 0 Need 28 8 20 bits per counter when SD is 12 the theo retical minimum is 4 0 Need pipelined hardware implementation of a heap CMA used in Ramabhadran and Varghese 2003 o SRAM counters are tagged when they are at least half full im plemented as a bitmap 0 Scan the bitmap clockwise for the next 1 to ush half full SRAM counters and pipelined hierarchical data struc ture to jump to the next 1 in 01 time 0 Maintain a small priority queue to preemptively ush the SRAM counters that rapidly become completely full 0 8 SRAM bits per counter for storage and 2 bits per counter for the bitmap control logic when SD is 12 Our scheme 0 Our scheme only needs 4 SRAM bits when SD is 12 o Flush only when an SRAM counter is completely full eg when the SRAM counter value changes from 15 to 16 assum ing 4 bit SRAM counters 0 Use a small say hundreds of entries SRAM FIFO buffer to hold the indices of counters to be ushed to DRAM 0 Key innovation a simple randomized algorithm to ensure that counters do not over ow in a burst large enough to over ow the FIFO buffer with overwhelming probability 0 Our scheme is provably space optimal The randomized algorithm 0 Set the initial values of the SRAM counters to independent random variables uniformly distributed in 0 12 15 ie uniform0 12 15 0 Set the initial value of the corresponding DRAM counter to the negative of the initial SRAM counter value ie B 239 Az o Adversaries know our randomization scheme but not the ini tial values of the SRAM counters c We prove rigorously that a small FIFO queue can ensure that the queue over ows with very small probability A numeric example 0 One million 4 bit SRAM counters 512 KB and 64 bit DRAM counters with SRAMDRAM speed difference of 12 o 300 slots 1 KB in the FIFO queue for storing indices to be ushed 0 After 1012 counter increments in an arbitrary fashion like 8 hours for monitoring 40M packets per second links 0 The probability of over owing from the FIFO queue less than 10 14 in the worst case MT BF is about 100 billion years proven using minimax analysis and large deviation theory in cluding a new tail bound theorem Distributed coordinated data streaming a new paradigm o A network of streaming nodes 0 Every node is both a producer and a consumer of data streams 0 Every node exchanges data with neighbors streams the data received and passes it on further 0 We applied this kind of data streaming to P2P Kumar et a1 2005b and sensor network query routing and the RPI team has ap plied it to Ad hoc networking routing Finding Global Icebergs over Distributed Data Sets Zhao et al 2006a 0 An iceberg the item whose frequency count is greater than a certain threshold 0 A number of algorithms are proposed to nd icebergs at a sin gle node ie local icebergs o In many real life applications data sets are physically distributed over a large number of nodes It is often useful to nd the icebergs over aggregate data across all the nodes ie global icebergs 0 Global iceberg y Local iceberg c We study the problem of nding global icebergs over distributed nodes and propose two novel solutions Motivations Some Example Applications 0 Detection of distributed DoS attacks in a large scale network The IP address of the victim appears over many ingress points It may not be a local iceberg at any ingress points since the attacking packets may come from a large number of hosts and Internet paths 0 Finding globally frequently accessed objectsURLs in CDNs eg Akamai to keep tabs on current hot spots 0 Detection of system events which happen frequently across the network during a time interval These events are often the indication of some anomalies For example nding DLLs which have been modi ed on a large number of hosts may help detect the spread of some unknown worms or spyware Problem statement 0 A system or network that consists of N distributed nodes 0 The data set Si at node 239 contains a set of x cmgt pairs Assume each node has enough capacity to process incoming data stream Hence each node generates a list of the arriving items and their exact frequency counts 0 The at communication infrastructure in which each node only needs to communicate with a central server 0 Objective Find 1 cm 2 T where cm is the frequency count of the item 1 in the set Si with the minimal communica tion cost Our solutions and their impact 0 Existing solutions can be viewed as hard decision codes by nding and merging local icebergs c We are the rst to take the soft decision coding approach to this problem encoding the potential of an object to become a global iceberg which can be decoded with overwhelming probability if indeed a global iceberg 0 Equivalent to the minimax problem of corrupted politician c We offered two solution approaches sampling based and bloom lter basedand discovered the beautiful mathematical struc ture underneath discovered a new tail bound theory on the way 0 Sprint Thomson and IBM are all very interested in it Direct Measurement of Traf c Matrices Zhao et al 2005a o Quantify the aggregate traf c volume for every origin destination OD pair or ingress and egress point in a network 0 Traf c matrix has a number of applications in network man agement and monitoring such as capacity planning forecasting future network capacity re quirements traf c engineering optimizing OSPF weights to minimize congestion reliability analysis predicting traf c volume of network links under planned or unexpected routerlink failures Previous Approaches 0 Direct measurement Feldmann et al 2000 record traf c ow ing through at all ingress points and combine with routing data storage space and processing power are limited sampling 0 Indirect inference such as Vardi 1996 Zhang et al 2003 use the following information to construct a highly underconstrained linear inverse problem B AX SNMP link counts B traf c volume on each link in a net work 1 if traf c of OD ow j traverses link 239 routing matrix AM 0 Otherwise Data streaming at each ingressegress node 0 Maintain a bitmap initialized to all 0 s in fast memory SRAM 0 Upon each packet arrival input the invariant packet content to a hash function choose the bit by hashing result and set it to l variant elds eg T T L CHECKSUM are marked as 0 s adopt the equal sized bitmap and the same hash function 0 No attempt to detect or resolve collisions caused by hashing 0 Ship the bitmap to a central server at the end of a measurement epoch How to Obtain the Traf c Matrix Element TMM 0 Only need the bitmap Bl at node 239 and the bitmap Bj at node j for TMM 0 Let Ti denote the set of packets hashed into Bi TMM Tyl Linear counting algorithm Whang et al 1990 estimates from Bi ie blog where b is the size of B1 and U is the number of 0 s in Bi 4mm mwmw munl gtnlt and estimate directly gtllt U T infer from the bitwise OR of Bl and Bj Some theoretical results 0 Our estimator is almost unbiased and we derive its approximate variance Va TMm b2etT Tj etTZ39UTJ39 eth39 etTj ianj 1 0 Sampling is integrated into our streaming algorithm to reduce SRAM usage A b X X l Va TMm 2 e 62 6 p l lt p p b p o The general forms of the estimator and variance for the inter section of k 2 2 sets from the corresponding bitmaps is derived in Zhao et al 2005b Pros and Cons 0 Pros multiple times better than the sampling scheme given the same amount of data generated for estimating TMM only the bitmaps from nodes 239 and j are needed gtllt support submatrix estimation using minimal amount of in formation gtllt allow for incremental deployment o Cons need some extra hardware addition hardwired hash function and SRAM only support estimation in packets not in bytes References Alon et al 1999 Alon N Matias Y and Szegedy M 1999 The space complexity of approximating the frequency moments Journal of Computer and System Sciences 5811377143 Chambers et al 1976 Chambers J M Mallows C L and Stuck B W 1976 A method for simulating stable random variables Journal of the American Statistical Association 71 354 Estan and Varghese 2002 Estan C and Varghese G 2002 New Directions in Traf c Measurement and Accounting In Proc ACM SIGCOMM Feldmann et al 2000 Feldmann A Greenberg A Lund C Reingold N Rexford J and FTrue 2000 DeriVing Traf c Demand for Operational IP Networks Methodology and Experience In Proc ACM SIG COMM Indyk 2006 Indyk P 2006 Stable distributions pseudorandom generators embeddings and data stream computation J ACM 5333077323 Kumar et al 2004 Kumar A Sung M Xu J and Wang J 2004 Data streaming algorithms for ef cient and accurate estimation of ow size distribution In Proc ACM SIGMETRICS Kumar et al 2005a Kumar A Sung M Xu J and Zegura E 2005a A data streaming algorithms for estimating subpopulation ow size distribution In Proc ACM SIGMETRICS Kumar and Xu 2006 Kumar A and Xu J 2006 Sketch guided samplingiusing on line estimates of ow size for adaptive data collection In Proc IEEE INFOCOM Kumar et al 2005b Kumar A Xu J and Zegura E W 2005b Ef cient and scalable query routing for unstructured peer to peer networks In Proc of IEEE Infocom Miami Florida USA Lall et al 2006 Lall A Sekar V Ogihara M Xu J and Zhang H 2006 Data streaming algorithms for estimating entropy of network traf c In Proc ACM SIGMETRICS Manku and Motwani 2002 Manku G and Motwani R 2002 Approximate frequency counts over data streams In Proc 28th International Conference on Very Large Data Bases VLDB


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Janice Dongeun University of Washington

"I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.