MONTE CARLO METHODS
MONTE CARLO METHODS CIS 5930
Popular in Course
Popular in Comm Sciences and Disorders
This 123 page Class Notes was uploaded by Mrs. Rahul Wuckert on Thursday September 17, 2015. The Class Notes belongs to CIS 5930 at Florida State University taught by Staff in Fall. Since its upload, it has received 12 views. For similar materials see /class/205699/cis-5930-florida-state-university in Comm Sciences and Disorders at Florida State University.
Reviews for MONTE CARLO METHODS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/17/15
Distinct Counting Problem COUNT Sketches Problem Estimate the number of distinct item IDs in a data set with only one pass Constraints Small space relative to stream size Small per item processing overhead Union operator on sketch results Exact COUNT is impossible without linear space First approximate COUNT sketch in FM 85 Olog N space Ol processing time per item Counting Paintballs Imagine the following M scenario i A bag of n paintballs is emptied at the top of a L long staircase At each step each paintball either bursts and marks the step or bounces to the next step 5050 chance either way Looking only at the pattern of marked steps what was n Counting Paintballs cont What does the Lam2 distribution of paintball 1st BnJ4 bursts look like 27 The number of bursts at each step follows a 30772 5 binomial distribution The expected number of bursts drops geometrically Few bursts after log2 n steps 5th 307 12 5 Counting Paintballs cont Many different estimator ideas FM3985AMS3996GGR39O3DF39O3 Example Let p05 denote the position of the highest unmarked stair Ep05 2 logzm 775351 n 026005 z 112127 Standard variance reduction methods apply Either Oog n or Oog log n space Back to COUNT Sketches The COUNT sketches of FM3985 are equivalent to th 39 tb esi39rr wh 3torofau X I 1 I 0 l 0 l 0 l 0 l zeros Four Sigtsltlegqalndahash 0 I 0 I 1 I O I O I function for coin flips Pick a bitvector entry Setthatb ittoone Xy I 1 I 0 I 1 I 0 I 0 I I These sketches are quot duplicateinsensitive VAB SketchA SketchB SketchA U B SUM Sketches Problem Estimate the sum of values of distinct ltkey valuegt pairs in a data stream with repetitions value 2 O integral Obvious start Emulate valueinsertions into a COUNT sketch and use the same estimators For ltkgt imagine inserting ltk I4 Jgt ltk I4 2gt ltk i4 Vgt SUM Sketches cont But what if the value is 1000000 Main Idea details on next slide Recall that all of the loworder bits will be set to 1 whp inserting such a value Just set these bits to one immediately Then set the highorder bits carefully Simulating a set of insertions Set all the loworder bits in the safe region First 5 log V 2 log log V bits are set to 1 whp Statistically estimate number of trials going beyond safe region Probability of a trial doing so is simply 2395 Number of trials BI 25 Mean Oog2 W For trials and bits outside safe region set those bits manually Running time is 01 for each outlying trial Expected running time Oog v time to draw from Bv 25 Oog2 v Sampling for Sensor Networks We need to generate samples from Bn p With a slow CPU very little RAM no floating point hardware General problem sampling from a discrete pdf Assume can draw uniformly at random from 01 I With an event space of size IVE Olog N lookups are immediate Represent the cdf in an array of size N Draw from 0 1 and do binary search Cleverer methods for Olog log N Olog N time Amazingly this can be done in constant time Walker s Alias Method Theorem Walker 77 For any discrete pdf D over a sample space of size n a table of size 0n can be constructed in 0n time that enables random variables to be drawn from D using at most two table lookups B010 D 005 E015 39 D 020 E 020 A010 3 015 c o05 lt n gt Binomial Sampling for Sensors Recall we want to sample from BV25 for various values of vand 5 First reduce to sampling from GJ 25 Truncate distribution to make range finite recursion to handle large values Construct tables of size 25 for each 5 of interest Can sample 50425 in 0I 3925 expected time Fact Suppose we have a method to repeatedly draw at random from Glp Let d be the random variable that records the number of draws from Glp until the sum of the draws exceeds n The value d1 is then equivalent to a random draw from Bn p The Bottom Line SUM inserts in 0092V time with 0V092V space 00gi time with 0Vlogi space 0V time with na39ive method Using 0092V method 16 bit values S S 8 and 64 bit probabilities Resulting lookup tables are 45KB Recursive nature of GJ 25 lets us tune size further Can achieve 0og V time at the cost of bigger tables Temporal Indexing MVBT Temporal Indexing Transaction time databases update the last version query all versions Queries Find all employees that worked in the company on 091298quot Find the employees with salary between 35K and 50K on 042199quot Multiversion Btree answers efficiently the queries above Basics 5 A data structure is called Ephemeral updates create a new version and the old version cannot be queried Persistent updates can be applied at any version and any version can be queried Partially Persistent updates applied to the last version and any version can be queried Transaction time fits with partially persistence MVBT The idea Store all versions of the state of a Btree which evolved over time ie multiple snapshots of the tree Inserts updates and deletes are applied to the present version of the tree and increase the version number of the whole tree Queries know which version of the tree they require the results from i General method Transform a single version external access structure with high utilization of disk blocks at the cost of a constant factor in time and space requirements compared to the original single version structure Such increase is asymptotically optimal worstcase bounds cannot decrease by adding multiversion capability to existing structure Proposition Specifics Extend Btree to have multiversion capability Support operations Insertkey info Insert into current version increase tree version Deetekey Delete from current version increase tree version Exact match amenkey version Range quey owkey nighey version Overview The multiversion Btree is a directed acyclic graph of Btree nodes that results from incremental changes to original Btree It has a number of Btree root nodes which partition the versions from the first to the present one so that each Btree root stands for an interval of versions Time Blocks pages Contain bdata items Live if it has not been copied o eao otherwise Weak version condition for every version i and each block A except the roots of versions we require that the number of entries in A is either 0 or at least a where bk39o Data Items Leaf node of the tree ltkey inversion delversion infogt Inner node of the tree ltrouter inversion delversion infogt Said to be of version i if its lifespan contains i In live block deletion version denotes that this entry has not been deleted at present in a dead block it means that the entry has not yet been deleted before the block died Updates Each update creates new version If no structural changes Insert lifespan is i Delete changes de version from toi A structural change is required if Block overflow can only fit b entries in a block Weak version underflow if deletion in a block with exactly d current version entries Structural Modification Copy the block and remove all but the present version entries from the copy If block consists of primarily present version entries the copy will produce an almost full block resulting in a split again after a few subsequent insertions To avoid this request that at least d1 insert or delete operations are necessary for the next block overflow or version underflow in that block a will be defined later Strong Version Constraints Strong version condition the number of present version entries after a version split must be in range from 1ed to k d Strong version uno erfovm result of version split leading to less than 1ed entries Attempt to merge with a sibling block containing only its present version entries if necessary followed by a version independent split by key values Strong version overflow if a version split leads to more than k d entries in the block Also perform a split by key values Simple Example Original Tree Version 1 R 391LL1A39gt A i4511B B Insert40 13 V V rquot quot 39 Gquot J 1 l C r 3quot LrJ U1 3 U1 U1 3 A J 4 J b i I l i 4 J 1 1 Ln r l J J Simple Example Version 2 7 R O 1 quot 1 N 2 1 1 3 U 1 4 3939 a 1 1 1 lt4u2gt 1 U 1 1 sz 1 1 quot39quot2 lm U1 J s EVJ I U 4 1 1 r 15 11 2711 1k 11 1k Delete65 Simple Example Version 3 R U 1 1 5 1 12 1 1 3953 U 1 2 lt4u2kgt 1 U 1 1 3 1812 B Ji L J J 4c U1 r C U1 5 g quot JJJJJ aerL Version Split Example R Insert5 creates a block U 13 1A3 ove rfl OW lt4 5 1 All currently live entries copied to the new live block A old A B block A marked dead 131J 1 7l j45 1 lt1 5515 51 3 lt2517gt 6113 Also the root block is updated to show that entity 10 was alive in the dead block A until Version 7 version 8 Version Split Example R lt10l8Agt 4 5 l Btr lt 58 7 I J gt A l aiui a1115gt i2ii b 3Ql gt 3114 414 0 2w 1 3 iui lt4u2 gt lt53 gt 4ilf 4511 aoilhe 7amp1 71L SQL 5 2 y b I I I i v39 v v v Version 8 Resulting tree Weak V Underflow Example A 0 5 l B gt 1 0 1 lt1 Sl5gt lt25l7gt lt30l6gt lt35l4gt l 6215 5 l 46513 1170 1 lt17 5 l 5278 0 l Version 7 Suppose b6 d2 805 cl the minimum of current version entries in the block Delete40 results in block A only having 1 current entry 1ltd weak underflow split A Strong V Underflow Example R 39r11101 8A1 r 113 1 08 KN 3 A B at l U l l 163 435 lt4028gt 1 U 1 45 1 1 5 1 392215 13 70 1 1 lt80 1 The version split of A has led to less than 1 d1523 entries in the new node estrong version under70W Seek a sibling of A in our case B Version split it to create 3 Merge B and A to produce block AB Processing Version 798 Strong V Underflow Condition Violation l R But now the node AB violates 191M the strong version over ow 39 14DLSBI cono ltlon and must be split by key B into nodes C and D VJ C Cb a U 3 Iquot 2quotquot anquot Jr C U rJ1 s my J V x C J T quotF quot 9quot C 7 a A fa 5quot Processing Version 798 Example Query 25 5 Resulting Tree R ltil 018Agt lt45l8Bgt lt1 quot985951 lt70 8quot Dgt 701 l k39j39 quotVb 480 1 3 39339 J J Um U1 Um J J1 U c H 1 lt4028gt 30 Version 8 Roots Can Split Too R1 1U1 3 Agt OVEI fIOW Split l 1881 1 Mil LC 708 15 D IUJIIKE 7amp111ampF 5 R2 J a 4101 1 70JbC Rl 7UJSSKG iHLLampAP J b auxachgt 7Uii15J3 JulL b 7ulilampFgt Roots Can Split Too Strong version overflow R2 with key split and allocation of the new block quotquotquot 4U2 quotquot R4 quotquot R1 11018 Aigt 425 13142132 iii3012lquotquot Cl39 402 1 t 5 14 25 E39jx R3 R4 70 143K173 l 01 83 2A 40 213 1D3 1 2SI 5525 11 lt5525 30gt 430213lac p 170 14F1gt Weak Version Underflow iof the Root Node R3 lt101829Agt lt251829Bgt lt302132Cgt lt102932llgt lt1032quot lgt R4 4140213392D3 70 143392173 R3 has shrunk weak underflow Block copies of R3 and R4 are created and merged into R5 R5 39lt391032 h13939 6402 i 393955 25quot jrgt 3939239701 1 4 This causes weak version underflow of R2 so R5 becomes new root block Algorithms Insertion Find the leaf node for the new key e then call blockInsert say A blockInsert enter e in A If block overflow of A then Version split block insert If strong version underflow then Merge Else if strong version overflow 9 key split Algorithms Delete blockDelete A Check weak version underflow on A If true then merge with sibling Note that Deletion is easier than the insertion in the MVBT What about the Btree Constraints on MVBT Parameters What are the restrictions on choices of k and a k d1 2 1o1ed Cl is the fraction of the entries in a block guaranteed to be in a new node after a key split 05 for Btrees Before key split A contains at least k cd1 entries After a key split both blocks must contain at least 1ed entries 2d1 2 1ed Before merge is performed together there are at least 2d 1 present version entries in the blocks to be merged Efficiency Analysis The big result is that the MVBT is asymptotically optimal to the BTree in the worst case in time and space for all considered operations Search time is in r 0ltllogb mil g Space is Onb and update Sllogbmil Additional Issues Can store the roots of version interval trees in a Btree of their own authors demonstrated that for most practical cases the depth of the root Btree will never exceed two aanay Store old versions on a Write Once Read Many WORM drive optical disks similar to Kolovson and Stonebraker 1989 5 Other Approach Overlapping Btrees Idea for each update store only the path from the root node to the leaf node use the rest of the tree for the new version since it is the same Better approach store only the new path Easier to implement but space is higher Onb logb nb Temporal Hashing Hashing is used to answer exact match queries Exactmatch with time find if employee with id23 was working in the company on 090898 One approach Use MVBT 45 IOs Temporal Hashing can achieve 2 105 Temporal Hashing 5 Treat objects that fall into a given bucket Bi during the whole evolution as an evolving set Bit To find if key k is in St reconstruct bucket Bi and search for key k Bi is the bucket that the key k should have been placed at t Simply observe how an ephemeral hashing scheme would map the keys of St in buckets and keep the history of each bucket Temporal Hashing 5 For each bucket use a Snapshot Index Using the SI we can reconstruct the bucket Note that when we split a bucket we treat the keys that moved to the new bucket as deleted for the old bucket newly inserted for the new Another approach is to keep an evolving list Bitemporal Data Indexing Bitemporal Index R tree for valid time Partial persistence for transaction time gt Partially persistent Rtree Kumar at al 97 What about the case that valid time has an open endtime now A special R tree to handle that Saltenis et al 99 Network Intrusion Detection Systems Beyond packet filtering Goal of NIDS Detect attacks as they happen Realtime monitoring of networks Provide information about attacks that have succeeded Forensic analysis Passive systems monitoring and reporting Active systems corrective measures adopted Good place to establish a NIDS The perimeter network or DMZ Strategies Often NIDS are described as being composed of several parts 7 Event generator boxes Analysis boxes Storage boxes Countermeasure boxes Analysis is the most complex element and can use protocol analysis as well as anomaly detection graph analysis etc Elements of a NIDS A general NIDS system can be seen to consist of 4 modules boxes o Event E m Analysis A n Conter measure C o Storage S E may activate A C or S A may activate C or S 39 39 Outpt tins 39 39 Events C Counter measure box Output High Level Interpreted Events Output Storage of Events locally or otherwise A Analysis box S Storage box E Event box Output Raw or Low level Events Host based vs Network based Host based Operating system log analyses Semantically rich Contain information about the state of the system Network based Direct analysis of network traffic Complete Sees all the network events not only those conveyed up to the higher levels of the operating system Unobtrusive Does not degrade network or host performance Common analysis techniques Attempts patternmatching against certain known attack types I For instance substring matching Passive protocol analysis Emulate the sequence of protocol events to detect attacks Dif culties inherent in NIDS What defines an attack is not a packet but its induced behavior on the receiving host NIDS must determine this behavior NIDS runs in a different machine even a different part of the network Proper function of the NIDS may require of each host being protected Knowledge of its place in the network topology v Knowledge of its TCPUDP implementation OS based behavior variance In uence of Network Topology If several internal routers exist between the network component where the NIDS resides and where the receiver host resides TTL may result in some packets reaching the NIDS but not the receiver Some packets being dropped by filtering routers In uence of implementation UDP packets with incorrect checksum will be dropped or accepted will be filtered Packets with incorrect header fields Fragmentation overlap and reordering issues Insertion attacks Means Lead the NIDS into thinking a particular packet will be accepted by the receiving host when it in fact will not Goal To prevent the NIDS from recognizing patterns either for protocol analysis or signature recognition by reconstructing an incorrect series of events Example of insertion attack Endsystem Interprets data as quotATTACKquot Rejected by E Attacker39s data stream Evasion attacks Means Lead the NIDS into believing that a particular packet will be rejected by the host when it will not Goal To prevent the NIDS from detecting an attack via protocol analysis or signature analysis by preventing the NIDS from reconstructing the correct sequence of packets processed by the receiving host iEvasknlexanq e Endsystem Interprets data as quotATTACKquot Rejected 39 by NIDS f Attacker39s data stream Confusing the NIDS Some implementations of NIDS may allow evasion insertion attacks simply because the NIDS does not correctly implement all the steps of protocol verification Ii An attacker specifically targets this In what follows we consider difficulties which are inherent with the design of NIDS systems namely intrinsic ambiguity on what types of decisions the NIDS should take Ambiguity Network topology information lacking lP TTL not be enough to reach destination NIDS may erroneously keep packet Donotfragment flag ignored by NIDS packet discarded at router before destination Destination configuration info lacking NIDS keeps sourcerouted packets discarded by destination Destination 08 knowledge lacking Timeout values for holding to packet fragments of incompletely received packets differs in destination from NIDS Re assembly strategy for overlapping packet fragments differs TCP header options may lead to packets being dropped differently Destination may or not silently drop packets with old timestamps Destination may interpret differently conflicting TCP segments Destination maymay not discard RST packets w wrong sequence numbers Some evasion insertion attacks Bad IP headers Differences in NIDS network and host network with respect to TTL and don t fragment DF bit Bad IP options Sourcebased packets filtered and variations in timestamp decisions Direct frame addressing u Attacker in the same physical network as NIDS directs packet to NIDS or to nonexisting MAC address but IP address of host IP packet fragmentation Large IP packets larger than the size of the dataframes in the link layer must be broken up into smaller packets The IDS must be able to handle lP packet re assembly correctly u outoforder fragments must be reordered u fragments must be stored until all fragments for the packet are known DoS attack Send partial IP packets Packet fragmentation After some time packet fragments must be discarded based on their arrival times or the system will run out of memory e If NIDS drops them faster than end system there is opportunity for successful evasion attacks If NIDS keeps them longer than end system there is opportunity for successful insertion attacks Coordinated attacks using many source destination pairs can disable NIDS Overlapping fragments Two TCP fragments may contain overlapping data FonNard overlap a fragment of data at a later section of the packet arrives early Later a fragment partially overlaps with the data in the early packet Some OSes override the old data others keep it Reverse overlap l A packet contains both new data and data received at earlier packets a All OSes keep old data discarding new overlapped data TCP layer problems For forensic reasons it is important to keep analyze higher level protocol information One such approach is called TCP connection monitoring TCP packets can be assigned to connections or at least requests to open connections A TCP session can be in a set of states Established Closed The NIDS and the end system should be state synchronized for monitoring to succeed TCB TCP Control Block A TCP monitoring NIDS must keep a TCB for every existing connection with state packet numbers window etc TCBs must be created for new connections and should be discarded for closed connections TCB creation TCB reassembly TCBteardown TCB creation and reassembly Missed handshake sequences TCP has a 3way handshake When to consider the connection has been established Existing connections at boot time Insertion attacks replay an old packet sequence number on an existing connection send a packet on a closed connection send a packet on a nonexisting connection TCP stream synchronization TCB reassembly I TCP data overlap TCP timewindow and acknowledgment strategies NIDS does not validate TCP packets in accordance with end system a TCP header options checksum Wrapped sequence numbers TCP I Emphasizing on the followings Second phase of the project 7 Connection management 7 Flow control and congestion control 7 Reliable transmission packet retransmission Please check some networking textbooks for details on TCP and OSPF mznm lSLVCIS Sgl lmzthmmcals mznm lSLVCIS Sgl lmzthmmcals connecuon management Flow control and congestion I Connection establishment contIOI baking I Three parameters SYNACK 7 Window size buffer that you have I Connection release 7 FlNACK 7 Receiver Window size rwnd 7 congestion Window size cwnd 7 slowstart threshold ssthresh 7 Preparing to receive data before ACKed At t I any me I Preparing for packet loss Timer for SYN packet 7 Sendingdata lt minrwnd cwnd 7 You can close your connection without receiving ACK bu a certain amount ofu39me m 2on4 lSLVCIS Sgl lmzthmmcals z mznm lSLVCIS Sgl lmzthmmcals Slow start and congestion avoidance I At beginning 7 cwnd one segment 7 ssthresh 64K While and lt ssthresh 7 Double cwnd a er each window round 7 increasing cwnd by one packet every time packet ACKed A er and gt ssthresh 7 Linear increasing 7 Increasing cwnd by one a er each window round 7 cwnd 1cwnd mznm isucis SQZDIntzthmmcals Reliable transmission 0 conceptually a timer for each unAcked packet 0 When timer goes off retransmit packet 0 At receiver side 7 A modi ed GobackN 7 Buffering outoforder packets 7 Acking maximum continually received packet mznm isucis SQZDIntzthmmcals Reliable transmission 0 How to set timeout interval 0 Estimating RTT RTT alpha RTT 1 alpha newisample Timeout beta RTT mznm isucis SQZDIntzthmmcals Open Shortest Path First OSPF 0 We use a simplified version 0 OSPF has two phases 7 Learning phase When an OSPF router rst starts 7 Maintenance phase therea er mznm isucis SQZDIntzthmmcals Learning phase 5 steps Learning neighbors 7 Neighbors are ones on the same subnet Exchanging connectivity info with neighbors Flooding learned info Computing shortest path to all routers Computing shortest path to all subnets OSPF message types 0 SPFiT YPEiHELL O 7 To probe other routers OSPFiTYPEiHELLOACK 7 To respond to hello messages 0 SPFiT YPEiL SAiUPDATE 7 To advertise updates connectivity info mznm rvacxs SQZUIntzthmmcals mznm rvacxs SQZUIntzthmmcals m When an OSPF router starts lt probes others After a certain amount of time say 10 seconds it assumes it has all topology info Computing shortest paths by applying Dijkstra algorithm 7 You get shortest paths to all other routers Computing shortest paths to all subnem Maintenance phase It probes others With a larger interval 7 Say 1 probeper 10 seconds If it does not hear from a neighbor for several probe intervals it assumes the neighbor is dead 7 Say 3 probes I e 30 seconds Propagating this loss of neighbor Recomputing shortestpaths mznm rvacxs SQZUIntzthmmcals n mznm rvacxs SQZUIntzthmmcals 12 Grading policy 100 points 1 Emulation of packet loss 5 points 2 TCP connection management 3 TCP congestion management 1 Slow start congestion avoidance TCP Tahoe or Reno Second phase Cont d I OSPF learning neighbors 5 poinw I OSPF exchanging connectivity I OSPF ooding I OSPF computing shortest path to all routers I OSPF computing shortest path to all subnets 4 TCP retransmission go back N or selectwe repeat I Code readability 5 points 5 FTP applrcatron Repo 5 POIms mm Bucls 593m lntzthmtncals 13 mm Bucls 593m lntzthmtncals 14 A new command Project UDP and TCP I ip I To configure a local host 7 ip loss 7 APPL UDPlTCP portinmnber type 7 ip ospfd 7 Type 7 ip Sh route I MSG for short messages I Showing the content of the routing table 7 ip rrn route I Deleting all content of the routing table I FTP for le transfer mum rsucxs syznrmmwmmrs 15 mum rsucxs syznrmmwmmrs m UDP and TCP 0 For two hosts to communicate 7 MSG UDP TCP sourcejort dst dSthIT message 7 FTP source jolt dst dSthIT leiname 0 For file transfer 7 A simple le transfer protocol Client speci es the name ofthe le he wants to download from the server mznm Bucls ngnlmmwmmcals Database Security Privacy and statistics Records privacy Database access control is an important aspect of practical security One of the crucial assets of many organizations is the data recorded in its databases Protection of data records is a complex issue often involving different security levels assigned to parts of records eg database table rows Rolebased access control is also more common in this setting than with regard to regular file systems In addition to the many important theoretical and practical aspects of database access control databases often have another privacy dimension Statistical access to data can reveal information about individual records We explore this issue in this class Statistical databases Valuable data records have been compiled to support statistics based research Medical databases genetic component of diseases effectiveness of drugs and treatments health policy issues Scientific databases Marketing research databases Legal privacy protections or intellectual property interests indicate that in many cases it is important to allow for queries to reveal generic patterns on the data statistical information while protecting information contained in individual records We consider here the inherent risks to disclosure of individual record information to an attacker that access the database through statistical queries Statistical databases simplified model n records each belonging to an individual Confidential category and data fields Category elds are used to identify and select records Data elds contain other information Identi er elds or indexes not used by statistical quenes Attacks against static databases not updated during the time of the attack or against dynamic ones Statistical queries Assumption Only perfect match is considered no relevancetype queries and the expressions are combinations of operators AND OR NOT The collection of records that satisfy a query C is called the query setXC Consider those statistical functions of the query set than can be expressed as qCj m 2 va where i is in XC v is the value in data fieldjof record i and m is an integer Special cases m 0 qCj 0 2 v00 2 1 foriin XC XC m 1 qC j 1 5 v01 fori in XC SUMC j m 2 qCj 2 2 v02 foriin XC Statistical significance MeanCJ 0Cj1QCj0 M VarianceCj 1q CU 0 q CU 2 2MqCj 1 M2 J Examples Denning et al Data Categories No Identifier Sex Dept Position Salary Pol contr 1 Adams M CS Prof 20K 50 2 Baker M MATH Prof 15K 100 3 Cook F MATH Prof 25K 200 4 Dodd F CS Prof 15K 50 5 Engel M STAT Prof 18K 0 6 Flynn F STAT Prof 22K 150 7 Grady M CS Adm 10K 20 8 Hayes M MATH Prof 18K 500 9 Irons F CS Stu 3K 10 10 Jones M STAT Adm 20K 15 11 Knapp F MATH Prof 25K 100 12 Lord M CS Stu 3K 0 Privacy compromising queries Simplified notation AND OR NOT x Ix The following two queries reveal the salary of Prof Dodd CountF CS Prof 1 CountF CS Prof 15KSal 1 Complementary approach By querying on the complement it is possible to violate the privacy of users while generating queries with corresponding large query sets CountProf Prof 12 Total database Count F CS Prof 11 From this it can be derived that CountF CS Prof 12 11 1 After this the following queries reveal Prof Dodd s salary SUMProf Prof Sal 194K SUM a F cs Prof 179K First observations Here we discuss considerations about privacy of individual data records Statistical queries that reveal statistical information about small query sets may leak information about specific records Statistical queries that reveal statistical information about query sets with small complements may leak information about specific records A necessary restriction for protection of record specific data is to only answer queries that satisfy k 3 Count C s nk where n total of records and k is a parameter k privacy However is the above restriction sufficient Privacy threats Since k and n k play similar roles it is sufficient to study the dependency of privacy on k for k in the range 2 n2 Intuitively the close k is to n2 the more privacy is afforded by the system at the cost of denying responses to more queries Individual tracker Schlorer showed the following Assume that an attacker to privacy knows that a single record in the database satisfies a query C To check if this record has characteristic b it is sufficient to test if CountC39b is 0 or 1 However such small query sets are assumed disallowed by the k privacy requirement k gt 1 How could an attacker overcome this Using intersections Assume that the query formula C with singleton query set XC I can be decomposed as C A B where both Aw B and A are answerable kSXA Snk and ngMB Snk ampnnamp Formula T AwB is called a tracker for I because it can be used to find other characteristics of I Compromise 1 C A 0 B identifies l 2 T A vB is the tracker for l 3 To test for any characteristic b first note that 3 XT g XTAb XAB Ab gXA b Since both Tand A are answerable so is T A39 b 4 Note that 3 XCb XABb XAb 39 XA B XAb A39IB 39 XA B X T A b 39 XT 5 Finally note that a CountC39 b CountTA39 b Count T Example Suppose we wish to find out Dodd s salary when the database implements 2 privacy so no query sets of cardinality 1 or fewer are allowed CF CS ProfF CS Prof T F vCS 0 ProD Suppose b 25KSal is the characteristic to test for then CountF CS 0 Prof 25KSaI CountFCS Pro F 25KSaI CountF wCS Prof 4 4 0 So the salary of Dodd is NOT 25K General trackers Denning Denning and Schwartz have generalized the trackers in several ways Single tracker suffices for the whole database No external knowledge of characteristics that identify specific records is required General idea 1 Assume that k s n4 2 Attacker chooses a formula Tsuch that a 2k S XT CountT s n 2k b Let Q CountT Count T 3 If C is unanswerable query with XC lt k a CountC CountCT CountC vT Q 4 If C is unanswerable query with XC gt n k a CountC 2Q CountwC 77Count vC vT Example For the example in Denning eta the formula T M selection of male sex is a generic tracker for k 2 since CountM 7 Any statistical query can be answered and as a result data privacy cannot be provided if statistical queries are to be answered exactly Intermediate situations between individual and generic trackers exist Clustering One strategy is to partition the entire database in blocks of ksets and query results are about match on blocks with ranges of possible results Must decide when a block matches if only some elements of the block satisfy the query Guarantees that at most a single kblock is identified individual records unreachable Answer precision in direct conflict with privacy requirement Perturbation Another strategy is to perturb data and quenes Queries as randomized sampling Heuristically it can be argued that better precision vs privacy could be achieved Privacy guarantee not provable or attacks exist Conclusion Privacy of records in statistical databases is hard to achieve Functionality and privacy in statistical databases are at odds Privacyguaranteeing solutions come at the expense of precision in query answering Singular Value Decomposition SVD SVD of a Matrix VT U and V are orthogonal matrices and S is a diagonal matrix consisting of singular values Singular Value Decomposition SVD SVD of a Matrix observations M U S VT Multiply both sides by MT Multiplying on the left Multiplying on the right a i z e u WWW 5 ll UT Singular Value Decomposition SVD SVD of a Matrix observations where A is a square symmetric matrix Columns of are eigenvectors ofA A is a diagonal matrix containing the eigenvalues What is PCA Principal Component Analysis PCA Steps in PCA 1 Calculate Adjusted Data Set Adjusted Data Set A Data Set D Mean values M Mi is calculated by taking the mean of the values in dimension i data samples Principal Component Analysis PCA Steps in PCA 2 Calculate Co variance matrix C from Adjusted Data Set A Co variance Matrix C 1Xi Y covXY 1 11 1 Principal Component Analysis PCA Steps in PCA 3 Calculate eigenvectors and eigenvalues of C Eigenvalues Eigenvalues Hill gt 111 Hill TH Eigenvectors Eigenvectors If some eigenvalues are 0 or very small we can essentially discard those eigenvalues and the corresponding eigenvectors hence reducing the dimensionality of the new basis x 3933HquotH7MEHquot Sno A practical NIDS What is SNORT CISnort is a packet loggeranalyzer which can be used to implement a NIDS Cllt can based be used in 4 modes 3 Sniffer mode t Packet Logger mode otoNetwork Intrusion Detection System NIDS mode otolnline Mode Breno de Medeiros Florida State University Fall 2005 Example logging packets CI The command 1 snort dev l log h 192 168 l 024 El results in packets being logged 00 Flag d tells to log the data as well as header portion of packets to Flag v is visual causing Snort to display information in the screen 0 Flag e tells to log extended header information eg data link layer headers 0 Flag indicates a location directory to use for logging purposes subdirectory log of current directory in the above example to Flag h indicates how to create subdirectories Each packet will be stored in a log file with a name that matches either source or destination addresses in a datagram By specifying the prefix 1921681x it indicates you want the packets logged under the local host in the communication 0 Q 0 0 0 0 Breno de Medeiros Florida State University Fall 2005 Using binary mode storage El Alternatively you can use the compact binary storage form to store packets 2 snort l log b El This causes Snort to log all packets in binary form tcpdump storage No flags are needed because all the packet is stored El You can then read them back in playback modeuseful to experiment with new rules 2 snort dv r packetlog El You can also playback only packets of a particular type 2 snort dvr packetlog icmp Breno de Medeiros Florida State University Fall 2005 NIDS mode CI NIDS mode enables modification of Snort basic behavior ie log everything and have it first apply a set of rules taking the appropriate action when a packet matches the rule 1 snort dev 1 log h l92l68l024 c snortconf CI Results in Snort logging only packets that matches the rules specified in snortconf CI Don t use v or e when using as NIDS for the sake of speed othenNise Snort may loose packets Breno de Medeiros Florida State University Fall 2005 Alerts in NIDS mode EIUsing the flag A will add alerting behavior to Snon stsA can be followed by the keywords full default fast unsock none console and cmg EITo use syslog for remote logging of alerts use the flag s EIExample 391 snort b A fast c snortconf Breno de Medeiros Florida State University Fall 2005 Inline Snort EIObtain packets from IPTables instead of libpcap and uses Snort rules to instruct IPtables whether to drop or pass packets El In order for snortinline to work properly you must download and compile the iptables code to include quotmake install devel quot This will install the libipq library that allows snortinline to interface with iptables Also you must build and install LibNet 2 http www iptables org 2 http www packetfactory net Breno de Medeiros Florida State University Fall 2005 Running Snort nine CI The QUEUE target should be specified in lPtables for interfacing with Snort 2 iptables A OUTPUT p tcp dport 80 j QUEUE El Then run Snort inine 1 snortinline QDc etcdropconf l varlogsnort CI The flags mean 9 N Q Obtain input from iptables QUEUE target N D Run in daemon mode ie continuously in the background Use the configuration file Use the log file 9 99 o 00 I O o 00 I I l Breno de Medeiros Florida State University Fall 2005 Snort configuration CI Snort configuration is highly customizable in order to achieve high performance and full flexibility of use 2 config checksummode none noip notcp noicmp noudp ip tcp udp icmp all CI An important feature of Snort is the use of pre processors oto For instance the defragmentation preprocessor frag3 allows you to use different policies to reproduced the de fragmentation policies of various operating systems Or you can define your own policy 00 Similarly the stream4reassemble preprocessor enables you to choose your policies with overlapping packets Breno de Medeiros Florida State University Fall 2005 Detecting port scans CIsfPortscan processor CIDetects NMAPstyle port scans as well as decoy and distributed port scans CICan detect port sweeps as well as port scans CICan be tuned for sensitivity accuracy Breno de Medeiros Florida State University Fall 2005 Application layer pre processors CITelnetdecode CIRPCdecode CIHTTPinspect ot Apache profile ozoIIS profile otomany customizable options Breno de Medeiros Florida State University Fall 2005 Cluster Analysis Cluster Analysis i What is Cluster Analysis El Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n Partitioning Methods a Hierarchical Methods n Density Based Methods n Grid Based Methods El Model Based Clustering Methods El Outlier Analysis El Summary Hierarchical Clustering El Use distance matrix as clustering criteria This method does not require the number of clusters k as an input but needs a termination condition IStepO Step1 IStep2 Step3 IStep4 ggglomerative 39 AGNES I I I I I divisive Step 4 Step 3 Step 2 Step 1 Step 0 DIANA AGNES Agglomerative Nesting Implemented in statistical analysis packages eg Splus Use the SingleLink method and the dissimilarity matrix Merge objects that have the least dissimilarity Go on in a nondescending fashion EIEIEIEIEI Eventually all objects belong to the same cluster n SingleLink each time merge the clusters C1C2 which are connected by the shortest single link of objects ie mianC1qEC2distplq ADendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into d a several levels of nested O O partitioning m of clusters b 0 6 called a dendrogram a g c A clustering of the data level 4 objects is obtained by cutting the dendrogram at the desired level then each connected component forms a cluster level 2 level 3 Eg level 1 gives 4 clusters level I ab0de level 2 gives 3 clusters aabla lcla ldae level 3 gives 2 clusters O aablflcadae etc a b C d e