### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Temporal Data Mining INFS 795

Mason

GPA 3.95

### View Full Document

## 39

## 0

## Popular in Course

## Popular in Science

This 50 page Class Notes was uploaded by Hazle Turcotte on Monday September 28, 2015. The Class Notes belongs to INFS 795 at George Mason University taught by Carlotta Domeniconi in Fall. Since its upload, it has received 39 views. For similar materials see /class/215082/infs-795-george-mason-university in Science at George Mason University.

## Reviews for Temporal Data Mining

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/28/15

CoTraining Ensemble Models for Text Categorization Dave DeBarr May 8th 2006 Agenda SemiSupervised Learning Text Categorization Experiments Modeling Techniques SelfTraining Results CoTraining Results SemiSupervised Learning SelfTraining classifier labels quotunlabeledquot data most confident examples added to training set CoTraining each classifier labels quotunlabeledquot data most confident examples added to training set Related Areas Active Learning adaptive sampling Transduction quotclustering assumptionquot applied to quotunlabeledquot data Text Categorization Experiments Reuter39s Corpus Version 1 RCV1 Partitioned into trainingtest sets to support comparison though quotcheatingquot still possible Collected Aug 20 1996 Aug 19 1997 103 category hierarchy httpwwwaimiteduproiectsimlrpapersvolu me5lewisO4alvrl2004 rcv1v2 READMEhtm TREC 2005 Spam Corpus TREC Spam Enron data iterative labeling httpplquwaterloocaqvcormactreccorpus RCV1 Example ltcopyrightgtc Reuters Limited 1995ltCopyrnghtgt ltmetadatagt ltcodes Claasquot i Countries1 gt ltcode codee39WJSAquot ltcodegt ltxm1 versionquot1 quot encodinganisoaeasealquot gt 330quot i at lessaosazuquot xmllangquotenquotgt lttt1egtUSA Tyl n etock jumps weighs sale of compan lt ltheadllnegtTylan stock jumps welghs sale of Companyltheadhnegt ltdatelinegtsAn DIEGOltdateljnegt lttextgt ltpgtThe stock of Tylan General lnc jumped Tuesday after the maker of u ment said it ie exploring the sale of the at it has already received some inquiries from re u bi ndustresl gt o e nunsquot ltcodegt potential buyersxp ltpgtTylan was up 250 to 1275 in early trading on the Naedag Inarketltpgt ltpgtThe company said it has set up a committee of directors to oversee d the sale and that Goldman Sache Stamp o hae been retained as its ltccde code quotcm quoti ltCodegt p ltcode codequotCCATquotgt ltcodegt ltcodesgt financial advserlt textgt ltdc e1ementquotdcpubllsherquot valuequotlleuters Holdings P1cquotgt cdatepuhliehedquot valueallsseanaazulp m t uxcequot valu Reu era e dc creatorlocaticnu ValuequotSAN DIEGOquotgt emen ccreatorlocatloncountrynamequotvalue USAquotgt dc e1ementquotdcsourcequot Value Reuter quotgt ltnewsjtemgt RCV1 Notes Vector version already in TFIDF form mm 1loggnrd Xlogg inm 47219 terms Token version preprocessed divided into tokens converted to lower case stop words removed stemmed RCV1 Notes Continued TrainingValidation set Initial feature selection Criteria found in more than 30 training documents 5129 features only around 134 of values gt O 23149 documents 10786 Corporate articles 232 documents per training segment holdout validation and test Testing set 781265 documents 7813 documents per testing segment TREC Spam Example 1 chroso smsvcc 0 21951600 Wed 4 Jul 2001 02 0416 0800 192168 w u 100 Wed 4 Jul 2001 04 03 51 0600 Reeewed from madman Enron com unven ed by corp Enron com 1 wdn ltgwhalleexchange Enron e0mgt Wed 4 Jul 2001 04 0219 0600 Reeewed from madman Enron com211219107 84 10 V 10 V mel for ltgeg wna11ey o gt 4 Jul 2001 05 02 09 0500 CDT Reeewed f10m249120 201 101 by madman Enron C0m1d lt0209094e01973gt Wed 4 Jul 2001 08 04 55 0200 MessageID lt3ngp080970488m yamtzgt me alrud69UE3 ltakud69yahoo e0mgt Re lyeTo alrud69UE3 ltalrud69yahoocorngt To greg Whalleyenron com SAEAEEL AAA Date Wed4 Jul 2001 08 04 55 GMT XeMaxler QUALCOMM Wmdows Eudora Venswn 5 1 MIMEVErsxon 10 ContenteType mulupanaltemauve oundany R2A74Ess 01 xepmondy xeMSMadePnondy Normal eeB27A74E66 01 ContentrType texthtml ContenteTxansfeerncodmg quotedepnntable ltHTMZLgt ltHEADgt TREC Spam Notes Preprocessing Convert quotcontrolquot characters and quotextendedquot characters Divide lines into tokens based on white space Parse off symbols nonalphanumeric Convert to lower case Remove quotSMARTquot project stop words Stem alphabetic tokens Prefix each token with quotlocationquot name of header eg quotfromquot or quotbodyquot TREC Spam Notes Continued Initial feature selection Criteria found in more than 30 training documents Weighted sample based on absolute value of log10 odds ratio estimate 0 in numerator or denominator replaced by 000001 15674 tokens gt 4000 tokens Sparse less than 1 of values gt 0 TFIDF representation WAT 110gw7d XlogAlQlquotID TREC Spam Notes Continued TrainingValidation set First 30422 documents messages 17057 spam 13365 ham 305 documents per training segment holdout validation and test Testing set 61767 documents 618 per testing segment Basic Modeling Techniques Decision Trees Support Vector Machines Ensembles of Decision Trees Ensembles of Support Vector Machines Decision Trees Class amp Reg Trees CART entropy reduc HyperParameter Optimization complexity penalty 0400 0200 0100 0050 0025 minimum bucket size 80 4o 20 1o 5 RCV1 cp 0400 minbucket 5 TREC Spam cp 0200 minbucket 5 Decision Tree RCV1 rank 13 Predicted Positive Negative 95 Conf Int Actual Positive 2188 1513 5912 Recall 5753 6070 Negative 435 3677 95 Conf Int 8342 6920 F Score 6805 7033 precision if com pan gt 00378065 company then assign category quotCorpquot 3844 95 Conf Int 8196 8480 else assign category quotnot Corpquot 126188 Decision Tree TREC Spam rank 2 Predicted Positive Negative Actual Positive 245 0 Negative 28 345 8974 Precision 95 Conf Int 8573 9292 95 Conf Int 10000 Recall 9898 10000 95 Conf Int 9459 F Score 9239 9630 if returnpath lt 00050965 no return path then assign category quotspamquot 156160 else assign category quothamquot 130145 Ensembles of Decision Trees Random Forests bagging plus HyperParameter Optimization number of trees in forest 50 75 100 125 150 number of features to examine for each node sqrtP4 sqrtP2 sqrtP sqrtP2 sqrtP4 Random Forest RCV1 rank 10 Predicted Positive Negative Actual Positive 2416 1285 Negative 162 3950 9372 Precision 95 Conf Int 9273 9460 6528 7695 95 Conf Int Recall 6373 6680 95 Conf Int F Score 7590 7798 ntree 150 mtry 71 Random Forest TREC Spam rank 12 Predicted Positive Negative 95 Conf Int Actual Positive 192 53 7837 Recall 7290 8317 Negative 4 369 95 Conf Int 9796 8707 F Score 8370 8996 Precision 95 Conflnt 9522 9931 ntree 100 mtry 59 Support Vector Machine Kernels Linear kernel dot product kernel defined for input space Radial Basis Function RBF kernel based on quotstandardquot Gaussian distribution Strangeness kernel modified form of representation Linear SVM kernel W y ygt nusvc support vector classi cation nu is an upper bound on the error and a lower bound on the number of support vectors hyperparameter optimization nu 016 008 004 002 001 Linear SVM RCV1 rank 4 Predicted Positive Negative Actual Positive 2651 1050 7163 Negative 200 3912 9298 8092 Precision 95 Conf Int 9200 9388 95 Conf Int Recall 7016 7306 95 Conf Int F Score 7996 8186 Parameters nu 016 Number of Support Vectors 210 Linear SVM TREC Spam rank 7 Predicted Positive Negative 95 Conf Int Actual Positive 216 29 8816 Recall 8368 9176 Negative 12 361 95 Conf Int 9474 9133 F Score 8854 9362 Precision Parameters nu 016 95 Conflnt 9126 9709 Number Of Support Vectors 216 RBF SVM kernel mum expliw llr 7 yllz nusvc support vector classi cation nu is an upper bound on the error and a lower bound on the number of support vectors hyperparameter optimization nu 016 008 004 002 001 sigma 5 points based on squared distance 1 quantile for 90th and 10th percentiles RBF SVM RCV1 rank 8 Predicted Positive Negative 95 Conf Int Actual Positive 2521 1180 6812 Recall 6660 6960 Negative 126 3986 95 Conf Int 9524 7943 F Score 7842 8041 precision Parameter nu 016 Hyperparameter sigma 05010 95 Conflnt 9438 9600 Number of Support Vectors 228 Actual RBF SVM TREC Spam rank 8 Predicted Positive Negative Positive 209 36 8531 Negative 9 364 9587 9028 Precision 95 Conf Int 9260 9794 95 Conf Int Recall 8047 8932 95 Conf Int F Score 8733 9273 Parameter nu 016 Hyperparameter sigma 05000 Number of Support Vectors 254 quotStrangenessquot quotStrangenessquot is a measure of dissimilarity Defined by Saunders Gammerman and Vovk based on a Lagrange multiplier quotTransduction with Confidence and Credibilityquot Redefined by Proedrou Nouretdinov Gammerman and Vovk quotTransductive Confidence Machines for Pattern Recognitionquot Kernel based on quotStrangenessquot Strangeness is defined with respect to each class ratio ofthe sum of distances to quotkquot nearest quotclass yquot neighbors to the sum of distances to quotkquot nearest quotnot class yquot neighbors 2 D3 Day Each observation is now represented by a vector of strangeness values Linear and RBF kernels simply use new representation Strangeness Kernel RCV1 rank 12 Predicted Positive Negative Actual Positive 2083 1 618 5628 Negative 145 3967 9349 7026 Precision 95 Conf Int 9241 9446 95 Conf Int Recall 5468 5787 95 Conf Int F Score 6909 7142 Parameter nu 016 Linear kernel Number of Support Vectors 4O Strangeness Kernel TREC Spam rank 10 Predicted Positive Negative Actual Positive 218 27 8898 Negative 23 350 9046 8971 Precision 95 Conf Int 8626 9368 95 Conf Int Recall 8460 9244 95 Conf Int F Score 8678 9218 Parameter nu 016 Linear kernel Number of Support Vectors 50 ROC Curves RCV1 RCV1 FP FP TN Dark Blue 9348 Strangeness Light Blue 9348 ROC Curves TREC Spam TREC Spam Tree Black 9637 Forest Red 9864 Linear Green 9795 RBF Dark Blue 9680 Strangeness Light Blue 9680 Ensemble of Random Forests RCV1 rank 9 Predicted Positive Negative 95 Conf Int Actual Positive 2553 1148 6898 Recall 6748 7046 Negative 182 3930 95 Conf Int 9335 7933 F Score 7833 8031 Precision 95 Conf Int 92 37 94 23 Ensembles of Random Forests TREC Spam rank 13 Predicted Positive Negative 95 Conf Int Actual Positive 184 61 7510 Recall 6941 8020 Negative 0 373 95 Conf Int 10000 8578 F Score 8224 8884 Precision 95 Conf Int 9865 10000 Ensembles of SVMs RCV1 Uank1 Predicted Positive Negative 95 Conf Int Actual Positive 2820 881 7620 Recall 7480 7755 Negative 156 3956 95 Conf Int 9476 8447 F Score 8359 8532 Precision 95 Conf Int 9391 9552 Ensembles of SVMs TREC Spam rank 4 Predicted Positive Negative Actual Positive 222 23 9061 Negative 9 364 9610 9328 Precision 95 Conf Int 9301 9806 95 Conf Int Recall 8648 95 Conf Int F Score 9076 9379 9527 Feature Selection Feature Selection Based on Random Forests Mean Decrease in Accuracy based on random permutation Feature Selection Based on Linear SVM Squared Weight Vector Elements 20 40 60 80 0 Feature Selection RCV1 Random Forest i i i i 00 02 04 06 08 Mean Decrease in Accuracy adiusied to be nonenegative Linear Support Vector Machine i i i i i i 00 02 04 06 08 10 i2 Squared Weight Vector Eiernent Iog10oddsratioestimate 3 prft 1082 attrib 1070 jcr 1139 chechny 1134 lebe 1108 maskhadov Random Forest 081 compan 070 research 070 shar Linear SVM 124 markk 106 merit 085 ton RF Ensemble wFeature Selection RCV1 rank 7 Predicted Positive Negative Actual Positive 2585 1116 Negative 210 3902 9249 Precision 95 Conf Int 9147 9342 6985 7959 Recall F Score 95 Conf Int 6835 7131 95 Conf Int 7859 8055 SVM Ensemble wFeature Selection RCV1 rank 2 Predicted Positive Negative Actual Positive 2795 881 Negative 156 3956 9471 Precision 95 Conf Int 9386 9548 7603 84 35 Recall F Score 95 Conf Int 7463 7739 95 Conf Int 8346 8521 20 40 60 80 Feature Selection TREC Spam Random Forest Mean Decrease in Accuracy adjusted to be nonregawe Linear Suppart Vector Machine Squared Wergm Vector Eiernent Iog10oddsratioestimate 1320 received n erlfl 1320 receivedtechnolog 1320 receivedsmtpr 1319 Ximsitneficorrelatorgt 1319 Ximsitneficorrelatorlt 1310 Xamlmeolev6047120 Random Forest 053 datez0700 053 returnpathzz 053 receivedzesmtp Linear SVM 572 bodyzweb 517 receivedzgt 308 receivedzesmtp RF Ensemble wFeature Selection TREC Spam rank 11 Predicted Positive Negative 95 Conf Int Actual Positive 200 45 8163 Recall 7643 8609 Negative 2 371 95 Conf Int 9901 8949 F Score 8639 9207 Precision 95 Conf Int 9686 9979 SVM Ensemble wFeature Selection TREC Spam rank 6 Predicted Positive Negative 95 Conf Int Actual Positive 220 25 89 80 Recall 8554 9312 Negative 15 358 95 Conf Int 9362 9167 F Score 8894 9389 Precision 95 Conf Int 8995 9622 Combined Ensemble RCV1 rank 5 Predicted Positive Negative 95 Conf Int Actual Positive 2562 1 139 6922 Recall 6772 7070 Negative 92 4020 95 Conf Int 9653 8063 F Score 7964 8159 Precision 95 Conf Int 9579 9718 Combined Ensemble TREC rank 5 Spam Predicted Positive Negative Actual Positive 218 27 Negative 6 367 97 32 Precision 95 Conf Int 94 56 98 87 8898 9296 95 Conf Int Recall 8460 95 Conf Int F Score 9038 9244 9502 RF Ensemble wSelf Training RCV1 rank 11 Predicted Positive Negative 95 Conf Int Actual Positive 2343 1358 6331 Recall 6174 6485 Negative 143 3969 95 Conf Int 9425 7574 F Score 7466 7680 Precision 95 Conf Int 9328 9511 RF Ensemble wSelf Training TREC Spam rank 9 Predicted Positive Negative Actual Positive 200 45 8163 Recall Negative 0 373 10000 8989 F Score Precision 95 Conf Int 9875 10000 95 Conf Int 7643 95 Conf Int 8683 8609 9243 SVM Ensemble wSelf Training RCV1 rank 3 Predicted Positive Negative 95 Conf Int Actual Positive 2684 1 017 7252 Recall 7107 7394 Negative 110 4002 95 Conf Int 9606 8265 F Score 8171 8355 Precision 95 Conf Int 9529 9674 SVM Ensemble wSelf Training TREC Spam rank 3 Predicted Positive Negative 95 Conf Int Actual Positive 223 22 9102 Recall 8696 9412 Negative 10 363 95 Conf Int 9571 9331 F Score 9080 9529 Precision 95 Conf Int 9252 9777 Co Training Performance rank 6 RCV1 Predicted Positive Negative Actual Positive 2560 1141 Negative 113 3999 9577 Precision 95 Conf Int 9496 9649 6917 8033 Recall F Score 95 Conf Int 6767 7064 95 Conf Int 7934 8129 Co Training Performance TREC Spam rank 1 Predicted Positive Negative 95 Conf Int Actual Positive 229 16 9347 Recall 8986 9606 Negative 10 363 95 Conf Int 9582 9463 F Score 9235 9638 Precision 95 Conf Int 9271 9783

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.