### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Class Note for CSCI 243 with Professor Bellaachia at GW (3)

### View Full Document

## 15

## 0

## Popular in Course

## Popular in Department

This 36 page Class Notes was uploaded by an elite notetaker on Saturday February 7, 2015. The Class Notes belongs to a course at George Washington University taught by a professor in Fall. Since its upload, it has received 15 views.

## Reviews for Class Note for CSCI 243 with Professor Bellaachia at GW (3)

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/07/15

Classi cation and Prediction 1 Objectives 2 2 Classi cation vs Prediction 3 21 De nitions 3 22 Supervised vs Unsupervised Learning 3 23 Classi cation and Prediction Related Issues 4 3 Common Test Corpora 5 4 Classi cation 6 5 Decision Tree Induction 11 51 Decision Tree Induction Algorithm 13 52 Other Attribute Selection Measures 18 53 Extracting Classi cation Rules from Trees 19 54 Avoid Over tting in Classi cation 19 55 Classi cation in Large Databases 20 6 Bayesian Classi cation 21 61 Basics 22 62 Naive Bayesian Classi er 24 7 Bayesian Belief Networks 27 71 De nition 27 8 Neural Networks Classi cation by Backpropagation 30 81 Neural network Issues 31 82 Backpropagation Algorithm 32 9 Prediction 35 91 Regress Analysis and LogLinear Models in Prediction 35 10 Classi cation Accuracy Estimating Error Rates 36 A Bellaachia Page 1 1 Objectives 0 Techniques to classify datasets and provide categorical labels e g sports technology kid etc 0 Example credit history salarygt credit approval YesNo 0 Models to predict certain future behaviors eg who is going to buy PDAs A Bellaachia Page 2 2 Classi cation vs Prediction 21 Definitions 0 Classi cation I Predicts categorical class labels discrete or nominal I Classifies data constructs a model based on the training set and the values class labels in a classifying attribute and uses it in classifying new data 0 Prediction I Models continuousvalued functions ie predicts unknown or missing values 0 Typical Applications I Document categorization I Credit approval I Target marketing I Medical diagnosis I Treatment effectiveness analysis 22 Supervised vs Unsupervised Learning 0 Supervised learning classification I Supervision The training data observations measurements etc are accompanied by labels indicating the class of the observations I New data is classified based on the training set 0 Unsupervised learning clustering A Bellaachia Page 3 I The class labels of training data is unknown I Given a set of measurements observations etc with the aim of establishing the existence of classes or clusters in the data 23 Classification and Prediction Related Issues 0 Data Preparation I Data cleaning 0 Preprocess data in order to reduce noise and handle missing values I Relevance analysis feature selection 0 Remove the irrelevant or redundant attributes I Data transformation 0 Generalize andor normalize data 0 Performance Analysis I Predictive accuracy 0 Ability to classify new or previously unseen data Speed and scalability 0 Time to construct the model 0 Time to use the model Robustness 0 Model makes correct predictions Handling noise and missing values Scalability 0 Efficiency in diskresident databases Interpretability 0 Understanding and insight provided by the model Goodness of mles A Bellaachia Page 4 0 Decision tree size 0 Compactness of classi cation rules 3 Common Test Corpora o Reuters Collection of newswire stories from 1987 to 1991 labeled with categories o TRECAP newswire stories from 1988 to 1990 labeled with categories 0 OHSUMED Medline articles from 1987 to 1991 MeSH categories assigned o UseNet newsgroups o WebKB Web pages gathered from university CS departments A Bellaachia Page 5 4 Classi cation I A classical problem extensively studied by statisticians and machine learning researchers 0 Predicts categorical class labels 0 Example Typical Applications I credit history salary 9 credit approval YesNo I Temp Humidity 9 Rain Y esNo I A set of documents 9 sports technology etc 0 Another Example If x gt 90 then grade A If 80ltxlt90 then grade B If 70ltxlt80 then grade C If 60ltxlt70 then grade D If xlt50 then grade F 0 Classi cation types X A gt8 gt7 C gt6 A Bellaachia Page 6 0 Distance based 0 Paititioning based Partitioning Based nil MIN Emmi l NmIxuumxloomo Class A Class B Class C O xmwhmmwmwo l 0 Classi cation as a twostep process 0 Model construction Build a model for predeteimined classes 0 Model usage Classify unknown data samples 0 If the accuracy is acceptable use the model to classify data objects whose class labels are not known A Bellaachia Page 7 0 Model construction describing a set of predetermined classes 0 Each data sample is assumed to belong to a prede ned class as determined by the class label attribute 0 Use a training dataset for model construction 0 The model is represented as classification rules decision trees or mathematical formula Classi cation Igt Algorithms Training Data Assistant Prof 3 no Classifier Model Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no A Bellaachia Page 8 0 Model usage for classifying future or unknown objects 0 Estimate accuracy of the model I The known label of test sample is compared with the classi ed result from the model I Test accuracy rate is the percentage of test set samples that are correctly classi ed by the model I Test set is independent of training set but from the same probability distribution l no yes 1 1 Tenured Yes Tom Assistant Prof Merlisa Associate Prof George Professor Joseph Assistant Prof 2 7 5 7 0 Common Techniques 0 KNearest Neighbor kNN Use k closest training data samples to predict category 0 Decision Trees DTree Construct classi cation trees based on training data A Bellaachia Page 9 O Neural Networks NNet Learn nonlinear mapping from input data samples to categories 0 Support Vector Machines SVMs A Bellaachia Page 10 5 Decision Tree Induction 0 Example Training Dataset Quinlan s ID3 age income student creditratin g buy scomputer lt3 0 high no fair no lt3 0 high no excellent no 3 1 40 high no fair yes gt40 medium no fair yes gt40 low yes fair yes gt40 low yes excellent no 3140 low yes excellent yes lt3 0 medium no fair no lt30 low yes fair yes gt40 medium yes fair yes lt30 medium yes excellent yes 3 1 40 medium no excellent yes 3 1 40 high yes fair yes gt40 medium no excellent no A Bellaachia Page 11 0 A decision tree for buyscomputer lt30 3040 gt40 student credit rating no excellent fair 4 E 1 A Bellaachia Page 12 51 Decision Tree Induction Algorithm 0 Basic algorithm a greedy algorithm 0 O 0 Tree is constructed in a topdown recursive divideand conquer manner At start all the training examples are at the root Attributes are categorical if continuousvalued they are discretized in advance Samples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure eg information gain Conditions for stopping partitioning I All samples for a given node belong to the same class I There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left I Attribute Selection Measure Information gain 0 O 0 Select the attribute with the highest information gain S contains si tuples ofclass Ci for i 1 m Entropy of the set of tuples I It measures how informative is a node m Si Si 1s1s2sm Z log2 izls S A Bellaachia Page 13 o Entropy after choosing attribute A with values a1a2av o lnfonnation gained by branching on attribute A GainA 1s1s2 Sm EA Information Gain Computation 0 Class P I buyscomputer yes I p number of samples 0 Class N I buyscomputer no I 11 number of samples 0 The expected information 1p n 19 5 0940 0 Compute the entropy for age age pi n1 1pi ni lt30 2 3 0971 30 40 4 0 0 gt40 3 2 0971 A Bellaachia Page 14 5 4 5 Ea e 1 23 1 40 132 20694 g 14 14 14 1 23 means age lt30 has 5 out of 14 samples with 2 yes es and 3 no s Hence Gainage IpnEage 094 0694 0246 Similarly Gainincome 0029 Gainstudent 0151 Gaincreditrating 0048 We say that age is more informative than income student and creditrating So age would be chosen as the root of the tree A Bellaachia Page 15 Age lt30 gt30 314 income student credit rating class income student credit rating class high no fair no medium no fair yes high no excellent no loW yes fair yes low yes fair yes low yes excellent no medium no fair no medium yes fair yes medium yes excellent yes medium no excellent no income student credit rating class high no fair yes All yes39 It isaleaf low yes excellent yes medium no excellent yes high yes fair yes 0 Recursively apply the same process to each subset A Bellaachia Page 16 0 ID3 Algorithm function ID3 R a set of non categorical attributes C the categorical attribute S a training set returns a decision tree begin If S is Empty return a single node with value Failure If S Consists of reeords all with the same value for the categorical attribute return a single node with that value If R is Empty then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of 3 note that then there will be errors that is records that will be improperly classified Let D be the attribute with largest GainiDS among attributes in R Let djl j12 m be the values of attribute D Let Sjl j12 m be the Subsets of S Consisting respectively of records with value dj for attribute D Return a tree with root labeled D and area labeled d1 d2 dm going respectively to the trees IDBIIRD C 31 ID3IZRD C 32 ID3IIRD C Sm end ID3 A Bellaachia Page 17 52 Other Attribute Selection Measures 0 Gini Index CART IBM lntelligentMiner o All attributes are assumed continuousvalued 0 Assume there exist several possible split values for each attribute 0 May need other tools such as clustering to get the possible split values 0 Can be modi ed for categorical attributes 0 Formal De nition 0 If a data set T contains examples from n classes gini index giniT is de ned as n 2 gmzT1 z p jl Where pf is the relative frequency of class j in T o If a data set T is split into two subsets T 1 and T2 with sizes N 1 and N2 respectively the gini index of the split data contains examples from n classes the gini index giniT is de ned as g lnlsplitT g1nlT1 g1nlT2 o The attribute provides the smallest ginisplitT is chosen to split the node need to enumerate all possible splitting points for each attribute A Bellaachia Page 18 53 Extracting Classification Rules from Trees Represent the knowledge in the form of IFTHEN rules One rule is created for each path from the root to a leaf Each attributevalue pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example 0 IF age lt30 AND student no THEN buysicomputer n0 0 IF age lt30 AND student yes THEN buysicomputer yes 0 IF age 3 l 40 THEN buysicomputer yes 0 IF age gt40 AND creditirating excellent THEN buysicomputer yes 0 IF age lt30 AND creditirating air THEN buysicomputer no 54 Avoid Overfitting in Classification 0 Overfitting An induced tree may overfit the training data 0 Too many branches some may re ect anomalies due to noise or outliers 0 Poor accuracy for unseen samples 0 Two approaches to avoid overfitting o Prepruning Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold I Difficult to choose an appropriate threshold 0 Postpruning Remove branches from a fully grown tree get a sequence of progressively pruned trees A Bellaachia Page 19 I Use a set of data different from the training data to decide which is the best pruned tree 55 Classification in Large Databases 0 Classification a classical problem extensively studied by statisticians and machine learning researchers 0 Scalability Classifying data sets with millions of examples and hundreds of attributes with reasonable speed 0 Why decision tree induction in data mining o Relatively faster learning speed than other classification methods 0 Convertible to simple and easy to understand classification mles 0 Can use SQL queries for accessing databases 0 Comparable classification accuracy with other methods A Bellaachia Page 20 Vlsuahzatmn of aDemsmn Tree 1n SGIMmeSet3 0 Hw g mgp ia u NW2 5 we Acme n 39 WWW mm 292 C Laws 4 52 ml lOlGwm Rwy mm A Helium Page 21 6 Bayesian Classi cation 0 Why Bayesian o Probabilistic learning Calculate explicit probabilities for hypothesis among the most practical approaches to certain types of learning problems 0 Incremental Each training example can incrementally increasedecrease the probability that a hypothesis is correct Prior knowledge can be combined with observed data 0 Probabilistic prediction Predict multiple hypotheses weighted by their probabilities 0 Standard Even when Bayesian methods are computationally intractable they can provide a standard of optimal decision making against which other methods can be measured 61 Basics 0 Let X be a data sample whose class label is unknown 0 Let H be a hypothesis that X belongs to class C 0 Posterior Probability o For classi cation problems determine PHX the probability that the hypothesis H holds given the observed data sample X 0 Prior Probability o PH prior probability of hypothesis H o It is the initial probability before we observe any data A Bellaachia Page 22 o It re ects the background knowledge 0 PX probability that sample data is observed 0 PXlH probability of observing the sample X given that the hypothesis H holds 0 Bayesian Theorem 0 Given training dataX posteriori probability of a hypothesis H PHlX follows the Bayes theorem PX HPH PH X PX 0 Informally this can be written as likelihood x prior posterior evzdence 0 Maximum Posteriori MAP Hypothesis Since PX is constant of all hypotheses we have the following km E argmaXPh D argmaXPD hPh heH heH Where D is the data sample H is the set of all available hypothesis eg all available classes 0 Practical difficulty 0 Require initial knowledge of many probabilities 0 Significant computational cost A Bellaachia Page 23 62 Na39i39ve Bayesian Classifier 0 Algorithm 0 A simpli ed assumption attributes are conditionally independent 11 PXle HPxlej k1 0 The product of occurrence of say 2 elements x1 and x2 given the current class is C is the product of the probabilities of each element taken separately given the same class Py1y2C PylC NYLC o No dependence relation between attributes o Greatly reduces the computation cost only count the class distribution 0 Once the probability PXCi is known assign X to the class with maximum PXCjPCj 0 Example 0 Given the following data sample from our previous example Xagelt30lncomemediumStudentyesCreditratingFair 0 Compute PxkCi S PltXklci l k S1 A Bellaachia Page 24 Where sik is the number of training samples of Class Ci having the value Xk for the attribute Ak and s1 is the number of training samples belonging to C1 Page lt30 buyscomputer yes 290222 Page lt30 buyscompute1 no 35 06 Pincome medium buyscompute1 yes 49 0444 Pincome medium buyscomputer no 25 04 Pstudent yes buyscomputer yes 69 0667 Pstudent yes buyscomputer no l502 Pcreditrating fair buyscomputer yes 690667 Pcreditrating fair buyscomputer no 2504 Xagelt30 income medium studentyescreditratingfair 0 Compute PXCi n PXICI HPxk ICi kl PXbuyscompute1 yes 0222 X 0444 X 0667 X 00667 0044 PXbuyscomputer no 06 X 04 X 02 X 04 0019 O PXCiPCi 1 HQ the class prior probability i S Where s1 is the number of training samples in Ci and s is the total number of training samples A Bellaachia Page 25 9 Pbuyscomputer yes E 0643 5 Pbuyscomputer no E 036 PXbuyscomputer yes Pbuyscomputer yes 0028 PXbuyscomputer no Pbuyscomputer no 0007 X belongs to class buyscomputeryes 0 Advantages O 0 Easy to implement Good results obtained in most of the cases 0 Disadvantages O Assumption class conditional independence therefore loss of accuracy Practically dependencies exist among variables Eg hospitals patients Profile age family history etc Symptoms fever cough etc Disease lung cancer diabetes etc Dependencies among these cannot be modeled by Nai39ve Bayesian Classifier 0 How to deal with these dependencies O Bayesian Belief Networks A Bellaachia Page 26 7 Bayesian Belief Networks 0 Objectives 0 The nai39ve Bayesian classi er assume that attributes are conditionally independents o Belief Nets are PROVEN TECHNOLOGY I Medical Diagnosis I DSS for complex machines I Forecasting Modeling Information Retrieval etc 71 Definition 0 A bayesian network is a causal directed acyclic graph DAG associated with an underlying distribution of probability 0 DAG structure I Each node is represented by a variable v I 1 depends only on its parents conditional probability Pvl39 parent lt01gt v is INDEPENDENT of nondescendants given assignments to its parents 0 We must specify the conditional probability distribution for each node 0 It the variables are discrete each node is described by a table Conditional Probability Table CPT which lists the probability that the child node takes on each of its different values for each combination of values of its parents 0 Example A Bellaachia Page 27 Q G Given H 1 D has no in uence on J G J has no in uence on B etc 6 0 Example wSimple Belief Net PH1 PH0 005 095 h PB1Hh PB0Hh NH 1 095 005 PBH 0 0 03 0 97 K h b PJ1hb PJ0hb 1 1 08 02 quoti quotquotquot quotSW 63 quotquotquotquotquot quot63 quotquotquotquotquot quot CD PJHB 0 1 03 0 7 0 0 03 07 A Bellaachia Page 28 0 Challenges 0 Ef cient ways to use BNs 0 Ways to create BNs 0 Ways to maintain BNs 0 Reason about time A Bellaachia Page 29 8 Neural Networks Classi cation by Backpropagation 0 A neural network is a set of connected inputoutput units where each connection has a weight associated with it 0 During the learning process the network learns by adjusting the weights to predict the correct class label of the input samples 0 Typical NN structure for classi cation 39 One output node per class 39 Output value is class membership function value 0 Perceptron 39 It is one of the simplest NN I No hidden layers 0 Supervised learning 0 For each tuple in training set propagate it through NN Adjust weights on edges to improve future classi cation A Bellaachia Page 30 0 Algorithms Propagation Backpropagation Gradient Descent 0 Example 81 Neural network Issues Number of source nodes Number of hidden layers Training data Number of sinks Interconnections Weights Activation Functions Learning Technique When to stop learning A Bellaachia Page 31 82 Backpropagation Algorithm 0 Backpropagation learns by iteratively processing a set of training samples comparing the network s prediction for each sample with the actual known class label 0 For each training sample the weights are modi ed so as to minimize the mean squared error between the network prediction and the actual class modi cations are made in the backward directions 0 A Neuron 2 f output y Input weight weighted Activation vector x vector w sum function 0 The ndimensional input vector x A training sample is mapped into variable y by means of the scalar product and a nonlinear function mapping 0 For example 11 Y Z wixi 9k i0 A Bellaachia Page 32 0 Algorithm 0 Initialize the weights in the NN and the bias associated to each neuron The value are generally chosen between 1 and 1 or 5 and 5 0 For each sample X do Propagate the inputs forward I The net input and output of each neuron j in the hidden and output layers are computed 1139 639 1 Where wij is the weight of the connection from unit i in the previous layer to unit j 01 is the output of unit I from the previous layer Si is the bias of the unit this is used to vary the activity of the unit I Apply a nonlinear logistic or sigmoid function to compute the output of neuron j 0 J I le 1 This is called a squashing function map Ij to a range of values between 0 and l o Backpropagate the error Update the weights and bias to re ect the network s prediction I Output layer Errj 011 01Tj 0j Where Tj is the true output and Oj is the actual output of neuron j A Bellaachia Page 33 I Hidden layer It uses the weighted sum of the errors of the neurons connected neuron j in the next layer The error of a neuron j is Errj 01 l OJErrkwjk Where wJk is the weight of the connection of neuron j to a neuron k in the next layer and Errk is the error of neuron k I Adjust the weights and Biases To re ect the propagated errors 9 9 lErrJ wlJ wlJ lErrJ01 Where 1 is a constant learning rate between 0 and 1 o Terminating conditions Generally after 100s of Ks of iterations or epochs I All Awij in the previous epoch one iteration through the training set are so small or below a speci ed threshold I The percentage of samples misclassi ed in the previous epoch is below some threshold I A prespeci ed number of epochs has expired 0 Do example in the textbook Page 308 A Bellaachia Page 34 9 Prediction 0 Prediction often called regression is similar to classi cation 0 First construct a model 0 Second use model to predict unknown value 0 Major method for prediction is regression 0 Linear and multiple regression o Nonlinear regression 0 Prediction is different from classi cation 0 Classi cation refers to predict categorical class label 0 Prediction models continuousvalued functions 0 Many classi cation methods can also be used for regressions 0 Decision trees can also produce probabilities o Bayesian networks probability 0 Neural networks continuous output 91 Regress Analysis and LogLinear Models in Prediction 0 Linear regression Y a b X 0 Two parameters a and b specify the line and are to be estimated by using the data at hand 0 Using the least squares criterion to the known values of Y1 Y2 X1 X2 See example 76 on page 320 of your textbook 0 Multiple regression Y b0 bl X1 b2 X2 0 Many nonlinear functions can be transformed into the above 0 Nonlinear models 0 Can always be converted to a linear model A Bellaachia Page 35 10 Classification Accuracy Estimating Error Rates Partition Trainingandtesting 0 Use two independent data sets eg training set 23 test set13 0 Used for data set with large number of samples Crossvalidation 0 Divide the data set into k subsamples 0 Use kI subsamples as training data and one subsample as test data k fold crossvalidation o For data set with moderate size Bootstrapping leaveoneout o For small size data Confusion Matrix 0 This matrix shows not only how well the classi er predicts different classes 0 It describes information about actual and detected classes Detected Positive Negative Positive A True positive B False Negative Actual Negative C False Positive D True Negative 0 The recall or the true positive rate and the precision or the positive predictive rate can be derived from the confusion matrix as follows A A B 0 Recall 0 Precision AC A Bellaachia Page 36

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.