Class Note for CMPSCI 585 at UMass(11)
Class Note for CMPSCI 585 at UMass(11)
Popular in Course
Popular in Department
This 6 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 11 views.
Reviews for Class Note for CMPSCI 585 at UMass(11)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Classification amp Information Theorv Lecture5 Introduction to Natural Language Processing CMPSCI 585 Fall 2004 University of Massachusetts Amherst Aron Culotta Slides courtesy of Andrew McCallum Document Classification by Machine Learning JD Temporal reasoning for 7 planning has long Testing 1 been studied formally Document V We discuss the semantics of several planning C ngUTiBS Machine Frog Lang Garbage Learning Semantics Collection Multimedia Training 39 data u 1 quotGarbage quotNeural networks 3mg based on colleenm for quotMultimedia quotUser and other machine Wlth t mporal the semantics Strongly typed streaming studies learning methods reasom g of program languages Video for of GUI of classi cation quot has been N dependence Work out Na39ive Bayes formulation interactively on the board Recipe for Solving a NLP Task Statistically 1 Data Notation representation 2 Problem Write down the problem in notation 3 Model Make some assumptions de ne a parametric model 4 Inference How to search through possible answers to find the best one 5 Learning How to estimate parameters 6 Implementation Engineering considerations for an ef cient implementation Engineering Components of a Na39ive Bayes Document Classifier Split documents into training and testing Cycle through all documents in each class Tokenize the character stream into words Count occurrences of each word in each class Estimate Pwc by a ratio of counts 1 prior For each test document calculate Pcd for each class Record predicted and true class and keep accuracy statistics 11mm A Probabilistic Approach to Classification Na39ive Bayes Pick the most probable class given the evidence 9 c argmaxcj Prcj d cj a class like Planning d a document like language intelligence proof Bayes Rule Na39ive Bayes P d P Me I d PKG1 Cf z e Wilt1 Prd 2Prck1Prwd ick Wd the i th word in d like proof Estimate of Pc Pcj Estimate of Pwc Parameter Estimation in Nai39ve Bayes Pwi IcJ 1C0untd 6 cl IC 2Countd ck k 1 ECountwdk aka IVI IV HE ECountwdk r1dkEci Programming Assignment 2 Help Small number Pm Id oc Prcj1 1 Prwdf IC 1 logPrcj Id at logPrcj ElogPrwdy lcj oTo get back to PrcJ Id Subtract a constant to make all positive eXPO Common words in Tom Sawyer 71 370 words Frequencies of frequencies in Tom Sawyer mm Em U5 the 3332 determiner article and 2972 conjunction a 1775 determiner to 1725 preposition verbal in nitive marker of 1440 preposition was 1161 auxiliary verb it 1027 personalexpletive pronoun in 906 preposition that 877 complementizer demonstrative he 877 personal pronoun 783 personal pronoun his 772 possessive pronoun you 686 personal pronoun Tom 679 proper noun with 642 preposition Ziph s law Tom Sawyer Word Freq Rank fr f r the 3332 1 3332 and 2972 2 5944 a 1775 3 5235 he 877 10 8770 but 710 20 8400 be 294 30 8820 there 222 40 8880 one 172 50 8600 about 158 60 9480 more 138 60 9480 never 124 80 9920 Oh 1 16 90 10440 two 104 100 10400 11mm Word Frequency of m Emma 71730 word tokens 1 3993 8018 word types 2 1292 3 664 4 410 5 243 6 199 7 172 8 131 9 82 10 91 1 150 540 51100 99 gt100 102 Ziph 5 law Tom Sawyer Word Freq Rank f r f r turned 51 200 10200 you ll 30 300 9000 name 21 400 8400 comes 1 6 500 8000 group 13 600 7800 lead 1 1 700 7700 friends 10 800 8000 begin 9 900 8100 family 8 1000 8000 brushed 4 2000 8000 sins 2 3000 6000 Could 2 4000 8000 Applausive 1 8000 8000 Zipf s law f lt r In other words there is a constant k such that frk Zipf s Law and the Brown Corpus 13mm man man mu m in ma mun mo innnun What is Information Information Theom The sun will come up tomorrow Greenspan was shot and killed this morning Ef cient Encoding Entropy of a Random Variable l have a 8 sided die How many bits do I need to tell you what face ljust rolled My 8 sided die is unfair P105 P2o125 P3P800625 Average length of message needed to transmit the outcome ofthe random variable First used in 7 Data compression 7 Transmission rates over noisy channel Coding Interpretation of Entropy Given some distr bution over events PX What is the average number of bits needed to encode a message a event string sequence Entropy of PX HltpltXgtgt Q pltxgtlog2ltpltxgtgt xex Notation HX HpXHpHXp HPx tr the entropyofafalrcorn Ararrsz rded dre7 tr the ehtropyorah uhfarrcorh that alway corhe uphead 7 vlhatr the entropyofan unrarre rded drethatalway 12 Upper and lower bound Prove lower bound Entropy and Expectation Recall EN SXEWX 39 PX Then E092PXl SXEXWIOQANXD 39 PX HX Entropy of a coin rm Entropy intuitively High entropy chaos fuzziness opposite of order Comes from physics Entropy does not go dovrm unless energy is used Measure of uncertainty High entropy a lot ofuncertainty about the outcome uniform distribution over outcomes Low entropy high certainty about he outcome Claude Shannon Claude Shannon 1916 2001 Creator of Information Theory Lays the foundation for implementing logic in digital circuits as part ofhis Masters Thesis 1939 A Mathematical Theory of Communicationquot 1948 Joint Entropy and Conditional Entropy Two random variables X space Vlo Y Y Joint entropy no big deal XY considered a single event HXrY SvEWSyEv pxry I092 pxry Conditional entropy HXIY Sx vayEY NW 39092 NW recall that HX Elogzpx weighted average and weights are not conditional How much extra information you need to supply to transmit X given that the other person knows Y Conditional Entropy another way HY i X 2pxHY i X x 2 p 2 py iX10g2py ix x y 22pltxgtplty ixgtlog2ltplty ix x y 22pxiy10g2py ix x y Chain Rule for Entropy Since like random variables entropy is based on an expectation HX Y HXY HX HX Y HYX HY Cross Entropy What happens when you use a code that is suboptimal for your event distr bution created my code to be ef cient for a fair 8sided die But the coin is unfair and always gives 1 or 2 uniformly How many bits on average for the optimal code How many bits on average for the suboptimal code HUM Epltxgtlog2ltqltxgtgt xEX KL Divergence What are the average number of bits that are wasted by encoding events from distrbution p using distribution q 01 q Hpiq HQ Epltxgtlog2ltqltxgtgt Epx10g2px gnawed M A sort or distance between distributions pend q but it is not syrnrnetnci it does not satisfy tne triangle ineousiiityi Mutual Information Recall Hgtlt average tt bits rorrneto teii you wnicn event occurred rrorn distribution Pgtlt New rst i teii you event yE v Htxivi average bits necessaryto tell you wnicn event occurred from distribution PM By now rnany bits does knowledge or v iowertne entropy 000 may HxeHxiy e HX the HXY 1 Emloszw2mnosz MW e 1 7 2 0g pxpv 1 My nEpUiy ogsziy u Mutual Information Symmetric nonnegative Measure ofindependence e igtltY 0 wnen x and v are independent 7 igtltY grovvs botn Witn degree of dependence and entropy oftnevanabies Sometimes also called information gainquot Used o en in NLP e ciustenng words 7 Word sense disambiguation e feature seiection Pointwise Mutual Information Previously measuring mutual information between two random variables Could also measure mutual information between two events Way 1 1 xy 0gpXpy
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'