Class Note for CMPSCI 585 at UMass(2)
Class Note for CMPSCI 585 at UMass(2)
Popular in Course
Popular in Department
This 33 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 10 views.
Reviews for Class Note for CMPSCI 585 at UMass(2)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Classification amp Information Theory Lecture 8 Introduction to Natural Language Processing CMPSCI 585 Fall 2007 University of Massachusetts Amherst Andrew McCaIIum Andrew McCallum UNIass Amherst Today s Main Points Automatically categorizing text Parameter estimation and smoothing a general recipe for a statistical CompLing model Building a Spam Filter Information Theory What is information How can you measure it Entropy Cross Entropy Information gain Maximum Likelihood Parameter Estimation Example Binomial Toss a coin 100 times observe r heads Assume a binomial distribution Order doesn t matter successive flips are independent One parameter is q probability of flipping a head Binomial gives prnq We know r and n Find arg man prn q Maximum Likelihood Parameter Estimation Example Binomial Toss a coin 100 times observe r heads Assume a binomial distribution Order doesn t matter successive flips are independent One parameter is q probability of flipping a head Binomial gives prnq We know r and n Find arg man prn q Notes for board n likelihood pR r nq qr q quot r log likelihood L logpr nq oclogqr1 q quot rlogq n rlog1 q dL r n r r gtr1qn rq3q 99 q 1 9 Our familiar ratioofoounts is the maximum likelihood estimate Binomial Parameter Estimation Examples Make 1000 coin flips observe 300 Heads PHeads 3001000 Make 3 coin flips observe 2 Heads PHeads 23 7 Make 1 coin flips observe 1 Tail PHeads o Make 0 coin flips PHeads 3927 We have some priof belief about PHeads before we see any data After seeing some data we have a posterior belief Andrew McCallum UMass Amherst Maximum A Posteriori Parameter Estimation We ve been finding the parameters that maximize pdataparameters not the parameters that maximize pparametersdata parameters are random variables pqlm prlnn pqln prlnn pq prn constant And let pq 2 q1q Maximum A Posteriori Parameter Estimation Example Binomial posterior 190 naqpq qu q quot2q1 q r1 10g posterior L OC 10gq 1 q quot1r 110gq n r 110g1 q o7Lr1n r1Ar3911qnr1qq r1 ampq q l q n2 Bayesian Decision Theory We can use such techniques for choosing among models Which among several models best explains the data Likelihood Ratio Pmodel1 data Pdatamodel1 Pmodel1 Pmodel2 data Pdatamodel2 Pmodel2 back to our example French vs English pFrench glacier melange versus pEninsh glacier melange We have real data for Jane Austin William Shakespeare o I Stars rthouu pShakeSPeare stars thou Testing Document Categories Training data Andrew McCallum UMass Amherst Are you free to meet Statistical Spam Filtering with Dan Iurafsky today at 3pm He wants V39 to talk about computational methods for noun coreference Real Email Nigerian minister awards quot Earn money at home today FREE CASH Just hours per dayquot Spam Email Speaking at awards ceremonyquot Coming home for dinnerquot Free for a research meeting at 6pmquot Computational Linguistics of ce hoursquot Document Classification by Machine Learning Temporal reasoning for I planning has long Testing been studied formally Document V We discuss the semantics of several planning Categories Machine ProgLang Garbage Learning Flaming Semantics Collection Multimedia Training 39 data Neural networks quotPlannning based on 1333 for quotMultimedia User and other machine With temporal the semantics Stron 1 ty ed streaming studies learning methods reasonmg of program languages 10quot Video for quot of GUI quot of Classi cation quot has been quot dependencequot Andrew McCallum UMass Amherst GUI Work out Na39I39ve Bayes formulation interactively on the board 1 2 3 4 5 6 Recipe for Solving a NLP Task Statistically Data Notation representation Problem Write down the problem in notation Model Make some assumptions define a parametric model Inference How to search through possible answers to find the best one Learning How to estimate parameters Implementation Engineering considerations for an efficient implementation Engineering Components of a Na39I39ve Bayes Document Classifier Split documents into training and testing Cycle through all documents in each class Tokenize the character stream into words Count occurrences of each word in each class Estimate Pwc by a ratio of counts 1 prior For each test document calculate Pcld for each class Record predicted and true class and keep accuracy statistics A Probabilistic Approach to Classification Nai39ve Bayes Pick the most probable class given the evidence 6 argmaxc Prc Id C a class like Planning d a document like language intelligence proof Bayes Rule Nai39ve Bayes PM Id PrcjPrd I cj PrCjgt1iPrWdi 3961 C 539quot Idl J Prd 2Prcknprwd Ick Wdi the 139 th word in d like proof Parameter Estimation in Na39I39ve Bayes Estimate of Pc 1C0untd E 6 IC I 2C0untd e ck k PCj Estimate of Pwc 1 2C0untwidk dkEcj IVI IV HZ 2C0untwtdk t1dkECj Pwilcj Information Theory What is Information The sun will come up tomorrow Condi Rice was shot and killed this morning Efficient Encoding I have a 8 sided die How many bits do I need to tell you what face I just rolled My 8 sided die is unfair P112 P218 P3P8116 Entropy of a Random Variable Average length of message needed to transmit the outcome of the random variable First used in Data compression Transmission rates over noisy channel Coding Interpretation of Entropy Given some distribution over events PX What is the average number of bits needed to encode a message a event string sequence Entropy of PX mm 2pltxgtlog2ltpltxgtgt xEX Notation HX HpXHPHXpHpx What is the entropy of a fair coin A fair 32sided die What is the entropy of an unfair coin that always comes up heads What is the entropy of an unfair 6sided die that always 1 2 Upper and lower bound Prove lower bound Entropy and Expectation RecaH EX 2xeXg2X 39 PX Then E39092PX 2M9 Iogzltpltx W HltXgt Andrew McCallum UMass Amherst Entropy of a coin 1 I I I I x l I I I HIpI 7 03 06 04 02 I Ill 1quot 0 I I I I I I I I I 0 01 02 03 04 05 06 07 03 09 Entropy intuitively High entropy chaos fuzziness opposite of order Comes from physics Entropy does not go down unless energy is used Measure of uncertainty High entropy a lot of uncertainty about the outcome uniform distribution over outcomes Low entropy high certainty about the outcome Claude Shannon Claude Shannon 1916 2001 Creator of Information Theory Lays the foundation for implementing logic in digital circuits as part of his Masters Thesis 1939 A Mathematical Theory of Communication 1948 Joint Entropy and Conditional Entropy Two random variables X space 9 Y 11 Joint entropy no big deal XY considered a single event HXY 2x692 Eyep I0XY 39092 I0XY Conditional entropy HXIY 2ng Eye pXy I092 I0Xy recall that HX Elog2px weighted average and weights are not conditional How much extra information you need to supply to transmit X given that the other person knows Y Conditional Entropy another way HY I X 2 pxHY I X x 219659190 I xgtlog2ltplty I x 22pxpy lx10g2py 3996 22pltxygtlog2ltplty I x Chain Rule for Entropy Since like random variables entropy is based on an expectation HX Y HYX HX HX Y HXY HY Cross Entropy What happens when you use a code that is suboptimal for your event distribution created my code to be efficient for a fair 8 sided die But the coin is unfair and always gives 1 or 2 uniformly How many bits on average for the optimal code How many bits on average for the suboptimal code qu 2pltxgtlog2ltqltxgtgt xEX KL Divergence What are the average number of bits that are wasted by encoding events from distribution p using distribution q Dp H q Hpq Hp 2 px10g2Qx Z pltxgtlog2ltpltxgtgt Zpx10g2px xEX A sort of distance between distributions p and q but It is not symmetric It does not satisfy the triangle inequality Mutual Information Recall HX average bits for me to tell you which event occurred from distribution PX Now first tell you event y E Y HXY average bits necessary to tell you which event occurred from distribution PX By how many bits does knowledge of Y lower the entropy of X IXY HX HX Y HX HY HXY 1 1 Q pltxgtlog2 E gponogz pm gpowogz my x9 2pltxygtlog2 L 35 pxpy Andrew McCallum UMass Amherst Mutual Information Symmetric nonnegative Measure of independence XY 0 when X and Y are independent XY grows both with degree of dependence and entropy of the variables Sometimes also called information gain Used often in NLP clustering words word sense disambiguation feature selection Andrew McCallum UMass Amherst Pointwise Mutual Information Previously measuring mutual information between two random variables Could also measure mutual information between two events 190637 I 1 x 3 ngltxgtpltygt
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'