Class Note for CMPSCI 591 at UMass(1)
Class Note for CMPSCI 591 at UMass(1)
Popular in Course
Popular in Department
This 57 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 59 views.
Reviews for Class Note for CMPSCI 591 at UMass(1)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 02/06/15
Andrew McCallurn UNIass Amherst including material from Chris Manning and Jason Eisner Regular Languages Lecture 2 Computational Linguistics CMPSCI 591N Spring 2006 University of Massachusetts Amherst Andrew McCaIIum A Little About Yourselves Have you programmed before Almost none at all Not much work for a software company Fortran C C C Lisp Perl Python Java Only Basic on my Tandy 286 mmmmmmmmmmmmmmmmmmmm st A Little About Yourselves Hobbies Fencing Hiking Singing Cooking Poker Working on machines like cars motorcycles airplanes Drinking Smoking Fencing Watching movies especially awesomely bad ones mmmmmmmmmmmmmmmmmmmm st A Little About Yourselves Favorite authors Kurt Vonnegut George Orwell Noam Chomsky Asimov Tolkein Pinger I avoid reading sorry Tolkein x6 CS Lewis etc Stroustrup Arthur C Clark Hemmingway x2 Salman Rushdie Obscure foreign names like Savyon Librecht Karel Capek Milan Kundera Bulgahov mmmmmmmmmmmmmmmmmmmm st A Little About Yourselves Why are you in the class Practical skills to help in my linguistic research accessing data building grammars Interested in how probabilistic methods can be integrated with algebraic grammars Possibilities of a computer that can make sense of language are very exciting I want to expand my knowledge of Al I want to focus my career in CL especially translation Want to simulate the mind s big bang I think this will help me get a job mmmmmmmmmmmmmmmmmmmm st Today s Main Points Examples of computation helping in Linguistic goals What are regular languages finite state automata and regular expressions Writing regular expressions in Python Examples on several large natural language corpora Finitestate transducers and morphology Homework assignment 1 mmmmmmmmmmmmmmmmmmmm st Some brief history 1950s Early CL on machines less powerful than pocket calculators Foundational work on automata formal languages probabilities and information theory First speech systems Davis et al Bell Labs MT heavily funded by military but basicallyjust word substitution programs Little understanding of natural language syntax semantics pragmatics mmmmmmmmmmmmmmmmmmmm st Some brief history 1960s Alvey report 1966 ends funding for MT in America the lack of real results realized ELIZA MIT Fraudulent NLP in a simple pattern matcher psycholtherapist It s true I am unhappy Do you think coming here will make you not to be unhappy I need some help that much is certain What would it mean to you if you got some help Perhaps I could earn to get along with my mother Tell me more about your family Early corpora Brown Corpus Kudera and Francis Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Some brief history 1970s Winograd s SHRDLU 1971 existence proof of NLP in tangled LISP code Could interpret questions statements commands Andrew McCallurn UMass Amherst Which cube is sitting on the table The large green one which supports the red pyramid Is there a large block behind the pyramid Yes three of them A large red one a large green cube and the blue one Put a small one onto the green cube with supports a pyramid OK including material from Chris Manning and Jason Eisner Some brief history 1980s Procedural gt Declarative including logic programming Separation of processing parser from description of linguistic knowledge Representations of meaning procedural semantics SHRDLU semantic nets Schank logic perceived as answer finally applicable to real languages Montague Perceived need for KR Lenat and Cyc Working MT in limited domains METEO mmmmmmmmmmmmmmmmmmmm st Some brief history 1990s Resurgence of finitestate methods for NLP in practice they are incredibly effective Speech recognition becomes widely usable Large amounts of digital text become widely available and reorient the field The Web Resurgence of probabilistic statistical methods led by a few centers especially IBM speech parsing Candide MT system often replacing logic for reasoning Recognition of ambiguity as key problem Emphasis on machine learning methods mmmmmmmmmmmmmmmmmmmm st Some brief history 2000s Abit early to tell But maybe Continued surge in probability Bayesian methods of evidence combination and joint inference Emphasis on meaning and knowledge representation Emphasis on discourse and dialog Strong integration of techniques and levels brining together statistical NLP and sophisticated linguistic representations Increased emphasis on unsupervised learning mmmmmmmmmmmmmmmmmmmm st Examples of Computation Helping Linguistics Kevin Knight A Computational Approach to Deciphering Unknown Scripts Mayan Writing Pronunciation model by Expectation Maximization which we will study in about 5 wee ks Figure l The Phaistos Disk c 17DOBC The disk is six inches wide doublesided and is the earliest known document printed with a form of movable type mmm mmst mhmgmm m an mmismm Examples of Computation Helping Linguistics Other examples coming later Learning Lexical Semantics Augmenting WordNet by mining the Web Automatically discovering English versus Japanese word order by grammar induction Neural Network learners go through the same periods mistakes on irregular verbs as children do and others mmmmmmmmmmmmmmmmmmmm st Noun phrase parsing Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Ed Hovy s thing Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Noam Chomsky 1928 Chomsky Hierarchy Generative Grammar LiberatarianSocialist The most cited person alive mmmmmmmmmmmmmmmmmmmm st A Language Some sentences in the language The man tOOk the bOOk From Chomsky1956 hisfirst contextfree parsetree The purple giraffe hopped through the clouds This sentence is false Some sentences not in the language The girl the sidewalk the chalk drew Backwards is sentence this loDvaD tlhlngan Hol ghojmoH be mmmmmmmmmmmmmmmmmmmm st Compact description of a language Start with some nonterminal symbol S Expand that symbol using some substitution rules keep applying rules until all nonterminals are expanded to terminals The string of terminals is in the sentence mmmmmmmmmmmmmmmmmmmm st Chomsky Hierarchy L t IngUIS lC Type 0 languages Turingequivalent example Rewrite rules a a b ATNS where a b are any string of terminals and nonterminals Contextsensitive languages Rewrite rules aXb a acb TAGs where X is nonterminal and ab as above Contextfree languages 43 Rewrite rules X a a PSGs 5 where X a b as above be 395 Regular languages 3 Rewrite rules X a aY FSAs where X Y are nonterminals and a is a string of terminals Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Regular language example Nonterminals S X Y z An expansion Terminals m o S Rules mX S gt mX moY i 3quot mooY Y gt mooo Start symbol mmmmmmmmmmmmmmmmmm st Example Sheep Language Strinqs in and out of the example Reqular Lanquaqe Inthelanguage baF baaF baaaaaV Notmthelanguage ba bF abF bbaaaF aHbabaV Finitestate Automata indicates accept state a 9 I r doublecircle Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Recognizer A recognizer for a language is a program that takes as input a string W and answers yes if W is a sentence in the language and answers no otherwise We can think of this as a machine that emits only two possible responses it input mmmmmmmmmmmmmmmmmmmm st Regular Languages related concepts Regular Languages the accepted strings Finitestate Automata Regular Expressions machinery for accepting a way to type the automata Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Finite State Automata more formally A finite state automata is a 5tuple Q 2 qo F 6qi Q finite set of N states q0 q1 q2 qN nonterminals 2 finite set of terminals 6qi transition function given state and input returns next state production rules qo the start state F the set of final states The FSA State marker Q Input tape b a a a a We will later return to a probabilistic version of this Annewnccnnmnnasnmm with Hidden Markov Models including material from Chris Manning and Jason Eisner Transition Table 5 mmmmmmmmmmmmmmmmmmmm st Input State b a l O 1 Q Q 1 Q 2 Q 2 Q 2 3 3 Q Q Q Regular Expressions The foundational operations Pattern Matches Concatenation abc abc Disjunction a b a b a bbd ad bbd Kleene star a a a aa aaa cabbCa Cbba The empty string Regular expressions Finitestate automata are Closed under these operations Andrew McCallum UMass Amherst ding material from Chris Manning and Jason Eisner Stephen Kleene 1909 1994 Attended Amherst College Best known for founding the branch of mathematical logic known as recursion theory together with Alonzo Church Kurt Godel Alan Turing and others and for inventing regular expressions KIeeneiness is next to Godeliness mmmmmmmmmmmmmmmmmmmm st Practical Applications of RegEx s Web search Word processing find substitute Validate fields in a database dates email addr URLs Searching corpus for linguistic patterns and gathering stats Finite state machines extensively used for acoustic modeling in speech recognition information extraction eg people amp company names morphology Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Two types of characters in REs Literal Every normal text character is an RE and denotes itself Metacharacters Special characters that allow you to combine REs in various ways Example a denotes a a denotes a or a or aa or aaa or mmmmmmmmmmmmmmmmmmmm st Basic Regular Expressions Pattern Matches Character Concat went went Alternatives go went go went aeiou a O u disjunc negation Aaeiou b c d f g wildcard char a z amp Loops amp skips a e a aa aaa one or more a a aa aaa zero or one colour color colour Andrew McCanurn UMass Amherst ding material from Chris Manning and Jason Ersner More Fancy Regular Expressions Special characters t tab v vertical tab n newline r carriage return Aliases shorthand d digits 09 D nondigits quot09 w alphabetic azAZ W nonalphabetic quotazAZ s whitespace tnrfv w alphabetic azAZ Examples d dollars 3 dollars 50 dollars 982 dollars woow food boo oodles Escape character is the general escape character eg is not a wildcard but matches a period if you want to use in a string it has to be escaped Andrew McCallum UMass Amherst including material from Chris Manning and Jason Ersner Yet More Fancy Regular Expressions Anchors AKA zero width characters They match positions in the text A beginning of line s end of line b word boundary ie location with w on one side but not on the other B Examples btheb thetogether Counters 1 12 3 mmmmmmmmmmmmmmmmmmmm st Even More Fancy Regular Expressions Grouping a goodbad movie He said it again and again Parens also indicate Registers saved contents bwb1 matches boohoo and baha but not boohaa The digit after the indicates which of multiple paren groups as ordered by when then were opened Grouping without the cost of register saving He went thisthat way mmmmmmmmmmmmmmmmmmmm st Extra Fancy Regular Expressions Nonconsuming tests Positive lookAHEAD 3921 Negative lookAHEAD lt Positive lookBEHIND lt1 Negative lookBEHIND Example ltHhe wed w1y mmmmmmmmmmmmmmmmmmmm st Yowza Regular Expressions Multi Iine with comments etc FROM recompilequotquotquot AFrom Anchor to start of line s maybe some spaces Pltusergtw 39user group of word characters Pltdomaingt the domain S some nonspace characters s finally a space character quotquotquotreVERBOSE Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Oral Quiz Describe the strings these will accept aba ab Aaeiouaeiou1 bdiswb Handson With little strings reshow function from NLTK import re def reshowregexp string print recompileregexp reMsubquotglt0gtquot stringrstrip Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Handson Investigating corpora regexspy regexcountpy Corpora 200k words of Wall Street Journal articles 3 books by Jane Austin Several plays by William Shakespeare Several books by Mark Twain and various other text from Walt Whitman amp others I will make these available on the course Web site mmmmmmmmmmmmmmmmmmmm st ELIZA Weizenbaum 1966 User Men are all alike ELIZA IN WHAT WAY User They re always bugging us about something or other ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE User Well my boyfriend made me come here ELIZA YOUR BOYFRIEND MADE YOU COME HERE User He says I m depressed much of the time ELIZA I M SORRY TO HEAR THAT YOU ARE DEPRESSED Implemented with reqular expression substitution s YOU ARE depressedsad I AM SORRY TO HEAR THAT YOU ARE 1 s always CAN YOU THINK OF A SPECIFIC EXAMPLE Andrew McCanuni UMass Amherst including material from Chris Manning and Jason Eisner Nondeterministic FSAs a More than one outgoing transition on a Input State b a l O 1 Q Q 1 Q 12 Q REESEr i t g itter 2 Q Q 3 3 Q Q Q Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Nondeterministic finitestate automata as Recognizers The problem When processing a string we might follow the wrong transition and reject the string when we should have accepted it One solution turn the NFA into a DFA See CMPSCI 250 Ubiquitous problem in this course How to efficiently search through various possible paths parses to find HOW do one that works the most likely one etc humans do this mmmmmmmmmmmmmmmmmmmm st Solutions Lookahead Peek ahead to help decide which path to take Parallelism At each choice take every path in parallel Backup At each choice point mark the input state If we fail go back and try another path Need a stack or queue of markers Marker Machine state Collection of current state amp markers Search state Depthfirst search or Breadthfirst search Smart heuristic search A See CMPSCI 383 may MMMM l mmmmmmmmmm Artificial Intelligence RE FSA equivalence proof How would you do it 3 material from Chris Manning and Jason Eisner Morphology The study of the subword units of meaning not to attach Making a word plural Examples If word is regular add 3 dog dogs If word ends in y change y to i and add 3 baby babies If word ends in x add es fox foxes Recognizing that foxes breaks down into morphemes fox and es called Morphological Parsing Parsing taking an input and producing some sort of structure for it Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Morphology briefly morpheme minimal meaningbearing unit stem main morpheme of a word eg fox af xes add additional meanings eg es includes pre xes suf xes infixes circum xes eg un Iy concatenative morphology nonconcatenative inflection stemmorpheme in the same class as stem eg nouns plural s possessive s derivation stemmorpheme in different class eg Iy makes and adverb from an adjective Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Morphological Parsing with Finite State Transducers We want a system that given foxes will output a parse foxes or fox PL FSAs will take input but not produce output other than accept reject Solution Finite State Transducers FST A FST is a twotape automaton that recognizes or generates pairs of strings mmmmmmmmmmmmmmmmmmmm st Example Finitestate Transducer FSTs can be used to transform a word sun ace form into morphemes or viceversa An entire lexicon can be encoded as a FST mmm LMAssAm mmmm an mmhmm FST transition table Input Q1 G Q1 State hh aza pp yy iy s eze rr ss tt mmm LMAssAm mmmm an mmhmm Fragment of a lexicon in a FST Further Closure Properties of FSAs Regular languages are also closed under the following operations Reversal If L1 is regular so is the language consisting of the set of all reversals of strings in L1 Intersection if L1 and L2 are regular languages so is the language consisting of all strings that are in both L1 and L2 Difference If L1 and L2 are regular languages so is the language consisting of all strings in L1 that are not in L2 Complementation If L1 is a regular language so is the set of all possible strings that are not in L1 mmmmmmmmmmmmmmmmmmmm st Announcement Undergraduate CMPSSCI Meeting First Friday Curriculum Information Spring Events JobsCo opsResearch positions in and out of the Department Library Carrels And More Friday February 3 2005 330 500 PM CMPS 150151 Computer Science Building Refreshments will be served mmmmmmmmmmmmmmmmmm st Next class Tuesday Feb 7 Learning Python Variables operators conditionals iteration etc functions classes modules Gather statistics from Pythonized Penn Treebank Calculate statistics from 200k words of WSJ Implement a phrase structure grammar and generate sentences from it Install Python and bring your laptop with you mmmmmmmmmmmmmmmmmmmm st First Homework assigned today Essentially Write some regular expressions Run them on some corpora Write 1 page about your experience and findings Extra credit for creativity and interesting application Feel free to come do it in office hours Due next Thursday one week from today Don t wait until Wednesday to install Python Recommended schedule Idea by Saturday Codedtested by Monday Writeup by Wednesday Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner Office Hours CS Building Rm 264 Friday 2 4pm Monday 1030am1pm Tuesday 1030am1pm Wednesday 1030am1pm Thursday 1030am1230pm If you can t make these times let me know mmmmmmmmmmmmmmmmmmmm st Aside Grammar Induction Also called Grammatical Inference Learning finitestate automata from many examples of strings in and out of the language httpwwwinfouclacbeDdupontodupontqramhtml Learning FSA and CFG structure from data mmmmmmmmmmmmmmmmmmmm st Thank you Andrew McCallum UMass Amherst including material from Chris Manning and Jason Eisner