Programming Languages and Compilers
Programming Languages and Compilers COMPSCI 164
Popular in Course
Major Konopelski DVM
verified elite notetaker
Popular in ComputerScienence
This 39 page Class Notes was uploaded by Mr. Hayley Barton on Thursday October 22, 2015. The Class Notes belongs to COMPSCI 164 at University of California - Berkeley taught by Staff in Fall. Since its upload, it has received 43 views. For similar materials see /class/226666/compsci-164-university-of-california-berkeley in ComputerScienence at University of California - Berkeley.
Reviews for Programming Languages and Compilers
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/22/15
Lexical Analysis Lec rur39e 5 Prof F ateman CS 164 Lecture 5 Outline Informal ske rch of lexical analysis Identifies Tokens in inpu r s rring Issues in lexical analysis Lookahead Ambigui ries Specifying lexers Regular expressions Examples of regular expressions Prof Fateman CS 164 Lecture 5 Lexical Analysis Wha r do we wan r To do Example if ij ZO else Z1 The inpu r is jus r a string of characters tab i f i j newline tab 2 O newline e l s e newline tab tab 2 1 Goal Par ri rion inpu r s rring in ro subs rrings where The subs rrings ar39e Tokens or lexemes Prof Fateman CS 164 Lecture 5 3 What39s a Token A syntactic category In English noun verb adjective In a programming language Identifier Integer SingleFloat DoubleFloat operator perhaps single or multiple character Comment Keyword Whitespace string constant Prof Fateman CS 164 Lecture 5 4 Defining Tokens more precisely Token categories correspond ro se rs of s rrings some of Them fini re se rs like keywords bu r some like iden rifiers unbounded Iden rifier Typically sfrings of leffers ar a igifs sfarfing wifv a e ffer Same languages resfricf lengfv of idenfifier Lisp a aes naf require firsf e er In reger a nanempfy sfring of a igifs Bounded Keyword else or if or begin ar Whi respace a nanempfy sequence of blanks newlines ana fabs Prof Fateman CS 164 Lecture 5 5 How are Tokens cr39eaTed This is The job of The Lexer39 The fir39sT pass over39 The source code The Lexer39 classifies program subsTr39ings accor39ding To r39ole We also Tack on To each Token The locaTion line and column of where iT ends so we can r39epor39T errors in The source code by locaTion Personally I hafe f1s if assumes quotbafch processing is normal and makes quotin ferac ve processing and debugging messer We read Tokens unTiI we geT To an endof file Prof Fateman CS 164 Lecture 5 6 How are Tokens used OquuT of lexical analysis is a sTream of Tokens In Lisp we jusT make a isT Token1 Token2 Token3 In an i nTeracTive sysTem we would jusT burp ouT Tokens as we find Their ends we don39T waiT for EOF This sTream is The i npuT To The nexT pass The parser parser lexer filenamequot in Lisp Parser makes use of Token disTincTions An idenTifier FOO may be TreaTed differenle from a keyword ELSE Some language have no keywords Lisp Prof Fateman CS 164 Lecture 5 7 Designing a Lexical Analyzer Tep 1 Define a finiTe seT of Tokens Tokens describe all iTems of inTeresT Choice of Tokens depends on language design of parser RecaH tif i jnttz Ontelsenttz 1 Useful Tokens for This expression InTeger Keyword operaTor IdenTifier WhiTespace I I I Prof Fateman CS 164 Lecture 5 8 Designing a Lexical Analyzer S rep 2 Describe which s rr39ings belong To each Token caTegory RecaH Iden rifier39 5777th of leffers or dig7 5 sfar ng wfv a e ffer In reger39 a nonempfy ring of dig7 5 Keyword quotese or quotI39f or quotbegWar Whi respace a nonempfy sequence of blanks newMes and fabs Prof Fateman CS 164 Lecture 5 9 Lexical Analyzer Implementation An implemen ra rion reads charac rers un ril i r finds The end of a Token Longes r possible single Token IT Then re rurns a package of up To 3 i rems Wha r kind of Thing is if Identifier Number Keyword If H is an iden rifier which one exac rly Foo Bar Where did This Token appear in The source code linecol Whi respace and commen rs are skipped Prof Fateman CS 164 Lecture 5 10 Example Recall tif i jnttz Ontelsenttz 1 Possible TokenIexeme groupings IdenTifier iJz Keyword if else MuITicharacTer oper39a ror39 no r in MiniJava In reger39 O 1 single characTer opera ror39s of The same name Illegal Eg for39 MiniJava quot everyThing else Prof Fateman CS 164 Lecture 5 11 Writing a Lexical Analyzer Comments The lexer39 usually discards uni n rer39es ringquot Tokens rha r don39T con rr ibu re ro parsing bu r has To keep Track of linecolumn coun rs If a commen r can Run off The end if means rha r you mus r check for39 EOF when skipping commen rs roo Some ri mes commen rs are embedded in commen rs Wha r To do 34 foo 43 Java handles rhis wr39ong Some rimes compiler39 directives are inside commen rs so commen rs are really NOT ignor39ed se r op rimiza rion O Prof Fateman CS 164 Lecture 5 12 True Crimes of Lexical Analysis Is if as easy as if sounds No r qui re Look of some his ror39y Prof Fateman CS 164 Lecture 5 13 Lexical Analysis in FORTRAN 39 FORTRAN r39ule Whi respace is insignifican r Eg VAR1 is The same as VA R1 Foo rno re FORTRAN whi respace r39ule mo riva red by inaccuracy of punch and oper39a ror39s Prof Fateman CS 164 Lecture 5 14 Now we see if as a Terrible design Example Consider DO 5 I 125 DO 5 I 125 The first s rar rs a loop 25 Times r39unning un ril rhe s ra remen r labeled 5quot DO 5 I 1 25 The second is D05I 1 25 an assignmen r Reading Ief rTor39igh r canno r Tell if D05I is a variable II or39 DO s rm r un ril af rer39 is reached Prof Fateman CS 164 Lecture 5 15 A nicer design limits The amount of lookahead needed Even our39 simple example has lookahead issues ii VS if amp vs ampamp not in MJ Bu r This uses only one or39 Two characters and is cer rainly resolved by The Time you find whi re space or39 ano rher character eg XampY or39 XampampY In C consider ab Is i r a b Or39 a b or39 some other breakup Prof Fateman CS 164 Lecture 5 16 Review The goal of lexical analysis is To ParTiTion The i npuT sTring i nTo lexemes IdenTify The Token Type and perhaps The value of each lexeme probably Eg converTsTringToinTeger 12345quot LefTTorighT scan gt lookahead someTimes required Prof Fateman CS 164 Lecture 5 17 Next We need a way To describe rhe lexemes of each Token class so rha r we can wri re a program or builduse a program rha r au roma rically Takes The descrip rion and wri res rhe lexer For runa rely we can s rar r from a wells rudied area in rheore rical compu rer science Formal Languages and Au roma ra Theory Prof Fateman CS 164 Lecture 5 18 Languages Def LeT 2 be a seT of characTers A anauaae overZ is a seT of sTrings of characTers drawn froniz 2 is called The aphabef Pronounced SIGMA Why are we using Greek 50 as To separaTe The language from The meTa language Presumably our alphabeT does noT include 2 as a characTer Prof Fateman CS 164 Lecture 5 Examples of Languages Alphabe r English charac rers Language English sentences No r every s rring of English charac rers is an English sen rence Alphabe r ASCII Language C programs No re ASCII charac rer se r is differen r from English charac rer se r UNICODE is ano rher much larger se r used by Java Common Lisp Prof Fateman CS 164 Lecture 5 20 One program39s language can be another395 alphabet An alphabe r ABCDE Z can be used To form a word language CAT DOG EAT HAT IN THE The alphabe r of CAT DOG EAT can be used To form a differen r language eg THE CAT IN THE HAT Prof Fateman CS 164 Lecture 5 21 Another kind of alphabet in this case for MCI Numbers Keywords Iden rifier39s an infini re se r O rher39 Tokens like Then The MT language is defined over39 This pr39eTokenized alphabe r Prof Fateman CS 164 Lecture 5 22 Notation We repea r Languages are se rs of s rrings Subse rs of The se r of all s rrings over an alphabe r 2quot Need some Tools for s ecifying which subset we wan r ro designa re a par ricu ar language For English and an alphabe r of words from The dlc rlonary we can use a grammar I r almos r works For Java programs we can also use a grammar Bu r for now we wan r To specify rhe se r of rokens Tha r is map The language of sIn le charac rers in ro sen rences rha r are rokens To ens have a simple s rruc rure Prof Fateman CS 164 Lecture 5 23 Regular Languages There are several formalisms for specifying Tokens Regular languages are The mosT popular Simple and useful Theory Easy To undersTand EfficienT implemenTaTions are possible AlmosT powerful enough Can39T do nesTed commenTs Popular almosT auTomaTicquot Tools To wriTe programs The sTandard noTaTion for regular languages is regular expressions Prof Fateman CS 164 Lecture 5 24 Atomic Regular Expressions Single charoc rer39 deno res o se r of one s rr39ing iciquotcquot Epsilon chor39oc rer39 deno res o se r of one 0 leng rh s rr39ing quotH 6 Emp ry se r is Q nof fhe same as a Size 0 Sizes 1 Prof Fateman CS 164 Lecture 5 25 Compound Regular Expressions Union If A and B are RES Then ABSSEA orseB Conco reno rion of Se rs Conco reno rion of s rr39ings ABabaeA andbeB I rer39o rion Kleene closure 14 A Where Ai Az39 times A 120 Prof Fateman CS 164 Lecture 5 26 In particular notationally confusing concatenation with multiplication union with addition A A AA AAA we 050 some fI39Ines use AA AA AAA AA Prof Fateman CS 164 Lecture 5 27 Regular Expressions Def The reguar expressans over 2 are The smallest se r of expressions including Where c e 2 Where A B are reXp over 2 H H H Where A is a temp over 2 Prof Fateman CS 164 Lecture 5 28 Syntax vs Semantics This no ra rion means This se r as W Lcc39 W LA B LAU LB LAB ab a E LA and b E LB M Lima4quot Prof Fateman CS 164 Lecture 5 29 Example Keyword Keyword quote539e or quot39f 0r quotbeq397 or 39else39 39if 39begin39 NOT 3 39618639 deno res The same 58139 GS quot6quotquot1WS39H396quot Prof Fateman CS 164 Lecture 5 30 Sometimes we use convenient names for sefs Keywords 39else39 39if39 39Then39 39when39 Prof Fateman CS 164 Lecture 5 31 Example Integers Integer a nonempfy sfn ng of a lyfs 39039391393923939339394393953939639397393983939939 integer digit digit Integers are almost digit Though is 000 an in feger Prof Fateman CS 164 Lecture 5 32 Sometimes we use in mg exps For39 example ABC deno fes fhe sef ABAC Prof Fateman CS 164 Lecture 5 33 omeTimes we need a language wiTh in iT For39 example quotquot quotquot quotquot denoTes a seT wiTh 3 sTr39ings in iT using our39 meTasymbols Because our39 programming language will usuall be in The same alphabeT as our39 descr39ipTion 012 RES we need To Take some care ThaT our39 meTasymboIs can be disTinguished from our39 alphabeT by special mar39ks like quot OfTen auThor39s leave off These marks if They Think The reader will under39sTand If you leave such mar39ks off in programs you may be in Trouble Prof Fateman CS 164 Lecture 5 34 Example Identifier Identifier 5774th of leffers or dyfs sfar ng wfh a e ffer letter 39A3939B39 39Z3939a39 39z393939 identi er letterletter digitquotlt Is letter digitquotlt the same Is letter digit the same Is letter letter digitquotlt the same Prof Fateman CS 164 Lecture 5 35 Example Whitespace Whi respace a nonempfy sequence of blanks newInes and fabs v v vnv vtv Prof Fateman CS 164 Lecture 5 36 Example Phone Numbers Regular expressions are all around you Consider 5106422020 Z digits U exchange digit3 phone digit4 area digit3 phoner1umber 3939 area 3939 exchange 3939 phone Prof Fateman CS 164 Lecture 5 37 Example Email Addresses Consider roofbeerc5berkeey ea u 2 letters U name letter address name 3939 name 3939 name 3939 name Prof Fateman CS 164 Lecture 5 38 Summary Regular expressions describe many useful languages Regular languages are exac rly Those languages expressible as regular expressions We s rill need an implemen ra rion To Take a specification of rexp R and produce a func rion Tha r answers The ques rion Given a s rring s is S 6 MR 7 HEX Prof Fateman CS 164 Lecture 5 39