New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Compiler Design

by: Lindsay Bergstrom Sr.

Compiler Design CIS 631

Lindsay Bergstrom Sr.
GPA 3.76

Nancy McCracken

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Nancy McCracken
Class Notes
25 ?




Popular in Course

Popular in Computer & Information Science

This 40 page Class Notes was uploaded by Lindsay Bergstrom Sr. on Wednesday October 21, 2015. The Class Notes belongs to CIS 631 at Syracuse University taught by Nancy McCracken in Fall. Since its upload, it has received 51 views. For similar materials see /class/225594/cis-631-syracuse-university in Computer & Information Science at Syracuse University.

Popular in Computer & Information Science


Reviews for Compiler Design


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/21/15
Decaf Language De nition Anoop Sarkar 7 anoopcs sfu ca Sepuunber102007 1 Introduction The programming assignments over the semester will build various components towards a working compiler for a programming language called Decaf This document is the speci cation for the Decaf languagel This language de nition might evolve over the course of the semester Each version of the language de nition will be marked with the date Please refer to the latest version available Decaf is a strongly typed objectoriented language with support for inheritance and encapsulation The design of Decaf has many similarities with other programming languages that are familiar to us such as C C or Java Although keep in mind that Decaf is not an exact match to any of those languages and has its own peculiar proper ties The feature set has been trimmed down considerably from what is usually part of a full edged programming language This was done to keep your programming assignments manageable Despite these limitations the Decaf compiler will be able to handle interesting and nontrivial programs 2 Sample Decaf Code Here is a simple program written in the Decaf language class GreatestCommonDivisor int a 10 int b 20 void main int x y z x a b 2 gcdx y quotlt H printint is part of the standard input output library calloutquotprintintquot z function that computes the greatest common divisor int gcdint a int b if b 0 returna else return gcdb a b 1Like Java but with less calfeine 3 Notation foo means foo is a nonterminal symbol foo means foo is a terminal symbol ie a token recognized by the lexical analyzer indicates a terminaltoken that is either an operator like lt or a single char punctuation like x means zero or one occurrence of x ie x is optional nb do not confuse this notation with terminals and zero Or more Occurrences of X x x one or more occurrences of x curly braces are used for grouping items nb do not confuse this notation with terminals and x a commaseparated list of one or more x s egiorijk separates alternatives for lhs of a CFG rule or a group nb the context determines whether I separates CFG rules eg type void vs S gt 018 la charl charZ denotes a group of characters for token de nitions eg a c denotes the three characters a b c 4 Lexical Considerations All Decaf keywords are lowercase Keywords and identi ers are casesensitive For example if is a keyword but IF is an identi er Also foo and F00 are two distinct identi ers Comments are started by and are terminated by the end of the line 41 Token De nitions The keywords are bool break callout continue class else extends false for if int new null return rot true void While For now the keywords extends new and null will not appear elsewhere in the language de nition these keywords are reserved for future extension The operator and punctuation tokens are 5 5 E 1 1 5 5 5 5 5 5 5 ltlt gtgt lt gt 5 lt gt ampamp5 I I I The token is reserved for future extension Tokens such as stringConstant which de nes a string constant like quothello worldquot or single character tokens such as or do not appear in the list of keywords above but are valid tokens and are used in de ning the reference grammar of Decaf see Section 5 Identi ers denoted by the token id are de ned as starting with an alphabetic character or the underscore character a zA ZJ and followed by zero or more alphanumeric characters including the underscore character a zA Z 9 Keywords and identi ers must be separated by white space or a token that is neither a keyword or an identi er thiswhiletrue is a single identi er not three distinct keywords See Section 42 for some examples String constants denoted by the token stringConstant will have a lexeme value that is composed of characters enclosed in double quotes A string must start and end on a single line it cannot be split over multiple lines Note that the string constant can include escape sequences like n and this is distinct from a newline character inside the string constant For example quotnquot is legal but quot quot is not legal For more on strings and character tokens see Section 43 Integer constants in Decaf denoted by the token intConstant are either decimals base 10 or they are hexadecimals base 16 A hex integer constant must begin with 0x that s a zero not the letter 0 and followed by a sequence of hex digits which include with the decimal digits plus the letters a through f either upper or lowercase Examples of integer constants 8 012 0x0 QXIZaE Character constants denoted by the token charConstant will have a lexeme value that is a single character enclosed in single quotes A character constant is any printable ASCII character ASCII values between decimal value 32 and 126 or octal 40 and 176 but a character constant cannat be a single quote or backslash character For more on strings and character tokens see Section 43 Character constants also include the character sequences eg quotquot to denote a quote to denote a single quote to denote backslash All the escape sequences are listed in Section 44 You can choose to include special tokens for whitespace and comments You can assume some intermediate step before parsing occurs can strip out these special tokens This makes the lexical analyzer more consistent since a special de nition is used for both whitespace and comments and also makes the de nition of string constants eas1er 42 Token Boundaries and Whitespace The boundaries between tokens such as integer constants keywords and identi ers are explained using the fol lowing rules In effect these rules de ne an algorithm for breaking up a sequence of characters from the set 0 9a zA ZJ into tokens If the sequence begins with 0x then these rst two characters and the longest subsequence of characters immediately following them drawn from the set 0 9a fA F form a hex integer constant The last such character is the end of the token o If the sequence begins with a decimal digit but not 0x then the longest pre x of decimal digits forms a decimal integer constant The last such character is the end of the token Note that the semantics of range checking occurs later so that a long sequence of digits eg 123456789123456789 which is clearly out of range is still scanned as a single token The semantic analyzer will come in later and reject this lexeme value as a valid integer constant If the sequence begins with an alphabetic character or then this character and the longest sequence of alphanumeric characters 0 9a zA ZJ following this initial character forms a token which is either an identi er or a keyword 0 Whitespace and other token de nitions can play a role in identifying token boundaries eg the string rot3 is a single identi er token but rot 3 is two tokens one a keyword rot and the second an integer token 3 and rot3 can only be a sequence of 4 tokens the keyword rot the left parenthesis the integer token 3 and the right parenthesisz Whitespace is de ned as one or more of the characters t n V f r plus the ASCII space character Here are some more examples that explain these rules 0x123food INTCONST X123f IDENTIFIERood 0xfood123 INTCONST xf IDENTIFIERood123 2Note that this example is here only to explain how things work in lexical analysis The actual syntax for using rot is 8x881 rot 2 or 8x881 rot 2 for rightleft rotation respectively See Section 62 for more details on syntax of expressions 123break INTCONSTC123 KWBREAK 0x123r0t3 INTCONST X123 IDENTIFIERrOt3 0x123r0t 3 INTCONST X123 KWROT INT3 1250x356 INTCONSTUZSO IDENTIFIERX356 break123 IDENTIFIERbreak123 breakwhile IDENTIFIERbreakwhile 43 String and Character Constants String constants denoted by the token stringConstant will have a lexeme value that is composed of characters enclosed in double quotes A string must start and end on a single line it cannot be split over multiple lines Note that the string constant can include escape sequences like n and this is distinct from a newline character inside the string constant For example quotnquot is legal but quot quot is not legal Character constants denoted by the token charConstant will have a lexeme value that is a single character enclosed in single quotes A character constant is any printable ASCII character ASCII values between decimal value 32 and 126 or octal 40 and 176 but a character constant cannat be a single quote or backslash character Character constants also include the character sequences e g quotquot to denote a quote to denote a single quote to denote backslash All the escape sequences are listed in Section 44 When scanning single quoted character constants charConstant or double quoted string constants stringConstant you should ensure that o Character constants cannot contain more than one character except in character constants that are escaped see Section 44 eg an is not valid but n is valid o String constants can have escape sequences see Section 44 o String and character constants are only terminated by an unescaped quote matching the open quote In particular the following character constants should be treated as an error Character constants containing zero characters Character constants with invalid closing delimiter e g String constants with invalid closing delimiter eg quotquot The above rules indicate that the lexical analyzer should not simply truncate the token after a single character if no closing quote character is found o Unterminated string and character constants must be reported as errors o Invalid escape sequences or newlines embedded in string or character constants should be reported as an error Some examples quotxquot STRINGerror unknown escape sequence quotquotquot STRINGquotquot STRINGerror unterminated string n n STRINGerror newline in string constant ab CHARerror char constant length greater than one CHARerror unterminated char constant 44 Escape Sequences An escape sequence or escaped symbol means having a preceding backslash character to distinguish a special purpose character or to distinguish the character from a previously de ned delimiter The list of escape sequences are t V r n a f b Character constants also have de ning an escaped open or close character delimiter String constants also have de ning an escaped open or close string delimiter Invalid escape sequences should be reported as errors Some of the escape sequences have a special meaning t denotes a horizontal tab v denotes a vertical tab r denotes carriage return n denotes a newline a denotes an alert bell sent to the terminal f denotes a form feed from back when terminal output was printed on paper b denotes a backspace You need to handle these escaped characters in the lexical de nition but you will not need to perform any special handling to implement their special roles That will be done by the interaction of the standard input output library and the terminal application 5 Reference Grammar The following reference grammar de nes the structure of Decaf programs It uses the notation de ned in Section 3 This reference grammar is not a contextfree grammar although it can be easily converted into one program gt class classname elddecl methoddecl classname gt id elddecl a type id id intConstant 1 type id constant methoddecl a type void id typeid block block a vardecl statement vardecl a type id type a intl boo statement gt assign methodcall if c expr 3 block else block While expr block for assign expr assign block return expr break continue block assign methodcall methodname calloutarg lvalue ltexprgt ltbmopgt arith op relop eelOp cond op constant boolconstant a a a a a a a a a lvalue expr methodname C expr I 3 callout stringConstant calloutarg I I D id expr stringConstant id id expr lvalue methodcall constant expr binop expr ltexprgt expr eXPrgt 3 arithOPH rel0p eq0pgt comOp I I I I ltlt I gtgt I I lt I gt I lt I gt I 5 ampamp5 I I I 5 intConstantl charConstantl bool constant true false To help grasp some of the differences from other programming languages that you may be used to here are some fragments of invalid code in Decaf Check for yourself using the reference grammar above exactly why each of these examples are invalid class foo int 3 int b a Invalid int foo int 3 10 Invalid for a lt b Invalid Of course the reference grammar for Decaf could be changed to accept the examples above 1714th nut change the grammar in any way just because you feel it should accept these or other examples 6 Semantics A Decaf program consists of a single class declaration associated with an identi er The class declaration consists of eld declarations and method or function declarations Field declarations introduce variables that can be accessed globally by all methods in the program 61 Types There are two basic types in Decaf 7 int for integers and bool for booleans In addition there are arrays of integers int id n there are arrays of booleans bool id n where n is an intConstant integer Arrays are declared only in the global class declaration scope All arrays are onedimensional and have a compiletime xed size Arrays are indexed from 0 to n 1 where n gt 0 is the size of the array The usual bracket notation is used to index arrays Since arrays have a compiletime xed size and cannot be declared as method parameters or local variables there is no facility to query the length of an array variable in Decaf 62 Expressions Expressions follow the normal rules from other languages like C C or Java for evaluation Integer constants evaluate to their integer value Character constants evaluate to their integer ASCII values e g A evaluates to the integer value 65 consult man ascii for the full ASCII table Note that Decaf does not have an explicit character type instead we use the type int for characters An expression that refers to an array location e g x 10 evaluates to the value contained at that location Method invocation expressions are discussed in Section 63 Relational operators are used to compare integer expressions The equality operators and are de ned for int and bool types and can be used to compare any two expressions having the same type The result of a relational operator or equality operator has type bool 5 The boolean connectives ampamp and are interpreted using short circuit evaluation as in Java This means the sideelfects of the second operand are not executed if the result of the rst operand determines the value of the whole expression ie if the result is false for ampamp or true for Precedence Operators Explanation 1 unary minus 2 logical not 3 multiplication division 4 addition subtraction 5 96 modulus op 6 ltlt gtgt rot bit shift ops 7 lt lt gt gt relational ops 8 equality 9 ampamp conditional and 10 I conditional or The precedence level for each operator is shown in the table above All operators at the same precedence level get equal precedence All operators with equal precedence associate left Binary 96 computes the modulus of two numbers Given integer operands a and b If b is positive then a b is a minus the largest multiple of b that is not greater than a If b is negative then a b is a minus the smallest multiple of b that is not less than a ie the result will be less than or equal to zero Binary rot rotates the bits of the rst argument by the number of bits speci ed by the second argument If the second argument is a positive number then rotate left and if it is negative then rotate right by the absolute amount of the second argument Number constants in Decaf are either decimals or hexadecimals Decimal numbers in Decaf are 32bit signed integers between the values 2147483647 to 2147483647 However range checking for 8digit hex constants is based on unsigned 32bit integers even though hex values greater than 214748364710 are actually negative hex xffffffff is 1 The reason for not bothering with the sign for hex digits is that they are used as bit patterns without regard for numeric value 63 Method Calls and Callouts The program must contain a declaration for a method called main that has no parameters The return type of the method main has to be the type void Execution of a Decaf program starts at this method main Other methods de ned as part of the class declaration can have zero or more parameters and must have a return type explicitly de ned Callout functions using the callout keyword are used to invoke external library functions The most useful library functions that you will use are the printint and readint functions The printint callout function is invoked as calloutquotprintintquot z where z is an integer variable that has a value The readint callout function is invoked as z calloutquotreadintquot where integer variable 2 receives the result of calling the readint library function 7 Brief History of Decaf Decaf has been used as part of Compilers courses in several universities including Stanford University MIT Uni versity of Delaware Southern Adventist University University of Tennessee among others The precise genesis of Decaf is not entirely clear Some believe it was a revision of the SOOP language developed by Maggie John son and Steve Freund at Stanford Others believe it was a simpli cation of a language called Espresso used at MIT Still others claim that Decaf was invented at the University of Tennessee In any case Decaf is a useful language for introductory compiler courses Our version of Decaf as described in this document is distinct and quite different apart from its general structure from all the other versions of Decaf Recursive Descent Parsing a TopDown Parsing Method Without Backtracking Lexical Analyzer has translated the source program into a sequence of tokens The Parser must translate the sequence of tokens into an intermediate representation Assume that the interface is that the parser can call getNextToken to get the next token from the lexical analyzer And the parser can call a function called emit that will put out intermediate representations currently unspeci ed The parser outputs error messages if the syntax of the source program is wrong Recursive Descent Strategy 0 For every BNF rule production of the form ltphraselgt 9 E the parser defines a function to parse phrasel Whose body is to parse the rule E void ParsePhrasel parse the rule E 0 Where E consists of a sequence of non terminal and terminal symbols 0 Requires no left recursion in the grammar Parsing a rule 0 A sequence of non terminal and terminal symbols Y1 Y2 Y3 Yn is recognized by parsing each symbol in turn 0 For each non terminal symbol Y call the corresponding parse function ParseY 0 For each terminal symbol y call a function expecty that Will check if y is the next symbol in the source program The terminal symbols are the token types from the lexical analyzer Suppose that a variable currentsymbol always contains the next token then expecty does if currentsymbol y then getNextToken else SyntaxError Simple parse function example 0 Suppose that there was a grammar rule ltpr0gramgt 9 class ltClassnamegt ltfie1d declgt ltmeth0d declgt 0 Then the body of ParseProgram would be expect class ParseClassname eX1960 ParseFieldDecl ParseMethodDecl eXIDCCK Lookahead 0 In general one non terminal may have more than one production so more than one function should be written to parse that non terminal 0 Instead we insist that we can decide which rule to parse just by looking ahead one symbol in the input ltsentencegt 9 if lteXprgt ltblockgt I while lteXprgt ltblockgt Then ParseSentence can have the form if currentsymbol 2 if parse first rule elsif currentsymbol 2 while parse second rule First and Follow Sets 0 FirstE Where E is the right hand side of a rule is the set of terminal symbols that may appear at the beginning of a sentence derived from E And may also include 8 if E can generate an empty string 0 FollowltNgt Where ltNgt is a non terminal symbol in the grammar is the set of terminal symbols that can follow immediately after any sentence derived from any rule of N Recognizing possibly empty sentences 0 In a strict context free grammar there may be rules in which the rhs is 8 the empty string 0 Or in an extended BNF grammar there may be the specification that some part of the rhs of the rule occurs 0 or 1 times ltphrase1gt 9gt ltphrase2gt 0 Then we recognize the possibly empty occurrence of phrase2 by if currentsymbol is in Firstltphrase2gt then ParsePhrase2 Recognizing sequences 0 In a context free grammar you often have rules that specify any number of a phrase can occur ltarglistgt 9 ltarggt ltarglistgt e 0 In extended BNF we replace this With the to indicate 0 or more occurrences ltarggt 0 We can recognize these sequences by using iteration If there is a rule of the form ltphrase1gt 9gt ltphrase2gtgt lt we can recognize the phrase2 occurrences by While currentsymbol is in Firstltphrase2gt do ParsePhrase2 Grammar Restrictions 0 In either of the previous cases Where the grammar symbol may generate sentences which are empty the grammar must be restricted suppose that ltphrase2gt is the symbol that can occur 0 times require that the sets Firstltphrase2gt and Followltphrase2 be disjoint Multiple rules 0 Suppose that there is a nonterminal symbol with multiple rules where each rhs is nonempty ltphraselgt9E1E2E3 En then we can write ParsePhrasel as follows if currentsymbol is in First E1 then ParseE1 elsif currentsymbol is in First E2 then ParseE2 elsif currentsymbol is in First En then ParseEn else Syntax Error 0 If any rhs can be empty then don t give the syntax error 0 Another grammar restriction The sets First E1 First EI1 must be disjoint Example expression grammar 0 Suppose that we have a grammar lteXprgt 9 lttermgt ltopgt lttermgt gt lt lttermgt 9 Const lteXprgt ltopgt 9 0 Parsing functions void ParseTerm if cursym const then getNextToken else if cursym void ParseEXpr ParseTerm while cursym in Firstltopgt getNextToken that gBtNBXtTOkBH ParseTerm PaISBEXPE expect First Sets 0 Here we give a more formal and more detailed definition of a First set starting with any non terminal If we have a set of rules for a nonterminal ltphraselgt ltphraselgt 9 E1E2E3 En then Firstltphraselgt FirstE1 FirstEn set union For any right hand side Y1 Y2 Y3 Yn we make cases on the form of the rule FirstaY2 Y3 Yn a for any terminal symbol a FirstN Y2 Y3 Yn FirstN for any nonterminal N that does not generate the empty string FirstNM FirstN FirstM FirstNM FirstN FirstM First 8 8 0 or 1 occurrence of N 0 or more occurrences Follow Sets 0 To define the set FollowT examine the cases of Where the non terminal T may appear on the rhs of a rule in the grammar N sTU or N98TU or N98TU 0 If U never generates an empty string then F0110wT includes FirstU 0 If U can generate an empty string then F0110wT includes FirstU and F0110wN N sT or N ST or N STgt lt F0110wT includes F0110wN The Follow set of the start symbol should contain EOT the end of text marker 0 Include the Follow set of all occurrences of T from the rhs of rules to make the set FollowT Simple Error Recovery To enable the parser to keep parsing after a syntax error the parser should be able to skip symbols until it finds a synchronizing symbol E g in parsing a sequence of declarations or statements skipping to a should enable the parser to start parsing the next declaration or statement General Error Recovery 0 A more general technique allows the syntax error routine to be given a list of symbols that it should skip to void syntaXErrorString msg Symbols StopSymbols give error With msg While currentsymbol in StopSymbols getNextSymbol assuming that there is a type called Symbols of terminal symbols we may want to pass an error code instead of a message 0 Each recursive descent procedure should also take StopSymbols as a parameter and may modify these to pass to any procedure that it calls E g if there is a procedure to parse the parameter list of a method call then it can have as a stop symbol Stop Symbols 0 If the parser is trying to parse the rhs E of a non terminal N 9 E then the stop symbols are those symbols which the parser is prepared to recognize after a sentence generated by E Remove anything ambiguous from FollowN Reference Per Brinch Hansen 0 The stop symbols should always also contain the end of text symbol EOT so that the syntax error routine never tries to skip over symbols past the end of the program C8143 Handout 07 Autumn 2007 October 3 2007 TopDown Parsmg Handout written by Maggie Johnson and revised by Julie Zelenski Possible Approaches The syntax analysis phase of a compiler verifies that the sequence of tokens extracted by the scanner represents a valid sentence in the granunar of the programming language There are two major parsing approaches topdown and bottomup In topdown parsing you start with the start symbol and apply the productions until you arrive at the desired string In bottomup parsing you start with the string and reduce it to the start symbol by applying the productions backwards As an example let s trace through the two approaches on this simple granunar that recognizes strings consisting of any number of a s followed by at least one and possibly more b s S gt AB A gt aA I e B gt b I bB Here is a topdown parse of aaab We begin with the start symbol and at each step expand one of the remaining nonterminals by replacing it with the right side of one of its productions We repeat until only terminals remain The topdown parse produces a leftmost derivation of the sentence S AB S gt AB aAB A gt aA aaAB A gt aA aaaAB A gt aA aaasB A gt s aaab B gt b A bottomup parse works in reverse We begin with the sentence of terminals and each step applies a production in reverse replacing a substring that matches the right side with the nonterminal on the left We continue until we have substituted our way back to the start symbol If you read from the bottom to top the bottomup parse prints out a rightmost derivation of the sentence aaab aaasb insert a aaaAb A gt e aaAb A gt aA aAb A gt aA Ab A gtaA AB B gtb S S gtAB In creating a parser for a compiler we normally have to place some restrictions on how we process the input In the above example it was easy for us to see which productions were appropriate because we could see the entire string aaab In a compiler s parser however we don t have longdistance vision We are usually limited to just onesymbol of lookaheud The lookahead symbol is the next symbol coming up in the input This restriction certainly makes the parsing more challenging Using the saIne graInmar from above if the parser sees only a single b in the input and it cannot lookahead any further than the symbol we are on it can t know whether to use the production B gt b or B gt bB Backtracking One solution to parsing would be to implement backtracking Based on the information the parser currently has about the input a decision is made to go with one particular production If this choice leads to a dead end the parser would have to backtrack to that decision point moving backwards through the input and start again making a different choice and so on until it either found the production that was the appropriate one or ran out of choices For exaInple consider this simple graInmar S gt babIbA A gt dch Let s follow parsing the input bcd In the trace below the column on the left will be the expansion thus far the middle is the remaining input and the right is the action attempted at each step Try S gt bab match b deadend backtrack Try S gt bA match b Try A gt d deadend backtrack Try A gt cA match c Try A gt d match d Success S bab ab Qaaaaaggagg QgtggtQgtgm As you can see each time we hit a deadend we backup to the last decision point unmake that decision and try another alternative If all alternatives have been exhausted we back up to the preceding decision point and so on This continues until we either find a working parse or have exhaustively tried all combinations without success A number of authors have described backtracking parsers the appeal is that they can be used for a variety of grammars without requiring them to fit any specific form For a 3 small grammar such as above a backtracking approach may be tractable but most prograInming language grammars have dozens of nonterminals each with several options and the resulting combinatorial explosion makes a this approach very slow and impractical We will instead look at ways to parse via efficient methods that have restrictions about the form of the grammar but usually those requirements are not so onerous that we cannot rearrange a programming language gramlnar to meet them TopDown Predictive Parsing First we will focus in on topdown parsing We will look at two different ways to implement a nonbacktracking topdown parser called a predictive parser A predictive parser is characterized by its ability to choose the production to apply solely on the basis of the next input symbol and the current nonterminal being processed To enable this the graInmar must take a particular form We call such a grammar LL1 The first quotLquot means we scan the input from left to right the second quotLquot means we create a leftmost derivation and the 1 means one input symbol of lookahead Informally an LL1 has no leftrecursive productions and has been leftfactored Note that these are necessary conditions for LL1 but not sufficient ie there exist grammars with no leftrecursion or common prefixes that are not LL1 Note also that there exist many grammars that cannot be modified to become LL1 In such cases another parsing technique must be employed or special rules must be embedded into the predictive parser Recursive Descent The first technique for implementing a predictive parser is called recursivedescent A recursivedescent parser consists of several small functions one for each nonterminal in the grammar As we parse a sentence we call the functions that correspond to the left side nonterminal of the productions we are applying If these productions are recursive we end up calling the functions recursively Let s start by examining some productions from a grammar for a simple Pascallike programming language In this prograInming language all functions are preceded by the reserved word FUNC program gt functionist functionist gt functionist function function function gt FUNC identifier parameterlist statements What might the C function that is responsible for parsing a function definition look like It expects to first find the token FUNC then it expects an identifier the naIne of the function followed by an opening parenthesis and so on As it pulls each token from the parser it must ensure that it matches the expected and if not will halt with an error For each nonterminal this function calls the associated function to handle its part of the parsing Check this out void ParseFunction if lookahead TFUNC anything not FUNC here is wrong printfquotsyntax error nquot exit0 else lookahead yyleX global 39lookahead39 holds next token ParseIdentifier if lookahead TLPAREN printfquotsyntax error nquot exit0 else lookahead yylex ParseParameterList if lookahead TRPAREN printfquotsyntax error nquot lookahead yylex ParseStatements To make things a little cleaner let s introduce a utility function that can be used to verify that the next token is what is expected and will error and exit otherwise We will need this again and again in writing the parsing routines void MatchTokenint expected if lookahead I expected printfquotsyntax error expected d got dnquot expectedlookahead exit0 else if match consume token and move on lookahead yylex Now we can tidy up the ParseFunction routine and make it clearer what it does void ParseFunction MatchTokenTFUNC ParseIdentifier MatchTokenTLPAREN ParseParameterList MatchTokenTRPAREN ParseStatements The following diagraIn illustrates how the parse tree is built pro ram functiTnlist fu 39on FUNC wifier parameerlist this part of the tree this part is parsed Is parsed by the C3 by the call to statements this part is parsed by the call to Here is the production for an ifstatement in this language ifstatement gt IF expression THEN statement ENDIF I IF expression THEN statement ELSE statement ENDIF To prepare this graInmar for recursivedescent we must leftfactor to share the common parts ifstatement gt IF expression THEN statement closeif closeif gt ENDIF I ELSE statement ENDIF Now let s look at the recursivedescent functions to parse an if statement void ParseIfStatement MatchTokenTIF ParseEXpression MatchTokenTTHEN Parsestatement ParseCloseIf void ParseCloseIf if lookahead TENDIF if we immediately find ENDIF lookahead yylex predict closeif gt ENDIF else MatchTokenTELSE otherwise we look for ELSE Parsestatement predict closeif gt ELSE stmt ENDIF MatchToken TENDIF When parsing the closing portion of the if we have to decide which of the two right hand side options to expand In this case it isn t too difficult We try to match the first token again ENDIF and on nonmatch we try to match the ELSE clause and if that doesn t match it will report an error 6 Navigating through two choices seemed simple enough however what happens where we have many alternatives on the right side statement gt assgstatement returnstatement printstatement nustatement ifstatement whilestatement blockofstatements When implementing the Parsestatement function how are we going to be able to determine which of the seven options to match for any given input Remember we are trying to do this without backtracking and just one token of lookahead so we have to be able to make immediate decision with minimal information this can be a challenge To understand how to recognize and solve problem we need a definition The first set of a sequence of symbols 41 written as Firstu is the set of terminals which start the sequences of symbols derivable from g A bit more formally consider all strings derivable from u If g gt v where y begins with some terminal that terminal is in Firstu If g gt s then s is in Firstu lnformally the first set of a sequence is a list of all the possible terminals that could start a string derived from that sequence We will work an example of calculating the first sets a bit later For now just keep in mind the intuitive meaning Finding our lookahead token in one of the first sets of the possible expansions tells us that is the path to follow Given a production with a nu1nber of alternatives A gt g1 g2 we can write a recursivedescent routine only if all the sets Firstui are disjoint The general form of such a routine would be void ParseA case below not quite legal C need to list symbols individually switch lookahead case Firstu1 predict production A gt ul code to recognize ul return case Firstu2 predict production A gt uz code to recognize uz return default printfquotsyntax error nquot exit0 If the first sets of the various productions for a nonterminal are not disjoint a predictive parser doesn39t know which choice to make We would either need to rewrite the gramlnar or use a different parsing technique for this nonterminal For programming 7 languages it is usually possible to restructure the productions or embed certain rules into the parser to resolve con icts but this constraint is one of the weaknesses of the topdown nonbacktracking approach It is a bit trickier if the nonterminal we are trying to recognize is nullable A nonterminal A is nullable if there is a derivation of A that results in s ie that nonterminal would completely disappear in the parse string ie s E FirstA In this case A could be replaced by nothing and the next token would be the first token of the symbol following A in the sentence being parsed Thus if A is nullable our predictive parser also needs to consider the possibility that the path to choose is the one corresponding to A gt 2 To deal with this we define the following The fall 0w set of a nonterminal A is the set of terminal symbols that can appear inunediately to the right of A in a valid sentence A bit more formally for every valid sentence S gt uAv where y begins with some terminal and that terminal is in FollowA Informally you can think about the follow set like this A can appear in various places within a valid sentence The follow set describes what terminals could have followed the sentential form that was expanded from A We will detail how to calculate the follow set a bit later For now realize follow sets are useful because they define the right context consistent with a given nonterminal and provide the lookahead that might signal a nullable nonterminal should be expanded to 2 With these two definitions we can now generalize how to handle A gt m uz in a recursivedescent parser In all situations we need a case to handle each member in Firstgi In addition if there is a derivation from any gi that could yield a ie if it is nullable then we also need to handle the members in FollowA void ParseA switch lookahead case Firstul code to recognize ul return case Firstuz code to recognize uz return case FollowA predict production A gtepsilon if A is nullable usually do nothing here default printfquotsyntax error nquot exit0 8 What about leftrecursive productions Now we see why these are such a problem in a predictive parser Consider this leftrecursive production that matches a list of one or more functions functionist gt functionist function function function gt FUNC identifier parameterist statement void ParseFunctionList ParseFunctionList ParseFunction Such a production will send a recursivedescent parser into an infinite loop We need to remove the leftrecursion in order to be able to write the parsing function for a functionist functionist gt functionist function function becomes functionist gt function functionist function then we must leftfactor the common parts functionist gt function morefunctions morefunctions gt function morefunctions 2 And now the parsing function looks like this void ParseFunctionList ParseFunction ParseMoreFunctionS may be empty ie expand to epsilon Computing first and follow These are the algorithms used to compute the first and follow sets Calculating first sets To calculate Firstw where u has the form X1X2Xn do the following a If X1 is a terminal add X1 to Firstu and you39re finished b Else X1 is a nonterminal so add FirstX1 s to Firstu a If X1 is anullable nonterminal ieX1 gt 2 add FirstX2 s to Firstu Furthermore if X2 can also go to s then add FirstX3 s and so on through all Xn until the first nonnullable symbol is encountered b If X1X2Xn gt 2 add a to the first set 9 Calculating follow sets For each nonterminal in the graInmar do the following 1 Place EOF in FollowS where S is the start symbol and EOF is the inputs right endmarker The endmarker might be end of file newline or a special symbol whatever is the expected end of input indication for this graInmar We will typically use as the endmarker 2 For every production A gt ng where u and y are any string of grammar symbols and B is a nonterminal everything in Firstv except a is placed in FollowB 3 For every production A gt uB or a production A gt qu where Firstv contains a ie y is nullable then everything in FollowA is added to FollowB Here is a complete example of first and follow set computation starting with this grammar S gt AB A gt Ca Is B gt BaAC Ic C gt b 8 Notice we have a leftrecursive production that must be fixed if we are to use LL1 parsing B gt BaAC It becomes B gt CB 339 gt aACB39 Is The new graInmar is S gt AB A gt Ca I B gt cB39 B39 gt aACB39 I C gt b I s It helps to first compute the nullable set ie those nonterminals X that X gt 2 since you need to refer to the nullable status of various nonterminals when computing the first and follow sets NullableG A B39 C The first sets for each nonterminal are FirstC b s FirstB39 a s FirstB c FirstA b a 8 Start with FirstC 2 add a since C is nullable and 2 since A itself is nullable FirstS b a c 10 Start with FirstA 2 add FirstB since A is nullable We don t add 2 since S itself is notnullable A can go away but B cannot It is usually convenient to compute the first sets for the nonterminals that appear toward the bottom of the parse tree and work your way upward since the nonterminals toward the top may need to incorporate the first sets of the terminals that appear beneath them in the tree To compute the follow sets take each nonterminal and go through all the rightside productions that the nonterminal is in matching to the steps given earlier FollowS SE S doesn t appear in the right hand side of any productions We put in the follow set because S is the start symbol FollowB SE B appears on the right hand side of the S gt AB production Its follow set is the same as S FollowB39 SE B39 appears on the right hand side of two productions The B39 gt aACB39 production tells us its follow set includes the follow set of B39 which is tautological From B gt cB39 we learn its follow set is the same as B FollowC a SE C appears in the right hand side of two productions The production A gt Ca tells us a is in the follow set From B39 gt aACB39 we add the FirstB39 which is just a again Because B39 is nullable we must also add FollowB39 which is FollowA c b a A appears in the right hand side of two productions From S gt AB we add FirstB which is just c B is not nullable From B39 gt aACB39 we add FirstC which is b Since C is nullable so we also include FirstB39 which is a B39 is also nullable so we include FollowB39 which adds 39 It can be convenient to compute the follows sets for the nonterminals that appear toward the top of the parse tree and work your way down but sometimes you have to circle around computing the follow sets of other nonterminals in order to complete the one you re on The calculation of the first and follow sets follow mechanical algorithms but it is very easy to get tripped up in the details and make mistakes even when you know the rules Be careful 11 Tabledriven LL1 Parsing In a recursivedescent parser the production information is embedded in the individual parse functions for each nonterminal and the runtime execution stack is keeping track of our progress through the parse There is another method for implementing a predictive parser that uses a table to store that production along with an explicit stack to keep track of where we are in the parse This gramlnar for add multiply expressions is already set up to handle precedence and associativity E gt ETT T gt TFF F gt Eint After removal of left recursion we get E gt TE39 E39 gt TE39 Is T gt FT39 T39 gt FT39Is F gt Eint One way to illustrate the process is to study some transition graphs that represent the grammar quotquot 0 A predictive parser behaves as follows Let s assume the input string is 3 4 5 Parsing begins in the start state of the symbol E and moves to the next state This transition is marked with a T which sends us to the start state for T This in turn sends us to the start state for F F has only terminals so we read a token from the input string It must either be an open parenthesis or an integer in order for this parse to be valid We consulne the integer token and thus we have hit a final state in the F transition 12 diagraIn so we return to where we came from which is the T diagraIn we have just finished processing the F nonterminal We continue with T39 and go to that start state The current lookahead is which doesn t match the required by the first production but is in the follow set for T39 so we match the second production which allows T39 to disappear entirely We finish T39 and return to T where we are also in a final state We return to the E diagram where we have just finished processing the T We move on to E39 and so on A tabledriven predictive parser uses a stack to store the productions to which it must return A parsing table stores the actions the parser should take based on the input token and what value is on top of the stack is the end of input symbol Input Tracing Here is how a predictive parser works We push the start symbol on the stack and read the first input token As the parser works through the input there are the following possibilities for the top stack symbol X and the input token a using table M 1 If X a and a end of input parser halts and parse completed successfully 2 If X a and a I successful match pop X and advance to next input token This is called a match action 3 If X a and X is a nonterminal pop X and consult table at MXa to see which production applies push right side of production on stack This is called a predict action 4 If none of the preceding cases applies or the table entry from step 3 is blank there has been a parse error Here is an exaInple parse of the string int int int PARSE REMAINING PARSER ACTION STACK INPUT E int int int Predict E gt TE39 pop E from stack push TE39 no change in input 13 intT39E39 int int Match int pop from stack Suppose instead that we were trying to parse the input The first step of the parse would give an error because there is no entry at ME Constructing The Parse Table The next task is to figure out how we built the table The construction of the table is somewhat involved and tedious the perfect task for a computer but errorprone for humans The first thing we need to do is compute the first and follow sets for the grammar E gt TE39 E39 gt TE39 Is T gt FT39 T39 gt FT39Is F gt Eint FirstE FirstT FirstF int FirstT39 a FirstE39 a FollowE FollowE39 FollowT FollowT39 FollowF Once we have the first and follow sets we build a table M with the leftmost column labeled with all the nonterminals in the gramlnar and the top row labeled with all the terminals in the grammar along with The following algorithln fills in the table cells 14 1 For each production A gt y of the gramlnar do steps 2 and 3 2 For each terminal a in FirstLQ add A gt u to MAa 3 If s in Firstu ie A is nullable add A gt u to MAb for each terminal b in FollowA If s in Firstu and is in FollowA add A gt g to MA 4 All undefined entries are errors The concept used here is to consider a production A gtu with a in Firstu The parser should expand A to LA when the current input symbol is a It s a little trickier when u s or u gt 2 In this case we should expand A to u if the current input symbol is in FollowA or if the at the end of the input has been reached and is in FollowA If the procedure ever tries to fill in an entry of the table that already has a nonerror entry the procedure fails the gramlnar is not LLl LL1 Grammars Properties These predictive topdown techniques either recursivedescent or tabledriven require a gramlnar that is LLl One fullygeneral way to determine if a grammar is LLl is to build the table and see if you have con icts In some cases you will be able to determine that a gramlnar is or isn39t LLl via a shortcut such as identifying obvious leftfactors To give a formal statement of what is required for a grammar to be LLl o No ambiguity o No left recursion o A graInmar G is LLl iff whenever A gt g v are two distinct productions of G the following conditions hold 0 for no terminal a do both u and y derive strings beginning with a ie first sets are disjoint at most one of g and y can derive the empty string if v gt s then u does not derive any string beginning with a terminal in FollowA ie first and follow must be disjoint if nullable 00 All of this translates intuitively that when trying to recognize A the parser must be able to examine just one input symbol of lookahead and uniquely determine which production to use Error Detection and Recovery A few general principles apply to errors found regardless of parsing technique being used 0 A parser should try to determine that an error has occurred as soon as possible Waiting too long before declaring an error can cause the parser to lose the actual location of the error 15 o A suitable and comprehensive message should be reported Missing semicolon on line 36 is helpful unable to shift in state 425 is not 0 After an error has occurred the parser must pick a reasonable place to resume the parse Rather than giving up at the first problem a parser should always try to parse as much of the code as possible in order to find as many real errors as possible during a single run 0 A parser should avoid cascading errors which is when one error generates a lengthy sequence of spurious error messages Recognizing that the input is not syntactically valid can be relatively straightforward An error is detected in predictive parsing when the terminal on top of the stack does not match the next input symbol or when nonterminal A is on top of the stack a is the next input symbol and the parsing table entry MAa is empty Deciding how to handle the error is bit more complicated By inserting specific error actions into the empty slots of the table you can determine how a predictive parser will handle a given error condition At the least you can provide a precise error message that describes the mismatch between expected and found Recovering from errors and being able to resume and successfully parse is more difficult The entire compilation could be aborted on the first error but most users would like to find out more than one error per compilation The problem is how to fix the error in some way to allow parsing to continue Many errors are relatively minor and involve syntactic violations for which the parser has a correction that it believes is likely to be what the programmer intended For example a missing semicolon at the end of the line or a misspelled keyword can usually be recognized For many minor errors the parser can quotfixquot the program by guessing at what was intended and reporting a warning but allowing compilation to proceed unhindered The parser might skip what appears to be an erroneous token in the input or insert a necessary but missing token or change a token into the one expected substituting BEGIN for BGEIN For more major or complex errors the parser may have no reliable correction The parser will attempt to continue but will probably have to skip over part of the input or take some other exceptional action to do so Panicmode error recovery is a simple technique that just bails out of the current construct looking for a safe symbol at which to restart parsing The parser just discards input tokens until it finds what is called a synchronizing token The set of synchronizing tokens are those that we believe confirm the end of the invalid statement and allow us to pick up at the next piece of code For a nonterminal A we could place all the symbols in FollowA into its synchronizing set If A is the nonterminal for a variable declaration and the garbled input is something like duoble d the parser might skip ahead to the semi colon and act as though the declaration didn t exist This will surely cause some more 16 cascading errors when the variable is later used but it might get through the trouble spot We could also use the symbols in FirstA as a synchronizing set for restarting the parse of A This would allow input junk double d to parse as a valid variable declaration Bibliography A Aho R Sethi J Ullman Cnmnilers39 P nciples T 39 and Tools Reading MA Addison Wesley 1986 J P Bennett 39 to Cnmnilino T 39 Berkshire England McGraW Hill 1990 D Cohen Introduction to Computer Theory New York Wiley 1986 C Fischer R LeBlanc Crafting a Compiler Menlo Park CA BenjaminCummings 1988 K Loudon Compiler Construction Boston MA PWS 1997


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Anthony Lee UC Santa Barbara

"I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.