Compiler Design CIS 631
Popular in Course
Popular in Computer & Information Science
This 11 page Class Notes was uploaded by Sam Rau on Wednesday October 21, 2015. The Class Notes belongs to CIS 631 at Syracuse University taught by Staff in Fall. Since its upload, it has received 22 views. For similar materials see /class/225600/cis-631-syracuse-university in Computer & Information Science at Syracuse University.
Reviews for Compiler Design
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/21/15
C8143 Handout 06 Autumn 2007 October 1 2007 Formal Grammars Handout written by Maggie Johnson and Julie Zelenski What is a grammar A grammar is a powerful tool for describing and analyzing languages It is a set of rules by which valid sentences in a language are constructed Here s a trivial example of English grammar sentence gt ltsubjectgt ltverbphrasegt ltobjectgt subject gt This Computers verbphrase gt ltadverbgt ltverbgt ltverbgt adverb gt never verb gt is I run I am tell object gt the ltnoungt a ltnoungt ltnoungt noun gt university world cheese I lies Using the above rules or productions we can derive simple sentences such as these This is a university Computers run the world I am the cheese I never tell lies Here is a leftmost derivation of the first sentence using these productions sentence gt ltsubjectgt ltverbphrasegt ltobjectgt gt This ltverbphrasegt ltobjectgt gt This ltverbgt ltobjectgt gt This is ltobjectgt gt This is a ltnoungt gt This is a university In addition to several reasonable sentences we can also derive nonsense like quotComputers run cheesequot and quotThis am a liesquot These sentences don39t make semantic sense but they are syntactically correct because they are of the sequence of subject verbphrase and object Formal grammars are a tool for syntax not semantics We worry about semantics at a later point in the compiling process In the syntax analysis phase we verify structure not meaning Vocabulary We need to review some definitions before we can proceed grammar a set of rules by which valid sentences in a language are constructed non terminal terminal production derivation start symbol null symbol 8 BNF 2 a grammar symbol that can be replaced expanded to a sequence of symbols an actual word in a language these are the symbols in a grammar that cannot be replaced by anything else quotterminalquot is supposed to conjure up the idea that it is a deadend no further expansion is possible a grammar rule that describes how to replace exchange symbols The general form of a production for a nonterminal is x gtY1Y2Y3Y The nonterminal X is declared equivalent to the concatenation of the symbols Y1Y2Y3Y The production means that anywhere where we encounter X we may replace it by the string Y1Y2Y3Y Eventually we will have a string containing nothing that can be expanded further ie it will consist of only terminals Such a string is called a sentence In the context of programming languages a sentence is a syntactically correct and complete program a sequence of applications of the rules of a grammar that produces a finished string of terminals A leftmost derivation is where we always substitute for the leftmost nonterminal as we apply the rules we can similarly define a rightmost derivation A derivation is also called a parse a granunar has a single nonterminal the start symbol from which all sentences derive s gt x1 x2x3x All sentences are derived from S by successive replacement using the productions of the grammar it is sometimes useful to specify that a symbol can be replaced by nothing at all To indicate this we use the null symbol a eg A gt B s a way of specifying progranuning languages using formal grammars and production rules with a particular form of notation BackusNaur form A few granunar exercises to try on your own The alphabet in each case is ab 0 Define a grammar for the language of strings with one or more as followed by zero or more b39s 0 Define a grammar for evenlength palindromes Define a grammar for strings where the nulnber of as is equal to the nulnber v bs 3 0 Define a graInmar where the nulnber of as is not equal to the nulnber b39s Hint think about it as two separate cases Can you write regular expressions for these languages Why or why not Parse Representation In working with grammars we can represent the application of the rules to derive a sentence in two ways The first is a derivation as shown earlier for quotThis is a universityquot where the rules are applied stepbystep and we substitute for one nonterminal at a time Think of a derivation as a history of how the sentence was parsed because it not only includes which productions were applied but also the order they were applied ie which nonterminal was chosen for expansion at each step There can many different derivations for the same sentence the leftmost the rightmost and so on A parse tree is the second method for representation lt diagrams how each symbol derives from other symbols in a hierarchical manner Here is a parse tree for quotThis is a universityquot 5 A subect v Q cyb39Kt This verb 3 noK i5 university Although the parse tree includes all of the productions that were applied it does not encode the order they were applied For an unambiguous graInmar we ll define ambiguity in a minute there is exactly one parse tree for a particular sentence More Definitions Here are some other definitions we will need described in reference to this exaInple grammar S gt AB A gt Ax y B gt z alphabet The alphabet is 8 A B x y 2 It is divided into two disjoint sets The terminal alphabet consists of terminals which appear in the sentences of the language x y z The remaining symbols are the nontermihal alphabet these are the symbols that appear on the left side of productions and can be replaced during the course of a derivation 8 A B Formally we use V for the alphabet T for 4 the terminal alphabet and N for the nonterminal alphabet giving us V T U N and T H N Q The convention used in our lecture notes are a sansserif font for grammar elements lowercase for terminals uppercase for nonterminals and underlined lowercase eg g M to denote arbitrary strings of terminal and nonterminal symbols possibly null In some textbooks Greek letters are used for arbitrary strings of terminal and nonterminal symbols eg 0L 5 contextfree grammar To define a language we need a set of productions of the general form u gt y In a contextfree grammar u is a single nonterminal and v is an arbitrary string of terminal and nonterminal symbols When parsing we can replace g by y wherever it occurs We shall refer to this set of productions symbolically as P formal grammar We formally define a grammar as a 4tuple S P N T S is the start symbol and S E N P is the set of productions and N and T are the nonterminal and terminal alphabets A sentence is a string of symbols in T derived from S using one or more applications of productions in P A string of symbols derived from S but possibly including nonterminals is called a sentential form or a working string A production u gt v is used to replace an occurrence of u by v Formally if we apply a production p E P to a string of symbols w in V to yield a new string of symbols Z in V we say that Z derived from using p written as follows w gtP Z We also use w gt Z Z derives from w production unspecified w gt Z Z derives from w us1ng zero or more productions w gt Z Z derives from w using one or more productions equivalence The language LG defined by grammar G is the set of sentences derivable using G Two grammars G and G39 are said to be equivalent if the languages they generate LG and LG39 are the same Grammar Hiearchy We owe a lot of our understanding of grammars to the work of the American linguist Noam Chomsky yes the Noam Chomsky known for his politics There are four categories of formal grammars in the Chomsky Hierarchy they span from Type 0 the most general to Type 3 the most restrictive More restrictions on the grammar make it easier to describe and efficiently parse but reduce the expressive power Type 0 free or unrestricted grammars These are the most general Productions are of the form u gt y where both u 5 and v are arbitrary strings of symbols in V with u nonnull There are no restrictions on what appears on the left or righthand side other than the left hand side must be nonempty Type 1 contextsensitive grammars Productions are of the form uXw gt M where u y and w are arbitrary strings of symbols in V with v nonnull and X a single nonterminal In other words X may be replaced by 1 but only when it is surrounded by g and w ie in a particular context Type 2 contextfree grammars Productions are of the form X gt y where v is an arbitrary string of symbols in V and X is a single nonterminal Wherever you find X you can replace with v regardless of context Type 3 regular graInmars Productions are of the form X gt a X gt aY or X gts where X and Y are nonterminals and a is a terminal That is the lefthand side must be a single nonterminal and the righthand side can be either empty a single terminal by itself or with a single nonterminal These graInmars are the most limited in terms of expressive power Every type 3 grammar is a type 2 grammar and every type 2 is a type 1 and so on Type 3 grammars are particularly easy to parse because of the lack of recursive constructs Efficient parsers exist for many classes of Type 2 grammars Although Type 1 and Type 0 graInmars are more powerful than Type 2 and 3 they are far less useful since we cannot create efficient parsers for them In designing programming languages using formal grammars we will use Type 2 or contextfree grammars often just abbreviated as CFG Issues in parsing contextfree grammars There are several efficient approaches to parsing most Type 2 graInmars and we will talk through them over the next few lectures However there are some issues that can interfere with parsing that we must take into consideration when designing the grammar Let s take a look at three of them ambiguity recursive rules and left factoring Ambiguity If a grammar permits more than one parse tree for some sentences it is said to be ambiguous For example consider the following classic arithmetic expression grammar E gt EopEEint op gt 6 This grammar denotes expressions that consist of integers joined by binary operators and possibly including parentheses As defined above this grammar is ambiguous because for certain sentences we can construct more than one parse tree For example consider the expression 1 0 2 5 We parse by first applying the production E gt E op E The parse tree on the left chooses to expand that first op to the one on the right to We have two completely different parse trees Which one is correct E A E o E E op E F I I m Int Int R r t 0 E 0xquot 3 int int 5 mt I I 5 1O 2 2 Both trees are legal in the grammar as stated and thus either interpretation is valid Although natural languages can tolerate some kind of ambiguity eg puns plays on words etc it is not acceptable in computer languages We don t want the compiler just haphazardly deciding which way to interpret our expressions Given our expectations from algebra concerning precedence only one of the trees seems right The righthand tree fits our expectation that quotbinds tighterquot and for that result to be computed first then integrated in the outer expression which has a lower precedence operator It s fairly easy for a grammar to become ambiguous if you are not careful in its construction Unfortunately there is no magical technique that can be used to resolve all varieties of ambiguity It is an undecidable problem to determine whether any grammar is ambiguous much less to attempt to mechanically remove all aInbiguity However that doesn39t mean in practice that we cannot detect ambiguity or do something about it For programming language grammars we usually take pains to construct an unambiguous grammar or introduce additional disambiguating rules to throw away the undesirable parse trees leaving only one for each sentence Using the above ambiguous expression grammar one technique would leave the grammar as is but add disaInbiguating rules into the parser implementation We could code into the parser knowledge of precedence and associativity to break the tie and force the parser to build the tree on the right rather than the left The advantage of this is that the grammar remains simple and less complicated But as a downside the syntactic structure of the language is no longer given by the grammar alone Another approach is to change the grammar to only allow the one tree that correctly re ects our intention and eliminate the others For the expression grammar we can 7 separate expressions into multiplicative and additive subgroups and force them to be expanded in the desired order E gt E top E T top gt T gt T fop T F fop gt F gt E int Terms are addition subtraction expressions and factors used for multiplication and division Since the base case for expression is a term addition and subtraction will appear higher in the parse tree and thus receive lower precedence After verifying that the above rewritten grammar has only one parse tree for the earlier ambiguous expression you might thing we were home free but now consider the expression 1 0 2 5 The recursion on both sides of the binary operator allows either side to match repetitions The arithmetic operators usually associate to the left so by replacing the righthand side with the base case will force the repetitive matches onto the left side The final result is E gt E top T T top gt T gt T fop F F fop gt F gt E int Whew The obvious disadvantage of changing the grammar to remove ambiguity is that it may complicate and obscure the original grammar definitions There is no mechanical means to change any ambiguous grammar into an unambiguous one undecidable remember However most programming languages have only limited issues with ambiguity that can be resolved using ad hoc techniques Recursive productions Productions are often defined in terms of themselves For example a list of variables in a programming language grammar could be specified by this production variabeist gt variable I variabeist variable Such productions are said to be recursive If the recursive nonterminal is at the left of the rightside of the production eg A gt u Av we call the production leftrecursive Similarly we can define a rightrecursive production A gt u MA Some parsing techniques have trouble with one or the other variants of recursive productions and so sometimes we have to massage the grammar into a different but equivalent form Leftrecursive productions can be especially troublesome in the topdown parsers and we ll see why a 8 bit later Handily there is a simple technique for rewriting the grammar to move the recursion to the other side For exalnple consider this leftrecursive rule X gt XalelABlClDEF To convert the rule we introduce a new nonterminal X39 that we append to the end of all nonleftrecursive productions for X The expansion for the new nonterminal is basically the reverse of the original leftrecursive rule The rewritten productions are X gt ABX39 ICX39 IDEFX39 X39 gt aX39 le39 Is It appears we just exchanged the leftrecursive rules for an equivalent rightrecursive version This might seem pointless but some parsing algorithms prefer or even require only left or right recursion Leftfactorin g The parser usually reads tokens from left to right and it is convenient if upon reading a token it can make an immediate decision about which production from the grammar to expand However this can be trouble if there are productions that have common first symbols on the right side of the productions Here is an example we often see in programming language grammars Stmt gt if Cond then Stmt else Stmt I if Cond then Stmt Other The conunon prefix is if Cond then Stmt This causes problems because when a parser encounter an if it does not know which production to use A useful technique called leftfactoring allows us to restructure the graInmar to avoid this situation We rewrite the productions to defer the decision about which of the options to choose until we have seen enough of the input to make the appropriate choice We factor out the common part of the two options into a shared rule that both will use and then add a new rule that picks up where the tokens diverge Stmt gt if Cond then Stmt OptElse Other I OptElse gt else S l s In the rewritten granunar upon reading an if we expand first production and wait until if Cond then Stmt has been seen to decide whether to expand OptElse to else or 2 Hidden leftfactors and hidden left recursion A graInmar may not appear to have left recursion or left factors yet still have issues that will interfere with parsing This may be because the issues are hidden and need to be first exposed via substitution For exaInple consider this gramlnar A gt dalacB B gt abBIdaAlAf A cursory exaInination of the graInmar may not detect that the first and second productions of B overlap with the third We substitute the expansions for A into the third production to expose this A gt da acB B gt abB daA daf ach This exchanges the original third production of B for several new productions one for each of the productions for A These directly show the overlap and we can then left factor A gt da acB B gt aM daN M gt bB ch N gt A f Silnilarly the following gramlnar does not appear to have any leftrecursion S gt Tulwx T gt Sqlva Yet after substitution of S into T the leftrecursion comes to light S gt Tulwx T gt Tuqlwquva If we then elilninate leftrecursion we get S gt Tu wx T gt wqu39 vaT39 T39 gt un39 2 Programming language case study ALGOL Algol is of interest to us because it was the first programming language to be defined using a gramlnar lt grew out of an international effort in the late 1950 s to create a quotuniversal programming languagequot that would run on all machines At that thne FORTRAN and COBOL were the prominent languages with new languages sprouting up all around Programmers became increasingly concerned about portability of prograIns and being able to conununicate with one another on programming topics Consequently the ACM and GAMM Gesellschaft fur angewandte Mathematik und Mechanillt decided to come up with a single prograInming language that all could use on their computers and in whose terms programs could be communicated between the 10 users of all machines Their first decision was not to use FORTRAN as their universal language This may seem surprising to us today since it was the most commonly used language back then However as Alan Perlis one of the original committee members puts it quotToday FORTRAN is the property of the computing world but in 1957 it was an IBM creation and closely tied to IBM hardware For these reasons FORTRAN was unacceptable as a universal languagequot ALGOL58 was the first version of the language followed up very soon after by ALGOL60 which is the version that had the most impact As a language it introduced the following features block structure and nested structures strong typing scoping procedures and functions call by value call by reference side effects is this good or bad recursion OOOOOOO It may seem surprising that recursion was not present in the original FORTRAN or COBOL You probably know that to implement recursion we need a runtime stack to store the activation records as functions are called In FORTRAN and COBOL activation records were created at compile time not runtime Thus only one activation record per subroutine was created No stack was used The parameters for the subroutine were copied into the activation record and that data area was used for subroutine processing The ALGOL report was the first time we see BNF to describe a programming language Both John Backus and Peter Naur were on the ALGOL committees They derived this description technique from an earlier paper written by Backus The technique was adopted because they needed a machineindependent method of description If one looks at the early definitions of FORTRAN one can see the links to the IBM hardware With ALGOL the machine was not relevant BNF had a huge impact on programming language design and compiler construction First it stimulated a large number of studies on the formal structure of programming languages laying the groundwork for a theoretical approach to language design Second a formal syntactic description could be used to drive a compiler directly as we shall see ALGOL had a tremendous impact on programming language design compiler construction and language theory but the language itself was a commercial failure Partly this was due to design decisions overly complex features no IO along with the 11 politics of the time popularity of Fortran lack of support from the allpowerful IBM resistance to BNF Bibliography A Aho R Sethi J Ullman Compilers Principles Techniques and Tools Reading MA Addison Wesley 1986 J Backus The Syntax and Semantics of the Proposed International Algebraic Language of the Zurich ACM GAMM Conference Proceedings of the International Conference on Information Processing 1959 pp 125 132 N Chomsky On Certain Formal Properties of Grammars Information and Control Vol 2 1959 pp 137 167 J P Bennett Introduction to Compiling Techniques Berkshire England McGraW Hill 1990 D Cohen Introduction to Computer Theory New York Wiley 1986 J C Martin Introduction to Languages and the Theory of Computation New York NY McGraW Hill 1991 P Naur Programming Languages Natural Languages and Mathematics Communications of the ACM Vol 18 No 12 1975 pp 676 683 J Sammet Programming Languages History and Fundamentals Englewood Cliffs NJ Prentice Hall 1969 RLWexelblat History of Programming Languages London Academic Press 1981
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'