Earley Parsing Algorithm

Published on Jan 17th 2020Duration: 30:53

In this video I introduce the Earley Parser Algorithm and explain how it works with a small parsing example. The Earley Algorithm was invented by Jay Earley and is used by the nearley parser generator tool which I have covered in previous videos.

Transcript

The following transcript was automatically generated by an algorithm.

• 00:00 : hello there welcome in this video I
• 00:02 : would like to explain how the early
• 00:05 : parsing algorithm works if you have been
• 00:09 : following along in my how to build a
• 00:12 : computer programming language series you
• 00:16 : should have seen how a parser can be
• 00:21 : built using a tool a parser generator a
• 00:25 : tool such as nearly Jas and in
• 00:27 : particular we did cover how to use
• 00:30 : nearly Jas in depth and to give credit
• 00:34 : where credit is due
• 00:36 : nearly it was created by Karthik Chandra
• 00:42 : and and nearly Jas is a implementation
• 00:49 : of the early parsing algorithm now what
• 00:54 : this video is going to talk about as I
• 00:56 : said is how the early parsing algorithm
• 00:59 : actually works it was created by an
• 01:02 : American computer scientist named Jay
• 01:06 : early and he at present he has left the
• 01:14 : realm of computer science and he has
• 01:17 : gone on to become a family psychologist
• 01:22 : and it's actually interesting if you
• 01:25 : google him up you would you would get
• 01:28 : this description of him which has
• 01:30 : nothing whatsoever to do with computer
• 01:33 : science but nevertheless he has really
• 01:36 : made his mark in computer science as the
• 01:40 : early parser is one of the really
• 01:42 : beautiful algorithms that that I have
• 01:46 : seen and and I hope to show you that and
• 01:51 : serve let the algorithm speak for itself
• 01:55 : now I suppose I'll talk about what is
• 01:58 : good about the early algorithm and what
• 02:01 : is perhaps not so good about it because
• 02:03 : the early algorithm is not the only
• 02:06 : parsing algorithm known to men and is
• 02:10 : actually not even the most popular
• 02:13 : there are other parsing algorithms that
• 02:16 : are actually quite a bit more popular
• 02:17 : than the early algorithm namely the
• 02:21 : recursive descent the recursive descent
• 02:24 : algorithm and bottom-up parsers parsers
• 02:29 : such as LR or as LR these algorithms I I
• 02:36 : may cover in the future so what how does
• 02:40 : it how does early algorithms stack up
• 02:43 : against these two okay so these are the
• 02:47 : other parsing algorithms so versus the
• 02:58 : early algorithm one thing that early has
• 03:02 : got is that it is able to work with
• 03:08 : either left recursive or right recursive
• 03:12 : grammars so which on the other
• 03:19 : algorithms actually have some trouble
• 03:22 : with and when you're using a recursive
• 03:27 : descent parser or a bottom up parser
• 03:29 : usually what you have to do is to when
• 03:32 : you have a left recursive grammar then
• 03:38 : and you're using one of the other
• 03:40 : algorithms you usually have to rewrite
• 03:42 : it in such a way that's not left
• 03:45 : recursive anymore well the early
• 03:47 : algorithm you don't have to worry about
• 03:49 : that it'll work with anything so I think
• 03:52 : from a developer standpoint that's
• 03:55 : actually quite a plus because you don't
• 03:59 : really have to worry about the low-level
• 04:01 : details such as exactly how to formulate
• 04:07 : another thing is that the early
• 04:10 : algorithm actually works perfectly with
• 04:12 : ambiguous grammars and ambiguous grammar
• 04:17 : what does that mean an ambiguous grammar
• 04:19 : is
• 04:20 : given one input there's actually two
• 04:24 : possible way to parse that input so if
• 04:28 : you take the sentence slow children
• 04:30 : playing there's two ways to parse it you
• 04:34 : can understand it as slow and then
• 04:37 : children playing or you could understand
• 04:40 : it is slow children are playing so
• 04:43 : that's an example of an ambiguous
• 04:45 : grammar you don't know where the
• 04:47 : parentheses are supposed to sit so the
• 04:52 : recursive descent parsing algorithm and
• 04:56 : the bottom-up parsing algorithms
• 04:59 : traditionally have trouble with
• 05:01 : ambiguous grammars but early algorithm
• 05:04 : will work with it anyway and it will
• 05:06 : actually give you multiple possible
• 05:09 : valid parse trees so if you give you
• 05:12 : both versions essentially that's what
• 05:14 : the early algorithm would do and if that
• 05:18 : is important to your application that's
• 05:20 : also cool but I say in general in
• 05:23 : computers like programming languages and
• 05:26 : ambiguous grammar is bad news anyway so
• 05:28 : perhaps that advantage is not really
• 05:31 : felt as much the third advantage of the
• 05:35 : early algorithm is that it allows for
• 05:40 : good error reporting it allows for and
• 05:44 : this reason is actually the original
• 05:47 : reason I was interested in the early
• 05:50 : algorithm I was looking for a way to
• 05:52 : provide better more useful error
• 05:55 : messages for more useful syntax error
• 05:58 : messages to users and in my research I
• 06:02 : came across the early algorithm and I
• 06:05 : found that it with this algorithm
• 06:07 : there's actually a very convenient way
• 06:09 : to report good helpful error messages
• 06:13 : and I think I may sort of cover how that
• 06:18 : works in in a separate episode okay so
• 06:21 : those are the pros of the early
• 06:23 : algorithm or the cons of the early
• 06:25 : algorithm well there's one that I can
• 06:31 : identify anyway we just it's actually
• 06:33 : less
• 06:34 : formant this algorithms runtime is
• 06:37 : actually n cubed in the worst case it in
• 06:41 : the case when the grammar is not
• 06:44 : ambiguous then it's quadratic time which
• 06:49 : means N squared whereas in the best case
• 06:53 : for recursive descent and bottom up
• 06:56 : parsers they can actually get Big O of n
• 07:03 : time so that's the main disadvantage
• 07:07 : there so that's sort of like a balance
• 07:10 : you might want to strike for me the good
• 07:13 : error reporting is so appealing and to
• 07:16 : me that I'm I really prefer the early
• 07:20 : algorithm for that reason and I also
• 07:22 : think the fact that it works with either
• 07:25 : left or right cursive grammar that's a
• 07:28 : very powerful feature as well yep so
• 07:31 : that's how it stacks up against the
• 07:32 : other hurting algorithms ok so let's get
• 07:35 : into it so I'm gonna use this example
• 07:39 : grammar which is very small and I chose
• 07:44 : a very small and simple grammar because
• 07:46 : I want to keep the Qt working out of the
• 07:51 : algorithm at a reasonable size that can
• 07:55 : fit on this computer screen here this is
• 07:58 : the starting symbol of this grammar is
• 08:01 : term so that one and a term can expand
• 08:07 : to a number a plus character and then
• 08:11 : another term and but a term can also
• 08:15 : expand to simply a number and what is a
• 08:18 : number and for simplicity stick again
• 08:22 : due to I have to fit the example in just
• 08:26 : one computer screen for simplicity sake
• 08:28 : a number only has a digit this is a
• 08:31 : character class in regular expression
• 08:33 : syntax that says this guy can be any
• 08:37 : character between 0 and 9 ok so in other
• 08:44 : words if you can imagine work
• 08:46 : string this glamour can match it can
• 08:49 : match something like 1 + 1 1 + 2 + 3 +
• 08:59 : and it can actually keep going the
• 09:02 : reason I can keep going is because this
• 09:06 : guy is a recursive production rule this
• 09:11 : rule is recursive in that the right-hand
• 09:14 : side refers back to itself so this is
• 09:17 : the grammar that we're gonna work with
• 09:18 : and for this example let's say we're
• 09:21 : gonna have the input 1 + 1 the first
• 09:26 : thing early does is generate a table a
• 09:32 : state table and because it's a state
• 09:36 : table we abbreviate it using the letter
• 09:38 : S and the table has a number of columns
• 09:43 : and we're going to denote the column so
• 09:46 : you can look at it as like a table so
• 09:49 : these are the columns and each column
• 09:51 : has a bunch of rows in it as you can
• 09:53 : imagine and then we're gonna label each
• 09:56 : column with an index number and so forth
• 09:59 : so when the algorithm initializes we're
• 10:03 : going to create the first column which
• 10:05 : will be column 0 so that that's labeled
• 10:09 : as 0 and what you do is you start with a
• 10:13 : starting symbol and in this grammar the
• 10:17 : starting symbol is turn and what we do
• 10:20 : is for each definition of the starting
• 10:25 : symbol meaning this the starting symbol
• 10:28 : is on the left hand side of the arrow
• 10:30 : we're going to copy it into this
• 10:33 : starting column so that would be term
• 10:38 : inside each rule inside the table
• 10:43 : there's this couple of additional pieces
• 10:46 : of information number one is the cursor
• 10:50 : which is written as a dot
• 10:55 : and it denotes where we're within this
• 10:59 : rule the progress of matching has
• 11:03 : reached and because here because we're
• 11:05 : at the very beginning of the algorithm
• 11:08 : we haven't even read in any characters
• 11:10 : from the input yet the were at the
• 11:12 : starting position so the dot is at the
• 11:15 : very beginning of this rule on the right
• 11:19 : hand side so this looks like so and so
• 11:26 : we copy the first possible expansion of
• 11:29 : the starting symbol term and then we're
• 11:32 : gonna do the same with the second one
• 11:36 : and again we're gonna put the dot at the
• 11:39 : beginning because we haven't read in any
• 11:42 : symbols from the input okay
• 11:45 : now along with each rule we can also
• 11:49 : gonna label each rule we Prairie
• 11:54 : shouldn't call these rules necessarily
• 11:56 : because there's some additional state
• 12:00 : associated with the rule so it was
• 12:03 : referred call each line here is a state
• 12:06 : rather than a rule but along with each
• 12:08 : state we're going to also attach a
• 12:10 : number transponding to the column within
• 12:16 : the table where this this state was
• 12:19 : initialized in this case that would be
• 12:23 : zero so I'm gonna put zero and
• 12:26 : parentheses it's zero for both of these
• 12:29 : cases now we're not done yet because for
• 12:34 : each case where the
• 12:36 : symbol to the right of this dot is a
• 12:41 : non-terminal symbol and the non-terminal
• 12:45 : symbol is something that is defined
• 12:47 : using a rule like term and number where
• 12:51 : as a terminal symbol is a symbol does
• 12:53 : it's a literal string like a plus okay
• 12:56 : so number is a non-terminal symbol and
• 12:59 : this literal plus that's a terminal
• 13:01 : symbol so for each case each state
• 13:05 : within this table we're
• 13:07 : the thing to the to the right of the
• 13:10 : stat is a non-terminal symbol this is
• 13:13 : potentially matching a number next so we
• 13:17 : really gonna take each rule its
• 13:21 : production rule in the grammar where it
• 13:24 : defines what that symbol is in this case
• 13:27 : number and copy it into this column as
• 13:30 : well so we're gonna copy number here and
• 13:34 : again we haven't read in any input yet
• 13:36 : so the dot is at the beginning of this
• 13:41 : rule as well here so what we're did and
• 13:48 : at this point to the right of the dot we
• 13:52 : have something that is actually a
• 13:55 : terminal symbol this is it so it's gonna
• 13:58 : match a literal character some digit
• 14:02 : right so and at that point we don't have
• 14:07 : to expand this anymore and then this
• 14:09 : also means because the thing immediately
• 14:11 : to the right of this dot is an actual
• 14:15 : like a terminal symbol you're actually
• 14:19 : trying to match with some string or some
• 14:21 : token here
• 14:23 : we're expecting this to come okay but at
• 14:29 : this point we're done with the 0th
• 14:31 : column of the table and for the next day
• 14:34 : we're actually gonna read in a character
• 14:36 : now I'm going to move on to state number
• 14:38 : one which is the second column of this
• 14:41 : table and for state number one we're
• 14:45 : gonna read in the first character which
• 14:47 : is the curt through one okay
• 14:51 : and whenever we read in a character
• 14:57 : whenever a read in a character what
• 15:00 : we're gonna do is go to the previous
• 15:03 : column and we're gonna look for a rule
• 15:07 : where this character matches the next
• 15:12 : expected matcher and the next expected
• 15:17 : matcher has to be a terminal sim
• 15:20 : for this to work so this rule is not
• 15:23 : gonna work because that's not a terminal
• 15:25 : symbol that's not the terminal symbol
• 15:27 : this one is and in the case that the
• 15:31 : thing immediately to the right of the
• 15:34 : dot is a terminal symbol we're gonna try
• 15:39 : to match it with a character that's
• 15:41 : coming in one does one match this
• 15:44 : character class a digit between 0 and 9
• 15:48 : yes it does
• 15:50 : so we can actually advance this the
• 15:53 : state of this rule by moving this dot
• 15:58 : over because we have progressed in the
• 16:02 : progression of this rule so what we're
• 16:05 : gonna do now is copy this rule over oh
• 16:10 : one more thing
• 16:11 : before I do that I do want to label the
• 16:15 : where this state started from which is
• 16:19 : zero and then I'm gonna move this rule
• 16:22 : over while moving the dot over by one
• 16:33 : and but there's zero the fact that this
• 16:36 : state started at the zero of column that
• 16:40 : remains unchanged okay and now we have
• 16:46 : to actually have completed this rule
• 16:50 : which means we have successively matched
• 16:53 : a number which is a non-terminal symbol
• 16:57 : so you can say yeah we have collected a
• 17:00 : number one the character one
• 17:03 : successfully parses a number and because
• 17:08 : of that we can actually go back in the
• 17:11 : previous column and find other rules
• 17:15 : where it's expecting a number as the
• 17:18 : next thing to come which for which there
• 17:20 : are two states in that in that situation
• 17:24 : both of these so for that reason we're
• 17:27 : gonna copy both of these rules over
• 17:29 : while moving the dot over so this dot
• 17:32 : originally at the starting point we're
• 17:35 : gonna move it over so we're gonna write
• 17:37 : number first the dot and then the plus
• 17:43 : sign and then time and also the the fact
• 17:47 : that this this rule started at index
• 17:51 : zero that does not change here so we're
• 17:54 : gonna copy that over although the
• 17:56 : position of the dot does change and
• 17:59 : we're gonna do the same with the next
• 18:02 : rule because that rule is also expecting
• 18:05 : a number which has been completed by
• 18:07 : this state over here in s1 so we're
• 18:12 : gonna say that gets carried over to here
• 18:21 : and so we are moving the dot over to
• 18:24 : after number and then also the fact that
• 18:30 : this this rule started at index zero
• 18:33 : also doesn't change here okay so very
• 18:36 : good and now we can see that oh we have
• 18:39 : completed the term as well which started
• 18:45 : at the beginning if there's some
• 18:47 : non-terminal symbol to the right of the
• 18:49 : dot in any state within this column then
• 18:53 : we have more expansion to do but in this
• 18:56 : case roughly don't and then I'm gonna
• 18:59 : continue with the next day so we have
• 19:01 : just parse the character one now we're
• 19:04 : gonna move to the next character oh by
• 19:07 : the way I said label the the symbols in
• 19:11 : the input so this is index 1 this is 2
• 19:17 : and this is 3 okay so so the next
• 19:22 : character to read from the input is this
• 19:25 : plus character and again we're gonna
• 19:29 : search through the column in in the
• 19:33 : previous column of the table s1 and up
• 19:36 : for a state which is expecting a plus
• 19:39 : sign immediately to the right of the dot
• 19:43 : which we conveniently find that in this
• 19:46 : rule here okay we get we're expecting a
• 19:50 : plus sign and we get it which is great
• 19:52 : because that means we can take this
• 19:55 : state copy it over and move the dot over
• 20:00 : and the fact that this rule started at
• 20:05 : state zero of the table that remains
• 20:09 : unchanged and then now we we have we
• 20:15 : have moved a dot over and now the next
• 20:18 : expected item in this production is
• 20:22 : actually another term are there any
• 20:25 : other rule that we can
• 20:27 : using this new character plus to scan no
• 20:30 : however we can expand this next term
• 20:34 : that's expected now because the this
• 20:37 : next term we can use the production
• 20:39 : rules of term to expand to figure out
• 20:42 : what is the next character or the next
• 20:46 : terminal rule
• 20:48 : yeah we're expecting so we're gonna go
• 20:52 : down the this production rule set and
• 20:55 : just copy them over so the first one is
• 20:59 : term expands to number plus and then
• 21:03 : time and we're gonna actually put the
• 21:05 : dot at the beginning because we're at
• 21:07 : the beginning of this term and this time
• 21:10 : we're gonna show that this state even
• 21:15 : though this rule is the same rule as
• 21:18 : this one but this day is different from
• 21:22 : this one because we're matching the Tron
• 21:27 : starting here we were trying to or
• 21:30 : expecting a next term after this plus
• 21:32 : sign we're starting at a different index
• 21:35 : of the input and that index is whatever
• 21:37 : the index we're dealing with right now
• 21:40 : which is two so this this this from
• 21:45 : index what does the index from which we
• 21:48 : have started trying to match the rule
• 21:51 : that differentiates this state in this
• 21:55 : state okay and so but that's not the
• 21:59 : only rule that would we need to
• 22:01 : instantiate because term can also expand
• 22:05 : to just a single number so you know so
• 22:07 : to copy that down as well and this one
• 22:11 : also started at index two but we're
• 22:13 : still not done
• 22:14 : because we can expand number as well our
• 22:17 : number is also a non-terminal symbol and
• 22:19 : we're expecting that next we expand all
• 22:22 : the non-terminal symbols until we're
• 22:25 : left with just the terminal symbols so
• 22:30 : I'm gonna look at the rule that can
• 22:32 : expand a number for which there's only
• 22:34 : one of them so I'm gonna copy that down
• 22:38 : and again that this cursor is going to
• 22:41 : be at the beginning of this one and
• 22:43 : again this rule we started matching it
• 22:46 : at index two okay all right so now we
• 22:52 : are ready to accept the next character
• 22:55 : from the input the last character of the
• 22:57 : input is 1 and and we're gonna search
• 23:02 : through the states in the last column of
• 23:06 : the table for a state that's expecting a
• 23:10 : number a yeah the digit is next thing so
• 23:16 : not that one not that one not that one
• 23:18 : that one right
• 23:20 : this one is expecting a digit as the
• 23:23 : next thing to come and because it
• 23:26 : matches what the input is the progress
• 23:29 : of this rule advances we can move this
• 23:33 : dot over like so but we don't draw it
• 23:37 : like that we copy it over to this new
• 23:40 : column in the table like so so the dot
• 23:49 : is at the end of this rule now and the
• 23:53 : fact that this rule started at index two
• 23:56 : that doesn't change and we're gonna copy
• 23:59 : that over and that's really the only
• 24:01 : rule where we were expecting a terminal
• 24:05 : symbol so we're done with that part but
• 24:08 : the next thing is to realize that oh
• 24:11 : we've completed this number symbol so
• 24:15 : because we completed this number symbol
• 24:17 : we can actually go back through the
• 24:20 : other states which were expecting a
• 24:23 : number symbol next where do we go back
• 24:26 : to to complete those rules we base it on
• 24:30 : this index number like where did this
• 24:32 : rule start well at index two we're gonna
• 24:35 : go back to the column two to find rules
• 24:40 : that were expecting a number symbol next
• 24:44 : and now we can resolve those because at
• 24:47 : this point we have completed a number
• 24:49 : symbol so looking
• 24:51 : in s2 we see actually two rules that
• 24:54 : were expecting a number as the next
• 24:56 : thing to come and we can actually move
• 24:59 : this dot over for both of them
• 25:01 : and this from index remains unchanged we
• 25:05 : just copy it over also gonna do it for
• 25:07 : this one okay
• 25:08 : so we have also you can see that well
• 25:13 : this rule we have this cursor in the
• 25:16 : middle of in the middle of this
• 25:19 : progression but this rule we have
• 25:22 : actually completed a term so we can
• 25:24 : actually again resolve the completion of
• 25:30 : a time and how do we resolve it again we
• 25:34 : look back to the from index of this
• 25:37 : state which is two so we go back to s2
• 25:40 : and find any rule that is expecting a
• 25:44 : term as the next thing to come
• 25:46 : immediately to the right of the cursor
• 25:50 : and we actually have a rule that matches
• 25:53 : that is this one here immediately to the
• 25:57 : right of the dot we have a time which is
• 26:00 : what we just completed so we can
• 26:02 : actually do the same thing again oh wait
• 26:05 : I didn't draw the arrow and so totally
• 26:07 : by completing this guy here we completed
• 26:12 : a term which is also gonna resolve this
• 26:15 : next term that was that this rule was
• 26:17 : expecting this rule this term rule
• 26:21 : actually started earlier it started at
• 26:25 : index zero all the way from the
• 26:27 : beginning so that's actually we're
• 26:30 : completing more work now so this rule
• 26:34 : will get copied over and we're gonna
• 26:37 : move the dot over and again we're gonna
• 26:41 : copy the from index over and at this
• 26:45 : point we have this dot at the end of
• 26:48 : this rule which means that we've
• 26:50 : completed this chunk yay
• 26:54 : and because we're out of characters from
• 26:58 : the input we have done we'll have done
• 26:59 : reading in Oh
• 27:01 : input at that point what you do is go
• 27:04 : through all the states in the last
• 27:06 : column of the table and find all of
• 27:09 : these states where the dot is at the end
• 27:13 : of the rule and the from index is zero
• 27:18 : and as you can see the only state where
• 27:22 : that is the case is this one and
• 27:24 : therefore this is actually a complete
• 27:27 : parse of all of the input and that's how
• 27:33 : the early algorithm works at least for
• 27:36 : this very simplistic example now this
• 27:40 : algorithm you might realize that it's
• 27:42 : kind of tedious to do by hand but
• 27:46 : luckily it's now Werdum that's why we
• 27:49 : can make a computer do this for us and
• 27:52 : again that does exactly what nearly Jas
• 27:55 : will do for us right so so what we can
• 27:59 : do is tell nearly Jas to compile this
• 28:03 : grammar so we can do it using nearly C
• 28:10 : taking in the nearly file and we'll tell
• 28:16 : it to output a J's file of the same name
• 28:20 : and after that and spit out this J's
• 28:23 : file which is a JavaScript
• 28:25 : representation of the same grammar and
• 28:28 : then with this J's file you can actually
• 28:31 : use the parser to parse an input and a
• 28:36 : easy way to do this without having to
• 28:39 : write up a JavaScript file
• 28:43 : it's used nearly test utility what you
• 28:47 : do is you give the JavaScript grandma
• 28:51 : JavaScript version of the grandma to in
• 28:53 : the early tests and then you define the
• 28:55 : input using the - I argument and then
• 29:01 : you can put in any input you desire I'll
• 29:05 : put in one plus one and here nearly will
• 29:10 : perform the parsing algorithm
• 29:13 : and actually show you each column of the
• 29:16 : table it actually caused each column of
• 29:20 : the table a chart actually chart is what
• 29:24 : I believe what J early calls it it's a
• 29:27 : chart based parsing algorithm and if I
• 29:32 : have done it correctly then this should
• 29:36 : the state of this chart should match up
• 29:40 : with what I did by hand here and if
• 29:43 : you're interested in experimenting with
• 29:45 : grammars of your own you can write
• 29:49 : grammars of your own and test it like
• 29:51 : this and and see what chart output you
• 29:58 : how you understand how well you
• 30:01 : understand early algorithm so that that
• 30:04 : is my introduction of the area algorithm
• 30:06 : I hope you enjoyed it if you do want to
• 30:09 : get into more detail about the algorithm
• 30:13 : there's actually quite a bit of