:use-module lr0.scm
[N.B. If all you want to do is parse, this section is particularly irrelevant. But if you want to hack parsing tools, this section may be useful.]
A grammar implies a set of grammar items which can be interpreted as states in an NFA. An item corresponds to a choice of one production from the grammar, and a position within that production.
For example, if a production is:
A -> B C D aka (a (b c d) . <reduction rule>)
Then the items are:
A -> * B C D aka ((a (b c d) . <reduction rule>) b c d) A -> B * C D aka ((a (b c d) . <reduction rule>) c d) A -> B C * D aka ((a (b c d) . <reduction rule>) d) A -> B C D * aka ((a (b c d) . <reduction rule>))
An empty production has exactly one item:
A -> * aka ((a () . <reduction rule>))
If an item has nothing to the left of the star, it is called an initial item.
If an item has nothing to the right of the star, it is called an final item.
An item says what we think we are parsing (an A, in the above
examples) and what we expect the next bit of input to be (for example,
in the item A -> * B C D, we next expect to parse a B).
These items are states in an NFA, the LR(0) NFA, with edges defined between them.
The first item in the grammar is conventionally the start state, although any initial item can be used for this purpose.
Final items are final NFA states. A final state implies that an entire expansion has been seen and now a reduction of that expansion is possible. Final states have no outward edges.
An NFA state (item) has two kinds of edges corresponding to whether the symbol to the right of the star is a terminal or non-terminal.
Terminal edges indicate that a particular type of input token is expected. If the lookahead is of that type, it can be shifted onto the parse stack whie crossing the terminal edge to a new state.
Non-terminal edges indicate a point where a nested construct should be
parsed. If in state I the symbol following the star is non-terminal
B, then there is an epsilon edge to all initial items associated
with productions of B. In case one of those epsilon edges leads
to a correct parse, there is also an edge labeled B that can be
used to shift the B generated by a reduction.
For example, in the state:
A -> w * B u
it is legal to shift the non-terminal B (this is sometimes called a "goto" move).
It is also legal to make an epsilon transition to states:
B -> * w B -> * u ...
presuming the corresponding productions exist. Informally: you're
allowed to absorb the next burst of input as a reduction to B,
therefore, you are also allowed at this point to begin working on
a parse of B.
The rule applies transitively so that if A -> w * B u epsilon
transitions to B -> * C v, that in turn epsilon transitions
to C -> * z.
To summarize: The lr(0) NFA has final states which specify reductions, terminal transitions, which specify permissible input, non-terminal transitions, which specify permissible feedback from the parsing of nested constructs, and epsilon edges, which specify points at which nested constructs may begin.
The above describes all the states, designates the start and final states and defines the transition function of a non-deterministic automata. The epsilon edges associated with non-terminal transitions prevent the automata from being deterministic.
But the lr(1) parser requires a deterministic lr(0) automata So the code in this file not only must construct the lr(0) NFA, but it must convert that NFA into a DFA.
To parse deterministically, we apply a few stern measures. First, we combine all states accessible by epsilon transitions into superstates. Second, when components of a superstate disagree about whether to shift some input or reduce because of it, we favor shifting. Third, we only permit reductions if we can guarantee that after a series of reductions, the look-ahead token will be shifted. Fourth, if the components of a superstate disagree about which reduction rule to apply, we pick whichever one we discover first (clearly this could be improved). Fifth and finally, if only some components of a superstate think the look-ahead token is an error, we ignore them (if all components agree the look-ahead is an error, then so do we).
Its useful to prove that a particular grammar has the property that components of an achievable superstate will never disagree about what reduction rule to apply. This can be proven automatically. One convenient technique for lalr(1) grammars is to convert them to yacc syntax and run a table generator on them. Some day, grammar debugging tools will be added to this library.
In the usual way, we can construct a DFA from an LR(0) NFA, combining multiple NFA states accessible along epsilon paths into singular DFA superstates.
Only the initial items of start-symbol productions
are included in this set. Compose with item-set-closure
to also obtain all items reachable by epsilon transitions.
is', the state that follows is after shifting s.
#f is returned if s can not be shifted from state is.
Only the items which are direct successors of items in is are
present in the successor kernel. To also include all items reachable by
epsilon transitions, compose this function with item-set-closure.
In the LR(0) NFA, if we are at the state:
A -> w * B v
Then we are also an epsilon transition (not to be confused with an empty reduction) away from also being in the items:
B -> * u B -> * r B -> * s ...
So if a superstate contains A-> w * B v, then it should also
contain the inital states of B. This rule applies transitively.
If is was returned by item-set-start-kernel then this
procedure returns the DFA start state. If is was returned by
item-set-successor-kernel, and the set argument to
item-set-successor-kernel was a DFA state, then this procedure
returns a DFA state.
Items and items sets are an abstract data type whose representation is
hidden from the definitions of the five basic superstate functions,
item-set-successor, item-set-start-kernel,
item-set-reductions, item-set-closure, and
item-set-nullable?.
The abstract data type is encapsulated in the definitions listed below:
The list is computed by looking at the production symbol of each production.
The list is computed by looking at all the symbols used in expansions, and subtracting out the non-terminals.
Go to the first, previous, next, last section, table of contents.