Dec 192017
 
Grammatic analyser for C++.
File GRAM5A.ZIP from The Programmer’s Corner in
Category C Source Code
Grammatic analyser for C++.
File Name File Size Zip Size Zip Type
AUTODOC5.TXT 142683 41397 deflated
C5.EXE 60132 21380 deflated
C5.Y 26902 5986 deflated
CPP5.EXE 153360 41113 deflated
CPP5.L 15576 5305 deflated
CPP5.Y 88976 20213 deflated
FREEGRM5.TXT 23325 9054 deflated
GRAMMAR5.TXT 65305 19909 deflated
SKELGRPH.C 13697 2620 deflated

Download File GRAM5A.ZIP Here

Contents of the AUTODOC5.TXT file


Copyright (C) 1991, James A. Roskind, All rights reserved.
Abstracting with credit is permitted. The contents of this file may
be reproduced electronically, or in printed form, IN ITS ENTIRETY
with no changes, providing that this copyright notice is intact and
applicable in the copy. No other copying is permitted without the
expressed written consent of the author.

FILENAME: AUTODOC5.TXT
Last modified: 7/4/91


Written by:

James Roskind
516 Latania Palm Drive
Indialantic, FL 32903-3816
(407)729-4348 or (617)290-4990 x5570
[email protected]


WARNING: This file contains many lines that are much wider than 80
columns.



ACKNOWLEDGEMENT

Without the wonderful work of Bob Corbett, and his public domain
program "Berkeley Yacc", much of the experimentation leading to this
automatic documentation of yacc grammars could not have been done. As
acknowledged by Corbett, byacc is:

"based on the excellent algorithm for computing LALR(1) lookaheads
developed by Tom Pennello and Frank DeRemer. The algorithm is
described in their almost impenetrable article in TOPLAS 4,4."

Pennello and DeRemer's paper, which I believe to be *excellently*
written, was also pivotal to the production of the explanations of
conflicts. The paper's methods are basically taken and applied. In
addition, Tom Pennello (MetaWare Incorporated, Santa Cruz CA) took
pity on my efforts to provide a standard syntax for C++, and donated a
copy of a portion of MetaWare's incredible "Translator Writing System"
(the parser generator portion) to assist me. Certainly seeing what
could be done added greatly to my efforts and confidence that grammar
conflict analysis was basically automatable.



OVERVIEW

I have created a set of tools to analyze and document grammars, and
the output of that file (when processing my YACCable C++ grammar) is
the enclosed "y.ouput" file. This paper describes the information
included in the attached "y.output" file, which provides extensive
"machine generated documentation" for the YACCable C++ grammar. In
addition, it motivates the presentation of such information by
discussing a number of subtle distinctions between LR and LALR
parsers, with a special focus on parsing conflicts.

If you hate to read human-written-documentation (such as this paper),
then jump right in and look at the "y.output" file that I have
included in this release. The sections each begin with a control L
(^L), and can easily be located by a text editor. Most of the
sections of the y.output are rather self explanatory, and their
usefulness is rather obvious. Note however, that this document
contains many interesting discussions of conflicts and resolutions in
LALR parser generators, and it may be worth reading for that content
alone.

If you are out to carefully understand every conflict still present in
my C++ grammar, then you will probably have to read at least some of
the sections of this document that pertain to conflict analysis. If
you are really hardy, you could probably decode my conventions, but
spoon feeding via this document is probably the easiest approach. As
it is, the concepts contained in this document (at least with regard
to conflict analysis) are generally far from trivial.

The good news is that a vast amount of the machine generated
documentation in "y.output" is targeted at making it easier to read
and understand the grammar, and this information is *completely*
independent of the conflict analysis sections. Feel free to ignore
sections of the automatic documentation that are confusing, just as
you should feel free to skip sections of this paper that are too
simplistic (or too confusing). If all you want to do is read and
understand the grammar, then look at the original grammar (which has
comments), as well as the "Symbol and Grammar Cross Reference" found
in "y.output." If all you want to do is read about automated analysis
of conflicts, and LR vs LALR subtleties, then you should proceed to
Chapter 3.

Comments suggesting sections of this document that should be expanded
upon, or clarified, are *more* than welcome. Similarly, suggestions
for expanding the analysis that is provided in the y.output file are
also welcome. A list of things that I look forward to doing in the
future are provided at the end of this document (see chapter 4).

Finally, as a disclaimer: I don't have a PhD in parser generators, and
I haven't dedicated my life to writing grammars. I may be incorrect
in my evaluation of how parsers work, and how parser generators
generate. To date, my views have proved useful (i.e., I have written
several grammars). If my philosophizing about the causes and cures of
conflicts is off base, and you *have* dedicated your life to writing
grammars, or have received multiple PhD's in parser generation
technology, please send in the corrective argument. I would like to
make this document better. Similarly, if what I say is obviously
wrong to even your 2 month old child, I would also like to hear about
it (but you can write on behalf of your child). This requests extends
all the way to typos and English-grammar corrections.


TABLE OF CONTENTS

1 INTRODUCTION TO the YACC CROSS REFERENCE SYSTEM

2 DETAILED DISCUSSION OF TABLES
2.1 Reference Grammar
2.2 Alphabetized Grammar
2.3 Sample Expansions for the Non-terminal Tokens
2.4 Summary Descriptions of the Terminal Tokens
2.5 Symbol and Grammar Cross Reference for the Tokens
2.6 Sample Stack Context and Accessing Sentences for each State
2.7 Concise list of Conflicts
2.8 Canonical Description of Conflicts
2.9 Verbose listing of state transitions
2.10 Explanations for all reductions suggested in conflicts

3 CONFLICT ANALYSIS Via the YACC CROSS REFERENCE SYSTEM
3.1 LR(1) Parsing
3.1.1 Ambiguous grammars
3.1.2 LR(1) ambiguous grammars
3.1.2.1 Removing LR(1) conflicts from unambiguous grammars
3.1.3 LALR-only ambiguous grammars, and LALR-only conflict components
3.1.3.1 LR parsers vs parsers generated by LALR parser generators
3.1.3.2 Sample LALR-only ambiguous grammar
3.1.3.2.1 Analysis of the sample LALR-only ambiguous grammar
3.1.3.2.2 Sample Removal of LALR-only conflict via state splitting
3.1.3.2.3 Sample Removal of LALR-only conflict via grammar augmentation
3.2 Interpreting Conflict Analysis Documentation
3.2.1 Conflict Reduction Demonstrations
3.2.2 Summarizing Reduction Contexts
3.2.3 Automatic LALR-only Conflict Summaries
3.2.3.1 Direct Pruning of LR conflicts From the Context Tree
3.2.3.2 Pruning of LR conflicts From the Context Tree using State Information
3.2.3.3 LALR-only conflicts With No Consequences
3.2.3.4 Significant LALR-only conflicts
3.2.4 Augmentation Proposals to Remove LALR-only Conflicts

4 FUTURE ENHANCEMENTS

A.1 Appendix A: Sample LALR-only ambiguous grammar: automatic
documentation






1 INTRODUCTION TO the YACC CROSS REFERENCE SYSTEM


The automatically generated documentation provided for the grammar
consists of 8 parts:

Reference Grammar

Alphabetized Grammar

Sample Expansions for the Non-terminal Tokens

Summary Descriptions of the Terminal Tokens

Symbol and Grammar Cross Reference for the Tokens

Sample Stack Context and Accessing Sentences for each State

Concise list of Conflicts

Canonical Description of Conflicts

Verbose listing of state transitions

Explanations for all reductions suggested in conflicts


Each section is meant to aid in reading, understanding and using a
grammar. This introduction provides a brief overview, and the
sections that follow describe in more detail the contents and use of
each of the automatically generated sections.

The first section is the "Reference Grammar." This section contains
the grammar as it was entered by the user, with intermediate actions
expanded to $n non-terminals. This is basically a de-commented,
printout of the grammar, with each production numbered.

The most obviously useful reading tool is the "Alphabetized Grammar."
For most folks that have had to look at a grammar, one of the most
painful aspects is trying to find out where a non-terminal token was
defined (If the reader is unlucky, then the non-terminal may have
several derivation rules in different sections of the grammar, and a
real nightmare ensues). This section gathers together all rules that
have a common left-hand-side, and lists them alphabetically ordered by
the left hand side. Within each group of productions, the rules are
also alphabetized symbol by symbol. Without this reference, it is
basically impossible to "read" a large grammar without a text editor.

A second problem that presents itself when trying to read a grammar is
trying to figure out what a non-terminal name really means. If the
author worked very hard, and the reader is in the same frame of mind,
then the names may have obvious meanings. Rarely is this consistently
true. The "Sample Expansions for the Non-terminal Tokens" section
provides a minimal length sample sequence of terminal-tokens that each
non-terminal could expand to. Such a list can quickly give insight
into what the author had in mind for each non-terminal name. This
list is also so short, that it is easy to print, or view in a single
window.

Another question that comes up is what exactly are the terminal
tokens? If a grammar is written using some nice convention, such as
all terminal tokens are capitalized, then this is not so difficult. On
the other hand, even if such a convention is followed, there are
several other questions such as the number of times the token was
used, what its precedence and associativity was set to, etc. The
"Summary Descriptions of the Terminal Tokens" provides all this
information at a glance.

Having provided all the basic reading tools, the "Symbol and Grammar
Cross Reference for the Tokens" ties it all together. Basically this
is a complete alphabetized grammar, with sample derivations for every
rule. As you are reading a grammar, this integrated sample/grammar
listing greatly reduces the burden of tracking down why a specific
rule was added to the grammar, and what it did to fill out the
language. You could always just read (the more compact) alphabetized
grammar, and refer to the symbol expansions (listed elsewhere), but
this integrated format provides a concise view of each production
(including a sample terminal sentence for every rule).

Reading a grammar is not all there is to producing and debugging a
grammar. The "Sample Stack Context and Accessing Sentences for each
State" section provides the first step at understanding what each
state in the parser really "means." Rather than providing just the
classical "verbose output", which shows local state information (and
is provided in the next section), this listing provides a global view
of each state by specifying the context in which each state is
entered. Historically, this information was gleamed by painfully
walking backwards though the verbose output. In this concise listing,
most of the state information is available for easy perusal. As it
turns out, because of the other integrated information, this table
typically finds little use, as does the classical verbose-listing
provided by a yacc (provided in a later section). Critical
information related to conflict states is extracted and used in other
tables automatically.

When developing a grammar, the most significant section of the
classical verbose listing is the core description of the states which
*have* conflicts. Rather than having to page through a typically
massive verbose-listing, the state descriptions for each conflict
state is shown in the "Concise list of Conflicts." Here again,
although this was historically the starting point for conflict
analysis, the usefulness of this information is greatly reduced by the
other integrated tables and listings. For a typical yacc hacker, this
listing will be a concise, familiar, and welcome sight.

The next table is the "Canonical Description of Conflicts." The major
use of this table is to a grammar developer (i.e., author), who must
compare conflicts in one version of a grammar to those found in
another. As the grammar is modified, the state numbering can vary
wildly, and required regressive comparisons were historically tedious
and delicate. Some parser generators have even been satisfied with
listing the counts of conflicts, and encouraging the user to "fear
not" when the number hasn't changed (re: the "%expect" construct).
This canonical list can be used to automatically identify which
conflict has been added or removed by any change, and is generally not
effected by restructuring or rule additions in the grammar. A minor
secondary use of this brief description format is to show complete
sentences (programs) that will induce the conflict scenarios. (Note
that although the conflict states are quickly defined in this section,
the more meaningful "conflict contexts" are analyzed in later
section).

No automatically generated documentation for a parser generator would
be complete without the "Verbose listing of state transitions."
Historically, this was all the grammar developer (using traditional
yacc) was given to evaluate conflicts. In some sense, *all* the
information is present in this set of tables, but the difficulty of
deciphering it is known near and far. When it is necessary to
understand the exact working of the parser, this table is almost
irreplaceable. With a graphical parse tree generator (such as is
provided elsewhere in this package), most users will find it easier to
run the tree generator and observe the results, than it is to walk
through the yacc tables by hand. If the grammar describes "bing bing
bong" (see the original papers describing yacc), then this verbose
output is sufficient. If the grammar describes a real language, get
the graphical parse tree generator working :-).

Finally, what all grammar hackers want in life is an explanation of
the conflicts in a grammar. The section entitled "Explanations for
all reductions suggested in conflicts" tries to provide this
information in a clear and concise form. My interactive yacc product
allows the user to selectively view explanations of conflicts, but
this verbose table provided a complete listing. The amount of data in
this table is generally more than should be printed, but it can be
examined relatively quickly using an editor. For every state where
there is an alternative to performing a reduction, each reduction is
listed, and the contexts in which the reduction can be performed are
exhaustively described. At the end of each section describing the
conflict in a state, a summary is produced and analyzed. One lovely
consequence of this section is an understanding of LALR-only conflicts
(conflicts that would not be a problem for an LR parser), and a proof
of their presence or absence from the grammar. This table is critical
to doing any conflict analysis, and can reduce many hours into moments
when trying to understand the cause and significance of a conflict.


2 DETAILED DISCUSSION OF TABLES


2.1 Reference Grammar

The reference grammar is, as stated earlier, a complete printout of
the grammar, in exactly the order provided by the author. Note that
every rule is numbered for reference, and all intermediate actions
have been expanded into yacc generated non-terminals. For example, an
original entry entry in the grammar such as:

bit_field_declarator:
bit_field_identifier_declarator
| TYPEDEFname {} ':' constant_expression
;

is printed as:

bit_field_declarator
: bit_field_identifier_declarator /* (379) */
;

$$12
: /* (380) */
;

bit_field_declarator
: TYPEDEFname $$12 ':' constant_expression /* (381) */
;

Note that the yacc generated non-terminal $$12 is used in rule number
381, in the place of the intermediate action "{}" in the original
grammar. Also note that the null derivation for rule 380 was placed
directly before the user supplied rule 381, which had the side effect
of splitting the list of derivations for "bit_field_declarator" into
two groups.

All rule listings are carefully constructed to be ready to serve as
input to yacc, and hence this rule listing can actually be used as a
"pretty print" of the original grammar. The one restriction is that
"$$12" is typically not a valid identifier for yacc, and hence an
editor would have to do a global replace of "$$" with something like
"dollar_dollar" before trying to yacc this "pretty print" form.

Comments and action code segments are omitted from this printout, but
the result is a more concise and readable output.


2.2 Alphabetized Grammar

The alphabetized grammar is basically in the same format as the
reference grammar. There are two major differences: 1) the listings
are alphabetized, and hence all derivations for a given left-hand-side
are collected together; 2) Intermediate actions are printed in a more
natural format, and yacc generated intermediate non-terminals are not
listed.

To demonstrate the distinction with regard to non-terminals, consider
the author supplied rule:

bit_field_declarator:
bit_field_identifier_declarator
| TYPEDEFname {} ':' constant_expression
;

The derivations for "bit_field_declarator" are listed as a group:

bit_field_declarator
: bit_field_identifier_declarator /* (379) */
| TYPEDEFname { /* $$12 */ } ':' constant_expression /* (381) */
;

Notice that the intermediate action is presented as a bracketed
comment "{ /* $$12 */ }", rather than simply a yacc non-terminal
"$$12." This offers a slight advantage in readability to a novice
reader, and makes the alphabetized grammar format acceptable for
direct input to yacc. Hence this form can even more easily be used as
a pretty print formatter for a yacc grammar. Also notice that since
the grammar is alphabetized, all the null derivations for yacc
generated non-terminals $$n would otherwise collect at the start of
the grammar listing, and provide no real information (it is already
known that all these $$n productions are null derivations).

With regard to alphabetizing within a group of derivations, note that
rules are also sorted by their right-hand-side entries, to make it
easier to locate a specific rule. For example, the derivations in the
original grammar:

basic_type_name:
INT
| CHAR
| SHORT
| LONG
| FLOAT
| DOUBLE
| SIGNED
| UNSIGNED
| VOID
;

are listed alphabetically:

basic_type_name
: CHAR /* (285) */
| DOUBLE /* (289) */
| FLOAT /* (288) */
| INT /* (284) */
| LONG /* (287) */
| SHORT /* (286) */
| SIGNED /* (290) */
| UNSIGNED /* (291) */
| VOID /* (292) */
;

Caution should be taken when using this output as an input for yacc.
Note that some of the disambiguating rules decide based on rule
ordering how to resolve a conflict. Hence, "alphabetizing" a grammar
may disrupt this process.


2.3 Sample Expansions for the Non-terminal Tokens

This section presents the shortest series of terminal tokens that can
possibly be derived from each non-terminal. It also gives a count of
the number of user defined non-terminals (note that yacc defined
non-terminals are not included in this count).

For example, some entries in a typical table are:

Name Usage count Minimal expansion

abstract_declarator 17 '*'
access_specifier 3 PUBLIC
... ... ...
sue_declaration_specifier 4 EXTERN STRUCT IDENTIFIER
...

One column presents the number of times the non-terminal is used in
the grammar (re: usage count), while the far right column presents the
list of terminal tokens that could be derived from the non-terminal
listed on the left. Specifically, we can tell from this table that
"abstract_declarator" is used 17 times in the grammar, and that it
"always derives something" (i.e., it never derives epsilon, the null
string). Specifically, we can tell that the shortest sequence of
terminal tokens that it does derive has 1 terminal in it. An example
of such a sentence that can be derived from "abstract_declarator" is
the sequence "'*'" . An interested reader could, by looking at the
grammar, find the critical rules that prove this fact:

abstract_declarator : unary_abstract_declarator (618)
unary_abstract_declarator : asterisk_or_ampersand (626)
asterisk_or_ampersand : '*' (634)

The "Symbol and Grammar Cross Reference for the Tokens" described in a
later section, makes it trivial to find such series, but the easier
thing is to assume that some derivation exists, and that is why it is
listed.

The "minimal" expansion is generally not unique, but there is never a
shorter expansion. The selection of a unique expansion when there is
a "tie" for a length is based on using the expansion that is provided
by the lowest numbered rule.

If it is possible to derive the empty sentence (epsilon) from the
non-terminal, then the minimal expansion area is left empty to
demonstrate this fact.


2.4 Summary Descriptions of the Terminal Tokens

This section lists alphabetically, all the terminal tokens (i.e.,
tokens that are never found on the left-hand-side of a production, and
are typically supplied directly be the lexical analyzer. Typical
entries look like:

Name Usage Count Value Precedence Associativity

'!' 2 33 0 -
'%' 2 37 0 -
'&' 3 38 0 -

As with the non-terminals, the first column presents the number of
times each terminal was used in the grammar (usage count). The number
shown in the next column is the value that is provided in the y.tab.h
file for communication with the lexer. Typically the y.tab.h file
will show this list of terminals ordered by their value, which is
determined primarily by their order of declaration to yacc.

If the %prec directive was used in the grammar, then individual tokens
would acquire a numerical precedence starting with 1, and increasing
with each new directive. Precedence is one of several methods that
yacc employs to disambiguate conflicts.

The column titled "Associativity" has entries that are either
undefined ("-" as shown), or "Right", "Left", or "Non" (the last of
which signifies "Non-associative"). Tokens acquire setting of this
associative status when %right, %left, and %nonassoc declaring
directives are encountered in the grammar.


2.5 Symbol and Grammar Cross Reference for the Tokens

As mentioned in the introduction, the cross reference table ties
together almost all of the other tables of information. The cross
reference is organized exactly as was the alphabetically listed
grammar, except there is additional information about each token, and

each rule. In addition, the non-terminals are merged into this
alphabetized list, and information is provided about them as well.

A typical entry for a terminal symbol in this table would look like:

AUTO
TERMINAL, used in 1 rule value = <275> re: storage_class : AUTO (278)

The above text indicates that the token "AUTO" is indeed a "TERMINAL"
token, and the usage count is given as "1". Whenever the usage count
is exactly 1 (as it is in this case), the rule that uses the token is
presented at the end of the line:

re: storage_class : AUTO (278)

Also note that the value is presented here as being "275". If there
had been any precedence or associativity settings for this terminal,
they would also have been listed. As a result, this table has all the
information about terminals that were provided in the earlier section,
as well as a bit more. (i.e., which rule it is used in when it is only
used once). They distinction is that in this table they are
alphabetically interspersed with the non-terminals.

A typical entry for a non-terminal symbol in this table would look
like:

abstract_declarator
NON-TERMINAL, used in 17 rules
abstract_declarator : postfix_abstract_declarator (619)
'(' '*' ')'

abstract_declarator : postfixing_abstract_declarator (620)
'[' ']'

abstract_declarator : unary_abstract_declarator (618)
'*'


The first line indicates that the symbol "abstract_declarator" is
being described, and the second line asserts that it is indeed a
non-terminal. Here again the symbol's usage count is presented, and
if it were only used in a single rule, that rule would be listed (as
was done for terminal symbols).

After this overview, all of the rules that derive the non-terminal are
listed exactly in the same order as was provided for the alphabetized
grammar. The line after each rule shows the minimal terminal-token
sequence that the right-hand-side of each rule can ever expand to.
Hence, looking at the above example, we see that when
"abstract_declarator" is derived from rule 619, then it would have at
least 3 tokens, and one example is the sequence "'(' '*' ')'". Notice
that it is quickly apparent that the shortest sequence possible for an
abstract_declarator is produced via rule 618, and is simply "'*'".

As was discussed in the section on minimal expansions for non-terminal
symbols, these sequences are not generally unique. In fact, some
exploration would reveal that "unary_abstract_declarator" can also
derive the 1 long terminal sequence "'&'" as well as "'*'" which is
shown above. However, the length of the samples sentences are always
as short (in terms of terminal token count) as possible.

For reading grammars, this generic minimal expansion view of
non-terminals and rules is critical to understanding what a rule
"really" does, and what a symbol "really" can mean. If you are trying
to read a new grammar, you are advised to either read this
cross-reference first, or have it handy as you read the author's
commented version of the grammar. Having this information on hand can
save quite a bit of time trying to explore all the nuances of a
grammar, and identifying what each non-terminal name stands for.


2.6 Sample Stack Context and Accessing Sentences for each State

States in an LR parser are entered from prior states based on the look
ahead symbol and (typically) a parsing table (this is just what your
run-of-the-mill table driven LR parser does). However, it is not in
general possible to transition from any state to every other state,
and the table defines which transitions are possible. Reversing this
process, when the parser is in any state, there are only a small
fraction of the total states that the parser could have been in
immediately before entering this state. This "history" of prior
states, which is generally represented as a push-down-automaton (a
stack of states), provides context information for each state.

One very interesting question (which is answered by this table) is
what is the smallest yacc stack that can reach each state? To put it
another way, what is the shorted list of transitions that will
progress from the start state (in which the stack is empty) to any
specific state in the parser? As pointed out, state 0 (the internal
start state) can only directly transition to a small subset of the
possible states. Each of those states can then only transition to a
small number of possible states, etc. Eventually, all states are
reached, and there emerges a "shortest path" from the start state to
each state.

The significance of these "shortest paths" is that for each
transition, a corresponding symbol is "pushed" onto a symbol stack.
Hence for any path to a given state, the stack can identify a lists of
tokens (terminals and non-terminals) that have been "seen" as the
parser moved into the state in question. Since we know the shortest
derivation for every symbol (it was presented in an earlier table for
all non-terminals, and terminals are "their own" shortest terminal
expansion), we can translate a minimal path into a minimal (in some
sense) series of terminal-tokens that should carry us from the start
state to any state in the parser. For any state, a series of terminal
tokens that carries the parser into a target state is called an
"accessing sentence" for that state. This table provides a list of
minimal stack contexts, and translates them into accessing sentences.

A typical entry in the table look like:


state 834: 1 incoming, 1 outgoing states; entering via state 655:
Minimal stack: 0 1 52 241 423 ! 655 834
symbols: $start translation_unit identifier_declarator '{' statement_list_opt ! GOTO label
sentence: $start IDENTIFIER '{' ! GOTO IDENTIFIER


The above entry indicates that there is exactly one state that can
transition to this state ("1 incoming"), and there is only one state
that this state can transition to ("1 outgoing state"). Note that in
addition to transitioning, each state has the option of "reducing" the
stack back to a prior state, so there are no "dead ends."

Typically there are several states from which to enter a given state,
and the "minimal stack" must be constructed from one of them. In
general, the table will select one of them as leading to the optimal
(i.e., minimal stack size) context, and that one will be listed at the
end of the first line. In this case, there is only "1 incoming"
state, and that is state 655. So we see that every stack context for
state 831 contains state 655 directly beneath it.

The second line contains the list of states that are found to be
minimal in depth, and yet transition from state 0 to this target
state. Note that amidst the list of 7 states ("0 1 52 241 423 ! 654
831") is an exclamation mark. This mark indicates the point at which
the automated documentation system had several alternatives for prior
states, and it chose one of them. Hence we can see in another format
that given state 831 is atop the stack, that state 655 *must* be
beneath it. In contrast, state 423 is *not* the only possible state
beneath that state 655, and so the "!" indicates that a choice was
made.

Given the minimal stack provided on the second line, each of these
states has a symbol associated with it. Specifically, state 0 (the
start state) transitions to state 1 after seeing a "translation_unit"
(the first significant entry on line three. State 1 transitions to
state 52 after seeing an "identifier_declarator" (the next entry on
line three). State 52 transitions to state 241 after seeing "'{'".
etc. Hence we see that line 3 contains the series of symbols that can
minimally transition from state 0 to state 831.

Finally. we can list the minimal expansion for each of the symbols
shown on line three. All sentences begin with $start, and so we
initially have only that:

sentence: $start

Since "translation_unit" has a minimal derivation of the null string,
we continue by recording only the start token on line four:

sentence: $start

Next we note that "identifier_declarator" has a minimal derivation of
"IDENTIFIER", and so we list that:

sentence: $start IDENTIFIER

Next we note that "'{'" is a terminal, and hence has itself as a
minimal expansion, so we list:

sentence: $start IDENTIFIER '{'

Next we examine "statement_list_opt" and find that it has a null
expansion, and hence nothing is added. However, since there is a "!"
to the right of "statement_list_opt", we can record that and make the
connection between the two lists more apparent:

sentence: $start IDENTIFIER '{' !

Continuing with line three we see that "GOTO" is a terminal, and hence
is its own minimal expansion, so we add it to our sentence:

sentence: $start IDENTIFIER '{' ! GOTO

Finally, we examine the last symbol on line three, which is the
non-terminal "label". Since label has a minimal expansion of
"IDENTIFIER", we can add that, and form our complete accessing
sentence:

sentence: $start IDENTIFIER '{' ! GOTO IDENTIFIER

So we see that state 831 can typically be reached (assuming all the
reductions are really usable) by entering the program prefix:

IDENTIFIER '{' GOTO IDENTIFIER

Translating this into a typical program:

foo { goto loop ....

A bit of thought about this reveals that state 831 has a context which
is typically inside of a function, and just after a "goto name" has
been seen. If you wonder what "foo" is, it represents the function
name. The reason why a more common fragment didn't emerge such as:

main ( ) { goto loop

is that "main ( )" (i.e., "IDENTIFIER '(' ')'") is another possible
"identifier_declarator", but it wasn't as short as the one token
sample "IDENTIFIER". The syntax does not require the "function
returning modifier, as this is a constraint, and not a syntax issue.

So we see that by examining the entries in this table, we can quickly
find a sample accessing sentence for any state, and we are spared the
time of searching and groping for expansions and contexts in the
standard verbose output of yacc. These sample sentences are most
useful when we are doing conflict analysis, and we need to understand
how we reached a state, wherein a problem emerges.


2.7 Concise list of Conflicts

This section contains nothing more than a direct subset of what is
provided in a standard yacc verbose output. Specifically, for each
state in which a conflict is identified, the conflict(s) is(are)
printed, and the item set that defines that LR0 state is printed. By
examining the item set (and looking for the "." which shows where the
non-deterministic parser is currently processing each of the rules),
you can generally see a bit of local information about the context of
a conflict. Unfortunately, since the state information is very
localized (never more than the length of a rule's right hand side), it
is generally very difficult to see under what conditions this state is
really used. If you are from the old school, and want to see the
conflicts states, this listing will give you exactly what you would
otherwise search for with a text editor (in the old y.output verbose
listings).

If you choose to analyze the conflicts by hand, you will need to run
the parser forward and backwards to identify conflicts, and you will
need the entire verbose output. For quick scanning (old school
method), this concise listing is quite useful.


2.8 Canonical Description of Conflicts

As was mentioned, for each state there is a minimal state stack access
to that state, and a corresponding accessing sentence for that state.
These accessing sentences are listed in an earlier table. In the case
of conflicts, small modifications to the grammar generally move
conflicts around (i.e., change their state numbers dramatically). In
contrast, the accessing sentence for a state that was not involved (or
insignificantly involved) in a grammar modification will generally not
change. This section lists the conflicts, not based on their "state
number", but rather based on their accessing sentence. A typical
entry looks like:

state 1105:
Canonical Conflict Access: STRUCT IDENTIFIER '{' TYPEDEFname '(' TYPEDEFname ')' . ';' (2 reductions)

The critical line "Canonical Conflict Access: ..." contains the
minimal accessing sentence "STRUCT IDENTIFIER '{' TYPEDEFname '('
TYPEDEFname ')'", followed by a dot ".", followed by the token that
induces the conflict "';'", followed by a description of the conflict.

Note that an effort was made to *not* list the rules or rule numbers
that are involved in the conflict. Since there are no rule numbers,
reordering of the grammar will typically not have any effect on this
description of the state. In addition, the fact that there are
exactly "(2 reductions)" as options indicates that there is a single
"reduce/reduce" conflict. In some instances there are 3 or more
possible reductions rule choices, and the note indicating the number
of such choices carries quite a bit of information.

Typically a user extracts all the lines in the description file that
contain the novel phrases "Canonical Conflict Access:" (using "grep"
or similar), sorts the list, and the compares (via "diff) this list to
what was provided by an earlier version of the grammar. Generally, if
there is a change in the conflict count or context for a conflict, the
changed conflict(s) will become instantly visible to the author. Using
this as a tool allows for a very rapid investigation of the results of
modifying a grammar which has conflicts, and monitoring the impact of
such changes.

Note that this canonical description is in theory "not perfect." Based
on experience with grammars which do have conflicts, it is extremely
accurate in its differentiation.


2.9 Verbose listing of state transitions

These listings are exactly what a classical verbose output from yacc
is defined to produce. Consult standard yacc documentation for
details on how to read the information in these tables. The most
typical use of these tables is to "play computer", and see the parser
working. As mentioned earlier, the tool that constructs a graphical
parse tree is generally far more adept at doing this, and is much
easier to use.

I will admit that the insight gained by a novice grammar writer by
walking through the verbose tables is irreplaceable. Basically, if
you can't walk through the table performing a parse (with a piece of
paper on hand to record the yacc stack), and you want to learn to
analyze conflicts, then you should learn to use these tables. Once
you know how to use these tables, you will know how a "Deterministic
Push Down Automaton" works, you will be able to follow arguments about
ambiguities, and you should stop reading these tables :-).


2.10 Explanations for all reductions suggested in conflicts

This section is so significant, that its description has been expanded
to all of chapter 3.



3 CONFLICT ANALYSIS Via the YACC CROSS REFERENCE SYSTEM


This chapter is concerned with using the tables provided by this YACC
Cross Reference System to understand the true meaning of a conflict,
as reported by yacc. There are typically several "meanings" that are
possible. Some conflicts indicate the presence of genuine ambiguities
in a grammar. Other conflicts indicate that the tasks specified by
the grammar are beyond the capabilities of any parser that must work
with limited look ahead (i.e., impossible for an LR(1), which only has
one token of lookahead, and must make reductions at clearly prescribed
times). Finally, the most complex form involves situations that are
too difficult to be handled *entirely* by an LALR(1) parser, and could
be handled "better" (a.k.a., correctly parsed) in some situations by
an LR(1) parser.

As will be seen, there is a tremendous, and oft confused distinction
between a "grammar", and a "language." The goal in grammar writing is
to concisely describe a language. If the description (a.k.a., the
grammar) is ambiguous or just plain "wrong", it does not prove that
the language is indescribable, any more than an error filled program
"proves" that a problem cannot be solved by a computer. By analyzing
conflicts we seek to understand what is "wrong" in the grammar, so
that it may be modified, and more "correctly" reflect the target
language. There are some restrictions on what *is* possible, but
experience has shown that the grammar writer is more often the
restricting factor than the theoretical restrictions of context free
languages (for typical computer languages at least).

Once the "meaning" of a conflict is understood, the author has several
options for handling the conflict. In some cases, the "default"
conflict resolution heuristics might be used. In other cases, hints
may be provided to the parser generator to direct some of the
resolutions (i.e., use %prec etc.). Finally, the most theoretically
precise mechanism is to adjust the grammar so as to remove the
ambiguity. In each case it is critical to understand the elements in
the grammar that are interacting to induce the conflict. Note that
the last method, which effectively "fixes the grammar", is often (but
not always) the most tedious and difficult.

Note that although some methods of handling conflicts are easier
(i.e., let yacc apply some arbitrary disambiguation rules), they tend
to be much more dangerous. Specifically, such methods allow a naive
user to accidentally abandon whole sections of a language that is
otherwise described by a grammar. Often, by the time the author has a
clear enough understanding of a conflict, that yacc ambiguity
resolution can be depended on to provide "correct" results, the author
has enough information to perform any of the repair actions with equal
effect. Users that rely on yacc resolution rules without analyzing
the conflicts are casting their fate to the winds, and waiting for
field evaluation of their products to identify deficiencies. The
bottom line is that there is no "easy way out" of analyzing conflicts,
but by using the tables described below, a relatively casual author
can conclusively prove the validity of the approach taken (i.e.,
validate the grammar that it being used).

The simplest way to analyze a conflict is exhaustive analysis. It is
not graceful, but it is perfect. Surprisingly, exhaustive analysis of
conflicts identified by yacc, only requires examination of a *very*
finite list of scenarios. The JRYACC Cross Reference Tool works to
provide this exhaustive analysis in a very readable form. The
remainder of this chapter will examine the issues involved in
conflicts, explain how a finite list of scenarios (contexts) depict
all possible scenarios, and explain how to read the formatted contexts
enumerated by this tool.


3.1 LR(1) Parsing

In order to analyze conflicts, we must recall what a conflict is. This
section is far too brief for a reader that has never understood how LR
parsing is achieved, but it will suffice as a refresher for
experienced reader. Additionally, LR parsing will be discussed from a
slant that is more focused on conflict analysis, and less on the
generation of parsing tables, compactions of tables, etc. This is a
notable departure from approaches taken in more classical texts.

LR(1) is the abbreviation for "Left-to-right-scan,
Right-most-derivation-in-reverse, 1 token of look-ahead."

The "Left-to-right" portion indicates that the parsing will examine
tokens as a simple stream, as you or I would read them (example: no
skipping around to see how the stream ends; no looking at the final
word in a file, etc.).

The phrase "right most derivation reversed" simply provides some
explicit ordering for the processing of reductions, and commonly ties
LR() parsing to a firm theoretical basis (re: right-most derivations).

The "1 token" specifies that all decisions to perform reductions
(specify a grouping of tokens as corresponding to a single
non-terminal) will me made by looking at only one more token. This
constraint means that *only* 1 token will be viewable in advance, but
more importantly, all reductions (i.e., grouping via production rules)
must be concluded on all tokens to the left *before* any additional
right context (a.k.a. hints) are provided. In some sense, this
restriction is what places the harshest constraint on a LR parser, and
causes LR(1) conflicts to appear in an otherwise unambiguous grammar.

We will now consider the three basic causes of conflicts: Ambiguous
grammars, LR conflicts, LALR-only conflicts.

3.1.1 Ambiguous grammars

In an ambiguous grammar the author has (inadvertently, or perhaps
conveniently) provided two distinct methods to parse a single series
of tokens. When a grammar is ambiguous, yacc will always indicate
that it cannot decide between two alternative parsing actions.
Fundamentally, there are two fully viable alternatives for parsing a
sentence, and there is no way to decide between them (without
consulting an oracle, or a mind reader, or something more powerful).

Since the general problem of deciding if a grammar is ambiguous is not
computable, miracles should not be looked for with the Cross Reference
tool. However, most real life grammars are very decidably ambiguous,
and the the information gleamed from this analysis tool generally
isolates actual ambiguities rather quickly if they exist.

Consider the grammar:

start : m
| o ;

m: A B ;

o: A B ;

Note that the language consists of exactly one sentence "A B." Clearly
this is an example where a poor choice was made for a grammar, as the
parsing the valid sentence (there is only one possible) in this
language is not clear. The parse of "A B" might be:

start
|
m
|
+---+
| |
A B

or it might be:

start
|
o
|
+---+
| |
A B

As with the classical ding-dong yacc example, the cause of this
conflict (which would appear as a R/R conflict) is clear: the language
is ambiguous. In a large grammar, the interactions of the many rules
are far less clear, and it is generally very difficult to locate the
underlying ambiguity.

Note that in this case the Reduce/Reduce conflict would happen when
the look-ahead symbol is an End-of-file. Assuming that a common left
context can be found for both reductions (in this case "A B"), an R/R
conflict on an End-of-file *always* indicates a "ambiguous grammar".
The genuine "ambiguity" is clear, in that an LR(1) ambiguity indicates
a common left context for this conflict, and an EOF indicates a common
right context (i.e., no more symbols), and hence there is an
ambiguity.

Basically, if there is a real ambiguity in the grammar, then an LR(1)
conflict state will be reached, and there will exist a common right
context for that state such that either of the actions are allowable.
Although this tool only automates the location of left contexts, some
imagination on the part of the reader is usually able to isolate a
common right context (if it exists).

More often than not, when a conflict appears, the author will be very
dismayed that one of the actions is even considered by the parser.
This will indicate that the grammar was "too general" in some regard,
and being "more specific" will remove the accidentally induced
ambiguity.

Just to prevent the reader from believing that "all true ambiguities"
manifest themselves as R/R conflicts, the following is an ambiguous
grammar for the same language, that presents an S/R conflict. After
manipulating grammars for a while, one quickly learns that it is
pretty easy to translate S/R into R/R conflicts, and vice versa.

start: m B
| A B ;

m : A ;

If it is not clear that the above produces an Shift-reduce conflict,
the reader is encouraged to try it out with a copy of yacc.


3.1.2 LR(1) ambiguous grammars

Consider the following parse tree for the terminal sentence "A B C D
E".

start
|
+--------+-----+
| | |
m n E
| |
+---+ +---+
| | | |
A B C D


One grammar which would support such a parse tree is:

start : m n E ;

m : A B ;

n : C D ;


Critically, in support of an LR(1) parse of this sequence, the
reduction to "m" must be made when *only* tokens "A B C" have been
examined. It would *NOT* be an LR(1) parse if the decision to reduce
"m:A B" were deferred until additional tokens were seen.

Suppose we were to augment the grammar with the productions:

start : m n E
| o n F ;

m : A B ;

o : A B ;

n : C D ;

The acceptable language (i.e., list of valid terminal sentences) would
grow to be "A B C D E" as well as "A B C D F". Each of these
sentences would have unambiguous parse trees:

start
|
+--------+-----+
| | |
m n E
| |
+---+ +---+
| | | |
A B C D


and:

start
|
+--------+-----+
| | |
o n F
| |
+---+ +---+
| | | |
A B C D

and hence the grammar would remain unambiguous. However, the grammar
would clearly present an "LR(1) conflict." Specifically, after "A B"
had been seen, and "C" was provided as a look-ahead, it is impossible
to decide which of the two parse trees will end up being used. Based
on the constraint that *only* this one additional look ahead symbol be
examined before all groupings (reductions) to the left are performed,
it cannot be determined whether the reduction to "m" or to "o" should
be performed.

The above example is a very typical "real life" (almost) example of a
conflict. In this case, the left context was very easy to see, and
the problem was abundantly clear. In more complex grammars,
identifying why a conflict appears is generally far from obvious.


3.1.2.1 Removing LR(1) conflicts from unambiguous grammars

Since we have given a nice example of an "LR(1)" conflict, that I
claim is typical, we will also diverge for a moment and explain how to
"repair" the grammar to remove the conflict. As mentioned, there are
some theoretical limitations on what can be done, but many conflicts
lend themselves to the following approach. This approach is basically
applicable when it is desirable (in this case, absolutely necessary)
to have additional right context information prior to making a
reduction (or at least the associated action). Note that the approach
is only valid when the reduction is *not* time critical (example of a
time critical reduction is one which modifies the symbol table, and
must be done prior to further lexing/parsing).

We start by noting that the conflict shown is a reduce/reduce
conflict. Specifically, we are unsure of whether to reduce to "m" or
"o" when we are in the conflict context "A B" with lookahead "C". As
we have stated, it is flat out impossible for a parser to decide which
reduction to make without additional right context. Hence, if we
could empathize with the lowly LR(1) parser, we see that it is an
impossible task that has been asked of the parser (at least with the
current grammar!).

Note that one method of repair is to "allow yacc to make an arbitrary
decision". In this case, such a solution is unacceptable as it would
throw away half of there language (there were only two valid sentences
to start with). It is interesting to note that the ordering of the
rules would provide yacc with the basis for deciding.

As a first step to removing the pressure on the parser (which was
being asked to do the impossible), we remove the constraint that it
must decide to reduce to "m" when it is in the conflict context. To
do this, we "inline expand" the offensive rule in the context that was
causing problems. Specifically, we inline expand "m : A B" in the
rule "start: m n E ". The resulting grammar is then:

start : A B n E
| o n F ;

o : A B ;

n : C D ;


This transformation of the grammar does *not* change the acceptable
language, but it changes the conflict from a Reduce/Reduce into a
Shift/Reduce conflict. Specifically, in the conflict context "A B"
with lookahead "C", the parser must decide whether to reduce the "A B"
to an "o", or whether to shift the "C" (and hope that "A B n F" are
soon formed).

It is interesting to note that we have now changed the basis for
yacc's default disambiguation from being based on rule ordering
(standard for R/R conflicts), to favoring the production "start : A B
n E" independent of ordering. In the case of S/R conflicts, the shift
is always taken (assuming no %prec-edence is specified to override).
Unfortunately, yacc would still discard half the language (it will
even state (in diagnostics) that the rule "o : A B" is never reduced).

As stated earlier, it is impossible to decide conclusively to reduce
"A B" to "o" on lookahead of merely "C", hence we must further reduce
the pressure on the parser to make the decision. The final
modification is then to inline expand the pressing rule "o : A B" in
all significant contexts. The resulting grammar is then:

start : A B n E
| A B n F ;

n : C D ;

Not too surprisingly, the above grammar is completely unambiguous.
Basically, an LR(1) parser tries out a multitude of parses at the same
time, and only must be deterministic when it reaches a reduction
point. We have exploited this fact, in this last transformation by
creating what I would call a "Shift/Shift conflict." Since a parser
does not need to decide anything special, the S/S conflict is absorbed
into the Deterministic State machine, which implements this
non-deterministic multitude of parses. The non-determinism is
apparent when viewing the core entries in each state (see the
traditional verbose output for this grammar), as the states will
clearly show that it is not known which of the two rules is being
parsed, until sufficient context has been provided.

The most technically interesting aspect about the transformations that
were just performed is that they are more effective than what is
provided by a "more powerful" parser generator, such an an LR(k)
parser generator. The sophisticated reader might have noted that with
3 tokens of lookahead, the conflict context would either be "A B" with
lookahead "C D E", or "C D F", and the lookahead would be sufficient
to solve the dilemma (i.e., remove the conflict). To demonstrate the
power of the transformations just provided, suppose the original
grammar was:

start : m n E
| o n F ;

m : A B ;

o : A B ;

n : C D
| n C D ;

Note that now "n" is an arbitrarily long list of "C D" sequences. With
this grammar, no matter how large a "k" is chosen, there will still be
a conflict in an LR(k) parser. In contrast, once the transformations
specified above are made, the grammar becomes:

start : A B n E
| A B n F ;

n : C D
| n C D ;

which is LR(1)! Fundamentally, the above series of grammar
transformations provides the power of k *non*-terminals of lookahead,
which may correspond to arbitrarily many terminals, and hence exceed
the power of an LR(k) parser, for any fixed k.


3.1.3 LALR-only ambiguous grammars, and LALR-only conflict components

We have now discussed both ambiguous grammar, and LR(1) ambiguous
grammars (i.e., grammar which induce conflicts in LR(1) parsers).
LALR-only conflicts represent a class of conflicts that are not
problematic for an LR parser, but are, by virtue of some odd
constraints, untenable for an LALR parser.

In order to discuss LALR-only conflicts, it is necessary to recall how
LR parsers differ from parsers generated by LALR parser generators. As
will be described, typical "parsers" are identical in structure, but
they differ in their complexity.


3.1.3.1 LR parsers vs parsers generated by LALR parser generators

To fully appreciate what is being discussed it is important to
understand that LR parsers are in some simple sense, perfect. To be
specific, an LR(1) parser gathers all possible information about the
entire left context (i.e., everything that has been scanned) and uses
this in deciding on future actions (i.e., shift, or reduce by some
specific rule). IF a unique decision can be made based entirely on
left context and current look-ahead, then an LR parser can make the
decision, perfectly. There is no better algorithm for parsing based
on this information. period.

It can be shown than an LR(k) parser can be implemented by a
"Deterministic Push Down Automaton" (a state machine that can push
it's state onto a stack, exactly the way yacc does). Hence, the
entire left context can be summarized as one of a several (but finite,
no matter how large the left context is) states, and the decision on
course of action can be based exclusively on the top-of-stack-state,
and the lookahead set. Hence, a full LR(1) parser ends up looking a
lot like what most folks have come to know and love as a yacc parser,
consisting of a yacc stack, and some look-ahead, and some tables to
combine the two so as to decide on actions.

Given the immense similarity between a yacc (LALR(1) generated)
parser, and an LR(1) parser, the natural question is why is there any
difference at all? Or, to put it another way, if there is a
difference, why does anyone want LALR(1) when perfect is "soooo
close?"

As it turns out, LALR parsers are easier to generate quickly (courtesy
of such work as was done by Pennello and DeRemer). Certainly such a
reason would not support avoiding the use of LR(1) parsers, as the
parser generator is only run rarely, and the parser is generally run
very often (excluding the work of grammar authors ๐Ÿ™‚ ). More
importantly, the "perfection" of LR parsing comes with a price: a much
greater number of states. This increase generally means a much larger
state transition table (or lots more code), which results in a much
larger parser.

I have heard numbers quoted such as LR parsers require "double" the
number of states, and "orders of magnitude more", but the real answer
is: enough to completely separate out all the left contexts, for all
the "basic states." Looking at a typical C grammar, it is easy to
conceive of distinctive left context including either '(', '{', and
'[' as being significant. If you are willing to take my word that
most states in an LALR parser *don't* distinguish between these left
contexts, we can immediately see that (since almost all the basic
states can be found in such contexts) that the size of the LR tables
for such a grammar would be at least triple that of an LALR parser
(clearly such left contexts much be handled distinctly by an LR
parser, as they would vary the response to matching brackets: syntax
error vs reduce some then shift). Examining statements, we could
probably double yet again that number, by noting that the response to
"else" depends on whether we have an unmatched "if", and hence there
are probably at least six times as many states as an LALR parser
generator would produce.

Having heard how much context is lost or ignored in an LALR generated
parser, it almost starts to be surprising that any work gets done with
LALR parsers. The truth is that the states are very (very is an
understatement) cleverly chosen so that a shift is never done on a
syntax-error terminal, but some reductions *might* be done (even if
the lookahead does represent a syntax error). This tends to combine a
*lot* of states very efficiently. Unfortunately, (hence the topic of
this section), too many states are sometimes combined (a.k.a. merged).
As a result, several left contexts are merged (as far as the parser
can tell, when looking *only* at the top of the yacc stack), which
then makes the decision as to future actions indefinite (i.e., a
conflict appears).

To be specific about how this set of states is ever so cleverly
conceived, the set of states is exactly the set of states that an
LR(0) parser would use. An LR(0) has no look-ahead, and decides based
only on what has been shifted onto the stack (and possibly reduced)
exactly what to do next. Very few real life grammars are LR(0), but
such a set of states may be nicely augmented with lookahead
information (but no change in the number of states!) to form what is
called an LALR(1) generated parser. The property that a bad token
(i.e., syntax error) cannot be "shifted" by an LR(0) parser, carries
over into an LALR(1) parser, and reveals a syntax error when there is
no alternative other than a shift (i.e., possibly after a number of
reductions have be performed).

It can also be noted that since an LALR generated parser is willing to
perform reductions (but not shifts) in the face of a
syntax-error-look-ahead, this willingness can be extended to simplify
many resulting states to have more "default reductions." Since the
LALR generation process has already conceded the "immediate reporting
of a syntax error", that was provided by an LR parser, there is no
real functional loss, but typically there is significant table
compression (this is why many yacc parsers perform a lot of default
reductions rather than using available information to identify a
syntax error).


3.1.3.2 Sample LALR-only ambiguous grammar

Consider the following grammar (which can be submitted to yacc):

%token '(' ')' DIGIT
%%
program : '(' paren_fixer
| '(' number ')'
| paren_fixer ')'
| number
;

paren_fixer : DIGIT;

number : DIGIT;
%%

The above grammar specifies a language with four legal sentences, each
of which has a unique (i.e., unambiguous) parse tree. If however,
this sample is processed by yacc (or any LALR parser generator), then
the following results will appear:

yacc: 1 rule never reduced
yacc: 2 reduce/reduce conflicts.
There are 10 states.

As will be shown, there is no LR conflict, it is just that the LALR
approach to using *only* 10 states has induced the conflict.

In order to prove this is an LALR-only conflict, we will examine the
resulting parser carefully, and prove that an LR parser would not have
any conflicts. We will also discuss how "state splitting" can cure
such a problem, and give an example of a rather novel strategy for
solving this problem within the constraints of an LALR parser
generator.

Appendix A contains the complete YACC Cross Reference output provided
for this sample grammar. You may choose to examine it now, before we
extract sections of it during the analysis. This grammar was chosen
to be about as small as I could make it, and still present this class
of conflict.

3.1.3.2.1 Analysis of the sample LALR-only ambiguous grammar

The following was extracted from Appendix A, but is annotated here
with additional commentary. We begin our exploration by looking at
the concise list of conflicts:

>Concise list of conflicts:
>
>2: reduce/reduce conflict (reduce 5, reduce 6) on $end
>2: reduce/reduce conflict (reduce 5, reduce 6) on ')'
>paren_fixer : DIGIT . (5)
>number : DIGIT . (6)

The above immediately tells us that both of the R/R conflicts occurred
in state 2. For reasons that will be come apparent shortly, we will
focus our attention on the conflict involving ')'. The interested
reader is invited to perform a similar analysis of the conflict on
$end (which represents an End-of-file).

Although there are several interesting sections in Appendix A, the
most significant is the section entitled "Explanations for all
reductions suggested in conflicts". Since we are focusing on the
conflict involving the lookahead symbol ')', we would proceed to the
relevant demonstration subsection:

>
>Demonstration(s) of reduction via rule (5) in state 2, when next token is ')'
>paren_fixer : DIGIT (5)

The above heading indicates that we are about to be presented with the
analysis of how rule 5 might be used while in state 2, when the
lookahead token is a ')'.

The following extract has some line number information at the far left
for reference in the text:

>
1>From state 0
2> $start paren_fixer transitions to state 4, and then <')'> can follow.
3> DIGIT . (5)
>
4>Following the 1 states below state 2, with lookahead <')'> REDUCE via (5) is possible.
5>Minimal stack context: 0 2
6>Corresponding symbols: $start DIGIT . ')'
>Sample sentence: $start DIGIT . ')'



What all the above notation says is that the only way (this is the
*only* demonstration in this section, and this section is exhaustive)
that rule (5) can be invoked in state 2 is as follows:

a) The yacc stack beneath state 2 *must* contain state 0 (see line
"5>"). There *might* be other states beneath that, but the top
section of the stack MUST contain state 2 on top of state 0.

b) The only example of how this yacc state stack can be reached is
"DIGIT" (if there were options, then a "!" would have denoted the
start of optional/variable prefixes). Note that we can look as far to
the left as the $start, and we still do not see a "!". This
information is presented on line "6>".

As a "proof" that this scenario can lead to shifting the ')', lines
"1>", "2>", and "3>" present a reduction sequence and justification.
Specifically, line "3>" suggests that rule 5 be used to reduce "DIGIT"
to "paren_fixer". At that point, line "2>" suggests that we would
transition (assuming we were in the underlying state 0, as mentioned
in line "1>") from state 0 to state 4. Moreover, line "2>" suggests
that in state 4, ')' is acceptable rather directly. If we examine
state 4 we see that:

>state 4
>program : paren_fixer . ')' (3)
>
>')' shift 8
>. error

which shows that a shift is performed on a ')'.

Moving on to the discussion of the reduction via rule 6, and its
context:

>
>Demonstration(s) of reduction via rule (6) in state 2, when next token is <')'>
>number : DIGIT (6)
>
>
1>From state 1
2> $start '(' number transitions to state 7, and then <')'> can follow.
3> DIGIT . (6)
>
4>Following the 1 states below state 2, with lookahead <')'> REDUCE via (6) is possible.
5>Minimal stack context: 0 1 2
6>Corresponding symbols: $start '(' DIGIT . ')'
>Sample sentence: $start '(' DIGIT . ')'


We see that the only way that a reduction via rule 6 is possible (here
again our "exhaustive list" has only one entry) is when:

a) The yacc stack beneath state 2 contains state 1 (see line "5>").
There *might* be other states beneath that, but the top section of the
stack MUST contain state 2 on top of state 1.

b) The only example of how this yacc state stack can be reached is
"'(' DIGIT" (if there were options, then a "!" would have denoted the
start of optional/variable prefixes). This information is presented
on line "6>".

If we stop to think about this, then we realize that these two
scenarios (appropriate for reduction via rule 5 vs rule 6) are very
detectably different. Specifically, if the entry on the yacc stack
beneath the "official" conflict state 2 is state 0, then we should be
reducing using rule 5. If the entry underneath state 2 is a state 1,
then we should be reducing via rule 6. We can wonder why yacc chose
to merge these distinct left contexts into state 2 (the contexts must
be distinct, in that they result in distinct yacc stack entries), but
it is clear that a "perfect" LR parser would not have trouble at all.
Since an LR parser takes all left context into account, it (the LR
parser) would certainly be able to distinguish these two scenarios,
and hence would not notice any conflict. As we move on to reading the
summary information, it is important to recall this fact that state 2
is entered from clearly distinct contexts, and it is this combining of
contexts that appears to be the source of our problems (conflict).

The thoughtful analysis of the last paragraph is presented in the
automated documentation as follows:

>Summary of conflict contexts leading to state 2 and lookahead symbol <')'>
>Possible reductions rules include (5,6)
>paren_fixer : DIGIT (5)
>number : DIGIT (6)
1>3 conflict contexts were found.
>


Up to this point only summary information is presented. The "3
conflict contexts" that are spoken of on line "1>", are a yacc stacks
of "2", "1 2", and "0 2".

Next, all contexts that are truly LR ambiguous contexts are cast aside
(in this grammar, there are no truly LR ambiguous contexts), and the
results shown are the LALR-only conflict contexts:

1>LALR-only-conflicts are present. Splittable states include: 2
2>3 conflict contexts (1 splittable state(s)) are LALR-only problems.
3>Unambiguous context tree is:
>
4>--2+-1(6)
5> --0(5)

Line "1>" suggests that state 2 should be "split" (we will explain
what this means in a moment), but our discussion above certainly
identified state 2 as an overburdened state.

Lines "4>" and "5>" present the "context tree" for the conflicts.
Although the symbols "+-" and "--" appear useless here, in larger
trees they form the connecting lines in a graphical presentation. The
tree is "rooted" in state 2 (see line "4>"), and has two distinct
underlying contexts "1" (shown on line "4>") and "0" (shown on line
"5>"). Note that the parenthesized numbers, such as "(6)" on line
"4>", indicate that in this context, reduction via rule 6 is
reasonable.

Since we have showed earlier that the conflict in state 2 had left
contexts that are distinctive enough that a decision can be made as to
an action, we have proved there there is no LR(1) conflict in this
grammar (at least with look ahead ')' ). By doing so, we have then
proved that the above conflict is "LALR-only", which means it is
problematic for a LALR parser generator, but not an LR parser.



3.1.3.2.2 Sample Removal of LALR-only conflict via state splitting

Since we have so carefully described the LALR-only conflict provided
in the grammar presented in Appendix A, we will take a brief tangent
and describe what "state splitting" is, and how it can remove such
conflicts. Understanding this example will make reading certain
elements of the conflict analysis sections a bit easier.

Recall from our discussion that state 2 was the cause of the conflict.
Also recall that we could determine the proper action (reduce via rule
5, vs reduce via rule 6) simply by looking below the top of the yacc
stack. Specifically, if the next-to-the-top stack entry was state 0,
then reduction via rule 5 was correct, if state 1, then reduction via
rule 6 was indicated. Since we wish to maintain the simplicity of a
typical yacc parser, we do not wish to alter the algorithm, which
computes action based on examining *only* the top of stack, and the
look-ahead set. Certainly, making the algorithm more complex, and
diverting to a special "if you reach this state, then look down into
the stack..." would work, but it is unnecessary.

The question is: Since we are not willing to look down into the yacc
stack, how can we force information held within the yacc stack to
propagate to the top of the stack? When posed in this way, the answer
becomes clear....

If we look at the"Sample Stack Context and Accessing Sentences for
each State" provided in appendix A, and focus on state 2, we see:

>state 2: 2 incoming, 0 outgoing states; entering via state 0:
>Minimal stack: 0 ! 2
>symbols: $start ! DIGIT
>sentence: $start ! DIGIT

Note that there are only "2 incoming" states. Indeed, we know from
our analysis that states 0 and 1 can each transition to state 2, so we
are actually considering all possible states that lead to state 2.
What we really wish we had was two separate copies of state 2, one
that was accessed via state 0, and the other that was accessed via
state 1. The easy way to achieve this is "simply make a copy of state
2", or in the jargon of this section, "split state 2."

To split state 2, we first look at what state 2 is defined to do (see
appendix A):

>state 2
>paren_fixer : DIGIT . (5)
>number : DIGIT . (6)
>
>. reduce 5

Next, since there are only 10 states (numbered 0 through 9), we create
a copy, but give it the unused number "state 10":

>state 10
>paren_fixer : DIGIT . (5)
>number : DIGIT . (6)
>
>. reduce 5


Finally, we adjust the original state 1 so that rather than its
current transition to state 2:

>state 1
>program : '(' . paren_fixer (1)
>program : '(' . number ')' (2)
>
>DIGIT shift 2
>. error
>
>paren_fixer goto 6
>number goto 7

we make it transition to our newly created split state (i.e., state
10):

>state 1
>program : '(' . paren_fixer (1)
>program : '(' . number ')' (2)
>
>DIGIT shift 10
>. error
>
>paren_fixer goto 6
>number goto 7

Now we see that when we reach state 2, we must have come from state 0,
and so reduction via rule 5 is indeed correct. On the other hand,
when we arrive in state 10, we must have come there via state 1, and
so we wish to reduce using rule 6 as the default. We can make this
adjustment to state 10:

>state 10
>paren_fixer : DIGIT . (5)
>number : DIGIT . (6)
>
>. reduce 6

We have now defined a parser that properly parses the grammar that was
supplied. Note that what was done was a splitting of state 2, to
distinguish the underlying contexts. All the steps can rather
directly be automated, and hence a mechanism for improving upon an
LALR parser was described. Note that in truth, typing over the
verbose output of yacc does not a parser make, but the integration
into a parser generator is not that difficult.

The intent reader will also find the the "minimal splitting" of states
is a slightly complex problem, especially when there are demands
posted by multiple conflicts, and a long and complex "conflict context
tree." Hence, optimal state splitting is a slightly interesting
challenge, but existing work in this area (re: Minimal State Finite
State Machines built on constraints including "don't care") should
provide all the necessary algorithmic horsepower.

After "reinventing" this state splitting method, I was told that the
original work on this topic was done by Pager ("A Practical General
Method for Constructing LR(k) Parsers", Acta Informatica, 1977, p
249), but I have yet to see the paper. I have in the mean time read
"Efficient Full LR(1) Parser Generation", David Spector, SIGPLAN
Notices V21, No. 12, which also describes some issues involved in
state splitting. Note that Spector's paper appears to confuse
"removal of LALR-only conflicts" with the creation of "Full LR
parsers." As has been explained here, an LR parser has certain
features involving immediate error detection (prior to *any*
reductions), and hence removal of LALR-only conflicts would provide
(to invent a name) NQLR parsing (Not Quite LR), or perchance, MTLALR
parsers (More Than LALR), but *not* full LR parsers. Additionally,
the approach of Spector removes *all* LALR-only conflict contexts,
which is not actually necessary. Detailed examples are provided in
the discussion of the YACCable C++ grammar, wherein some LALR-only
conflicts components can be shown to be unimportant in light of the
default reductions taken.


3.1.3.2.3 Sample Removal of LALR-only conflict via grammar
augmentation

In the last section we described a method of removing a LALR-only
conflict by modifying the underlying state machine produced by a LALR
parser generator. Although this is probably the easiest and most
precise approach (assuming you have access to yacc source, and the
where-with-all to code it), there appears to be an interesting
alternative involving adding "useless rules" to an existing grammar.
This approach of "augmenting" a grammar has several nice attributes:

1) It allows current LALR(1) parser generators to be used with
no modification. This means that for what ever reason you *demand*
(or your boss demands ๐Ÿ™‚ ) that you continue to use a specific parser
generator, that you can. Some users have magical hooks wired into
their parser generators, other have special parser generators that
produce ultra-fast parsers, or ultra compact parsers. Basically,
whatever the reason, there would be no need to change.

2) It appears to be possible to automatically generate the
augmenting rules. I cannot say this conclusively, but it appears
likely, and in many cases, I have existing code that succeeds. This
paper is being written prior to completion of the effort in order to
document the YACCable C++ grammar, and hopefully future work will
clarify this issue.

We define a "useless rule" as a production that we can *guarantee*
will never be used. The easiest way to create a "useless rule" is to
be sure it has a terminal on its right hand side that *never* is
provided by the lexer. For example, the following rule will never
have any impact on a parser assuming that "NEVER" is a terminal that
in never produced by the lexer:

program : NEVER ;

Obviously, the rule that we will propose is not totally "useless", as
it will perform a very interesting function. If we examine the
printout in Appendix A, we see the following at the conclusion of the
LALR-only analysis:

>The following rules might split the problematic states:
>
>program : '(' DIGIT error ; /* SPLIT state(s) 2 following state 1 for rule(s) (6) out of (5,6) */
>Minimal stack context: 1 2
>Corresponding symbols: $start '(' DIGIT ')'
>Sample sentence: $start '(' DIGIT ')'
>

For ease of creation, the token that will never be provided by the
lexer is "error." Interestingly, if the above rule is added to the
grammar provided in Appendix A, the LALR-only conflict (in fact, all
conflicts) vanish! The following is the modified grammar that has no
LALR conflicts:

%token '(' ')' DIGIT
%%
program : '(' paren_fixer
| '(' number ')'
| paren_fixer ')'
| number
;

paren_fixer : DIGIT;

number : DIGIT;

program : '(' DIGIT error ;
%%

Before explaining why this works, we pause to note a most amazing
fact. The addition of a rule that contains the "error" token, has
caused the performance of the resulting LALR parser to change when
processing sentences with no errors!!! This is most amazing. If you
are not amazed, I am amazed enough for both of us. I always had the
belief that the addition of productions to provide error recovery
would have no impact on the parser as it was processing valid
sentences. This belief is in general false (as shown above), though
it is true when no LALR-only conflict contexts are present.

Prior to explaining how the augmenting sentence is arrived at in
general, it should be comforting to the theoreticians that read this
to hear *why* such an augmentation should have any chance of helping.
The explanation is simply that LALR parser generators are constrained
to use only a fixed set of states based on an LR(0) parser. Adding
*any* rule, no matter how useless, might increase the number of basic
states. If the number of states expand in just the right place, then
the LALR parser generator will use these distinctions to disambiguate
conflicts. This is exactly the result that we have achieved.

To understand why the above augmentation was selected, we start by
recalling the description of the state which contained the conflict:

>state 2: 2 incoming, 0 outgoing states; entering via state 0:
>Minimal stack: 0 ! 2
>symbols: $start ! DIGIT
>sentence: $start ! DIGIT

and:

>state 2
>paren_fixer : DIGIT . (5)
>number : DIGIT . (6)
>
>. reduce 5
>

The problem that can be seen here is that the context of the state is
too small. Basically, when we are in this state, we do not recall
what the context to the left was beyond the "DIGIT" which was
indicated in:

>symbols: $start ! DIGIT

Note that the "!" shows that there are several different things that
could precede this state on the stack. Indeed, if only we could
recall whether we came from state 1, which means there was a '(', or
state 0, which means there was no left context, then we could decide
what to do in this state. Hence, we *wish* there was a entry in state
2 that made its core items look like:

>state 2
>paren_fixer : DIGIT . (5)
>number : DIGIT . (6)
> wish : '(' DIGIT . ...
>
>. reduce 5
>

If we could implant this additional entry into this state, then this
state would be guaranteed to have a left context including a '('. We
don't care where the other alternative ends up (there will have to be
a state to take care of it), but we would definitely *not* have this
LALR-only problem.

In order to have our wish come true, we recall that the only way that
an entry is made into the core for a state (i.e., where we want our
wish line to appear) is if there is a production that looks like what
we have entered. So we see that we will have to add a production to
the grammar. Since we *don't* want to effect any of the other aspects
of the grammar, we want to make sure it is a "useless rule", as
defined above. Hence the rule must look something like:

wish : '(' DIGIT . NEVER ;

The one remaining question is what should be left hand side be so that
we will achieve the desired implantation. Clearly the word "wish"
will not tie into the grammar, and will have no impact. The correct
selection is actually made by looking back at the conflict context as
described in Appendix A:


>Demonstration(s) of reduction via rule (6) in state 2, when next token is ')'
>number : DIGIT (6)
>
>
1>From state 1
2> '(' number transitions to state 7, and then <')'> can follow.
3> DIGIT . (6)
>
4>Which requires a 1 deep state stack below state 2.
5>After states: 1 2 on lookahead <')'> REDUCE via (6) is possible.
6> Minimal access sentence: '(' DIGIT . ')'
>

If we succeed in forcing:

wish : '(' DIGIT . NEVER ;

into the core items for state 2, then the prior states must also have
corresponding entries, but with the "." backed up one space. Hence we
must have:

wish : '(' . DIGIT NEVER ;

in the state that lead to state 2. Looking at our demonstration, we
see that we need to have this prior wish implanted into state 1:

>state 1
>program : '(' . paren_fixer (1)
>program : '(' . number ')' (2)
>
>DIGIT shift 2
>. error
>
>paren_fixer goto 6
>number goto 7

Looking at the entries in state 1, it is clear that if we added a new
rule:

program : '(' DIGIT never ;

then it would have to be placed in state 1 exactly as desired, and we
would also have the desired entry in state 2. If you run yacc on the
augmented grammar presented at the start of this section, and examine
the verbose output, you will see that this is exactly what has
happened in the state descriptions.

The slightly sad news is that this augmented grammar has 12 states,
rather than 11, which was achievable via state splitting. Basically,
in the process of splitting the desired state, we created another
state as well (by accident).

Unfortunately, my current algorithm does not provide solutions to all
LALR-only conflicts. I believe it is close, but it is generally quite
difficult to manipulate the kernel of the states by varying the
grammar. The grammar represents a list of transitions, which are
non-deterministically explored in a virtual parser, which is
implemented as a deterministic push down automaton. All this
indirection through multiple layers of abstraction makes it rather
hard to focus on, and split a single state. The general approach is
to examine the conflict contexts provided by the YACC Cross Reference
system, and construct an appropriate production. This algorithm is
currently implemented, and the results appear with the caveat:

"The following rules might split the problematic states:"
^^^^^

As far as I am concerned, further research is required on this topic.
Perhaps a reader will direct me to some existing body of work.



3.2 Interpreting Conflict Analysis Documentation

Within the last section we examined details of LALR-only conflicts,
and provided example analysis of a small grammar using the
automatically generated documentation provided in Appendix A. In this
section, we will explore the more general format of the information
presented in the documentation section entitled: "Explanations for all
reductions suggested in conflicts." Understanding the relatively
simple scenarios provided in the last section is a general
prerequisite for reading this section. If you can understand most of
what is in Appendix A, then this section should be meaningful, and
allow you to follow what is provided in the y.output file for the
YACCable C++ grammar. Samples will commonly be taken from that file,
but the samples may not reflect the most up-to-date grammar, and
should be viewed only as samples.

The "Explanations ..." section of the the automated documentation is
broken up into groups for each conflict state and lookahead symbol.
Specifically, for each state in which there is a conflict, and for
each lookahead symbol which induces a conflict, there is a
corresponding section.

The following is as typical section, taken directly from Appendix A:

>Demonstration(s) of reduction via rule (5) in state 2, when next token is <')'>
>paren_fixer : DIGIT (5)
>
>
>From state 0
> $start paren_fixer transitions to state 4, and then <')'> can follow.
> DIGIT . (5)
>
>Following the 1 states below state 2, with lookahead <')'> REDUCE via (5) is possible.
>Minimal stack context: 0 2
>Corresponding symbols: $start DIGIT . ')'
>Sample sentence: $start DIGIT . ')'
>
>
>
>Demonstration(s) of reduction via rule (6) in state 2, when next token is <')'>
>number : DIGIT (6)
>
>
>From state 1
> $start '(' number transitions to state 7, and then <')'> can follow.
> DIGIT . (6)
>
>Following the 1 states below state 2, with lookahead <')'> REDUCE via (6) is possible.
>Minimal stack context: 0 1 2
>Corresponding symbols: $start '(' DIGIT . ')'
>Sample sentence: $start '(' DIGIT . ')'
>
>
>Summary of conflict contexts leading to state 2 and lookahead symbol <')'>
>Possible reductions rules include (5,6)
>paren_fixer : DIGIT (5)
>number : DIGIT (6)
>3 conflict contexts were found.
>
>--2+-1(6)
> --0(5)
>
>LALR-only-conflicts are present. Splittable states include: 2
>3 conflict contexts (1 splittable state(s)) are LALR-only problems.
>
>LALR-only conflict contexts leading to state 2 and lookahead symbol <')'>
>Possible reductions rules include (5,6)
>paren_fixer : DIGIT (5)
>number : DIGIT (6)
>3 conflict contexts were found.
>
>Unambiguous context tree is:
>
>--2+-1(6) $start '(' DIGIT
> --0(5) $start DIGIT
>
>The following rules might split the problematic states:
>
>program : '(' DIGIT error ; /* SPLIT state(s) 2 following state 1 for rule(s) (6) out of (5,6) */
>Minimal stack context: 0 1 2
>Corresponding symbols: $start '(' DIGIT . ')'
>Sample sentence: $start '(' DIGIT . ')'
>
>$accept : DIGIT error ; /* SPLIT state(s) 2 following state 0 for rule(s) (5) out of (5,6) */
>Minimal stack context: 0 2
>Corresponding symbols: $start DIGIT . ')'
>Sample sentence: $start DIGIT . ')'


Each "section" which focuses on a state/lookahead, has one or more
"Demonstration" areas (the above example has two demonstration areas),
followed by a LALR-only-conflict analysis area. If there are
LALR-only conflicts (or conflict components), then there is a summary
of the conflict contexts (a "Unambiguous context tree"), and a list of
proposed augmentations to remove the LALR-only components.

The following sections will describe in detail the the contents of the
"Reduction Demonstrations", "Automatic LALR-only Analysis", "Automatic
LALR-only Summaries", and the "LALR-only augmentation proposals".




3.2.1 Conflict Reduction Demonstrations

Sections beginning with the text:

"Demonstration(s) of reduction via rule (Rn) in state Sn,
when next token is Tn"

are conflict reduction demonstrations. The presence of such a section
indicates that there is either a Reduce/Reduce or a Shift/Reduce
conflict in state Sn, when the lookahead is token Tn. Specifically.
this section is provided to demonstrate how indeed the reduction Rn
could possibly take place.

During this section we will discuss the classical S/R conflict on
"else" as an example. To generate a demonstration, we used the
YACCable C+ grammar, and produced the following:

>Demonstration(s) of reduction via rule (472) in state 1100, when next token is
>selection_statement : IF '(' comma_expression ')' statement (472)
>
>
>From state 1036
> $start translation_unit identifier_declarator '{' statement_list_opt ! IF '(' comma_expression ')' statement transitions to state 1100, and then can follow.
> selection_statement . (459)
> IF '(' comma_expression ')' statement . (472)
>
>Following the 5 states below state 1100, with lookahead REDUCE via (472) is possible.
>Minimal stack context: 0 1 52 241 423 ! 656 833 950 1036 656 833 950 1036 1100
>Corresponding symbols: $start translation_unit identifier_declarator '{' statement_list_opt ! IF '(' comma_expression ')' IF '(' comma_expression ')' statement . ELSE
>Sample sentence: $start IDENTIFIER '{' ! IF '(' IDENTIFIER ')' IF '(' IDENTIFIER ')' ';' . ELSE


For a novice YACC author, it is traditionally a mystery as to why
there is a S/R conflict on ELSE, in a typical C grammar. If we
examine the state description for the conflict state, we would only
see:

>1100: shift/reduce conflict (shift 1149, reduce 472) on ELSE
>state 1100
>selection_statement : IF '(' comma_expression ')' statement . (472)
>selection_statement : IF '(' comma_expression ')' statement . ELSE statement (473)

Since this core is all that is visible, the novice YACC author would
construct the program (mentally) that looked like:

main() { if (1) foo ;
else ...

and be amazed that a parser generator even *considered* doing anything
other than shifting. After all, if a reduction of "if (1) ;" is made
to "selection_statement", how could the "else ..." ever be
incorporated into a valid program? Unfortunately, this novice
performed insufficient analysis, and failed to see the real "conflict
context." There is a real conflict here, and the demonstration
isolates it quite precisely.

After looking at enough of the "demonstrations", a reader will be able
to quickly realize the *real* conflict takes place when a program such
as:

main() { if (0)
if (0)
foo;
else ...

or, with slightly different indentation:

main() { if (0)
if (0)
foo;
else ...

is presented. We will now explain how such examples (which hopefully
have meaning to a reader) are made visible by the lines of the
demonstration.


Each demonstration context is provided based on a "conflict
contributing state", which is some state that must be reached prior to
the actual conflict state Sn. In this case, the conflict contributing
state is state 1036. Moreover, *BEYOND* reaching state 1036, a 5 deep
yacc stack containing the states "1036 656 833 950 1036 1100" is
needed to really demonstrate this conflict.

If we were to look the section of the verbose output that describes
states, we would see:

>state 1036: 1 incoming, 131 outgoing states; entering via state 950:
>Minimal stack: 0 1 52 241 423 ! 656 833 950 1036
>symbols: $start translation_unit identifier_declarator '{' statement_list_opt ! IF '(' comma_expression ')'
>sentence: $start IDENTIFIER '{' ! IF '(' IDENTIFIER ')'

The reader should notice the similarity between the line "symbols:
..." and the line that follows "From state 1036" in the demonstration.
This information was gathered into the "demonstration" scenario to
make it easier for a reader to follow without having to look up state
1036 access information. Recall that the "symbols:" entry provides a
list of symbols that correspond to the shorted stack that has state
1036 in it. Hence, the first line of the demonstration presents the
"stack context" which leads to this conflict contributing state 1036.

Looking back at the demonstration, to the right of the stack context
for state 1036 we see:

>... statement transitions to state 1100, and then can follow.
> selection_statement . (459)
> IF '(' comma_expression ')' statement . (472)

Ignoring the text "transitions to ...", we focus now on the remainder
of the text:

>... statement
> selection_statement . (459)
> IF '(' comma_expression ')' statement . (472)

these three lines suggest that the reduction via:

selection_statement : IF '(' comma_expression ')' statement (472)

followed by:

statement : selection_statement (459)

would provide a acceptable scenario for this "demonstration". Note
that the reductions that were just listed are "apparent" in the lines
of the demonstration. Specifically, the right hand side of rule 472
is shown in these three lines with the letter 'I' in "IF '('..."
appearing directly below the first 's' in "selection_statement", which
is the left hand side of rule 472. Similarly, the first 's' in
"selection_statement" (the rhs side of rule 459) appears directly
below the "s" in "statement", (which is the left hand side of rule
459).

The "5 deep stack state below state 1100" corresponds to the 5 tokens
read across the line:

> IF '(' comma_expression ')' statement . (472)

In summary we see that we have the *very* reasonable option of making
a reduction via rule 472 when the next token is "ELSE", whenever we
have the left context which consists of states:


>Minimal stack context: 0 1 52 241 423 ! 656 833 950 1036 656 833 950 1036 1100

Note that state 1036 is located, as required, 5 states from the right
end of the list (5 below the top of the state stack). It turns out
that every occurrence of state 1036 is preceded by states 656-833-950,
and hence the first real option in such a stack is (as indicated by
the '!') just beneath state 656.

As mentioned earlier, each state is associated with an "accessing
symbol", and hence the above list of states has an equivalent list of
symbols:

>Corresponding symbols: $start translation_unit identifier_declarator '{' statement_list_opt ! IF '(' comma_expression ')' IF '(' comma_expression ')' statement . ELSE

Finally, each symbol has (as described earlier) a minimal set of
corresponding terminal tokens. Using these, the listing expands the
above symbol list into a sample sentence composed of only terminals:

>Sample sentence: $start IDENTIFIER '{' ! IF '(' IDENTIFIER ')' IF '(' IDENTIFIER ')' ';' . ELSE

Recall that the initial "IDENTIFIER" is meant to be a function
declarator, and although syntactically correct, a constraint would
require something more like "main()". Similarly, we note that
"statement" was expanded to a lone semicolon, but an arbitrary
statement could appear there. So we arrive at an understanding of
what the *real* problem is, and note the the parser is unsure of what
to do when it sees:

IDENTIFIER '{' ! IF '(' IDENTIFIER ')' IF '(' IDENTIFIER ')' ';' . ELSE

main () { if ( a ) if ( b ) goo; else

Indeed, there is good reason to be unsure, as the grammar is clearly
ambiguous. Fortunately, specifications such as the ANSI C standard
assert that the else should match the last "else-less" if-statement.
Since the statement "if ( 0 ) goo;" is an else-less if-statement, it
would be "incorrect" to reduce via rule 472, abandoning the upcoming
else. Finally, we see that the shift/reduce conflict is being handled
*exactly* as it is specified in a standard. ๐Ÿ™‚ (at least in this
context...)

It is also reasonable to ask if there might be other contexts in which
the reduction could take place, that we have not considered. These
demonstration sections list absolutely *ALL* possible contexts for the
reduction listed. Hence, in the above example, we have verified that
in every possible context where a reduction via rule 479 in state 1100
on lookahead "ELSE" is possible (i.e., leads to eventual shifting of
"ELSE"), the reduction should be avoided (i.e., our resolution in
favor of Shift is correct).

There is only one detail in the demonstration format that was not
covered in the example above. For this detail, we cite a different
demonstration, but only point at the differing item:

>From state 286
> $start translation_unit TYPEDEFname ! TYPEDEFname '(' type_name transitions to state 354, and then <')'> can follow.
> global_or_scoped_typedefname abstract_declarator . (446)
> postfixing_abstract_declarator . (620)
> parameter_type_list . (622)
> '(' ')' type_qualifier_list_opt . (398)
> . (61)
>
>Following the 3 states below state 758, with lookahead <')'> REDUCE via (61) is possible.
>Minimal stack context: 0 1 28 ! 146 286 345 544 758
>Corresponding symbols: $start translation_unit TYPEDEFname ! TYPEDEFname '(' global_or_scoped_typedefname '(' ')' . ')'
>Sample sentence: $start TYPEDEFname ! TYPEDEFname '(' CLCL TYPEDEFname '(' ')' . ')'


In this example, attention should be placed on the "jagged" edge
provided by the series of reductions:

> ... type_name transitions to state 354, and then <')'> can follow.
> global_or_scoped_typedefname abstract_declarator . (446)
> postfixing_abstract_declarator . (620)
> parameter_type_list . (622)
> '(' ')' type_qualifier_list_opt . (398)
> . (61)

Notably, the first reduction is a "null reduction", and hence the yacc
stack prior to this reduction is represented by all tokens to the left
of this point. Secondly, the "3 deep state stack" corresponds to the
3 tokens to the left of this point, namely:

global_or_scoped_typedefname '(' ')'

Notice that the bottom left edge of this "jagged" sequence is can be
visually read to extract the stack context above and beyond the
"conflict contributing state" (in this case, state 286). Hence the
line:

>Corresponding symbols: $start translation_unit TYPEDEFname ! TYPEDEFname '(' global_or_scoped_typedefname '(' ')' . ')'

Consists of a part that accesses state 286, which gives context for
the conflict:

>Corresponding symbols: $start translation_unit TYPEDEFname ! TYPEDEFname '(' ...

and the jagged sequence which carries us to the conflict state 758:

> ... global_or_scoped_typedefname '(' ')' . ')'


3.2.2 Summarizing Reduction Contexts

After all the demonstration contexts have been exhibited, they are
summarized, and LALR-only conflict analysis takes place. This portion
of the conflict analysis always begins with the phrase: "Summary of
conflict contexts leading to state Sn and lookahead symbol ."
Basically. this section summarizes all the contexts into a
graphically presented tree. The following summary presentation was
taken from the analysis of the YACCable C++ grammar:

>Summary of conflict contexts leading to state 739 and lookahead symbol <')'>
>Possible reductions rules include (8,656)
>paren_identifier_declarator : scope_opt_identifier (8)
>global_opt_scope_opt_identifier : scope_opt_identifier (656)
>21 conflict contexts were found.
>
>--739+-954+-1039(8)
> | +-953(8)
> | +-837(8)
> | +-836(8)
> | --835(8)
> |
> +-741+-737(8)
> | +-544(8)
> | +-540(8)
> | --508(8)
> |
> +-1039(8,656)
> +-953(8,656)
> +-837(8,656)
> +-836(8,656)
> +-835(8,656)
> +-737(8,656)
> +-544(8,656)
> +-540(8,656)
> --508(8,656)

Notice that the tree is rooted at state 739. All the states listed to
the right of the root represent underlying yacc stack states (this is
the only time that moving down into the stack corresponds to moving to
the right in a diagram). The numbers in parenthesis list the rules
that would be appropriate for reducing with if this context were
presented during a parse. For example, the line:

>--739+-954+-1039(8)

indicates there is a significant context present when state 739 is
atop state 954, atop state 1039. Specifically, when this context
arises, a reduction in state 739 using rule 8 would be possible.
Careful examination the demonstrations provided earlier will indeed
locate the relevant demonstration, with the line:

>Following the 2 states below state 739, with lookahead <')'> REDUCE via (8) is possible.
>Minimal stack context: 0 1 52 241 423 659 835 ! 954 1039 954 739

Also note that other branches of the tree provided represent state 739
directly atop (for example) state 1039. In this context we see that
the parenthesis include *both* rules 8 and 656:

>--739+-...
> |
> +-1039(8,656)

Here again, exploring the demonstrations would locate support for both
of these suggestions:

>Following the 1 states below state 739, with lookahead <')'> REDUCE via (8) is possible.
>Minimal stack context: 0 1 52 241 423 659 835 ! 954 1039 739

as well as:

>Following the 1 states below state 739, with lookahead <')'> REDUCE via (656) is possible.
>Minimal stack context: 0 1 52 241 423 659 835 ! 954 1039 739

The advantage of viewing these contexts in this condensed graphical
form is that it becomes easy to see which contexts represent truly
ambiguous left contexts (i.e., several reductions are applicable), and
which represent "LALR-only" conflict contexts. In the next section,
this fact will be exploited, and the pruning of the tree will leave
the LALR-only conflicts to be examined.


3.2.3 Automatic LALR-only Conflict Summaries

Having gathered together the underlying stack contexts as described in
the last section, it is a small operation to isolate "LALR-only
conflict" contexts. Recall from the discussion of LALR-only

conflicts, that such conflicts are present when it is possible to
determine, based solely on left context and look ahead, what reduction
should take place. In contrast, in an LR conflict, it is impossible
to decide between several possible reductions, because they are both
possible (recall by "possible" we mean that the lookahead symbol will
be shifted eventually, or in more precise theoretical terms: the left
context after the reductions plus the lookahead symbol form a viable
prefix to a sentence in the language).

3.2.3.1 Direct Pruning of LR conflicts From the Context Tree

Consider first the simple tree taken from the analysis of the YACCable
C++ grammar:

>Summary of conflict contexts leading to state 739 and lookahead symbol <'['>
>Possible reductions rules include (8,656)
>paren_identifier_declarator : scope_opt_identifier (8)
>global_opt_scope_opt_identifier : scope_opt_identifier (656)
>12 conflict contexts were found.
>
>--739+-1039(8,656)
> +-954(8,656)
> +-953(8,656)
> +-837(8,656)
> +-836(8,656)
> +-835(8,656)
> +-741(8,656)
> +-737(8,656)
> +-544(8,656)
> +-540(8,656)
> --508(8,656)
>
>Pruning removed all conflict states: There are no LALR-only components

Notice that in this context tree, every time a context was found in
the demonstration section that proved the viability of reduction via
either rule 8 or rule 656, a second demonstration was also provided
that supported the feasibility of using the other reduction. So we
see that every context listed is representative of a full LR(1)
ambiguous left context. The fact that no LALR-only conflicts were
found is indicated in the final line of the analysis above.

In fact, if we examine some information about state 739 we see:

>state 739: 11 incoming, 0 outgoing states; ...
>Minimal stack: ...

Interestingly, there are only 11 possible states that can appear below
state 739 on the yacc stack. Since we have all 11 listed above in the
tree, we can see that in fact, no matter how we get to state 739, one
of the contexts shown will be present, and reduction via either rule 8
or 656 will be possible. The significance of this evaluation will
become clearer in the next example.


3.2.3.2 Pruning of LR conflicts From the Context Tree using State
Information


Consider another analysis section provided for the YACCable C++
grammar:

>Summary of conflict contexts leading to state 1105 and lookahead symbol <'{'>
>Possible reductions rules include (61,600)
>type_qualifier_list_opt : (61)
>simple_paren_typedef_declarator : '(' TYPEDEFname ')' (600)
>7 conflict contexts were found.
>
>--1105(61)+-1065--984--853(600)
> |
> --1041--957--843(600)
>
>Pruning removed all conflict states: There are no LALR-only components

Here again pruning removed all the contexts, but the reason is a bit
less clear. Note that what is indicated is that *IF* the top of the
yacc stack is state 1105, then reduction via rule 61 is possible no
matter what is underneath it. In contrast, the search for contexts
that allowed reduction via rule 600 required the yacc stack to be
either:

>Following the 3 states below state 1105, with lookahead <'{'> REDUCE via (600) is possible.
>Minimal stack context: 0 1 64 254 443 684 ! 853 984 1065 1105

or

>Following the 3 states below state 1105, with lookahead <'{'> REDUCE via (600) is possible.
>Minimal stack context: 0 1 64 254 443 684 ! 843 957 1041 1105

Why, you should ask, are there no LALR-only conflicts? Certainly we
can imagine that there are other states that could lurk beneath state
1105 on the yacc stack. If one of those states (other than 1065 and
1041) were under state 1105, then it would appear that only reduction
via rule 61 would be possible. Looking back at the information
gathered about state 1105 we see:

>state 1105: 2 incoming, 5 outgoing states; ...
>Minimal stack: ...

Which reveals that there are exactly 2 possible states that can lurk
beneath state 1105. Since we have listed 2 in our tree, there are no
other possibilities! The questioning should however continue about
what could lurk beneath state 1041 (for example).

>state 1041: 1 incoming, 48 outgoing states; ...
>Minimal stack: 1 64 254 443 684 ! 843 956 1041
>...

Interestingly, there is only one possible state that can lie beneath
state 1041, and so it must be state 956. We could then look at state
956 and see a similar piece of information, but the "Minimal stack:"
listing for state 1041 contains the whole story. Note that the "!" is
placed between state 684, and state 843. This indicates that there
are no other choices up until that point. To be specific, the only
state that can possibly be found beneath state 1041 is state 956, and
the only state that can be found beneath that is state 843.

Similar analysis of the other branch of the context analysis would
reveal an identical scenario. The bottom line is that if we are in
state 1105, then our demonstration has revealed that we can directly
reduce via rule 61. Our examination just provided shows that whenever
we are in state 1105, the underlying yacc stack *MUST* be either "853
984 1065 1105", or "843 956 1041 1105", and in either case, reduction
via rule 600 is possible. Therefore, all left contexts of state 1105
support reduction via either rule.

Having explained how contexts are discarded, the next two examples
will show how contexts can remain, and true LALR-only conflicts can be
isolated.


3.2.3.3 LALR-only conflicts With No Consequences

Consider the following, also taken from the analysis of the YACCable
C++ grammar:

>Summary of conflict contexts leading to state 739 and lookahead symbol <')'>
>Possible reductions rules include (8,656)
>paren_identifier_declarator : scope_opt_identifier (8)
>global_opt_scope_opt_identifier : scope_opt_identifier (656)
>21 conflict contexts were found.
>
>--739+-...
> |
> ...
>
>LALR-only-conflicts are present. Splittable states include: 954 741 739
>12 conflict contexts (3 splittable state(s)) are LALR-only problems.
>
>LALR-only conflict contexts leading to state 739 and lookahead symbol <')'>
>Possible reductions rules include (8,656)
>paren_identifier_declarator : scope_opt_identifier (8)
>global_opt_scope_opt_identifier : scope_opt_identifier (656)
>12 conflict contexts were found.
>
>Unambiguous context tree is:
>
>--739+-954+-1039(8) $start IDENTIFIER '{' TYPEDEFname '(' ! '*' '(' '*' IDENTIFIER
> | +-953(8) $start IDENTIFIER '{' TYPEDEFname '(' ! '(' '*' IDENTIFIER
> | +-837(8) $start IDENTIFIER '{' ! CLCL TYPEDEFname '(' '*' IDENTIFIER
> | +-836(8) $start IDENTIFIER '{' ! INT '(' '*' IDENTIFIER
> | --835(8) $start IDENTIFIER '{' TYPEDEFname ! '(' '*' IDENTIFIER
> |
> --741+-737(8) $start IDENTIFIER '(' TYPEDEFname '(' ! '(' '*' IDENTIFIER
> +-544(8) $start IDENTIFIER '(' ! CLCL TYPEDEFname '(' '*' IDENTIFIER
> +-540(8) $start IDENTIFIER '(' ! INT '(' '*' IDENTIFIER
> --508(8) $start IDENTIFIER '(' ! TYPEDEFname '(' '*' IDENTIFIER


Since there is a conflict in state 739 between reduction via rules 8
and 656 when the lookahead is ')', we should recall that the default
resolution will be to use the reduction via rule 8. Hence, even
though we can show that there are 12 conflict contexts (9 leaves in
the above tree), we do not have to worry about this issue. Yes it is
true that these contexts are unambiguously deserving of reduction via
rule 8 (i.e., if we tried to reduce using rule 656 in these contexts
we would get a syntax error before the ')' was ever shifted). However,
since we are going to use rule 8 all the time, there is no reason to
worry about these LALR-only components of the LR conflict (note there
were some fully ambiguous contexts, so we refer to entries shown her
as the "LALR-only components").

To make reading these contexts a bit easier, the sample sentence
(shown earlier during the demonstrations) for each context is
displayed to the right of the leaf. Here again we are repeating
information presented elsewhere, in order to enhance the local
readability of a section.


3.2.3.4 Significant LALR-only conflicts

As a final example, we will consider a conflict with a significant
LALR-only conflict component.

>Summary of conflict contexts leading to state 758 and lookahead symbol <','>
>Possible reductions rules include (61,74)
>type_qualifier_list_opt : (61)
>postfix_expression : global_or_scoped_typedefname '(' ')' (74)
>27 conflict contexts were found.
>
>--758+-754--527+-1039(74)
> | +-953(74)
> | +-570(74)
> | +-505(74)
> | --330(74)
> |
> --544--345+-1014(61,74)
> +-900(61,74)
> +-754(61,74)
> +-753(61)
> +-748(61,74)
> +-737(61,74)
> +-720(61,74)
> +-718(61,74)
> +-716(61,74)
> +-639(61,74)
> +-544(61,74)
> +-540(61)
> +-508(61,74)
> +-309(61,74)
> +-296(61,74)
> +-286(61,74)
> --174(61,74)
>
>LALR-only-conflicts are present. Splittable states include: 345 544 758
>6 conflict contexts (3 splittable state(s)) are LALR-only problems.
>
>LALR-only conflict contexts leading to state 758 and lookahead symbol <','>
>Possible reductions rules include (61,74)
>type_qualifier_list_opt : (61)
>postfix_expression : global_or_scoped_typedefname '(' ')' (74)
>6 conflict contexts were found.
>
>Unambiguous context tree is:
>

>--758+-754(74) $start IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')'
> --544--345+-753(61) $start IDENTIFIER '(' '(' ! INT '(' CLCL TYPEDEFname '(' ')'
> --540(61) $start IDENTIFIER '(' ! INT '(' CLCL TYPEDEFname '(' ')'


As with the last section, we can note that since there is a conflict
between using reduction 61 and 74, we will be using rule 61. Hence
the contexts which unambiguously support the use of rule 61 can be
ignored.

Alas, the context suggested by state 758 atop state 754 is a real
problem. In this context, an attempt to use rule 61 will only lead to
a syntax error (otherwise it would have appeared). Hence we see that
although the grammar is clear, an LALR parser generator is not capable
(directly) of creating a parser that can correctly process such a
context. The word "correctly" it taken to mean "as well as a perfect
LR parser." (Recall from earlier discussions of LR parsing, that the
word "perfect" is actually redundant).

Several solutions to this dilemma are apparent. The first thing to do
is to understand the context by referring to the demonstration
provided earlier. If we are lucky, then the context might turn out to
be a case where some constraint is being violated, and hence the
problem is not significant for valid programs (i.e., error free
programs). If however we cannot ignore the problem, there are several
ways to deal with it.


The easiest way to handle LALR-only conflicts is to buy a better
parser generator. There are several commercially available supposedly
LR parser generators. My experience has been that they are NQLR (Not
Quite LR), but they are close enough that they will (hopefully)
process this context correctly. (Note that I haven't really played
with any of these parsers, but discussions with the vendors have
suggested that they are not "full LR parsers.")

The second solution is to improve the quality of nearly public domain
parser generators by upgrading them to the standard NQLR status.
Fixing up byacc would be an excellent approach.

Finally. the third solution is to augment the grammar with "useless"
rules. This approach is the topic of the next section.


3.2.4 Augmentation Proposals to Remove LALR-only Conflicts

Whenever the summary section isolates LALR-only conflicts, as
described above, the YACC Cross Reference Tool makes several
suggestions as to how to cure these problems by augmenting the
grammar. In Appendix A, we see the example:

>LALR-only-conflicts are present. Splittable states include: 2
>3 conflict contexts (1 splittable state(s)) are LALR-only problems.
>
>LALR-only conflict contexts leading to state 2 and lookahead symbol <')'>
>Possible reductions rules include (5,6)
>paren_fixer : DIGIT (5)
>number : DIGIT (6)
>3 conflict contexts were found.
>
>Unambiguous context tree is:
>
>--2+-1(6) $start '(' DIGIT
> --0(5) $start DIGIT
>
>The following rules might split the problematic states:
>
1>program : '(' DIGIT error ; /* SPLIT state(s) 2 following state 1 for rule(s) (6) out of (5,6) */
>Minimal stack context: 0 1 2
>Corresponding symbols: $start '(' DIGIT . ')'
>Sample sentence: $start '(' DIGIT . ')'
>
2>$accept : DIGIT error ; /* SPLIT state(s) 2 following state 0 for rule(s) (5) out of (5,6) */
>Minimal stack context: 0 2
>Corresponding symbols: $start DIGIT . ')'
>Sample sentence: $start DIGIT . ')'


Note that two rules are proposed in lines "1>" and line "2>" above.
Since we are handling rules 5 and 6, we know that yacc will tend to
support rule 5 automatically (by default). What we need is an
augmentation that will sponsor (i.e., split out the state(s) needed to
support) rule 6. The rule on line "1>" asserts that it will perform
this feat. Note that the "Minimal stack context:" it services is "1
2", which read left-right, lower in the stack to higher in the stack:
yacc state 2 atop state 1. This is indeed the context that must
invoke rule 6. (The analysis for this conclusion was presented in
"3.1.3.2 Sample LALR-only ambiguous grammar").

The line containing "Corresponding symbols" lists the access symbols
for the states 1 and 2, which are "'('" and "DIGIT" respectively,
followed by the lookahead symbol ')' (in this case the symbols are
terminals, but in general they might be non-terminals.

The next line provides "Sample sentence:", which is the expansion of
each of the symbols provided on the "Corresponding symbols:" line.
Since the symbols were all terminals, their "minimal terminal
expansion" is basically the terminals (this is a bit unusual, but
since the sample grammar is so small, these sample lines are a bit
degenerate).

If we add the only rule suggested, then the grammar will, as discussed
earlier, become conflict-free. I do not believe that I have the
totally correct algorithm yet, so I am not always this lucky.

Looking next at a more complex example taken from the YACCable C++
grammar analysis:



>LALR-only-conflicts are present. Splittable states include: 345 544 758
>6 conflict contexts (3 splittable state(s)) are LALR-only problems.
>
>LALR-only conflict contexts leading to state 758 and lookahead symbol <','>
>Possible reductions rules include (61,74)
>type_qualifier_list_opt : (61)
>postfix_expression : global_or_scoped_typedefname '(' ')' (74)
>6 conflict contexts were found.
>
>Unambiguous context tree is:
>
>--758+-754(74) $start IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')'
> --544--345+-753(61) $start IDENTIFIER '(' '(' ! INT '(' CLCL TYPEDEFname '(' ')'
> --540(61) $start IDENTIFIER '(' ! INT '(' CLCL TYPEDEFname '(' ')'
>
>The following rules might split the problematic states:
>
>postfix_expression : global_or_scoped_typedefname '(' ')' error ; /* SPLIT state(s) 758 following state 754 for rule(s) (74) out of (61,74) */
>Minimal stack context: 0 1 33 174 330 ! 527 754 758
>Corresponding symbols: $start translation_unit paren_identifier_declarator '(' '(' ! global_or_scoped_typedefname '(' ')' . ','
>Sample sentence: $start IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')' . ','
>
>...


Note: (In this augmentation suggestion, we are focusing on the
proposal to support reduction via rule 74. Clearly the goal is to
split state 758 so that there is a distinct version whenever state 754
precedes it. Alas, this example "augmentation" does not split the
desired state (I checked by adding it, and then running this YACC
tool). I believe the approach is sound, and I am simply not isolating
the offending context well enough in my rule proposal. Future work
should correct this situation, but I did not want to further delay a
new release of a C++ grammar until this research was completed.)

Getting back to reading the analysis, note that the minimal stack
context includes eight states "0 1 33 174 330 ! 527 754 758", but the
"!" between states 330 and state 527 indicates that only a typical
sample was provided to the left of this point (i.e., there are other
possible left stack contexts besides "0 1 33 174 330").

The line providing "Corresponding symbols:" lists the access symbols
for each of these states, and also has the embedded "!" to show where
the options begin. Note that since there were eight states in the
minimal stack context, there are eight symbols, plus the lookahead
token listed on this line.

Finally, the "Sample sentence:" provides the minimal expansion of each
of the tokens in listed in the previous line. For convenience, the
"!" is also correspondingly embedded, but since "Minimal expansions"
are rarely unique, this line contain samples throughout (i.e., other
samples exists with variations to the right of the "!"). Note
however, that it is very convenient to see this form, as it provided
simple direct insight into what sort of conflict is being addressed.
In this case, since we have:

>Sample sentence: IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')' ','

This is a complete program prefix. Specifically, we can deduce that
the program:

foo ( ( :: MY_CLASS () ....

provides a confusing program to an LALR parser. The C++ aficionado
(if not cursing, or developing a migraine) would probably parse this
as:

foo // some identifier
( // either starting an initializer, or a prototype
( // can't be a prototype, so we must be an initializer
// starting a parened-expression, or an old style cast

:: MY_CLASS () // either a parened-expression: function like cast of void
// or a "function returning MY_CLASS" type
, // well this rules out an old style cast, so...
// it must be a parenthesized comma-expression is
// the *FIRST* initializer argument


Hence this segment is rather similar to code like:

int foo ((5, 6));

which is similar to old style initialization:

int foo = (5, 6);

and since comma expression return the value of only its last argument,
this is the same as:

int foo = (6);

or simply:

int foo = 6;

I can vaguely imagine code like this being useful, in that the the
first expression might have side effects, but this is not really the
topic of this paper.

What should be noted is that a very meaningful understanding of the
conflict, the LALR-only components, and the the grammar was obtained
by examining the tables shown above.



4 FUTURE ENHANCEMENTS

The following are items that I would like to add to a LALR(1) parser
generator:

1) When summarizing non-terminals, an indication should be given of
whether the non-terminals are left-recursive, right-recursive,
recursive, or none-of-the-above. Note that if a non-terminal is both
left and right recursive, then the grammar *must* be ambiguous. This
is a common error, and can easily be diagnosed by such evaluations.

2) The bulk of the information presented in the "Minimal access to
states" can be presented graphically, in the same way as conflict
analysis shows the state contexts. This is actually easy to do, but
the thought appeared in a discussion too close to the release date of
the YACCable C++ grammar to be able to implement it.

3) Although I already present what I believe is the most difficult
aspect of conflict analysis (i.e., identifying the contexts that
induce the conflicts), there is a small section that is presented in
Pennello and DeRemer's paper that I omit. I summarize this by simply
asserting (during a conflict analysis context description) that upon
entering a state, some specific terminal token can "follow."
Basically, by walking forward through the parser, it is rather
straight forward to see *how* this can happen, but I don't automate
the procedure currently. Pennello and DeRemer suggest that this
analysis should be integrated into the conflict analysis presentation,
but I believe it would clutter an already large section of y.output. I
expect that the "proof" of my assertions should be presented in a
separate section. This would prevent redundantly printing the same
explanation in several places.

4) It is known that it is possible to split states in an LALR(1)
parser so that all LALR-only conflicts are removed. I believe it is
possible to augment a grammar with additional rules (which actually
are never used!) that induce the desired state splitting in the
underlying LR(0) automaton, and hence the desired state splitting in
the LALR(1) parser. I am very close to this in my LALR-only conflict
analysis, but I am not quite there. The advantages of this approach
to users who wish-to, need-to, or must use a specific LALR(1) parser
generator are rather significant. The research toward this end needs
to be finished.

5) It is possible to, after splitting states by whatever method, build
a NQLR(1) parser (Not Quite LR(1)) that can parse every valid sentence
that a corresponding LR(1) parser can parse. The one remaining
distinction between LR(1) and NQLR(1) parsing would be that the LR(1)
parser never performs any reduction when the lookahead token is not
eventually "shiftable." To put it another way, NQLR(1) parsers
commonly perform several reductions before realizing that there is a
syntax error afoot, whereas LR(1) parsers *never* perform a reduction
in the face of a syntax error. I believe it is possible to enhance
the parsing engine (driven by NQLR(1) tables) such that full LR(1)
parsing (re: immediate error detection) is provided, but yet the
performance of the parser on valid sentences is *not* diminished from
what is provided by a classical LALR(1) parsing engine (i.e., the
byacc skeleton; or the yacpar that is used with yacc). This may be
the topic of another paper, or simply another product. (Late breaking
news: this error detection proposal (without the state splitting) has
been implemented years ago by Corbett (author of byacc and bison), and
is commonly available as a parsing engine called "bison.hairy". For
compatibility with yacc grammars, LR(0) consistent states must be made
to ignore lookahead during this process).

5) The transformations that I performed manually to create the
YACCable C++ grammar can generally be automated. As a result, it
should be possible to have a much more concise grammar, plus a set of
"grammar transformation" operations that result in the entire grammar
as it is released. I expect that part of this will involve adopting a
grammar notation that uses regular expressions on the right hand side
of productions. One problem with a grammar that is as large as the
C++ grammar is that adding "action code" is incredibly tedious, and
repetitive. I believe that these two problems have a unifying
solution that should allow actions to be specified as "overloaded
functions", wherein the parser generator would select the appropriate
function. Moreover, these automatically selected action code segments
should be carried gracefully (i.e., correctly, with proper adjustments
to $n references) through the indicated grammatical transformations.

To give but a tiny hint of the usefulness, consider how many
productions require the entry of action code like:

{ $$ = $2; }

With knowledge of the types associated with elements of the right hand
side as well as that of the left hand side, such action are often
"obvious." Note that yacc typically complains when the default action
"$$=$1;" leads to a type mismatch, but (in all implementations that I
know of) doesn't bother to provide a superior (i.e., type matching)
choice for a default action.




A.1 Appendix A: Sample LALR-only ambiguous grammar: automatic
documentation

JRYACC Cross Reference Tool, Version 1.00 pre-alpha
Copyright (c) 1991, James Roskind, All Rights Reserved

Reference Grammar

$accept
: program $end /* (0) */
;

program
: '(' paren_fixer /* (1) */
| '(' number ')' /* (2) */
| paren_fixer ')' /* (3) */
| number /* (4) */
;

paren_fixer
: DIGIT /* (5) */
;

number
: DIGIT /* (6) */
;

Alphabetized Grammar (alphabetically by tokens)


%start program

number
: DIGIT /* (6) */
;

paren_fixer
: DIGIT /* (5) */
;

program
: '(' number ')' /* (2) */
| '(' paren_fixer /* (1) */
| number /* (4) */
| paren_fixer ')' /* (3) */
;



Sample Expansions for the 3 Non-terminal Tokens (alphabetically by token)

Name Usage count Minimal expansion

number 2 DIGIT
paren_fixer 2 DIGIT
program 1 DIGIT


Summary Descriptions of the 4 Terminal Tokens (alphabetically by token)

Name Usage Count Value Precedence Associativity

'(' 2 40 0 -
')' 2 41 0 -
DIGIT 2 257 0 -
error 0 256 0 -


Symbol and Grammar Cross Reference for the 7 Tokens
(alphabetically by token name, including expansion samples)
Sample expansions are also provided for the 6 rules

'('
TERMINAL, used in 2 rules value = <40>


')'
TERMINAL, used in 2 rules value = <41>


DIGIT
TERMINAL, used in 2 rules value = <257>


error
TERMINAL, used in 0 rules value = <256>


number
NON-TERMINAL, used in 2 rules
number : DIGIT (6)
DIGIT



paren_fixer
NON-TERMINAL, used in 2 rules
paren_fixer : DIGIT (5)
DIGIT



program
NON-TERMINAL, used in 1 rule re: $accept : program $end (0)
program : '(' number ')' (2)
'(' DIGIT ')'

program : '(' paren_fixer (1)
'(' DIGIT

program : number (4)
DIGIT

program : paren_fixer ')' (3)
DIGIT ')'






Sample Stack Context and Accessing Sentences for each State

state 0: 0 incoming, 5 outgoing states; Never has a left context
Minimal stack: 0
symbols: $start
sentence: $start

state 1: 1 incoming, 3 outgoing states; entering via state 0:
Minimal stack: 0 1
symbols: $start '('
sentence: $start '('

state 2: 2 incoming, 0 outgoing states; entering via state 0:
Minimal stack: 0 ! 2
symbols: $start ! DIGIT
sentence: $start ! DIGIT

state 3: 1 incoming, 0 outgoing states; entering via state 0:
Minimal stack: 0 3
symbols: $start program
sentence: $start DIGIT

state 4: 1 incoming, 1 outgoing states; entering via state 0:
Minimal stack: 0 4
symbols: $start paren_fixer
sentence: $start DIGIT

state 5: 1 incoming, 0 outgoing states; entering via state 0:
Minimal stack: 0 5
symbols: $start number
sentence: $start DIGIT

state 6: 1 incoming, 0 outgoing states; entering via state 1:
Minimal stack: 0 1 6
symbols: $start '(' paren_fixer
sentence: $start '(' DIGIT

state 7: 1 incoming, 1 outgoing states; entering via state 1:
Minimal stack: 0 1 7
symbols: $start '(' number
sentence: $start '(' DIGIT

state 8: 1 incoming, 0 outgoing states; entering via state 4:
Minimal stack: 0 4 8
symbols: $start paren_fixer ')'
sentence: $start DIGIT ')'

state 9: 1 incoming, 0 outgoing states; entering via state 7:
Minimal stack: 0 1 7 9
symbols: $start '(' number ')'
sentence: $start '(' DIGIT ')'




Concise list of conflicts:

2: reduce/reduce conflict (reduce 5, reduce 6) on $end
2: reduce/reduce conflict (reduce 5, reduce 6) on ')'
paren_fixer : DIGIT . (5)
number : DIGIT . (6)



Canonical Description of Conflicts

state 2:
Canonical Conflict Access: $start DIGIT . $end (2 reductions)

state 2:
Canonical Conflict Access: $start DIGIT . ')' (2 reductions)

Verbose listing of state transitions

state 0
$accept : . program $end (0)

'(' shift 1
DIGIT shift 2
. error

program goto 3
paren_fixer goto 4
number goto 5


state 1
program : '(' . paren_fixer (1)
program : '(' . number ')' (2)

DIGIT shift 2
. error

paren_fixer goto 6
number goto 7


2: reduce/reduce conflict (reduce 5, reduce 6) on $end
2: reduce/reduce conflict (reduce 5, reduce 6) on ')'
state 2
paren_fixer : DIGIT . (5)
number : DIGIT . (6)

. reduce 5


state 3
$accept : program . $end (0)

$end accept


state 4
program : paren_fixer . ')' (3)

')' shift 8
. error


state 5
program : number . (4)

. reduce 4


state 6
program : '(' paren_fixer . (1)

. reduce 1


state 7
program : '(' number . ')' (2)

')' shift 9
. error


state 8
program : paren_fixer ')' . (3)

. reduce 3


state 9
program : '(' number ')' . (2)

. reduce 2


Rules never reduced:
number : DIGIT (6)


State 2 contains 2 reduce/reduce conflicts.


5 terminals, 4 nonterminals
7 grammar rules, 10 states



Explanations for all reductions suggested in conflicts


Demonstration(s) of reduction via rule (5) in state 2, when next token is <$end>
paren_fixer : DIGIT (5)


From state 0
$start program transitions to state 3, and then <$end> can follow.
'(' paren_fixer . (1)
DIGIT . (5)

Following the 2 states below state 2, with lookahead <$end> REDUCE via (5) is possible.
Minimal stack context: 0 1 2
Corresponding symbols: $start '(' DIGIT . $end
Sample sentence: $start '(' DIGIT . $end



Demonstration(s) of reduction via rule (6) in state 2, when next token is <$end>
number : DIGIT (6)


From state 0
$start program transitions to state 3, and then <$end> can follow.
number . (4)
DIGIT . (6)

Following the 1 states below state 2, with lookahead <$end> REDUCE via (6) is possible.
Minimal stack context: 0 2
Corresponding symbols: $start DIGIT . $end
Sample sentence: $start DIGIT . $end


Summary of conflict contexts leading to state 2 and lookahead symbol <$end>
Possible reductions rules include (5,6)
paren_fixer : DIGIT (5)
number : DIGIT (6)
4 conflict contexts were found.

--2+-0(6)
--1--0(5)

LALR-only-conflicts are present. Splittable states include: 2
3 conflict contexts (1 splittable state(s)) are LALR-only problems.

LALR-only conflict contexts leading to state 2 and lookahead symbol <$end>
Possible reductions rules include (5,6)
paren_fixer : DIGIT (5)
number : DIGIT (6)
3 conflict contexts were found.

Unambiguous context tree is:

--2+-0(6) $start DIGIT
--1(5) $start '(' DIGIT

The following rules might split the problematic states:

$accept : DIGIT error ; /* SPLIT state(s) 2 following state 0 for rule(s) (6) out of (5,6) */
Minimal stack context: 0 2
Corresponding symbols: $start DIGIT . $end
Sample sentence: $start DIGIT . $end

$accept : program DIGIT error ; /* SPLIT state(s) 2 following state 3 for rule(s) (5) out of (5,6) */
Minimal stack context: 0 1 2
Corresponding symbols: $start '(' DIGIT . $end
Sample sentence: $start '(' DIGIT . $end


Demonstration(s) of reduction via rule (5) in state 2, when next token is <')'>
paren_fixer : DIGIT (5)


From state 0
$start paren_fixer transitions to state 4, and then <')'> can follow.
DIGIT . (5)

Following the 1 states below state 2, with lookahead <')'> REDUCE via (5) is possible.
Minimal stack context: 0 2
Corresponding symbols: $start DIGIT . ')'
Sample sentence: $start DIGIT . ')'



Demonstration(s) of reduction via rule (6) in state 2, when next token is <')'>
number : DIGIT (6)


From state 1
$start '(' number transitions to state 7, and then <')'> can follow.
DIGIT . (6)

Following the 1 states below state 2, with lookahead <')'> REDUCE via (6) is possible.
Minimal stack context: 0 1 2
Corresponding symbols: $start '(' DIGIT . ')'
Sample sentence: $start '(' DIGIT . ')'


Summary of conflict contexts leading to state 2 and lookahead symbol <')'>
Possible reductions rules include (5,6)
paren_fixer : DIGIT (5)
number : DIGIT (6)
3 conflict contexts were found.

--2+-1(6)
--0(5)

LALR-only-conflicts are present. Splittable states include: 2
3 conflict contexts (1 splittable state(s)) are LALR-only problems.

LALR-only conflict contexts leading to state 2 and lookahead symbol <')'>
Possible reductions rules include (5,6)
paren_fixer : DIGIT (5)
number : DIGIT (6)
3 conflict contexts were found.

Unambiguous context tree is:

--2+-1(6) $start '(' DIGIT
--0(5) $start DIGIT

The following rules might split the problematic states:

program : '(' DIGIT error ; /* SPLIT state(s) 2 following state 1 for rule(s) (6) out of (5,6) */
Minimal stack context: 0 1 2
Corresponding symbols: $start '(' DIGIT . ')'
Sample sentence: $start '(' DIGIT . ')'

$accept : DIGIT error ; /* SPLIT state(s) 2 following state 0 for rule(s) (5) out of (5,6) */
Minimal stack context: 0 2
Corresponding symbols: $start DIGIT . ')'
Sample sentence: $start DIGIT . ')'


---end of file autodoc5.txt---


 December 19, 2017  Add comments

Leave a Reply