The APLR(1) Algorithm for Generating Compact LR(1) Parsers is Simpler and More Capable than IELR(1) | BranchTaken
The APLR(1) Algorithm for Generating Compact LR(1) Parsers is Simpler and More Capable than IELR(1)
The Hocc parser generator,<br>which is part of the Hemlock programming language project,<br>implements a novel LR(1)-family parser generation algorithm called Adequacy Preservation LR(1).<br>APLR(1) generates compact parsers that are devoid of LR(1)-relative inadequacies, even for<br>nondeterministic/ambiguous grammars. Thus APLR(1) parser automata are suitable for use with<br>Generalized LR (GLR) parsing techniques, but the<br>practical implication for deterministic parsing is that mysterious conflicts never arise during<br>grammar development. Furthermore APLR(1) is useful in combination with any LR(1)-family parser<br>generation algorithm that may introduce unnecessary state splits. As a case in point, Hocc’s<br>IELR⁺(1) algorithm is a generalized extension of the original<br>Bison IELR(1) algorithm that also supports<br>nondeterministic/ambiguous grammars, but this algorithm sometimes induces precautionary state splits<br>that are ultimately unnecessary, and APLR(1) augments IELR⁺(1) such that APLR(1) and IELR⁺(1) may be<br>used interchangeably.
Introduction
Knuth discovered the LR(k) algorithm [1] for parsing deterministic context-free<br>grammars over six decades ago and<br>showed that LR(1) is a reasonable restriction, but at the time even LR(1) was too computationally<br>intensive to be usable. The LALR(1) algorithm [2] is not as capable as canonical LR(1),<br>but despite superior alternatives since discovered, LALR(1) remains the de facto LR(1)-family<br>algorithm of choice. The relative obscurity of PGM LR(1) [3] appears to be due to a<br>combination of confusion surrounding related lane tracing research<br>[4][5][6], along with having been overshadowed by the rapid<br>dissemination of the LALR(1) algorithm via the Yacc parser<br>generator. The limited adoption of IELR(1) is due at least in part to its conceptual complexity,<br>incidentally also a lane tracing approach. In contrast, APLR(1) is a straightforward application of<br>subgraph isomorphism search, and understanding requires only basic graph theory and a limited<br>knowledge of LR(1) automaton structure.
What constitutes a practical LR(1)-family parser? Algorithmic complexity has execution and data<br>components; in order for a parser to be practical it must be both sufficiently fast and sufficiently<br>small. Furthermore, parser generation is distinct from parsing, and excessive generation-time<br>overhead can make an algorithm impractical even if parse-time performance is amazing.
At the time of LR(1) discovery, both parser generation and parsing were impractical, most critically<br>because canonical LR(1) automata can be extremely large even for modest grammars. All deterministic<br>LR(1)-family parsers use the same pushdown automaton<br>(PDA) algorithm, so research on efficient PDAs<br>generally applies; the parser generation algorithms distinguish themselves by generating smaller<br>automata, in some cases restricting what grammars can be expressed. Following is a modern<br>perspective on the tradeoffs inherent to the algorithms discussed in this report.
Algorithm<br>Year<br>LR(1)-relative power<br>Automaton size<br>Generation overhead
LR(1)<br>1965<br>Maximal<br>Moderate
LALR(1)<br>1969<br>Inadequate<br>Low
PGM LR(1)<br>1977<br>=¹<br>Compact²<br>Low
IELR(1)<br>2010<br>Compact<br>Low-moderate
IELR⁺(1)<br>2024<br>⊃³<br>Compact<br>Moderate-high
APLR(1)<br>2026<br>⊃³<br>Compact<br>Moderate-high
1 — Precedence/associativity not supported.<br>2 — Assuming weak compatibility test; strong compatibility guarantees minimality.<br>3 — Nondeterminism/ambiguity supported, technically also true of LR(1) automata.
Canonical LR(1) automata are maximal in that any further splitting of states would result in<br>reduntant identical states; LALR(1) automata are inadequate in that all possible state merging is<br>blindly performed; the compact automata are commonly minimal, but greedy algorithms make no<br>guarantee regarding minimality. Maximal automata remain unwieldy on modern computers due to<br>generated code size and low execution locality, though not intractably so. Inadequate automata are<br>to be avoided. The compact automata make no parse-time compromises, and generation-time compromises<br>are modest, quickly trending toward negligible as hardware continues to improve.
The LALR(1) algorithm [2] is so prevalent that many practitioners unquestioningly<br>accept its shortcomings. Although LALR(1) utilizes symbol lookahead during parsing, an LALR(1)<br>automaton is generated by merging states with equal LR(0) item sets, i.e. by disregarding lookahead.<br>Three types of mysterious conflicts can result. The well-known mysterious new conflicts caused<br>by LALR(1) state merging are always reduce-reduce conflicts , but it is also possible for state<br>merging to create mysterious invasive conflicts that are caused by merging shift-reduce<br>conflicts into states which would otherwise have performed a reduce action. Furthermore, it...