The APLR(1) algorithm for compact LR(1) parsers is simpler and more capable than

fanf22 pts0 comments

The APLR(1) Algorithm for Generating Compact LR(1) Parsers is Simpler and More Capable than IELR(1) | BranchTaken

The APLR(1) Algorithm for Generating Compact LR(1) Parsers is Simpler and More Capable than IELR(1)

The Hocc parser generator,<br>which is part of the Hemlock programming language project,<br>implements a novel LR(1)-family parser generation algorithm called Adequacy Preservation LR(1).<br>APLR(1) generates compact parsers that are devoid of LR(1)-relative inadequacies, even for<br>nondeterministic/ambiguous grammars. Thus APLR(1) parser automata are suitable for use with<br>Generalized LR (GLR) parsing techniques, but the<br>practical implication for deterministic parsing is that mysterious conflicts never arise during<br>grammar development. Furthermore APLR(1) is useful in combination with any LR(1)-family parser<br>generation algorithm that may introduce unnecessary state splits. As a case in point, Hocc’s<br>IELR⁺(1) algorithm is a generalized extension of the original<br>Bison IELR(1) algorithm that also supports<br>nondeterministic/ambiguous grammars, but this algorithm sometimes induces precautionary state splits<br>that are ultimately unnecessary, and APLR(1) augments IELR⁺(1) such that APLR(1) and IELR⁺(1) may be<br>used interchangeably.

Introduction

Knuth discovered the LR(k) algorithm [1] for parsing deterministic context-free<br>grammars over six decades ago and<br>showed that LR(1) is a reasonable restriction, but at the time even LR(1) was too computationally<br>intensive to be usable. The LALR(1) algorithm [2] is not as capable as canonical LR(1),<br>but despite superior alternatives since discovered, LALR(1) remains the de facto LR(1)-family<br>algorithm of choice. The relative obscurity of PGM LR(1) [3] appears to be due to a<br>combination of confusion surrounding related lane tracing research<br>[4][5][6], along with having been overshadowed by the rapid<br>dissemination of the LALR(1) algorithm via the Yacc parser<br>generator. The limited adoption of IELR(1) is due at least in part to its conceptual complexity,<br>incidentally also a lane tracing approach. In contrast, APLR(1) is a straightforward application of<br>subgraph isomorphism search, and understanding requires only basic graph theory and a limited<br>knowledge of LR(1) automaton structure.

What constitutes a practical LR(1)-family parser? Algorithmic complexity has execution and data<br>components; in order for a parser to be practical it must be both sufficiently fast and sufficiently<br>small. Furthermore, parser generation is distinct from parsing, and excessive generation-time<br>overhead can make an algorithm impractical even if parse-time performance is amazing.

At the time of LR(1) discovery, both parser generation and parsing were impractical, most critically<br>because canonical LR(1) automata can be extremely large even for modest grammars. All deterministic<br>LR(1)-family parsers use the same pushdown automaton<br>(PDA) algorithm, so research on efficient PDAs<br>generally applies; the parser generation algorithms distinguish themselves by generating smaller<br>automata, in some cases restricting what grammars can be expressed. Following is a modern<br>perspective on the tradeoffs inherent to the algorithms discussed in this report.

Algorithm<br>Year<br>LR(1)-relative power<br>Automaton size<br>Generation overhead

LR(1)<br>1965<br>Maximal<br>Moderate

LALR(1)<br>1969<br>Inadequate<br>Low

PGM LR(1)<br>1977<br>=¹<br>Compact²<br>Low

IELR(1)<br>2010<br>Compact<br>Low-moderate

IELR⁺(1)<br>2024<br>⊃³<br>Compact<br>Moderate-high

APLR(1)<br>2026<br>⊃³<br>Compact<br>Moderate-high

1 — Precedence/associativity not supported.<br>2 — Assuming weak compatibility test; strong compatibility guarantees minimality.<br>3 — Nondeterminism/ambiguity supported, technically also true of LR(1) automata.

Canonical LR(1) automata are maximal in that any further splitting of states would result in<br>reduntant identical states; LALR(1) automata are inadequate in that all possible state merging is<br>blindly performed; the compact automata are commonly minimal, but greedy algorithms make no<br>guarantee regarding minimality. Maximal automata remain unwieldy on modern computers due to<br>generated code size and low execution locality, though not intractably so. Inadequate automata are<br>to be avoided. The compact automata make no parse-time compromises, and generation-time compromises<br>are modest, quickly trending toward negligible as hardware continues to improve.

The LALR(1) algorithm [2] is so prevalent that many practitioners unquestioningly<br>accept its shortcomings. Although LALR(1) utilizes symbol lookahead during parsing, an LALR(1)<br>automaton is generated by merging states with equal LR(0) item sets, i.e. by disregarding lookahead.<br>Three types of mysterious conflicts can result. The well-known mysterious new conflicts caused<br>by LALR(1) state merging are always reduce-reduce conflicts , but it is also possible for state<br>merging to create mysterious invasive conflicts that are caused by merging shift-reduce<br>conflicts into states which would otherwise have performed a reduce action. Furthermore, it...

algorithm aplr compact parser automata ielr

Related Articles