|
NFA, DFA, PDA
NFA
NFA means Nondeterministic Finite Automata. An NFA may have more than one choice of action
(which state to go to next) for an input symbol. If it makes the wrong choice and finds
out after more symbols are read, then it has to backtrack and try another path. Consequently,
NFA's may be slow and are undesirable.
DFA
DFA means Deterministic Finite Automata. A DFA has only one choice of action
(which state to go to next) for each input symbol. Consequently, it never has to backtrack
and try another path. DFA's are desirable, because they are faster and simpler. DFA
processors do not use a stack or make reductions. Therefore, they cannot be used to parse
computer languages. They are used for lexical analysis only.
PDA
PDA means Push Down Automata. A PDA is a DFA with a stack. A PDA processor has a stack
which it uses like a memory. PDA's can be used for parsing computer languages. PDA's can
be used for lexical analysis also, however, the speed of DFA's are 5 times faster than PDA's, when
used for lexical analysis (based on my research).
Common Error
Some people talk about LALR DFA's. This is wrong terminology. An LALR parser uses a stack
and operates as a PDA processor, not a DFA.
LALR
LALR is a computer science acronym for LOOK AHEAD LEFT RIGHT. It is a
method of parsing computer languages. Parsing is recognizing patterns in input statements
that match rules in the grammar which was used to create the parser. When parsing a computer language,
an LALR parser:
- Uses a LOOK AHEAD symbol to aid the recognition process,
- Reads input statements from LEFT to right,
- Makes reductions on the RIGHT first, working backward toward the left side of the parse stack.
This is also called bottom-up parsing.
Myths
LALR cannot handle context-sensitive languages like C++.
This is not true. LRSTAR has a semantic grammar symbol that solves this problem (e.g. {typedef}).
At parsing time, when an incoming <identifier> could be ambiguous, a symbol-table lookup will change the
<identifier> to a {typedef} which is not ambiguous. Thus, LRSTAR parsers can handle difficult
computer languages like C++.
LALR does not work when you need to have actions embedded in the rules of the grammar.
This is not true. Because of the integrated symbol-table manager and automatic AST constructor, you rarely
need to have actions embedded in the rules. You can wait until the rule is recognized and then execute your
actions. Actually, for actions that produce output, it's better to build the AST while parsing and then after
parsing, execute the actions during a traversal of the AST.
LALR, bottom-up parsing, is too difficult and complicated. LL or top-down parsing is easier.
This is not true. LALR grammars are actually more natural to write than LL grammars when you don't try to embed
actions in the rules. An LL parsing algorithm is actually more complicated because it always requires more than
one lookahead symbol for real-world languages. LALR usually does NOT need more than one lookahead symbol to make
parsing decisions, so the LALR parsing algorithms is simpler.
LALR vs LL
LALR(1) is a larger language class than LL(1).
Therefore, LALR(1) is a more powerful notation than LL(1) for defining a lexer or parser,
where (1) indicates the number of lookahead symbols used by the parser to make parsing decisions.
LL(k) is often used to overcome the deficiency of LL(1), however, the resultant LL(k) parsers
are slower, more complicated and LL(k) grammars are still less natural for humans to specify than
LALR.
LALR has its complexities as well, but that is a huge topic for another web page. Quite
often, personal preference is the deciding factor. Many people prefer to write LALR grammars.
Because of the more difficult lookahead requirement in LL parsers, they are slower than LALR parsers.
A rough estimate NOT based on testing results would suggest that LALR parsers are at least 2 times the speed
of LL parsers and at most 10 times the speed of LL parsers.
SLR, LALR, LR1
LR(0) State Machine
Both SLR and LALR parser generators construct an LR(0) state machine and have the same states.
However, they differ in the computation of the lookahead set for each state. The SLR method
may report erroneous conflicts, which LALR never does. So LALR is the preferred method.
SLR Lookahead Sets
SLR is a term that means Simple LR. An SLR parser generator uses a simple method to compute
the set of lookahead symbols for each state. It uses the BNF grammar to compute the lookahead
sets. It looks at every occurrence of a nonterminal symbol in the grammar to see what terminal
symbols may follow it. This is called the Follow Set. The context in the grammar determines
the lookahead sets. This is actually too simple and consequently an SLR parser generator may
report conflicts that really do not exist in the LR(0) state machine.
LALR Lookahead Sets
LALR is a more powerful concept than SLR. An LALR parser generator uses a more complicated
method to compute the set of lookahead symbols for each state. It looks at the LR(0) state
machine rather than the grammar to compute the lookahead sets. LALR gives 100% accurate lookahead
sets for the LR(0) state machine. If any conflicts are reported by LALR, they are true conflicts.
If the conflicts are based on LALR(1), 1 symbol of lookahead, then more than 1 symbol of lookahead may
be required to resolve the conflicts or perhaps the grammar is truly ambiguous for any number of lookahead
symbols.
Canonical LR(1) Construction
LR usually means Canonical LR(1) and it's a more powerful concept than LALR. A Canonical
LR(1) parser generator computes the LR(1) state machine, instead of LR(0). This is fine
for small toy grammars, but for real-world computer languages, way too many states are created.
For the COBOL-85 language definition, over 2,000,000 states were created by an canonical LR(1)
parser generator, whereas only 1,720 states were created by an LALR parser generator.
Canonical LR offers only one advantage over LALR. The error handling part of the LR parser
algorithm is simpler, because no useless reductions will be made when an error is encountered.
Thus, error-recovery algorithms are simpler.
Minimal LR(1) Construction
Minimal LR is another type of parser generation algorithm which is the best of all. Minimal
LR gives the recognition power of LR(1) and the small table size of LALR. There are two different
ways to do the Minimal LR parser-table construction. One way is to do build the LR(0) state machine
first and then do state splitting (duplicating states to resolve conflicts). The other way is to
attempt to build the LR(1) state machine but merge states with "compatible" lookahead sets while building
the LR(1) state machine. This will keep the number of states from "exploding".
LR(1) vs. LALR(1)
In the real world, I have not yet seen a computer language that is LR(1) and not LALR(1). In my
experience, computer languages that are not LALR(1) are not LR(1) either. They are LR(2) or LR(3).
These can be handled by LALR(1) parsers with the addition of hand-written code to look ahead and make a
parsing decision. An LALR(2) parser generator would be more desirable than LR(1).
|