Donate | Contact | AMDG

PARSER GENERATOR -->
Home Information Customers Support About
Theory Terminology

NFA, DFA, PDA

NFA

NFA means Nondeterministic Finite Automata.  An NFA may have more than one choice of action (which state to go to next) for an input symbol.  If it makes the wrong choice and finds out after more symbols are read, then it has to backtrack and try another path.  Consequently, NFA's may be slow and are undesirable.

DFA

DFA means Deterministic Finite Automata.  A DFA has only one choice of action (which state to go to next) for each input symbol.  Consequently, it never has to backtrack and try another path.  DFA's are desirable, because they are faster and simpler.  DFA processors do not use a stack or make reductions.  Therefore, they cannot be used to parse computer languages.  They are used for lexical analysis only.

PDA

PDA means Push Down Automata.  A PDA is a DFA with a stack.  A PDA processor has a stack which it uses like a memory.  PDA's can be used for parsing computer languages.  PDA's can be used for lexical analysis also, however, the speed of DFA's are 5 times faster than PDA's, when used for lexical analysis (based on my research).

Common Error

Some people talk about LALR DFA's.  This is wrong terminology.  An LALR parser uses a stack and operates as a PDA processor, not a DFA.



LALR

LALR is a computer science acronym for LOOK AHEAD LEFT RIGHT.  It is a method of parsing computer languages.  Parsing is recognizing patterns in input statements that match rules in the grammar which was used to create the parser.  When parsing a computer language, an LALR parser:

  • Uses a LOOK AHEAD symbol to aid the recognition process,
  • Reads input statements from LEFT to right,
  • Makes reductions on the RIGHT first, working backward toward the left side of the parse stack. This is also called bottom-up parsing.



Myths

LALR cannot handle context-sensitive languages like C++.

This is not true.  LRSTAR has a semantic grammar symbol that solves this problem (e.g. {typedef}).  At parsing time, when an incoming <identifier> could be ambiguous, a symbol-table lookup will change the <identifier> to a {typedef} which is not ambiguous.  Thus, LRSTAR parsers can handle difficult computer languages like C++.

LALR does not work when you need to have actions embedded in the rules of the grammar.

This is not true.  Because of the integrated symbol-table manager and automatic AST constructor, you rarely need to have actions embedded in the rules.  You can wait until the rule is recognized and then execute your actions.  Actually, for actions that produce output, it's better to build the AST while parsing and then after parsing, execute the actions during a traversal of the AST. 

LALR, bottom-up parsing, is too difficult and complicated.  LL or top-down parsing is easier.

This is not true.  LALR grammars are actually more natural to write than LL grammars when you don't try to embed actions in the rules.  An LL parsing algorithm is actually more complicated because it always requires more than one lookahead symbol for real-world languages.  LALR usually does NOT need more than one lookahead symbol to make parsing decisions, so the LALR parsing algorithms is simpler.



LALR vs LL

LALR(1) is a larger language class than LL(1). 

Therefore, LALR(1) is a more powerful notation than LL(1) for defining a lexer or parser, where (1) indicates the number of lookahead symbols used by the parser to make parsing decisions.

LL(k) is often used to overcome the deficiency of LL(1), however, the resultant LL(k) parsers are slower, more complicated and LL(k) grammars are still less natural for humans to specify than LALR.

LALR has its complexities as well, but that is a huge topic for another web page.  Quite often, personal preference is the deciding factor.  Many people prefer to write LALR grammars.

Because of the more difficult lookahead requirement in LL parsers, they are slower than LALR parsers.  A rough estimate NOT based on testing results would suggest that LALR parsers are at least 2 times the speed of LL parsers and at most 10 times the speed of LL parsers.



SLR, LALR, LR1

LR(0) State Machine

Both SLR and LALR parser generators construct an LR(0) state machine and have the same states.  However, they differ in the computation of the lookahead set for each state.  The SLR method may report erroneous conflicts, which LALR never does.  So LALR is the preferred method.

SLR Lookahead Sets

SLR is a term that means Simple LR.  An SLR parser generator uses a simple method to compute the set of lookahead symbols for each state.  It uses the BNF grammar to compute the lookahead sets.  It looks at every occurrence of a nonterminal symbol in the grammar to see what terminal symbols may follow it.  This is called the Follow Set.  The context in the grammar determines the lookahead sets.  This is actually too simple and consequently an SLR parser generator may report conflicts that really do not exist in the LR(0) state machine. 

LALR Lookahead Sets

LALR is a more powerful concept than SLR.  An LALR parser generator uses a more complicated method to compute the set of lookahead symbols for each state.  It looks at the LR(0) state machine rather than the grammar to compute the lookahead sets.  LALR gives 100% accurate lookahead sets for the LR(0) state machine.  If any conflicts are reported by LALR, they are true conflicts.  If the conflicts are based on LALR(1), 1 symbol of lookahead, then more than 1 symbol of lookahead may be required to resolve the conflicts or perhaps the grammar is truly ambiguous for any number of lookahead symbols.

Canonical LR(1) Construction

LR usually means Canonical LR(1) and it's a more powerful concept than LALR.  A Canonical LR(1) parser generator computes the LR(1) state machine, instead of LR(0).  This is fine for small toy grammars, but for real-world computer languages, way too many states are created. 

For the COBOL-85 language definition, over 2,000,000 states were created by an canonical LR(1) parser generator, whereas only 1,720 states were created by an LALR parser generator. 

Canonical LR offers only one advantage over LALR.  The error handling part of the LR parser algorithm is simpler, because no useless reductions will be made when an error is encountered.  Thus, error-recovery algorithms are simpler.

Minimal LR(1) Construction

Minimal LR is another type of parser generation algorithm which is the best of all.  Minimal LR gives the recognition power of LR(1) and the small table size of LALR.  There are two different ways to do the Minimal LR parser-table construction.  One way is to do build the LR(0) state machine first and then do state splitting (duplicating states to resolve conflicts).  The other way is to attempt to build the LR(1) state machine but merge states with "compatible" lookahead sets while building the LR(1) state machine. This will keep the number of states from "exploding".

LR(1) vs. LALR(1)

In the real world, I have not yet seen a computer language that is LR(1) and not LALR(1).  In my experience, computer languages that are not LALR(1) are not LR(1) either.  They are LR(2) or LR(3).  These can be handled by LALR(1) parsers with the addition of hand-written code to look ahead and make a parsing decision.  An LALR(2) parser generator would be more desirable than LR(1).

© Copyright 2010 Paul B Mann