DEXTER

Dynamically EXTEnsible Recognizer


Questions and issues to Walter Cazzola.

DEXTER is the Dynamically EXTEnsible Recognizer: an LR parser generator that can be extended and restricted dynamically and on-the-fly. The DEXTER parser generator was developed as the parser generator for version 2 of the Neverlang framework, but the library works stand-alone as well. The following is a quick guide to the DEXTER (and LEXTER) library. Further information on the available methods can be found in the JavaDoc. For more information on the DEXTER parser please refer to the related publications.

Using DEXTER

DEXTER is a Java library, thus using it consists in adding the jar to the classpath. DEXTER is an LR parser generator, therefore it generates an LR parser for a given grammar. The grammar is defined programmatically through an API. For a more traditional approach, that is, feeding the parser generator with an input grammar from a text file, it is possible to use the Neverlang framework.

Terminals, Nonterminals and Productions

In DEXTER, terminals and nonterminals are both instances of the interface Symbol. A nonterminal is an instance of the interface NonTerminalSym and terminals are instances of the interface TerminalSym. There are two default implementations for terminals and nonterminals, called SimpleNonTerminalSym and SimpleTerminalSym. Users are free to provide their own implementation. For instance, a convenient way to quickly test grammars is to implement nonterminals and terminals as Java enums.

public class MyGrammar {
   enum NT implements NonTerminalSym {
      S, A, B, C;
   }

   enum T implements TerminalSym {
      a,b,c;
   }
}

TIP. It is advisable that each terminal and nonterminal is a singleton, thus enums are perfectly suited.

A production is an instance of the class Production. For instance, let us define the grammar for the language { b, bb, bbb, ... }. The grammar may be written as A → b A, A → b, with A as the starting symbol.

import static MyGrammar.NT.*;
import static MyGrammar.T.*;
             ...
Production p1 = new Production(A, b, A);
Production p2 = new Production(A, b);

As you can see the first nonterminal is always the left-hand-side nonterminal of the production. There is no ambiguity, because in context-free grammars only one nonterminal is on the left-hand side. The parser can be now generated using:

Dexter dx = new Dexter();
dx.setAxiom(A);
dx.add(p1);
dx.add(p2);

or the shorthand form:

Dexter dx = new Dexter();
dx.setAxiom(A);
dx.add(A, b, A);
dx.add(A, b);

In this case a production is created automatically for each add() invocation.

The Grammar class comes with the predefined singleton object Grammar.EPSILON, which represent the empty word (convenionally represented as the greek symbol "ε"), and Grammar.EOF which represent the meta-terminalsymbol end-of-file (conventionally represented as "$"). The EOF is not supposed to be used directly in productions, while EPSILON is allowed.

Removing Productions

A production p can be easily removed using the method Dexter.remove(Production). The parser is updated on the fly, and the change is immediately applied.

Parsing

Parsing can be done using the method Dexter.parse(ITokenStream). A successful parse() invocation generates a parse tree. The parse tree can be retreived using the method Dexter.getLastParseTree(). An unrecognized input causes the parser to throw the exception UnexpectedSymbolException, which contains the unexpected TerminalSym, a list of TerminalSym that were expected and the ParserState (the internal representation of a state in the LR goto-graph of the given grammar) where the error occurred.

The ITokenStream interface represent a stream of tokens; that is, a stream of TerminalSym instances. There is a predefined implementation called SimpleTokenStream that can be used for testing. For instance, the string bbbb could be represented as the stream input:

import static MyGrammar.T.b;
                   ...
ITokenStream input = new SimpleTokenStream(b, b, b, b);

The SimpleTokenStream can be used directly for testing, but a more convenient way to provide a token stream is to use a lexer that maps strings into tokens (i.e., TerminalSym instances). A default implementation is available throught LEXTER.

LEXTER

DEXTER comes with a predefined lexer called LEXTER (the LEXer for dexTER). The LEXTER lexer comes with two predefined terminal definitions, KeywordTerminalSym and RegexTerminalSym. Keywords are constant strings such as for, if, while, etc. Defining one such symbol is as simple as writing

TerminalSym whileKeyword = new KeywordTerminalSym("while");

Regexes are a more powerful way to represent character classes such as numbers, identifiers, etc.; in this case, instead of the keyword, the object constructor is given a regular expression. For instance, the regex /[a-zA-Z]+/ matches a word made of alphabetic characters:

TerminalSym identifier = new RegexTerminalSym("[a-zA-Z]+");

Multiple constructors are available. The Lexter class can be instantiated manually and filled with the terminals and nonterminals using the Lexter.add() method; LexterStream is the subclass of Lexter that implements the ITokenStream interface, and is therefore the class to use with Dexter. However, the preferred way to use Lexter is by instantiating the companion class Dexter2.

Using DEXTER with LEXTER

The companion class Dexter2 comes with shorthands to define keywords an character classes. A production of the form WhileStmt → '(' Expr ')' Body can be easily written as follows:

public class MyGrammar2 { 
   enum NT { WhileStmt, Expr, Body, Identifier; }
}

import static NT.*;
   
Dexter2 dx = new Dexter2();
           ...
dx.add(WhileStmt, "(", Expr, ")", Body);

Regex nonterminals are easily written using the static method Conversions.asRegex():

import static dexter.lexter.utils.Conversions.asRegex;
          
dx.add(Identifier, asRegex("[a-zA-Z]+"));

The more general API using explicit Production instantiation is obviously always available.

Parsing with LEXTER

Since LEXTER is able to tokenize arbitrary input strings onto streams of TerminalSym instances (the LexterStream class implements ITokenStream), Dexter2 is augmented with the method Dexter2.parse(String). Thus, once the Dexter2 instance have been initialized it is possible to parse an input String with:

dx.parse("some input string");

In case of error, the UnexpectedSymbolException contains an augmented TerminalSym implementation called QualifiedTerminalSym; this class contains a representation of the unexpected terminal symbol, along with metadata such as the line and column at which the unexpected token was found.

Downloads

LEXTER and DEXTER can be downloaded from the following links:

In order to use them, please remember to add them to your CLASSPATH.

DEXTER Staff

The DEXTER/LEXTER projects are led by Walter Cazzola.

References
  1. Walter Cazzola and Edoardo Vacchi, On the Incremental Growth and Shrinkage of LR Goto-Graphs, ACTA Informatica, 51(7):419-447. June 2014. Springer. [DOI1]
  2. Walter Cazzola and Edoardo Vacchi, DEXTER and Neverlang: A Union Towards Dynamicity, in Proceedings of the 7th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems (ICOOOLPS'12), Eric Jul, Ian Rogers, and Olivier Zendra, Eds., Beijing, China, June 2012, ACM. [PDF2].

Walter Cazzola

Didactics

Publications

Funded Projects

Research Projects

Related Events