Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
Date: 1/31/2018 @ 2 p.m. ET
A Parser Generator that Speaks YOUR Language
In a previous installment, you visited the inner workings of the Yet Another Compiler-Compiler (YACC) parsing system. This month, you go for the gold: the Grammar Oriented Language Developer (GOLD) Parser. Like most parsing systems, GOLD uses the LALR(1) state machine (algorithm) to analyze syntax and a Deterministic Finite Automaton (DFA) to identify different lexical units (tokenizer). Practically all common parser generators, such as YACC/Bison, use these algorithms. However, GOLD takes a different approach than common compiler-compilers. GOLD is freeware and uses the "zlib" style license so you can integrate it with your own apps without worry.
In GOLD, the LALR and DFA algorithms are merely simple automata, using lookup tables to determine both state transition and parsing actions. As a result, most of the computational work is performed before any input text is parsed. It is the computation of these tables that is both time-consuming and complex. Once created, these parsing tables are essentially independent of any programming language binding. As if to prove this point, the GOLD developers have implemented a dozen language bindings as of this writing (as either linkable .LIB import libraries or plug-ins), including:
- ActiveX DLL
- Assembly - Intel x86
- Visual Basic .NET
How Does It Work?
The GOLD Parsing System is designed to aid in the development of compilers, interpreters, and translators while supporting multiple programming languages' backends.
Because the LALR and DFA algorithms can be implemented with simple state transition graphs, the parsing can be easily hosted by any programming language because the logic simply looks up a value in a table and then acts accordingly. The creation of these lookup tables is where all the hard work takes place. It's not surprising, then, that GOLD is composed of three pieces: Builder, parser Engine, and an output file that contains table information. In a nutshell, the Builder constructs the parse tables and the Engine executes the tables.
The Builder is a Win32 app that reads a source grammar written in the GOLD Meta-Language, produces the parse tables, and then writes them to the Compiled Grammar Table file. Builder does a couple unrelated things, such as creating enum lists and allowing you to interactively test a grammar. The Builder could theoretically be ported to other OSes, but is only available for Win32 at this time.
The Engine component performs the actual parsing and can be developed in any programming language necessary. Because all the complex work was already performed by the Builder, the Engine simply needs to read the Compiled Grammar Table (CGT) file and implement the LALR and DFA automata. Although Engine implementations vary in design, they all behave identically because they completely are table-driven with respect to the input.
Mechanically speaking, this means each time you modify the grammar file, you'll run the Builder to regenerate the CGT file. Once the grammar is completely refined, the bulk of your work shifts to the implementation of semantics and your focus changes to the Engine.
The Calculator Revisited
For comparison's sake, you will now reverse-engineer the simple calculator you developed in GNU Bison. Fortunately, the GOLD Builder includes an import tool for legacy YACC/Bison grammars to get you started. For those of you starting from scratch, the New Grammar Wizard will give you a leg up on some standard types of grammars by quizzing you on some basic language characteristics: Do newlines terminate statements or not? Does it have typical programming-language type identifiers? Does it contain string constants? And so on. The output from the File | Import A YACC/Bison Grammar menu selection is shown in Figure 1. I modified it slightly to follow the suggested practice of defining the terminal symbols using the "=" operator. Although you literally could write in '+' for the plus operator, I feel that spelling out the terminals ("PLUS") both provides abstraction while supporting maintenance and readability. The GOLD Meta-Language syntax is as close as anyone will ever get to notating in pure Backus-Naur Form (BNF), so if you have any prior experience in formal grammars it will be a cakewalk.
"Start Symbol" = <input> NEWLINE = '\n' PLUS = '+' MINUS = '-' SLASH = '/' ASTERISK = '*' LPAREN = '(' RPAREN = ')' ! ======================================= Rules <input> ::= | <input> <line> <line> ::= NEWLINE | <expr> NEWLINE <expr> ::= <expr> PLUS <term> | <expr> MINUS <term> | <term> <term> ::= <term> ASTERISK <factor> | <term> SLASH <factor> | <factor> <factor> ::= LPAREN <expr> RPAREN | NUMBER
Figure 1: Calc.grm (grammar file)
Schooling Your Grammar
Now, it's time to test the grammar! Rather than write a lot of code to do this, you simply can choose Window | Test Grammar on the Builder menu. You'll test the same expression you used in the previous article on YACC: 3 * ( 4 + 5 ). The results are shown in Figure 2:
Figure 2: Parse Actions Display
You also can have a look at the Parse Tree view that shows a nice treeview control so you can drill down to exactly the right rules in your grammar (see Figure 3).
Figure 3: Parse Tree Display