Implementing the XQuery grammar

This is a presentation that was delivered at Extreme Markup 2002. The presentation is an introductory tutorial on the use of the JavaCC and JJTree parser-generating tools to build a parse tree representing an XPath or XQuery query. Quoting from the program description for this session:

Building a parse tree representation of an XQuery or XPath query is an important first step in responding to that query. This presentation shows how to use the JavaCC parser generator tool, starting from the XQuery BNF, to build a parse tree for a particular XQuery instance, and how to tailor that tree to make it easily walkable at query-resolution time to connect in to your own backend data structures. JavaCC is used by both the W3C's Query Working Group to test new versions of the XQuery grammar, and by the author to generate the query-parsing code in Fatdog's open-source query engine product.

Please note that the actual content of the presentation is not quite as ambitious as the title might imply. Under a strict "truth in advertising" policy, we might more realistically title this talk as "First baby steps on the way to implementing the XQuery grammar". It's a good start, though, if you need to implement a grammar of your own and you're unfamiliar either with the basic concepts or these specific tools.

Introduction

JavaCC is an example of a parser generator tool. Using its own syntactic conventions (it has a grammar of its own, expressed in JavaCC), you describe your language or grammar using a set of rules (or productions) that are very similar in format to standard BNF notation. When you run the JavaCC tool on this special file (suffixed by .jj by convention), it produces Java source code, which you then wire into your application. Your application calls this code at query-analysis time, passing it the text of your query. If the query is valid according to the rules of the language you've specified, the Java actions you've embedded in the grammar get run. If your query is invalid, the parser generates an exception.

The following examples provides a simple, step-by-step tutorial on how this process works. The prime focus of this tutorial however is on using JJTree, not JavaCC. JJTree is the preprocessor tool we run in advance of JavaCC. It acts on what's called a .jjt file and produces in turn the .jj file required by JavaCC.

Writing a .jjt file for JJTree consumption lets us produce a parser which builds a parse-tree representation of (in this case) a query. We can then walk that tree at our leisure at query-analysis time to evaluate our query.

example00

A minimalistic example. Our language has a single rule or production: It states that a valid query is a single integer times a single integer. Period. Anything else will produce either a lexical error or a JavaCC ParserException at runtime. Here's a complete .jj file:


    options
    {
      STATIC=true;
    }

    PARSER_BEGIN( ExampleParser )

    package example00;
    import java.io.StringReader;

    public class ExampleParser
    {
        public static void main(String args[]) throws ParseException
        //------------------------------------
        {
            String query = "1 * 2";

            ExampleParser parser = new ExampleParser (new StringReader(query));
            parser.query();
        }
    }
    PARSER_END( ExampleParser )

    void query() : {} { <INT> "*" <INT> {System.out.println( "success!" );} }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < INT : ( ["0" - "9"] )+  > }

Of particular interest here:

This is a .jj file, not a .jjt file. In other words, we're not building a parse tree in this example, and we're only running JavaCC, not JJTree as well.

The singular action we embed in the grammar is the statement { System.out.println( "success" ); }, which only gets executed if the query() production gets resolved successfully (ie, we have a valid query). The fact that this statement is embedded in curly braces indicates that it's a user action and that we're escaping to Java code in the middle of a grammatical production.
This is the primary way you embed programming constructs in a JavaCC .jj file. This user action ends up embedded in a Java method titled "query()" that gets emitted by JavaCC. query() in turns belongs to the generated Java class, "ExampleParser".
The single file (example00.jj) contains both the template for our calling application and our grammatical rules.

Running JavaCC on this file produces a number of Java classes, the primary one being called ExampleParser (we specified the name in the PARSER_BEGIN directive.

The STATIC=true option at the top of the file says that ExampleParser will be a static class; for pedagogic purposes we can embed main() right in the .jj file and invoke it from the command line in standard Java fashion. In turn it invokes the parser code by calling its query() method. query() is the name of the topmost (and only) production in this parser.

example0

In this and the following examples, we separate the calling application and parser code, a much more realistic scenario, and in all cases first produce a .jjt file, against which we run JJTree first. The file "RunExamples0_6.java" below acts like a typical calling application and invokes the next set of seven examples.

By packaging each example in its own Java package, we can vary which example parser we're invoking by changing the import exampleNN statement at the top of the file -- vary "NN" as appropriate. In the example here, it's "0".


    import example0.*;
    import java.io.StringReader;

    public class RunExamples0_6
    {
        public static void main( String[] args )
        //--------------------------------------
        {           
            String query = "1 * 2";

            new RunExamples0_6().runQuery( query );
                
            mouseClickWindow( "Click to quit", 10, 10, 160, 100 );        
        }
        
        void runQuery( String queryStr )
        //------------------------------
        {
            ExampleParser parser = new ExampleParser( new StringReader(queryStr));

            try 
            {
                System.out.println( "\nquery = " + queryStr + "\n" );

                SimpleNode root = parser.query();
                root.dump( "" );
            }
            catch( ParseException e ) 
            { 
                System.out.print( e.getMessage() ); 
                System.out.println();
            }   
		// I'm not sure what distinguishes a ParseException 
		// from a TokenMgrError -- they appear quite similar

            catch( TokenMgrError tke )
            {
                System.out.print( tke.getMessage() );
                System.out.println();
            }  
        }

	  // a utility method -- we throw up a mouseClickWindow to keep our 
	  // console around after producing results. a click and it goes away.

        public static void mouseClickWindow( String title, int x, int y, int w, int h )
        //-----------------------------------------------------------------------------
        {
            java.awt.Frame frame = new java.awt.Frame();
            frame.setBounds( x, y, w, h );
            frame.addMouseListener (
                new java.awt.event.MouseAdapter() {
                    public void mousePressed( java.awt.event.MouseEvent event ) {
                        System.out.println( "Mouse Pressed!" );
                        System.exit( 0 );
                    }
                }
            );
            frame.setTitle( title );
            frame.show(); 
        }     
    }

And here's the parser code for example0. Of particular note:

We're returning a single SimpleNode object as the root (and only member) of this parse tree. SimpleNode is a built-in JavaCC class. The tree doesn't carry enough information at this point to be useful.

We've turned off the STATIC option. As described above, we're only generating the parser code itself, not the application that calls it as well.

If this file had other productions (and correspondingly our tree had other nodes), they'd all be of type SimpleNode. We'll see how to use the MULTI option in the next example to change that.

We've moved the multiply operator ("*") into the TOKEN section of the .jjt file.

The line of code in our calling application that starts the parsing process and retrieves the resulting parse tree is:

SimpleNode root = parser.query();


    options {
      STATIC=false;
      NODE_USES_PARSER=true;
    }
    /* MULTI=false by default == we only produce SimpleNode nodes */

    PARSER_BEGIN( ExampleParser )
    package example0;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode query() : {}    { <INT> <MULT> <INT> { return jjtThis; } }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT : "*" > }
    TOKEN : { < INT  : ( ["0" - "9"] )+  > }

example1

In this example, our language is almost the same except that we've made the query itself optional and we've broken our earlier, single production into three.

We've also turned on the MULTI option, which means that the nodes that get returned by the various productions take the names of the productions themselves, rather than returning generic SimpleNode nodes. (Though they're not totally generic -- they contain an internal id field that lets you know and test which production produced them.)

What's most important here is that :

Each production returns a node corresponding to the successful evaluation of that production

We've turned on the MULTI option, so that the nodes are named the same as the production names

You get a node for each production whether you want them all or not. The examples that follow this one show both how to vary the names of the returned nodes and optionally omit them altogether, which as we'll see is frequently useful.

NOTA BENE: in the examples that follow this, I've omitted the options block from the top of the file, since it's the same in each example.


    options {
      MULTI=true;
      NODE_DEFAULT_VOID=false;
      STATIC=false;
      NODE_PREFIX="";
      NODE_USES_PARSER=true;
    }

    PARSER_BEGIN( ExampleParser )
    package example1;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode  query()       : {}  { ( multExpr() )? { return jjtThis; } }
    void        multExpr()    : {}  { integerExpr() <MULT> integerExpr() }
    void        integerExpr() : {}  { <INT> }

    SKIP  : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT : "*" > }
    TOKEN : { < INT  : ( ["0" - "9"] )+  > }

In this example, a parse of the query "1 * 2" produces a tree whose nodes are of the following types (or classes). Each of these classes is a descendant of SimpleNode. That's just how jjtree works. We make use of this feature later on.

              query
                |
             multExpr
             /     \
     integerExpr   integerExpr

example2

The grammar here is pretty much the same as previous, although we've abstracted the rules for how an integer is defined into a separate DIGITS token. (The pound sign ("#") means that this token is local and doesn't get exported out of the scanner).

The main point of this example is that we start using jjtree's #nodeName convention to decorate our parse tree with names of our own choosing, not necessarily corresponding to the production that produces them.


    PARSER_BEGIN( ExampleParser )
    package example2;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode query()  #Root  : {}    { ( multExpr() )? <EOF> { return jjtThis; } }
    void multExpr()     #Mult  : {}    { integerExpr() <MULT> integerExpr() }
    void integerExpr()  #Int   : {}    { <INT> }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT : "*" > }
    TOKEN : { < INT  : <DIGITS>
    TOKEN : { < #DIGITS : ( ["0" - "9"] )+  > }

The result of this code is that a parse of the query "1 * 2" now returns a tree that looks like this:

               Root
                |
               Mult
             /     \
           Int     Int

Other than that, this example is pretty much like the prior one. Onward ...

example3

Here's the .jjt file.


    PARSER_BEGIN( ExampleParser )
    package example3;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode query()  #Root     : {}   { ( multExpr() )? <EOF> { return jjtThis; } } 
    void multExpr()     #Mult(>1) : {}   { integerExpr() ( <MULT> integerExpr() )? }
    void integerExpr()  #Int      : {}   { <INT> }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT : "*" > }
    TOKEN : { < INT  : <DIGITS> > }
    TOKEN : { < #DIGITS : ( ["0" - "9"] )+  > }

We've made an important change in the rules. Before, our multExpr() production stated:

void multExpr() #Mult : {} { integerExpr() <MULT> integerExpr() }

It now states:

void multExpr() #Mult(>1) : {} { integerExpr() ( <MULT> integerExpr() )? }

The important differences are several:

We've separated what was previously a single, atomic production into one that has two terms. The second term in the new production, ( <MULT> integerExpr() )? is optional.

This means that a multExpr production can be either a single integer or a multiplication expression as before.

If it's a single integer (a new possibility allowed by this grammar), we return a single Int subnode under our Root node (indicated by the #Int in the final production in the grammar).

NOTE: Don't use #Integer as the name of the nodetype to return here (as the author did the first time he wrote this grammar). Your compiler will confuse the name of the generated class with Java's built-in Integer class and may exhibit schizoid behaviour!

If it's a multiplication, we return a Mult subnode.

What's important here is that we use a jjtree conditional to distinguish between the two cases. The factor #Mult(>1) says to only return a Mult node if the production following contains two or more subnodes. That'll only happen if integerExpr() is called twice, indicating that a multiplication has taken place. Otherwise multExpr only returns a single Int node from its integerExpr()sub-production, and not a Mult node of its own.

example4

As mentioned earlier, the trees we've been producing haven't been useful -- so far. The reason is that we need to be able to completely recapitulate the state of the query in the parse tree. We can't do that yet because the value of the integers in our query weren't saved in the tree when we built it. This example shows how to do that.

We also mentioned earlier that every node produced by a jjtree/javacc parser is a descendant of SimpleNode and that we could exploit that fact. Now's our chance. Here's the .jjt file:


    /* show how to set the value of a token and add accessors for it to SimpleNode */

    PARSER_BEGIN( ExampleParser )
    package example4;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode query()  #Root     : {}   { ( multExpr() )? <EOF> { return jjtThis; } } 
    void multExpr()     #Mult(>1) : {}   { integerExpr() ( <MULT> integerExpr() )? } 
    void integerExpr()  #Int      :      { Token t; } { t=<INT> 
                                             { jjtThis.setToken(t.kind,t.image); } 
                                         }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT : "*" > }
    TOKEN : { < INT  : <DIGITS> > }
    TOKEN : { < #DIGITS : ( ["0" - "9"] )+  > }

Note in particular the production:

void integerExpr() #Int : { Token t; } { t=<INT> {jjtThis.setToken(t.kind,t.image); }}

We're declaring a variable of type Token inside the previouly unused curly braces ("{}"), and then in our production grabbing the value of the INT integer from our query, assigning it to t, and then calling the method jjtThis.setToken() with the two components of t.

What's going on here is:

Token is a JavaCC class which contains the value of whatever token we've just parsed -- in this case an integer.

We're grabbing that value of that token and calling a setToken() method with its kind (an internal field that holds an enum constant corresponding to INT) and more importantly to us, its image, or string value, as parameters.

setToken()

SimpleNode

SimpleNode()

    public class SimpleNode implements Node {
      
        ...
      
      public    int     m_type;
      public    String  m_text;

      public void setToken( int type, String text )
      //------------------------------------------- 
      {
            m_type = type; 
            m_text = text;
      }

	public String getText()  { return m_text; }
  	//---------------------

        ...
    }

m_text

example5

Here's the grammar file for example 5:


    PARSER_BEGIN( ExampleParser )  
    package example5;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode query()  #Root     : {}  { ( multExpr() )? <EOF> { return jjtThis; } } 
    void multExpr()     #Mult(>1) : {}  { { integerExpr() ( <MULT> multExpr() )? }
    void integerExpr()  #Int      :     { Token t; } { t=<INT> 
                                             { jjtThis.setToken(t.kind,t.image); } 
                                        }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT : "*" > }
    TOKEN : { < INT  : <DIGITS> > }
    TOKEN : { < #DIGITS : ( ["0" - "9"] )+  > }

Again, we've just changed one production. We've gone from this in example4

void multExpr() #Mult(>1) : {} { integerExpr() ( <MULT> integerExpr() )? }

to this

void multExpr() #Mult(>1) : {} { { integerExpr() ( <MULT> multExpr() )? }

The multExpr() production now (optionally) calls itself recursively. This slight extension of the grammar now allows our language to have extensible queries such as "1 * 2 * 4", to any number of multiplicative factors.

Technically, a recursive production like this produces a right associative tree. To wit, the above query produces a parse tree that looks like the following:

This doesn't matter when we're talking multiplication and addition, but as we'll see, it doesn't quite work when we have division and subtraction.

Onward ...

example6

In this version of our language, we add a second operator (a <PLUS>) and introduce the notion of precedence.


    PARSER_BEGIN( ExampleParser )
    package example6;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode query()  #Root     : {}   { ( plusExpr() )? <EOF> { return jjtThis; } } 
    void plusExpr()     #Plus(>1) : {}   { multExpr() ( <PLUS> plusExpr() )? }
    void multExpr()     #Mult(>1) : {}   { integerExpr() ( <MULT> multExpr()  )? }
    void integerExpr()  #Int      :      { Token t; }	 { t=<INT>
                                              { jjtThis.setToken(t.kind,t.image); } 
                                         }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT : "*" > }
    TOKEN : { < PLUS : "+" > }
    TOKEN : { < INT  : <DIGITS> > }
    TOKEN : { < #DIGITS : ( ["0" - "9"] )+  > }

The normal rules of arithmetic precedence say that multiplication binds tighter than addition. This means that we want to evaluate the expression

     1 + 2 * 3

     1 + ( 2 * 3 ) == 7    (correct)

as opposed to

     ( 1 + 2 ) * 3 == 9    (incorrect)

If we're expressing this in tree form, the former would be represented by the tree

           +               (correct)
        /     \
       1       *
            /      \
           2        3

and the latter by

              *            (incorrect)
           /     \
          +       3
       /     \
      1       2

The JavaCC grammar shown here provides the latter representation, as it should. This is discussed in greater depth in the following example. Note by the way that saying that multiplication binds tighter than addition is equivalent to stating that the multiply operator, rather than the addition operator, binds both terminal leaf nodes (ie, the numbers).

The next example shows why the right associative tree we're producing here doesn't quite work in the general case.

example7

This example extends the prior one to show how we actually walk and evaluate our parse tree. To do this, we use the SimpleNode.getText() accessor we described earlier to retrieve the value of the integers stored in each Integer node.

To make this example work, we need to call it from an updated version of RunExamples0_6 (now called appropriately RunExample7) that contains an eval() function that does the evaluation. Note that I've omitted the mouseClickWindow code from this example for the sake of brevity.

import example7.*;
import java.io.StringReader;
import java.lang.Integer;

public class RunExample7 implements ExampleParserTreeConstants
{
	static String query = "1 - 2 - 3 ";
    
    public static void main( String[] args )
    //--------------------------------------
    {           
        new RunExample7().runQuery( query );  
    }
    
    void runQuery( String queryStr )
    //------------------------------
    {
        ExampleParser parser = new ExampleParser( new StringReader( queryStr ));

        try 
        {
            System.out.println( "\nquery = " + queryStr + "\n" );

            SimpleNode root = parser.query();

            // dump the result tree to the console for debug purposes

            root.dump( "" );
            System.out.println();
            
            // here's where we actually walk the tree we've so painfully
            // constructed and derive an answer to our query

            int answer = eval( (SimpleNode) root.jjtGetChild(0) );
            
            System.out.println( "\nanswer = " + answer );
        }
        catch( ParseException e ) 
        { 
            System.out.print( e.getMessage() ); 
            System.out.println();
        }
        catch( TokenMgrError tke )
        {
            System.out.print( tke.getMessage() );
            System.out.println();
        } 
    }
    
    int eval( SimpleNode node )
    //-------------------------
    {
        int result;
        
        // each node stores an id which uniquely identifies its type
        // jjtree names and stores these enums in an interface
        // named ExampleParserTreeConstants by JJTree

        int id = node.id;
        String nodeName = node.getNodeName();

        SimpleNode lhs = null;
        SimpleNode rhs = null;
        
        if ( id != JJTINT )
        {
            lhs =  (SimpleNode) node.jjtGetChild(0);
            rhs =  (SimpleNode) node.jjtGetChild(1);
        }

        System.out.println( "eval(): evaluating AST nodetype " 
					+ id + " [" + nodeName + "]" );

        switch( id )
        {                               
            case JJTPLUS  :     return eval( lhs ) + eval( rhs );
            case JJTMINUS :     return eval( lhs ) - eval( rhs );
            case JJTMULT  :     return eval( lhs ) * eval( lhs );       
                                     
            case JJTINT   :     return Integer.parseInt( node.getText() );
            
            default :
            
                    throw new java.lang.IllegalArgumentException( 
                                        "didn't understand the query, man!" );
        }
    }   
}

eval() works by calling itself recursively, applying the proper arithmetic operators while doing so. The recursion ends once a terminal integer leaf is reached. The stack then starts unwinding upwards.

Here's the .jjt file itself:


    PARSER_BEGIN( ExampleParser )
    package example7;
    public class ExampleParser {}
    PARSER_END( ExampleParser )

    SimpleNode query()  #Root : {}  { ( plusExpr() )? <EOF> { return jjtThis; } }
    void plusExpr()     : {}        { minusExpr() ( <PLUS> minusExpr() #Plus(2) )* }
    void minusExpr()    : {}        { multExpr() ( <MINUS> multExpr() #Minus(2) )* }
    void multExpr()     : {}        { integerExpr() ( <MULT> integerExpr() #Mult(2) )* }
    void integerExpr()  #Int  :     { Token t; } { t=<INT> 
						{ jjtThis.setToken(t.kind,t.image); } }

    SKIP : { " " | "\t" | "\n" | "\r" }
    TOKEN : { < MULT  : "*" > }
    TOKEN : { < PLUS  : "+" > }
    TOKEN : { < MINUS : "-" > }
    TOKEN : { < INT   : <DIGITS> > }
    TOKEN : { < #DIGITS : ( ["0" - "9"] )+  > }

I've done two things with this example:

I've added a new <MINUS> operator, and

I changed the productions for all three operators so that they're now iterative as opposed to recursive. This changes the subtrees they build from being right associative to left associative.

I earlier alluded to difficulties in some cases with right associative structures. A right associative tree is a structure that results from using a production which is defined recursively in terms of itself. Sometimes this is desirable, depending on what you want your tree to look like, but it fails to work properly as mentioned earlier in the case of subtraction and division. This is explored in some depth below. To resolve this problem, we need to use iterative structures in our grammar to produce trees that are left associative.

Here's an example (from example6) of a recursive production:

void multExpr() #Mult(>1) : {} { { integerExpr() ( <MULT> multExpr() )? }

And here's an example of recasting this iteratively:

void multExpr() : {} { integerExpr() ( <MULT> integerExpr() #Mult(2) )* }

We can see more specifically how this works in the case of subtraction. If we run the query:

    1 - 2 - 3

using the current grammar, which is iterative, we get back the following tree. (This output is produced by SimpleNode.dump(), behaviour that's built into SimpleNode right out of the box. I modified the original routine to also emit the m_text contents of the node, if there are any, in brackets.)

    Root
       Minus
          Minus
             Int [ 1 ]
             Int [ 2 ]
          Int [ 3 ]

The above tree is left associative, which means that the tree builds downward toward the left. (You can see this more easily in the diagrams that follow.)

Here's a right associative tree, produced by the prior, recursive grammar.

    Root
       Minus
             Int [ 1 ]
             Minus
                 Int [ 2 ]
                 Int [ 3 ]

These look almost the same, but there's a difference. Since the normal way we evaluate is to call eval() recursively and to only start producing intermediate arithmetic results once we reach the innermost Minus node, where the two terminal Ints are bound, evaluating the current, left associative tree produces a subtraction that looks like this:

     ( 1 - 2 ) - 3    == -4

This is the way we expect subtraction to work.

Evaluating the second tree, on the other hand, means we'd be evaluating as

      1 - ( 2 - 3 )   == 0

which is not how we expect our arithmetic to behave.

It might be easier to see this if we diagram our parse trees in more traditional fashion.

Here's a right associative tree (the one that first gets us to incorrectly subtract 3 from 2):

          <MINUS>
          /     \
         1    <MINUS>
              /     \
             2       3

Here's a left associative tree (the correct way of doing it):

          <MINUS>
          /     \
      <MINUS>   3
      /     \
     1       2

And that's about it. Hope that was useful!

Howard Katz

updated 19feb03
updated 24nov03