Indexing architecture

Treewalking & query resolution

ResultList architecture

XQEngine Design Documentation

This document briefly details some of the more important architectural components of XQEngine and how they work.

The indexing architecture

The process of index building is managed by XQEngine's IndexManager class, which parcels out indexing subtasks to various subordinate objects. For example, the IndexManager maintains an array of NodeTree objects, one for each document that's been indexed. As XQEngine's SAX parser encounters element, attribute, and text nodes during indexing, a two-part key and other information is appended to the end of the current NodeTree to represent the node, and the tree is grown as needed.

A basic entry for a node consists of an integer representing the type of the node (NodeTree.ELEM, NodeTree.ATTR, or NodeTree.TEXT), the two integers representing its key if it's a named node (i.e., element or attribute), and pointers to its stored text if it's a text node or attribute content (see below). Elements and text nodes maintain pointers to their following siblings, and all nodes have a parent-pointer index that points to the node's parent.

Each element or attribute has a two-part double integer key identifying the prefix and localpart components of its QName. If the prefix is empty, the key is -1. These keys are generated by one of IndexManager's two WordManager objects, one for attributes and one for elements.. If a new element prefix is encountered, for example, the IndexManager's m_elementWM WordManager will generate a new integer ID for that prefix, or return the value WordManager.NO_ENTRY if none is present. If a pre-existing prefix is encountered on the other hand, m_elementWM will return the existing integer key reference that maps to that prefix. Each WordManager depends on two subordinate TagDictionary objects to maintain this QName-to-key mapping. A TagDictionary in turn uses a specialized form of Hashtable to maintain each key part. This WordListHashtable is guaranteed to generate a unique integer for each prefix or localpart "word".

Text in element content that's encountered during indexing is concatenated in an element-text StringBuffer maintained by the NodeTree for that document, and pointers to the start and stop locations of each text node are stored in the tree. Similarly, attribute text is concatenated in an attribute-text StringBuffer and pointers to it stored in the tree.

NodeTree's internal array structure for nodes and its two text buffers store sufficient state information from the original document that querying does not need to re-access the original in order to reconstitute its serialized form.

Treewalking and query resolution

A JavaCC-based parser generator builds a query-tree representation of each query passed in via setQuery(). If you run the SampleApp application that ships with XQEngine, you can see what the query tree looks like for any query you enter by making sure that the debugOutputToConsole flag is turned on in the run() routine:

void run() { ... m_engine.setDebugOutputToConsole( true ); ... }

The query "/bib/book," for example, produces the following query tree:

>> StartNode >> SlashRoot >> RelPath [ / ] >> QName [ bib ] >> QName [ book ]

A query tree is walked by a TreeWalker object starting at its StartNode root. Each object shown in the above debug dump corresponds to a JavaCC-generated node.

The main workhorse of the treewalking process is the TreeWalker.eval() function, which dispatches resolution of query-tree subpaths to methods that are specialized on the particular operators encountered during the walk. These in turn may recursively call eval() yet again to further walk the tree downward, returning node results upward.

In the above query, for example, eval() encounters the RelPath operator as it descends the query tree and dispatches its two QName subnodes as arguments to TreeWalker.child(). child() in turn dispatches to other routines depending on the types of the subnodes it receives as arguments. Each node in the tree is identified both by the name of the class emitted by the parser generator (SlashSlashRoot, RelPath, and QName, for example, which are primarily useful for debugging purposes as above), and by a unique id() field which is used by the eval() routine and other TreeWalker methods to determine the types of the query-tree operators and operands they are handling.

The ResultList architecture

At some point in the treewalking process a ResultList object gets generated, generally once treewalking has reached a leaf node in the query tree. A ResultList is a collection class holding a number of subordinate DocItems objects that are linked together in linked-list form. (DocItems was called DocumentItems in previous versions of the program.) A DocItems object corresponds to all the nodes satisfying a query which have been contributed by a single document. In general, a ResultList object is initially generated once the TreeWalker reaches the leaf of a query-tree subpath. Resolution proceeds upwards through the tree, with subsequent methods corresponding to nodes that are higher up adding and subtracting nodes from the ResultList as appropriate. ResultList results are passed upwards as return values of the methods handling the query-tree operator subpaths.

The TreeNode object that corresponds to a DocItems object (they share the same integer ID) is the ultimate determiner of which nodes get added and subtracted as treewalking proceeds. As a very general principle (the specifics vary, depending on the exact XPath pattern being resolved), each DocumentItems object initially determines the maximal number of nodes corresponding to the leaf of the subpath it's evaluating. As each Treewalker method returns the results of its evaluation to the method that called it, nodes deemed to no longer be viable candidate nodes are marked invalid but are otherwise not removed from the DocItems object. Rather, a cumulative m_numValidItems field gets decremented. The equivalent total aggregate count of valid items for the owning ResultList object is also adjusted accordingly. It is the final valid item count that determines the success of a query, not the field m_numTotalItems, which generally only indicates how large a ResultList object was initially, before it started getting pruned back.

One other point to note is that the actual node-related data held in a DocItems object is maintained by a subordinate IntList collection class, a dynamic array of integers that holds, in the case of DocumentItems, two integers for every node in the resultset for that particular document. The first of each two-integer pairs represents the node at the leaf end of the subpath being evaluated. This "leaf-end" value is the document-order index of the node that's actually being returned. The second integer in the entry, its corresponding "root-end" entry, either holds the enum value NodeTree.VOIDED_NODE to indicate an initially successful match that's since been invalidated, or the root-most end of the subpath currently under evaluation. Two integers representing both endpoints of each subpath under evaluation are necessary for resolution to work as evaluation moves up the subpath toward the root of the query tree.

These integer pairs can also hold atomic values, with a negative root-end value indicating a non-node entry. In the case where the root-end integer is the enum NodeTree.INT for example (-2), the integer in the leaf-end position is literally an integer and not a node. ResultList's emitXML() routine uses the value in the root-end position to serialize XML nodes or atomic values as appropriate.

created 13july03
updated 24nov03