XQEngine at SourceForge

How unique? 

What's new? 

Getting started 

Installation

Sample app

API quick tour   

Functions

JavaDocs

JUnit tests

Namespaces

Protocol handlers  

Word breaking

@SourceForge 

Other stuff

Contact

XQEngine is a full-text search engine for XML documents. Utilizing XQuery as its front-end query language, it lets you interrogate collections of XML documents for boolean combinations of keywords, much as Google and other search engines let you do for HTML. XQuery, however, provides much more powerful search capabilities than equivalent HTML-based engines, since its XPath component lets you specify constraints on attributes and element hierarchies, in addition to the specific word content you're searching on. Refer to the W3C's XML Query website to see what the W3C and other vendors are doing with XQuery and XPath.

XQEngine is a compact (roughly 300K) embeddable component written in Java. It's not a standalone application and requires a reasonable amount of Java programming skill to use. It has a straightforward programming interface that makes that fairly easy to do. It should work well as a personal productivity tool on a single desktop, as part of a CD-based application, or on a server with low to moderate traffic.

The current version of XQEngine is 0.69. See What's new for a list of what's been added or fixed in this version of the program.

How is XQEngine unique?

XQEngine differs in several important ways from other XQuery and/or XPath implementations:
  • It handles collections of multiple documents, up to a maximum limit of 2,147,000,000 documents, memory constraints notwithstanding. Documents can be XML files on your local hard drive, remote documents on web servers, or even messages snagged on the fly. Each document can contain up to 8,400,000 nodes (elements, attributes, and text nodes). Schemas or DTD's are not required. Queries such as contains-word( //para, "XML" ) search for <para> elements containing the word "XML" across every document in your repository. The query /PLAY/TITLE returns the main title element of every play in your Shakespeare collection (see below).
  • As a pre-indexing type of engine (documents to be added to your searchable repository for later querying go through an indexing process first), XQEngine is fast. Depending on your hardware and the version of Java you're using (I'm using a 2.4-gigahertz Windows 2000 box running JDK 1.4 under eclipse), you should be able to index all 37 plays of the well-known Shakespeare collection in about 3.0 seconds. That works out to an indexing speed of roughly 145,000 nodes, or 3 megabytes of raw data, per second. Speed increases the larger the size of the documents you're indexing.

    Query speed is good too. The query //LINE posed against the Shakespeare collection on my machine returns 107,791 nodes in 0.04 seconds (though it takes longer than that to display them). The query contains-word(//LINE, "love") posed against Shakespeare returns 1875 hits in 0.75 seconds.

  • It has an extremely easy-to-use API (and you will need to understand at least a modicum of Java programming to use it). Most use of the engine will likely consist of invocation of two main APIs:

    1. calling setDocument( docName ) repeatedly to index each of the documents you'll want to be querying against, and
    2. calling setQuery( queryString ) to pose a query against the collection or part of it and get back the results.

  • The price is right. XQEngine is open-source software, and if you can live with the restrictions of its standard GPL license, it's free. Plus it's all out there in public view -- there are no hidden strings, and you're not beholden to me in any way if you need modifications (though I'm more than happy to help if you want to use my services). If you're interested in private-licensing the codebase for non-GPL distribution, please contact me.

  • XQEngine's primary raison d'etre is full-text, and as mentioned the engine extends standard XQuery/XPath to accomodate that goal. The contains-word function provides a convenient syntactic shorthand for specifying the search terms you're looking for. This is quite similar to what the Query working group is likely to adopt as a keystone function for full-text retrieval.

    The query:

            contains-word(//SPEECH/LINE, "love" )

    posed against the Shakespeare collection says, return all the <LINE> elements contained within a <SPEECH> element that contain the word "love". If you wanted all <LINE> elements containing both the words "love" and the word "hate" in the same line, you could say either

            contains-word(//SPEECH/LINE, "love", "hate" )

    or

            contains-word(//SPEECH/LINE, "love hate" )

    The latter form of contains-word() uses the fact that the function parses any quoted phrase(s) into individual words separated by punctuation or whitespace. You can use whichever form is more convenient, or even a combination of the two.

  • Queries that are rooted in either a single slash (/) or double slash (//) are automatically posed against the entire collection of documents that have been indexed to date. You can also tie a query to a particular document by rooting the query with XQuery's doc() function. The query doc("http://www.fatdog.com/bib.xml" )/bib for example will return the serialized XML form of the bib element at the root of the named document. doc() can also take a numeric argument, the integer docID that's returned as a result of a setDocument() call.

  • You can index and query XML documents using a variety of addressing mechanisms:

    1. using http://- or file://-based URLs as arguments tosetDocument() and doc()
    2. using local filepath addresses as arguments to setDocument() and doc()
    3. via a mechanism called explicit addressing, using the method setExplicitDocument(). This API doesn't take an address; rather, it takes as its single argument the actual serialized XML to be indexed, as in:
        setExplicitDocument("<log><entry num="1">1440 hits</entry>
                                  <entry num="2">1650 hits</entry></log>")
      
    4. via a unique URL-like addressing mechanism of your own devising that allows you or third-party vendors to supply XML on demand, at query time if so desired. See Custom Protocol Handlers for further detail.

      In short, you register a unique URL scheme-like String prefix and an associated protocol handler with the query engine. When the engine determines that the docName-argument to either setDocument() or doc() begins with that scheme, the query engine calls your custom protocol handler to request the XML content corresponding to that address. The custom handler is free to determine the mapping between address and content in any way it sees fit, as long as it delivers the content on request.

      Also, using ad hoc indexing, if the content being addressed in a doc() invocation is being seen by the engine for the first time, it will be automatically indexed at query time. This allows you to maintain a connection with a backend, third-party supplier in the event you don't have access to their XML in advance. In fact, that XML might not even exist natively as XML -- it might be generated from some other source format on demand -- until you query against it. Using the vendor's own proprietary addressing mechanism, you can get content delivered, indexed, and queried, all at the same time.

What's new?

What's new in version 0.69?

NOTA: versions 0.67 and 0.68 were internal releases only.

Version 0.69 fixes one minor bug, adds fn:following-sibling() functionality, and most importantly adds math support for integer, double, and decimals. See the ChangeLog for details.

What's new in version 0.66?

NOTA: versions 0.64 and 0.65 were internal releases only.

Version 0.66 fixes several bugs, most importantly a bug that prevented namespaces and namespace declarations from working properly. Startup program behaviour has also been changed so that the setting UseLexicalPrefixes is now defaulted to false. In other words, normal XQuery namespace behaviour is now the default and namespace declarations are required before prefixes can be used in queries. (Use setUseLexicalPrefixes() to change this. See Namespaces).

Support has been added for XQuery's if-then-else expression.

What's new in version 0.63?

Version 0.63 fixes half a dozen bugs and adds a half dozen new functions, including boolean(), chr(), document-name(), name(), and root(). The latter three are particularly useful for specifying within queries the names of nodes and their owning documents to provide fuller context for results. The following query for example outputs all the book <title> nodes in the repository, as well as the names of the documents containing those nodes:

    for $book in //book
    return
    ( 
       <title document = "{ doc-name( root( $book )) }">
           { $book/title/text() } 
       </title>, chr(10)  
    )

The results look like this:

    <title document="bib.xml">TCP/IP Illustrated</title>
    <title document="bib.xml">Advanced Programming in the Unix environment</title>
    <title document="bib.xml">Data on the Web</title>
    <title document="bib.xml">The Economics of Technology and Content for Digital TV</title>
    <title document="book.xml">Data on the Web</title>

What's new in version 0.62?

Version 0.62 is primarily a bug-fix release, with a few new features and enhancements.

Enhancements include:

  • support for proper namespace searching, which requires a namespace declaration in the prolog before QName prefixes can be used in location paths. The original form of lexical namespace searching (in which prefixes are searched for exactly as is and no namespace declaration is required) is still allowed as a switchable option. See XQEngine accessors for useLexicalPrefixes() (The default is false to conform to the XQuery specification.)

  • I've changed the Swing dialog in SampleApp to allow for longer queries. Serialized XML output is now directed to the lower panel in the dialog, with debug item info going to the console. You invoke the query somewhat kludgily at the moment by terminating the query with two consecutive carriage returns. (Once I figure out how to code an attractive Swing dialog with a "Query" button and/or menu, I'll replace this mechanism.)

  • I've gotten a good start on the mechanisms necessary for persisting the index to disk. This is needed to deal with large datasets exceeding the size of available RAM.
Refer to the readme file in the download for further information on enhancements and bug fixes.

What's new in version 0.61?

Version 0.61 adds support for a number of important XQuery expressions. The version follows the August, 2003 working drafts. New features supported include :
  • FLWOR expressions. You can now use FLWORs to join node sequences from different documents. You can use the order by clause to sort either alphabetically or numerically on the results. Only a single order by is supported at this time. From the XQuery Use Cases XMP-2:

    <results>
      {
        for $b in doc("bib.xml")/bib/book,
            $t in $b/title,
            $a in $b/author
        return
            <result>
                { $t }
                { $a }
            </result>
      }
    </results>
    

  • Element and attribute constructors. You can now say things like (Use Case TREE-5):

        <section_list>
        {
           for $s in doc("book.xml")//section
           let $f := $s/figure
           return
              <section title="{ $s/title/text() }" figcount="{ count($f) }"/>
        }
        </section_list>

    Computed constructors are not yet supported.

  • You can now do general alphabetic and integer comparisons, either in predicates or as arguments to a where clause, optionally anding the arguments together, as in (Use Case XMP-1):

        <bib>
        {
           for $b in doc("bib.xml")/bib/book
           where $b/publisher = "Addison-Wesley" and $b/@year > 1991
           return
               <book year="{ $b/@year }">
               { $b/title }
               </book>
        }
        </bib>

  • The contains-word() function now allows multiple words as arguments. These are implicitly anded together, as in either contains-word( //elemName, "word1", "word2", "word3" ) or contains-word( //elemName, "word1-word2 word3" ) (using internal whitespace and/or punctuation as word separators). The comparisons are case-insensitive by default. (If you want case-sensitive searches, you can add as a final argument any value that evaluates to a boolean true().) contains-word() has also been completely rewritten from scratch to be more efficient.

  • I've added support for additional functions:

    • empty()
    • exists()
    • not()
    • true()
    • false()
    • string()

  • I've added support for some, as in (Use Case SEQ-4):

        for $p in doc("report1.xml")//section[section.title = "Procedure"]
        where not( some $a in $p//anesthesia satisfies $a << ($p//incision)[1] )
        return $p

  • I've added support for node precedes ("<<") and node follows (">>"). (See the above example.)

  • I've added support for the position() function in predicates, as in (Use Case SEQ-2):

        for $s in doc("report1.xml")//section[section.title = "Procedure"]
        return ($s//instrument)[position()<=2]

  • I've added a number of additional JUnit tests (see JUnit tests below). One of the newly added files tests a number of the XQuery Use Cases (from which most of the above code snippets are taken).

Getting started

Once you've installed XQEngine, you can get started with it in a couple of different ways :

  • You can use it. XQEngine ships with two pieces of sample code that should work right out of the box to illustrate how to work with the APIs.

  • You can read a brief tutorial-like introduction to the APIs, or

  • You can jump right into the APIs themselves.

  • As a final choice of what to do initially, here's a design document that discusses some of XQEngine's internal architecture.

Installing XQEngine

Note that the installation procedure has been much simplified. Starting with version v0.61, the XQEngine jar file contains all the data files necessary for running both the SampleApp sample application as well as XQEngine's built-in JUnit tests.

Unzip the XQEngine.zip file. Place the XQEngine.jar file in a directory of your choosing. If you're going to run XQEngine's JUnit tests, place the junit.jar file in a location of your own choice as well.

At a minimum, you'll need to have a SAX parser on the CLASSPATH. If you're running JDK 1.3 or 1.4, the parser is already built in.

Otherwise:

  • If you want to run the SampleApp sample application and have JSDK 1.4, CD to the directory where you've placed the XQEngine.jar file and say :

             java -jar XQEngine.jar

  • If you want to run SampleApp and have another version of the JSDK, add the XQEngine jar to your CLASSPATH and say :

              java SampleApp

    If you're writing a client application from scratch to invoke XQEngine functionality, you would invoke it in the same way.

  • If you want to run XQEngine's JUnit tests, add junit.jar to the CLASSPATH and say :

              java com.fatdog.xmlEngine.junitTest.TestDriver

SampleApp sample application

SampleApp is a simple Swing-based sample application that ships with the XQEngine distribution that shows how to use the APIs. With Java 2 (required to run XQEngine), you should be able to run SampleApp right out of the box with the following command-line invocation, once you've placed the jar file on the CLASSPATH:

     java -jar XQEngine.jar

SampleApp starts up and immediately indexes five XML files that are shipped as part of the XQEngine jar file. These files, "bib.xml", "book.xml", "report.xml", "review1.xml", and "home_1.rss" are also used as part of XQEngine's JUnit tests of the XQuery Use Cases.

SampleApp then puts up a simple two-panel Swing dialog. Enter queries in the top panel and press the [enter] key twice to commit the query (you need a blank line to indicate no further input), and get back in the bottom panel the standard serialized XML corresponding to the query. An optional flag, NodeTree.LINEFEED_AFTER_INDIVIDUAL_NODES, has been set true to pretty-print the results (otherwise there are no linefeeds between lines).

      

Here's a debug view of the same information that's sent to the console. To emit debug info, you call toString() on the so-called ResultList object returned by a setQuery() invocation. This view shows both a summary of all hits in the ResultList: 5 total in 2 different documents -- as well as a breakdown per document. Printing debug information also shows other information not shown here.

      
The first number on each line in the listing above is the document order index of the node on that line. The parent and nextSib numbers are the indexes of the node's parent and next sibling nodes, respectively. The node's document order index actually shows two entries: one for both the leaf- and root-ends indexes of the subpath that was evaluated to produce the node. (And unless you're doing serious debugging, that's probably more than you wanted to know.)

The sample application takes the ResultList object that's returned by setQuery(), inspects it to see whether it has any valid items worth displaying (as indicated by a call to ResultList.getNumValidItems()), and then calls emitXML() on it. ResultList.emitXml() in turn calls emitXML() on each of its subordinate DocumentItem objects, one for each document in the result list. It should be fairly easy to modify this code if you want to package up the returned XML in a different fashion. See the source code for details.

Here's part of the code in SampleApp.run() that sets up the XQEngine environment prior to indexing:


  public void run() 
  {
     m_engine = new XQEngine();
        
     m_engine.setMinIndexableWordLength( 2 ); 
     m_engine.setDebugOutputToConsole( false ); 

     installSunXMLReader();
  }

XQEngine JUnit tests

JUnit testing is an excellent way of incrementally and continually testing the correct functioning of code. XQEngine maintains a set of JUnit test files located in the package com.fatdog.xmlEngine.junitTest. Looking through the JUnit test files for XQEngine is an excellent way of sampling the range of query functionality currently supported by the engine.

All JUnit tests in the package can be run by invoking

    java com.fatdog.xmlEngine.junitTest.TestDriver
There are currently over 250 separate JUnit tests in seven different test case files, each of which is a subclass of JUnit's TestCase class:
  • OneFileXPaths
  • TwoFileXPaths
  • Functions
  • TestProtocolHandler
  • Constructors
  • XQueryUseCases
  • Namespaces
The first two test case classes test a variety of XPath location-path patterns. OneFileXPaths tests against the canonical "bib.xml" file described in the XQuery Use Cases document. TwoFileXPaths subclasses the first test case file and adds a second test document, "bib_2.xml" to the repository: "bib_2.xml" is identical to "bib.xml", except that all element and attribute names have been prepended with the prefix "xqe". This allows the second test file to be used to test both the correct functioning of queries against multiple documents and also the correct handling of lexical namespace prefixes.

The Functions file tests built-in functions, as its name implies, and the test-case file, TestProtocolHandler, tests the correcting functioning of XQEngine's custom-protocol-handling mechanism. The TestProtocolHandler test makes use of XYZ_ProtocolHandler, the sample protocol handler discussed in Custom Protocol Handlers. Constructors tests the proper functioning of element and attribute constructors, including whitespace handling and normalization, and XQueryUseCases tests a variety of the use cases pulled from the working group's Use Cases document. Namespaces tests the proper behaviour of XQuery namespace declarations and prefix use in XPaths.

The suite() method in each test-case file is used to invoke and initialize XQEngine preparatory to each test run, and to instruct it which XML files to index prior to starting testing.

Namespaces and Lexical Prefixes

XQEngine's default behaviour is to require namespace declarations in queries wherever a namespace prefix is used. This is expected XQuery behaviour. However you can call engine.setUseLexicalPrefixes(false) to change that behaviour. This allows you to issue (normal invalid) queries such as
    //rdf:*
without requiring the usual prolog declaration first:
    declare namespace rdf = ""http://www.w3.org/1999/02/22-rdf-syntax-ns#";
This is simply for user convenience and is not part of the XQuery specification. Note that most of the JUnit tests included with the program use this convenience (except for the test file, "Namespaces.java", which tests expected namespace behaviour as per the specification).

Custom Protocol Handlers

The custom protocol handler capability is new to XQEngine 0.60. It allows you to use a custom-defined address for a document, indicated by the presence of a predefined URL scheme-like prefix. If the engine determines that the document-address argument being passed in either the setDocument() or doc() function is neither an http://- nor a file:/-based URL, it will look through its registry of custom protocol handlers to determine if the address is prefixed by a custom scheme that's been previously registered with the engine using the registerProtocolHandler() API (see the usage example below).

If the address starts with a known scheme, XQEngine will call the content() method of the IProtocolHandler that's been registered for that scheme, requesting the XML content for the document address in question. If the address isn't prepended with a known scheme, XQEngine will attempt to index the document via a local file reference as usual.

One of the benefits of using a custom protocol handler and this type of indirect addressing is that it enables you to use a custom addressing scheme for your documents, either one of your own devising or one that's previously been agreed to with other vendors. You can deliver your requests for XML content to backend functionality that can map each address in any way it sees fit to actual XML data, generating it on demand if need be. As an additional benefit, XQEngine's ability to do ad hoc indexing in a doc() invocation means that XML content to be queried against doesn't have to be delivered to the engine until query time -- in some circumstances XML content might not be available for indexing or even exist as XML prior to being queried against.

The suite() and various other test methods in the JUnit test-case file, "TestProtocolHandlers.java," show a typical sequence of events for registering a custom protocol handler, and then obtaining and indexing XML content from it on demand. In that scenario, a test XYZ_ProtocolHandler is first instantiated and then registered with XQEngine. The test handler is also initialized with content corresponding to several different dummy addresses that will be passed back to it indirectly later on via setDocument() and doc() calls in the test file.

Here's a brief look at a scenario for initializing and registering a custom protocol handler:


    // instantiate the handler

    XYZ_ProtocolHandler xyzTestHandler = new XYZ_ProtocolHandler();

    // feed it some test content we'll ask for later on

    xyzTestHandler.setContent( "XYZ::aProprietaryAddr_1", "<xyz>some xyz content_1</xyz>" );
    xyzTesthandler.setContent( "XYZ::aProprietaryAddr_2", "<xyz>some xyz content_2</xyz>" );

    // tell XQEngine what its "scheme" looks like
    // and what handler to call on detecting it

    m_engine.registerProtocolHandler( "XYZ::", xyzTestHandler );

The protocol handler needs to be declared like this:


    public class XYZ_ProtocolHandler implements IProtocolHandler

and it needs to implement the single IProtocolHandler callback method content(). In this example scenario, the XYZ_ProtocolHandler implementation of the IProtocolHandler interface keeps its content in a hash table and uses the address it receives from XQEngine as a key into that table. The implementation of content() for our test handler is nothing more complicated like this:


    public String content( String address )
    {
        return (String) m_contentHash.get( address );
    }

The handler tests the incoming address. If it's either "XYZ::aProprietaryAddr_1" or "XYZ::aProprietaryAddr_2", it returns the content corresponding to that address and XQuery indexes it. If there's no corresponding entry, content() returns null and XQEngine raises an error.

Word Breaking

Word breaking in XQEngine is handled by the class com.fatdog.xmlEngine.word.WordBreaker. If you want to use its characters() routine to deliver words to you as they're parsed, either customized or as is, your word-breaking class needs to subclass WordBreaker, implement the newWord() method defined in the interface IWordHandler, and use its registerWordHandler() routine to register your class. WordBreaker's characters() routine will deliver each new word as its encountered to your newWord() routine. You can override characters() to change its behaviour (something you might want to do for non-European languages, for example), as long as you respect its contract with newWord().

DocumentItem.contains_word() (the back-end routine that implements XQEngine's full-text search capability) shows an example of how to use the word-breaking routines. newWord() takes an IntList argument you can use to pass various word-breaking results back to the calling routine. In the case of contains_word(), the IntList is used to pass back the character offset in the node at which the particular searched-for word occurs.

Other stuff

  • Here's a presentation on Implementing the XQuery grammar I delivered at Extreme Markup 2002 in Montreal. It's an introductory tutorial on using the JavaCC and JJTree parser-generator tools to implement grammars such as XPath and XQuery.

  • Here's an introductory article on XQuery that I wrote for IBM's developerWorks website.

  • I'm the editor of a new book on XQuery, XQuery from the Experts, recently out from Addison-Wesley. The book is written by members of the W3C's Query working group, including such well-known names in the XQuery world as Don Chamberlin, Jonathan Robie, Michael Kay, and all the other stellar participants listed on the cover below. The book is available (naturally) at Amazon.

    Here's Jonathan Robie's excellent Chapter One overview of the language, XQuery: A Guided Tour, excerpted from the book.

    And here's Don Chamberlin's superlative Chapter Two, Influences on the Design of XQuery.

Contact

   Howard Katz, Fatdog Software Inc.
   3771 Sunshine Coast Hwy
   Roberts Creek, BC
   Canada V0N 2W2

   Email: howardk@fatdog.com
   Website: www.fatdog.com

  created 6november 1999
  move to sourceforge 24november 2003
  last updated 24march2005 (v0.69)

SourceForge.net Logo