The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. Apache VXQuery is implemented to efficiently query such large data collections and take advantage of parallelism. The system builds upon two other open-source frameworks -- Hyracks, a parallel execution engine, and Algebricks, a language agnostic compiler toolbox. Apache VXQuery extends these two frameworks and provides an implementation of the XQuery specifics. The main idea of the project is to integrate Lucene indexing to the VXQuery system. It already has some capabilities of Lucene such as, Create a Lucene index from an XML file and execute a query by using that index. This project is to fully integrate the Lucene and extend the indexing capabilities of the system. Such as, enabling queries to dynamically select to use indexing at run time, extend indexing for HDFS folders, allow updates to collection indexes (when Adding/Deleting/Modifying XML files).




  • Steven Jacobs