Abstract
There is a common need among XML projects to have fast, reliable, full-text searching of XML documents that can be applied to entire content repositories. For all but the most trivial cases, the solution must allow for complex querying of element content and attributes along with the ability to search for structural relationships among elements. For many use cases the ideal solution would minimize cost by using existing open source components as opposed to costly commercial alternatives.
This paper describes the integration of XML-aware indexing and searching components with a full-text search engine. It details the approach used, the results, some alternatives, and important lessons learned along the way.
The integrated system is designed to meet the following goals:
Indexing and searching of both full document and XML-specific content: tagname, ancestor, attribute, and processing instruction searching, as well as treeloc and document id information for post-search navigation.
Development using existing open source components.
Support for typical search query structure (AND, OR, ...).
Easy integration of custom business logic.
Very fast search times.
The XML-aware searching provided is capable of solving many business needs, allowing for complex XML queries, including:
finding an element with specific content
finding an element with attributes with specific values
finding an element with a specific ancestor or ancestor list
finding a particular processing instruction
most combinations of the above
more...
The resulting system is a fast, reliable XML search engine, which has exceeded our expectations in terms of flexibility and low development cost.
Keywords
Table of Contents
This paper describes the experiences of a small software development team that needed to create the ability to do fast, yet complex XML-aware searching. The solution had to be low cost, support many future projects, and be completed in only a few days.
The problem of how to index XML-aware content into a searchable form is a complex one. This team is not the first group of developers to take on this problem. Others have solved it to varying degrees and offer it as part of large suites of software. However it is very hard to find an inexpensive solution.
A quick perusal of the XML searching landscape led the team to a Java full-text search engine. It had no "out of the box" XML support, but its architecture was flexible and its Application Programming Interface (API) was clear and intuitive. Its search results were consistently fast and provided enough flexibility to quickly develop a solution.
The development team consisted of four standards-based developers with a variety of software development and markup-related experience totaling over 32 years. While two of the team members were tasked with implementing this solution, all four had vital influence on the final design. The effort described took place in the Fall of 2001.
The over-arching requirement for the project was to provide searching of both XML-specific information and full-document text content.
The time and resources weren't available to purchase or create a search engine from scratch. The team needed to find an open source search engine to provide the underlying indexing and searching mechanisms. The search engine chosen had to provide an intuitive API flexible enough to design our XML searching around. It also had to meet other searching and performance needs.
For the base search engine, Lucene was chosen. Lucene is a part of the Apache Software Foundation's Jakarta open source Java initiative[LU 2]. Lucene provides a lot of functionality and flexibility 'out of the box' and was key to our project's success.
To index content for searching, a means of processing an XML document to extract the desired information is needed. Both the Document Object Model (DOM)[DS 5] and the Simple API for XML (SAX)[SA 12] were considered. DOM provides a standard object model that can be used to extract the required XML information for indexing and was chosen due to previous experience and limited development time. See the Alternatives section (Section 6) for a more detailed comparison of the two technologies.
Generalized XML-aware searching requires a DTD/Schema-independent means of indexing XML documents such that you can search on the PCDATA content, tagnames, ancestors, and attributes of elements. It also requires having the ability to search for more complex relationships among elements in an XML document.
XML-aware searching requires indexing of:
DTD/Schema-independent full-text indexing of XML documents enabling searching on the PCDATA content of elements.
Searching for the text 'financing' should return a list of hits, one for each element that contains 'financing' in its direct content.
Tagnames, enabling searching on elements by tag name.
Searching for the text 'financing' in a 'para' element should return a list of hits, one for each element of type 'para' that contains 'financing' in its direct content.
Ancestor and parent tagnames, enabling searching for elements with a specific type of parent element, sequence of ancestors, or a set of ancestors in any order.
Searching for the text 'financing' in 'para' elements with a 'chapter' ancestor element should return all the 'para' elements that have a 'chapter' ancestor and contain 'financing'.
Searching for the text 'financing' in a 'para' element with a 'chapter' parent element should return all the 'para' elements whose parent is a 'chapter' element and contain 'financing'.
Absolute child numbers, enabling searching for the exact position that the element you are searching for is within its parent.
Searching for the text 'financing' in a 'para' element with a 'chapter' parent element with child number of 1 should return all the 'para' elements that contain 'financing' that are the absolute first children of 'chapter' elements.
Attribute values, enabling searching for elements with specific attribute names and values.
Searching for 'para' elements that contain an attribute named 'lang' should return all 'para' elements that contain such attributes.
Searching for elements that have a 'lang' attribute with value 'en_ww' should return all elements whose language attribute value is 'en_ww'.
Processing instructions, enabling search for processing instructions.
In developing a more complete XML-aware indexing solution, Namespace handling and other XML element types and content (such as comments), can be indexed and searched for as well. See the Alternatives section (Section 6) of this document for more details.
Defining a specific User Interface for searching was not part of this project. To the contrary, the indexing/searching scheme was designed to provide post-search information that would be usable for many custom user interfaces.
For a given text search, users expect to get back a complete list of possible matches for the desired documents/content. From this complete list, it is helpful to have a ranking/scoring associated with each potential match. The ranking helps promote displaying the most likely results first.
In an XML context, users are expected to search for either specific elements or documents containing specific elements. Given a list of possible matches, it would be ideal to be able to navigate to the exact position of an element within its containing document. This would enable user interfaces to open exactly to the element that a user searched for. To accommodate this, DOM tree location and the original document location must be stored for each element.
As part of a results display, it would also be useful to show a few lines of content for each element. For example, if the user is searching for a particular 'para' element containing 'certain text' and they get back many results, they can visually examine the few lines around each result to find the one they need. This requires either storage of element/node content or other post-processing mechanisms, such as reparsing the containing document to get the element's content.
The XML-aware searching is only as good as the underlying search engine it uses. Lucene is a server-side Java search engine toolkit with a library of packages that can be combined to create highly-configurable full-text indexing and searching. Packages available are analysis, document, index, queryparser, search, store, and util[LU 2].
The analysis package contains an abstract API and Java classes for analyzing and tokenizing input into a form used by index writing and searching.
The document package contains the primary data structure for representing actual documents.
The index package contains classes for index reading and writing.
The queryparser package contains classes that support parsing of text into searchable query objects.
The search package contains the abstract searching API as well as classes for index searching, query objects, and post-search document results.
The util package contains some supplementary data structures.
The specific features used for our implementation will be described as the corresponding design areas are discussed.
Lucene query structure is very flexible. It provides support for boolean, conjunctions, wildcard, prefix, and date ranges in query strings. Lucene provides configurable filters that are applied to indexing and searching text. These filters can be used for case-sensitivity, to filter out 'stop' words that are not meaningful/desired to be indexed, and more. This provides the structure for our XML querying.
Lucene provides a structured 'Hits' object to represent search results. The Hits object contains a list of search results with a score for each. The score is calculated using the frequency of terms in the document, the size of the document, the number of documents containing the search terms, a coordination factor (multi-term queries), and an optional user-specified 'boost' factor. The boost gives the ability to increase or decrease the effect specific query terms have on result scores[LF 4]. For each result in the Hits object, you can retrieve any content stored at index time.
For our current implementation, we are indexing documents from the filesystem into indexes on the filesystem. Search performance can be increased by using Lucene support for multiple specialized indexes. Index size optimization is provided. RAM-based indexes can also be used to maximize searching performance.
W3C Document Object Model[DS 5] was selected for XML document parsing. Given an XML document, we can build a DOM tree representation of the document. We can then iterate over the nodes in the tree and get all of the information needed to construct the XML-specific index.
The Apache XML Xerces Java Parser version 1.4.4[XE 1] was selected. Xerces is open-source, supports the XML 1.0 recommendation[XM 6], and contains support for DOM Level 2 version 1.0 in addition to supporting the DOM Level 1 API.
To describe the design we borrow some terminology from Lucene[LF 4]. The capitalized forms of words refer to these specific terms, whereas lowercase refers to the everyday meaning of the word.
| Filter |
Applied to content to transform it. Example: Lower Case Filter |
||||||
| Analyzer |
A group of Filters to be applied to content. Used at Index and Search time. |
||||||
| Document |
An object representing a unit of indexed information. A Document contains an arbitrary list of Fields. |
||||||
| Field |
A Name/Value pair. The Name is used to identify the content contained in the Value. A Field may be Stored and/or Indexed and/or Tokenized.
|
||||||
| Index |
A compact, rapidly accessible and searchable storage of Documents. |
Document (Structure)
[
Field[name:value]
Field[name:value]
Field[name:value]
Field[name:value]
]
Index [ segments of Document fragments... ]There had been much discussion and interest on the Lucene users group about how to implement XML-aware indexing, but few feasible solutions had been offered. Those solutions had succeeded in indexing some XML content, but had failed to provide any but the most trivial searching.
The most common idea was to index XML content for each XML document in a single Lucene document. It sounds easy enough, but doing this causes many design problems. Fields are simply name value pairs. Each time you encounter a new element, you create a new Field using the tagname with the text content as the value. This works for indexing text content in elements. However, if you have more than one tag with the same name, then the Field name is no longer unique and overwrites the old Field. So either you'd need a naming convention to identify each individual tag or you'd need to just append the content from all tags with the same name. Whatever you add to the Field name makes searching for it non-trivial. Instead of searching for tags named 'foo', you'd have to know its position or whatever convention was used to make it unique (i.e. 'foo1 0'). Concatenating the content loses the ability to identify particular elements within the document, so tree location would lose its value. If you solve that problem, then what about attributes? Say your document has 35 tags with the attribute 'language'. How do you associate each with the correct tag Field to make it available for queries? This quickly becomes a complex problem.
The key to our design is that for each element, processing instruction, and other desired node type, a new Lucene Document is created. This causes many small Documents to be created to represent the original XML document. Each Document contains document-specific XML content as well as element-specific XML content. The biggest critique with this design is one of scalability. It will be addressed more fully in the System Performance section (Section 7) of this document. It has been our experience that for small to moderate index sizes, search time is extremely fast.
At the core of the design are the indexing and searching processes that provide for XML-aware searching. The indexing scheme contains the XML document processing and creation of searchable indexes. The searching scheme consists of searching and post-processing of results.
The implementation code for this project contains an implementation of the design core in the form of XMLIndexer and XMLSearcher Java classes. It also provides a storage and retrieval manager (XMLSandRManager class) to control the lifecycles of indexers and searchers. Finally it provides several utility classes: SearchError, IndexError, and StemmingAnalyzer used to provide filtering for index creation and searching. The code is available for download on the ISOGEN website[II 9]. A basic User Interface client for interacting with the system is also available.
Having defined a workable indexing strategy, the basic indexing process is fairly straightforward. For each XML document to index, you build a DOM tree. You recursively iterate over the tree. For each desired Node in the tree, you create a single Lucene Document that contains the XML information to index for that node. The Documents created depend on the type of DOM Node. All Element and Processing Instruction nodes have Documents created for them. Other node types do not.
The following table shows the content stored for each Element Node. It also shows each Field's corresponding field state.
Table 1.
| Field Name | Description | Stored | Indexed | Tokenized |
|---|---|---|---|---|
| ancestors | Ancestor Tagnames | Yes | Yes | Yes |
| attributenames | Attribute Names (all) | Yes | Yes | Yes |
| att#name | Attribute Name (each) | Yes | Yes | Yes |
| childnum | Absolute Child Number | Yes | Yes | Yes |
| content | Content | No | Yes | Yes |
| docid | Document ID | Yes | Yes | Yes |
| nodetype | Node Type (DOM) | Yes | Yes | Yes |
| parent | Parent Tagname | Yes | Yes | Yes |
| tagname | Tagname | Yes | Yes | Yes |
| treeloc | Tree Location | Yes | No | No |
All but two Fields are stored, indexed, and tokenized. The content Field is not stored, so as to keep the index size as small as possible. The treeloc Field is for post-search result support only. It is not indexed as it need never be searched on. It is not tokenized so that you get the original Field value back (i.e. "0 1 4 7 1").
The following list defines each Field name and summarizes how it is generated in the code algorithms (as needed for clarity).
| ancestors |
denotes ancestry as a list of ancestor tag names. Ancestors can be thought of as a path to the document root for a given node. It is represented by a space separated list of tagnames. The order starts with the document element's tagname and ends with the parent of the current node's tagname (e.g. "document chap para" or "magazine section article title"). This ordering provides for searching of specific sequences of ancestors or simply for ancestor presence. This list is built as the DOM is recursively iterated over. Each parent node appends its name to the list of ancestors. That new list is passed in to each of its children. Thus each node has a complete ancestors list provided. Each node that a Lucene document is generated for (element, processing instruction, etc...) will store its ancestor list in a searchable Field. |
| attributenames |
a list of a particular element's attribute names. If a Node has attributes, they are retrieved. The name of each attribute is added to a space-separated list of attribute names for the current node. This list is stored in a searchable Field for that node. This allows searching for attribute presence by name within an element. |
| att#name |
the value of each attribute for a given element. Each attribute is also stored in its own searchable Field. The naming convention of 'att#' + attribute name provides for general name uniqueness among Fields in the document. Thus searching for a specific attribute by name and value within each element in a document is enabled. Example attribute Field names are 'att#lang' 'att#foo'. Should situations occur where this naming convention violates the naming used in XML documents, simply change the Field name prefix in a few code locations to one that will work. |
| childnum |
the absolute child location of the current node within its parent. The child number is translated from the tree location to a layman's expectation. For example, casual users want the first child of the parent, the zero'th child means nothing to them. Thus, it is a 1-base indexing. So if the current node's index within its parent is 0, its childnum is 1. The change from 0-base to 1-base indexing is for intuitive querying (i.e. User wants the first child of the parent, not the zero'th). Each node calculates its own childnum using tree location. |
| content |
any text contained directly within the element. Content for the text children of elements (etc...) is combined and added to the Lucene Document for that element. The text content of all nodes is combined into another Lucene Document to represent the document as a whole. Thus searches on text within the document as a whole are provided for. |
| docid |
the document identifier for the containing document. The document identifier is a string that can be used to retrieve the document for indexing and post-search processing. The docid could be a URL, filename, database object id, or other unique identifier. For our implementation, docid is a file path on the file system. The XMLIndexer Java code could be extended to take other forms of identifiers and retrieve the XML document from another source, such as a database. The document identifier is required to begin indexing a document. All Lucene Documents created for either the XML document as a whole or any of the document's nodes store this id in a searchable Field. |
| nodetype |
the W3C Document Object Model node type. Node types include entity, notation, attribute, and more. For a complete listing see the W3C DOM Document Level 2 Core Specification[DS 5]. Each node encountered in the DOM traversal stores its own nodetype as a searchable Field. For our purposes, we have added an ALL_CONTENT nodetype to indicate the Lucene Document that stores the entire PCDATA content of the XML document. Integers are used to represent the node types available from the Xerces DOM. These integers are part of the DOM Specification and are therefore mapped to representative strings for searching purposes. |
| parent |
the tagname of the parent element of the current element. As the DOM is being iterated over, each child node stores the tagname of its parent as a searchable field. |
| tagname |
the element's tag name. Each element stores its tagname in a searchable field. |
| treeloc |
the tree location of the element/node in the DOM. The tree location is represented by a space separated sequence of integers. Treeloc is computed while the DOM is being processed. Each parent node appends the child index for the child node that is being indexed next on to its treeloc. Thus the recursive nature of the tree walk, builds a complete treeloc value for each node. The tree location provides a unique identifier for each element in an XML document. It can be used to quickly navigate to locations within a DOM built for that document. |
Processing Instructions are a special case. Documents created for processing instructions have two special Fields (piname and pitext).
Table 2.
| Field Name | Value | Stored | Indexed | Tokenized |
|---|---|---|---|---|
| piname | Processing instruction's name | Yes | Yes | Yes |
| pitext | Text content of the PI | No | Yes | Yes |
The entire PCDATA content of each document is also indexed. It is indexed as a single Document having the nodetype of ALL_CONTENT.
This means that for a single XML document, many very small Lucene Documents are created. Each Document created is added to the current Index. Once the Indexing of an XML document is completed, you can search on its content.
Given an Index containing searchable content, searching is a matter of constructing a query, searching through the Index for matches to that query, and retrieving stored content for possible matches.
When adding documents to an index, an Analyzer is used to transform and otherwise filter out unwanted text content. The XMLSearcher uses a Lucene IndexSearcher to prepare for searching. When a query string is received, it is passed into a Lucene QueryParser. The QueryParser applies structure recognition and then the StemmingAnalyzer to the query string and creates a Lucene Query object. An IndexSearcher uses that Query object to search the target index and determine search results.
The QueryParser recognizes string structures that imply constraints to a query. A Query is comprised of groups of words organized by ()'s and associated by conjunctions (AND and OR). These groups can be specific terms, phrases, or specialized query forms. Terms are a specific unit of searching. Phrases are denoted by quotations to indicate exact text. The terms and phrases can be qualified with a Field name. Other modifiers (+, -, !, and NOT) are applied to terms and phrases to indicate mandatory inclusion or exclusion. There is also support for prefix queries, wildcard, fuzzy matching, and a boost factor. This query structure will hereafter be referred to as query syntax. For a more thorough explanation of Lucene's QueryParser & query syntax, see the FAQ[LF 4].
The StemmingAnalyzer uses several pluggable Lucene content Filters to tokenize and convert content into indexable and searchable forms. Currently filtering converts the target text to lower case, filters out a configurable list of common or meaningless 'stop' words, and provides porter stemming of terms (i.e. apple finds applesauce and apples). This can be customized/changed using other Filters and Analyzers.
For XML searching, the key to the syntax is the ability to qualify terms and phrases with Field names. This means that all of the terms that were indexed for each XML document and/or element/node can be searched.
To search for text content 'foo' anywhere in the document, search on: nodetype:ALL_CONTENT AND content:foo.
To search for the phrase 'foo baby' anywhere in the document, search on: nodetype:ALL_CONTENT AND content:"foo baby".
To search for an element by name 'para', search on: tagname:para. Note that tagname implies element, and therefore specifying nodetype:ELEMENT_NODE is not required.
To search for text content 'foo baby' anywhere in a 'para' element, search on: tagname:para AND content:"foo baby".
To search for attributes (lang and bar) of a 'para' element, search on: tagname:para AND attributenames:lang AND attributenames:bar .
To search for a specific attribute named 'units' with value 'inches' on a 'para' element, search on: tagname:para AND att#units:inches.
To search for a specific processing instruction 'pNote', search on: piname:pNote.
To search for an element named 'foo' directly inside an element named 'bar', search on: tagname:foo AND parent:bar.
To search for an element named 'foo' that is the second child element directly inside an element named 'bar', search on: tagname:foo AND parent:bar AND childnum:2.
To search for an element named 'foo' that has a 'bar' element in its ancestor list, search on: tagname:foo AND ancestors:bar.
To search for an element named 'foo' that specifically has an element 'bar' inside an element 'toga' in its ancestor list, search on: tagname:foo AND ancestors:"toga bar".
To search for an tree location...you can't. It is not indexed, but after searching you can retrieve the treeloc for a result. Also, you can search on document identifier, but it is not recommended, as it is tokenized and depending on format not may behave as expected. Finally, by default the 'content' fields are searched, so the query string 'apples are nutritious' will return results for any elements/nodes or documents containing that phrase.
By combining these Field-based queries with the query structure provided, you can perform very complex XML searching.
To find all elements named 'em' that contain the text 'finances' that are the third child of a 'para' element whose path to root explicitly contains a 'para' within a 'chap' within a 'document' element that is not currently in the French language, use the query string:
'content:finances AND tagname:em AND ancestors:"mydoc chap para" AND parent:para AND childnum:3 AND (NOT att#language:french)'
The Search Result list that comes back could then be organized by document id to group together all the results for a single XML document. This is not provided by default, but has been done with extension to this code.
The Indexing design provides support for basic post-search result processing. For a given search result (Lucene) Document, enough contextual information is stored to allow navigation to specific element locations in an XML document.
The document id (docid) is used to locate the original XML document. The tree location allows for unique identification of arbitrary XML elements/nodes in a document. For example, there can be only one element whose tree location is 0 1 2 5 6 in a single XML document. The tree location '0 1 2 5 6' indicates that the target element is the 7th child of the 6th child of the 3rd child of the 2nd child of the first element/node.
By Default, content for nodes is not 'Stored' and is therefore not retrievable from the Index after searching. This helps keep index size small. If only a few results are displayed at a time and document size is not excessively large, the treeloc and document id can be used to retrieve a few lines of content for the displayed elements. An example of this can be seen in the generic client user interface provided at the ISOGEN website[II 9].
The following figure shows how the stored values (treeloc, ancestors, and docid) relate to an XML document.
At the time this project started, there was really only one XML-Lucene alternative available, an example using the Extensible Stylesheet Language (XSL)[XS 11] and the SAX (Simple API for XML) for indexing. Since then two other attempts have been posted, one using XML Path Language (XPath)[XP 7] for indexing, and a Lucene binding that is part of an RTF to XML converter. This section briefly analyzes each alternative.
At various points during system design, critical choices are made. Technologies are selected and eliminated. Useful features are either supported or not. This section also lists several of the key choices for this project and their alternatives.
At the time of this project, there was really only one XML-Lucene alternative available, posted by Philip Ogren (ogren@mayo.edu)[LC 3]. It was a demonstration to show that it was possible to convert XML documents to Lucene documents with the Extensible Stylesheet Language (XSL) and SAX.
In a nutshell, one XML document results in creation of one Lucene document. For each element tag encountered in SAX, a field would be added to that Lucene document. The field name was the tagname and the field value was the text content of that element. The XSL transform applies store, index, and tokenize values to fields.
This solution was merely an example. Occurrences of multiple elements with the same name will be lost in the creation of the Lucene Document. It only tracks elements - no other XML information. It provides no tracking of individual element location (i.e. no treeloc or similar post-processing support). Hierarchy of elements was also not indexed, only element name and its direct text content.
Its current XSL implementation relies on specific element names so it is not a generic solution. For every document type, a new XSL transform would have to be written specific to its structure.
It serves as an interesting example, but the loss of element hierarchy, no post-processing support, lack of default DTD-generic behavior, and loss of multiple element occurrences with the same name assured this solution did not meet our needs.
Since this project's completion, an XPath alternative has been attempted. The code was posted by Peter Carlson from the Lucene Users group[LC 3].
The code takes a list of XPaths. The list of XPaths is applied to an XML document to create one Lucene document. Each XPath has an associated 'key' name. This key name is used for a Lucene field name, and therefore, searching.
For each XPath, a DOM of the target XML document is constructed. Parts of the DOM are iterated over as the XPath expression is evaluated. At each level of the expression, the content of all text nodes at that level are appended to a string. Once the expression is fully evaluated, that string contains all text content of all (DOM) levels from the levels reached from while iterating over the expression. That big string is indexed as a Lucene field identified by the key name.
The XPath is used only as a tool for indexing. This XPath approach solves a different type of problem than our XML-aware project. This solution appears to be good for indexing 'slices' of XML code of which only the text is indexed and searchable. If the content you wish to index can be specified quickly and concisely with a single XPath expression and you have no need of any non-text XML searching, this could be the right solution.
Unless the 'key' names are also the XPath expressions, you can not search the indexed content using XPath. Even then, XPath searching is only for the fixed set of XPaths used at indexing time.
If it cannot be eliminated with XPath syntax, then there is no easy way to filter out (i.e. not index) content at steps of the expression. There is no support for making separately searchable elements/parts from a singular XPath expression.
Using this approach, the only way of expressing control to the searching mechanism is in the field name. If searching by ancestry, element type, and element relationships for the XML content is desired, that information has to be specifiable through a very clever naming convention.
This approach has not been tested extensively. A DOM is constructed for each XPath. That DOM is then iterated over via XPath evaluation. As a result, for a large number of XPaths indexing can become slow (no actual numbers provided by author(s)).
This approach only indexes text content. Attributes, location and content of 'specific' elements, processing instructions, hierarchical relationships and all other XML information is lost. By default, this approach does not allow searching by entity name:value, has no post-processing tree navigation support, and is not a generic solution sufficient to meet heavy-duty XML indexing and querying needs.
A company named TetraSix has a Microsoft Word (RTF) to XML converter that supposedly has a Lucene binding. It is referenced from the Lucene Contributions page[LC 3], but no date was given to its contribution. Several attempts to download and evaluate this product from the company's website have failed. Therefore, it was not possible to evaluate its XML to Lucene binding.
XPath and XQL are both very useful technologies in their own right. There were several reasons that neither were used on this project.
First, neither XPath nor XQL were requirements. In trying to find the simplest solution without any unnecessary effort, we evaluated all alternatives. XQL was not yet fully defined, and therefore not considered.
Second, even though the primary content documents are in XML, typical end users may not be completely familiar with XML (and may not want to be).
Third, the querying syntax needed to seem familiar enough to end users that they could learn it quickly and use it intuitively. It could be argued that after users 'climbed' the learning curve for XPath, it may seem more intuitive to some users. However, it was necessary to make the query language simple enough that unskilled user groups could rely on previous experience at other types of searching when constructing queries.
The team ended up with an alternative 'home-grown' solution that was good enough to allow the level of querying required. It relies on the built-in search engine query syntax. It provides for the desired complexity in XML searching that is fairly similar to web-search syntax[GS 8]. Thus users with a basic understanding of XML can construct structured XML search queries with minimal learning and use it intuitively.
That being stated, there could be value in adding a layer of XPath or XQL querying over the existing functionality.
At the time of development there was no requirement to handle Namespaces. However, the addition of this ability should be fairly trivial. The problem is mapping Namespace prefixes to Namespace strings (e.g. URL, ...). This could be accomplished with indexing prefixes to return the strings.
The XML information must be stored in a searchable format so that queries can be constructed to search on it. This requires a means of processing arbitrary XML documents to extract the desired information.
In order to index the XML content we needed a way to process the document and a way to easily determine the relationships between elements in that document. The two choices were SAX and DOM (Document Object Model).
SAX is event-driven, and does not construct an in-memory object model for XML Documents. Using SAX would provide a faster indexing solution. However, it can be a lot of work to write a SAX document handler to interpret all the needed SAX events.
DOM gives you an in-memory object model in a standard hierarchical "tree of nodes" format. Using DOM, you can recursively iterate over the tree to extract the desired content. Given that you can afford the object creation overhead, DOM can be a quicker implementation.
DOM provided the functionality needed with the minimal amount of work. The time required for DOM processing is a fraction of the total indexing time. Using SAX would only provide a minimal speed up. DOM provides a standard object model to code against and was chosen for this project. For a more detailed comparison of SAX vs. DOM, O'Reilly's XML Website has a good list of white papers[OP 10].
Some performance numbers have been taken with the default implementation. More should be available during the accompanying presentation at the XML Europe Conference, 2002.
Indexing and searching performance was tested on two sets of documents.
The first set consisted of large flat XML documents (i.e. one DOM-level deep) of varying size. Their text content was trivial and all the same size (per element). These documents were generated for testing purposes only.
The second set consisted of actual, large XML documents. This set included the Book of Mormon from Jon Bosak's Religious Works and several Shakespearean plays also marked up and made available by Jon Bosak[JB 13]. These publicly available documents were used in their XML form without modification.
The testing computer had a Pentium III 933Mhz processor with 256Megs of PC100 RAM, and was running Microsoft Windows 2000.
Indexing Metrics were collected per-XML document.
Metrics included:
Size of document indexed
Total number of Elements (TE) per document
Total number of Nodes Indexed (NI)
Total DOM Processing Time (DPT)
Total Element Indexing Time (EIT)
Total Text Content Indexing Time (TCIT)
Total Overall Time (TOT) = DPT + EIT + TCIT + overhead time
Index Size (IS) - containing only the document indexed
All times were measured in seconds. All size units were measured in kilobytes (KB) unless otherwise indicated.
For each element, an element node and a text node are created. Thus the total number of nodes indexed was roughly 2x total number of entities. For both document sets this ratio was consistent.
For Document Set 1, the average index size was approximately 10x original document size.
For Document Set 2, the average index size was approximately 2.5x the original document size
Lucene was optimized for indexing text. Index size was therefore much more related to the number of elements and nodes indexed than text size. For the typical document set, the 2.5x growth of index size was acceptable.
If all XML documents indexed contain a large number of elements and little to no text, this could be a scalability issue. For 1GB of documents, up to a 10GB index would be required. This could be split up between multiple indexes, yet it remains a large requirement. However, this document set was a very extreme case. The typical test document entity content was one four letter string. Hence the 2.5 growth from the more realistic document set should be expected for the typical usage, and therefore shouldn't be a scalability problem.
For Document Set 1, Element Indexing took 30x longer than DOM Processing.
For Document Set 2, Element Indexing took 150x longer than DOM Processing.
A greater percentage of time was taken for DOM Processing when a single document contained a huge number of elements. However, even in the extreme case of Document Set 1, DOM Processing time was still a minimal amount of the overall time.
Text content indexing time was minimal for both data sets.
For Document Set 1, overhead time (TOT - (DPT + EIT + TCIT)) for indexing 2.32MB of files was 271.515 seconds. This was approximately 1/4th of the total time.
For Document Set 2, overhead time from the realistic document set was only 14 seconds. This was less than 1/20th of the total time.
This indicates that node-handling overhead should be optimized. The remainder of the overhead was due to metrics collection overhead and the user interface client used for testing.
The total time results were roughly linear with respect to document size & total number of nodes indexed.
Searches of arbitrary and varying complexity were performed on the data set indexes. The longest search time observed was 3 seconds.
A large portion of this time was due to the results display from the generic user interface client used for testing. For each result shown, the client performed post-search DOM processing to extract per-element text content from the original XML document.
After disabling this client feature, all subsequent searches were virtually instantaneous. None took longer than 1 second. This was sufficient to meet our needs.
Table 3.
| Doc | Size | TE | NI | DPT | EIT | TCIT | TOT | IS |
|---|---|---|---|---|---|---|---|---|
| h1.1k | 23KB | 1001 | 2001 | 0.06s | 8.40s | 0.4s | 8.92s | 275KB |
| h1.2k | 48KB | 2001 | 4001 | 0.1s | 16.21s | 0.9s | 17.26s | 552KB |
| h1.4k | 95KB | 4001 | 8001 | 0.12s | 33.75s | 0.25s | 36.87s | 275KB |
| h1.6k | 144KB | 6001 | 12001 | 0.18s | 52.55s | 0.33s | 61.40s | 1.62MB |
| h1.10k | 245KB | 10001 | 20001 | 0.94s | 102.29s | 0.60s | 137.38s | 2.71MB |
| h1.20k | 489KB | 20001 | 40001 | 20s | 193.45s | 1.53s | 340.38s | 5.45MB |
| h2.1k | 38KB | 1001 | 2001 | 0.09s | 8.69s | 0.4s | 9.23s | NA |
| h2.2k | 75KB | 2001 | 4001 | 0.11s | 18.30s | 0.14s | 19.33s | NA |
| h2.4k | 149KB | 4001 | 6001 | 0.24s | 33.12s | 0.21s | 36.71s | NA |
| h2.6k | 223KB | 6001 | 10001 | 0.32s | 53.70s | 0.33s | 61.83s | NA |
| h2.10k | 372KB | 10001 | 20001 | 1.00s | 99.96s | 0.56s | 130.41s | NA |
| h2.15k | 557KB | 15001 | 30001 | 1.52s | 148.0s | 0.80s | 210.63s | NA |
Totals for Document Set 1:
Size = 2.32 MB
TE = 81012 elements
NI = 162012 nodes
DPT = 25.49 seconds
EIT = 768.49 seconds
TCIT = 4.92 seconds
TOT = 1070.42 seconds
IS = 23MB (for all documents indexed)
Table 4.
| Doc | Size | TE | NI | DPT | EIT | TCIT | TOT | IS |
|---|---|---|---|---|---|---|---|---|
| bom | 1370KB | 6945 | 13443 | 0.7s | 64.14s | 4.78s | 80.68s | 2.18MB |
| dream | 147KB | 3361 | 6811 | 0.12s | 26.62s | 0.36s | 27.72s | NA |
| all_wl | 212KB | 5031 | 10143 | 0.18s | 40.23s | 0.50s | 41.96s | NA |
| hen_v | 228KB | 4971 | 10029 | 0.21s | 40.77s | 0.58s | 42.22s | NA |
| lear | 249KB | 5984 | 12101 | 0.24s | 48.19s | 0.58s | 50.04s | NA |
| mchdo | 198KB | 4727 | 9525 | 0.21s | 38.75s | 0.46s | 40.30s | NA |
| othello | 252KB | 6194 | 12500 | 0.37s | 50.77s | 0.57s | 52.85s | NA |
Totals for Document Set 2:
Size = 2.59 MB
TE = 37213 elements
NI = 74552 nodes
DPT = 2.03 seconds
EIT = 309.48 seconds
TCIT = 7.83 seconds
TOT = 335.77s
IS = 6.53 MB
This project provides the core indexing and searching functionality for complex XML-based querying. It has been customized with business logic and integrated with an ISOGEN document repository with good success. It has many points of extension, a clear design, a flexible, yet clear querying language, and a fast and highly customizable search engine in Lucene. XML Document parsing is performed with DOM, and this is easily readable and customizable. The entire implementation was built using open source or otherwise free technologies. Perhaps best of all, 90% of this code was completed in a few days time.
The task of customizing the Indexing and Searching implementation to specific business scenarios is not difficult. We have customized and integrated this indexing code into an ISOGEN document repository with good success. The integration dynamically maintained the indexing and was a non-trivial task. However, the XML indexing scheme allowed for rapid customization and accounted for a minimal amount of the total integration time.
One obvious customization would be DTD/Schema-specific filtering of nodes to be indexed. This would reduce both indexing and searching time and help minimize the index sizes. Another would be to index elements in mapping to their higher level data structures rather than just syntactic structures. A third would be to provide a layer of XPath querying by either pre-calculating and adding XPath fields to the documents indexed or as a post-search process.
All of the software tools and technologies used for this project are open source or otherwise free technologies. Lucene provided the base search engine. Xerces, an XML DOM/SAX Parser, provided the pre-index document parsing. The Java Programming Language provided platform independence for Lucene, Xerces, and this code. This more than meets the 'Low-Cost' goal. The only true cost in using this project is the cost of customizing it, specializing it, and scaling it to the desired production level. Most of this can be facilitated with relatively simple Java code; thus keeping costs minimal.
In the System Design section (Section 5) of this document, the core indexing scheme was discussed. The indexing of XML documents is DTD/Schema independent. Therefore, most XML documents can be indexed using the solution provided. This indexing scheme does not by default, but could be extended, to handle nearly all XML documents by adding support for Namespaces.
The current implementation provides searching for:
the entire PCDATA content of an XML document.
the PCDATA content within specific elements.
processing instructions by name and content.
attributes of elements by both name and value.
elements/PIs with specific parent element types.
elements/PIs at specific child locations within a parent element.
elements/PIs with specific ancestor element types.
elements/PIs with specifically ordered ancestor element type.
and many interesting relationships using combinations of these values, the DOM node type, and the provided query structure.
The end result is one of the most comprehensive XML searching abilities currently available, which has more than met our needs.
The search querying syntax/structure used is similar to common Internet search query structure. Users can create intuitive structured XML search queries with minimal learning and/or knowledge of XML. It provides for queries that can combine multiple complex XML-aware query parts and full-text searching of document content. The default querying allows for lower case queries with boolean (AND, OR, NOT) logic, and basic word stemming. The query syntax is highly customizable to business needs and can easily be extended to include wildcard, date ranges, and other types of queries.
The indexing scheme and querying structure combine to allow for a variety of User Interfaces showcasing the searching ability. For a basic example, see the User Interface Client available on the ISOGEN website[II 9].
The current implementation returns a Lucene 'Hits' object. The 'Hits' object contains a list of Document objects for potential matches. For each Document, there is an associated Score so results can be ranked.
The storage of tree location and document id during indexing allows for integration with authoring or other programs to open the desired post-search result document to the exact element location searched on.
This combination of features allows for development of a variety of User Interfaces showcasing search results.
The primary issue for scalability is the ratio of index size to original document size. Many Lucene Documents are created for each XML document and many of those contain several searchable Fields. For production systems, index size per document size should scale well. It is estimated that index sizes will be between 2.5 to 5 times the original document size. Also, indexing time results should scale linearly with respect to size of documents indexed. More testing must be performed on a larger scale to draw conclusive results.
There are four main components used to determine indexing speed: DOM processing time, element indexing time, text indexing time, and overhead.
DOM Processing time is a very small fraction of the total indexing time. Therefore, switching from DOM to SAX for document processing would only make an incremental improvement to total time.
Element indexing time is by far the largest portion of the indexing time. This time is the actual Lucene indexing time for all of the elements in a document. It is large due to multiple Lucene Documents being created for each XML document. Indexing each Field for each Document adds time to this number.
Text indexing time is a very small portion of the total indexing time. Lucene indexes text very quickly.
Overhead time refers to indexing-related overhead from node-handling, metrics collection, and the user interface client used for testing. Node-handling appears to be the largest portion of this time and is a prime candidate for optimization.
The current indexing performance is acceptable. Based on these results, perhaps the greatest indexing performance boost would come from filtering out unwanted elements from being indexed. This can be done on a DTD, Schema, or per-document basis.
Using the XML-aware searching defined in this document with Lucene as the underlying search engine provides for complex XML searching at minimal cost. This is not a production-ready complete system by any means. However, it serves as an effective starting point for such systems. It is built using only open-source technologies, so the cost of converting the system into a specialized production-scale one is directly proportional to the cost of developing relatively simple Java code to extend the system.
The code referred to in this paper contains an implementation of the design core in the form of XMLIndexer and XMLSearcher Java classes. It also provides a storage and retrieval manager (XMLSandRManager class) to control the lifecycles of indexers and searchers. Finally, it provides several utility classes: SearchError, IndexError, and StemmingAnalyzer used to provide filtering for index creation and searching. It is available for download on the ISOGEN website[II 9]. Also available for download is a basic User Interface client for interacting with the system.
Recommended future research areas are Namespace support and XPath/XQL integration. Currently, for a given XML document, many smaller Lucene Documents are created for indexing. This scheme has yet to be tested on a large scale (tens-of-thousands to millions of documents). It is recommended that the design be refactored to merge more Documents together to minimize the amount of work that the search engine has to go through to update/remove/search. Finally, converting the document processing from DOM to SAX will likely reduce indexing time.
It is the hope of the authors that these concepts and design will help spur interest in additional research on the subject. It is also hoped that this example will help provide XML Searching solutions to a wide variety of applications.
Thank you for reading.