Using a Java Query Analyzer to Translate Generic Query to SQL, XPath and XQuery

Keywords: XML, XQuery, XPath, SQL, Query Language, Database, Java, Search

Dazhi Jiao
Indiana University Digital Library Program
Bloomington
Indiana
United States of America
djiao@indiana.edu

Biography

Dazhi (David) Jiao is a programmer and analyst of the Digital Library Program and Library Electronic Text Resource Service (LETRS) at Indiana University. David has nearly 5 years of experiences in software and web application development with Java and XML technologies. His primary interest is in designing and developing framework and tools for developing XML-based digital library collections. David has actively involved in the design and implementation of several digital library collections at Indiana University as the lead developer. David holds a Master’s degree in Computer Science from Virginia Tech.


Abstract


As traditional relational databases and new native XML databases are maturing as viable storage solutions for XML data, application developers are faced with SQL extended to provide XML support along with new query languages, such as XPath and XQuery.

At the Indiana University Digital Library Program, various data-driven digital library collection applications are developed. The diversity of data sources and formats in those collections requires the use of different types of databases and storage systems. For instance, relational databases are used to store image collection metadata and Dublin Core metadata to support Open Archives Initiative (OAI) data harvesting. At the same time, many academic and research data are stored in document-centric XML files, using common scholarly XML vocabularies, such as TEI (Text Encoding Initiative). To preserve the structure of these XML data, they are stored in native XML databases, which can be queried using XQuery or XPath. The nature of digital library projects requires a robust yet user friendly searching interface for retrieving data from collections. Such an interface typically allows users to submit queries that can contain Boolean operators, wildcards, multiple-word phrases, groupings with parentheses, and filters that can narrow a user's search based on parameters such as date or media type.

At the Indiana University Digital Library Program, a Java Query Analyzer (JQA) was developed to simplify interaction with heterogeneous database and storage systems. JQA interprets generic user query input, containing familiar Internet search engine operators, and transforms that user input into SQL, XPath, or XQuery statements, which can then be used to query data collections stored in a variety of relational database or native XML storage systems. It offers a transparent API (Application Programming Interface) to perform such interpretations while hiding the logics of parsing and transforming from developers.

JQA has these characteristics:

1.Transparent. JQA provides developers with a generic query API that can interact with systems supporting SQL, XPath, or XQuery. JQA provides a generic connection interface to integrate with Java database connection APIs for relational databases and XML databases.

2.Adaptive. The API can be adapted to handle query text from different searching interfaces, such as a "simple" search, "advanced" search or query-driven browsing.

3.Extensible. Very often developers need to take advantage of vendor-specific extended syntax in SQL and XPath. The query analyzer can be extended to interpret user’s queries and transform those queries into vendor-specific query languages.

4.Configurable. Developers can use a simple XML file to configure JQA. The configuration file allows developers to map generic query inputs to specific query language statements, configure query operators and syntax, and define supports for internationalization.

In this paper, I will discuss in detail the functionality, limitations, and configuration of JQA, and I will demonstrate some applications that use JQA to translate user input into SQL, XPath and simple XQuery.


Table of Contents


1. Background
     1.1 Introduction to SQL, XPath and XQuery Query
     1.2 Examples of SQL, XPath and XQuery Query Statements
     1.3 Digital Libraries and Data Model
     1.4 Common Internet Search Engine Functionalities And Digital Libraries
2. Java Query Analyzer
     2.1 Examples of Query Translation
     2.2 JQA in Digital Library Applications Structure
     2.3 JQA Components
     2.4 The JQA Parser
     2.5 JQA Parser Visitors and Application Programming Interfaces
     2.6 JQA Configuration
3. Future Work
     3.1 Implementing Full Text Operators
     3.2 Completing XML schema for configuration file
     3.3 Improving JQA’s extensibility and implementing a plug-in framework
4. Conclusions
Acknowledgements
Bibliography

1. Background

Developing a digital library collection application can impose challenges to developers to integrate heterogeneous data storage methods for searching and browsing user interfaces. This paper will focus on how the problem is solved by providing the developers a programming interface to assist the translation from the user’s query into distinct database query languages.

1.1 Introduction to SQL, XPath and XQuery Query

Structured Query Language (SQL) is the standard relational database language. In 1986, the SQL standard was defined by the American National Standard Institute (ANSI), which was subsequently adopted in 1987 by the International Organization for Standardization (ISO). SQL is supported by most Relational Database Management Systems (RDBMS). SQL has syntax for performing both simple and complex queries. It is designed to be portable, that is, it is a recognized standard so that the same command structure and syntax can be used in different RDBMS. SQL supports a set of helpful functions that can return values determined by input parameters. However, in addition to a number of useful functions provided by the SQL standard, each database vendor maintains a list of their own internal functions that are outside the scope of the SQL standard. Furthermore, many database vendors support the ability to create user-defined functions. These functions impose vendor specific SQL statements to application development. SQL also enables developers to combine the results from two or more query statements by using operations. The result of each SELECT statement can be treated as a set and SQL set operations can be applied to those sets to arrive at a final result. The new ANSI standard SQL-2003 defines more multi-set operations. Although set operators differ among RDBMS vendors, the normal operations contain Union, Intersection and Difference.

XPath [XPath] originally emerged at the intersection of XLink [XLink] , which defines a method of using URLs to describe location within a document, and XSLT [XSLT] , which provides XML transformations based upon some demonstrable set of patterns. XPath provides a way of extracting pieces of the XML structure and passing them on as a sequence of nodes to other processes. XPath is powerful, but like regular expression, XPath can be difficult for non-XML specialists to write. The central construct of XPath is the UNIX filename or URI style path expression, which addresses nodes within the tree representation of an XML document with a sequence of steps separated by slash (/) characters. Both XPath 2.0 and SQL share the ability to process sets. XPath also has a range of operators and functions for computing values.

XQuery [XQuery] is a query language for XML proposed by the World Wide Web Consortium (W3C) Query Working Group. It is derived from an XML query language called Quilt, which in turn borrowed features from several other languages, such as XPath, XML-QL, SQL, OQL, Lorel, XQL and YATL. XQuery is a typed, functional language in which a query is represented as an expression. XQuery is related to other standards for XML, including XML itself, XML Namespace, XML Schema, XML Stylesheet Transformations (XSLT) and XPath. XPath is an integral part of XQuery – most XPath functions are supported in XQuery; and XPath and XQuery share the same data model. XQuery uses the abbreviated syntax of XPath in path expressions.

SQL, XPath and XQuery all process sets of data. SQL select statements return tuples, XPath and XQuery path expressions return a set of nodes, including their descendent nodes. The difference is tuples are tabular data while nodes are represented as a tree structure which captures the hierarchical relationship between nodes. With the expression, XQuery can go a step further than XPath to put the data into formats other than the original XML document. SQL, XPath and XQuery also support Boolean operators and functions, and vendors can add their own set of functions and operators to these languages. SQL and XQuery both support set operations of Union, Intersection and Difference. Although the syntax of the queries built using these languages are quite different from each other, there are similarities among their query semantics and structures.

Because of limited space, details of the syntax of these languages will not be discussed in this paper. Readers can easily find the information on the web and in books about SQL, XPath, XQuery specifications.

1.2 Examples of SQL, XPath and XQuery Query Statements

Here is a very simple database of film literature indexes. There are two tables in this database, a citation table, and a journal table. Below are some data from the two tables.

 
CID     TITLE                                        VOL   YEAR  JID    
------- -------------------------------------------- ----  ----- ---- 
101572  French nets cry foul over soccer.            347   1992  593    
24454   TV, World Cup soccer and the American way.   139   1989  448    
325775  German exhibits kick pic woes with soccer.   379   2000  593    
 
JID     JTITLE              ISSN                                  
------  ------------------  ---------  
448     The New York Times  0362-4331     
593     Variety             0042-2738  

In order to find all citations in the journal Variety that have the word soccer in the title, a SQL query could look like:

SELECT c.citationid, c.title, c.volume, c.jyear, j.jtitle 
FROM citations c, journal j 
WHERE c.jid=j.jid AND title LIKE '%soccer%' AND jtitle='Variety' 

An XML representation of the data might look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<CITATIONS>
  <CITATION id="101572">
    <TITLE>French nets cry foul over soccer.</TITLE>
    <JOURNAL ISBN="0042-2738">
      <TITLE>Variety</TITLE>
      <VOL>347</VOL>
      <YEAR>1992</YEAR>
    </JOURNAL>
  </CITATION>
  <CITATION id="24454">
    <TITLE>TV, World Cup soccer and the American way.</TITLE>
    <JOURNAL ISBN="0362-4331">
      <TITLE>The New York Times</TITLE>
      <VOL>139</VOL>
      <YEAR>1989</YEAR>
    </JOURNAL>
  </CITATION>
  <CITATION id="325775">
    <TITLE>German exhibs kick pic woes with soccer.</TITLE>
    <JOURNAL ISBN="0042-2738">
      <TITLE>Variety</TITLE>
      <VOL>379</VOL>
      <YEAR>2000</YEAR>
    </JOURNAL>
  </CITATION>
</CITATIONS>

To perform a query similar to the SQL query above, there could be several ways to write the XPath. One of these options could be:

//CITATIONS/CITATION[contains(TITLE/text(), 'soccer') 
and JOURNAL/TITLE/text()='Variety']

An XQuery statement to perform the same search could look like:

for $citation in //CITATIONS/CITATION
where contains($citation/TITLE, 'soccer') 
and $citation/JOURNAL/TITLE='Variety'
return $citation

1.3 Digital Libraries and Data Model

A Digital Library is a collection of information objects and services. Information objects can be anything that is represented in a digital format: books, journal articles, photos, audio recordings, etc. A digital library must provide a variety of services to all of its users, both humans and machines. To end users, the most important service of a digital library is to satisfy their needs in accessing the variety of information stored in the collection.

The diversity of the information objects in digital libraries determines the complexity of their data models. This paper will focus on retrieval of digital objects based on searching text-based records from digital library collections. These texts can be digitized books, articles, bibliographic data and indexes. In addition, metadata of these text objects or other multimedia objects are also text based. At the Indiana University Digital Library Program (IUDLP) [IUDLP] , different storage methods are used depending on project requirements and the formats of original data sources. Relational database systems are chosen to store data-centric data such as literature indexes and metadata of digital objects. In other cases, document-centric data are stored in their original formats such as Microsoft Word, PDF, or semi-structured XML or HTML files. Several common XML vocabularies used in digital libraries include Text Encoding Initiative (TEI) [TEI] and Encoded Archival Description (EAD) [EAD] . TEI is a standard that helps libraries, museums, publishers and individual scholars represent all kinds of literary and linguistic texts for online research and teaching. IUDLP have different collections of books, articles and poems that are encoded in TEI. EAD is a standard for encoding archival finding aids. Some metadata are also stored in structured XML formats, such as METS and MarcXML. The Metadata Encoding & Transmission Standards (METS) [METS] schema is a standard for encoding descriptive, administrative and structural metadata regarding objects within digital libraries. MarcXML [MarcXML] is a framework initiated and developed by the Library of Congress for working with traditional library MARC records in XML environments.

1.4 Common Internet Search Engine Functionalities And Digital Libraries

Nowadays, the Internet has become the most important source for information to many people in their daily lives. Popular search engines, such as Google [Google] , have even become a primary access to information. Some search engines search only the titles and descriptions of sites, yet more advanced engines like Google perform full text search. It uses computerized “spiders” to index billions of pages so it searches not only the title but also the content of web pages. A common characteristic among these search engines is their support for usage of advanced operators in user queries. For example, both Yahoo! [Yahoo] and Google support Boolean operators and both have established the Boolean operator AND as the default. In user queries, search engines usually have the functionality to recognize and ignore stop words, which are common words such as “to”, “be”, etc. Some search engines also support usage of Field Operators, which allow users to search within a specific field, to refine searches for more precise results. For example, a Google search on intitle:”economics” restricts the search on economics to the title of web pages. Most search engines provide users two types of search interfaces: simple search and advanced search. More often, users visit the advanced search page for these advanced features, which is typically designed to guide users in formulating more complex queries. Different search engines support these common functionalities but can vary in the details. For example, they all have a different list of fields and stop words. Besides, search engines also have special features unique to them.

The most important role that digital libraries must play for end users is to provide robust and intuitive access to information objects. The two common access methods in many Digital Library collections are browsing and searching. For some of the digital collections, browsing is also driven by searches, namely, when users request an item through the browsing interface, the application actually sends a query to the server and the results are created on the fly. Usability studies indicate that providing assistance to users in query formulation and execution are critical to the popularity of a digital collection.

At IUDLP, during the design and implementation of the searching functionalities for digital library collections at IUDLP, a common requirement is to make the application interfaces work like popular search engines, such as Google.

This is a great challenge in digital library application development. The data model of digital library collections is very different to the one used in search engines. Search engines like Google use computer programs to index web pages and the search is executed on the indexes. Other formats of data linked from these web pages, such as Microsoft Word, PDF documents and images, are also indexed and searched. All these data are originally from a single source: the Internet. In digital libraries, complicated data models are very common and data storage systems vary among applications. At IUDLP, those data models include metadata, bibliographic data, digitized text marked up in XML, and more, as discussed in previous sections. Different types of storage mechanisms, such as relational databases, xml databases and file systems are used to meet respective project requirements. When using the relational databases, due to special requirements of particular collections, various types of database functionalities are used. For example, in order to implement full text search for some projects, Oracle Text technologies is used to support indexing, searching and text analysis for text stored in Oracle databases. Oracle Text supports a set of vendor specific operators and functions to embed in SQL statements. For the storage of XML documents, different types of XML databases are being evaluated and tested. Although XPath is used to perform most queries in such databases, we are aware of XQuery becoming a standard in near future. XQuery also could provide more functionalities and full text support than XPath. With such a heterogeneous and changing storage system in the back end, a component that can translate a user’s query from the interface to different query languages could highly improve the implementation of digital library applications with a similar user interface to search engines.

2. Java Query Analyzer

The Java Query Analyzer (JQA) is a Java-based software component developed by the IUDLP to assist project development by moving query language statements out of programming source code, and by offering a simple application programming interface (API) for the translation of user queries from application user interfaces to different query languages. JQA hides the logic of a user’s query translation from developers by providing a generic API independent of a query language. JQA accepts easy query syntax that can be entered directly by end users from user interfaces, or built by server with simple programming logic. This query syntax is similar to the syntax of common search engines.

2.1 Examples of Query Translation

As discussed earlier, there are similarities between query languages’ syntax and structures. In this section we will discuss in detail how user’s query can be translated to search criteria. Criteria are restrictions placed on a query to identify the specific fields or records we want to work with. In SQL and XQuery, criteria can be identified by the WHERE keyword in a query statement. Criteria are harder to identify in XPath expressions and are often embedded in brackets. A common feature of criteria in these query languages is that they can be combined with operators.

Here are the examples used in the first section. Now they can be revisited to identify and discuss usage of criteria: The criteria in SQL and XQuery are in the lines starting with the WHERE keyword; the criteria in the XPath are enclosed in the brackets.

SQL:

SELECT c.citationid, c.title, c.volume, c.jyear, j.jtitle 
FROM citations c, journal j 
WHERE c.jid=j.jid AND title LIKE '%soccer%' AND jtitle='Variety'

XPath:

//CITATIONS/CITATION[contains(TITLE/text(), 'soccer') 
and JOURNAL/TITLE/text()='Variety']

XQuery:

for $citation in //CITATIONS/CITATION
where contains($citation/TITLE, 'soccer') 
and $citation/JOURNAL/TITLE='Variety'
return $citation

In all examples, the operator AND is used to combine different criteria. It is clear that all three query statements have very similar criteria sections for this purpose. The SQL statement is slightly different for having one criterion c.jid=j.jid to associate two tables using foreign key and primary key. These queries will retrieve citations that CONTAIN soccer in their titles and occur in journals with the EXACT title Variety . Compared in this way, these queries are similar in constructing criteria with Boolean operators and functions.

We can envision a simple user interface (see figure 1) which users can use to search \ the sample database and XML document with the above queries:

jqa_final-fig0.png

Figure 1: An Advanced Search Form

From this interface, the application must translate what users entered as in the figure above, into one of the three query statements depending on the databases system used in the back end.

With the field operators, as mentioned in the discussion of Internet search engine functionalities, users who are very familiar with the system could have an option to type in some query even in the simple search interface for the same purpose. In the simple search interface such as an example as below (see figure 2), there is usually only one text box for user to input a query.

jqa_final-fig1.png

Figure 2: A Simple Search Form

The query entered by the user in the figure is:

citation:soccer AND journal:Variety

The field operators here imply slightly different concepts than the ones in search engines. A field here maps to a particular field or index in relational database tables, or a node in XML files. The field operators in search engines map to indexes on part of the web pages, such as title, body text, or they can be mapped to metadata of web pages, such as URL, dates of page indexed, and file types. Several field operators in Google can perform special functionalities other than searching, for example, the operator define: is used to retrieve a list of definitions.

If AND is defined as the default Boolean operator between search criteria, like in Google and Yahoo!, the query can be somewhat simpler:

citation:soccer journal:Variety

Although users can enter a query text such as [1] and [2], a third query would make more sense and appear frequently in simple search forms. Assume there is a database system with an index called KEYWORD, and the keyword field (which maps to this index) is the default field:

French soccer

This query can be translated to another SQL query. Given the assumption that a product like Oracle Text [Oracle_Text] is used to support full-text indexing and searching, the query could look like:

SELECT c.citationid, c.title, c.volume, c.jyear, j.jtitle 
FROM citations c, journal j 
WHERE contains(KEYWORD, 'French') > 0 
AND contains(KEYWORD, 'soccer') > 0

In this SQL query, contains is a vendor specific function defined in Oracle Text. It performs similar full text searching functionalities with the contains function used in XQuery and XPath previously.

When users execute searches in digital library applications, most of the time, the retrieved data model, namely, the information objects users are searching for, is predefined. Users expect certain representation of the objects in the interface to display the results. In such systems, the query entered by users actually sets only the searching criteria, like the examples given above. The criteria given by users tell the application how to retrieve the information, while the application defines what to fetch and where to get them. In another word, all the application needs to fit in after receiving a user’s query is the search criteria.

Therefore, digital library applications need to implement a component to translate a user’s query into the query language criteria listed above in SQL, XPath or XQuery. The component can further embed the criteria into a query statement. From the advanced interface, it is very easy for the application to build such a query string like [1] or [2], than to build a SQL, XPath or XQuery statement. There could be a single solution in an application to translate queries from both the simple and advanced search interfaces into query language statements. JQA is the component developed to provide such a solution.

2.2 JQA in Digital Library Applications Structure

JQA can act as a component in the server, between the user interfaces and the data connections. The server passes the query entered directly by the user from the simple search interface or from the advanced search interface, or sent from the browsing interface in cases of search-driven browsing, to JQA, which will translate the query into the query language statements specified by developers in the configuration. The server then uses this returned (or maybe processed) query statement in the data connection layer to obtain results from the database.

Even though it has not been tested, JQA can also be expected work well in a heterogeneous system with multiple databases of different kinds, as a result of its ability to translate a user’s query into different query languages with the same programming interface. JQA has also been proven to work with Object/Relational mapping tools, such as ObjectRelationalBridge [OJB] .

JQA provides a simple API for developers to interact with systems supporting SQL, XPath, or XQuery transparently. Developers will only need to change the configuration file after changes are made to the back end databases or to the data connection layer. JQA allows the developers to decide whether the whole query or simply the query criteria need sto be generated, and thus becomes more flexible to the database connection layer. For example, one digital collection project developed at IUDLP uses an Object/Relational mapping tool, ObjectRelationalBridge (OJB), which allows transparent persistence for Java Objects against relational databases. OJB encapsulates the mappings from relational tables to Java Objects but has a Criteria API for developers to specify criteria of database queries. JQA is able to work closely with the OJB API by passing only the criteria to OJB, and thus maintaining the structure of the Object/Relational mappings of OJB.

The JQA API can be easily adapted to handle query text passed from search-driven browse interfaces. For digital library collections, browsing is very important because it gives users additional access points to the collection as well as facilitates exploration of the website. As we discussed earlier, an application can construct a query string in the form of query [1] and [2] from the advanced search form. This built query string or what the user entered from the spiel search form is then passed to JQA. For server driven browses, a query in the JQA syntax can be associated with each browse item. The application sends the associated query to JQA when a user asks to browse the item. For example, on a web page for users to browse citations of sports, the URL linked to the subject soccer could have the query citation:soccer as a parameter. When a user clicks on the subject soccer , this query will be sent to JQA by the server.

Very often developers need to take advantage of vendor-specific extended syntax in SQL, XPath and XQuery. Sometimes applications have to support additional functionalities such as thesaurus support. The JQA API can be extended to implement these functionalities with usage of other libraries or database supports. For example, in one image collection at IUDLP, JQA was extended to integrate the support of a controlled vocabulary to improve search and browse functionality.

Developers can use an XML file to configure JQA. The configuration file allows developers to map generic query inputs to specific query language criteria statements, configure vendor specific behavior of databases, such as operators, stop words, etc., and escape characters and words that could interfere with query processing.

JQA is very performant. For a query of ten words, the translation takes less the 0.05 seconds, if only default functionalities are used. JQA with extended functionalities, such as thesaurus support, the time bottleneck is in processing these additional functionalities. If there is complicated logic to be performed, it could affect the performance dramatically.

2.3 JQA Components

JQA consists of three components, with different functionalities: a Parser, Parser Visitors and a Configurator. The key component in JQA is the Parser, which parses the user’s query using a lexer (lexical analyzer), and creates an Abstract Syntax Tree (AST) out of the query. An AST is a tree structure created in the translation of an input string, in which each node represents an operator and the children of the node represent the operands. In JQA, nodes can be query clauses, operators and field-term pairs. Parser Visitors traverse each node in the tree, identify nodes representing field-term pairs, map those nodes to criteria segments with the help of field handlers, and finally build query statements by using Boolean operator to reconstruct the structure of the query with these criteria segments. The visitors and handlers use JQA configuration during this whole process for information about vendor specific query language syntax, and mapping of terms with a field to criteria segments in query languages. For quick access time, the configuration needs to be loaded into the memory by the Configurator and kept in memory during the lifetime of the application.

jqa_final-fig2.png

Figure 3: Illustration of JQA Processing User Query

This graph indicates the relationships between these three components and how they work together to translate a user’s query into SQL, XPath or XQuery statements. First, the Parser constructs an AST tree by analyzing the query citation:soccer AND journal:Variety . For simplicity, the AST tree in this graph is not the real tree built by JQA parser but it reflects its basic structure. The Parser Visitor uses the nodes of field/term pair in this tree and finds the corresponding field/criterion in the configuration for same fields. In the field/criterion mappings in this graph, the question mark (?) symbol represents an anchor in the criterion. For each field, the Parser Visitor uses a field handler class to replace the anchor with the term from the AST tree node. The Parser Visitor then reconstructs the whole criteria with the Boolean operator, and if the whole query is specified in the configuration, it will construct the whole query with the criteria buried in the statement.

2.4 The JQA Parser

The JQA parser was developed using an open source parser generator called Java Compiler Compiler (JavaCC) [JavaCC] . JavaCC is one of the most popular parser generators for use with Java applications. JavaCC reads a grammar specification and converts it to Java programs that recognizes matches to the grammar. In addition to the parser generator itself, JavaCC provides other standard capabilities related to parser generation such as tree building. The tool for the tree building is called JJTree, which is included with JavaCC. JQA defines a grammar specification, and uses JavaCC commands to generate source code for JQA parser automatically. The disadvantage of using JavaCC is the loss of control to the source code. Minor changes could be made to the files generated by JavaCC, but complicated logic makes it difficult to add or change the source code directly once they are generated.

The JQA Parser breaks a user’s query into a list of single-word terms or phrases, and operators. The parser recognizes phrases by double quotes surrounding a group of words. Multiple terms can be combined together with Boolean operators to form a more complex query. Each term or phrase can be associated with a field by specifying a field operator, similar to Internet search engines discussed earlier. If the field is not specified in the query, the parser will assign a default field to the term. The default field is configurable.

A term can be in different formats. It can be regular text, wild carded, a NULL value, or in the formats of ranges. A phrase can also be considered as a special format of a term. JQA supports the mapping of single and multiple character wildcard to wildcard symbols used in query languages. If no wildcard is specified in the configuration file for the query language, JQA passes the wildcard symbol as a regular character in the term, or even escape it if necessary. JQA supports searching fields whose values are between the lower and upper bound specified by special range syntax. Range queries can be inclusive or exclusive of the upper and lower bounds. For example, a search for year:[1928-1936] will search the year fields and pull results that contain the years 1928 to 1936, including 1928 and 1936, and a search for year:{1928-1936} will search only years between 1928 and 1936, without these two years. Special characters in phrases are escaped automatically, which means, a search of year:”{1928-1936}” will not be considered as an exclusive range search, but a phrase term {1928-1936} .

Currently JQA supports three Boolean operators: AND, OR, NOT, and their programming equivalents, &, |, !. In JQA, the NOT operator is considered to be a binary operator instead of a unary operator. Namely, the NOT operator must have a term before and after it. This decision was made upon the assumption that most end users do not understand the difference between “AND NOT” and “OR NOT”. Therefore, rather than confusing users, NOT will act as a binary operator, which means “AND NOT”. In XQuery and XPath, there is no such Boolean operator as NOT, but there is a function fn:not() . Therefore, “AND NOT” can be achieved by using and not() in these two languages. For example, in order to find all citations in the example XML file that has the word soccer in its title but not from the journal Variety , the XPath query can be modified to:

//CITATIONS/CITATION[contains(TITLE/text(), 'soccer') 
and not(JOURNAL/TITLE/text()='Variety')]

Another reason of having NOT as a binary operator is because Boolean operations in JQA can also be mapped to binary set operations in query languages. This will be discussed later.

JQA also supports using parentheses to group clauses to form sub queries. Grouping can be helpful in two cases: it can control the Boolean logic in a query thereby eliminating confusion for the users, and it can be assigned to sub queries in a field, which allows end users to use Boolean operators in advanced search forms. For example, in the advanced search form in figure 1, users can also search for citations that have both French and soccer in the title and from the journal Variety. With field grouping, when a user enters French soccer in the citation title box, and Variety in the journal title box in an interface like figure 1, the application can simply build a string of citation:(French soccer) journal:Variety , and pass this text string to JQA. The sub query French soccer will be performed on the field citation . The details of how to map the sub query of a field to query language criteria will be discussed in JQA configuration section later.

2.5 JQA Parser Visitors and Application Programming Interfaces

Besides source code for the parse tree structure, JJTree also generates a ParserVisitor interface. The ParserVisitor interface, which uses the visitor design pattern [Visitor_Pattern] , defines a set of methods for operations on different nodes in the AST tree. Implementation of the ParserVisitor interface encapsulates these operations and separates the logic of handling nodes from the parser. JQA includes several implementations of the ParserVisitor interface for different purposes. The TreeDumpVisitor class prints the tree structure into a console, the QueryDumpVisitor returns the original query string, and the TermSwitchVisitor can modify terms of given fields. The QueryBuildVisitor class supports the building of query statements from the AST tree. Developers can implement their own Parser Visitor to meet special project demands.

Simplicity of the JQA API enables developers to plug JQA supports into applications with only a few lines of code. Below is a code snippet to illustrate the usage of JQA API in an application.

String query = "citation:soccer journal:Variety";             [1]
Parser parser = null;                                         [2]
ParserVisitor visitor = null;                                 [3]
try {                                                         [4]
    parser = Parser.newParser(query);                         [5]
    ASTStart startNode = parser.Start();                      [6]
    visitor = new BuildQueryVisitor("citation");              [7]
    statement = (String)startNode.acceptVisitor(visitor);     [8]
} catch (Exception ex) {                                      [9]
    // Exception handling                                     [10]
}                                                             [11]

In lines 5 and 6, JQA parses the query text and builds the AST tree. A Parser can be created by calling the newParser method of the Parser class. The ASTStart class is the root node in the AST tree generated by JJTree. Line 7 creates a new BuildQueryVisitor instance, with the type of “citation”. The type “citation” tells the visitor to load the configuration of type “citation”. In line 8, the ASTStart node accepts the visitor and returns the query statement.

Several types of exceptions could be thrown during JQA’s interpretation of the user’s query. When the parser parses the query, a ParseException could be thrown if there are any syntax errors, such as unclosed parentheses and quotation marks, or misusage of special characters. The ParserVisitor class could throw ConfigurationException and HandlerException . The ConfigurationException is caused by errors in the configuration file, and HandlerException is thrown by JQA field handlers. Field handlers, which are implementations of the Handler interface in JQA, encapsulate the logic of how different types of terms of a field, such as wild carded terms, regular text terms or phrases, should be mapped to criteria segments in a query language. The Handler implementation can also include complex logic depending on the particular requirements in application. For example, in one image collection at Indiana University, a controlled vocabulary was used to improve search and browse functionality [Cushman] , and a thesaurus is included for extending query terms. In this application, the same field handlers retrieve narrower terms for a term entered (e.g., Animals – Camels, Dogs, Cats, etc.) in a field from the thesaurus, and thus expand the results for more complete results.

2.6 JQA Configuration

Developers can use an XML configuration file to specify JQA’s behavior and customize functionalities.

JQA configuration file consists of two parts. The first part configures the database and some characteristics of the language specific Boolean operators, keywords and syntax. The second part defines mappings of fields to query language criteria segments in the configuration file. Developers can define different types of mappings for the same field so that distinctive query statements can be generated under different circumstances.

As discussed earlier, Boolean operators differ slightly from each other between SQL and XPath/XQuery. The NOT operator in SQL is a single operator, yet in XPath and XQuery, NOT operation is accomplished by calling a function. Developers can also map JQA to set operations (Union, Intersect and Difference), if criteria are built by using sub queries in SQL, XPath and XQuery. For example, the query example title:(French soccer) journal:Variety be translated into statements with sub queries:

SQL:

The highlighted criteria in this SQL statement use the Intersection operation on two sub queries result sets. Although it might not make much sense in this example, this option makes JQA more flexible and adaptive.

Similar operations can be performed with XQuery:

for $citation in 
(
  (for $cit in //CITATIONS/CITATION
  where contains($cit/TITLE, 'soccer')
  return $cit)
  intersect
  (for $cit in //CITATIONS/CITATION
  where contains($cit/TITLE, 'French')
  return $cit
  )
)
where $citation/JOURNAL/TITLE/text()='Variety'
return $citation

Whether to use sub queries depends on the requirements and environment of particular applications, such as database performance and capabilities and complexity of the databases. Discussing these particularities is beyond the scope of this paper.

JQA by default escapes characters and words defined in the JavaCC grammar. The list of characters include: ( ) { } [ ] \ * ? : etc,. Developers can also specify additional escaping words and characters to meet the requirements of the database. For example, in SQL, % is a special character that is used for truncation searches. JQA has to escape such characters in order to make the SQL query semantically correct. Sometimes, special reserved words and characters can cause the query to fail. For example, in Oracle Text, the character single quote (‘) is a reserved character. Without escaping it, the query statement will be rejected by Oracle and thus create errors when processing the user’s query.

Developers can include a list of stop words JQA should ignore when performing the translation. When indexes on database columns or XML nodes are created, commonly used words can be omitted to increase performance and save indexing space. These words need to be ignored during searches to ensure results. By building a list in the configuration file, JQA will omit these words in a query automatically and use other words in the query to build the criteria. JQA keeps a list of these words, which the developers can use to send feedback to users that those words will not be searched.

Digital library collections could have multiple interfaces searching for different information. For example, the interface for the example database can have two types of interfaces: searching for citations, and searching for journals. The query statements will be very different. Developers can specify distinct types of queries in one application for different purposes.

3. Future Work

3.1 Implementing Full Text Operators

JQA grammar specifications need to be extended to support advanced full-text search operators. Full-Text Search is a large field that covers a vast array of functionalities. For example, proximity search, which enables users to focus a search to find terms near each other, can be accomplished by adding a NEAR operator. Implementing additional operators will require changes to the parser and the configuration. Besides, usage of these operators could be contingent upon the full text support of the back end databases. The standard SQL does not have full-text functions, but many relational database vendors have defined extensions for full-text support. Even though XPath and XQuery have specific requirements for full-text searches [XQuery_Full_Text] , most vendors only have implemented very simple full-text functionalities due to the early stage in XML database development. Therefore, implementation of the full-text search operators in JQA syntax needs to make those operators flexible, so that developers can choose whether to provide such functionalities to end users.

3.2 Completing XML schema for configuration file

An XML schema is being implemented for the configuration file. There could be several advantages of having a schema as an option to developers when they are creating the configuration file. The schema can be used to validate the configuration file and prevent errors caused by mistakes in configuration file.

3.3 Improving JQA’s extensibility and implementing a plug-in framework

The ParserVisitor and Handler interfaces make it possible for developers to implement customizable functionalities upon JQA framework. These interfaces can be improved further to provide a plug-in framework that developers can easily integrate with other tools within JQA. Once such framework is completed, some plug-ins that can provide important support to digital library applications can be included in JQA distribution.

4. Conclusions

The Indiana University Digital Library Program (IUDLP) developed a Java Query Analyzer (JQA) to simplify the development of systems obligated to provide user interfaces with common Internet search engine functionalities while accommodating heterogeneous database and storage systems. JQA interprets the generic user query, containing familiar Internet search engine operators into SQL, XPath, or XQuery statements, which can then be used to query data collections stored in a variety of relational databases or XML storage systems. It offers a transparent Application Programming Interface (API) to perform such interpretations while hiding the logic of parsing and transforming from developers.

Acknowledgements

I would like to express my gratitude to my colleagues John Walsh, Michelle Dalmau, Tamara Cameron and Ying Jin, all from the Indiana University Digital Library Program, who have provided valuable advice to the development of JQA, and offered great help in the preparation of this paper.

Bibliography

[XPath]
http://www.w3.org/TR/xpath
http://www.w3.org/TR/2001/REC-xlink-20010627/
[XSLT]
http://www.w3.org/TR/xslt
[XQuery]
http://www.w3.org/TR/xquery/
[IUDLP]
http://www.dlib.indiana.edu
[TEI]
http://www.tei-c.org
[EAD]
http://www.loc.gov/ead/
[METS]
http://www.loc.gov/standards/mets/
[MarcXML]
http://www.loc.gov/standards/marcxml/
[Google]
http://www.google.com
[Yahoo]
http://www.yahoo.com
[Oracle_Text]
http://www.oracle.com/technology/products/text/index.html
[OJB]
http://db.apache.org/ojb/
[JavaCC]
https://javacc.dev.java.net/
[Visitor_Pattern]
The Visitor Design Pattern http://exciton.cs.oberlin.edu/javaresources/DesignPatterns/VisitorPattern.htm
[Cushman]
Charles W. Cushman Photograph Collection http://www.dlib.indiana.edu/collections/cushman/
[XQuery_Full_Text]
http://www.w3.org/TR/xquery-full-text-requirements/

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.