Keywords: XML, query, XQuery, SQL, group by, distinct, outer join
Biography
Vinayak Borkar is a Staff Software Engineer at BEA Systems, Inc. He is a lead developer for the BEA XQuery engine, a key component of BEA’s Liquid Data for WebLogic and WebLogic Integration products. Vinayak joined BEA when his previous company was acquired. His expertise lies in the areas of query optimization and distributed query processing.
Biography
Michael Carey is a Technical Director at BEA Systems, Inc. He manages the BEA XQuery engine team and is the architect for the Liquid Data product. Mike was previously a Professor of Computer Science at the University of Wisconsin and later an IBM Research Staff Member. His expertise lies in databases, especially DBMS internals and performance.
XQuery is the W3C’s emerging language standard for querying and transforming XML. XQuery is a powerful, flexible language designed to query the many kinds of structured and unstructured data that XML can represent. Despite its power, certain familiar SQL query operations, such as grouping, duplicate elimination, and outer-joins, are either difficult or impossible to express “reasonably” in XQuery. These primitives are important for data-oriented applications of XML, particularly applications that have a need for reporting (e.g., for OLAP and statistical querying).
This paper presents a small set of XQuery extensions to enable grouping, duplicate elimination, and outer-join queries all to be expressed neatly within the XQuery language. The proposal does minimal “damage” to the XQuery standard; it generalizes the current FLWOR expression syntax of XQuery and requires no changes to the underlying XQuery data model. The extensions are slated to appear in the next major revision of the BEA XQuery engine and its encompassing products.
1. Introduction
2. XQuery: A Query Language for XML
2.1 Why Not SQL?
2.2 XQuery 101
3. Making XQuery More Data-Friendly
3.1 Grouping
3.2 Duplicate Elimination
3.3 Outer Joins
3.4 FLWGDOR: The Generalized FLWOR Expression
4. Summary
Acknowledgements
Bibliography
In this paper, we motivate and propose a modest set of extensions to the emerging W3C XML Query Language standard, XQuery, to address an important class of use cases that arise frequently when applying XQuery in data-centric, query-oriented applications. While XQuery is quite well-designed in terms of providing powerful facilities for manipulating both unstructured and structured XML data, its current design is somewhat biased towards queryingunstructured data (e.g., querying documents) and transformingstructured data (e.g., transforming business messages). As we will discuss, XQuery thus comes up short when it comes to querying structured data – where SQL has traditionally shined, at least for “flat” data – due to the lack of explicit support in XQuery for certain commonly needed data-oriented capabilities. In particular, writing queries that need the equivalent of SQL’s GROUP BY, SELECT DISTINCT, and OUTER JOIN features is awkward (at best) in XQuery. Queries of this sort arise often when using XQuery as a language for data integration, such as in the BEA Liquid Data for WebLogic product, and we are extending XQuery to address BEA customer requirements as a result.
The remainder of this paper is organized as follows. First, to level the playing field for readers, we provide a very brief overview of XQuery. We focus mainly on the data-querying features of XQuery, most notably the FLWOR expression, the heart of XQuery for data-centric applications. We then propose three extensions – one for grouping, one for duplicate elimination, and one for outer-join handling – showing how each would be done today in XQuery, discussing the associated difficulties, and then showing how our extensions address these difficulties. Our extensions are intended to be “small” in the sense of requiring only modest changes to the current XQuery language design. No changes to the XQuery data model are needed, and minimal FLWOR expression extensions suffice to meet the desired goals. We close this paper with a brief discussion of the current status of these XQuery extensions.
The development of the SQL language for querying and manipulating relational data was a major force in ushering in the database age in the late 1970s and early 1980s. In late 1999, the W3C XML Query working group set out to design a similarly high-level and declarative query language for XML data. XQuery, the result of that effort, is nearing completion and is slated to become an official W3C Recommendation “real soon now” (hopefully early in 2005).
It is natural to ask why SQL, or a derivative of SQL, wasn’t the right solution to the problem of querying XML. The short answer is that there are just too many differences between XML data and relational data to make SQL a good candidate for this task:
As noted above, the W3C XML Query working group has been designing a new query language tailored to the unique needs of manipulating XML data. The result of that work is the language now called XQuery [3]. Although XQuery is a work in progress, it is finally nearing completion at the time of this writing.
XQuery is based on an underlying model of XML data called the XQuery data model. Just as the relational model laid the foundation for SQL, the XQuery data model lays the foundation for XQuery. Because XML data is naturally ordered, the XQuery data model is based on the notion of ordered trees. Central to the data model is the notion of a sequence. XML queries consume and produce sequences that consist of atomic values (based on the primitive types of XML Schema) and/or of XML nodes (element, attribute, text, document, and so on).
XQuery is a functional, side-effect-free language. Like many other functional languages, a program (which is a query in the case of XQuery) consists of a prologue and a body, where the body is an expression. The result of a given query is the result of evaluating its body expression in the environment defined by its prologue. Expressions in XQuery can be simple expressions like primitive constants (e.g., “John Doe” or 1.3), variables, arithmetic expressions, function calls, or path expressions (familiar to users of XPath). They can also be combined to form more interesting expressions via operators, functions, and syntactic constructs that include FLWOR expressions (the heart of XQuery for the data-oriented purposes of this paper), typeswitch expressions, and node constructors.
The XQuery language is rich enough to support navigation within an XML input document, the combining of data from multiple XML inputs, and the generation of new XML structures from one or more XML inputs. To generate new XML structures, XQuery takes an approach similar to Java Server Pages: A subset of the XML syntax itself is part of the XQuery language, enriched with XQuery expressions that are executed dynamically and replaced inside the XML structures with their results. One can toggle back and forth between literal XML and query expressions via curly braces.
From the standpoint of data-oriented applications, the most important expression in XQuery is the FLWOR (pronounced “flower”) expression, which is roughly analogous to SELECT-FROM-WHERE-ORDER BY queries in SQL. The components of a FLWOR expression are:
for
clause that generates one or more value sequences, binding their values to one or more query variables. The for clause in XQuery plays a role similar to the FROM clause in SQL.let
clause that binds a temporary variable to the result of a query expression. The XQuery let clause is similar to support for temporary views (i.e., the WITH clause) in some dialects of SQL.where
clause that contains Boolean predicates that restrict the FOR clause’s variable bindings. The where clause in XQuery serves the same purpose as the WHERE clause in SQL.order by
clause that contains a list of expressions that dictate the order of the FLWOR expression’s XML output items. XQuery’s order by clause is directly analogous to SQL’s ORDER BY clause.return
clause that specifies the query’s desired XML output. The XQuery return clause is analogous to the SELECT clause in SQL, but the structures that it can specify are much richer than those expressible in SQL. (For example, this is where XQuery’s JSP-like XML node construction syntax is heavily used.)In general, an XQuery FLWOR expression can consist of any number of for and/or let clauses followed by one optional where clause, one optional order by clause, and then a mandatory return clause. In other words, using regular expression notation and first-letter abbreviations for the clauses, the grammar for a FLWOR expression is: (F|L)+W?O?R.
For data handling, XQuery has much of the richness of SQL and more – it includes support for subqueries, union, intersection, difference, aggregate functions, sorting, existential and universal quantification, conditional expressions, user-defined functions, and static and dynamic typing, in addition to various constructs to support document manipulation (e.g., query primitives for order-related operations). The single biggest thing that XQuery lacks relative to SQL is support for updates; XQuery 1.0 is strictly a functional data access language, with update support being targeted for a later revision of the standard. In addition, as addressed in the remainder of the paper, it is also lacking relative to SQL’s support for certain common data-centric query operations.
As mentioned in the introduction, XQuery comes up short in several respects when it comes to querying structured data. In particular, writing queries that involve the equivalent of SQL’s GROUP BY, SELECT DISTINCT, and OUTER JOIN features is very awkward, as we will show, in XQuery as currently defined [XQuery 1.0]. Queries of this kind arise frequently when using XQuery as a language for accessing and/or integrating data like we do in BEA’s Liquid Data for WebLogic product. We now illustrate the shortcomings of XQuery in this regard and describe the syntax and the semantics of our proposed extensions.
To illustrate both the current state of XQuery as well as our extensions, it will help to have some sample data to query. We will assume the availability of a vendor-provided XQuery function called CUSTOMERS() that returns a sequence of CUSTOMER elements, perhaps drawn from a database, where each such element has the form illustrated in Example 1. The associated schema, not shown, should be evident.
<CUSTOMER> <CUSTOMER_ID>001</CUSTOMER_ID> <FIRST_NAME>John</FIRST_NAME> <LAST_NAME>Smith</LAST_NAME> <AGE>30</AGE> <STATE>MA</STATE> <COUNTRY>USA</COUNTRY> </CUSTOMER> |
Example 1: Simple XML customer data.
Applications that do any sort of simple data analysis frequently need to group and aggregate data – common examples include finding minimum and maximum salaries by department, total sales by region, and so on. Such queries are problematic in XQuery today. To illustrate, suppose we wanted to find the average ages of our customers by state and country in order to create a simple report. The query pattern recommended by the XQuery Working Group for this sort of use case, as detailed in Appendix G.5 of the XQuery 1.0 language specification [XQuery 1.0], is shown in Example 2; we also show the corresponding SQL query that would accomplish the same task if we had a relational database containing a CUSTOMERS table.
XQuery 1.0:
for $c in distinct-values(CUSTOMERS()/COUNTRY)
for $s in distinct-values(CUSTOMERS()/STATE)
let $cs-group := CUSTOMERS()[COUNTRY eq $c and STATE eq $s]
where exists($cs-group)
return
<avg-age country=”{$c}” state=”{$s}”>
{
fn:avg($cs-group/AGE)
}
</avg-age>
SQL:
select c.COUNTRY, c.STATE, avg(c.AGE)
from CUSTOMERS
group by c.COUNTRY, c.STATE
|
Example 2: XQuery vs. SQL for counting customers by state and country.
It should be clear that this use case is much more readily addressed in SQL than in XQuery 1.0. In the case of SQL, which has built-in support for this sort of grouping, one simply selects the country, state, and average age grouped by country and state. In XQuery, the thinking required to formulate the query above is far less direct; moreover, it is more difficult for an XQuery processor to “see” what really needs to be done (in terms of efficient query evaluation). As written, the XQuery 1.0 query in Example 2 can be viewed as iterating over all possible combinations of the distinct country values and the distinct state values found in the underlying data. For each such country/state pair, the query computes the sequence ($cs-group) of those customers whose country and state values match those of the pair. If the resulting sequence is non-empty (which it may not be, since not all pairs will actually occur in the data), then the query constructs a result element called <avg-age> containing the corresponding country, state, and average age values. Simple, no? No.
In addition to being rather convoluted, the XQuery query in Example 2 is also not quite equivalent to the SQL query. Suppose that, since not all countries are divided into states, our XML data is based on a schema that permits the STATE element to be missing in a CUSTOMER element. In the case of SQL, this means that STATE is allowed to be null. In the case of the SQL query in Example 2, customers without states will still be grouped into groups based on their COUNTRY values (with a null value for STATE in the output for such a group). In contrast, the XQuery query in Example 2 has a problem – such groups will simply be omitted, as the variable $s only iterates over non-empty values – and as a result, stateless countries will be erroneously excluded from the report. Example 3 shows how this query can be reformulated to handle customers with missing STATE elements by separately grouping them. Clearly, SQL fans will not be pleased when trying to handle such reporting use cases in XQuery today.
for $c in distinct-values(CUSTOMERS()/COUNTRY)
return
(
for $s in distinct-values(CUSTOMERS()/STATE)
let $cs-group := CUSTOMERS()[COUNTRY eq $c and STATE eq $s]
where exists($cs-group)
return
<avg-age country=”{$c}” state=”{$s}”>
{
fn:avg($cs-group/AGE)
}
</avg-age>
,
let $cx-group := CUSTOMERS()[COUNTRY eq $c and fn:empty(STATE)]
where exists($cx-group)
return
<avg-age country=”{$c}” state=””>
{
fn:avg($cx-group/AGE)
}
</avg-age>
)
|
Example 3: Correct XQuery for counting customers by state and country.
To address this shortcoming of XQuery, we propose to add a group by
clause to the language – i.e., to add a G to FLWOR. Example 4 illustrates the use of our extension.
for $c in CUSTOMERS()
group $c as $cs-group by $c/COUNTRY as $country, $c/STATE as $state
return
<avg-age country=”{$country}” state=”{$state}”>
{
fn:avg($cs-group/AGE)
}
</avg-age>
|
Example 4: Extended XQuery for counting customers by state and country.
As illustrated in Example 4, this extension makes the task of handling such use cases much simpler, i.e., they become much more SQL-like. The proposed group by clause has two parts – it specifies the information to be grouped, giving the resulting sequence a variable name for use later in the query, and it specifies a list of atomic values to group the information by. Example 4 essentially iterates over customers, grouping them (like in SQL) by their combined country/state values; for each such value combination, it binds the sequence of customers ($c) with that combination of values to the variable $cs-group. Following the group by clause, leading into the return clause, the query produces a series of bindings for $cs-group, $country, and $state – where a given $cs-group binding refers to the sequence of $c values associated with the current binding combination for $country and $state – and these bindings are then available for use in the return clause, which in this case outputs the country and state information and computes and outputs the desired average for each group of customers. The bindings that precede the group by clause, in this case just the $c binding, are not available following the grouping operation; they are no longer in scope after that point.
Another common operation, when querying structured data, is the elimination of duplicate data values. As currently defined, XQuery pays significant attention to duplicates, but mostly in the context of document querying – where node identity within an XML document is the prime target for duplicate elimination. In contrast, in data querying use cases, one often wants to retrieve a “projection” of some underlying data and eliminate duplicates in the result. This is another example of a task that is simple in SQL but messy in XQuery as it stands today. To illustrate, suppose that we wanted to know the unique name/country combinations present in our customer data. Example 5 shows the SQL and XQuery 1.0 queries that retrieve this information. As before, the XQuery version of the query is based on the pattern recommended in [XQuery 1.0].
XQuery 1.0:
for $ln in distinct-values(CUSTOMERS()/LAST_NAME)
for $fn in distinct-values(CUSTOMERS()/FIRST_NAME)
for $c in distinct-values(CUSTOMERS()/COUNTRY)
return
if (fn:exists(CUSTOMERS()[LAST_NAME eq $ln and
FIRST_NAME eq $fn and
COUNTRY eq $c]))
then
<combo>
<last_name>{$ln}</last_name>
<first_name>{$fn}</first_name>
<country>{$c}</country>
</combo>
else ()
SQL:
select distinct c.LAST_NAME, c.FIRST_NAME, c.COUNTRY
from CUSTOMERS
|
Example 5: XQuery vs. SQL for selecting name/country combinations.
To address the obvious need here, we propose the addition of a distinct by
clause to XQuery – i.e., adding a D to FLOWR as well. Example 6 shows how this clause would come into play in solving the use case of Example 5 more naturally.
for $c in CUSTOMERS()
distinct by $c/LAST_NAME, $c/FIRST_NAME, $c/COUNTRY
return
<combo>
<last_name>{$ln}</last_name>
<first_name>{$fn}</first_name>
<country>{$c}</country>
</combo>
|
Example 6: XQuery vs. SQL for selecting name/country combinations.
Clearly, the query in Example 6 is easier to write and read than Example 5’s XQuery. Semantically, the distinct by clause operates on the concurrent variable bindings that precede it in the FLWOR expression – in this case, just the $c bindings – and it filters them so that only one binding for each unique combination of the values in the distinct clause survive. Unlike the group by clause, the distinct by clause does not itself add any new variable bindings – it simply filters the existing bindings (like a where clause does, but filtering on duplicates instead of on a predicate).
Our third and final proposed XQuery extension addresses another task that is easy in SQL but very clumsy in XQuery today – namely, the handling of “flat outer joins”, a need that also arises in some reporting use cases. To illustrate what this is, let’s assume the addition of another vendor-provided function, ORDERS(), that returns XML fragments containing summary information about each of the orders that our customers have placed. Example 7 shows what such a fragment might contain.
<ORDER> <ORDER_ID>54321</ORDER_ID> <CUST_ID>001</CUST_ID> <DATE>2004-09-15</DATE> <STATUS>shipped</STATUS> <TOTAL>399.99</TOTAL> </ORDER> |
Example 7: Simple XML order data.
Example 8 shows the most natural way to combine customers and order information in XQuery 1.0 and SQL, respectively. In the hierarchical XML world, since each customer can have zero, one, or many orders, the most natural customer/order representation is nested – each XML fragment for a customer can simply contain the corresponding XML fragments about that customer’s orders, representing the relationship via containment. This is how Example 8’s XQuery is designed; it extracts the desired customer information and then constructs a nested orders element containing all the related order information that the user wants to see. In the flat relational world, this approach is not possible, so the SQL version of the query does a left outer join of CUSTOMERS and ORDERS. This query produces a flat table containing one row for each order that a customer has placed. Customers who have placed no orders are retained by the outer join operation – that’s why an outer join is used rather than an inner join here – and in those cases, the corresponding order information will be null in each order-less customer’s row. To keep the information about a given customer together, the SQL query includes an order by clause that will ensure that each customer’s rows are together in the query’s result set. (Note that this isn’t necessary in the XQuery case, as there is just one XML fragment produced per customer.)
XQuery 1.0:
for $c in CUSTOMERS()
return
<cust>
<cid>{data($c/CUSTOMER_ID)}</cid>
<lname>{data($c/LAST_NAME)}</lname>
<orders>
{
for $o in ORDERS()
where $o/CUST_ID eq $c/CUSTOMER_ID
return
<order id=”{$o/ORDER_ID}”>
<amt>{data($o/TOTAL)}</amt>
</order>
}
</orders>
</cust>
SQL:
select c.CUSTOMER_ID, c.LAST_NAME, o.ORDER_ID, o.TOTAL
from CUSTOMERS left outer join ORDERS
on (c.CUSTOMER_ID = o.CUST_ID)
order by c.CUSTOMER_ID
|
Example 8: XQuery vs. SQL for combining customers and orders.
Both queries in Example 8 seem quite natural, and in many ways the hierarchical XML representation seems more desirable – so what’s the problem with XQuery 1.0 here? Basically, the problem is that some current software tools, tools that are commonly used to generate reports, can only consume tabular data; these tools were designed to graph and otherwise manipulate SQL output, so for them, the outer join result would be the required result form for the customer/order data. Therein lies the problem: XQuery 1.0 is great at producing the nested form, but expressing a “flat” left outer join is clunky – the language has no natural way to do this, other than writing a sequence of expressions to first produce the sequence of customers with orders and then produce the sequence of order-less customers (or vice versa). The present XQuery 1.0 formulation is shown in Example 9. This approach is similar to what SQL query writers were required to do to solve such use cases in the days before outer joins were added to SQL.
for $c in CUSTOMERS()
for $o in ORDERS()
where $c/CUSTOMER_ID eq $o/CUST_ID
return
<cust>
<cid>{data($c/CUSTOMER_ID)}</cid>
<lname>{data($c/LAST_NAME)}</lname>
<order id=”{$o/ORDER_ID}”>
<amt>{data($o/TOTAL)}</amt>
</order>
</cust>
,
for $c in CUSTOMERS()
where fn:empty(ORDERS()[CUST_ID eq $c/CUSTOMER_ID])
return
<cust>
<cid>{data($c/CUSTOMER_ID)}</cid>
<lname>{data($c/LAST_NAME)}</lname>
<order id=””/>
</cust>
|
Example 9: Simulating a customer/order left outer join in XQuery.
In order to add true “flat outer join” expressiveness to XQuery, we propose the addition of a new flavor of for clause – one that is guaranteed to bind its variable at least once by binding it to the empty sequence if there are no other bindings generated. This is best understood by looking at Example 10, which uses this extension – ofor
, for outer for
– to solve the tabular customer/order use case. With the extension, the query works as follows: The outermost for clause generates a $c binding for each customer. The other for clause, which is an ofor clause, generates one corresponding $o binding for each of that customer’s orders; the path expression ensures the correlation. If a customer has no orders at all, the ofor clause will still generates one binding – it will bind $o to the empty sequence in that case. This ensures that the return clause will still execute once for such customers, which would not be the case if $o were bound with a normal for clause, and that a result fragment will therefore be produced even for customers who have no matching orders. (Note: ofor needs a better keyword – readers’ suggestions are invited.)
for $c in CUSTOMERS()
ofor $o in ORDERS()[CUST_ID eq $c/CUSTOMER_ID]
return
<cust>
<cid>{data($c/CUSTOMER_ID)}</cid>
<lname>{data($c/LAST_NAME)}</lname>
<order id=”{$o/ORDER_ID}”>
<amt>{data($o/TOTAL)}</amt>
</order>
</cust>
|
Example 10: Using ofor to outer join customers and orders in XQuery.
Put it all together, and what do you have? FLWGDOR!
The last extension that we are proposing is that the preceding clauses should be added into the XQuery 1.0 grammar in a very general way. Specifically, the XQuery grammar generalization that we are proposing allows the following combinations of clauses (where oF
stands for ofor): (F|L)(F|L|oF|W|G|D|O)*R
. As a result, a general FLWOR expression – spelled FLWGDOR, which is unfortunately less pronounceable than FLWOR – starts with either a for clause or a let clause, which is followed by any combination of the for, let, ofor, where, group by, distinct by, and order by clauses, and then ends with a return clause. Demonstrating the full expressiveness of the resulting extensions is beyond the scope of this paper, but Example 11 provides a small taste by showing how the FLWGDOR approach’s allowance of multiple where clauses enables SQL-style GROUP BY / HAVING use cases to be easily expressed.
for $c in CUSTOMERS()
where $c/LAST_NAME eq “Smith”
group $c as $cs-group by $c/COUNTRY as $country
where fn:count($cs-group) > 1000
return
<info country=”{$country}”>
<avg-age>{fn:avg($cs-group/AGE)}</avg-age>
<names>{$cs-group/FIRST_NAME}</names>
</info>
|
Example 11: Using FLWGDOR to analyze countries with an abundance of Smiths.
In this paper, we have proposed a small set of XQuery extensions to enable grouping, duplicate elimination, and outer-join queries all to be expressed neatly within the XQuery language. These extensions make it possible for several data-centric query tasks, tasks that are easily specified in SQL, to be much more conveniently formulated in XQuery than they can be today. Our proposed extensions do minimal “damage” to the XQuery standard in that they generalize the current FLWOR expression syntax of XQuery and require no changes to be made to the underlying XQuery data model. Two of the three extensions (grouping and duplicate elimination) are running “in house” at BEA, and all three are expected to debut in the next major revision of the BEA XQuery engine and its encompassing products, notably BEA Liquid Data for WebLogic and BEA WebLogic Integration.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.