This website was created to provide additional information and resources to accompany our presentation to INLS258 on XML query languages.
XML is a metalanguage - meaning that it is a syntax for designing markup languages. The general philosophy the World Wide Web Consortium used in developing XML was one of clarity and orderliness that would allow computationally inexpensive parsing of documents while providing enough flexibility to apply the language to a wide range of applications.
XML was created to applicable to a wide range of document types, but some consider it to be too flexible. Because the variety of ways that data can be represented by an XML grammar, two incompatible philosophies about how XML should be used have emerged: document-centric vs. data-centric.
Proponents of the document-centric XML concentrate on documents intended for direct human consumption such as Documenting the American South's encoding of historic documents in a digital archive. Such documents are almost always human generated and tend to be less uniform and less structured than data-oriented XML.
Most XML used in databases focuses on a "data-oriented" or "data-centric" approach where XML is used primarily as a data-transport mechanism. XML of this variety tends to be less readable by humans and is usually generated by an automated process. One of the primary focuses of current XML query langagues is creation of XML data-centric documents from relational databases.
Depending on the needs of the user, XML documents can be stored in a database in a number of ways: a text object, a "shredded" document or a native XML data-type.
Since XML documents are text, they can be stored using a text object datatype (e.g. the CLOB datatype in Oracle or the Memo field in Access). This is the easiest way of storing XML data because it does not have to be parsed by the database prior to storage. Moreover, this storage scheme is the most flexible because it is unaffected by changes to the document schema. In many document-oriented situations, this type of storage system may be perfectly adequate, such as when the database serves as the back-end for a content management system. Moreover, full-text searching is fairly easy to implement on a text-object field. However, such a setup doesn't allow external applications to have ready access to the document's internal structure. The entire document has to be loaded into memory before a particular node can be retrieved.
If stored as a "shredded" document, the tree structure of the original XML document is partially implemented by relational tables. For example, the author information of a document might be stored separately from the remainder of the document so that it can be retrieved without having to load the entire document. A caveat of this storage scheme is that any performance gains offered by shredding the documents is directly dependent on how well the divisions match a user's information needs. While a shredded storage scheme may be useful in certain circumstances, it can make the utility of the system brittle and inflexible. Because of this lack of flexibility, shredded storage schemes are best suited to data-oriented XML documents.
The preferred method of storing data-oriented documents is as "native" XML objects. A native object internally reproduces the tree-structure of the document so that any node of the document can be retrieved easily. However, upon being stored, the document will need to be parsed by the database system - making native objects the most computationally expensive storage method. Moreover, implementing full-text searching on native XML documents is considerably more difficult. These shortcomings are partially mitigated by the additional control given to the database user. In addition to simplifying retrieval of specific elements, the document can also be validated against a schema document during the parsing stage. This validation can be used to enforce entity integrity.
Xpath is one of the primary retrieval tools for XML data. The syntax was developed to allow applications to precisely specify elements and attributes of an XML document. Since the Xpath specification was completed in 1999, it is also very stable and widely supported. For many applications Xpath is perfectly adequate - such as exchanging data between database systems. However Xpath lacks many of the features necessary for advanced analysis and data processing. Specifically, Xpath has no native methods for sorting, grouping, or joining documents. Sorting and grouping can be accomplished by using Xpath in conjuction with XSLT, but overall Xpath is inadequate for many uses.
To overcome the shortcomings of Xpath, W3C has been developing a full-featured query language called Xquery. Xquery incorporates most (but not all) of the syntax used in Xpath to indicate an XML attribute element or attribute and adds query language features to it. Since the Xquery standard has not been finalized (currently in final draft) it is not as widely supported as Xpath. Oracle, for example, doesn't support Xquery prior to Oracle 10g release 2, and Microsoft doesn't support Xquery prior to SQL server 2005.
Xquery has a number of features that make it far more desirable than Xpath as a query language. First, it supports sorting, grouping and joins. Second, it is exceptionally rich in features - including approximately 100 built-in functions. Third, it supports a dual syntax that provides a number of unique advantages over SQL.
The two Xquery syntaxes outlined in the W3C standard include a human-readable syntax and an XML-based syntax for machine use. In the latter, the tree structure makes it easy to create building blocks of queries to embed into other languages. Moreover, Xqueries written in the XML syntax can describe themselves - thus making it possible to compose a query of queries. For example, a user could select all of the queries that refer to a particular object.
The basic selection syntax for a human-readable Xquery contains the predicates For, Let, Where, Order By and Return and is commonly referred to as a FLWOR object. Some examples of a FLOWR construct (heavily borrowed from the W3C working draft):
FOR $b IN document("bib.xml")//book
WHERE $b/publisher = "Morgan Kaufmann"
AND $b/year = "1998"
RETURN $b/title
The FOR clause indicates that the query should iterate through all the instances of the book element (bound to the $b variable) in the document. Note that this example does not use the optional LET or ORDER BY clauses - which will be demonstrated below. However, it does illustrate the use of the RETURN clause. Unlike SQL where the returned object is always a collection of rows, an Xquery can return any text construct the user specifies. The above example might be expresed in SQL as:
SELECT book as b FROM bib.xml WHERE (publisher = 'Morgan Kaufmann') AND (year = '1998');
Here's a more complex example of Xquery:
FOR $d in fn:doc("depts.xml")/depts/deptno
LET $e := fn:doc("emps.xml")/emps/emp[deptno = $d]
WHERE fn:count($e) >= 10
ORDER BY fn:avg($e/salary) descending
RETURN
<big-dept>
{
$d,
<headcount>{fn:count($e)}</headcount>,
<avgsal>{fn:avg($e/salary)}</avgsal>
}
</big-dept>
In the above example the LET clause is used to indicate the join between the depts.xml document and the emps.xml document. The return object is a seqment of an XML document that includes the department number($d), number of employees ($e), and the average salary ($e). The SQL version would look something like this:
SELECT deptno, count(emps), average(salary) FROM depts.xml, emps.xml GROUP BY deptno HAVING count(emps) >= 10 ORDER BY average(salary) DESC;
As you can see, while the syntax of Xquery and SQL are very different, the two languages have many of the same structures. In fact Xquery's syntax is so flexible that in Oracle 10g, Xquery can be used to query relational tables and XML objects at the same time. However, the flexibility of Xquery comes at a price of added complexity. Additionally, the standard does not presently include a facility for updating data using Xquery. While the W3C states that the ability to update is forthcoming, its current lack is a serious shortcoming.
SQL/XML (also called variously: SQLX, XSQL, and XML-SQL) is a collection of extensions to the SQL standard targeted at expanding SQL to retrieve from and generate XML documents. SQL/XML is part of the SQL200n standard and is expected to be ratified in 2006. It is not related to the propriatary SQLXML technology used on SQL server.
One of the goals of the SQL/XML project is to provide a conceptual mapping between SQL and XML so that SQL can reliably be used to retrieve from XML documents. Thus, SQL/XML and Xquery have a siginificant amount of overlap. This overlap doesn't necessarily indicate that SQL/XML is redundant; rather, it fills the niche of providing XML retrieval capabilities to a language that many database developers are already familiar with in order to produce XML documents from relational tables. Unlike Xquery, SQL/XML is not designed to retrieve data from native XML sources, and it includes the update functionality of SQL (a facility Xquery currently lacks).
Here's an example of SQL/XML publishing functions:
SELECT XMLELEMENT (name employee, XMLATTRIBUTES(emp.ID as id), XMLFOREST(emp.f_name || emp.l_name as name, emp.salary, emp.dept) FROM t_employee as emp WHERE emp.dept = 4;
The above query uses the XMLELEMENT function to indicate that the output will be an XML element rather than a row (like the RETURN predicate of Xquery). XMLATTRIBUTES indicates that the column should be returned as an attribute of the parent element, and XMLFOREST indicates that each of the contained columns should be returned as an element with the same name as the column (excluding the table name). Thus, a returned tuple would look like this:
<employee id="2341"> <name>Miles Naismith</name> <salary>50000</salary> <dept>4</dept> </employee>
SQL/XML can also be used by Oracle to create XML views of relation tables - serving virtual XML documents to external applications that support XML such as the MS-Office and Open Office suites.
XML query language specifically for document-oriented data presently implemented as the query language for the Xircus, an XML sensitive search engine. It provides for querying XML documents using a combination of simple keyword searching, path expressions, and phrase/sentence querying. An interesting feature is that it also provides support for weighting expressions to affect result ranking. The syntax for this weighting is: expression * factor. It utilizes the XML Path Language to specify path expression and supports returning document fragments as the query results through the 'return' command.
An example of XirCL syntax quoted directly from The Xircus Search Engine:
"To illustrate querying with XircL we use the topc 21 from the INEX collection: 'Which authors of articles cited recent work by Heikki Mannila?' The query is expressed in XircL this way:
path(//bm/bb/au)
contains Heikki and Mannila
and
path(//bm/bb/pdt/yr) >= 1998
return /article/fm
The back matter of an article is searched for the author Heikki Mannila. The search is restricted by an exact query term, which selects references from 1998 up to now. Since we are interested in authors who cited Heikki Mannila, we just want to return the front matter stuff (author, title) of the article."
Xupdate is a non-W3C sponsored language created by the XML-DB opensource group designed to mitigate the shortcomings of Xquery. Unfortunately, Xupdate seems to be a stalled project - the most recent working draft is dated 2000 and the website is moribund. It is not a fully-functional query language because it lacks information retrieval capabilities. Xupdate - as it's name implies - concentrates on updating, appending and inserting records.
XUpdate statements are fully qualified XML documents. This provides the advantage of being able to quickly author XUpdate statements using XML editing tools and to easily store XUpdate statements directly in XML supporting databases.
While the Xupdate language project is stalled, several active open-source projects appear to be using Xupdate. Implementations of Xupdate include a java-based implementation called Lexus used in several open-source projects including:
Here's an example of the Xupdate query syntax. Starting with the following input data is:
|
|
|
|
|
|
<?xml version="1.0"?>
<addresses version="1.0">
<address id="1">
<fullname>Andreas Laux</fullname>
<born day='1' month='12' year='1978'/>
<town>Leipzig</town>
<country>Germany</country>
</address>
</addresses>
|
|
|
|
|
|
The following XUpdate update inserts a new address element after the first address.
|
|
|
|
|
|
<?xml version="1.0"?>
<xupdate:modifications version="1.0"
xmlns:xupdate="http://www.xmldb.org/xupdate">
<xupdate:insert-after select="/addresses/address[1]" >
<xupdate:element name="address">
<xupdate:attribute name="id">2</xupdate:attribute>
<fullname>Lars Martin</fullname>
<born day='2' month='12' year='1974'/>
<town>Leizig</town>
<country>Germany</country>
</xupdate:element>
</xupdate:insert-after>
</xupdate:modifications>
|
|
|
|
|
|
The XML result is
|
|
|
|
|
|
<?xml version="1.0"?>
<addresses version="1.0">
<address id="1">
<fullname>Andreas Laux</fullname>
<born day='1' month='12' year='1978'/>
<town>Leipzig</town>
<country>Germany</country>
</address>
<address id="2">
<fullname>Lars Martin</fullname>
<born day='2' month='12' year='1974'/>
<town>Leizig</town>
<country>Germany</country>
</address>
</addresses>
|
|
|
|
|
|