Previous Next Contents

Chapter 1    Goals and motivations

1.1  Why query XML Documents ?

When we speak about queries and query languages, we often refer to database technology. The well known SQL, for Structured Query Language, has become a standard for relational databases management systems. For the more modern object database technology, a query language, namely OQL has been defined. The purpose of a query language is to offer a declarative means to obtain data. The query language is a tool to describe the data you want to retrieve and manipulate, and it is the role of the underlying system to find the best (e.g. efficient) way to compute the answer.

Traditional database systems are appropriate tools for data that can fit into a rigid schema. But there exist a lot of information that can hardly benefit from being stored into structured repositories. We are speaking here about documents, like for instance TeX, SGML or HTML files, and also less structured data typically exchanged via data exchange formats (e.g ASN.1). Anyway, it is often very useful to be able to query large sets of such data to fulfill specific needs. In the Web context, search engines provide such facilities. Roughly speaking, we can say that in particular, documents were until recently mostly queried using full-text indexes, which do not take into account the internal structure of documents. Although quite sophisticated query engines for structured documents have existed for some time, they have not been a mainstream application.

Being optimistic, we can say that XML brings together the best of these two worlds. Indeed, XML documents are structured documents. They blur the distinction between data and documents, allowing documents to be treated as data sources, and traditional data sources to be treated as documents. Some XML documents are nothing more than an ASCII representation of data that might traditionally have been stored in a database. Others are documents that contain very little structure. Still others are somewhere in between, like for instance dictionaries or technical manuals. As more and more information is either stored in XML, exchanged in XML, or presented as XML through various interfaces, the ability to query XML data sources becomes increasingly important.

1.2  Why X-OQL ?

There is an active debate about what a query language for XML should be. Several proposals were presented at a W3C workshop on query languages for XML [Mar98]. The Query Language Working Group at the W3C is still working on the standard query language for XML.

Those proposals can be divided into two groups.

On one hand, we have languages coming from the semistructured data field (see [Abi97, ABS99] ). Since XML can be seen as a semistructured data model, languages created for querying semistructured data can be seen as query languages for XML (see for instance [GMW99]). Lorel[AQM+97], XML-QL[DFF+98] and YATL[CDSS98] belong to this family. In these languages, a query consists of three parts: The information passed between these clauses can be modeled as a relation. The separation between these three clauses is clear in a language like Lorel, which uses a familiar select-from-where statement. The select statement corresponds to the constructor clause, the from statement is the pattern clause and the where statement represents the filter clause. X-OQL belongs to this category.

On the other hand, proposals like XQL [RLS98], XQuery [DeR98] or XPath [CD99] provide extensions to the XSL pattern syntax. Such language can be used to filter the nodes of interest into a document. They bring XSL-T, the transformation language of XSL [Dea99], to a full fledged pattern matching language.

We will not discuss here the respective advantages and drawbacks of these two approachs, but the question remains. Why X-OQL ?

In the context of the Active Views project [AAC+99, AAA+99], a year ago, we needed a working implementation of a query language for XML. Since there was at this time no consensus about the incoming standard (which is still the case), we choosed to create our own query language. In order to ensure a good modularity of the Active Views prototype, we also decided to implement it on top the Document Object Model, the standard application interface for XML documents. Concerning the design of the language, we were rather conservative and kept, as in Lorel, the spirit of OQL, the standard query language for objects databases [CBB+97].

I was in charge of the so called X-OQL part of Active Views. At this time, I thought I would have fun doing it. I happened to be wrong. I had a lot of fun. So you should be aware that the work around X-OQL is certainly not to be compared to official proposals to the W3C! It is more a research prototype than a strong proposal. It certainly lacks essential features, as compared to more polished works. Anyway, it often works as it should do.

We will in the sequel give a quick overview of XML [BPSM98] and the Document Object Model (DOM) [WG98]. Those two sections try to provide guidelines to readers new to XML technologies. It is strongly recommended to refer to official specifications Web pages (see [BPSM98, WG98]) for more information. We then expose briefly the design of the language and its implementation on top of DOM.

1.3  XML

XML means eXtensible Markup Language. It is a new standard adopted by the World Wide Web Consortium (W3C) to ease data exchange on the World Wild Web. Like HTML, XML makes use of tags (words bracketed by < and >) and attributes (of the form name="value"), but while HTML specifies what each tag and attribute means (and often how the text between them will look in a browser), XML does not impose a fixed a-priori set of tags, and, generally speaking, uses the tags only to delimit pieces of data, leaving the interpretation of the data completely to the application that reads it.

1.3.1  Basic syntax

Elements
Recall that XML is a textual representation of data. The basic component in XML is the element. An element is a piece of text bounded by matching tags. Consider for instance the following XML element:
<book> 
  <title> Heart of Drakness </title> 
  <author> Joseph Conrad </author> 
</book>
Inside an element we may have raw tex, other elements, a mixture of the two, or nothing. An expression such as <book> is called a start-tag and </book> an end-tag. Such tags must be balanced, ie they should be closed in inverse order they are opened. This means that overlapping tags, like
<p> <b> some text </p> some text </b>
are not allowed in an XML document. Tags in XML are defined by users. There are no predefined tags as in HTML. Finally, XML has a useful abbreviation for empty elements. The following <book></book> can be abbreviated to <book/>.

Attributes
XML allows to associate attributes with elements. Attributes are name = "value" pairs that can appear inside start-tags. In the example below an attribute is used to specify the currency of a price element:
<book>
    <title> Heart of Drakness </title> 
    <author> Joseph Conrad </author> 
    <price currency = "euro"> 20.9 </price>
</book>
As with tags, users may define arbitrary attributes. The value of an attribute is always a string, and must be enclosed between quotation marks. Whereas sub-elements with the same tag may be repeated within an element, a given attribute may occur at most once within a tag. For instance,
<price> <currency = "euro" /> <currency = "US dollar /> </price>
is correct, but
<price currency="euro" currency="US dollar">
is not.

Comments
Comments can be added anywhere into an XML document for readability. Their syntax is as follows:
<!--this is a comment-->

1.3.2  Well-Formed and Valid XML Documents


<books>
  <novel>
    <title> Heart of Darkness </title>
    <author> Joseph Conrad </author>
    <price currency = "euro"> 20.9 </price>
  </novel>
  <manual>
    <title> 
      LateX: a Document Preparation System 
    </title>
    <author> Leslie Lamport </author>
    <publisher> Addison-Wesley </publisher>
  </manual>
  <book year="1999" month="October" >
    <title> 
      Data on the Web: From Relations 
      to Semistructured Data and XML 
    </title>
    <authors>
      <author> Serge Abiteboul </author>
      <author> Peter Buneman </author>
      <author> Dan Suciu </author>
    </authors>
    <price currency = "US dollar"> 39.95 </price>
    <publisher> Morgan Kauffman </publisher>
  </book>
</books>


Figure 1.1: A simple XML document.

So far we have presented the basic XML syntax. There are very few constraints that have to be matched: tags have to nest properly and attributes within an element have to be unique. A document is well-formed if it satisfies those two constraints and if it contains one and only one element at the upper-most level. The example document in Figure  is well-formed. Actually according to the official specification, every well formed XML document must contain a prolog, but the prolog can consist of nothing. When the prolog is not empty, it must come before the first element in the document. It can contain a version declaration, a document type declaration and comments.

Version declaration
All XML documents should contain a simple version declaration, which tells the processor what version of XML the document conforms to. Here is the form that it takes.
<?xml version="1.0"?>
<greeting> Hello XML!</greeting>
The version declaration can also contain other information such as an encoding declaration, that informs the processor what kind of code the document uses. For instance UTF-8 means that the encoding of this document uses the same character set as ASCII. It may also contain a standalone declaration that tells the XML processor whether the document can be read as a standalone document, or whether it needs to look outside the document for other rules. This is a rather advanced parameter and you are directed to the specification for more information. Here is an example however:
<?xml version="1.0"? encoding="UTF-8" standalone="yes" ?>
<greeting> Hello XML!</greeting>
Document Type Declaration and DTD
Alternatively, a document can be accompanied by a Document Type Definition (DTD). It is essentially a grammar for restricting the tags and structure of a document. An XML document satisfying a DTD is considered valid. A DTD is generally stored in a standalone file. An XML processor can locate a document's DTD by looking at the document type declaration in the prolog. For instance:
<?xml version="1.0" encoding="UTF-8" ?> 
<!DOCTYPE greeting SYSTEM "hello.dtd">
<greeting> Hello XML!</greeting>
means that this document must have a root element with tag name "greeting" and must satisfy the validity constraints defined in the file "hello.dtd" in order to be valid. "hello.dtd" would contain something like:
<!ELEMENT greeting (#PCDATA)>
This declaration means that a greeting element can contain only character data.

Without getting into the details of DTD syntax, here follows a simple DTD for the example document.
<!-- the root element -->
<!ELEMENT books  (novel | manual | book)*>
<!-- it can contains 0 or more subelements -->
<!-- each subelement must be a novel, a manual or a book -->

<!-- novel type -->
<!ELEMENT novel  (title, author, price?) >
<!-- a novel must contain, in this order -->
<!-- a title and an author               -->
<!-- a price may follow                  -->

<!ELEMENT manual (title, author, publisher?)>

<!ELEMENT book   (title, 
          (author|authors), 
                  price?,
                  publisher?)>

<!ELEMENT authors (author+)>
<!-- authors is a non empty set of author elements -->

<!ELEMENT author  (#PCDATA)>
<!ELEMENT title   (#PCDATA)>
<!ELEMENT price   (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>

<!-- the currency attribute for a price element -->
<!ATTLIST price currency CDATA #IMPLIED>
Assume that this DTD is stored in "booksDTD.dtd". If we want to validate the example document against this DTD, it becomes:
<?xml version="1.0" encoding="UTF-8" ?> 
<!DOCTYPE books SYSTEM "booksDTD.dtd">
<books>
  ... 
</books>

1.4  Other XML constructs

Entities and CDATA
Suppose for instance that one wants to insert this sentence:
A start <tag> must nest properly with an end </tag>.
into an XML document, as pure textual data. In this case, we do not want an XML processor to interpret the text between <tag> and </tag as XML.

One way is to use entities. An entity is a named piece of text. Whenever an XML processor encounters a reference to an entity, it replaces this reference by the corresponding text. There exists some predefined entities, such that lt for the character < or gt for >. Using entities, the sentence above become:
A start <tag> must nest properly with an end </tag>.
User defined entities can be declared into a DTD.

Another way is to use CDATA sections. This is a more appropriate method to treat markup as text in documents, so that there is no necessity to escape every special character. The syntax is:
"<![CDATA [A start <tag> must nest properly 
           with an end </tag>
   ]]>"
Processing Instructions
Processing instructions take the following form:
<?target processing instruction?>
They are not part of the document's character data, but must be passed through to the application. The PI begins with a target (PITarget) used to identify the application to which the instruction is directed. The target names "XML", "xml", "xMl" and so on are reserved. Processing instructions can occur anywhere and contain information that the processor must pass on to the user agent. The version declaration
<?xml version="1.0" encoding="UTF-8"?>
is an example of a processing instruction.

1.5  DOM

The W3 Document Object Model (DOM) is an application programming interface for XML documents. It defines a logical view of the structure of documents and the way a document is accessed and manipulated.

The current W3 specification [WG98] defines the Document Object Model Level 1. It is separated into two parts: Core and HTML. The Core DOM Level 1 section provides a low-level set of fundamental interfaces that can represent any structured document. It defines as well extended interfaces for representing an XML document.

We will in the sequel deal with Core DOM Level 1, and, for simplicity, refer to it as DOM.

The goal of DOM is to define a programmatic platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content and the structure of documents. In DOM, documents have a logical structure which is a tree. Consider for instance the XML document in Figure 1.1.


0.65 DOMtree.eps


Figure 1.2: A DOM representation of an XML document

It can be represented as a tree, as in Figure 1.2. The name "Document Object Model" was chosen because it is an "object model" in the traditional object-oriented design sense. Documents and components of documents are modeled using objects, and the model encompasses not only the structure of a document, but also the behavior of a document and the objects of which it is composed. In other words, the nodes in the example diagram (Figure 1.2) represent objects with identities and methods. Let us examine more precisely the example document to discover what kind of nodes a DOM tree contains. Starting from the root node, we find the Document node. This object represents a Document. It has one child node, books. This is an Element object. In DOM, a document must have one and only one element child. For instance, the simplest XML document, <doc></doc> would be represented as a document object having one element child with tag name doc. This unique child element of a document is called the document element. Getting back to Figure 1.2, we can see that the document element has three child nodes, novel, manual and book, which happen to be element nodes, and each of these have children of different kind. A node that contains only text data is called a Text node. In DOM, Text nodes does not have any children. Some elements, like book have attributes, represented as Attr objects. An attribute is a name/value pair belonging to an element.

We have looked at the objects that the example well-formed XML document can be divided into. We will in the sequel details more precisely the properties DOM defines for each kind of object. Please be aware that this presentation is an abstract of the W3 DOM Specification. For shortness and clarity a lot of details have been omitted. The reader should refer to the official document whenever exact information is needed. Before going any further it is probably better to say a little about the syntax, namely the Object Management Group's Interface Definition Language (OMG IDL) used to define the properties and methods of the DOM in the W3C DOM specification.

1.5.1  The OMG IDL

The OMG IDL is a language for defining object interfaces, and ensure interoperability between object oriented programming languages. If one wants a Java program to communicate over a network with a C++ program, one can do so using IDL and a Broker which will translate between the two languages. This broker is called the Common Object Request Broker Architecture or CORBA for short. All CORBA objects support an IDL interface; the IDL interface defines an object type. An interface can inherit from one or more other interfaces. IDL syntax is very similar to that of Java or C++, and an IDL file is functionally the CORBA language-independent analog to a C++ header file. IDL is mapped into each programming language to provide access to object interfaces from that language. An IDL interface declares a set of client accessible operations, exceptions, and typed attributes (values). Each operation has a signature that defines its name, parameters, result, and exceptions. A simple IDL interface that describes a person follows.
interface Person {
    attribute string firstname;
    attribute string lastname;
    attribute short  age;

    boolean isOfAge();
};
The Interfaces attribute definitions are shorthands for get/set methods. It is particularly unfortunate that the term that IDL uses as a get/set method is called 'attribute', which of course has nothing at all to do with element XML attributes. For this reason the DOM refers to XML element attribute as Attr. An operation may raise an exception when an error condition arises. The type of the exception indicates the kind of error that was encountered.

1.5.2  The Node interface

The Node interface is the primary datatype for the entire Document Object Model: each object in a DOM tree is a Node. The IDL definition of the Node interface is :
interface Node { 
    readonly attribute  DOMString            nodeName;
             attribute  DOMString            nodeValue; 
    readonly attribute  unsigned short       nodeType;
    readonly attribute  Node                 parentNode;
    readonly attribute  NodeList             childNodes;
    readonly attribute  Node                 firstChild;
    readonly attribute  Node                 lastChild;
    readonly attribute  Node                 previousSibling;
    readonly attribute  Node                 nextSibling;
    readonly attribute  NamedNodeMap         attributes;
    readonly attribute  Document             ownerDocument;

    Node                      insertBefore(in Node newChild, 
                                           in Node refChild)
                                           raises(DOMException);
    Node                      replaceChild(in Node newChild, 
                                           in Node oldChild)
                                           raises(DOMException);
    Node                      removeChild(in Node oldChild)
                                          raises(DOMException);
    Node                      appendChild(in Node newChild)
                                          raises(DOMException);
    boolean                   hasChildNodes();
    Node                      cloneNode(in boolean deep);
};
For convenience, the Node interface expose methods for dealing with children, despite the fact that not all objects implementing the Node interface may have children. For example, Text nodes may not have children, and adding children to such nodes results in a DOMException being raised. In the same spirit, the attributes nodeName, nodeValue and attributes are included as a mechanism to get at node information without casting down to the specific derived interface. In cases where there is no obvious mapping of these attributes for a specific nodeType (e.g., nodeValue for an Element or attributes for a Text), this returns null. Note that the specialized interfaces may contain additional and more convenient mechanisms to get and set the relevant information. The DOM also specifies a NodeList interface to handle ordered lists of Nodes, such as the children of a Node, and a NamedNodeMap interface to handle unordered sets of nodes referenced by their name attribute, such as the attributes of an Element.

1.5.3  DOM Structure Model

DOM presents documents as a hierarchy of Node objects that also implement other, more specialized interfaces. Some types of nodes may have child nodes of various types, and others are leaf nodes that cannot have anything below them in the document structure. The node types, and which node types they may have as children, are as follows:
DocumentElement (maximum of one), ProcessingInstruction, Comment, DocumentType
DocumentFragmentElement, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
DocumentTypeno children
EntityReferenceElement, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
ElementElement, Text, Comment, ProcessingInstruction, CDATASection, EntityReference
AttrText, EntityReference
ProcessingInstructionno children
Commentno children
Textno children
CDATASectionno children
EntityElement, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
Notationno children

1.5.4  The Document interface

The Document interface represents the entire XML document. Conceptually, it is the root of the document tree, and provides the primary access to the document's data. Elements, text nodes, comments, processing instructions, etc. cannot exist outside the context of a Document. So the Document interface also contains the factory methods needed to create these objects. The Node objects created have a ownerDocument attribute which associates them with the Document within whose context they were created.
IDL Definition
interface Document : Node {
  readonly attribute  DocumentType         doctype;
  readonly attribute  DOMImplementation    implementation;
  readonly attribute  Element              documentElement;
  Element             createElement(in DOMString tagName);
  DocumentFragment    createDocumentFragment();
  Text                createTextNode(in DOMString data);
  Comment             createComment(in DOMString data);
  CDATASection        createCDATASection(in DOMString data);
  ProcessingInstruction  
                  createProcessingInstruction(in DOMString target, 
                                              in DOMString data);
  Attr                createAttribute(in DOMString name);
  EntityReference     createEntityReference(in DOMString name);
  NodeList            getElementsByTagName(in DOMString tagname);
};
Attributes

1.5.5  The Element interface

By far the vast majority of objects (apart from text) that authors encounter when traversing a document are Element nodes. Assume the following XML document:

<elementExample id="demo">
      <subelement1/>
      <subelement2><subsubelement/></subelement2>
    </elementExample>  
When represented using DOM, the top node is an Element node for "elementExample", which contains two child Element nodes, one for "subelement1" and one for "subelement2". "subelement1" contains no child nodes. Elements may have attributes associated with them; since the Element interface inherits from Node, the generic Node interface method getAttributes may be used to retrieve the set of all attributes for an element. There are methods on the Element interface to retrieve either an Attr object by name or an attribute value by name. In XML, where an attribute value may contain entity references, an Attr object should be retrieved to examine the possibly fairly complex sub-tree representing the attribute value.

IDL definition
interface Element : Node {
      readonly attribute  DOMString    tagName;
      DOMString         getAttribute(in DOMString name);
      void              setAttribute(in DOMString name, 
                                     in DOMString value); 
      void              removeAttribute(in DOMString name);    
      Attr              getAttributeNode(in DOMString name);
      Attr              setAttributeNode(in Attr newAttr); 
      Attr              removeAttributeNode(in Attr oldAttr);
      NodeList          getElementsByTagName(in DOMString name);
      void              normalize();
    };
Attributes
Methods

1.5.6  The Attr Interface

The Attr interface represents an attribute in an Element object. Typically the allowable values for the attribute are defined in a document type definition. Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, DOM does not consider them part of the document tree. Thus, the Node attributes parentNode, previousSibling, and nextSibling have a null value for Attr objects. DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with. Furthermore, Attr nodes may not be immediate children of a DocumentFragment. However, they can be associated with Element nodes contained within a DocumentFragment. In XML, where the value of an attribute can contain entity references, the child nodes of the Attr node provide a representation in which entity references are not expanded. These child nodes may be either Text or EntityReference nodes. Because the attribute type may be unknown, there are no tokenized attribute values.

IDL definition
interface Attr : Node { 
    readonly attribute DOMString name; 
    attribute DOMString value; 
};
Attributes

1.5.7  The Text Interface

The Text interface represents the textual content (termed character data in XML) of an Element or Attr. If there is no markup inside an element's content, the text is contained in a single object implementing the Text interface that is the only child of the element. If there is markup, it is parsed into a list of elements and Text nodes that form the list of children of the element.
IDL definition
interface Text : CharacterData {
        };

1.6  Querying XML documents using DOM

Let us go back to the document figure . It contains three similar records containing informations about some books. Note that each record has a different structure. Imagine a bookseller managing his catalog using such a document. He will certainly be interested in retrieving all the books written by a given author, or the novels with price less than a given value. When the document is small, this can be achieved by manually browsing the XML document. But one can easily imagine that manual browsing becomes more and more cumbersome as the size of the document grows up.

We can automate manual browsing in order to ease specific tasks. For instance, if it often happens that the bookseller looks for the titles of novels with price less than 10, he could write a program like the following :
Starting from the root, find all the direct child nodes which name is novel.
Call N this collection of nodes.
For each node n in N, if n has a child node p with price for name, coerce the text children of p into a numeric value, and compare it to 10.
If the comparison succeeds, append the title subelement of n to the result.
Translated into a Java program using DOM Java Binding, it becomes:

// the input document
XMLParser parser = DomImplementation.createParser();
Document in = parser.parse("books.xml");

// the ouput document 
Document out = domImplementation.createDocument();
// create the root element for output document
// this is the DocumentElement for out 
out.appendChild( out.createElement("result");

NodeList rootChildNodes = in.getDocumentElement().getChildNodes();

// iterate through the novel subelements of root
for(int i=0; i<rootChildNodes.getLength(); i++){
  Node novel = rootChildNodes.item(i);
  if( novel.getNodeName().compareTo("novel")==0 ){
    NodeList novelChildNodes = novel.getChildNodes();

    // iterate through the price subelements of each novel
    for(int j=0; j<novelChildNodes.getLength(); j++){
      Node price = novelChildNodes.item(j);
      if( price.getNodeName().compareTo("price")==0 ){

        // retrieve the first child node of price
        Node priceText = price.getFirstChild();
        if( priceText != null){
          if( priceText.getNodeType() == Node.TEXT_NODE){

            // if it is a text node, compute its numeric value
            float priceValue = new Float( priceText.getNodeValue() ).floatValue();
            if( priceValue < 10){

              // if this value is less than 10
              // retrieve a title subelements of this novel
              Node title = null;
              for(int k=0;k<novelChildNodes.getLength() && (title==null);k++){
                if(novelChildNodes.item(i).getNodeName().compareTo("title")){
                  title = novelChildNodes.item(k);
                }
              }

              // If it exists, append it to the output document.
              // Note that DOM does not allow cloning nodes from 
              // one document to an another.
              if( title != null){    
                Element e = out.createElement( "title");
                e.appendChild( 
                  out.createTextNode(
                    title.getFirstChild().getNodeValue()));
                out.getDocumentElement().appendChild(e);
              } 
            }
          }
        }
      }
    } 
  }
}
One must admit that this program is quite complex compared to the simplicity of the original query.

An equivalent X-OQL query would be :

select novel/title 
from books in load("books.xml")
    ,novel in books/novel
where novel/price < 10;

1.7  DOM as an intermediate layer

The X-OQL implementation only uses DOM operations to access XML data. Thus, the X-OQL module is able to query DOM trees, whatever the nature of the underlying storage and/or implementation.


Previous Next Contents