Birdstep Database Engine™ XML Interface Developers' Guide


Table of Contents
IBXML implementation
Introduction to XML
IBXML data structure
Node attributes
Node types
Searching
IBXML components
IBXML low-level interface
XML parser
The Simple API for XML (SAX)
DOM
Bibliography

IBXML implementation

Marius Larsen Jøhndal

System developer
Birdstep Technology AS
Revision History
Revision 1.124 May 2000
Updated and redesigned figures. Removed a copule of pointless figures. Rewrote the section on the data structure completely. Corrected several spelling errors. Now consitently used British spelling. Updated to DocBook v3.1. Updated links to W3C documents. Updated to reflect changes in upcoming IBXML v1.2/IBAPI v1.3. Updated to reflect changes in IBXML v1.2/IBAPI v1.3. Updated links to W3C documents. Updated to DocBook v3.1. Corrected several spelling errors.
Revision 1.018 Feb 2000
Initial version.

The Birdstep Database Engine™ XML Interface (IBXML) is a module which enables the Birdstep Database Engine™ to use XML natively. It provides two standardised programming interfaces for XML: The Simple API for XML (SAX) and the Document Object Model (DOM) in addition to a low-level interface for the native IBXML structures. This paper discusses the SAX and DOM implementations in IBXML 1.0 and the underlying data structures used by IBXML.


Introduction to XML

The Extensible Markup Language (XML) [BPSM98] is a meta-language defined by the World Wide Web Consortium (W3C) and is a subset of the Standard Generalized Markup Language (SGML) defined in ISO standard 8879:1986. In contrast to HTML, XML is neither a predefined set of tags nor a standardised template for document production. XML is simply a syntax that can be used to annotate arbitrary character data, to describe hierarchical structures and to attribute meta-data to character data. The meaning of the data and the semantics of the data are beyond the scope of XML itself.

Figure 1. Sample XML document

<?xml version='1.0'?>
<!doctype list>
<list>
 <magazine frequency="weekly">
  <title>XML World</title>
 </magazine>
</list>
   

An XML document (see Figure 1 for a sample document) can be divided into the following parts:

  1. A prolog.

  2. An optional DOCTYPE declaration.

  3. Processing instructions.

  4. A root element with optional attributes.

  5. A hierarchy of sub-elements with optional attributes, entities, character data (CDATA) and parsed character data (PCDATA).

XML documents may be well-formed and may also be valid. An XML document is well-formed if there is one top-level element, the root element or the document element, and all elements nest properly within each other (see [BPSM98] section 2.1 for the complete definition). An XML document may have an associated document type definition (DTD), which may be an external document, part of the XML document or both. If it does have such a DTD, the XML document is valid if the document complies with the constraints expressed in the DTD (see [BPSM98] section 2.8 for the complete definition).


IBXML data structure

IBXML is implemented using the Birdstep Database Engine™ notion of fields (members), records (inner relations) and user types. The stored XML document is represented in a database as a unidirectional, cyclic graph, e.g. the graph in Figure 2 which represents the sample XML document in Figure 1. As in graph theory, each box in Figure 2 is called a node and each of these nodes correspond to a single field in a Birdstep Database Engine™ database. The prolog, the DOCTYPE declaration and the processing instructions of the XML document are not shown in Figure 2.

Figure 2. IBXML storage graph of sample XML document in Figure 1

IBXML is capable of storing any well-formed XML document that conforms to [BHL99]. An XML document does not have to be valid in order to be stored by IBXML. IBXML stores each information item as defined in [CM99] as one or more nodes. All core information items of the XML Information Set are stored for all imported XML documents, while some of the peripheral information items may be stored if desired.


Node attributes

Each node in an IBXML is part of a record, and each node has four attributes from the IBXML point of view: implicit pointers, node type, enclosing tag and contents.

The implicit pointers are provided by the record mechanism of the Birdstep Database Engine™. The implicit pointers point both to the next node in a record and to the first node in a record. The first pointer type is called a next pointer in Birdstep Database Engine™ terminology while the latter type of pointer is called a owner pointer.

The node type is signalled using the base type mechanism of the Birdstep Database Engine™, while the enclosing tag of a node is expressed using the user type mechanism of the Birdstep Database Engine™ in collaboration with the dictionary. Each user type maps to a specific combination of a namespace URI and an element name. This means that the ordered triple (namespace URI, local element name, node type) maps uniquely to an ordered pair (user type, base type) in the dictionary. The base types used by IBXML are reserved by Birdstep Database Engine™ for this purpose and should not used by other applications.

The contents of the nodes differ depending on the node type; some are used for storing of actual XML data, e.g. a string containing an XML comment, while others are used to point to new records. Most of the pointers contained in the latter type of nodes, are referred to as explicit pointers. These pointers point to a new record, in which the first node is a RETURN which contains an explicit pointer back to the referring node. This combination of two reciprocal, explicit pointers is used among other things for the notion of child elements.


Node types

The following node types are used by IBXML are listed in Table 1. The semantics of some of the node types are described in more detail in the following list:

  • DOCUMENT: Indicates a document information item. The database address points to a record which contains the following fields in this order:

    1. A RETURN node.

    2. NOTATION nodes, if any (order in document is insignificant).

    3. ENTITY nodes, if any (order in document is insignificant).

    4. PI nodes, COMMENT nodes and exactly one ELEMENT node (order in document is significant).

  • NOTATION: Indicates a notation information item. The database address points to a record which contains the following fields in this order: (RETURN, NOTATION_NAME, NOTATION_SYSID, NOTATION_PUBID, NOTATION_BASEURI).

  • ENTITY: Indicates an entity information item. The database address points to a record which contains the following fields in this order: (RETURN, ENTITY_TYPE, ENTITY_NAME, ENTITY_SYSID, ENTITY_PUBID, ENTITY_BASURI, ENTITY_NOTATION).

  • COMMENT: Indicates a comment information item. The string represents the contents of the comment.

  • ELEMENT: Indicates an element information item. The name of the new element, i.e. both the namespace URI and the local part of the element's name, is indicated by the user type of the nodes in the record. The database address points to a record which contains the following fields in this order:

    1. A RETURN node.

    2. ATTRIBUTE nodes, if any (order in document is insignificant).

    3. NAMESPACE nodes, if any (order in document is insignificant).

    4. ELEMENT nodes, PI nodes, SKIPPED nodes, CHARACTER nodes and COMMENT nodes (order in document is significant).

  • ATTRIBUTE: Indicates an attribute information item. The database address points to a record which contains the following fields in this order: (RETURN, ATTRIBUTE_NS, ATTRIBUTE_LOCAL, ATTRIBUTE_VALUE).

  • PI: Indicates a processing instruction item. The database address points to a record which contains the following fields in this order: (RETURN, PI_TARGET, PI_CONTENT).

  • SKIPPED: Indicates a reference to a skipped entity information item. The string contains the name of the skipped entity referenced.

  • CHARACTER: Indicates a block of characters. The string contains the characters in the block.

  • NAMESPACE: Indicates a namespace declaration information item. The database address points to a record which contains the following fields in this order: (RETURN, NAMESPACE_PREFIX, NAMESPACE_URI, NAMESPACE_VALUE).

Table 1. IBXML node types

Node contentsBase typeSemantic base type
DocumentDOCUMENTDBADDR
NotationNOTATIONDBADDR
Notation nameNOTATION_NAMEUNI
Notation system identifierNOTATION_SYSIDUNI
Notation public identifierNOTATION_PUBIDUNI
Notation base URINOTATION_BASEURIUNI
Entity informationENTITYDBADDR
Entity typeENTITY_TYPEBYTE
Entity typeENTITY_NAMEUNI
Entity system identifierENTITY_SYSIDUNI
Entity public identifierENTITY_PUBIDUNI
Entity base URIENTITY_BASEURIUNI
Entity notation referenceENTITY_NOTATIONDBADDR
CommentCOMMENTUNI
ElementELEMENTDBADDR
AttributeATTRIBUTEDBADDR
Attribute namespace URIATTRIBUTE_NSDBADDR
Attribute local nameATTRIBUTE_LOCALUNI
Attribute content/valueATTRIBUTE_VALUEUNI
Processing instructionPIDBADDR
Processing instruction targetPI_TARGETUNI
Processing instruction contentPI_CONTENTUNI
Skipped entity referenceSKIPPEDUNI
Block of charactersCHARACTERUNI
Namespace declarationNAMESPACEDBADDR
Namespace prefixNAMESPACE_PREFIXUNI
Namespace URINAMESPACE_URIUNI
Namespace attribute valueNAMESPACE_VALUEUNI


Searching

All nodes except the RETURN nodes are searchable, i.e. they are stored using the primart storage type of Birdstep Database Engine™. This means that a simple searches, like a search for a character string enclosed by a known element, can be performed using one single query operation, and the IBXML data structure can be traversed easily using the implicit and explicit pointers.


IBXML components

IBXML consists of four components:

  1. IBXML database I/O component with an XML parser.

  2. IBXML SAX interface component.

  3. IBXML DOM interface component.

  4. IBXML XPath component.


IBXML low-level interface

The IBXML database I/O component interfaces with IBAPI and is responsible for all reading and writing of data in the database. This component plays multiple roles when interacting with the database depending on whether data is retrieved or stored (see Figure 3). Compared to the relationship between an application and an XML parser for XML documents stored as pure text, the IBXML database I/O component functions as an XML parser when retrieving an XML document stored in a Birdstep Database Engine™ database, and it plays the role of an application or an XML generator when storing XML documents.

Figure 3. Role of IBXML I/O components.

The low-level interface of IBXML provides access to operations on the IBXML databases on the node-level. Nodes may be created, modified, deleted and relinked. If used only for outputting information, the IBXML low-level interface performs the same set of operations on the data stored in an IBXML database as an XML parser performs on an XML documents in plain text. Application developers may use the low-level interface in special circumstances, but the functions are not primarily intended for end-user applications.


XML parser

An XML parser is a software engine capable of tokenising a stream of characters into XML syntactic constructs and analysing that stream to determine whether it is well-formed and maybe also if it is valid. The parser is usually invoked by a user-application or another software component on a per-document basis. Some XML parsers only require XML documents to be well-formed, other XML parsers are also able to validate XML documents. XML parsers also expose information about the XML document to the caller in different ways.

As XML parsers are able to manipulate any XML document, they are reusable in several software projects, and application developers are freed from the tedious work associated with the development of software capable of reading and writing data files.

Version 1.0 of IBXML uses the ( expat XML parser) XML parser written by James Clark for parsing of XML documents stored as text. expat is subject to the Mozilla Public License v1.1 or alternatively the GNU Public Licence.

expat is not a validating parser, so IBXML will not be able to validate XML documents. expat only reads the internal DTD subset and will ignore external DTDs and entity references. IBXML behaves the same way both when using expat to parse XML documents stored as plain text, and when manipulating IBXML databases: no validation of an XML document is performed, and IBXML only requires that the XML document is well-formed.


The Simple API for XML (SAX)

The IBXML SAX interface component provides the SAX version 1.0 interface and implements drivers for parsing of XML documents stored as text and for retrieval of XML documents from databases. It also implements handlers for outputting of XML documents as text and for storing of XML documents in databases.

SAX is an event driven interface to the process of parsing XML documents. SAX is an API in the public domain, developed by individuals on the XML-DEV mailing list and does not have a formal specification document, but is defined by a public domain implementation using the Java™ Programming Language. It has also been documented in [SUN99].

An event driven interface provides a mechanism for notifiations to the application code as the underlying parser recognizes XML syntactic constructions in the document. Using SAX in an application is quite simple and follows this procedure:

  1. The application instantiates a driver for the parser.

  2. The application registers handlers for all events which the application wants to receive.

  3. The application transfers control to the parser which parses a given input source and makes calls back to the application code.

  4. The application uses the call-backs from SAX to build an internal data structure.

  5. The application destroys the SAX parser object.

  6. The application uses the internally built data structures to manipulate the XML data.

The communication between SAX and parsers, XML databases and other XML readers is done using drivers, while the communication between SAX and applications, XML generators and other XML writers is done primarily using handlers. The IBXML SAX interface component include drivers for parsing of XML stored as plain text (using expat) and in IBXML databases, and it features handlers for storing of data in the same two formats.

The SAX interface is implemented using C++ and thus sometimes deviates from the defining public domain Java™ implementation due to language differences. Work on SAX version 2.0 for C++ (SAX/C++ 2.0) has been started, and IBXML will be updated to SAX/C++ 2.0 when this API reaches a mature state.


DOM

For some classes of applications, using SAX or interfacing directly with an XML parser may be the ideal way to process XML documents. If the application is expected to handle XML documents with as little latency as possible or to handle documents too large to fit in memory, processing each event as it occurs in the document is needed.

The problem with using SAX is that the application has to setup event handlers for all elements the application cares about and build its own data structures on-the-fly as the events occur. The SAX events come in a linear sequence, and it is not possible to peek at future data or retrieve old information. Dynamic modification is also not possible. Rather than responding to each event, it would be easier if the entire tree was already loaded into memory and it was possible to walk the tree and manipulate parts of the tree in a simple way.

Just as an XML parser in general and SAX in particular adds a layer of abstraction over the actual textual representation of the XML document, the Document Object Model (DOM) adds a layer of abstraction on top of the entire document. DOM standardises the object model representing an XML document and defines a language- and platform-neutral interface to the structure and style of XML documents, which a process may dynamically access and update. Elements are considered as nodes in a tree instead of being composed by start- and end-tags. Nodes may have parents and children, and they may have internal properties which can be modified using objects and methods.

Figure 4. Sample XML document

<?xml version='1.0'?>
<!doctype bookstore>
<bookstore>
 <list>
  <book>
   <title>Selected Short Stories</title>
   <author>
    <first-name>John</first-name>
    <last-name>Doe</last-name>
   </author>
  </book>
  <magazine frequency="weekly">
   <title>XML World</title>
  </magazine>
 </list>
</bookstore>
   

Figure 5 shows the DOM tree representing the sample XML document in Figure 4. Each circle, box and diamond in the figure is referred to as a node. The top-level <bookstore>-node is the root node of the tree. Note that attributes (the diamond in figure Figure 5) are represented by name only, i.e. both attribute name and attribute value belong to the same node.

Figure 5. DOM tree representing sample XML document in Figure 1

The DOM specification specifies two parts of the DOM level 1: DOM core and DOM HTML. DOM level 1 core contains all functions necessary to manipulate XML documents, whereas DOM level 1 HTML adds support for HTML documents.

As of IBXML version 1.0, the IBXML DOM interface component has not been implemented. It will support DOM core level 1 and, at a later stage, DOM level 2. The IBXML storage model is close to the DOM, which means that IBXML does not have to load an entire XML document into memory before the user may access it. The DOM methods and objects can be accessed while the document is in the IBAPI buffer cache mechanism. This is a major difference from the approach necessary when the XML document is stored as a sequential chain of entities.

Bibliography

[BHL99] Tim Bray, Dave Hollander, and Andrew Layman, Namespaces in XML, January 1999, World Wide Web Consortium recommendation available at http://www.w3c.org/TR/1999/REC-xml-names-19990114 .

[BPSM98] Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0, February 1998, World Wide Web Consortium recommendation available at http://www.w3c.org/TR/1998/REC-xml-19980210 .

[CM99] John Cowan and David Megginson, XML Information Set, December 1999, World Wide Web Consortium working draft available at http://www.w3.org/TR/1999/WD-xml-infoset-19991220 .

[SUN99] Java API for XML Parsing Specification Version 1.0, Sun Microsystems, November 16, 1999.

[W98] Lauren Wood, Steve Byrne, Mike Champion, Scott Isaacs, Ian Jacobs, Arnaud Le Hors, Gavin Nicol, Jonathan Robie, Robert Sutor, and Chris Wilson, Document Object Model (DOM) Level 1 Specification Version 1.0, October 1998, World Wide Web Consortium recommendation available at http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001 .