|
|
|
Henry S. Thompson |
|
HCRC Language Technology Group |
|
University of Edinburgh |
|
|
|
|
|
What are schemata, anyway? |
|
The nature of document structure |
|
Schema as contract |
|
Taking control of structure definition |
|
XML Schema:
the activity |
|
The W3C and its WGs |
|
The Charter and Requirements |
|
The state of play |
|
The Draft RECs |
|
A detailed walkthrough |
|
Schemas and Layered Architecture |
|
|
|
|
|
Documents have structure |
|
Document types |
|
Document instances |
|
Structure can be defined |
|
Informally (D. S. D.) |
|
SGML DTD |
|
XML DTD |
|
Schema using XML |
|
|
|
|
|
SGML DTDs for D. S. D |
|
Sperberg-McQueen |
|
Others |
|
Considered for XML itself |
|
MCF, then RDF, now DCD, by Bray et al. |
|
XML-Data, two versions, now XML-Data reduced, by
Layman et al., then Frankston and Thompson |
|
SOX, from Veo Corp. |
|
XSchema, from an ad-hoc group of designers |
|
|
|
|
|
Two relations are constitutive |
|
Part-of |
|
Kind-of |
|
Existing DSD mechanisms use Content Models to
specify part-of relations |
|
But they only specify kind-of relations
implicitly or informally |
|
Making kind-of relations explicit would make
both understanding and maintenance easier |
|
|
|
|
|
Eric Naggum used to talk about SGML allowing
users to take control of their data |
|
XML allows the same move one level up, for
developers |
|
The starting point is much simpler |
|
The architecture is congenial |
|
The demand is there |
|
We need to do this, to make the transition to
validation easier |
|
|
|
|
|
A D. S. D. is a contract between producers and
consumers |
|
It provides a guaranteed interface |
|
Producers validate to ensure they are providing
what they promised |
|
Consumers validate to check up on producers |
|
and to protect their applications |
|
Application authors validate to simplify their
task |
|
Leave error detection and analysis to the
validating parser |
|
|
|
|
|
The Schema DTD is expressed in vanilla XML |
|
Top level elements for declaring |
|
Elements :-) |
|
Types |
|
Notations |
|
. . . |
|
Subordinate element types for declaring |
|
Attributes |
|
Content models |
|
. . . |
|
|
|
|
|
SGML and XML 1.0 talk about element types |
|
XML Schema to date has been more casual and just
talked about elements |
|
Meaning either an element in an instance |
|
Or the abstraction which is described in a DTD
or Schema |
|
Further confused by XML Schema making extensive
use of type |
|
Also, schema means many different things to
different people |
|
I'll try always to say/write XML Schema. . . |
|
|
|
|
<!ELEMENT text (#PCDATA|emph|name)*> |
|
<!ATTLIST text
timestamp NMTOKEN
#REQUIRED> |
|
|
|
<element name="text"> |
|
<type
content="mixed"> |
|
<element ref="emph"/> |
|
<element ref="name"/> |
|
<attribute name="timestamp"
type="date"
minOccurs="1"/>
</type>
</element> |
|
|
|
|
A document or an application or a user
identifies a schema |
|
Each is well-formed XML |
|
The schema is valid w.r.t the Schema DTD |
|
The document is schema-valid w.r.t the schema |
|
The schema is schema-valid wrt the schema for
schemas |
|
|
|
|
|
An XML application (XSP) which schema-validates |
|
‘Takes control’ because changing how schemata
work means |
|
changing the Schema DTD/schema for schemas |
|
upgrading XSP accordingly |
|
not changing XML itself |
|
|
|
|
XML Schema hopes to be a W3C Recommendation |
|
The W3C is The World Wide Web Consortium, a
voluntary association of companies and non-profit organisations. Membership
costs serious money, confers voting rights. Complex procedures, with the Chairman (Tim Berners-Lee)
holding all the high cards, but the big vendors (e.g. Microsoft, Adobe,
Netscape) have a lot of power. |
|
|
|
|
The XML recommendation was written by the W3C’s
XML Working Group |
|
Which split itself into pieces, of which one is
the XML Schema WG |
|
Chartered in the autumn of 1998 |
|
Requirements document out in February of 1999 |
|
Due to go to Last Call early in 2000 |
|
|
|
|
Full of good and hopeful requirements |
|
DTDs and more |
|
Support inheritance |
|
Data-friendly |
|
Good inventory of primitive datatypes |
|
|
|
|
|
|
Two component documents |
|
Structures |
|
Datatypes |
|
Three public working drafts so far |
|
May 1999 |
|
September 1999 |
|
November 1999: |
|
Further (near-final) PWD out December 1999 |
|
http://www.w3.org/TR/xmlschema-1/ |
|
[contains pointers to previous drafts] |
|
|
|
|
|
Validity and well-formedness are XML 1.0
concepts |
|
They are defined over character sequences |
|
Namespace-compliant is a Namespace concept |
|
It's defined over character sequences too |
|
Schema-validity is the XML Schema concept |
|
It is defined over XML document Infosets |
|
So the whole XML Schema exercise is predicated
on and layered on top of XML 1.0 well-formedness plus Namespaces |
|
Because they are constitutive of the Infoset |
|
|
|
|
|
The XML 1.0 plus Namespaces abstract data model |
|
Defines a modest number of information items |
|
Element, attribute, namespace declaration, ... |
|
Each has required and optional properties |
|
Name, children, … |
|
|
|
|
|
It's not the DOM |
|
Much higher level |
|
It's not about implementation or interfacing at
all |
|
But you can think of it as a data structure if
that helps |
|
It's not an SGML property set/grove |
|
But it's close |
|
It doesn't have the entity problem |
|
a mixed blessing, as we will see |
|
|
|
|
|
So crucially, schemas are about infosets, not
character sequences |
|
You could schema-validate a DOM tree you built
by hand! |
|
Using a schema which exists only as a DOM tree
ditto |
|
This simplifies things tremendously |
|
but is hard to get your head around at first |
|
|
|
|
Syntax is not the Schema |
|
Namespaces are fundamental |
|
But a schema is not a namespace |
|
Separation of tag from type |
|
Simple and Complex types |
|
Modular Schema construction |
|
Powerful type construction |
|
Local tag-type association |
|
Powerful wildcards |
|
Element equivalence classes |
|
Extension mechanism |
|
Documentation mechanism |
|
|
|
|
A Toy Purchase Order schema |
|
|
|
|
For purposes of discussion, consider only the
content type aspects of types
(attributes are analogous) |
|
A content type definition (simple or complex)
consists of a set of constraints on
what's allowed as content. |
|
|
|
|
|
You can think of the type itself as the set of strings/EIIs its constraints
allow. It's helpful to think of
constraints as composed of obligations and permissions: |
|
(\d
)?(\d{3}-)?\d{3}-\d{4} |
|
regexp definition facet for [US] 'phone number
type |
|
the ?
and the \d can be seen as permissions, the - and the {3} as obligations |
|
1
337-6818 and 207-422-6240 belong to this type |
|
|
|
|
|
(title?,forename*,surname) |
|
(shorthand for) content model for name |
|
the ?
can be seen as permission, the , and the 'surname' as obligations (at the end of the day, each
component involves both permission
AND obligation, but the balance of impact is as suggested) |
|
|
|
|
|
(title?,forename*,surname) |
|
<name>
<forename>...</forename>
<surname>...</surname>
</name> |
|
and |
|
<name>
<title>...</title>
<surname>...</surname>
</name> |
|
are both members of this type |
|
|
|
|
|
A type definition may be a restriction of
another type's definition if it reduces permissions, sometimes to the point
of inducing obligations: |
|
\d[01]\d-\d{3}-\d{4}
(a restriction |
|
(\d
)?(\d{3}-)?\d{3}-\d{4} of US p#) |
|
The membership of this type, which includes |
|
207-422-6240 but not 1 337-6818 |
|
is a (proper) subset of the membership of the
original type, |
|
because by construction every member of the new
type is a member of the original. |
|
|
|
|
|
Similarly, |
|
(forename+,surname) |
|
is a restriction of the original type definition
for name |
|
(title?,forename*,surname) |
|
and the same relation holds. |
|
|
|
|
|
Note
first that |
|
(forename+,surname) |
|
<name>
<forename>...</forename>
<surname>...</surname>
</name> |
|
is a
member of the new type, but |
|
<name>
<title>...</title>
<surname>...</surname>
</name> |
|
is not. |
|
|
|
|
|
Now consider |
|
(title?, forename*, surname,
genMark?) |
|
This
type extends the original type definition for name. |
|
<name>
<forename>Al</forename>
<surname>Gore</surname>
<genMark>Jr</genMark>
</name> |
|
is an instance of this new type, but not of the
original. |
|
|
|
|
Finally note that the <any/> content model
particle, in all of its forms, introduces particularly broad permissions
into complex content types. |
|
|
|
|
A number of design decisions can now be stated: |
|
Should we make it easy to construct type
definitions which restrict or extend other type definitions, by specifying
only the method of derivation and the differences between the source and
derived type definitions? |
|
The new proposal says 'yes', you do this by
using the "source" and "derivedBy" attributes on your <type>
or <datatype> element. |
|
|
|
|
|
Consider the simple type case first: |
|
<datatype name='bodytemp'
source='decimal'>
<precision
value='4'/>
<scale
value='1'/>
<minInclusive
value='97.0'/>
<maxInclusive
value='105.0'/>
</datatype> |
|
|
|
|
|
<datatype name='healthyBodytemp'
source='bodytemp'>
<maxInclusive
value='99.5'/>
</datatype> |
|
The healthyBodytemp type definition is defined
by closing down the permitted range of bodytemp. We say it 'inherits' the other facets of bodytemp, so the
'effective type definition' of healthyBodytemp is |
|
|
|
|
|
<datatype name='healthyBodytemp'
source='decimal'>
<precision
value='4'/>
<scale
value='1'/>
<minInclusive
value='97.0'/>
<maxInclusive
value='99.5'/>
</datatype> |
|
Since it doesn't in general make sense to extend
one simple type by another, the "derivedBy" attribute is actually
redundant for <datatype>. |
|
|
|
|
|
The next simplest case is extension for complex
types: |
|
<type name='name'>
<element name='title'
minOccurs='0'/>
<element name='forename'
minOccurs='0'
maxOccurs='*'/>
<element name='surname'/>
</type> |
|
|
|
|
|
<type name='fullName'
source='name'
derivedBy='extension'>
<element name='genMark'
minOccurs='0'/>
</type> |
|
|
|
|
|
<type name='fullName'>
<element name='title'
minOccurs='0'/>
<element name='forename'
minOccurs='0'
maxOccurs='*'/>
<element name='surname'/>
<element name='genMark'
minOccurs='0'/>
</type> |
|
|
|
|
Restriction for complex types is harder to
handle syntactically, because of the significance of linear order in
content models, but the semantics are completely parallel to the simple
type case: |
|
|
|
|
|
<type name='simpleName'
source='name'
derivedBy='restriction'>
<restrictions>
<element name='title'
maxOccurs='0'/>
<element
name='forename'
minOccurs='1'/>
</restrictions>
</type> |
|
|
|
|
Just as in the <datatype> case, the
content model aspects not mentioned are left alone, including the "maxOccurs='*'"
on <forename> and the whole particle for <surname>, so the
'effective content model' of
'simpleName' is |
|
|
|
|
|
<type name='simpleName'>
<element name='title'
maxOccurs='0'
minOccurs='0'/>
<!-- i.e. forbidden
-->
<element
name='forename'
minOccurs='1'
maxOccurs='*'/>
<element
name='surname'/>
</type> |
|
|
|
|
|
Given all the example definitions above, all of |
|
<name>
<title>Ms</title>
<surname>Steinem</surname>
</name> |
|
<name xsi:type='simpleName'>
<foreName>Harry</foreName>
<foreName>S</foreName>
<surname>Truman</surname>
</name> |
|
|
|
|
|
<name xsi:type='fullName'>
<forename>Al</forename>
<surname>Gore</surname>
<genMark>Jr</genMark>
</name> |
|
all would be schema-valid per |
|
<element name='name' type='name'/> |
|
|
|
|
|
Like I said |
|
A schema is not a namespace |
|
The connection cannot be made rigid |
|
The draft identifies three layers, first is |
|
schema-valid(EII,TypeName,ComponentSet) |
|
The TypeName is a (namespaceURI,NCName) pair |
|
The component set is made up of
(namespaceURI,NCName,component) triples |
|
|
|
|
Layer 2: transfer syntax |
|
Layer 3: web connections |
|
|
|
|
|
|
Let's look at the role of schemas in supporting
the layered architecture which is emerging all around us |
|
|
|
|
|
|
ASCII (ISO 646) solved a fundamental interchange
problem for flat text documents |
|
What bits encode what characters |
|
(For a pretty parochial definition of
'character') |
|
UNICODE/ISO 10646 extends that solution to the
whole world |
|
XML thought it was doing the same for simple
tree-structured documents |
|
The emphasis in the XML design was on
simplifying SGML to move it to the Web |
|
XML didn't touch SGML's architectural vision |
|
flexible linearisation/transfer syntax |
|
for tree-structured documents with internal
links |
|
|
|
|
|
It's a markup language used for annotating text |
|
It is concerned with logical structure |
|
to identify sections, titles, section headers,
chapters, paragraphs,… |
|
It is not concerned with appearance |
|
you say 'this is a subtitle'
not 'this is in bold, 14pt, centered' |
|
you say 'this is an example'
not 'this is in verbatim, indented by 5pts, ragged right' |
|
|
|
|
|
It's a markup language used for transferring
data |
|
It is concerned with data models |
|
to convert between application-appropriate and
transfer-appropriate forms |
|
It is not concerned with human beings |
|
It's produced and consumed by programs |
|
|
|
|
|
|
A slogan of Adam Bosworth |
|
I interpret it in two ways: |
|
At the client end |
|
Use XML plus XSL as the basis for what the user
sees on his/her screen |
|
Use XLinks from a master document to pull
together disparate sources of information |
|
At the server end |
|
Use XML as a uniform interface for any data
source onto the web |
|
Not just documents, but E.g. Databases, process
control information, stock quotes |
|
|
|
|
|
<POORDERHDR>
<DATETIME qualifier="DOCUMENT">
<YEAR>1996</YEAR>
<MONTH>06</MONTH>
<DAY>30</DAY>
<HOUR>23</HOUR>
<MINUTE>59</MINUTE>
<SECOND>59</SECOND>
<SUBSECOND>0000</SUBSECOND>
<TIMEZONE>+0100</TIMEZONE>
</DATETIME>
<OPERAMT
qualifier="EXTENDED" type="T">
<VALUE>670000</VALUE>
<NUMOFDEC>2</NUMOFDEC>
<SIGN>+</SIGN>
<CURRENCY>USD</CURRENCY>
. . . |
|
|
|
|
|
|
The whole transfer syntax story just went meta,
that's what happened! |
|
XML has been a runaway success, on a much
greater scale than its designers anticipated |
|
Not for
the reason they had hoped |
|
Because separation of form from content is right |
|
But for a reason they barely thought about |
|
Data must travel the web |
|
Tree structured documents are a useable transfer
syntax for just about anything |
|
So data-oriented web users think of XML as a
transfer mechanism for their data |
|
|
|
|
|
A W3C Note resulting from a meeting this August
(http://www.w3.org/TR/schema-arch) |
|
Signalled a widespread acceptance of layering: |
|
"XML has defined a transfer syntax for
tree-structured documents; |
|
"Many data-oriented applications are being
defined which build their own data structures on top of an XML document
layer, effectively using XML documents as a transfer mechanism for
structured data; " |
|
|
|
|
|
Called for support in XML Schema for specifying
mapping between the XML document data model (or XML Infoset) and
application-specific data models |
|
XML Schema is a W3C recommendation-in-progress
for definiing the structure of document families |
|
A grammar for markup structure |
|
E.g. |
|
artice -> title, subtitle?, section+ |
|
or |
|
POORDERHDR -> DATETIME, ORDERAMT |
|
|
|
|
|
Fortunately, XML Schema is actually notated in
XML itself |
|
So there are elements defined for use in schemas
to define. . . |
|
Elements :-) |
|
Attributes |
|
Types |
|
A type is a collection of constraints on element
content and attribute values |
|
A type may be either |
|
simple, for constraining string values |
|
complex, for constraining elements which contain
other elements |
|
|
|
|
<type name='personName'>
<element name='title'
minOccurs='0'/>
<element name='forename'
minOccurs='0'
maxOccurs='*'/>
<element
name='surname'/>
<attribute name='id'
type='integer'/>
</type> |
|
|
|
<element name='owner'
type='personName'/> |
|
|
|
|
|
|
|
|
We can think of this in two ways |
|
In terms of an abstract data modelling language |
|
Entity-Relation |
|
UML |
|
RDF |
|
In concrete implementation terms |
|
Tables and rows |
|
Class instances and instance variables |
|
The first is more portable |
|
The second more immediately useful |
|
|
|
|
|
|
Regardless of what approach we take, we need |
|
A vocabulary of data model components |
|
An attachment of that vocabulary to schema
components |
|
Sample vocabularies |
|
entity, relationship, collection |
|
table, row, column |
|
instance, variable, list, dictionary |
|
Where should attachment be specified? |
|
In the schema |
|
convenient |
|
Outside it |
|
modular |
|
|
|
|
Probably reasonable if done in high-level (ER,
UML) terms |
|
See example infoset-xmpl.xml, infoset-uml.xsd |
|
|
|
|
Requires some duplication of structural
information |
|
Encourages cross-language working |
|
See example infoset-xmpl.xsl |
|
|
|
|
The point at which idiosyncratic scripting takes
over can be moved one layer up |
|
Using public consensual declarative standards is
a Good Thing |
|
Interoperability makes things better for
everyone |
|
|
|
|
"Schemas are coming: Start using them!" |
|
____Tim Berners-Lee, 1999-11-05 |
|