Notes
Outline
XML Schema:
An Intensive One-Day Tutorial
Henry S. Thompson
HCRC Language Technology Group
University of Edinburgh
Overview
What are schemata, anyway?
The nature of document structure
Schema as contract
Taking control of structure definition
XML Schema:  the activity
The W3C and its WGs
The Charter and Requirements
The state of play
The Draft RECs
A detailed walkthrough
Schemas and Layered Architecture
Terminology
Documents have structure
Document types
Document instances
Structure can be defined
Informally (D. S. D.)
SGML DTD
XML DTD
Schema using XML
Background
SGML DTDs for D. S. D
Sperberg-McQueen
Others
Considered for XML itself
MCF, then RDF, now DCD, by Bray et al.
XML-Data, two versions, now XML-Data reduced, by Layman et al., then Frankston and Thompson
SOX, from Veo Corp.
XSchema, from an ad-hoc group of designers
Document Structure
Two relations are constitutive
Part-of
Kind-of
Existing DSD mechanisms use Content Models to specify part-of relations
But they only specify kind-of relations implicitly or informally
Making kind-of relations explicit would make both understanding and maintenance easier
Taking Control of D. S. D.
Eric Naggum used to talk about SGML allowing users to take control of their data
XML allows the same move one level up, for developers
The starting point is much simpler
The architecture is congenial
The demand is there
We need to do this, to make the transition to validation easier
Why validate?
A D. S. D. is a contract between producers and consumers
It provides a guaranteed interface
Producers validate to ensure they are providing what they promised
Consumers validate to check up on producers
and to protect their applications
Application authors validate to simplify their task
Leave error detection and analysis to the validating parser
Reconstructing DTDs
The Schema DTD is expressed in vanilla XML
Top level elements for declaring
Elements :-)
Types
Notations
. . .
Subordinate element types for declaring
Attributes
Content models
. . .
An aside about terminology
SGML and XML 1.0 talk about element types
XML Schema to date has been more casual and just talked about elements
Meaning either an element in an instance
Or the abstraction which is described in a DTD or Schema
Further confused by XML Schema making extensive use of type
Also, schema means many different things to different people
I'll try always to say/write XML Schema. . .
A simple example
<!ELEMENT text (#PCDATA|emph|name)*>
<!ATTLIST text
        timestamp NMTOKEN #REQUIRED>
<element name="text">
 <type content="mixed">
  <element ref="emph"/>
  <element ref="name"/>
  <attribute name="timestamp"
             type="date"
             minOccurs="1"/>
 </type>
</element>
The Schema Architecture:  Static
A document or an application or a user identifies a schema
Each is well-formed XML
The schema is valid w.r.t the Schema DTD
The document is schema-valid w.r.t the schema
The schema is schema-valid wrt the schema for schemas
The Schema Architecture:  Dynamic
An XML application (XSP) which schema-validates
‘Takes control’ because changing how schemata work means
changing the Schema DTD/schema for schemas
upgrading XSP accordingly
not changing XML itself
The W3C
XML Schema hopes to be a W3C Recommendation
The W3C is The World Wide Web Consortium, a voluntary association of companies and non-profit organisations. Membership costs serious money, confers voting rights.  Complex procedures, with the Chairman (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.
. . . and its WGs
The XML recommendation was written by the W3C’s XML Working Group
Which split itself into pieces, of which one is the XML Schema WG
Chartered in the autumn of 1998
Requirements document out in February of 1999
Due to go to Last Call early in 2000
Requirements document
Full of good and hopeful requirements
DTDs and more
Support inheritance
Data-friendly
Good inventory of primitive datatypes
The state of play
Two component documents
Structures
Datatypes
Three public working drafts so far
May 1999
September 1999
November 1999:
Further (near-final) PWD out December 1999
http://www.w3.org/TR/xmlschema-1/
[contains pointers to previous drafts]
The XML Schema worldview
Validity and well-formedness are XML 1.0 concepts
They are defined over character sequences
Namespace-compliant is a Namespace concept
It's defined over character sequences too
Schema-validity is the XML Schema concept
It is defined over XML document Infosets
So the whole XML Schema exercise is predicated on and layered on top of XML 1.0 well-formedness plus Namespaces
Because they are constitutive of the Infoset
What's the Infoset?
The XML 1.0 plus Namespaces abstract data model
Defines a modest number of information items
Element, attribute, namespace declaration, ...
Each has required and optional properties
Name, children, …
What the Infoset isn't
It's not the DOM
Much higher level
It's not about implementation or interfacing at all
But you can think of it as a data structure if that helps
It's not an SGML property set/grove
But it's close
It doesn't have the entity problem
a mixed blessing, as we will see
The Schema and the Infoset
So crucially, schemas are about infosets, not character sequences
You could schema-validate a DOM tree you built by hand!
Using a schema which exists only as a DOM tree ditto
This simplifies things tremendously
but is hard to get your head around at first
Basic XML Schema concepts
Syntax is not the Schema
Namespaces are fundamental
But a schema is not a namespace
Separation of tag from type
Simple and Complex types
Modular Schema construction
Powerful type construction
Local tag-type association
Powerful wildcards
Element equivalence classes
Extension mechanism
Documentation mechanism
Schema Walkthrough 1
A Toy Purchase Order schema
Types and Type Derivation
For purposes of discussion, consider only the content type aspects of  types (attributes are analogous)
A content type definition (simple or complex) consists of a set of constraints  on what's allowed as content.
Permissions and obligations
You can think of the type itself as the  set of strings/EIIs its constraints allow.  It's helpful to think of constraints as composed of obligations and permissions:
 (\d )?(\d{3}-)?\d{3}-\d{4}
regexp definition facet for [US] 'phone number type
 the ? and the \d can be seen as permissions, the - and the {3} as obligations
 1 337-6818 and 207-422-6240 belong to this type
Complex types
  (title?,forename*,surname)
  (shorthand for) content model for name
  the ? can be seen as permission, the , and the 'surname' as  obligations (at the end of the day, each component involves both  permission AND obligation, but the balance of impact is as suggested)
Complex types, cont'd
 (title?,forename*,surname)
  <name>
   <forename>...</forename>
   <surname>...</surname>
  </name>
   and
  <name>
   <title>...</title>
   <surname>...</surname>
  </name>
are both members of this type
Restriction
A type definition may be a restriction of another type's definition if it reduces permissions, sometimes to the point of inducing obligations:
 \d[01]\d-\d{3}-\d{4}     (a restriction
 (\d )?(\d{3}-)?\d{3}-\d{4} of US p#)
The membership of this type, which includes
  207-422-6240 but not 1 337-6818
is a (proper) subset of the membership of the original type,
because by construction every member of the new type is a member of the original.
Restriction, cont'd
Similarly,
  (forename+,surname)
is a restriction of the original type definition for name
 (title?,forename*,surname)
and the same relation holds.
Restriction, cont'd
 Note first that
 (forename+,surname)
  <name>
   <forename>...</forename>
   <surname>...</surname>
  </name>
   is a member of the new type, but
  <name>
   <title>...</title>
   <surname>...</surname>
  </name>
  is not.
Extension
Now consider
  (title?, forename*, surname,
  genMark?)
 This type extends the original type definition for name.
  <name>
   <forename>Al</forename>
   <surname>Gore</surname>
   <genMark>Jr</genMark>
</name>
is an instance of this new type, but not of the original.
Any
Finally note that the <any/> content model particle, in all of its forms, introduces particularly broad permissions into complex content types.
Where are we headed?
A number of design decisions can now be stated:
Should we make it easy to construct type definitions which restrict or extend other type definitions, by specifying only the method of derivation and the differences between the source and derived type definitions?
The new proposal says 'yes', you do this by using the "source" and "derivedBy" attributes on your <type> or <datatype> element.
Datatype example
Consider the simple type case first:
  <datatype name='bodytemp'
          source='decimal'>
   <precision value='4'/>
   <scale value='1'/>
   <minInclusive value='97.0'/>
   <maxInclusive value='105.0'/>
  </datatype>
Derived type
<datatype name='healthyBodytemp'
        source='bodytemp'>
   <maxInclusive value='99.5'/>
  </datatype>
The healthyBodytemp type definition is defined by closing down the permitted range of bodytemp.  We say it 'inherits' the other facets of bodytemp, so the 'effective type definition' of healthyBodytemp  is
Effective type
  <datatype name='healthyBodytemp'
           source='decimal'>
   <precision value='4'/>
   <scale value='1'/>
   <minInclusive value='97.0'/>
   <maxInclusive value='99.5'/>
  </datatype>
Since it doesn't in general make sense to extend one simple type by another, the "derivedBy" attribute is actually redundant for <datatype>.
Extension for complex types
The next simplest case is extension for complex types:
  <type name='name'>
   <element name='title'
            minOccurs='0'/>
   <element name='forename'
            minOccurs='0'
            maxOccurs='*'/>
   <element name='surname'/>
  </type>
Derived type
  <type name='fullName'
       source='name'
       derivedBy='extension'>
   <element name='genMark'
            minOccurs='0'/>
  </type>
The effective type
  <type name='fullName'>
   <element name='title'
            minOccurs='0'/>
   <element name='forename'
            minOccurs='0'
            maxOccurs='*'/>
   <element name='surname'/>
   <element name='genMark'
            minOccurs='0'/>
  </type>
Restriction for complex types
Restriction for complex types is harder to handle syntactically, because of the significance of linear order in content models, but the semantics are completely parallel to the simple type case:
Restriction example
<type name='simpleName'
       source='name'
       derivedBy='restriction'>
   <restrictions>
    <element name='title'
             maxOccurs='0'/>
    <element name='forename'
             minOccurs='1'/>
   </restrictions>
  </type>
Restriction and Inheritance
Just as in the <datatype> case, the content model aspects not mentioned are left alone, including the "maxOccurs='*'" on <forename> and the whole particle for <surname>, so the 'effective content model'  of 'simpleName' is
Effective type
  <type name='simpleName'>
   <element name='title'
            maxOccurs='0'
            minOccurs='0'/>
 <!-- i.e. forbidden -->
   <element name='forename'
            minOccurs='1'
            maxOccurs='*'/>
   <element name='surname'/>
  </type>
Instances
Given all the example definitions above, all of
<name>
<title>Ms</title>
<surname>Steinem</surname>
</name>
  <name xsi:type='simpleName'>
   <foreName>Harry</foreName>
   <foreName>S</foreName>
   <surname>Truman</surname>
  </name>
Another instance
  <name xsi:type='fullName'>
   <forename>Al</forename>
   <surname>Gore</surname>
   <genMark>Jr</genMark>
  </name>
all would be schema-valid per
 <element name='name' type='name'/>
Connecting Instances and Schemas
Like I said
A schema is not a namespace
The connection cannot be made rigid
The draft identifies three layers, first is
schema-valid(EII,TypeName,ComponentSet)
The TypeName is a (namespaceURI,NCName) pair
The component set is made up of (namespaceURI,NCName,component) triples
Other layers
Layer 2: transfer syntax
Layer 3: web connections
Schema Walkthrough 2
The Schema for Datatypes
Schema Walkthrough 3
The Schema for Schemas
Change of Gear
Let's look at the role of schemas in supporting the layered architecture which is emerging all around us
XML is ASCII for the 21st century
ASCII (ISO 646) solved a fundamental interchange problem for flat text documents
What bits encode what characters
(For a pretty parochial definition of 'character')
UNICODE/ISO 10646 extends that solution to the whole world
XML thought it was doing the same for simple tree-structured documents
The emphasis in the XML design was on simplifying SGML to move it to the Web
XML didn't touch SGML's architectural vision
flexible linearisation/transfer syntax
for tree-structured documents with internal links
Just what is XML?
It's a markup language used for annotating text
It is concerned with logical structure
to identify sections, titles, section headers, chapters, paragraphs,…
It is not concerned with appearance
you say 'this is a subtitle'
not 'this is in bold, 14pt, centered'
you say 'this is an example'
not 'this is in verbatim, indented by 5pts, ragged right'
Take Two: Just what is XML?
It's a markup language used for transferring data
It is concerned with data models
to convert between application-appropriate and transfer-appropriate forms
It is not concerned with human beings
It's produced and consumed by programs
XML as UI
A slogan of Adam Bosworth
I interpret it in two ways:
At the client end
Use XML plus XSL as the basis for what the user sees on his/her screen
Use XLinks from a master document to pull together disparate sources of information
At the server end
Use XML as a uniform interface for any data source onto the web
Not just documents, but E.g. Databases, process control information, stock quotes
Application data
Structured markup
<POORDERHDR>
<DATETIME qualifier="DOCUMENT">
 <YEAR>1996</YEAR>
  <MONTH>06</MONTH>
  <DAY>30</DAY>
  <HOUR>23</HOUR>
  <MINUTE>59</MINUTE>
  <SECOND>59</SECOND>
  <SUBSECOND>0000</SUBSECOND>
  <TIMEZONE>+0100</TIMEZONE>
 </DATETIME>
 <OPERAMT qualifier="EXTENDED" type="T">
  <VALUE>670000</VALUE>
  <NUMOFDEC>2</NUMOFDEC>
  <SIGN>+</SIGN>
  <CURRENCY>USD</CURRENCY>
. . .
What just happened!?
The whole transfer syntax story just went meta, that's what happened!
XML has been a runaway success, on a much greater scale than its designers anticipated
Not  for the reason they had hoped
Because separation of form from content is right
But for a reason they barely thought about
Data must travel the web
Tree structured documents are a useable transfer syntax for just about anything
So data-oriented web users think of XML as a transfer mechanism for their data
The Cambridge Communiqué
A W3C Note resulting from a meeting this August (http://www.w3.org/TR/schema-arch)
Signalled a widespread acceptance of layering:
"XML has defined a transfer syntax for tree-structured documents;
"Many data-oriented applications are being defined which build their own data structures on top of an XML document layer, effectively using XML documents as a transfer mechanism for structured data; "
The Communiqué, cont'd
Called for support in XML Schema for specifying mapping between the XML document data model (or XML Infoset) and application-specific data models
XML Schema is a W3C recommendation-in-progress for definiing the structure of document families
A grammar for markup structure
E.g.
artice -> title, subtitle?, section+
or
POORDERHDR -> DATETIME, ORDERAMT
Mapping between layers
Fortunately, XML Schema is actually notated in XML itself
So there are elements defined for use in schemas to define. . .
Elements :-)
Attributes
Types
A type is a collection of constraints on element content and attribute values
A type may be either
simple, for constraining string values
complex, for constraining elements which contain other elements
Type definition example
<type name='personName'>
 <element name='title'
          minOccurs='0'/>
 <element name='forename'
        minOccurs='0' maxOccurs='*'/>
 <element name='surname'/>
 <attribute name='id'
            type='integer'/>
</type>
<element name='owner'
         type='personName'/>
Mapping between layers 2
We can think of this in two ways
In terms of an abstract data modelling language
Entity-Relation
UML
RDF
In concrete implementation terms
Tables and rows
Class instances and instance variables
The first is more portable
The second more immediately useful
Mapping between layers 3
Regardless of what approach we take, we need
A vocabulary of data model components
An attachment of that vocabulary to schema components
Sample vocabularies
entity, relationship, collection
table, row, column
instance, variable, list, dictionary
Where should attachment be specified?
In the schema
convenient
Outside it
modular
Specifying mapping in the schema
Probably reasonable if done in high-level (ER, UML) terms
See example infoset-xmpl.xml, infoset-uml.xsd
Specifying mapping outside
Requires some duplication of structural information
Encourages cross-language working
See example infoset-xmpl.xsl
Take-home message
The point at which idiosyncratic scripting takes over can be moved one layer up
Using public consensual declarative standards is a Good Thing
Interoperability makes things better for everyone
Overall Conclusion
"Schemas are coming:  Start using them!"
         ____Tim Berners-Lee, 1999-11-05