Standards come into being for a variety of reasons, and in a variety of ways, not always entirely explicable. They may be entirely market-defined; for example by manufacturers' attempts singly or as a group to control market share, or by consumers' desires to simplify purchase decisions. Standards also result from pressure applied by well-intentioned groups of experts, or as a consequence of legislation in the public interest. And finally, standards come about as the expression of some emergent consensus within some large community. This last method is the most likely to last, but the most difficult to achieve.
In creating such a consensus, there is an inevitable tension between the need to transform what is simply tried and tested into something normative and binding on the one hand, and the reluctance to straitjacket or constrain unanticipated development on the other. This is particularly true of the research and development arena, which depends for its survival on innovation, and thus the ability to provide answers to as yet unformulated questions, while at the same time being as concerned as any other community to codify existing practice. The research community is populated by experts, who need maximal flexibility and who distrust constraint, but also by novices, who need access to that accumulated expertise in a consistent and codified form, if only in order to rebel against it and thus become experts in their turn.
Standardization of the way in which information is stored and represented (rather than processed) is the key to a number of closely related problems, all of central concern to users of modern Information Technology, be they academic or commercial. For creators of language resources in particular, it addresses the difficulty of ensuring that information is reusable; the difficulty of ensuring that information represented in different ways can be seamlessly integrated; and the difficulty of facilitating loss-free information interchange between the widest choice of different platforms, different application systems and different languages.
By standardizing at the level of text representation, we can hope to retain the flexibility needed to develop new applications, while ensuring that old ones continue to function. By attempting a theory-neutral standardization, at the level where consensus exists, we avoid the need to reinvent the wheel, without requiring that everyone drive a particular brand of bicycle.
In this spirit, the TEI Guidelines which form the topic of this paper, aim to provide not a set of normative rules for particular applications, but rather a modular and extensible framework, within which particular application-specific norms can be defined. The development of such TEI-aware norms is already underway in a number of contexts, most significantly for the present audience, within EAGLES and related EU projects such as Multext, but also in a wide variety of corpus building, scholarly editing and digital library projects. Such projects have in common the need to customize and make less generic the framework defined by the TEI, retaining as they do so the capacity for interchange for which it was developed. The general principles, and many of the specific mechanisms, underlying this approach are of clear relevance to all large scale users of information technology.
This paper [NOTE: An earlier version of which was published as The Wider Relevance of the Text Encoding Initiative in OII Spectrum, Nov 1994.] describes the origins and organization of the TEI scheme, including some technical details of how it may be customized for multiple application areas, and an overview of its coverage.
The Text Encoding Initiative (TEI) is an international cooperative research effort, the goal of which is to define a set of generic Guidelines for the representation of textual materials in electronic form. The project was sponsored and organized by three leading professional associations in the field: the Association for Computational Linguistics (ACL), the Association for Literary and Linguistic Computing (ALLC) and the Association for Computing and the Humanities (ACH). It has been funded throughout its five years of activities on both sides of the Atlantic: primarily by the US National Endowment for the Humanities and by the European Union 3rd framework Programme for Linguistic Research and Engineering, but also with grants from the Mellon Foundation and from the Canadian Social Sciences and Humanities Research Council. Of equal significance has been the donation of time and expertise by the many members of the wider research community who have served on the TEI's Working Committees and Working Groups.
As its title suggests, the TEI is strongly interested in text. But this interest is by no means confined to the use of electronic text as a stage in the production of paper documents, and the word text should not be read too literally. The TEI is equally concerned with both textual and non-textual resources in electronic form, whether as constituents of a research database or components of non-paper publications.
Like the publishing industry, the research community has long realized that its stock in hand is not words on the page, but information, independent of any particular physical realization. As technology emerges which is genuinely adequate to the task of integrating text, graphics and audio into a seamless information-bearing vehicle, so the importance of that integrated vision becomes more apparent. By providing a description of information which is independent of realization or media, the TEI scheme, like other SGML-based approaches, enormously facilitates the construction and exploitation of multimedia technology.
In the same way, the texts with which language researchers are concerned are likely to be very heterogenous. In the construction of language corpora such as the British National Corpus [NOTE: See http://info.ox.ac.uk/bnc for details of this 100 million word TEI-conformant corpus of modern British English.], material as divers as newspapers, books, office memoranda, playscripts, publicity brochures, letters and diaries, transcribed lectures and interviews, TV and radio broadcasts, and unscripted conversations are integrated into a single body of material. Research needs impose that this integration be carried out with minimal loss of information, and at the same time with minimal complexity: in any case, the resulting text is far removed from the conventional notion of a printed work.
Electronic texts are most obviously different from printed ones in that the former contain markup or encoding, which makes explicit various features of the text, so that they can be efficiently processed. Printed texts adopt a variety of similarly-motivated conventions (use of typeface, organization of the carrier medium etc), but these are not so readily processable as the tags of a formal markup scheme.
The goals of the TEI project initially had a dual focus: being concerned with both what textual features should be encoded (i.e. made explicit) in an electronic text, and how that encoding should be represented for loss-free, platform-independent, interchange.
Early on in the project, the Standard Generalized Markup Language (SGML; ISO 8879) was chosen as the most appropriate vehicle for the Guidelines, initially on the purely pragmatic grounds that to create a comparably expressive and versatile formal language would be a major research project in itself. In the event, despite some frequently rehearsed inelegancies, SGML has proved entirely adequate to the needs of researchers, and after five years, is still increasing its domination of the software industry, with new product announcements coming every year. The TEI was thus able to focus its efforts on the expression, using SGML, of the set of textual features indicated as its first goal above.
The prime deliverable of the TEI project is a very large number (over 400) of textual feature definitions, expressed as SGML elements and attributes, with associated documentation and examples. These elements are grouped into tag sets of various kinds, as further discussed below, and together constitute a modular scheme which can be configured to provide hardware-, software-, and application- independent support for the encoding of all kinds of text in all languages and of all times.
The TEI tag sets are necessarily based on, but not limited by, existing encoding practices; they are designed to be both comprehensive and extensible. They are collectively documented in a substantial reference manual, the Guidelines for Text Encoding for Interchange, which appeared in May 1994 after five years of extensive development work. This 1400 page manual is published both in paper and electronic hypertext form and is also available over the Internet, in a variety of formats. [NOTE: Guidelines for the encoding and interchange of machine-readable texts edited by C.M.Sperberg-McQueen and Lou Burnard (Chicago and Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1994). For details of current availability and locations, see the official TEI Web page at http://www-tei.uic.edu/orgs/tei.]
As an SGML application, the TEI scheme necessarily requires the
existence of some kind of document type definition (DTD).
Current approaches to dtd design may be caricatured as falling into one
of three camps, depending on their answer to the question How many
DTDs does the world need?
.
For many of the first users of SGML, the appropriate answer was
One
: the whole purpose of the exercise being to define a
template against which all texts could be checked rigorously and
consistently. This approach, which might be characterized by the phrase
we know what's best for you
, has an obvious place in
applications such as technical documentation, but is equally obviously
inappropriate where the object of the exercise is to describe texts
produced before the blessings of structured document design were
revealed to the world.
At the opposite extreme are those whose answer would be
none
, for whom no DTD can ever be adequate to the full
complexity of the texts to be described: this attitude might be
caricatured as No-one will ever understand my problem
. Again, it
is not impossible to imagine applications for which a DTD consisting
only of elements with the content model ANY would be entirely
appropriate (the first electronic edition of the Oxford English
Dictionary provides one obvious example), although its
usefulness in the general case is less clear.
Perhaps most numerous are those who shrug their shoulders and say
as many as it takes
: the world will always need new DTDs, in the
boundary case, one per document. In the name of pragmatism, this
attitude risks crowding the fledgeling possibility of information
interchange out of the nest entirely; nevertheless, its popularity
reminds us that sometimes the document must drive
the DTD, rather than the reverse.
The approach taken by the TEI attempts to combine virtues of all three of these approaches. It defines not one, but many possible DTDs, which may be tailored to the needs of a particular application in a way difficult or impossible with most other general purpose DTDs so far developed. The user of the TEI scheme is offered the opportunity of building a DTD which matches his or her requirements, but constrained to do so in a way that facilitates interchange.
We refer to this somewhat jocularly as the Chicago Pizza model. All
pizzas have some ingredients in common (cheese and tomato sauce); in
Chicago, at least, they may have entirely different forms of pastry
base, with which (universally) the consumer is expected to make his or
her own selection of toppings. Using SGML syntax this might be
summarized as follows:
In the same way, the user of the TEI scheme constructs a view
of the TEI DTD by combining the core tag sets (which are always
present), exactly one base tag set and his or her
own selection of additional tag sets or toppings.
<!ENTITY % base "(deepDish | thinCrust | stuffed)" >
<!ENTITY % topping "(sausage | mushroom | pepper | anchovy ...)">
<!ELEMENT pizza - - (%base, cheese & tomato, (%topping;)* )>
We use the term tag set to denote simply a collection of definitions for SGML elements and their attributes. These tag sets are the basic organizing principles of the TEI scheme, and are divided into four groups:
This modularization is achieved by the use of parameter entities in
the TEI DTD, which is further discussed below. To illustrate the basic
mechanism we present here the start of a minimal TEI-conformant document
in which the base tag set for prose has been selected together with the
additional tag set for linking:
Because this selection of tag sets is effected explicitly by
declarations within the DTD subset, as shown above, any recipient of the
document can tell which TEI tag sets are required to process it. Any
deviations or modifications of the TEI definitions (for example, the
renaming of elements, or the addition of new ones) may be made in a
similar declarative manner. Once a given view of the TEI dtd has been
defined in this way, it can be fixed or compiled
to preclude further modification and also to remove the complexity
necessarily introduced by the extensive use of indirection in the TEI
dtd.
<!DOCTYPE tei.2 [
<!ENTITY % TEI.prose "INCLUDE">
<!ENTITY % TEI.linking "INCLUDE">
]>
<tei.2>
</tei.2>
Two core tag sets are available to all TEI documents without formality. The first defines a large number of elements which may appear in almost any kind of document, whatever kind of base tag set is in use. The second defines the header, providing something analogous to an electronic title page for the electronic text.
The core tag set common to all TEI documents provides means of
encoding with a reasonable degree of sophistication such textual
features as typographically highlighted or quoted
phrases, (optionally distinguishing highlighting used
for emphasis, technical terms, foreign words, titles etc);
quoted phrases, (optionally distinguishing amongst direct speech,
quotation, glosses, cited phrases etc.);
data-like
phrases such as
names, numbers and measures, dates and times, etc.;
lists of all kinds;
basic editorial changes (e.g. correction of apparent errors;
regularization and normalization; additions, deletions and omissions);
simple links and cross references, providing basic hypertextual
features; facilities for annotation, indexing, bibliographic citations
and referencing systems.
There are few documents which do not exhibit some of these
features; and none of these features is particularly restricted to any
one kind of document. In some cases, an additional tagset is also
available, providing more specialized elements for those wishing to
encode aspects of these features in greater detail (for example, for
verse and drama, and for names), but the elements defined in this core
are believed to be adequate for most applications most of the time.
The TEI scheme attaches particular importance to the provision of documentary or bibliographic information about electronic texts. Such information is essential for any satisfactory interchange of texts coming from multiple sources, or for which long term uses are envisaged.
The TEI header is one of the few mandatory elements in a TEI document. It has four major divisions which together provide a detailed syntax for the documentation of:
The first of these, the file description, contains traditional bibliographic material, detailing title, intellectual responsibility and publication or distribution information relating to an electronic text, which can readily be translated into a conventional catalogue record for use by the growing number of forward-thinking academic and public libraries now coming to terms with their new role as curators of non-print electronic materials.
Several commentators, noticing how the day to day information
processing of all sectors of the economy now takes place in electronic
form only, have expressed concern at the difficulties faced by
librarians and archivists in handling these new forms of historical
records. Others, trying to come to terms with the wealth of information
in cyberspace
, have lamented the absence of any effective
cataloguing standards for networked resources and other forms of
electronic publication. For creators of language corpora, the provision
of such meta-descriptive information is essential, since without it
analysis of the full complexity of language use is all but impossible.
The TEI Header represents a major contribution to overcoming all these
problems.
Many electronic texts are essentially derivative works, created either by keying or scanning previously existing print materials, combining or modifying previously existing electronic materials, or both. The source description part of the TEI header allows an encoder to specify the source or sources from which a text has been derived, using traditional bibliographic concepts. The pedigree of a TEI-conformant text can thus be specified, in the same way as a conventional book will generally document its publishing history. A detailed formal description of changes made in producing a text can be recorded as a distinct revision history; this is particularly useful for highly dynamic texts.
As noted above, the TEI is not a fixed encoding scheme, but offers a variety of options appropriate to different situations. Consequently, the encoding description within a TEI Header is of particular importance to users of an electronic document. It provides, in structured or unstructured form, vital information about editorial conventions or policies, design decisions and even the selection of tags actually used within the document.
The profile description is used to group together a wide range of additional descriptive information ranging from specifications of the languages used within it, the situation or social context in which it was produced, its topics or classification, to demographic or social characteristics of its authors or participants. No-one is likely to need all of these categories of information, but all of them are likely to be essential to some users.
A collection of TEI headers can also be regarded as a distinct document, and an auxiliary DTD is provided to support interchange of headers alone, for example between libraries or archives.
To construct a view of the TEI DTD, the user must always choose one base tag sets. Six of these are currently defined, for documents which are predominantly one of prose, verse, drama, transcribed speech, dictionaries, or terminological databases. Another two are provided for use with texts which combine these basic tag sets.
The choice of a base tag set determines the basic structure of all the
documents with which it is to be used, reflecting the fact that
subelements likely to appear within a
dictionary (for example) will be entirely different in kind from those
likely to appear within a letter or a novel, and even
more so from those likely to be found in a transcription of spoken
language. To cater for this variety, the constituents of all divisions
of a TEI text
element are not defined explicitly, but in terms
of parameter entities. The mechanism used is to provide
definitions like the following within the DTD, one of which the
user must over-ride by supplying an appropriate declaration
in the DTD subset:
The body of the main dtd contains a series of alternative definitions,
each enclosed within an SGML marked section named after the
base which it defines, as in this simplified example:
<!ENTITY % TEI.prose "IGNORE">
<!ENTITY % TEI.dictionary "IGNORE">
Within the body of the DTD, elements are defined using these
parameter entities only, for example:
<![ %TEI.prose [
<!ENTITY % component "p|list" >
]&null;]>
<![ %TEI.dictionary [
<!ENTITY % component "entry" >
]&null;]>
<!ENTITY % component.seq "(%component)+">
To select a base tag set a declaration such as the following should
be supplied within the DTD subset for the document:
<!ELEMENT div - - ((%component.seq)+)>
This will over-ride the declaration within the TEI DTD itself, because
it is given first. If no base is declared, the DTD will not compile.
<!ENTITY % TEI.prose "INCLUDE">
The value of the parameter entity called
component.seq
will thus differ in different bases. In
this way it is possible for the divisions of a text using the drama
base (for example) to consist of speeches and stage directions, while
those of a text using the dictionary base will consist of lexical
entries.
Although the actual components may differ, groups of textual components are potentially grouped into higher level divisions in almost any kind of text. These higher level units may be called variously chapters, sections, subdvisions, acts or parts but all seem to behave in more or less the same way: they are incomplete in themselves, and nested hierarchically. In the TEI scheme all such objects are therefore regarded as the same kind of element, called here a division.
A type
attribute may be used to distinguish amongst
divisions in some respect other than their hierarchic position: the
values for this attribute (as for several others in the TEI scheme) are
not standardized, precisely because no consensus exists, or is likely to
exist, as to a generic typology. A set of legal values should however be
defined for a given application, either in the TEI Header or by a
user-defined modification.
In the normal case, the components of all divisions in a particular
base are homogeneous --- they all use the same value for
component.seq
. However, the scheme also allows for two
kinds of heterogeneity. If the general base is selected,
together with two or more other bases, then different divisions of a
text may have different constituents, though each division must itself
be homogeneous. A mixed base is also defined, in which
components from any selection of bases may be combined promiscuously
across division boundaries.
This approach applies equally to the encoding of smaller units:
rather than attempt to enumerate all the different analytic units which
particular disciplines might find necessary, the TEI proposes two
generic segmentation elements: one (s
) for simple end-to-end
segmentation, such as that commonly used in language corpora, roughly
corresponding to the notion of orthographic sentence; the other (seg
)
for segments which can potentially self-nest. In either case, a type
attribute may be used to distinguish different kinds of segment.
Textual features, and hence the elements which encode them, may be categorized or classified in a number of ways. The TEI scheme identifies two kinds of classification scheme: attribute classes and model classes; both are used for broadly similar purposes.
Members of an attribute class share the same set of attributes. For
example, all elements which represent links or associations between one
element and another do so using a common set of attributes, defined
by the pointer
attribute class.
Members of a model class share the same structural properties: that is, they may appear at the same position within the SGML document structure. For example, the class divtop contains all elements (headings, epigraphs etc.) which can appear at the start of a textual division; all elements used to mark editorial corrections or omissions are members of the class edit; elements marking bibliographic citations etc. are all members of the class bibl and so on.
Elements may of course be members of more than one class. Classes may have super- and sub-classes, and properties (notably associated attributes) may be inherited. Classes are defined in the TEI dtd by means of parameter entities, and used extensively for DTD maintenance, documentation, and extension.
The TEI scheme supports three kinds of user modification: new elements may be added into existing classes, and existing elements renamed or undefined. These operations are carried out in a controlled manner, using the class system and without any need for extensive revision of the TEI DTD itself.
The process of adding a new element to a class may be illustrated
as follows. Consider
the model class divTop mentioned above. Simplifying somewhat, this element class is defined as follows:
To add a new element (say,
<!ENTITY % x.divtop "">
<!ENTITY % m.divtop "%x.divtop head | byline | epigraph">
keywords
) to this class,
enabling it to appear anywhere in the content model that other members
of the class do, all that is needed is to re-define the
x-entity within the document type subset:
Note the trailing vertical bar, which is required. As it happens, the
element
<!ENTITY % x.divtop "keywords |">
keywords
is already defined in the TEI scheme (within the
header); if it were not, an element declaration would also be
necessary.
Parameter entities are also used to effect the two other kinds of modification mentioned above: the ability to undefine elements, and the ability to rename them.
Within the main TEI dtd, each element definition and its associated
attribute list specification is enclosed by a marked section with the same
name as the element, the default value for which is "INCLUDE". Thus, to
undefine the element mentioned
, all that is needed is a declaration
like the following in the DTD subset:
<!ENTITY % mentioned "IGNORE">
A similar declaration may be used to rename any element; for example, to
rename p
as para
:
This works because all references to the
<!ENTITY % n.p "para">
p
element throughout
the TEI dtd are made indirectly, using the n.p
entity.
Furthermore, the original name for an element is recoverable by an SGML
application, because it forms the value of a global attribute
teiform
of declared type FIXED.
All user-defined modifications of this kind are regarded as forming an additional tag set, which is embedded within the DTD in the same way as as any other tag set, i.e. by enabling the TEI.extensions parameter entities. In this way a TEI document can make explicit the extent and nature of any modification required in the base TEI scheme for its processing. An auxiliary tag set is also provided for the documentation of additional SGML elements in a way compatible with that used for the rest of the scheme.
One particularly important class is the global attribute class. By default the following attributes are members of this class and may therefore be supplied for all elements in the TEI scheme:
This list may be extended: for example,
selecting the additional tag set for analysis will add analytic attributes to
the above list. The id
and n
attributes allow for
the identification of any element occurrence within a TEI-conformant
text. Elements carrying an id
attribute value may be
the object of a link or cross-reference, or any of the other
re-structuring mechanisms proposed by the TEI for circumventing the
rigidly hierarchic structure of a simple SGML DTD. The fact that the
requirement for such links is usually unpredictable is one reason for
making this attribute global.
Values on id
attributes must be unique (their
declared value is ID). Values on the n
attribute however
need not be; they may be used to carry a TEI canonical reference. A
method for defining the structure of such canonical reference schemes is
also provided, so that documents using it can be processed
automatically.
The lang
attribute indicates both the language and
hence the writing system applicable to the element's content, thus
providing explicit support for polyglot or multiscript texts. If no
value is given, that of the element's direct parent is assumed. (A
number of TEI attributes have this characteristic, which is catered for
by a TEI-defined keyword). The value of this element identifies a
special purpose language
element which documents the language
in use, optionally associating it with an external entity in which a
formal writing system declaration (WSD) may be given.
A WSD defines a language/writing system pair
(for example, Koine Greek, using TLG Beta Code
).
and is formally defined by an auxiliary DTD which allows each
character to be systematically defined and documented, in terms of
existing international or other standards, public or private entity
sets, ad hoc transliteration schemes or explicit definitions, as well as
combinations of all four.
Finally, the global rend
element may be used to give
information about the physical presentation of the text in the source,
where this is not otherwise given. A default rendition may be specified
for all elements of a given type. No specific set of values is defined
for this attribute in the current draft, though it is probable that some
suitable set of DSSSL primitives will be proposed in a later version.
It should be stressed that the rend
element is
not intended for use as a means of specifying the desired
formatting of an element, except insofaras this may be determined by a
desire to mimic the approximate appearance of the original text. Like
other SGML applications, the TEI scheme attempts to provide elements for
the encoding of those textual features deemed essential to a productive
use of the encoded text; however, unlike most other SGML applications,
the TEI scheme recognizes that for some, it is precisely the appearance
of a text which is the object of research.
Ten additional tag sets are defined by the current TEI proposals. These include tag sets for special application areas such as the orthographic transcription of speech, the detailed physical description of manuscript or print material, and the recording of an electronic variorum modelled on the traditional critical apparatus. A tag set is defined for the detailed documentation of contextual information needed by language corpora, as well as for the detailed encoding of names and dates; abstractions such as networks, graphs or trees; mathematical formulae and tables etc.
In addition to these application-specific additional tag sets, some more general purpose additional tag sets are defined for
The tag set for linking and alignment extends the set of linking and pointing elements already defined in the TEI core to provide facilities for linking to arbitrary locations or spans of texts, whether or not these are in the current document, and whether or not the target is an SGML document. Mechanisms are included for recording the alignment or correspondence of parts of a text, for example in multilingual corpora, or for marking the alignment of audio or video with a transcription of it. As such, this tag set provides a usefully large subset of the facilities offered by the HyTime standard, but with a considerably simpler and more efficient interface. [NOTE: Witness the fact that, as of May 1995, support for the TEI extended pointer mechanism has already been implemented in Softquad's Panorama Pro, and Electronic Book Technology's DynaText --- the two market leaders amongst commercial SGML browsing software.]
As noted above, a generic segmentation element is defined for the
identification of textual spans appropriate to any analytic scheme. An
out-of-line generic interp
element may be used to link
arbitrary text segments (which may be nested or discontinuous) with any
user-defined set of attribute/value pair interpretations. Specific tags
are also defined for the most common requirements of linguistic
analysis such as identification and typing of morphemes, words, phrases,
and sentences.
A specialized tagset is also provided for the encoding of abstract interpretations of a text, either in parallel with it or embedded within it. This is based on the feature structure notation employed in theoretical linguistics, but has applications beyond linguistic theory. [NOTE: An introduction to this tag set is provided by D. T. Langendoen and G.F. Simons ``A rationale for the TEI recommendations for feature-structure markup'' in Computers and the Humanities (forthcoming, 1995; for an extended discussion of an application of the feature structure scheme to the problems of encoding historical source materials, see D. I. Greenstein, and L. Burnard ``Speaking with one voice'' (ib).]
Using this mechanism, encoders can define arbitrarily complex bundles or sets of features identified in a text, according to their own methodological bias. They may thus embed a whole range of interpretations of a text, linguistic, literary, or thematic, within a text in a controlled manner. The syntax defined by the Guidelines not only formalizes the way in which such features are encoded, but also provides for a detailed specification of legal feature value/pair combinations and rules determining, for example, the implication of under-specified or defaulted features. This is known as a feature system declaration and is defined by an auxiliary tag set.
An additional tag set is also provided for the encoding of degrees
of uncertainty or ambiguity in the encoding of a text. These particular
tag sets exhibit in a particularly noticeable form one of the chief
strengths of the TEI approach to encoding: it provides the encoder with
a well-defined set of tools which can be used to make explicit his or
her reading of a text. No claim to absolute authority is made by any
encoder, nor ever should be; the TEI scheme merely allows encoders to
come clean
about what they have perceived in a text, to whatever
degree of detail seems appropriate.
A user of the TEI scheme may combine as many or as few additional tag sets as suit his or her needs. The existence of tag sets for particular application areas in the current draft reflects, to some extent, accidents of history: no claim to systematic or encyclopaedic coverage is implied. Indeed, it is confidently expected that new tag sets will be added, and that their definition will form an important part of the continued work of this and successor projects.
The TEI Guidelines have taken more than five years to reach their present state, the first at which they can be said to be reasonably complete. In retrospect, it is doubtless true that they could have been created much more quickly with less involvement from the research community, or a clearer statement from it of a set of particular goals. But that statement would have inevitably limited the scope of the resulting scheme, providing exactly the kind of strait-jacket which we wished to avoid. Moreover, by prioritizing any one research agenda however well-articulated, we would have effectively disenfranchised and alienated all others. A little like the early Church fathers then, the TEI chose to provide as broad and as catholic a means of salvation as possible.
At the same time, the TEI scheme applies rigorously the principle
essentia non sunt multiplicanda praeter necessitatem
[NOTE: Generally
attributed to William of Occam (1300-1349), this recommendation is known
as Occam's Razor; it may be translated as Essences
should not be unnecessarily multiplied
and refers properly to the
distinction made by the Scholiasts between essence
--- those
properties of an entity which define its type and accidents
---
those properties specific only to one instance of an entity].
Rather than defining discrete elements for different kinds of list
(bulleted, glossary, enumerated etc.). the TEI scheme defines a single
list
element which bears a type
attribute to
distinguish amongst these various kinds. In the same way, all kinds of
links between document elements, whatever their semantics, are encoded
using the same tags. To handle the indefinite number of elements
potentially needed to handle all kinds of analysis and interpretation, a
small number of generic tags are proposed which (in the case of the
feature structure tag set referred to above) are sufficiently abstract
and general to cater for almost any kind of interpretative judgment.
At the same time, there remain many situations in which the TEI's desire to exclude no-one has lead to a multiplication of distinctions at first sight rather bewildering. It seems to say the least unlikely that anyone will ever encode a document using every possible element defined by the union of every TEI tag set, though such a monster DTD is indeed possible.
As published, the Guidelines constitute a substantial document unsuitable for casual browsing, even in electronic form. The TEI therefore plans to make available a number of smaller introductory tutorials focused on particular application areas. Two such have already appeared: one dealing with terminological systems, [NOTE: Melby, Alan et al Terminology Interchange Format (TIF): a tutorial (Vienna, Infoterm, 1993) ] and the other on encoding of manuscript transcriptions [NOTE: Robinson, Peter Encoding of Primary Sources Using SGML, Oxford, Office for Humanities Communication, 1994].
A third tutorial has also recently been completed, documenting a special pedagogically-motivated subset of some 200 elements, selected from the whole TEI scheme (not just the core). Known as TEI Lite, this DTD has already been used in two electronic publishing projects and is in use at electronic text repositories at the Universitirs of Oxford, Virginia and Michigan, and elsewhere. [NOTE: At the time of writing, the document defining this scheme is only available in electronic form, as (Sperberg-McQueen, C.M. and Lou Burnard TEI Lite: An Introduction to the TEI encoding scheme (Chicago and Oxford, May 1995)) from the URLs http://www-tei.uic.edu/orgs/tei or http://info.ox.ac.uk/~archive/teilite]
The real proof of the effectiveness of the TEI design will come only with its wide-spread adoption, tailored to the particular needs of individual projects. As far as can be judged from the long list of early implementors, such evidence will soon be forthcoming.
This article has focussed chiefly on the complexity and generality of the TEI scheme, with a view to demonstrating its intellectual adequacy and its potential as a model for many SGML applications.
It has also attempted to demonstrate how a simple modular scheme can
be implemented in such a way as to maximize the interchange space
within which information interchange takes place.
The origins of the TEI scheme in the academic world mean that it has been designed with the widest possible set of applications in mind. Optimizing it for particular sets of users will be a new challenge.