<DIV style="font-size:18pt;font-family:Verdana;font-weight:bold">Text Encoding for Information Interchange: An Introduction to the Text Encoding Initiative.</DIV>

Text Encoding for Information Interchange : An Introduction to the Text Encoding Initiative

Author: Lou Burnard
Document Date: July 1995

- Standardization and the TEI
- What is the TEI?
- Organization of the TEI scheme
- The TEI core
- Elements available to all bases
- The header
- The TEI base tag sets
- Textual Divisions
- The TEI Class System and Modification Mechanisms
- The global attributes
- The TEI additional tag sets
- From General to Specific
- Conclusions

Standardization and the TEI

Standards come into being for a variety of reasons, and in a variety of ways, not always entirely explicable. They may be entirely market-defined; for example by manufacturers' attempts singly or as a group to control market share, or by consumers' desires to simplify purchase decisions. Standards also result from pressure applied by well-intentioned groups of experts, or as a consequence of legislation in the public interest. And finally, standards come about as the expression of some emergent consensus within some large community. This last method is the most likely to last, but the most difficult to achieve.

In creating such a consensus, there is an inevitable tension between the need to transform what is simply tried and tested into something normative and binding on the one hand, and the reluctance to straitjacket or constrain unanticipated development on the other. This is particularly true of the research and development arena, which depends for its survival on innovation, and thus the ability to provide answers to as yet unformulated questions, while at the same time being as concerned as any other community to codify existing practice. The research community is populated by experts, who need maximal flexibility and who distrust constraint, but also by novices, who need access to that accumulated expertise in a consistent and codified form, if only in order to rebel against it and thus become experts in their turn.

Standardization of the way in which information is stored and represented (rather than processed) is the key to a number of closely related problems, all of central concern to users of modern Information Technology, be they academic or commercial. For creators of language resources in particular, it addresses the difficulty of ensuring that information is reusable; the difficulty of ensuring that information represented in different ways can be seamlessly integrated; and the difficulty of facilitating loss-free information interchange between the widest choice of different platforms, different application systems and different languages.

By standardizing at the level of text representation, we can hope to retain the flexibility needed to develop new applications, while ensuring that old ones continue to function. By attempting a theory-neutral standardization, at the level where consensus exists, we avoid the need to reinvent the wheel, without requiring that everyone drive a particular brand of bicycle.

In this spirit, the TEI Guidelines which form the topic of this paper, aim to provide not a set of normative rules for particular applications, but rather a modular and extensible framework, within which particular application-specific norms can be defined. The development of such TEI-aware norms is already underway in a number of contexts, most significantly for the present audience, within EAGLES and related EU projects such as Multext, but also in a wide variety of corpus building, scholarly editing and digital library projects. Such projects have in common the need to customize and make less generic the framework defined by the TEI, retaining as they do so the capacity for interchange for which it was developed. The general principles, and many of the specific mechanisms, underlying this approach are of clear relevance to all large scale users of information technology.

This paper [NOTE: An earlier version of which was published as The Wider Relevance of the Text Encoding Initiative in OII Spectrum, Nov 1994.] describes the origins and organization of the TEI scheme, including some technical details of how it may be customized for multiple application areas, and an overview of its coverage.

What is the TEI?

The Text Encoding Initiative (TEI) is an international cooperative research effort, the goal of which is to define a set of generic Guidelines for the representation of textual materials in electronic form. The project was sponsored and organized by three leading professional associations in the field: the Association for Computational Linguistics (ACL), the Association for Literary and Linguistic Computing (ALLC) and the Association for Computing and the Humanities (ACH). It has been funded throughout its five years of activities on both sides of the Atlantic: primarily by the US National Endowment for the Humanities and by the European Union 3rd framework Programme for Linguistic Research and Engineering, but also with grants from the Mellon Foundation and from the Canadian Social Sciences and Humanities Research Council. Of equal significance has been the donation of time and expertise by the many members of the wider research community who have served on the TEI's Working Committees and Working Groups.

As its title suggests, the TEI is strongly interested in text. But this interest is by no means confined to the use of electronic text as a stage in the production of paper documents, and the word text should not be read too literally. The TEI is equally concerned with both textual and non-textual resources in electronic form, whether as constituents of a research database or components of non-paper publications.

Like the publishing industry, the research community has long realized that its stock in hand is not words on the page, but information, independent of any particular physical realization. As technology emerges which is genuinely adequate to the task of integrating text, graphics and audio into a seamless information-bearing vehicle, so the importance of that integrated vision becomes more apparent. By providing a description of information which is independent of realization or media, the TEI scheme, like other SGML-based approaches, enormously facilitates the construction and exploitation of multimedia technology.

In the same way, the texts with which language researchers are concerned are likely to be very heterogenous. In the construction of language corpora such as the British National Corpus [NOTE: See http://info.ox.ac.uk/bnc for details of this 100 million word TEI-conformant corpus of modern British English.], material as divers as newspapers, books, office memoranda, playscripts, publicity brochures, letters and diaries, transcribed lectures and interviews, TV and radio broadcasts, and unscripted conversations are integrated into a single body of material. Research needs impose that this integration be carried out with minimal loss of information, and at the same time with minimal complexity: in any case, the resulting text is far removed from the conventional notion of a printed work.

Electronic texts are most obviously different from printed ones in that the former contain markup or encoding, which makes explicit various features of the text, so that they can be efficiently processed. Printed texts adopt a variety of similarly-motivated conventions (use of typeface, organization of the carrier medium etc), but these are not so readily processable as the tags of a formal markup scheme.

The goals of the TEI project initially had a dual focus: being concerned with both what textual features should be encoded (i.e. made explicit) in an electronic text, and how that encoding should be represented for loss-free, platform-independent, interchange.

Early on in the project, the Standard Generalized Markup Language (SGML; ISO 8879) was chosen as the most appropriate vehicle for the Guidelines, initially on the purely pragmatic grounds that to create a comparably expressive and versatile formal language would be a major research project in itself. In the event, despite some frequently rehearsed inelegancies, SGML has proved entirely adequate to the needs of researchers, and after five years, is still increasing its domination of the software industry, with new product announcements coming every year. The TEI was thus able to focus its efforts on the expression, using SGML, of the set of textual features indicated as its first goal above.

The prime deliverable of the TEI project is a very large number (over 400) of textual feature definitions, expressed as SGML elements and attributes, with associated documentation and examples. These elements are grouped into tag sets of various kinds, as further discussed below, and together constitute a modular scheme which can be configured to provide hardware-, software-, and application- independent support for the encoding of all kinds of text in all languages and of all times.

The TEI tag sets are necessarily based on, but not limited by, existing encoding practices; they are designed to be both comprehensive and extensible. They are collectively documented in a substantial reference manual, the Guidelines for Text Encoding for Interchange, which appeared in May 1994 after five years of extensive development work. This 1400 page manual is published both in paper and electronic hypertext form and is also available over the Internet, in a variety of formats. [NOTE: Guidelines for the encoding and interchange of machine-readable texts edited by C.M.Sperberg-McQueen and Lou Burnard (Chicago and Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1994). For details of current availability and locations, see the official TEI Web page at http://www-tei.uic.edu/orgs/tei.]

Organization of the TEI scheme

As an SGML application, the TEI scheme necessarily requires the existence of some kind of document type definition (DTD). Current approaches to dtd design may be caricatured as falling into one of three camps, depending on their answer to the question How many DTDs does the world need?.

For many of the first users of SGML, the appropriate answer was One: the whole purpose of the exercise being to define a template against which all texts could be checked rigorously and consistently. This approach, which might be characterized by the phrase we know what's best for you, has an obvious place in applications such as technical documentation, but is equally obviously inappropriate where the object of the exercise is to describe texts produced before the blessings of structured document design were revealed to the world.

At the opposite extreme are those whose answer would be none, for whom no DTD can ever be adequate to the full complexity of the texts to be described: this attitude might be caricatured as No-one will ever understand my problem. Again, it is not impossible to imagine applications for which a DTD consisting only of elements with the content model ANY would be entirely appropriate (the first electronic edition of the Oxford English Dictionary provides one obvious example), although its usefulness in the general case is less clear.

Perhaps most numerous are those who shrug their shoulders and say as many as it takes: the world will always need new DTDs, in the boundary case, one per document. In the name of pragmatism, this attitude risks crowding the fledgeling possibility of information interchange out of the nest entirely; nevertheless, its popularity reminds us that sometimes the document must drive the DTD, rather than the reverse.

The approach taken by the TEI attempts to combine virtues of all three of these approaches. It defines not one, but many possible DTDs, which may be tailored to the needs of a particular application in a way difficult or impossible with most other general purpose DTDs so far developed. The user of the TEI scheme is offered the opportunity of building a DTD which matches his or her requirements, but constrained to do so in a way that facilitates interchange.

We refer to this somewhat jocularly as the Chicago Pizza model. All pizzas have some ingredients in common (cheese and tomato sauce); in Chicago, at least, they may have entirely different forms of pastry base, with which (universally) the consumer is expected to make his or her own selection of toppings. Using SGML syntax this might be summarized as follows:

<!ENTITY % base    "(deepDish | thinCrust | stuffed)" >
<!ENTITY % topping "(sausage | mushroom | pepper | anchovy ...)"> 
<!ELEMENT pizza - - (%base, cheese & tomato, (%topping;)* )>

In the same way, the user of the TEI scheme constructs a view of the TEI DTD by combining the core tag sets (which are always present), exactly one base tag set and his or her own selection of additional tag sets or toppings.

We use the term tag set to denote simply a collection of definitions for SGML elements and their attributes. These tag sets are the basic organizing principles of the TEI scheme, and are divided into four groups:

core: tag sets defining elements likely to be needed by all documents, and therefore available by default in all cases.
base: tag sets defining particular classes of document whose gross structure may vary; in general, only one base tag set is appropriate for a given document.
additional: tag sets defining sets of elements which may be found in any class of document but which are typically associated with some specialized application or detailed subject area.
auxiliary: tag sets comprising elements with highly specialized roles, typically for description of some part of the encoding scheme, and which make up a DTD independent of the main one.

In general, elements appear in only one tag set, though the current model allows for the redefinition of elements within different base tag sets. Elements may not be defined in more than one additional tag set.

This modularization is achieved by the use of parameter entities in the TEI DTD, which is further discussed below. To illustrate the basic mechanism we present here the start of a minimal TEI-conformant document in which the base tag set for prose has been selected together with the additional tag set for linking:

<!DOCTYPE tei.2 [
<!ENTITY % TEI.prose "INCLUDE">
<!ENTITY % TEI.linking "INCLUDE">
]>
<tei.2>
</tei.2>

Because this selection of tag sets is effected explicitly by declarations within the DTD subset, as shown above, any recipient of the document can tell which TEI tag sets are required to process it. Any deviations or modifications of the TEI definitions (for example, the renaming of elements, or the addition of new ones) may be made in a similar declarative manner. Once a given view of the TEI dtd has been defined in this way, it can be fixed or compiled to preclude further modification and also to remove the complexity necessarily introduced by the extensive use of indirection in the TEI dtd.

The TEI core

Two core tag sets are available to all TEI documents without formality. The first defines a large number of elements which may appear in almost any kind of document, whatever kind of base tag set is in use. The second defines the header, providing something analogous to an electronic title page for the electronic text.

Elements available to all bases

The core tag set common to all TEI documents provides means of encoding with a reasonable degree of sophistication such textual features as typographically highlighted or quoted phrases, (optionally distinguishing highlighting used for emphasis, technical terms, foreign words, titles etc); quoted phrases, (optionally distinguishing amongst direct speech, quotation, glosses, cited phrases etc.); data-like phrases such as names, numbers and measures, dates and times, etc.; lists of all kinds; basic editorial changes (e.g. correction of apparent errors; regularization and normalization; additions, deletions and omissions); simple links and cross references, providing basic hypertextual features; facilities for annotation, indexing, bibliographic citations and referencing systems. There are few documents which do not exhibit some of these features; and none of these features is particularly restricted to any one kind of document. In some cases, an additional tagset is also available, providing more specialized elements for those wishing to encode aspects of these features in greater detail (for example, for verse and drama, and for names), but the elements defined in this core are believed to be adequate for most applications most of the time.

The header

The TEI scheme attaches particular importance to the provision of documentary or bibliographic information about electronic texts. Such information is essential for any satisfactory interchange of texts coming from multiple sources, or for which long term uses are envisaged.

The TEI header is one of the few mandatory elements in a TEI document. It has four major divisions which together provide a detailed syntax for the documentation of:

the electronic document itself and the sources from which it was derived;
the encoding system which has been applied;
descriptive information categorizing the document and its subject matter;
its revision history.

The first of these, the file description, contains traditional bibliographic material, detailing title, intellectual responsibility and publication or distribution information relating to an electronic text, which can readily be translated into a conventional catalogue record for use by the growing number of forward-thinking academic and public libraries now coming to terms with their new role as curators of non-print electronic materials.

Several commentators, noticing how the day to day information processing of all sectors of the economy now takes place in electronic form only, have expressed concern at the difficulties faced by librarians and archivists in handling these new forms of historical records. Others, trying to come to terms with the wealth of information in cyberspace, have lamented the absence of any effective cataloguing standards for networked resources and other forms of electronic publication. For creators of language corpora, the provision of such meta-descriptive information is essential, since without it analysis of the full complexity of language use is all but impossible. The TEI Header represents a major contribution to overcoming all these problems.

Many electronic texts are essentially derivative works, created either by keying or scanning previously existing print materials, combining or modifying previously existing electronic materials, or both. The source description part of the TEI header allows an encoder to specify the source or sources from which a text has been derived, using traditional bibliographic concepts. The pedigree of a TEI-conformant text can thus be specified, in the same way as a conventional book will generally document its publishing history. A detailed formal description of changes made in producing a text can be recorded as a distinct revision history; this is particularly useful for highly dynamic texts.

As noted above, the TEI is not a fixed encoding scheme, but offers a variety of options appropriate to different situations. Consequently, the encoding description within a TEI Header is of particular importance to users of an electronic document. It provides, in structured or unstructured form, vital information about editorial conventions or policies, design decisions and even the selection of tags actually used within the document.

The profile description is used to group together a wide range of additional descriptive information ranging from specifications of the languages used within it, the situation or social context in which it was produced, its topics or classification, to demographic or social characteristics of its authors or participants. No-one is likely to need all of these categories of information, but all of them are likely to be essential to some users.

A collection of TEI headers can also be regarded as a distinct document, and an auxiliary DTD is provided to support interchange of headers alone, for example between libraries or archives.

The TEI base tag sets

To construct a view of the TEI DTD, the user must always choose one base tag sets. Six of these are currently defined, for documents which are predominantly one of prose, verse, drama, transcribed speech, dictionaries, or terminological databases. Another two are provided for use with texts which combine these basic tag sets.

The choice of a base tag set determines the basic structure of all the documents with which it is to be used, reflecting the fact that subelements likely to appear within a dictionary (for example) will be entirely different in kind from those likely to appear within a letter or a novel, and even more so from those likely to be found in a transcription of spoken language. To cater for this variety, the constituents of all divisions of a TEI text element are not defined explicitly, but in terms of parameter entities. The mechanism used is to provide definitions like the following within the DTD, one of which the user must over-ride by supplying an appropriate declaration in the DTD subset:

<!ENTITY % TEI.prose "IGNORE">
<!ENTITY % TEI.dictionary "IGNORE">

The body of the main dtd contains a series of alternative definitions, each enclosed within an SGML marked section named after the base which it defines, as in this simplified example:


<![ %TEI.prose [
<!ENTITY % component "p|list" >
]&null;]>

<![ %TEI.dictionary [
<!ENTITY % component "entry" >
]&null;]>

<!ENTITY % component.seq "(%component)+">

Within the body of the DTD, elements are defined using these parameter entities only, for example:


<!ELEMENT div - - ((%component.seq)+)>

To select a base tag set a declaration such as the following should be supplied within the DTD subset for the document:


<!ENTITY % TEI.prose "INCLUDE">

This will over-ride the declaration within the TEI DTD itself, because it is given first. If no base is declared, the DTD will not compile.

The value of the parameter entity called component.seq will thus differ in different bases. In this way it is possible for the divisions of a text using the drama base (for example) to consist of speeches and stage directions, while those of a text using the dictionary base will consist of lexical entries.

Textual Divisions

Although the actual components may differ, groups of textual components are potentially grouped into higher level divisions in almost any kind of text. These higher level units may be called variously chapters, sections, subdvisions, acts or parts but all seem to behave in more or less the same way: they are incomplete in themselves, and nested hierarchically. In the TEI scheme all such objects are therefore regarded as the same kind of element, called here a division.

A type attribute may be used to distinguish amongst divisions in some respect other than their hierarchic position: the values for this attribute (as for several others in the TEI scheme) are not standardized, precisely because no consensus exists, or is likely to exist, as to a generic typology. A set of legal values should however be defined for a given application, either in the TEI Header or by a user-defined modification.

In the normal case, the components of all divisions in a particular base are homogeneous --- they all use the same value for component.seq. However, the scheme also allows for two kinds of heterogeneity. If the general base is selected, together with two or more other bases, then different divisions of a text may have different constituents, though each division must itself be homogeneous. A mixed base is also defined, in which components from any selection of bases may be combined promiscuously across division boundaries.

This approach applies equally to the encoding of smaller units: rather than attempt to enumerate all the different analytic units which particular disciplines might find necessary, the TEI proposes two generic segmentation elements: one (s) for simple end-to-end segmentation, such as that commonly used in language corpora, roughly corresponding to the notion of orthographic sentence; the other (seg) for segments which can potentially self-nest. In either case, a type attribute may be used to distinguish different kinds of segment.

The TEI Class System and Modification Mechanisms

Textual features, and hence the elements which encode them, may be categorized or classified in a number of ways. The TEI scheme identifies two kinds of classification scheme: attribute classes and model classes; both are used for broadly similar purposes.

Members of an attribute class share the same set of attributes. For example, all elements which represent links or associations between one element and another do so using a common set of attributes, defined by the pointer attribute class.

Members of a model class share the same structural properties: that is, they may appear at the same position within the SGML document structure. For example, the class divtop contains all elements (headings, epigraphs etc.) which can appear at the start of a textual division; all elements used to mark editorial corrections or omissions are members of the class edit; elements marking bibliographic citations etc. are all members of the class bibl and so on.

Elements may of course be members of more than one class. Classes may have super- and sub-classes, and properties (notably associated attributes) may be inherited. Classes are defined in the TEI dtd by means of parameter entities, and used extensively for DTD maintenance, documentation, and extension.

The TEI scheme supports three kinds of user modification: new elements may be added into existing classes, and existing elements renamed or undefined. These operations are carried out in a controlled manner, using the class system and without any need for extensive revision of the TEI DTD itself.

The process of adding a new element to a class may be illustrated as follows. Consider the model class divTop mentioned above. Simplifying somewhat, this element class is defined as follows:

<!ENTITY % x.divtop "">
<!ENTITY % m.divtop "%x.divtop head | byline | epigraph">

To add a new element (say, keywords) to this class, enabling it to appear anywhere in the content model that other members of the class do, all that is needed is to re-define the x-entity within the document type subset:


<!ENTITY % x.divtop "keywords |">

Note the trailing vertical bar, which is required. As it happens, the element keywords is already defined in the TEI scheme (within the header); if it were not, an element declaration would also be necessary.

Parameter entities are also used to effect the two other kinds of modification mentioned above: the ability to undefine elements, and the ability to rename them.

Within the main TEI dtd, each element definition and its associated attribute list specification is enclosed by a marked section with the same name as the element, the default value for which is "INCLUDE". Thus, to undefine the element mentioned, all that is needed is a declaration like the following in the DTD subset:

<!ENTITY % mentioned "IGNORE">

A similar declaration may be used to rename any element; for example, to rename p as para:

<!ENTITY % n.p "para">

This works because all references to the p element throughout the TEI dtd are made indirectly, using the n.p entity. Furthermore, the original name for an element is recoverable by an SGML application, because it forms the value of a global attribute teiform of declared type FIXED.

All user-defined modifications of this kind are regarded as forming an additional tag set, which is embedded within the DTD in the same way as as any other tag set, i.e. by enabling the TEI.extensions parameter entities. In this way a TEI document can make explicit the extent and nature of any modification required in the base TEI scheme for its processing. An auxiliary tag set is also provided for the documentation of additional SGML elements in a way compatible with that used for the rest of the scheme.

The global attributes

One particularly important class is the global attribute class. By default the following attributes are members of this class and may therefore be supplied for all elements in the TEI scheme:

id: provides an SGML identifier for an element
n: provides a possibly non-unique name or number for an element
lang: specifies the language and hence the writing system used for an element
rend: provides information about the rendering of an element where this is not otherwise specified

This list may be extended: for example, selecting the additional tag set for analysis will add analytic attributes to the above list. The id and n attributes allow for the identification of any element occurrence within a TEI-conformant text. Elements carrying an id attribute value may be the object of a link or cross-reference, or any of the other re-structuring mechanisms proposed by the TEI for circumventing the rigidly hierarchic structure of a simple SGML DTD. The fact that the requirement for such links is usually unpredictable is one reason for making this attribute global.

Values on id attributes must be unique (their declared value is ID). Values on the n attribute however need not be; they may be used to carry a TEI canonical reference. A method for defining the structure of such canonical reference schemes is also provided, so that documents using it can be processed automatically.

The lang attribute indicates both the language and hence the writing system applicable to the element's content, thus providing explicit support for polyglot or multiscript texts. If no value is given, that of the element's direct parent is assumed. (A number of TEI attributes have this characteristic, which is catered for by a TEI-defined keyword). The value of this element identifies a special purpose language element which documents the language in use, optionally associating it with an external entity in which a formal writing system declaration (WSD) may be given.

A WSD defines a language/writing system pair (for example, Koine Greek, using TLG Beta Code). and is formally defined by an auxiliary DTD which allows each character to be systematically defined and documented, in terms of existing international or other standards, public or private entity sets, ad hoc transliteration schemes or explicit definitions, as well as combinations of all four.

Finally, the global rend element may be used to give information about the physical presentation of the text in the source, where this is not otherwise given. A default rendition may be specified for all elements of a given type. No specific set of values is defined for this attribute in the current draft, though it is probable that some suitable set of DSSSL primitives will be proposed in a later version.

It should be stressed that the rend element is not intended for use as a means of specifying the desired formatting of an element, except insofaras this may be determined by a desire to mimic the approximate appearance of the original text. Like other SGML applications, the TEI scheme attempts to provide elements for the encoding of those textual features deemed essential to a productive use of the encoded text; however, unlike most other SGML applications, the TEI scheme recognizes that for some, it is precisely the appearance of a text which is the object of research.

The TEI additional tag sets

Ten additional tag sets are defined by the current TEI proposals. These include tag sets for special application areas such as the orthographic transcription of speech, the detailed physical description of manuscript or print material, and the recording of an electronic variorum modelled on the traditional critical apparatus. A tag set is defined for the detailed documentation of contextual information needed by language corpora, as well as for the detailed encoding of names and dates; abstractions such as networks, graphs or trees; mathematical formulae and tables etc.

In addition to these application-specific additional tag sets, some more general purpose additional tag sets are defined for

linking and alignment
analysis and interpretation
feature structure analysis

The tag set for linking and alignment extends the set of linking and pointing elements already defined in the TEI core to provide facilities for linking to arbitrary locations or spans of texts, whether or not these are in the current document, and whether or not the target is an SGML document. Mechanisms are included for recording the alignment or correspondence of parts of a text, for example in multilingual corpora, or for marking the alignment of audio or video with a transcription of it. As such, this tag set provides a usefully large subset of the facilities offered by the HyTime standard, but with a considerably simpler and more efficient interface. [NOTE: Witness the fact that, as of May 1995, support for the TEI extended pointer mechanism has already been implemented in Softquad's Panorama Pro, and Electronic Book Technology's DynaText --- the two market leaders amongst commercial SGML browsing software.]

As noted above, a generic segmentation element is defined for the identification of textual spans appropriate to any analytic scheme. An out-of-line generic interp element may be used to link arbitrary text segments (which may be nested or discontinuous) with any user-defined set of attribute/value pair interpretations. Specific tags are also defined for the most common requirements of linguistic analysis such as identification and typing of morphemes, words, phrases, and sentences.

A specialized tagset is also provided for the encoding of abstract interpretations of a text, either in parallel with it or embedded within it. This is based on the feature structure notation employed in theoretical linguistics, but has applications beyond linguistic theory. [NOTE: An introduction to this tag set is provided by D. T. Langendoen and G.F. Simons ``A rationale for the TEI recommendations for feature-structure markup'' in Computers and the Humanities (forthcoming, 1995; for an extended discussion of an application of the feature structure scheme to the problems of encoding historical source materials, see D. I. Greenstein, and L. Burnard ``Speaking with one voice'' (ib).]

Using this mechanism, encoders can define arbitrarily complex bundles or sets of features identified in a text, according to their own methodological bias. They may thus embed a whole range of interpretations of a text, linguistic, literary, or thematic, within a text in a controlled manner. The syntax defined by the Guidelines not only formalizes the way in which such features are encoded, but also provides for a detailed specification of legal feature value/pair combinations and rules determining, for example, the implication of under-specified or defaulted features. This is known as a feature system declaration and is defined by an auxiliary tag set.

An additional tag set is also provided for the encoding of degrees of uncertainty or ambiguity in the encoding of a text. These particular tag sets exhibit in a particularly noticeable form one of the chief strengths of the TEI approach to encoding: it provides the encoder with a well-defined set of tools which can be used to make explicit his or her reading of a text. No claim to absolute authority is made by any encoder, nor ever should be; the TEI scheme merely allows encoders to come clean about what they have perceived in a text, to whatever degree of detail seems appropriate.

A user of the TEI scheme may combine as many or as few additional tag sets as suit his or her needs. The existence of tag sets for particular application areas in the current draft reflects, to some extent, accidents of history: no claim to systematic or encyclopaedic coverage is implied. Indeed, it is confidently expected that new tag sets will be added, and that their definition will form an important part of the continued work of this and successor projects.

From General to Specific

The TEI Guidelines have taken more than five years to reach their present state, the first at which they can be said to be reasonably complete. In retrospect, it is doubtless true that they could have been created much more quickly with less involvement from the research community, or a clearer statement from it of a set of particular goals. But that statement would have inevitably limited the scope of the resulting scheme, providing exactly the kind of strait-jacket which we wished to avoid. Moreover, by prioritizing any one research agenda however well-articulated, we would have effectively disenfranchised and alienated all others. A little like the early Church fathers then, the TEI chose to provide as broad and as catholic a means of salvation as possible.

At the same time, the TEI scheme applies rigorously the principle essentia non sunt multiplicanda praeter necessitatem [NOTE: Generally attributed to William of Occam (1300-1349), this recommendation is known as Occam's Razor; it may be translated as Essences should not be unnecessarily multiplied and refers properly to the distinction made by the Scholiasts between essence --- those properties of an entity which define its type and accidents --- those properties specific only to one instance of an entity]. Rather than defining discrete elements for different kinds of list (bulleted, glossary, enumerated etc.). the TEI scheme defines a single list element which bears a type attribute to distinguish amongst these various kinds. In the same way, all kinds of links between document elements, whatever their semantics, are encoded using the same tags. To handle the indefinite number of elements potentially needed to handle all kinds of analysis and interpretation, a small number of generic tags are proposed which (in the case of the feature structure tag set referred to above) are sufficiently abstract and general to cater for almost any kind of interpretative judgment.

At the same time, there remain many situations in which the TEI's desire to exclude no-one has lead to a multiplication of distinctions at first sight rather bewildering. It seems to say the least unlikely that anyone will ever encode a document using every possible element defined by the union of every TEI tag set, though such a monster DTD is indeed possible.

As published, the Guidelines constitute a substantial document unsuitable for casual browsing, even in electronic form. The TEI therefore plans to make available a number of smaller introductory tutorials focused on particular application areas. Two such have already appeared: one dealing with terminological systems, [NOTE: Melby, Alan et al Terminology Interchange Format (TIF): a tutorial (Vienna, Infoterm, 1993) ] and the other on encoding of manuscript transcriptions [NOTE: Robinson, Peter Encoding of Primary Sources Using SGML, Oxford, Office for Humanities Communication, 1994].

A third tutorial has also recently been completed, documenting a special pedagogically-motivated subset of some 200 elements, selected from the whole TEI scheme (not just the core). Known as TEI Lite, this DTD has already been used in two electronic publishing projects and is in use at electronic text repositories at the Universitirs of Oxford, Virginia and Michigan, and elsewhere. [NOTE: At the time of writing, the document defining this scheme is only available in electronic form, as (Sperberg-McQueen, C.M. and Lou Burnard TEI Lite: An Introduction to the TEI encoding scheme (Chicago and Oxford, May 1995)) from the URLs http://www-tei.uic.edu/orgs/tei or http://info.ox.ac.uk/~archive/teilite]

The real proof of the effectiveness of the TEI design will come only with its wide-spread adoption, tailored to the particular needs of individual projects. As far as can be judged from the long list of early implementors, such evidence will soon be forthcoming.

Conclusions

This article has focussed chiefly on the complexity and generality of the TEI scheme, with a view to demonstrating its intellectual adequacy and its potential as a model for many SGML applications.

It has also attempted to demonstrate how a simple modular scheme can be implemented in such a way as to maximize the interchange space within which information interchange takes place.

The origins of the TEI scheme in the academic world mean that it has been designed with the widest possible set of applications in mind. Optimizing it for particular sets of users will be a new challenge.

Table of Contents