Sams Teach Yourself XML in 21 Days

Day 1

What Is XML and Why Should I Care?

The Web Grows Up
Where HTML Runs Out of Steam
So What's Wrong with...?
SGML
Why Not SGML?
Why XML?
What XML Adds to SGML and HTML
Is XML Just for Programmers?
Summary
Q&A
Exercise

Welcome to Teach Yourself XML in 21 Days! This chapter starts you on the road to mastering the Extensible Markup Language (XML). Today you will learn

The importance of XML in a maturing Internet
The weaknesses of HTML that make it unsuitable for Internet commerce
What SGML, the Standard Generalized Markup Language is and XML's relation to it
The weaknesses of other tag and markup languages
What XML adds to both SGML and HTML
The advantages of XML for non-programmers

The Web Grows Up

Love them or hate them, the Internet and the World Wide Web (WWW) are here to stay. No matter much you try, you can't avoid the Web playing an increasingly important role in your life.

The Internet has gone from a small experiment carried out by a bunch of nuclear research scientists to one of the most phenomenal events in computing history. It sometimes feels like we have been experiencing the modern equivalent of the Industrial Revolution: the dawning of the Information Age.

In his original proposal to CERN (the European Laboratory for Particle Research) management in 1989, Tim Berners-Lee (the acknowledged inventor of the Web) described his vision of

...a universal linked information system, in which generality and portability are more important than fancy graphics and complex extra facilities.

The Web has certainly come a long way in the last ten years, and I sometimes wonder what Berners-Lee thinks of his invention in its present form.

The Web is still in its infancy, however. Use of the Web is slowly progressing beyond the stage of family Web pages, but the dawn of electronic commerce (e-commerce) via the Internet has not yet broken. By e-commerce, I do not mean being able to order things from a Web page, such as books, records, CDs, and software. This kind of commerce has been going on for several years, and some companies--most notably Amazon.com--have made a great success of it. My definition of e-commerce goes much deeper than this. Various new initiatives have appeared in recent years that are going to change the way a lot of companies look at the Web. These include

Using the Internet to join the parts of distributed companies into one unit
Using the Internet for the exchange of financial transaction information (credit card transactions, banking transactions, and so on)
The exchange over the Internet of medical transaction data between patients, hospitals, physicians, and insurance agencies
The distribution of software via the Web, including the possibility of creating zero-install software and of modularizing the massive suites of software in programs such as Microsoft Word so that you only load, use, and pay for the parts that you need

NOTE: Every time you visit a Web site that supports Java, JavaScript, or some other scripting language, you are in fact running a program over the Web. After you've finished with it, all that's left in your Web browser's cache is possibly a few scraps of code. Several software companies--including Microsoft--want to distribute software in this way. They'd gain by constantly generating new income from their software, and you would benefit by only having to pay for the software you used at the time that you used it, and only for as long as you used it.

Whereas most of these applications are impossible using Hypertext Markup Language (HTML), XML can make all these applications (and many more) real possibilities. In a sense, XML is the enabling technology that heralds the appearance of a new form of Internet society. XML is probably the most important thing to happen to the Web since the arrival of Java.

So why can XML do what HTML can't? Read on for an explanation.

Where HTML Runs Out of Steam

Before we look at all the weaknesses of HTML, let's get one thing clear: HTML has been, and still is, a fantastic success.

Designed to be a simple tagging language for displaying text in a Web browser, HTML has done a wonderful job and will probably continue to do so for many years to come. It is no exaggeration to say that if there hadn't been HTML, there simply wouldn't have been a Web. Although Gopher, WAIS, and Hytelnet, among others, predated HTML, none of them offered the same trade-off of power for simplicity that HTML does.

Although HTML might still be considered the killer Internet application, there have been a lot of complaints leveled against it. Furthermore, people are now realizing that XML is superior to HTML. Following are some of the most frequently cited complaints against HTML (but many of them aren't really legitimate, as you will see from my comments):

HTML lacks syntactic checking: You cannot validate HTML code.
Yes and no. There are formal definitions of the structure of HTML documents--as you will learn later, HTML is an SGML application and there is a document type definition (DTD) for every version of HTML.

NOTE: The document type definition (DTD) is an SGML or XML document that describes the elements and attributes allowed inside all the documents that can be said to conform to that DTD. You will learn all about XML DTDs in later chapters.

There are also some tools (and one or two Web sites) readily available for checking the syntax of HTML documents. This begs the question of why more people don't validate their HTML documents; the answer is that the validation is really a bit misleading. Web browsers are designed to accept almost anything that looks even slightly like HTML (which runs the risk that the display will look nothing like what you expected--but that's another story). Strangely enough, the only tag that is compulsory in an HTML document is the TITLE tag; equally strangely, this is one of the least common tags there is.
HTML lacks structure.
Not really. HTML has ordered heading tags (H1 to H6), and you can nest blocks of information inside DIV tags. Browsers don't care what order you use the headings in, and often the choice is simply based on the size of the font in which they are rendered. This isn't HTML's fault. The problem lies in how HTML code is used.
HTML is not content-aware.
Yes and no. Searching the Web is complicated by the fact that HTML doesn't give you a way to describe the information content--the semantics--of your documents. In XML you can use any tags you like (such as <NAME> instead of <H3>), but using attributes in tags (such as <H3 CLASS="name">) can embed just as much semantic information as custom tags can. Without any agreement on tag names, the value of custom tags becomes a bit doubtful. To worsen matters, the same tag name in one context can mean something completely different in another. Furthermore, there are the complications of foreign languages--seeing <inkoopprijs> isn't going to help very much if you don't know that it's Dutch for "purchase price".
HTML is not international.
Mostly true. There were a few proposals to internationalize HTML, and most particularly to give it a way of identifying the language used inside a tag.
HTML is not suitable for data interchange.
Mostly true. HTML's tags do little to identify the information that a document contains.
HTML is not object-oriented.
True. Modern programmers have been making a long and difficult transition to object-oriented techniques. They want to leverage these skills and have such things as inheritance, and HTML has done very little to accommodate them.
HTML lacks a robust linking mechanism.
Very true. If you've spent a few hours on the Web, you've probably encountered at least one broken link. Although broken links are the curse of Web managers the world over, there is little that can be done to prevent them. HTML's links are very much one-to-one, with the linking 'hard-coded in the source HTML files. If the location of one target file changes, a Webmaster may have to update dozens or even hundreds of other pages.
HTML is not reusable.
True. Depending on how well-written they are, HTML pages and fragments of HTML code can be extremely difficult to reuse because they are so specifically tailored to their place in the web of associated pages.
HTML is not extensible.
True but unfair. This is a bit like saying that an automobile makes a better motor vehicle than a bicycle. HTML was never meant to be extensible.

So what's really wrong with HTML? Not a lot, for everyday Web page use. However, looking at the future of electronic commerce on the Web, HTML is reaching its limits.

So What's Wrong with...?

All right, if HTML can't handle it, what's wrong with TeX, PDF, or RTF?

TeX is a computer typesetting language that still flourishes in scientific communities. In the early 1980's, there were online databases that returned data in TeX form that could be inserted straight into a TeX document. Adobe owns the PDF (Adobe Acrobat) standard, but it is fairly well documented. RTF is the property of Microsoft and, as many Windows Help authors will tell you, it is poorly documented and extremely unreliable. The RTF code created by Word 97 is not the same as the code created by Word 95, for example, and in some areas the two versions are completely incompatible.

All of these formats suffer from the same weaknesses: they are proprietary (owned by a commercial company or organization), they are not open, and they are not standardized. By using one of these formats, you risk being left out in the cold. Although the market represents a strong stabilizing force (as seen with RTF), when you place too much reliance on a format over which you have no control and into which you have little insight, you are leaving yourself open to a lot of problems if and when that format changes.

SGML

I'm going to try and avoid teaching you as much as I can about SGML. Although it can be helpful to know a little about it, in many ways you're probably better off not knowing anything about it at all. The problem with learning too much about SGML is that when you move to XML you'd have to spend most of your time forgetting a lot of the things you'd just learned. XML is different enough from SGML that you can become an expert in XML without knowing a thing about SGML.

That said, XML is very much a descendant of SGML, and knowing at least a little about SGML will help put XML in context.

The Standard Generalized Markup Language (SGML), from which XML is derived, was born out of the basic need to make data storage independent of any one software package or software vendor. SGML is a meta language, or a language for describing markup languages. HTML is one such markup language and is therefore called an SGML application. There are dozens, maybe even hundreds, of markup languages defined using SGML. In XML, these applications are often called markup languages--such as the hand-held device markup language (HDML) and the FAQ markup language (QML).

In SGML, most of these markup languages haven't been given formal names; they are simply referred to by the name of their document type definition (DocBook), their purpose (LinuxDOC), their application (TEI), or even the standard they implement (J2008 - automobile parts, Mil-M-38784 - US Military).

By means of an SGML declaration (XML also has one), the SGML application specifies which characters are to be interpreted as data and which characters are to be interpreted as markup. (They do not have to include the familiar < and > characters; in SGML they could just as easily be { and } instead.)

Using the rules given in the SGML declaration and the results of the information analysis (which ultimately creates something that can easily be considered an information model), the SGML application developer identifies various types of documents--such as reports, brochures, technical manuals, and so on--and develops a DTD for each one. Using the chosen characters, the DTD identifies information objects (elements) and their properties (attributes).

The DTD is the very core of an SGML application; how well it is made largely determines the success or failure of the whole activity. Using the information elements defined in the DTD, the actual information is then marked up using the tags identified for it in the application. If the development of the DTD has been rushed, it might need continual improvement, modification, or correction. Each time the DTD is changed, the information that has been marked up with it might also need to be modified because it may be incorrect. Very quickly, the quantity of data that needs modification (now called legacy data) can become a far more serious problem--one that is more costly and time-consuming than the problem that SGML was originally introduced to solve.

You are already getting a feel for the magnitude of an SGML application. There are good reasons for this magnitude: SGML was built to last. At the back of the developers' minds were ideas about longevity and durability, as were thoughts of protecting data from changes in computer software and hardware in the future.

SGML is the industrial-strength solution: expensive and complicated, but also extremely powerful.

Why Not SGML?

The SGML on the Web initiative existed a long time before XML was even considered. Somehow, though, it never really succeeded. Basically, SGML is just too expensive and complicated for Web use on a large scale. It isn't that it can't be used--it's that it won't be used. Using SGML requires too much of an investment in time, tools, and training.

Why XML?

XML uses the features of SGML that it needs and tries to incorporate the lessons learned from HTML.

NOTE: One of the most important links between XML and SGML is XML's use of a DTD. On Day 17, "Using XML for Data," you will learn more about the developments that are underway to cut this major link to SGML and replace the DTD with something more in keeping with the data-processing requirements of XML applications.

When the designers of XML sat down to write its specifications, they had a set of design goals in mind (detailed in the recommendation document). These goals and the degree to which they have already been met are why XML is considered better than SGML:

XML can be used with existing Web protocols (such as HTTP and MIME) and mechanisms (such as URLs), and it does not impose any additional requirements. XML has been developed with the Web in mind--features of SGML that were too difficult to use on the Web were left out, and features that are needed for Web use either have been added or are inherited from applications that already work.
XML supports a wide variety of applications. It is difficult to support a lot of applications with just HTML; hence, the growth of scripting languages. HTML is simply too specific. XML adopts the generic nature of SGML, but adds flexibility to make it truly extensible.
XML is compatible with SGML, and most SGML applications can be converted into XML. In the foreseeable future, the SGML standard will be amended to make XML applications fully backward-compatible.
It is easy to write programs that process XML documents. One of the major strengths of HTML is that it's easy for even a non-programmer to throw together a few lines of scripting code that enable you to do basic processing (and there's an amazing variety of scripting languages available). HTML even includes some features of its own that enable you to carry out some basic processing (such as forms and CGI query strings). XML has learned a lesson from HTML's success and has tried to stay as simple as possible by throwing out a lot of SGML's more complex features. XML processing applications are already appearing in Java, SmallTalk, C, C++, JavaScript, Tcl, Perl, and Python, to name just a few.
The number of optional features in XML has been kept to an absolute minimum. SGML has many optional features, so SGML software has to support all of them. It can be argued that there isn't actually a single software package that supports all of SGML's features (and it's difficult to imagine an application that actually needs all of them). This degree of power immediately implies complexity, which also means size, cost, and sluggishness. The speed of the Web is already becoming a major concern; it's bad enough to wait for a document to download, but if you had to wait ages for it to be processed as well, XML would be doomed from the start.
XML documents reasonably clear to the layperson. Although it is becoming increasingly rare, and even difficult, for HTML documents to be typed in manually, and XML documents weren't intended to be created by human beings, this remains a worthy goal. Machine encoding is limited in longevity and portability, often being tied to the system on which it was created. XML's markup is reasonably self-explanatory.

Given the time, you can print out any XML document and work out its meaning--but it goes further than this. A valid XML document

Describes the structural rules that the markup attempts to follow
Lists any external resources (external entities) that are part of the document
Declares any internal resources (internal entities) that are used within the document
Lists the types of non-XML resources (notations) used and identifies any helper applications that might be needed
Lists any non-XML resources (binaries) that are used within the document and identifies any helper applications that might be needed

The design of XML is formal and concise. The Extended Backus-Naur Format (EBNF) was used as the basis of the XML specification (a method well understood by the majority of programmers). Information marked up in XML can be easily processed by computer programs. Better still, by using a system that is familiar to computer programmers and is almost completely unambiguous, it is reasonably easy for programmers to develop programs that work with XML.
XML documents are easy to create. HTML is almost famous for its ease of use, and XML capitalizes on this strength. In fact, it is actually even easier to create an XML document than an HTML document. After all, you don't have to learn any markup tags--you can create your own!

What XML Adds to SGML and HTML

XML takes the best of SGML and combines it with some of the best features of HTML, and adds a few features drawn from some of the more successful applications of both. XML takes its major framework from SGML, leaving out everything that isn't absolutely necessary. Each facility and feature was examined, and if a good case couldn't be made for its retention, it was scrapped. XML is commonly called a subset of SGML, but in technical terms it's an application profile of SGML; whereas HTML uses SGML and is an application of SGML, XML is just SGML on a smaller scale.

From HTML, XML inherits the use of Web addresses (URLs) to point to other objects. From HyTime (a very sophisticated application of SGML, officially called ISO/IEC 10744 Hypermedia/Time-based Structuring Language) and an academic application of SGML called the Text Encoding Initiative (TEI), XML inherits some other extremely powerful addressing mechanisms that allow you to point to parts and ranges of other documents rather than simple single-point targets, for example.

XML also adds a list of features that make it far more suitable than either SGML or HTML for use on an increasingly complex and diverse Web:

Modularity--Although HTML appears to have no DTD, there is an implied DTD hard-wired into Web browsers. SGML has a limitless number of DTDs, on the other hand, but there's only one for each type of document. XML enables you to leave out the DTD altogether or, using sophisticated resolution mechanisms, combine multiple fragments of either XML instances or separate DTDs into one compound instance.
Extensibility--XML's powerful linking mechanisms allow you to link to material without requiring the link target to be physically present in the object. This opens up exciting possibilities for linking together things like material to which you do not have write access, CD-ROMs, library catalogs, the results of database queries, or even non-document media such as sound fragments or parts of videos. Furthermore, it allows you to store the links separately from the objects they link (perhaps even in a database, so that the link lists can be automatically generated according to the dynamic contents of the collection of documents). This makes long-term link maintenance a real possibility.
Distribution--In addition to linking, XML introduces a far more sophisticated method of including link targets in the current instance. This opens the doors to a new world of composite documents--documents composed of fragments of other documents that are automatically (and transparently) assembled to form what is displayed at that particular moment. The content can be instantly tailored to the moment, to the media, and to the reader, and might have only a fleeting existence: a virtual information reality composed of virtual documents.
Internationality--Both HTML and SGML rely heavily on ASCII, which makes using foreign characters very difficult. XML is based on Unicode and requires all XML software to support Unicode as well. Unicode enables XML to handle not just Western-accented characters, but also Asian languages. (On Day 8, "XML Objects: Exploiting Entities," you will learn all about character sets and character encoding.)
Data orientation--XML operates on data orientation rather than readability by humans. Although being humanly readable is one of XML's design goals, electronic commerce requires the data format to be readable by machines as well. XML makes this possible by defining a form of XML that can be more easily created by a machine, but it also adds tighter data control through the more recent XML schema initiatives.

Is XML Just for Programmers?

Having read this far, you might think that XML is only for programmers and that you can quite happily go back to using HTML. In many ways you'd be right, except for one important point: If programmers can do more with XML than they can with HTML, eventually this will filter down to you in the form of application software that you can use with your XML data. To take full advantage of these tools, however, you will need to make your data available in XML. As of yet, support for XML in Web browsers is incomplete and unreliable (you will learn how to display XML code in Mozilla and Internet Explorer 5 later on), but full support will not take long.

In the meantime, is XML just for programmers? Definitely not! One of the problems with HTML is that all the tags are optional, so you have to be somewhat familiar with all of them in order to make the best choice. Worse, your choice will be affected by the way the code looks in a particular browser. But XML is extensible, and extensibility works both ways--it also means you can use less rather than more. Instead of having to learn more than 40 HTML tags, you can mark up your text in a way that makes a lot more sense to you and then use a style sheet to handle the visible appearance. Listing 1.1 shows a typical XML document that marks up a basic sales contact entry.

Listing 1.1 A Simple XML Document

1: <?xml version="1.0"?>
2:   <contacts>
3:     <contact>
4:       <name>
5:         <first>John</first>
6:         <last>Belcher</last>
7:       </name>
8:       <address>
9:         <street>Pennington 13322</street>
10:        <city>Washington</city>
11:        <state>DC</state>
12:        <zip>66522</zip>
13:      </address>
14:      <tel>555 1276</tel>
15:      <fax>555 9983</fax>
16:      <mobile>887 8887 7777</mobile>
17:      <email>jb@southside.com</email>
18:    </contact>
19:  </contacts>

As Listing 1.1 suggests, you can make your markup very rich in information (semantic content). The great thing about XML is that you can adapt it to your needs. When you need less you can use less, as demonstrated by Listing 1.2. (It would hardly be in keeping with all the other computer language-oriented books in the world if we didn't include some kind of "Hello World" example.)

Listing 1.2 "Hello World" in XML

1: <?xml version="1.0"?>
2:   <greeting>
3:     <salutation>Hello</salutation>
4:     <target>World!</target>
5:   </greeting>

XML is already becoming the preferred language for interfacing between databases and the Web, and it is becoming an important method for interchanging data between computers and computer applications. However, at the level of "ordinary" Web document authoring, XML still has a lot to offer. The wonderful thing about XML is that it can actually be even simpler than HTML! You can decide what tags you'll need, and how many, and you can choose names that either mean something sensible to you, or to your readers. Instead of producing documents containing meaningless jumbles of H1, H2, P, LI, UL and EM tags you can say what you really mean and use CHAPTER, SECTION, PARAGRAPH, LIST.ITEM, UNNUMBERED.LIST and IMPORTANT. This doesn't only make your documents more meaningful, it makes them more accessible to other people. Tools (such as search engines) will be able to make more intelligent enquiries about the content and structure of your documents and make meaningful inferences about your documents that could far exceed what you originally intended.

Summary

On this first day, you were introduced to XML as a markup language in abstract terms. You saw why XML is needed by the rapidly maturing Internet and its commercial applications. You were also given a very brief overview of why XML is seen as the solution to publishing text and data through the Internet, rather than SGML or HTML.

Just as medical students start their education by dissecting corpses, tomorrow you will dissect the anatomy of an XML document to determine what it is made of.

Q&A

: Q Is XML a standard, and can I rely on it?
: A XML is recommended by a group of vendors, including Microsoft and Sun, called the World Wide Web Consortium (W3C). This is about as close to a standard as anything on the Web. The W3C has committed itself to supporting XML in all its other initiatives. Also, in the regular standardization circles, the SGML standard is being updated so that XML can rely on the support and formality of SGML.
: Q Do I need to learn SGML to understand XML?
: A No. It might help to know a little about SGML if you're going to get involved in highly technical XML developments, but no knowledge of SGML is needed for most XML applications.
: Q I know SGML; how difficult will it be for me to learn XML?
: A If you already have some experience with SGML, it will take less than a day to convert your knowledge to XML and learn anything extra you'll need to know. However, you'll need the discipline to unlearn some of the things you were doing with SGML.
: Q I know HTML; how difficult will it be for me to learn XML?
: A This depends on how deep your knowledge of HTML is and what you intend to do with XML. If all you want to do with XML is create Web pages, you can probably master the basics in a day or two.
: Q Will XML replace SGML?
: A No. SGML will continue to be used in the large-scale applications where its features are most needed. XML will take over some of the work from SGML but will never replace it.
: Q Will XML replace HTML?
: A Eventually, yes. HTML has done a wonderful job so far, and there is every reason to believe it will continue to do so for a long time to come. Eventually, though, HTML will be rewritten as an XML application instead of being an SGML application--but you are unlikely to notice the difference.
: Q I have a lot of HTML code; should I convert it to XML? If so, how?
: A No. Existing HTML code can be expressed very easily in XML syntax. It will also be possible to include HTML code in XML documents, and vice versa. However, it is not quite so simple to convert an HTML authoring environment into an XML one. Currently there are no XML DTDs for HTML. Until there are, it's easier to create the HTML code using HTML (or SGML) tools and then convert the finished code.

Exercise

: 1. You've already seen what a basic XML document looks like. Mark up a document that you'd like to use on the Web (something personal, like a home page or the tracks on a CD).