Sams Teach Yourself XML in 21 Days

Contents


Day 1

What Is XML and Why Should I Care?


Welcome to Teach Yourself XML in 21 Days! This chapter starts you on the road to mastering the Extensible Markup Language (XML). Today you will learn

The Web Grows Up

Love them or hate them, the Internet and the World Wide Web (WWW) are here to stay. No matter much you try, you can't avoid the Web playing an increasingly important role in your life.

The Internet has gone from a small experiment carried out by a bunch of nuclear research scientists to one of the most phenomenal events in computing history. It sometimes feels like we have been experiencing the modern equivalent of the Industrial Revolution: the dawning of the Information Age.

In his original proposal to CERN (the European Laboratory for Particle Research) management in 1989, Tim Berners-Lee (the acknowledged inventor of the Web) described his vision of

...a universal linked information system, in which generality and portability are more important than fancy graphics and complex extra facilities.

The Web has certainly come a long way in the last ten years, and I sometimes wonder what Berners-Lee thinks of his invention in its present form.

The Web is still in its infancy, however. Use of the Web is slowly progressing beyond the stage of family Web pages, but the dawn of electronic commerce (e-commerce) via the Internet has not yet broken. By e-commerce, I do not mean being able to order things from a Web page, such as books, records, CDs, and software. This kind of commerce has been going on for several years, and some companies--most notably Amazon.com--have made a great success of it. My definition of e-commerce goes much deeper than this. Various new initiatives have appeared in recent years that are going to change the way a lot of companies look at the Web. These include


NOTE: Every time you visit a Web site that supports Java, JavaScript, or some other scripting language, you are in fact running a program over the Web. After you've finished with it, all that's left in your Web browser's cache is possibly a few scraps of code. Several software companies--including Microsoft--want to distribute software in this way. They'd gain by constantly generating new income from their software, and you would benefit by only having to pay for the software you used at the time that you used it, and only for as long as you used it.

Whereas most of these applications are impossible using Hypertext Markup Language (HTML), XML can make all these applications (and many more) real possibilities. In a sense, XML is the enabling technology that heralds the appearance of a new form of Internet society. XML is probably the most important thing to happen to the Web since the arrival of Java.

So why can XML do what HTML can't? Read on for an explanation.

Where HTML Runs Out of Steam

Before we look at all the weaknesses of HTML, let's get one thing clear: HTML has been, and still is, a fantastic success.

Designed to be a simple tagging language for displaying text in a Web browser, HTML has done a wonderful job and will probably continue to do so for many years to come. It is no exaggeration to say that if there hadn't been HTML, there simply wouldn't have been a Web. Although Gopher, WAIS, and Hytelnet, among others, predated HTML, none of them offered the same trade-off of power for simplicity that HTML does.

Although HTML might still be considered the killer Internet application, there have been a lot of complaints leveled against it. Furthermore, people are now realizing that XML is superior to HTML. Following are some of the most frequently cited complaints against HTML (but many of them aren't really legitimate, as you will see from my comments):


NOTE: The document type definition (DTD) is an SGML or XML document that describes the elements and attributes allowed inside all the documents that can be said to conform to that DTD. You will learn all about XML DTDs in later chapters.

So what's really wrong with HTML? Not a lot, for everyday Web page use. However, looking at the future of electronic commerce on the Web, HTML is reaching its limits.

So What's Wrong with...?

All right, if HTML can't handle it, what's wrong with TeX, PDF, or RTF?

TeX is a computer typesetting language that still flourishes in scientific communities. In the early 1980's, there were online databases that returned data in TeX form that could be inserted straight into a TeX document. Adobe owns the PDF (Adobe Acrobat) standard, but it is fairly well documented. RTF is the property of Microsoft and, as many Windows Help authors will tell you, it is poorly documented and extremely unreliable. The RTF code created by Word 97 is not the same as the code created by Word 95, for example, and in some areas the two versions are completely incompatible.

All of these formats suffer from the same weaknesses: they are proprietary (owned by a commercial company or organization), they are not open, and they are not standardized. By using one of these formats, you risk being left out in the cold. Although the market represents a strong stabilizing force (as seen with RTF), when you place too much reliance on a format over which you have no control and into which you have little insight, you are leaving yourself open to a lot of problems if and when that format changes.

SGML

I'm going to try and avoid teaching you as much as I can about SGML. Although it can be helpful to know a little about it, in many ways you're probably better off not knowing anything about it at all. The problem with learning too much about SGML is that when you move to XML you'd have to spend most of your time forgetting a lot of the things you'd just learned. XML is different enough from SGML that you can become an expert in XML without knowing a thing about SGML.

That said, XML is very much a descendant of SGML, and knowing at least a little about SGML will help put XML in context.

The Standard Generalized Markup Language (SGML), from which XML is derived, was born out of the basic need to make data storage independent of any one software package or software vendor. SGML is a meta language, or a language for describing markup languages. HTML is one such markup language and is therefore called an SGML application. There are dozens, maybe even hundreds, of markup languages defined using SGML. In XML, these applications are often called markup languages--such as the hand-held device markup language (HDML) and the FAQ markup language (QML).

In SGML, most of these markup languages haven't been given formal names; they are simply referred to by the name of their document type definition (DocBook), their purpose (LinuxDOC), their application (TEI), or even the standard they implement (J2008 - automobile parts, Mil-M-38784 - US Military).

By means of an SGML declaration (XML also has one), the SGML application specifies which characters are to be interpreted as data and which characters are to be interpreted as markup. (They do not have to include the familiar < and > characters; in SGML they could just as easily be { and } instead.)

Using the rules given in the SGML declaration and the results of the information analysis (which ultimately creates something that can easily be considered an information model), the SGML application developer identifies various types of documents--such as reports, brochures, technical manuals, and so on--and develops a DTD for each one. Using the chosen characters, the DTD identifies information objects (elements) and their properties (attributes).

The DTD is the very core of an SGML application; how well it is made largely determines the success or failure of the whole activity. Using the information elements defined in the DTD, the actual information is then marked up using the tags identified for it in the application. If the development of the DTD has been rushed, it might need continual improvement, modification, or correction. Each time the DTD is changed, the information that has been marked up with it might also need to be modified because it may be incorrect. Very quickly, the quantity of data that needs modification (now called legacy data) can become a far more serious problem--one that is more costly and time-consuming than the problem that SGML was originally introduced to solve.

You are already getting a feel for the magnitude of an SGML application. There are good reasons for this magnitude: SGML was built to last. At the back of the developers' minds were ideas about longevity and durability, as were thoughts of protecting data from changes in computer software and hardware in the future.

SGML is the industrial-strength solution: expensive and complicated, but also extremely powerful.

Why Not SGML?

The SGML on the Web initiative existed a long time before XML was even considered. Somehow, though, it never really succeeded. Basically, SGML is just too expensive and complicated for Web use on a large scale. It isn't that it can't be used--it's that it won't be used. Using SGML requires too much of an investment in time, tools, and training.

Why XML?

XML uses the features of SGML that it needs and tries to incorporate the lessons learned from HTML.


NOTE: One of the most important links between XML and SGML is XML's use of a DTD. On Day 17, "Using XML for Data," you will learn more about the developments that are underway to cut this major link to SGML and replace the DTD with something more in keeping with the data-processing requirements of XML applications.

When the designers of XML sat down to write its specifications, they had a set of design goals in mind (detailed in the recommendation document). These goals and the degree to which they have already been met are why XML is considered better than SGML:

Given the time, you can print out any XML document and work out its meaning--but it goes further than this. A valid XML document

What XML Adds to SGML and HTML

XML takes the best of SGML and combines it with some of the best features of HTML, and adds a few features drawn from some of the more successful applications of both. XML takes its major framework from SGML, leaving out everything that isn't absolutely necessary. Each facility and feature was examined, and if a good case couldn't be made for its retention, it was scrapped. XML is commonly called a subset of SGML, but in technical terms it's an application profile of SGML; whereas HTML uses SGML and is an application of SGML, XML is just SGML on a smaller scale.

From HTML, XML inherits the use of Web addresses (URLs) to point to other objects. From HyTime (a very sophisticated application of SGML, officially called ISO/IEC 10744 Hypermedia/Time-based Structuring Language) and an academic application of SGML called the Text Encoding Initiative (TEI), XML inherits some other extremely powerful addressing mechanisms that allow you to point to parts and ranges of other documents rather than simple single-point targets, for example.

XML also adds a list of features that make it far more suitable than either SGML or HTML for use on an increasingly complex and diverse Web:

Is XML Just for Programmers?

Having read this far, you might think that XML is only for programmers and that you can quite happily go back to using HTML. In many ways you'd be right, except for one important point: If programmers can do more with XML than they can with HTML, eventually this will filter down to you in the form of application software that you can use with your XML data. To take full advantage of these tools, however, you will need to make your data available in XML. As of yet, support for XML in Web browsers is incomplete and unreliable (you will learn how to display XML code in Mozilla and Internet Explorer 5 later on), but full support will not take long.

In the meantime, is XML just for programmers? Definitely not! One of the problems with HTML is that all the tags are optional, so you have to be somewhat familiar with all of them in order to make the best choice. Worse, your choice will be affected by the way the code looks in a particular browser. But XML is extensible, and extensibility works both ways--it also means you can use less rather than more. Instead of having to learn more than 40 HTML tags, you can mark up your text in a way that makes a lot more sense to you and then use a style sheet to handle the visible appearance. Listing 1.1 shows a typical XML document that marks up a basic sales contact entry.

Listing 1.1 A Simple XML Document

1: <?xml version="1.0"?>
2:   <contacts>
3:     <contact>
4:       <name>
5:         <first>John</first>
6:         <last>Belcher</last>
7:       </name>
8:       <address>
9:         <street>Pennington 13322</street>
10:        <city>Washington</city>
11:        <state>DC</state>
12:        <zip>66522</zip>
13:      </address>
14:      <tel>555 1276</tel>
15:      <fax>555 9983</fax>
16:      <mobile>887 8887 7777</mobile>
17:      <email>jb@southside.com</email>
18:    </contact>
19:  </contacts>

As Listing 1.1 suggests, you can make your markup very rich in information (semantic content). The great thing about XML is that you can adapt it to your needs. When you need less you can use less, as demonstrated by Listing 1.2. (It would hardly be in keeping with all the other computer language-oriented books in the world if we didn't include some kind of "Hello World" example.)

Listing 1.2 "Hello World" in XML

1: <?xml version="1.0"?>
2:   <greeting>
3:     <salutation>Hello</salutation>
4:     <target>World!</target>
5:   </greeting>

XML is already becoming the preferred language for interfacing between databases and the Web, and it is becoming an important method for interchanging data between computers and computer applications. However, at the level of "ordinary" Web document authoring, XML still has a lot to offer. The wonderful thing about XML is that it can actually be even simpler than HTML! You can decide what tags you'll need, and how many, and you can choose names that either mean something sensible to you, or to your readers. Instead of producing documents containing meaningless jumbles of H1, H2, P, LI, UL and EM tags you can say what you really mean and use CHAPTER, SECTION, PARAGRAPH, LIST.ITEM, UNNUMBERED.LIST and IMPORTANT. This doesn't only make your documents more meaningful, it makes them more accessible to other people. Tools (such as search engines) will be able to make more intelligent enquiries about the content and structure of your documents and make meaningful inferences about your documents that could far exceed what you originally intended.

Summary

On this first day, you were introduced to XML as a markup language in abstract terms. You saw why XML is needed by the rapidly maturing Internet and its commercial applications. You were also given a very brief overview of why XML is seen as the solution to publishing text and data through the Internet, rather than SGML or HTML.

Just as medical students start their education by dissecting corpses, tomorrow you will dissect the anatomy of an XML document to determine what it is made of.

Q&A

Q Is XML a standard, and can I rely on it?

A XML is recommended by a group of vendors, including Microsoft and Sun, called the World Wide Web Consortium (W3C). This is about as close to a standard as anything on the Web. The W3C has committed itself to supporting XML in all its other initiatives. Also, in the regular standardization circles, the SGML standard is being updated so that XML can rely on the support and formality of SGML.

Q Do I need to learn SGML to understand XML?

A No. It might help to know a little about SGML if you're going to get involved in highly technical XML developments, but no knowledge of SGML is needed for most XML applications.

Q I know SGML; how difficult will it be for me to learn XML?

A If you already have some experience with SGML, it will take less than a day to convert your knowledge to XML and learn anything extra you'll need to know. However, you'll need the discipline to unlearn some of the things you were doing with SGML.

Q I know HTML; how difficult will it be for me to learn XML?

A This depends on how deep your knowledge of HTML is and what you intend to do with XML. If all you want to do with XML is create Web pages, you can probably master the basics in a day or two.

Q Will XML replace SGML?

A No. SGML will continue to be used in the large-scale applications where its features are most needed. XML will take over some of the work from SGML but will never replace it.

Q Will XML replace HTML?

A Eventually, yes. HTML has done a wonderful job so far, and there is every reason to believe it will continue to do so for a long time to come. Eventually, though, HTML will be rewritten as an XML application instead of being an SGML application--but you are unlikely to notice the difference.

Q I have a lot of HTML code; should I convert it to XML? If so, how?

A No. Existing HTML code can be expressed very easily in XML syntax. It will also be possible to include HTML code in XML documents, and vice versa. However, it is not quite so simple to convert an HTML authoring environment into an XML one. Currently there are no XML DTDs for HTML. Until there are, it's easier to create the HTML code using HTML (or SGML) tools and then convert the finished code.

Exercise

1. You've already seen what a basic XML document looks like. Mark up a document that you'd like to use on the Web (something personal, like a home page or the tracks on a CD).


Contents

© Copyright 1999, Macmillan Computer Publishing. All rights reserved.