Welcome to Teach Yourself XML in 21 Days! This chapter starts you on the road to mastering the Extensible Markup Language (XML). Today you will learn
Love them or hate them, the Internet and the World Wide Web (WWW) are here to stay. No matter much you try, you can't avoid the Web playing an increasingly important role in your life.
The Internet has gone from a small experiment carried out by a bunch of nuclear research scientists to one of the most phenomenal events in computing history. It sometimes feels like we have been experiencing the modern equivalent of the Industrial Revolution: the dawning of the Information Age.
In his original proposal to CERN (the European Laboratory for Particle Research) management in 1989, Tim Berners-Lee (the acknowledged inventor of the Web) described his vision of
...a universal linked information system, in which generality and portability are more important than fancy graphics and complex extra facilities.
The Web has certainly come a long way in the last ten years, and I sometimes wonder what Berners-Lee thinks of his invention in its present form.
The Web is still in its infancy, however. Use of the Web is slowly progressing beyond the stage of family Web pages, but the dawn of electronic commerce (e-commerce) via the Internet has not yet broken. By e-commerce, I do not mean being able to order things from a Web page, such as books, records, CDs, and software. This kind of commerce has been going on for several years, and some companies--most notably Amazon.com--have made a great success of it. My definition of e-commerce goes much deeper than this. Various new initiatives have appeared in recent years that are going to change the way a lot of companies look at the Web. These include
NOTE: Every time you visit a Web site that supports Java, JavaScript, or some other scripting language, you are in fact running a program over the Web. After you've finished with it, all that's left in your Web browser's cache is possibly a few scraps of code. Several software companies--including Microsoft--want to distribute software in this way. They'd gain by constantly generating new income from their software, and you would benefit by only having to pay for the software you used at the time that you used it, and only for as long as you used it.
Whereas most of these applications are impossible using Hypertext Markup Language (HTML), XML can make all these applications (and many more) real possibilities. In a sense, XML is the enabling technology that heralds the appearance of a new form of Internet society. XML is probably the most important thing to happen to the Web since the arrival of Java.
So why can XML do what HTML can't? Read on for an explanation.
Before we look at all the weaknesses of HTML, let's get one thing clear: HTML has been, and still is, a fantastic success.
Designed to be a simple tagging language for displaying text in a Web browser, HTML has done a wonderful job and will probably continue to do so for many years to come. It is no exaggeration to say that if there hadn't been HTML, there simply wouldn't have been a Web. Although Gopher, WAIS, and Hytelnet, among others, predated HTML, none of them offered the same trade-off of power for simplicity that HTML does.
Although HTML might still be considered the killer Internet application, there have been a lot of complaints leveled against it. Furthermore, people are now realizing that XML is superior to HTML. Following are some of the most frequently cited complaints against HTML (but many of them aren't really legitimate, as you will see from my comments):
NOTE: The document type definition (DTD) is an SGML or XML document that describes the elements and attributes allowed inside all the documents that can be said to conform to that DTD. You will learn all about XML DTDs in later chapters.
So what's really wrong with HTML? Not a lot, for everyday Web page use. However, looking at the future of electronic commerce on the Web, HTML is reaching its limits.
All right, if HTML can't handle it, what's wrong with TeX, PDF, or RTF?
TeX is a computer typesetting language that still flourishes in scientific communities. In the early 1980's, there were online databases that returned data in TeX form that could be inserted straight into a TeX document. Adobe owns the PDF (Adobe Acrobat) standard, but it is fairly well documented. RTF is the property of Microsoft and, as many Windows Help authors will tell you, it is poorly documented and extremely unreliable. The RTF code created by Word 97 is not the same as the code created by Word 95, for example, and in some areas the two versions are completely incompatible.
All of these formats suffer from the same weaknesses: they are proprietary (owned by a commercial company or organization), they are not open, and they are not standardized. By using one of these formats, you risk being left out in the cold. Although the market represents a strong stabilizing force (as seen with RTF), when you place too much reliance on a format over which you have no control and into which you have little insight, you are leaving yourself open to a lot of problems if and when that format changes.
I'm going to try and avoid teaching you as much as I can about SGML. Although it can be helpful to know a little about it, in many ways you're probably better off not knowing anything about it at all. The problem with learning too much about SGML is that when you move to XML you'd have to spend most of your time forgetting a lot of the things you'd just learned. XML is different enough from SGML that you can become an expert in XML without knowing a thing about SGML.
That said, XML is very much a descendant of SGML, and knowing at least a little about SGML will help put XML in context.
The Standard Generalized Markup Language (SGML), from which XML is derived, was born out of the basic need to make data storage independent of any one software package or software vendor. SGML is a meta language, or a language for describing markup languages. HTML is one such markup language and is therefore called an SGML application. There are dozens, maybe even hundreds, of markup languages defined using SGML. In XML, these applications are often called markup languages--such as the hand-held device markup language (HDML) and the FAQ markup language (QML).
In SGML, most of these markup languages haven't been given formal names; they are simply referred to by the name of their document type definition (DocBook), their purpose (LinuxDOC), their application (TEI), or even the standard they implement (J2008 - automobile parts, Mil-M-38784 - US Military).
By means of an SGML declaration (XML also has one), the SGML application specifies which characters are to be interpreted as data and which characters are to be interpreted as markup. (They do not have to include the familiar < and > characters; in SGML they could just as easily be { and } instead.)
Using the rules given in the SGML declaration and the results of the information analysis (which ultimately creates something that can easily be considered an information model), the SGML application developer identifies various types of documents--such as reports, brochures, technical manuals, and so on--and develops a DTD for each one. Using the chosen characters, the DTD identifies information objects (elements) and their properties (attributes).
The DTD is the very core of an SGML application; how well it is made largely determines the success or failure of the whole activity. Using the information elements defined in the DTD, the actual information is then marked up using the tags identified for it in the application. If the development of the DTD has been rushed, it might need continual improvement, modification, or correction. Each time the DTD is changed, the information that has been marked up with it might also need to be modified because it may be incorrect. Very quickly, the quantity of data that needs modification (now called legacy data) can become a far more serious problem--one that is more costly and time-consuming than the problem that SGML was originally introduced to solve.
You are already getting a feel for the magnitude of an SGML application. There are good reasons for this magnitude: SGML was built to last. At the back of the developers' minds were ideas about longevity and durability, as were thoughts of protecting data from changes in computer software and hardware in the future.
SGML is the industrial-strength solution: expensive and complicated, but also extremely powerful.
The SGML on the Web initiative existed a long time before XML was even considered. Somehow, though, it never really succeeded. Basically, SGML is just too expensive and complicated for Web use on a large scale. It isn't that it can't be used--it's that it won't be used. Using SGML requires too much of an investment in time, tools, and training.
XML uses the features of SGML that it needs and tries to incorporate the lessons learned from HTML.
NOTE: One of the most important links between XML and SGML is XML's use of a DTD. On Day 17, "Using XML for Data," you will learn more about the developments that are underway to cut this major link to SGML and replace the DTD with something more in keeping with the data-processing requirements of XML applications.
When the designers of XML sat down to write its specifications, they had a set of design goals in mind (detailed in the recommendation document). These goals and the degree to which they have already been met are why XML is considered better than SGML:
XML takes the best of SGML and combines it with some of the best features of HTML, and adds a few features drawn from some of the more successful applications of both. XML takes its major framework from SGML, leaving out everything that isn't absolutely necessary. Each facility and feature was examined, and if a good case couldn't be made for its retention, it was scrapped. XML is commonly called a subset of SGML, but in technical terms it's an application profile of SGML; whereas HTML uses SGML and is an application of SGML, XML is just SGML on a smaller scale.
From HTML, XML inherits the use of Web addresses (URLs) to point to other objects. From HyTime (a very sophisticated application of SGML, officially called ISO/IEC 10744 Hypermedia/Time-based Structuring Language) and an academic application of SGML called the Text Encoding Initiative (TEI), XML inherits some other extremely powerful addressing mechanisms that allow you to point to parts and ranges of other documents rather than simple single-point targets, for example.
XML also adds a list of features that make it far more suitable than either SGML or HTML for use on an increasingly complex and diverse Web:
Having read this far, you might think that XML is only for programmers and that you can quite happily go back to using HTML. In many ways you'd be right, except for one important point: If programmers can do more with XML than they can with HTML, eventually this will filter down to you in the form of application software that you can use with your XML data. To take full advantage of these tools, however, you will need to make your data available in XML. As of yet, support for XML in Web browsers is incomplete and unreliable (you will learn how to display XML code in Mozilla and Internet Explorer 5 later on), but full support will not take long.
In the meantime, is XML just for programmers? Definitely not! One of the problems with HTML is that all the tags are optional, so you have to be somewhat familiar with all of them in order to make the best choice. Worse, your choice will be affected by the way the code looks in a particular browser. But XML is extensible, and extensibility works both ways--it also means you can use less rather than more. Instead of having to learn more than 40 HTML tags, you can mark up your text in a way that makes a lot more sense to you and then use a style sheet to handle the visible appearance. Listing 1.1 shows a typical XML document that marks up a basic sales contact entry.
1: <?xml version="1.0"?> 2: <contacts> 3: <contact> 4: <name> 5: <first>John</first> 6: <last>Belcher</last> 7: </name> 8: <address> 9: <street>Pennington 13322</street> 10: <city>Washington</city> 11: <state>DC</state> 12: <zip>66522</zip> 13: </address> 14: <tel>555 1276</tel> 15: <fax>555 9983</fax> 16: <mobile>887 8887 7777</mobile> 17: <email>jb@southside.com</email> 18: </contact> 19: </contacts>
As Listing 1.1 suggests, you can make your markup very rich in information (semantic content). The great thing about XML is that you can adapt it to your needs. When you need less you can use less, as demonstrated by Listing 1.2. (It would hardly be in keeping with all the other computer language-oriented books in the world if we didn't include some kind of "Hello World" example.)
1: <?xml version="1.0"?> 2: <greeting> 3: <salutation>Hello</salutation> 4: <target>World!</target> 5: </greeting>
XML is already becoming the preferred language for interfacing between databases and the Web, and it is becoming an important method for interchanging data between computers and computer applications. However, at the level of "ordinary" Web document authoring, XML still has a lot to offer. The wonderful thing about XML is that it can actually be even simpler than HTML! You can decide what tags you'll need, and how many, and you can choose names that either mean something sensible to you, or to your readers. Instead of producing documents containing meaningless jumbles of H1, H2, P, LI, UL and EM tags you can say what you really mean and use CHAPTER, SECTION, PARAGRAPH, LIST.ITEM, UNNUMBERED.LIST and IMPORTANT. This doesn't only make your documents more meaningful, it makes them more accessible to other people. Tools (such as search engines) will be able to make more intelligent enquiries about the content and structure of your documents and make meaningful inferences about your documents that could far exceed what you originally intended.
On this first day, you were introduced to XML as a markup language in abstract terms. You saw why XML is needed by the rapidly maturing Internet and its commercial applications. You were also given a very brief overview of why XML is seen as the solution to publishing text and data through the Internet, rather than SGML or HTML.
Just as medical students start their education by dissecting corpses, tomorrow you will dissect the anatomy of an XML document to determine what it is made of.
© Copyright 1999, Macmillan Computer Publishing. All rights reserved.