|
The
Story of XML
As far back as the
sixties, IBM scientists were working on a Generalized
Markup Language (GML) for describing documents and their
formatting. In 1986 the International Standards
Organization (ISO) adopted a version of this standard
called Standard Generalized Markup Language (SGML). SGML
offers a highly sophisticated system for marking up
documents so that their appearance is independent of
specific software applications. It is big, powerful,
filled with options, and well suited for large
organizations that need exacting document standards.
But early in the game,
it became apparent that SGML's sophistication made the
language quite unsuitable for quick and easy Web
publishing. For that, we needed a simplified markup
system, one at which practically anyone could quickly
gain proficiency. Enter HyperText Markup Language
(HTML), which is little more than one specific SGML
document type, or Document Type Definition (DTD).
Because it is so easy to learn and to implement, and
because early Web browsers supported it, HTML quickly
became the basis of the burgeoning Web. In fact, if SGML
had been the Web's essential markup language, the Web
probably wouldn't have attained the enormous popularity
it now enjoys.
The problem with HTML,
however, was that it quickly proved to be too
simple. It was superb for the early days of the
Web--with text-based documents that featured headings,
bulleted lists, and hyperlinks to other documents--but
as soon as Web authors started demanding multimedia and
page design capabilities, the language started
experiencing severe growing pains. Straightforward
in-line graphics were fine, but you couldn't do much to
place them, so page design suffered. And image maps
(images with embedded hyperlinks) created new problems
and needed new solutions. Then came blinking text,
tables, frames, and dynamic HTML. Every time we turn
around, it seems, someone's trying to add something new
to HTML, and every time that happens we end up with new
incompatibilities and the need for new standards.
Why is this happening?
Because, quite simply, HTML isn't extensible.
Over the years Microsoft has added tags that work only
in Internet Explorer, and Netscape has added tags that
work only in Navigator, but you, as a Web author, have
no way to add your own.
Undoubtedly, you've
experienced the frustration of HTML's limitations as a
page layout system, and just as undoubtedly you've
eagerly embraced new tags and elements as they've been
introduced. But to design serious sites, you keep
needing more. Hence Java
and JavaScript,
hence Active Server Pages, hence all the continuing
developments that are making HTML more powerful. Recent
HTML developments such as Cascading Style Sheets (CSS)
and dynamic HTML offer some of the necessary strength
for customizing your Web designs more completely, but
these additions simply highlight the growing problem.
Full customization of Web page design remains at the
whim of the people who make the browsers.
The Web's many
developers have recognized for quite some time the irony
of all this: Whereas HTML offers no extensibility, its
parent system, SGML, is fully extensible. To create a
fully customized set of documents in SGML, authors
develop a DTD that will control all documents in that
set. This is time-consuming and can be extremely
complex, but it works. The question, then, is how to
capture SGML's extensibility, which serious HTML authors
require, without retaining the complexity, which almost
nobody wants. In other words, the issue is how to bridge
the gap between SGML and HTML.
Enter XML
The answer is
Extensible Markup Language, better known as XML.
Proposed in late 1996 to the World Wide Web Consortium
(W3C), XML currently exists as a pair of draft documents
at www.w3.org/pub/WWW/MarkUp/SGML/Activity. Its
intention is to offer some of SGML's power while
avoiding the language's complexity, enabling Web authors
to produce fully customized documents with a high degree
of design consistency. It can offer these things because
XML is SGML. Whereas HTML is merely one SGML
document type, XML is a simplified version of the parent
language itself.
XML is more than a
markup language. Like SGML, it's a metalanguage,
or a language that allows you to describe languages.
HTML and other markup languages let you define how the
information in a document will appear in an application
that displays it, but SGML and XML let you define the
markup language itself. In this sense, XML can actually
control HTML documents. Think of HTML as a description
system and XML (like SGML) as a system for defining
description systems and you get the idea. One benefit of
SGML is that you can use it to define and control an
unlimited variety of description systems, HTML being
only one of these, and XML offers this advantage as
well.
Like HTML and SGML, XML
will require viewing software that will interpret it
according to the author's instructions. In all
likelihood, future versions of Microsoft Internet
Explorer and Netscape Navigator will include XML
interpretation, but in the meantime, you might want to
check out JUMBO, an experimental browser originally
designed to display chemical industry documents. JUMBO
is available at www.venus.co.uk/omf/cml/, where
it displays an SGML/XML implementation called Chemical
Markup Language, or CML (see Figure 1).
One of XML's greatest
strengths is that it lets entire industries, academic
disciplines, and professional organizations develop sets
of DTDs that will standardize the presentation of
information within those disciplines. To an extent this
works against the much-ballyhooed universality of the
Web and HTML, but if you work in a specialized area,
you're probably aware of the need for systems that let
you produce documents enabling you to communicate
efficiently with your colleagues. Specialists often need
to display formulas, hierarchies, mathematical and
scientific notations, and other elements, all within
well-defined parameters. SGML's DTD system lets you do
so, and XML picks up on the DTD system without all the
complexity.
One example of XML's
advantage over HTML lies in its linking possibilities.
HTML's linking, even though it is the basis of the
entire Web, is extremely limited. You can link to
internal or external documents, but the links are
unidirectional and always connected to a hard-coded
address. That's why you get so many "Document not
found" errors.
HTML's redirection
capabilities--which automatically forward the browser to
another location--take care of some of these issues, but
the linking portion of the proposed XML standard (www.w3.org/pub/WWW/TR/WDxml-link-970406.html)
takes linking much further. With XML, Web authors can
establish multidirectional links, which not only link to
a destination location but also provide information
about links to the current location from other
locations. As an example, an author can provide a link
that will take users to a particular resource; a
cross-reference link will then show all the links that
lead to that resource, and the user can follow these
links to their sources. XML authors can also specify
what happens when a link is found, such as whether or
not the link will be followed automatically, and even
whether or not the linked document will be displayed
within the original document. As XML linking options
find their way into general use, the Web will become a
much more capable hypermedia system.
Valid and Well-formed
The DTD system is only
one method of creating XML documents. DTDs offer the
greatest possible flexibility and extensibility, but one
of the XML team's design goals has been to eliminate the
need for building them. As a result, there are two types
of XML document, those with DTDs and those without.
Those with DTDs that conform to the SGML standard are
called valid files. Documents that exclude DTDs
must be well-formed; that is, they must conform
to a specific set of standards. Valid files must be
well-formed too.
A valid XML document,
like a valid SGML document, opens with a Document Type
Declaration, through the <!doctype ..> element. In
addition, the document might have an XML Declaration
before the DTD to specify the version of XML in use, but
this isn't strictly needed. If present it takes the form
<?XML Version="1.0"?>
with 1.0 replaced, of
course, by whatever version is in effect. The XML
version must be available locally or over the Net, and
the XML Declaration will state its location.
The Document Type
Definition's purpose is to specify the structure for the
content of all documents of a certain type; thus the
Document Type Declaration represents the core of SGML.
It might seem strange or even impossible, therefore,
that XML could let you dispense with the >!doctype<
element completely. It does so by demanding that files
be well-formed, which lets the viewer interpret them as
SGML. Instead of having a DTD, the XML document must
follow a series of rules, none of which are difficult
for authors to master.
First, a document must
begin with a Required Markup Declaration (RMD) stating
that the document lacks a DTD (the code is
"NONE"). This RMD occurs in the same line as
the XML Declaration, in the form
<?XML VERSION="1.0" RMD="NONE">
Second, all values for
attributes must be enclosed in quotation marks. Third,
all elements must have opening and closing tags, unlike
some elements in HTML. Other requirements dictate the
type of attributes available, as well as some
restrictions on the data itself. As long as you adhere
to these rules you may omit the DTD, and that simple
fact goes a long way toward making XML more accessible
than SGML.
So what about HTML? Are
your current HTML documents invalid or poorly formed or
both? Not necessarily. Remember that HTML is simply one
SGML DTD; as long as a document conforms to the HTML 3.2
standard, it will be all or at least mostly well-formed.
All you have to do is ensure that it adheres to the XML
rules for well-formed documents and you're set. You can
also run your HTML files through an SGML-aware authoring
tool such as SoftQuad's HoTMetaL Pro (www.sq.com/)
or turn to parsing software such as Lark (www.textuality.com/Lark/).
Types of XML
Applications
Unless XML offers the
ability to produce new kinds of applications, it won't
be of great value to the Web authoring community. Much
of the early development work is still in progress, but
XML appears to be extremely well suited to several
advanced application types.
First, because of its
data structures, XML provides a good way to develop
applications that let the user view data from various
perspectives. Such applications can make documents more
useful by sorting data according to various criteria (by
name, by number, and so on) or by providing a way to
toggle different information on and off. For instance, a
listing that contains program information for all
flavors of Windows could display only the user's version
at the click of a mouse.
XML can also be applied
to an intranet (a site restricted to users inside an
organization) or an extranet (a site restricted to
select users outside an organization). If an
organization needs to present extensive amounts of data
in particular formats, complete with strong database
linkages, XML offers a solution. From an extranet
standpoint, organizations can make their information
available to clients through XML browsers, and entire
industries can band together to produce an XML standard
for information presentation.
XML is also much better
than HTML at drawing data from heterogeneous database
types and displaying that data in a consistent format.
Of course, SGML already makes all of this possible, but
XML is easier to use and faster to implement.
The first high-profile
application of XML will be Microsoft's Channel
Definition Format (CDF), included in Internet Explorer
4.0. Microsoft has based CDF on XML standards, and you
can see the DTD at www.microsoft.com/standards/cdf.htm.
This DTD shows the value of XML quite clearly: Microsoft
has defined the XML elements specifically for push
technology, with element names such as channel, item,
schedule, and tracking. The push providers need only fit
their data to the appropriate element types and their
applications will be consistent with IE 4.0's display
capabilities. This is the kind of standardization that
just can't be achieved with HTML.

|