|
|
What Is XML?
XML is a text-based markup language that is fast becoming the standard for
data interchange on the Web. As with HTML, you identify data using tags (identifiers enclosed in angle brackets,
like this: <...>). Collectively, the tags are known as "markup".
But unlike HTML, XML tags identify the data, rather than specifying
how to display it. Where an HTML tag says something like "display this data in
bold font" (<b>...</b> ), an XML tag acts like a field
name in your program. It puts a label on a piece of data that identifies it (for
example: <message>...</message> ).
Note: Since identifying the data gives you some sense of what
means (how to interpret it, what you should do with it), XML is sometimes
described as a mechanism for specifying the semantics (meaning) of the
data.
In the same way that you define the field names for a data structure, you are
free to use any XML tags that make sense for a given application. Naturally,
though, for multiple applications to use the same XML data, they have to agree
on the tag names they intend to use.
Here is an example of some XML data you might use for a messaging
application: <message>
<to>you@yourAddress.com</to>
<from>me@myAddress.com</from>
<subject>XML Is Really Cool</subject>
<text>
How many ways is XML cool? Let me count the ways...
</text>
</message>
The tags in this example identify the message as a whole, the destination and
sender addresses, the subject, and the text of the message. As in HTML, the
<to> tag has a matching end tag: </to> .
The data between the tag and and its matching end tag defines an element of the XML data. Note, too, that the
content of the <to> tag is entirely contained within the
scope of the <message>..</message> tag. It is this
ability for one tag to contain others that gives XML its ability to represent
hierarchical data structures
Once again, as with HTML, whitespace is essentially irrelevant, so you can
format the data for readability and yet still process it easily with a program.
Unlike HTML, however, in XML you could easily search a data set for messages
containing "cool" in the subject, because the XML tags identify the content of
the data, rather than specifying its representation.
Tags and Attributes
Tags can also contain attributes --
additional information included as part of the tag itself, within the tag's
angle brackets. The following example shows an email message structure that uses
attributes for the "to", "from", and "subject" fields: <message to="you@yourAddress.com" from="me@myAddress.com"
subject="XML Is Really Cool"> <text>
How many ways is XML cool? Let me count the ways...
</text>
</message>
As in HTML, the attribute name is followed by an equal sign and the attribute
value, and multiple attributes are separated by spaces. Unlike HTML, however, in
XML commas between attributes are not ignored -- if present, they generate an
error.
Since you could design a data structure like <message>
equally well using either attributes or tags, it can take a considerable amount
of thought to figure out which design is best for your purposes.
Empty Tags
One really big difference between XML and HTML is that an XML document is
always constrained to be well formed. There are
several rules that determine when a document is well-formed, but one of the most
important is that every tag has a closing tag. So, in XML, the
</to> tag is not optional. The <to>
element is never terminated by any tag other than </to> .
Note: Another important aspect of a well-formed
document is that all tags are completely nested. So you can have
<message>..<to>..</to>..</message> , but
never <message>..<to>..</message>..</to> . A
complete list of requirements is contained in the list of XML Frequently Asked
Questions (FAQ) at http://www.ucc.ie/xml/#FAQ-VALIDWF .
(This FAQ is on the w3c "Recommended Reading" list at http://www.w3.org/XML/ .)
Sometimes, though, it makes sense to have a tag that stands by itself. For
example, you might want to add a "flag" tag that marks message as important. A
tag like that doesn't enclose any content, so it's known as an "empty tag". You
can create an empty tag by ending it with /> instead of
> . For example, the following message contains such a tag: <message to="you@yourAddress.com" from="me@myAddress.com"
subject="XML Is Really Cool">
<flag/> <text>
How many ways is XML cool? Let me count the ways...
</text>
</message>
Note: The empty tag saves you from having to code
<flag></flag> in order to have a well-formed document.
You can control which tags are allowed to be empty by creating a Document Type
Definition, or DTD. We'll talk about that in a
few moments. If there is no DTD, then the document can contain any kinds of tags
you want, as long as the document is well-formed.
Comments in XML Files
XML comments look just like HTML comments: <message to="you@yourAddress.com" from="me@myAddress.com"
subject="XML Is Really Cool">
<!-- This is a comment -->
<text>
How many ways is XML cool? Let me count the ways...
</text>
</message>
The XML Prolog
To complete this journeyman's introduction to XML, note that an XML file
always starts with a prolog. The minimal prolog contains a declaration that identifies the document as an XML document, like this: <?xml version="1.0"?>
The declaration may also contain additional information, like this: <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
The XML declaration is essentially the same as the HTML header,
<html> , except that it uses <?..?> and it
may contain the following attributes:
- version
- Identifies the version of the XML markup language used in the data. This
attribute is not optional.
- encoding
- Identifies the character set used to encode the data. "ISO-8859-1" is
"Latin-1" the Western European and English language character set. (The default
is compressed Unicode: UTF-8.)
- standalone
- Tells whether or not this document references an external entity or an external data type specification
(see below). If there are no external references, then "yes" is appropriate
The prolog can also contain definitions of entities (items that are inserted when you
reference them from within the document) and specifications that tell which tags
are valid in the document, both declared in a Document Type Definition DTD that can be defined directly within the prolog,
as well as with pointers to external specification files. But those are the
subject of later tutorials. For more information on these and many other aspects
of XML, see the Recommended Reading list of the w3c XML page at http://www.w3.org/XML/ .
Note: The declaration is actually optional. But it's a good idea to
include it whenever you create an XML file. The declaration should have the
version number, at a minimum, and ideally the encoding as well. That standard
simplifies things if the XML standard is extended in the future, and if the data
ever needs to be localized for different geographical regions.
Everything that comes after the XML prolog constitutes the document's
content.
Processing Instructions
An XML file can also contain processing instructions that give
commands or information to an application that is processing the XML data.
Processing instructions have the following format: <?target instructions?>
where the target is the name of the application that is expected to do
the processing, and instructions is a string of characters that embodies
the information or commands for the application to process.
Since the instructions are application specific, an XML file could have
multiple processing instructions that tell different applications to do similar
things, though in different ways. The XML file for a slideshow, for example,
could have processing instructions that let the speaker specify a technical or
executive-level version of the presentation. If multiple presentation programs
were used, the program might need multiple versions of the processing
instructions (although it would be nicer if such applications recognized
standard instructions).
Note: The target name "xml" (in any combination of upper or lowercase
letters) is reserved for XML standards. In one sense, the declaration is a
processing instruction that fits that standard. (However, when you're working
with the parser later, you'll see that the method for handling processing
instructions never sees the declaration.)
Please, click on XML Tutorial to read more about XML
|