In this tutorial we provide a quick introduction to XML.
XML stands for EXtensible Markup Language and is a text-based markup language derived from Standard Generalized Markup Language (SGML). This language was designed to store and transport data. And it was designed to be both human- and machine-readable.
A markup language is a set of symbols that can be placed in the text of a document to demarcate and label the parts of that document. The following is an example of XML markup:
<text>
<sentence>Hello, world!</sentence>
</text>
XML is a generalized way of describing hierarchical structured data. An XML document contains one or more elements, which are delimited by start and end tags.
The following is an example shows of a complete XML file:
<?xml version="1.0"?>
<sentence>
<token>This</token>
<token>is</token>
<token>a</token>
<token>sentence</token>
<token>.</token>
</sentence>
It contains two kinds of information:
<sentence>
and <token>
andThe first line is the XML declaration.
The following diagram depicts the syntax rules to write different types of markup and text in an XML document.
The XML document can optionally have an XML declaration, e.g.:
<?xml version="1.0" encoding="UTF-8"?>
<?xml
, with “xml” in lower caseThe example above contains two attributes
version
, which refers to the XML version andencoding
, which specifies the character encoding used in the documentAn XML file is structured by several XML-elements, also called XML-nodes or XML-tags. XML-element names are enclosed by triangular brackets < > as shown below:
<token>
Syntax of XML-Elements:
Each element is delimited by a start and an end element
<token>....</token>
exceptions are empty XML-elements, which are a combination of start and end element:
<token/>
Nesting of elements:
XML-elements may be nested, i.e. an XML-element can contain multiple XML-elements as its children. However, XML-elements must not overlap.
Incorrect XML with overlapping elements:
<?xml version="1.0"?>
<sentence>
<token>married
</sentence>
</token>
Correct XML with nested elements:
<?xml version="1.0"?>
<sentence>
<token>married</token>
</sentence>
Root element:
A valid XML document can have only one root element. The root element spans the whole document, and includes all other elements.
Incorrect XML document without root element, i.e. their is no single element containing both the sentence
and token
element.
<sentence>...</sentence>
<token>...</token>
Correct XML document with root element (text
) including both the sentence
and token
element.
<text>
<sentence>...</sentence>
<token>...</token>
</text>
Syntax of element names:
The name of XML-elements is case-sensitive, i.e. <sentence>
is different from <Sentence>
This also means the name of the start and the end elements need to be exactly in the same case.
<text>
<sentence>...</sentence>
<Sentence>...</Sentence>
</text>
In the example above we have an XML-document with a root element text
and two other XML-elements sentence
and Sentence
.
The name of XML-elements may only contain the characters [a-zA-0-9_]
, i.e. no white spaces, dashes, diacritics or alike.
Examples of incorrect XML-element names:
<sentence 1>
<sentence-1>
<Sätze>
Example of correct XML-element names:
<sentence1>
<sentence_1>
<Saetze>
An XML-element can contain attributes that specify a single property for the element, as name-value pair. For example:
<a href="http://www.tutorial.com/">Tutorial</a>
Here href
is the attribute name and http://www.tutorial/ is the attribute value.
Syntax Rules for XML Attributes:
HREF
and href
are considered two different XML attributes.[a-zA-0-9_]
, i.e. no white spaces, dashes, diacritics or alike.The following example shows incorrect syntax because the attribute id is specified twice:
<token id="12" stem="tale" id="15">....</token>
The following example demonstrates incorrect xml syntax, the attribute value is not defined in quotation marks:
<token id=12>....</token>
References usually allow you to add or include additional text or markup in an XML document. References always begin with the symbol & ,which is a reserved character and end with the symbol ;. XML has two types of references:
Entity References: an entity reference contains a name between the start and the end delimiters. For example & where amp is name. The name refers to a predefined string of text and/or markup.
Character References: These contain references, such as A, contains a hash mark (#) followed by a number. The number always refers to the Unicode code of a character. In this case, 65 refers to the character A.
[a-zA-0-9_]
, i.e. no white spaces, dashes, diacritics or alike.not allowed character | replacement-entity | character description |
---|---|---|
< | < |
less than |
> | > |
greater than |
& | & |
ampersand |
’ | ' |
apostrophe |
“ | " |
quotation mark |
Web browsers are quite lenient regarding not well-formed and invalid HTML. They will try to figure out how to render a page, even if there are errors. However, errors in XML documents will stop your XML applications. XML parsers will choke, XML errors are not allowed.
Therefore, whenever you work with markup languages, try to check that everything is alright to be sure that your material is error free. Follow this piece of advice and you will avoid lot of headaches in the future.
A document is well-formed if it is compliant with some minimal requirements:
/
)<
, >
, "
, '
, and &
are only used as markup (either part of a tag or a entity). If they are to be used in the document as character, entities should be used instead: <
, >
, "
, '
, &
.HTML documents have to conform to a particular specification where only a closed set of elements and attributes with particular contents and data types are allowed. Anything else will produce errors.
However, the structure and contents of XML documents can and have to be defined. The rules describing those aspects are defined in a DTD (Document Type Definition) or XML schema. A document is valid if:
If used properly, XML schemas can help you to detect annotation inconsistencies and errors (specially helpful if you are working with data created manually by humans).
There are different ways to define documents: Relax NG compact, DTDs, XML schema languages.>
The XML Document Type Declaration, commonly known as DTD, is a way to describe XML language precisely.
DTDs check vocabulary and validity of the structure of XML documents against grammatical rules of appropriate XML language.
An XML DTD can be either specified inside the document, or it can be kept in a separate document and then linked separately. We will only deal with the internal DTD here.
The basic syntax of a DTD is as follows:
<!DOCTYPE element DTD identifier
[
declaration1
declaration2
........
]>
In the above syntax,
element
tells the parser to parse the document from the specified root element.DTD identifier
is an identifier for the document type definition, which may be the path to a file on the system or URL to a file on the internet. If the DTD is pointing to external path, it is called External Subset.A DTD is referred to as an internal DTD if elements are declared within the XML files. To refer it as internal DTD, standalone attribute in XML declaration must be set to yes. This means, the declaration works independent of external source.
The syntax of internal DTD is as shown:
<!DOCTYPE root-element [element-declarations]>
where root-element is the name of root element and element-declarations is where you declare the elements.
The following is a simple example of an internal DTD:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE text [
<!ELEMENT text (sentence)>
<!ELEMENT sentence (token+)>
<!ELEMENT token (#PCDATA)>
]>
<!DOCTYPE text
defines that the root element is text
<!ELEMENT text
defines that the text
element must contain one sentence
element<!ELEMENT sentence
defines that the sentence
element must contain at least one token
element<!ELEMENT token
defines that the token
element must be of the type #PCDATA
An example of a corresponding valid XML-structure:
<text>
<sentence>
<token>These</token>
<token>days</token>
<token>.</token>
</sentence>
</text>
DTD rules:
*
(0 to more occurences) or ?
(0 to 1 occurences)+
(1 to more occurrences) or not operator (1 occurrence)|