Corpus building with XML and TEI: Introduction to XML

In this tutorial we provide a quick introduction to XML.

Definition

XML stands for EXtensible Markup Language and is a text-based markup language derived from Standard Generalized Markup Language (SGML). This language was designed to store and transport data. And it was designed to be both human- and machine-readable.

XML is extensible: XML allows you to create your own self-descriptive tags, or language, that suits your application.
XML carries the data, it does not present it: XML allows you to store the data irrespective of how it will be presented.
XML is a public standard: XML was developed by an organization called the World Wide Web Consortium (W3C) and is available as an open standard.

A markup language is a set of symbols that can be placed in the text of a document to demarcate and label the parts of that document. The following is an example of XML markup:

<text>
   <sentence>Hello, world!</sentence>
</text>

XML is a generalized way of describing hierarchical structured data. An XML document contains one or more elements, which are delimited by start and end tags.

XML Syntax

The following is an example shows of a complete XML file:

<?xml version="1.0"?>
<sentence>
    <token>This</token>
    <token>is</token>
    <token>a</token>
    <token>sentence</token>
    <token>.</token>
</sentence>

It contains two kinds of information:

markup, the XML-tags (XML-elements) <sentence> and <token> and
the text, This is a sentence.

The first line is the XML declaration.

The following diagram depicts the syntax rules to write different types of markup and text in an XML document.

XML Declaration

The XML document can optionally have an XML declaration, e.g.:

<?xml version="1.0" encoding="UTF-8"?>

XML declarations are case sensitive,
must be the first statement of the XML document,
must begin with <?xml, with “xml” in lower case

The example above contains two attributes

version, which refers to the XML version and
encoding, which specifies the character encoding used in the document

XML Structure

An XML file is structured by several XML-elements, also called XML-nodes or XML-tags. XML-element names are enclosed by triangular brackets < > as shown below:

<token>

Syntax of XML-Elements:
Each element is delimited by a start and an end element

<token>....</token>

exceptions are empty XML-elements, which are a combination of start and end element:

<token/>

Nesting of elements:

XML-elements may be nested, i.e. an XML-element can contain multiple XML-elements as its children. However, XML-elements must not overlap.

Incorrect XML with overlapping elements:

<?xml version="1.0"?>
<sentence>
<token>married
</sentence>
</token>

Correct XML with nested elements:

<?xml version="1.0"?>
<sentence>
<token>married</token>
</sentence>

Root element:
A valid XML document can have only one root element. The root element spans the whole document, and includes all other elements.

Incorrect XML document without root element, i.e. their is no single element containing both the sentence and token element.

<sentence>...</sentence>
<token>...</token>

Correct XML document with root element (text) including both the sentence and token element.

<text>
   <sentence>...</sentence>
   <token>...</token>
</text>

Syntax of element names:

The name of XML-elements is case-sensitive, i.e. <sentence> is different from <Sentence> This also means the name of the start and the end elements need to be exactly in the same case.

<text>
<sentence>...</sentence>  
<Sentence>...</Sentence>
</text>

In the example above we have an XML-document with a root element text and two other XML-elements sentence and Sentence.

The name of XML-elements may only contain the characters [a-zA-0-9_], i.e. no white spaces, dashes, diacritics or alike.

Examples of incorrect XML-element names:

<sentence 1>
<sentence-1>
<Sätze>

Example of correct XML-element names:

<sentence1>
<sentence_1>
<Saetze>

Attributes

An XML-element can contain attributes that specify a single property for the element, as name-value pair. For example:

<a href="http://www.tutorial.com/">Tutorial</a>

Here href is the attribute name and http://www.tutorial/ is the attribute value.

Syntax Rules for XML Attributes:

attribute names have the same restrictions as XML-element names:
- they are case sensitive, i.e. HREF and href are considered two different XML attributes.
- they may only contain the characters [a-zA-0-9_], i.e. no white spaces, dashes, diacritics or alike.
an XML-element can have multiple unique attributes, i.e., an XML-element can have multiple attributes, but each attribute can occur only once

The following example shows incorrect syntax because the attribute id is specified twice:

    <token id="12" stem="tale" id="15">....</token>

attribute names are defined without quotation marks, whereas attribute values must always appear in quotation marks.

The following example demonstrates incorrect xml syntax, the attribute value is not defined in quotation marks:

    <token id=12>....</token>

XML References

References usually allow you to add or include additional text or markup in an XML document. References always begin with the symbol & ,which is a reserved character and end with the symbol ;. XML has two types of references:

Entity References: an entity reference contains a name between the start and the end delimiters. For example & where amp is name. The name refers to a predefined string of text and/or markup.

Character References: These contain references, such as A, contains a hash mark (#) followed by a number. The number always refers to the Unicode code of a character. In this case, 65 refers to the character A.

XML Text

the names of XML-elements and XML-attributes
- are case-sensitive
- may contain only the characters [a-zA-0-9_], i.e. no white spaces, dashes, diacritics or alike.
to avoid character encoding problems, all XML files should be saved as Unicode UTF-8 or UTF-16 files.
whitespace characters like blanks, tabs and line-breaks between XML-elements and between the XML-attributes will be ignored.
some characters are reserved by the XML syntax itself. Hence, they cannot be used directly. To use them, some replacement-entities are used, which are listed below:

not allowed character	replacement-entity	character description
<	`<`	less than
>	`>`	greater than
&	`&`	ampersand
’	`'`	apostrophe
“	`"`	quotation mark

Well-formed and valid

Web browsers are quite lenient regarding not well-formed and invalid HTML. They will try to figure out how to render a page, even if there are errors. However, errors in XML documents will stop your XML applications. XML parsers will choke, XML errors are not allowed.

Therefore, whenever you work with markup languages, try to check that everything is alright to be sure that your material is error free. Follow this piece of advice and you will avoid lot of headaches in the future.

Well-formed documents

A document is well-formed if it is compliant with some minimal requirements:

the document contains a document type declaration
a single element, known as the root element, contains all the other elements in the document.
all elements are well formed (if they are):
- contain an open and close tag
- empty elements may be single elements, when properly terminated (using a /)
- properly nested so that they do not overlap
- element and attribute names follow the rules for allowed characters
- attribute values are quoted
<, >, ", ', and & are only used as markup (either part of a tag or a entity). If they are to be used in the document as character, entities should be used instead: <, >, ", ', &.
it contains only properly encoded legal Unicode characters

Valid documents

HTML documents have to conform to a particular specification where only a closed set of elements and attributes with particular contents and data types are allowed. Anything else will produce errors.

However, the structure and contents of XML documents can and have to be defined. The rules describing those aspects are defined in a DTD (Document Type Definition) or XML schema. A document is valid if:

it is well-formed, and
it observes the rules dictated by its DTD or XML schema.

If used properly, XML schemas can help you to detect annotation inconsistencies and errors (specially helpful if you are working with data created manually by humans).

There are different ways to define documents: Relax NG compact, DTDs, XML schema languages.>

DTDs

The XML Document Type Declaration, commonly known as DTD, is a way to describe XML language precisely.
DTDs check vocabulary and validity of the structure of XML documents against grammatical rules of appropriate XML language.

An XML DTD can be either specified inside the document, or it can be kept in a separate document and then linked separately. We will only deal with the internal DTD here.

Syntax

The basic syntax of a DTD is as follows:

<!DOCTYPE element DTD identifier
[
   declaration1
   declaration2
   ........
]>

In the above syntax,

the DTD starts with <!DOCTYPE delimiter.
an element tells the parser to parse the document from the specified root element.
DTD identifier is an identifier for the document type definition, which may be the path to a file on the system or URL to a file on the internet. If the DTD is pointing to external path, it is called External Subset.
The square brackets [ ] enclose an optional list of entity declarations called Internal Subset.

Internal DTD

A DTD is referred to as an internal DTD if elements are declared within the XML files. To refer it as internal DTD, standalone attribute in XML declaration must be set to yes. This means, the declaration works independent of external source.

The syntax of internal DTD is as shown:

<!DOCTYPE root-element [element-declarations]>

where root-element is the name of root element and element-declarations is where you declare the elements.

The following is a simple example of an internal DTD:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE text [
   <!ELEMENT text (sentence)>
   <!ELEMENT sentence (token+)>
   <!ELEMENT token (#PCDATA)>
]>

<!DOCTYPE text defines that the root element is text
<!ELEMENT text defines that the text element must contain one sentence element
<!ELEMENT sentence defines that the sentence element must contain at least one token element
<!ELEMENT token defines that the token element must be of the type #PCDATA

An example of a corresponding valid XML-structure:

<text>
  <sentence>
    <token>These</token>
    <token>days</token>
    <token>.</token>
  </sentence>
</text>

DTD rules:

the document type declaration must appear at the start of the document (preceded only by the XML header) — it is not permitted anywhere else within the document.
similar to the DOCTYPE declaration, the element declarations must start with an exclamation mark.
the Name in the document type declaration must match the element type of the root element.
multiple child elements in an element declaration may be
- optional, followed by * (0 to more occurences) or ? (0 to 1 occurences)
- obligatory, followed by + (1 to more occurrences) or not operator (1 occurrence)
- sequential, separated by commas, and must appear in the document in this order
- either/or, separated by |

Links

Back