This tutorial is a shortend version (missing anything python-related) of the introduction to xml that was part of the Working with Corpora at Saarland University tutorials by @interrogator, @alvations, @chozelinek , licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on work at https://github.com/resbaz/nltk.

Session: XML

In this session we provide a quick introduction to XML.

XML and HTML

Both XML and HTML are markup languages. Markup languages are systems to annotate documents in a way that the annotation is syntactically distinguishable from the content. What does it mean? Well, we normally want to keep text and metatextual information separated. Metatextual information can be metadata, linguistic annotation, format, content description…

Two well known markup formats are XML and HTML. They are very similar in fact. Both are instances of SGML and both follow the DOM specification. However, HTML is a markup format made up of a pre-defined closed set of tags, with a specification that is used by web browsers to present web content. Whereas, XML is not restricted to a particular set of elements and/or purpose. Users can define the structure of the document, its elements, attributes, etc.

Because most of what we will learn for XML also applies to HTML (we can regard HTML as a specification of the more general XML), and there are plenty of resources in the web to learn HTML, we will focus on XML.

Documents as trees

DOM stands for Document Object Model. This is the specification of how a HTML and XML documents has to be structured, as well as how the file is manipulated to create, edit or remove contents.

We can think of DOM as a tree structure:

<?xml version="1.0" encoding="UTF-8"?>
<TextCorpus lang="de">
    <text>Karin fliegt nach New York. Sie will dort Urlaub machen.</text>
    <tokens>
        <token ID="t_0">Karin</token>
        <token ID="t_1">fliegt</token>
        <token ID="t_2">nach</token>
        <token ID="t_3">New</token>
        <token ID="t_4">York</token>
        <token ID="t_5">.</token>
        <token ID="t_6">Sie</token>
        <token ID="t_7">will</token>
        <token ID="t_8">dort</token>
        <token ID="t_9">Urlaub</token>
        <token ID="t_10">machen</token>
        <token ID="t_11">.</token>
    </tokens>
</TextCorpus>

XML

XML stands for EXtensible Markup Language. This language was designed to store and transport data. And it was designed to be both human- and machine-readable. Unlike HTML the structure of the document, the elements, their attributes, and the content are not pre-defined. That provides a very flexible framework.

XML is a generalized way of describing hierarchical structured data. An XML document contains one or more elements, which are delimited by start and end tags.

<s>This is a sentence.</s>

Elements can be nested to any depth. An element inside another one is said to be a subelement or child. The first element in every XML document is called the root element. An XML document can only have one root element.

<s>
    <token>This</token>
    <token>is</token>
    <token>a</token>
    <token>sentence</token>
    <token>.</token>
</s>

Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values must be quoted. You may use either single or double quotes.

<s id="s_0">
    <token pos1="DT" pos2="DET">This</token>
    <token pos1="VBZ" pos2="VERB">is</token>
    <token pos1='DT' pos2='DET'>a</token>
    <token pos1='NN' pos2='NOUN'>sentence</token>
    <token pos1='.' pos2='PUNCT'>.</token>
</s>

If an element has more than one attribute, the ordering of the attributes is not significant. Element’s attributes form an unordered set of keys and values. There is no limit to the number of attributes you can define on each element.

<s id="s_0">
    <token pos1="DT" pos2="DET">This</token>
    <token pos2="VERB" pos1="VBZ">is</token>
    <token pos1="DT" pos2="DET">a</token>
    <token pos2="NOUN" pos1="NN">sentence</token>
    <token pos1="." pos2="PUNCT">.</token>
</s>

Elements can have text content. Elements that contain no text and no children are empty. Elements that contain text and children elements are said to contain mixed content.

This is an element with text content:

<s>This is a sentence.</s>

This is an empty element:

<comment type="gesture"/>

This is an element with mixed content:

<s>This is a sentence with <italics>mixed</italics> content.</s>

Finally, XML documents can contain character encoding information on the first line, before the root element.

<?xml version="1.0" encoding="UTF-8"?>
<s>
    <token>This</token>
    <token>is</token>
    <token>a</token>
    <token>sentence</token>
    <token>.</token>
</s>

(Mark Pilgrim. Dive Into Python 3. http://www.diveintopython3.net/xml.html)

Well-formed and valid

Web browsers are quite lenient regarding not well-formed and invalid HTML. They will try to figure out how to render a page, even if there are errors. However, errors in XML documents will stop your XML applications. XML parsers will choke, XML errors are not allowed.

Therefore, whenever you work with markup languages, try to check that everything is alright to be sure that your material is error free. Follow this piece of advice and you will avoid lot of headache in the future.

Well-formed documents

A document is well-formed if it is compliant with some minimal requirements:

  • the document contains a document type declaration
  • a single element, known as the root element, contains all the other elements in the document.
  • all elements are well formed (if they are):
    • opened and subsequently closed, or
    • if empty, properly terminated, and
    • properly nested so that they do not overlap
  • <, >, ", ', and & are only used as markup (either part of a tag or a entity). If they are to be used in the document as character, entities should be used instead: &lt;, &gt;, &quot;, &apos;, &amp;.
  • there are rules about the characters that can be used in element names and elsewhere
  • tags are case-sensitive
  • attribute values have to be quoted
  • it contains only properly encoded legal Unicode characters

Valid documents

HTML documents have to conform to a particular specification where only a closed set of elements and attributes with particular contents and data types are allowed. Try to use anything else and you will get an error.

However, the structure and contents of XML documents can and have to be defined. The rules describing those aspects are defined in a DTD (Document Type Definition) or XML schema. A document is valid if:

  • it is well-formed, and
  • it observes the rules dictated by its DTD or XML schema.

If used properly, XML schemas can help you to detect annotation inconsistencies and errors (specially helpful if you are working with data created manually by humans).

There are different ways to define documents out there. My favorite schema language is Relax NG compact: it is quite easy to understand, write, and read. It is much more powerful than DTDs, but at the same time easier than other XML schema languages.