In this tutorial we will learn how to part-of-speech-tag a text using the GUI of TreeTagger.
Tagging is the task of labeling each word in a sequence of words with the appropriate part-of-speech (pos). The labels asigned are specified in a so-called tagset, a set of part-of-speech tags. The size and choice of the tagset can vary greatly. Usually the size is between 50 and 200 tags.
TreeTagger:
- tool for annotating text with part-of-speech and lemma information
- developed by Helmut Schmid in the TC project at the Institut für Maschinelle Sprachverarbeitung (IMS) at the University of Stuttgart
- supported languages include German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Latin, Estonian, Polish and old French
- adaptable to other languages if a lexicon and a manually tagged training corpus are available
- Graphical User Interface for the Windows version of the TreeTagger (developed by Ciarán Ó Duibhín) - works also on Linux with Wine installed
- TreeTagger with GUI and parameter files for English and German: zip
Tagsets used by the trained parameter files of TreeTagger:
How to get started
- download the zip-file with TreeTagger, GUI and parameter files for English, German and French
- unpack the zip-file on your computer or on a USB-stick
- it creates a directory
TreeTagger
containing:
- directories:
bin
(program files), lib
(parameter files)
- files:
INSTALL.txt
, README.txt
- if you do not have a Perl interpreter installed you need to install one (normally this is the case)
Tag your first text with TreeTagger
- download the sample files and unpack them on your computer or USB-stick
- go to the directory
TreeTagger/bin
- open the
TreeTagger GUI
by double-clicking on wintreetagger.exe
- this should open a pop-up window
- the instructions here will use the
English
language setting. If you prefer German
you can click on the small German flag at the bottom right.

- the GUI gives access to all parameters of the TreeTagger grouped in:
Language
, Task
, Output for each token
, Input Options
, Tokenization Options
, Tagging Options
- we will need to change the
Language
as the sample files are in Latin-1
- choose the second
English
- you will see that the
Model
information below the Language
-box changes to Model english-par
, Trained on Latin-1
- load a plain-text-file by clicking in the window below
Input File
- a pop-up window for browsing your file system will open
- go to the directory were you unpacked the sample files and choose
grimm_sample.txt
- click on
Open
or Öffnen
- set a name for the output file by clicking in the window below
Output File
- a pop-up window for browsing your file system will open
- you will already be in the directory were you choose the input file from
- choose
grimm_sample.txt
and append a .tagged
so that the output file will be grimm_sample.txt.tagged
- click on
Save
or Speichern
- click on
Run
- once the TreeTagger is finished, you will wee a small pop-up window reading
TreeTagger finished
- click on
OK
- the tagged file will not pop up, but it has been written to the directory you have chosen
The tagged file
- with the default settings, TreeTagger will have
tokenized, lemmatized
and part-of-speach tagged
your text
- the tagged file is now in a one-word-per-line format
- each line has three
TAB
separated columns
Tagging text with XML/SGML tags
- the sample file
grimm_sample.txt
contained plain-text only
- the sample file
grimm_sample.xml
additionally contains meta-data information and annotations using XML/SGML
tags
- What happens if you tag the text
grimm_sample.xml
with the same settings we used for grimm_sample.txt
? - give it a try!
- the
XML/SGML
tags are treaded by TreeTagger as if they were normal words and are assigned a part-of-speach tag and a lemma
- however, what we want TreeTagger to do is ignore the
XML/SGML
tags, leaving them as they are
- in order to tell TreeTagger to ignore the
XML/SGML
tags, we need to tick the Input Option
SGML tags present
- the
XML/SGML
tags have to be on a separate line!
- tag
grimm_sample.xml
with this option and have a look at the output file.