Welcome to the Middle Polish Diachrone Lemmatised Corpus




The PolDiLemma corpus is a diachronic corpus made of political, religious, scientific and historical texts from different authors of the Middle Polish period (16th-18th century).

Characteristic for this period is the slow development of a supra-regional standard language, a process of standardisation on the basis of the variety of the Polish nobility, under the influence of Latin and other foreign languages as well as different social or regional varieties.

All texts (free licenses) are gathered from Federacja Bibliotek Cyfrowych (Digital Library Federation). The Middle Polish texts illustrate the history of the language and give the opportunity to explore some first-hand evidence of the development of Polish in its historical context.

Studying the history of the language is a way to familiarize oneself with aspects of the history of Poland in general. It also helps to build up valuable methodological knowledge in diachronic linguistics and philology.


Historical word forms may differ in orthography and/or grammatical categories. It is a linguistic challenge to make these old forms accessible. The PolDiLemma-Tool is a Python tool for generating possible middle Polish inflected word forms including the part of speech tags from a given (new) Polish word. It contains a middle Polish morphology in XFST-format (compiled and plain text) and a stemmer (Morfologik: full form lexicon and finite state based stemmer).

Should run on Linux and Windows machines (tested on Ubuntu 12.04, Ubuntu 14.04, and Win 8).

Note: The XFST-Tool needs to be downloaded separately from: www.fsmbook.com. Morfologik requires JAVA.


6 july 2018

PolDiLemma is part of the CLARIN resource family of historical corprora.

31 may 2014

PolDiLemma is published on this page.

Valid XHTML 1.0 Transitional