Old and Middle French are hardly studied outside of a limited network of French universities. Implementing NLP functionality to these languages would make it easier to study them and access the rich literature and culture expressed in them, famous examples of which include the "Chanson de Roland", Chrétien de Troyes' Arthurian legends, Marie de France's "Lais", Christine de Pizan's writings, etc.

This project aims to extend basic CLTK functionality to Old and Middle French texts between c.900 and c.1500 CE, by implementing a tokenizer, stopwords, named entity recognition, a PoS tagger, and a lemmatizer with English translations for as many words as possible. Data from which the above will be sourced will be from texts licensed under creative commons licenses, which have been transcribed and digitized. For example, a number of Old French texts from the BNF's 19th century editions have been digitized and made available at gallica.fr. Lemmas will be sourced from Godefroy's 1901 "Lexique de l'Ancien Français" and the "Dictionnaire Electronique de Chrétien de Troyes", which has the advantage of English-language definitions.


Natasha Voake


  • Patrick Burns
  • Marius Jøhndal