Lemmatization is a core preprocessing task for NLP which allows for meaningful comparisons of different forms of a given word in a corpus. For highly inflected languages like Latin and Greek, the task is particularly important for two reasons: 1. words often have a dozen or more possible forms (and, as opposed to go in English, this is the norm and not only a characteristic of irregularly formed words), and 2. small corpus size in general often demands that counts for a given feature like words be based on the broadest measure possible. Currently available Latin lemmatizers rely largely on dictionary-based look up methods, which is a good solution for frequently occurring and unambiguous forms. For my Google Summer of Code project, I propose to rewrite the CLTK Latin and Greek lemmatizers to handle a higher percentage of forms by applying a backoff tagging approach.




  • James Tauber
  • Kyle P. Johnson