Classical Language Toolkit

NLP for the Ancient World

Technologies
python, javascript
Topics
web, natural language processing
NLP for the Ancient World

We develop the Classical Language Toolkit (CLTK) because we believe it is revolutionizing the study of the ancient world. It is doing so by removing barriers to entry for those doing natural language processing (NLP) in Classical languages (namely, the surviving literature of the entirety of Eurasia and north Africa, from roughly 3000 B.C. to A.D. 1500).

Due to how academic disciplines have evolved over the past 200 years, our earliest civilizations are often studied in isolation from one another. This is tragic, for today we know that the ancient world – from Rome to Mesopotamia to India to China – consisted of deeply interconnected networks of ideas, technologies, art, and beliefs. As a framework for multidisciplinary research, the CLTK will help scholars discover the commonalities of what were once thought disparate cultures.

As software, the CLTK is a suite of NLP tools suited to the special needs of ancient languages. We have have three goals: The most basic is to offer low-level libraries for doing NLP in particular Classical languages (e.g., Ancient Greek, Sanskrit). Developed with an extensible architecture, our code is easily hacked to support new languages. Second, the CLTK offers tools for students and scholars to do reproducible scientific research. For instance, it has version-controlled linguistic corpora and a suite of functions for stylometrics. Third, it is a framework for multidisciplinary language research. With pre-trained models (such as Word2Vec for vector space models), we provide easy-to-use tools to capture the transmission and evolution of knowledge, from the earliest human societies to the dawn of the modern era.

2018 Program

Successful Projects

Contributor
Eleftheria
Mentor
James Tauber, Todd Cook
Organization
Classical Language Toolkit
Extending NLP functionality for Germanic Languages
NLP is severely lacking in meaningful functionalities for Germanic languages. Normalization, POS tagging and stemming modules (all significant parts...
Contributor
Andrew Deloucas
Mentor
Willis Monroe, Tyler Kirby
Organization
Classical Language Toolkit
The Road to CDLI’s Corpora Integration into CLTK: an Undertaking
This project focuses on integrating Cuneiform Digital Library Initiative (CDLI) corpora into the Classical Language Toolkit (CLTK). Currently, CLTK...
Contributor
James Gawley
Mentor
Patrick J. Burns, Kyle Johnson
Organization
Classical Language Toolkit
Expanding the CLTK with Synonyms, Translations and Word Embeddings
The CLTK features the most sophisticated algorithm available for lemmatizing classical Latin. Lemmatization is the process by which inflected...