NLP for the ancient world

We have developed the Classical Language Toolkit (CLTK) because we believe it will revolutionize the study of the ancient world. It will do so by removing barriers to entry for those doing natural language processing (NLP) in Classical languages (namely, the surviving literature of the entirety of Eurasia and north Africa, from roughly 3000 B.C. to A.D. 1500).

Due to how academic disciplines have evolved over the past 200 years, our earliest civilizations are often studied in isolation from one another. This is tragic, for today we know that the ancient world – from Rome to Mesopotamia to India to China – consisted of deeply interconnected networks of ideas, technologies, art, and beliefs. As a framework for multidisciplinary research, the CLTK will help scholars of tomorrow discover the shared origins of what were once thought disparate cultures.

As software, the CLTK is a suite of NLP tools suited to the special needs of ancient languages. We have have three goals: The most basic is to offer low-level libraries for doing NLP in particular Classical languages (e.g., Ancient Greek, Sanskrit). Developed with an extensible architecture, our code is easily hacked to support new languages. Second, the CLTK offers tools for students and scholars to do reproducible scientific research. For instance, it has version–controlled linguistic corpora and a suite of functions for stylometrics. Third, it is a framework for multidisciplinary language research. With pre–trained models (such as Word2Vec for vector space models, Moses for machine translation, and LDA for topic modeling), we now provide (or will soon) easy–to–use tools to capture the transmission and evolution of knowledge, from the earliest human societies to the dawn of the modern era.

lightbulb_outline View ideas list

Technologies

  • python
  • javascript
  • java

Topics

  • natural language processing
  • web
  • machine translation
  • machine learning
  • human language technologies
comment IRC Channel
email Mailing list
mail_outline Contact email

Classical Language Toolkit 2016 Projects

  • diyclassics
    CLTK Latin/Greek Backoff Lemmatizer
    Lemmatization is a core preprocessing task for NLP which allows for meaningful comparisons of different forms of a given word in a corpus. For highly...
  • suheb
    Enhancements in the CLTK webapp
    The CLTK webapp aims to provide a modern reading environment for documents present in the CLTK corpora. This project aims to enhance the reading...
close

2016