Contributor
James Gawley

Expanding the CLTK with Synonyms, Translations and Word Embeddings


Mentors
Patrick J. Burns, Kyle Johnson
Organization
Classical Language Toolkit

The CLTK features the most sophisticated algorithm available for lemmatizing classical Latin. Lemmatization is the process by which inflected word-forms are grouped together under their dictionary headings. This allows us to gather accurate word-usage statistics, analyze authorship, and model subject matter in classical corpora. However the CLTK lemmatizer is not currently able to identify synonyms for a given word, or suggest translations into other languages.

I propose to modify the existing CLTK lemmatizer to look up synonyms and translations for Latin and Greek. I will adapt CLTK’s unique ‘backoff’ approach to lemmatization in order to measure the probability of each possible synonym and/or translation for a target word given its context. Further, I propose to incorporate vector models for Latin and Greek based on word embeddings trained using the word2vec algorithm. Once synonyms, translations, and vector models are incorporated into CLTK, users will be able to perform cutting-edge tasks like sentence length document alignment. This will open new horizons for digitally assisted classical scholarship.