Meaningful Adversarial Examples for Natural Language Models

A project to create adversarial examples and resulting counterfactuals for text classifiers using the relations in the word embedding vector space. This allows for meaningful alterations to be made in the input documents to test a model for biases. What would happen if the subject of this document was female instead of male? If it was a person of color instead of being white? How would a state of the art model change its results when such changes are made?

This project means to address these questions by creating a framework that allows for the testing against such biases as well as the creation of augmented datasets to dissuade their development.


Panagiotis Lantavos


  • Madhumita Sushil
  • Markus Beuckelmann