Contributor
Rudresh Panchal

Multi Pronged Approach to Text Anonymization


Mentors
Tom De Smedt, Guy De Pauw
Organization
CLiPS, University of Antwerp

Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.

This project consists of two principal parts, entity/identifier recognition, and the subsequent anonymization. First sensitive chunks of texts will be identified using various approaches including Named Entity Recognition, Regular Expression based pattern matching, TF-IDF based rare token detection etc. On being identified, the sensitive attributes will either be suppressed, generalized or deleted/replaced. Some of the approaches for generalization include Word Vector based obfuscation and usage of part holonyms.

This system will be tied on top of a Django web-app. The system will be provided with a dashboard where users can map attributes to the appropriate action and configure them. This system will provide a seamless, end-to-end solution for a firm's/user's text anonymization needs.