Multi Pronged Approach to Text Anonymization
- Mentors
- Tom De Smedt, Guy De Pauw
- Organization
- CLiPS, University of Antwerp
Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.
This project consists of two principal parts, entity/identifier recognition, and the subsequent anonymization. First sensitive chunks of texts will be identified using various approaches including Named Entity Recognition, Regular Expression based pattern matching, TF-IDF based rare token detection etc. On being identified, the sensitive attributes will either be suppressed, generalized or deleted/replaced. Some of the approaches for generalization include Word Vector based obfuscation and usage of part holonyms.
This system will be tied on top of a Django web-app. The system will be provided with a dashboard where users can map attributes to the appropriate action and configure them. This system will provide a seamless, end-to-end solution for a firm's/user's text anonymization needs.