34 - Text Anonymizer

Removing sensitive information before a document is shared with others. Deidentification and obfuscation of persons and organizations relies on Named Entity Recognition.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

05 Oct 2020• 1 min read

Text Anonymizing is the task of removing sensitive information before a document is shared with others. Deidentification and obfuscation is often done for strings that identify persons (name, social number, email, etc.) and organizations or for other sensitive details from crime records and patient dossiers.

A simple solution is to do Named Entity Recognition and replace the found mention with a tag. If it is Anonymization you replace the mention with nothing (the black marker). For Pseudonymization you replace the mention with a unique tag.

^{Deidentification levels (source)}

Text Anonymization has its use cases within governments and organizations that want to avoid being too transparent and have to deal with legal frameworks such as the General Data Protection Regulation (GDPR).

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

34 - Text Anonymizer

Rob van Zoest

Rob van Zoest

33 - Coreference Resolution

32 - Named Entity Linking

31 - Temporal Parser

33 - Coreference Resolution

35 - Sentencizer