39 - Deduplication

Finding texts that are exactly the same or show a high similarity. Similarity can be measured on lexicality or semantic meaning from embeddings.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

10 Oct 2020• 1 min read

Deduplication is a relevant task if you have little control over your input data collection. For example, you can have a lot of duplicate documents when scraping internet articles or using tweets.

There are different methods for creating a unique document set. For finding texts that are exactly the same you can use hashing. For comparing similarity between texts you can use fuzzy string matching or subsequencematching. This often has a bad performance when you want to scale up. Calculation costs grow quadratically when you increase the set of documents for deduplication. A solution is to cluster the documents first. You now only compare documents within the same cluster instead of comparing all documents against each other.

For comparing semantic similarity between texts you can use distributed or contextualized word representations. A vector then represents each text and the (cosine) distance will indicate the similarity.

Deduplication can benefit a lot from cleaning your text, like lowercasing all text or replace URLs.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

39 - Deduplication

Rob van Zoest

Rob van Zoest

42 - Language Identification

41 - Meta-Info Extractor

40 - Raw Text Cleaning

38 - Readability Scoring

40 - Raw Text Cleaning