Documents

39 - Deduplication

Finding texts that are exactly the same or show a high similarity. Similarity can be measured on lexicality or semantic meaning from embeddings.

Deduplication is a relevant task if you have little control over your input data collection. For example, you can have a lot of duplicate documents when scraping internet articles or using tweets.

There are different methods for creating a unique document set. For finding texts that are exactly the same you can use hashing. For comparing similarity between texts you can use fuzzy string matching or subsequencematching. This often has a bad performance when you want to scale up. Calculation costs grow quadratically when you increase the set of documents for deduplication. A solution is to cluster the documents first. You now only compare documents within the same cluster instead of comparing all documents against each other.

For comparing semantic similarity between texts you can use distributed or contextualized word representations. A vector then represents each text and the (cosine) distance will indicate the similarity.

Deduplication can benefit a lot from cleaning your text, like lowercasing all text or replace URLs.



This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.