60 - Document Similarity

Estimating the degree of similarity between the semantic representation of two documents.

The task of estimating the degree of similarity between the semantic representation of two documents can be done by different techniques for feature extraction. Some examples:

  • The statistical techniques BM25 (Best Matching 25) and TF-IDF (Term Frequency * Inverse Document Frequency), which are the default and former-default similarity algorithm in Elasticsearch and Lucene.
  • Latent Semantic Analysis (LSA/LSI) for vectorization of documents. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on the vector space and only keeps the directions in our vector space that contain the most variance.
  • Latent Dirichlet allocation (LDA) which is a probabilistic method.
  • Doc2Vec (aka paragraph2vec, aka sentence embeddings) a neural network method that modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text.
  • USE (Universal Sentence Encoder) encodes text into high dimensional vectors. It has pretrained models for English, but also a multilingual model.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.