Phrases and Entities

24 - N-grams

Detecting N-grams results in common multi-word expression with a high probability of occurrence, like the Bi-gram ‘red wine’.

An N-gram is a sequence of N words, with a high probability of occurrence. It can also be named a collocation, a multi-word expression or a common phrase. A Bi-gram example is ‘red wine’ and a Tri-gram is ‘summer of 69’. The probability is calculated by: (the number of times the previous word occurs before this word) / (the total number of times the previous word occurs in the corpus)

You can detect N-grams with Gensim’s Phrases-model or with the CountVectorizer from Scikit learn.

N-grams are sometimes used for next-word prediction. This is a simple, but costly (performance) solution. A variant is the cheaper character N-grams, but it’s better to use LSTMs.



This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.