24 - N-grams

Detecting N-grams results in common multi-word expression with a high probability of occurrence, like the Bi-gram ‘red wine’.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

25 Sep 2020• 1 min read

An N-gram is a sequence of N words, with a high probability of occurrence. It can also be named a collocation, a multi-word expression or a common phrase. A Bi-gram example is ‘red wine’ and a Tri-gram is ‘summer of 69’. The probability is calculated by: (the number of times the previous word occurs before this word) / (the total number of times the previous word occurs in the corpus)

You can detect N-grams with Gensim’s Phrases-model or with the CountVectorizer from Scikit learn.

N-grams are sometimes used for next-word prediction. This is a simple, but costly (performance) solution. A variant is the cheaper character N-grams, but it’s better to use LSTMs.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

24 - N-grams

Rob van Zoest

Rob van Zoest

28 - Abbreviation Finder

27 - Named Entity Recognition

26 - Dependency Nounchunks

23 - Negation Recognizer

25 - Rulebased Phrasematcher