An N-gram is a sequence of N words, with a high probability of occurrence. It can also be named a collocation, a multi-word expression or a common phrase. A Bi-gram example is ‘red wine’ and a Tri-gram is ‘summer of 69’. The probability is calculated by: (the number of times the previous word occurs before this word) / (the total number of times the previous word occurs in the corpus)
N-grams are sometimes used for next-word prediction. This is a simple, but costly (performance) solution. A variant is the cheaper character N-grams, but it’s better to use LSTMs.
This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.