Sentences and Paragraphs

35 - Sentencizer

Finding the words that together form a sentence, or from another viewpoint, detecting sentence boundaries.

Once you tokenized your textual data, a sentencizer should find the words that together form a sentence.

Starting with a titlecased word, followed by lowercase words, until there is a dot. That might be the simplest (erroneous) version of rulebased sentence boundary detection (SBD) logic.

More sophisticated SBD is, for example, done by the spaCy library. The sentence segmentation is performed by the Dependency Parser, which predicts the sentence boundary by the dependency tags.

In NLTK you can train (unsupervised) a sentence-tokenizer on your own training data. It builds a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.



This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.