35 - Sentencizer

Finding the words that together form a sentence, or from another viewpoint, detecting sentence boundaries.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

06 Oct 2020• 1 min read

Once you tokenized your textual data, a sentencizer should find the words that together form a sentence.

Starting with a titlecased word, followed by lowercase words, until there is a dot. That might be the simplest (erroneous) version of rulebased sentence boundary detection (SBD) logic.

More sophisticated SBD is, for example, done by the spaCy library. The sentence segmentation is performed by the Dependency Parser, which predicts the sentence boundary by the dependency tags.

In NLTK you can train (unsupervised) a sentence-tokenizer on your own training data. It builds a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

35 - Sentencizer

Rob van Zoest

Rob van Zoest

38 - Readability Scoring

37 - Grammar Checker

36 - Paragraph Segmentation

34 - Text Anonymizer

36 - Paragraph Segmentation