The task of Raw Text Cleaning consists of several pre-processing steps with the goal to increase the quality of subsequent NLP tasks.
- Dedash words that were split up because the end of the line was reached.
- Replace redundant whitespace like multiple line-ends, tabs and spaces.
- Handle HTML tags like < b >, < div >, < span >, < \ br >
- Smart Decapitalization for SCREAMING words
- Remove rare special characters, like a standard dash for the figure dash (‒), en dash (–), em dash ( — ), horizontal bar (―), swung dash (⁓), which shouldn’t be confused by the tilde (~).
- Replace special letter (áççêñtèd) characters with their simple form.
- Remove footnotes, page numbers, headers and references.
- Replace numerical words by values: ‘hundred thousand’ to 100.000
- Replace Emoji’s by their description, like 😶 to ‘no mouth’
- Remove URL and Email references
Etcetera, etcetera! A lot of other NLP tasks described here, can also be used for Text Cleaning.
This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.