Word Processing

21 - Normalization

Besides Stemming or Lemmatizing, there still might be a need to edit words to move to more default words.

Besides Stemming or Lemmatizing, there still might be a need to edit words to move to more default words.

For example, transform word numerals into numbers, handle emoji, substitute contractions (I’m → I am), replace repetitions (Yeaaaaaahhhh → Yeah), remove gender to prohibit to have a gender-bias in your model (all he, his, she, her, etc. to a default form).

You can also normalize spèçíâl characters with Pythons unicodedata. This has for example the effect that accents are removed and that curly quotes are converted to their ASCII equivalent. An advantage for simplicity, although you lose directionality of the quote.

Word Normalization example (source)



This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.