Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). It all starts with loading the textual data in the right encoding. The former CEO of StackOverflow wrote an interesting overview about The Absolute Minimum you should know about Unicode and Character Sets.
This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.