Python‎ > ‎

Python Hints and Tips

Text Preprocessing

In this paper, we will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

After a text is obtained, we start with text normalization. Text normalization includes:

  • converting all letters to lower or upper case
  • converting numbers into words or removing numbers
  • removing punctuations, accent marks and other diacritics
  • removing white spaces
  • expanding abbreviations
  • removing stop words, sparse terms, and particular words
  • text canonicalization
We will describe text normalization steps in detail below.
Comments