Cleaning Text Data

Step 0. Download data

Note: In this tutorial, we will be using Kafka's "The Metamorphosis" where tells the story of salesman Gregor Samsa who wakes one morning to find himself inexplicably transformed into a huge insect.

Link: http://www.gutenberg.org/cache/epub/5200/pg5200.txt

Open the file and delete the header and footer information and save the file as "metamorphosis_clean.txt"

Step 1. Sneak peek into the data

  • See structure, paragraph and punctuation of the data

  • Determine how much of this data is useful for us

  • Know your objective. For instance, if we are trying to develop a Kafka language model, we may want to keep the punctuation, quotes and cases

Step 2. White space/Punctuation/Normalize Case

< Manual cleaning >

2.1 Load data

# 1. Load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

2.2 Tokenization

Tokenization: Split paragraph -> sentences -> words. Commonly split by white spaces

  • Split by white space

  • Split by white space and remove punctuation

2.2.a Split by white space

Warning: We may end up with punctuation included. Eg: 'room,'

Output:

2.2.b Split by white space and remove punctuation

  • Create a mapping table

  • Apply mapping table over a list to translate/strip punctuation

Output:

2.3 Capitalization

  • Use .lower() method to reduce every word to lowercase

  • Warning: 'US' and 'us' may differ in meaning

Output:

< Using Natural Language Toolkit, or NLTK >

2.1 Install NLTK

2.2 Tokenization

  • Split by word: word_tokenize(text)

  • split by sentence: sent_tokenize(text)

  • Filter out punctuation: isalpha()

2.2.a Split by word

Output:

2.2.b Split by sentence

Output:

2.2.c Filter out punctuation

Step 3. Stopwords + Stemming

3.1 Stopwords

Output:

Output:

Stopwords: Words with little value that are removed during NLP. Eg: the, and, me, myselfIn [54]:

Output:

Output:

3.2 Stemming

Stemming reduces word to its root.

  • leafs -> leaf

  • leaves -> leav

Output:

Step 4. Other tools

  • Lemmatization

    • Relates to stemming but more context aware

    • Accurate but slower. Useful for sentiment analysis

    • leafs -> leaf

    • leaves -> leaf

  • Word embedding/Text vectors

    • Representing words as vectors

    • Words with similar context will be placed close in spatial position

    • Word2Vec: Most popular technique for lean word embedding.

Last updated

Was this helpful?