Cleaning Text Data
Step 0. Download data
Note: In this tutorial, we will be using Kafka's "The Metamorphosis" where tells the story of salesman Gregor Samsa who wakes one morning to find himself inexplicably transformed into a huge insect.
Link: http://www.gutenberg.org/cache/epub/5200/pg5200.txt
Open the file and delete the header and footer information and save the file as "metamorphosis_clean.txt"
Step 1. Sneak peek into the data
See structure, paragraph and punctuation of the data
Determine how much of this data is useful for us
Know your objective. For instance, if we are trying to develop a Kafka language model, we may want to keep the punctuation, quotes and cases
Step 2. White space/Punctuation/Normalize Case
< Manual cleaning >
2.1 Load data
# 1. Load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()2.2 Tokenization
Tokenization: Split paragraph -> sentences -> words. Commonly split by white spaces
Split by white space
Split by white space and remove punctuation
2.2.a Split by white space
Warning: We may end up with punctuation included. Eg: 'room,'
Output:
2.2.b Split by white space and remove punctuation
Create a mapping table
Apply mapping table over a list to translate/strip punctuation
Output:
2.3 Capitalization
Use .lower() method to reduce every word to lowercase
Warning: 'US' and 'us' may differ in meaning
Output:
< Using Natural Language Toolkit, or NLTK >
2.1 Install NLTK
Install NLTK library
Download NLTK data(toy grammar, trained models, etc) ~3.25GB
2.2 Tokenization
Split by word: word_tokenize(text)
split by sentence: sent_tokenize(text)
Filter out punctuation: isalpha()
2.2.a Split by word
Output:
2.2.b Split by sentence
Output:
2.2.c Filter out punctuation
Step 3. Stopwords + Stemming
3.1 Stopwords
Output:
Output:
Stopwords: Words with little value that are removed during NLP. Eg: the, and, me, myselfIn [54]:
Output:
Output:
3.2 Stemming
Stemming reduces word to its root.
leafs -> leaf
leaves -> leav
Output:
Step 4. Other tools
Lemmatization
Relates to stemming but more context aware
Accurate but slower. Useful for sentiment analysis
leafs -> leaf
leaves -> leaf
Word embedding/Text vectors
Representing words as vectors
Words with similar context will be placed close in spatial position
Word2Vec: Most popular technique for lean word embedding.
Last updated
Was this helpful?