Cleaning Text Data
Step 0. Download data
Step 1. Sneak peek into the data
Step 2. White space/Punctuation/Normalize Case
< Manual cleaning >
2.1 Load data
# 1. Load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()2.2 Tokenization
2.2.a Split by white space
Output:
2.2.b Split by white space and remove punctuation
Output:
2.3 Capitalization
< Using Natural Language Toolkit, or NLTK >
2.1 Install NLTK
2.2 Tokenization
2.2.a Split by word
2.2.b Split by sentence
2.2.c Filter out punctuation
Step 3. Stopwords + Stemming
3.1 Stopwords
3.2 Stemming
Step 4. Other tools
Last updated
Was this helpful?