Andrew Gurung
  • Introduction
  • Data Science
    • Natural Language Processing
      • Sentiment analysis using Twitter
    • Linear Algebra
      • Linear algebra explained in four pages
      • Vectors
        • Vector Basics
        • Vector Projection
        • Cosine Similarity
        • Vector Norms and Orthogonality
        • Linear combination and span
        • Linear independence and Basis vectors
      • Matrices
        • Matrix Arithmetic
        • Matrix Operations
        • Functions and Linear Transformations
        • Matrix types
      • Eigendecomposition, Eigenvectors and Eigenvalues
      • Principle Component Analysis (PCA)
      • Singular-Value Decomposition(SVD)
      • Linear Algebra: Deep Learning Book
    • Calculus
      • Functions, Limits, Continuity and Differentiability
      • Scalar Derivative and Partial Derivatives
      • Gradient
      • Matrix Calculus
      • Maxima and Minima using Derivatives
      • Gradient Descent and its types
    • Statistics and Probability
      • Probability Rules and Axioms
      • Types of Events
      • Frequentist vs Bayesian View
      • Random Variables
      • MLE, MAP, and Naive Bayes
      • Probability Distributions
      • P-Value and hypothesis test
    • 7 Step DS Process
      • 1: Business Requirement
      • 2: Data Acquisition
      • 3: Data Processing
        • SQL Techniques
        • Cleaning Text Data
      • 4: Data Exploration
      • 5: Modeling
      • 6: Model deployment
      • 7: Communication
    • Miscellaneous
      • LaTeX commands
  • Computer Science
    • Primer
      • Big O Notation
  • Life
    • Health
      • Minimalist Workout Routine
      • Reddit FAQ on Nootropics
      • Hiking/Biking Resources
    • Philosophy
      • Aristotle's Defense of Private Property
    • Self-improvement
      • 100 Mental Models
      • Don't break the chain
      • Cal Newport's 5 Productivity tips
      • Andrew Ng's advice on deliberate practice
      • Atomic Habits
      • Turn sound effects off in Outlook
    • Food and Travel
      • 2019 Guide to Pesticides in Produce
      • Recipe
        • Spicy Sesame Noodles
      • Travel
        • Hiking
    • Art
      • Scott Adams: 80% of the rules of good writing
      • Learn Blues Guitar
    • Tools
      • Software
        • Docker
        • Visual Studio Code
        • Terminal
        • Comparing Git Workflow
      • Life Hacks
        • DIY Deck Cleaner
  • Knowledge Vault
    • Book
      • The Almanack of Naval Ravikant
    • Media
    • Course/Training
Powered by GitBook
On this page
  • Step 0. Download data
  • Step 1. Sneak peek into the data
  • Step 2. White space/Punctuation/Normalize Case
  • < Manual cleaning >
  • < Using Natural Language Toolkit, or NLTK >
  • Step 3. Stopwords + Stemming
  • Step 4. Other tools

Was this helpful?

  1. Data Science
  2. 7 Step DS Process
  3. 3: Data Processing

Cleaning Text Data

PreviousSQL TechniquesNext4: Data Exploration

Last updated 6 years ago

Was this helpful?

Step 0. Download data

Note: In this tutorial, we will be using Kafka's "The Metamorphosis" where tells the story of salesman Gregor Samsa who wakes one morning to find himself inexplicably transformed into a huge insect.

Link:

Open the file and delete the header and footer information and save the file as "metamorphosis_clean.txt"

Step 1. Sneak peek into the data

  • See structure, paragraph and punctuation of the data

  • Determine how much of this data is useful for us

  • Know your objective. For instance, if we are trying to develop a Kafka language model, we may want to keep the punctuation, quotes and cases

Step 2. White space/Punctuation/Normalize Case

< Manual cleaning >

2.1 Load data

# 1. Load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

2.2 Tokenization

Tokenization: Split paragraph -> sentences -> words. Commonly split by white spaces

  • Split by white space

  • Split by white space and remove punctuation

2.2.a Split by white space

Warning: We may end up with punctuation included. Eg: 'room,'

# 2. Tokenization
# 2.2.a. split by whitespace
words = text.split()
print(words[:100])

Output:

'One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human'

2.2.b Split by white space and remove punctuation

  • Create a mapping table

  • Apply mapping table over a list to translate/strip punctuation

# 2. Tokenization
# 2.2.b Split by whitespace and remove punctuation
Create a mapping table split by whitespace and remove punctuation
import string
words = text.split()

# list the characters in the punctuation constant
print(string.punctuation)

# create a mapping table
# where the 3rd parameter: list of characters to be removed
mappingTable = str.maketrans('', '', string.punctuation)

# apply mapping to remove punctuation
stripped = [w.translate(mappingTable) for w in words]
print(stripped[:100])

Output:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

2.3 Capitalization

  • Use .lower() method to reduce every word to lowercase

  • Warning: 'US' and 'us' may differ in meaning

# 2.3 Capitalization
lowerWords = [w.lower() for w in stripped]
print(lowerWords[:100])

Output:

['one', 'morning', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'he', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'his', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'whats', 'happened', 'to', 'me', 'he', 'thought', 'it', 'wasnt', 'a', 'dream', 'his', 'room', 'a', 'proper', 'human']

< Using Natural Language Toolkit, or NLTK >

2.1 Install NLTK

  • Install NLTK library

  • Download NLTK data(toy grammar, trained models, etc) ~3.25GB

2.2 Tokenization

  • Split by word: word_tokenize(text)

  • split by sentence: sent_tokenize(text)

  • Filter out punctuation: isalpha()

2.2.a Split by word

# 1. load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# 2. Tokenization
# 2a. Split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

Output:

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']

2.2.b Split by sentence

# 2b. Split by sentences
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[:10])

Output:

['One morning, when Gregor Samsa woke from troubled dreams, he found\nhimself transformed in his bed into a horrible vermin.', 'He lay on\nhis armour-like back, and if he lifted his head a little he could\nsee his brown belly, slightly domed and divided by arches into stiff\nsections.', 'The bedding was hardly able to cover it and seemed ready\nto slide off any moment.', 'His many legs, pitifully thin compared\nwith the size of the rest of him, waved about helplessly as he\nlooked.', '"What\'s happened to me?"', 'he thought.', "It wasn't a dream.", 'His room,\na proper human room although a little too small, lay peacefully\nbetween its four familiar walls.', 'A collection of textile samples\nlay spread out on the table - Samsa was a travelling salesman - and\nabove it there hung a picture that he had recently cut out of an\nillustrated magazine and housed in a nice, gilded frame.', 'It showed\na lady fitted out with a fur hat and fur boa who sat upright,\nraising a heavy fur muff that covered the whole of her lower arm\ntowards the viewer.']

2.2.c Filter out punctuation

# 3. Filter out punctuation
tokens = word_tokenize(text)
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])
['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']

Step 3. Stopwords + Stemming

3.1 Stopwords

Output:

Output:

Stopwords: Words with little value that are removed during NLP. Eg: the, and, me, myselfIn [54]:

# 1. Import stopwords from NLTK library
from nltk.corpus import stopwords
print(stopwords.words('english'))

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
# 2. Tokenize words + remove punctuation
tokens = word_tokenize(text)
words = [word for word in tokens if word.isalpha()]

stop_words = set(stopwords.words('english'))
clean_words = [w for w in words if not w in stop_words]
print(clean_words[:100])

Output:

['One', 'morning', 'Gregor', 'Samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'He', 'lay', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'The', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'What', 'happened', 'thought', 'It', 'dream', 'His', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'A', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'Samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'It', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff']

3.2 Stemming

Stemming reduces word to its root.

  • leafs -> leaf

  • leaves -> leav

# 1. Import stemming model: Porter and Snowball
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

# 2. Tokenize words + remove punctuation
tokens = word_tokenize(text)
words = [word for word in tokens if word.isalpha()]

# 3. Stemming of words
# Note: stemmed words are automatically in lowercase
stemmed = [porter.stem(word) for word in words]
print(stemmed[:100])

Output:

['one', 'morn', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', 'He', 'lay', 'on', 'hi', 'back', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', 'hi', 'mani', 'leg', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'wave', 'about', 'helplessli', 'as', 'he', 'look', 'what', 'happen', 'to', 'me', 'he', 'thought', 'It', 'wa', 'a', 'dream', 'hi', 'room', 'a', 'proper', 'human', 'room']

Step 4. Other tools

  • Lemmatization

    • Relates to stemming but more context aware

    • Accurate but slower. Useful for sentiment analysis

    • leafs -> leaf

    • leaves -> leaf

  • Word embedding/Text vectors

    • Representing words as vectors

    • Words with similar context will be placed close in spatial position

    • Word2Vec: Most popular technique for lean word embedding.

URL:

URL:

http://www.gutenberg.org/cache/epub/5200/pg5200.txt
http://www.nltk.org/install.html
http://www.nltk.org/data.html