2. NLP Supplemental Data Cleaning

Stemming

A process reducing inflected (or sometimes derived) words to thier word stem or root.

Eg:

  • Stemming/stemmed will be reduced to thier base word Stem
  • Electricity/electrical will be reduced to thier base word Electr
  • Berries/berry will be reduced to thier base word Berri
  • Connection/connected/connective will be reduced to thier base word Connect

But it won't always be right. For eg:

  • Meanness/meaning will be reduced to thier base word Mean which is not right

Why Stemming?

  • Reduces the corpus of words the model is exposed to.
  • Explicitly correlates words with similar meaning.
  • For eg:
    • Grow/grew/growing will be using more memory space. Using stemming, we can reduce all three to its root word (which is Grow) thus saving us memory.

Using Stemming

Test out Porter Stemmer

In [1]:
# importing nltk package
import nltk

ps = nltk.PorterStemmer()
In [2]:
#checking out what function, methods are in PorterStemmer. We will be using stem a lot.
dir(ps)
Out[2]:
['MARTIN_EXTENSIONS',
 'NLTK_EXTENSIONS',
 'ORIGINAL_ALGORITHM',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_abc_impl',
 '_apply_rule_list',
 '_contains_vowel',
 '_ends_cvc',
 '_ends_double_consonant',
 '_has_positive_measure',
 '_is_consonant',
 '_measure',
 '_replace_suffix',
 '_step1a',
 '_step1b',
 '_step1c',
 '_step2',
 '_step3',
 '_step4',
 '_step5a',
 '_step5b',
 'mode',
 'pool',
 'stem',
 'unicode_repr',
 'vowels']
In [3]:
#example
print(ps.stem('grows'))
print(ps.stem('growing'))
print(ps.stem('grow'))
grow
grow
grow
In [4]:
#example 2
print(ps.stem('run'))
print(ps.stem('running'))
print(ps.stem('runner'))
run
run
runner

Reading in raw text data used in the first post

In [5]:
# importing packages
import nltk
import pandas as pd
import re
import string

# displaying 100 characters
pd.set_option('display.max_colwidth', 100)

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t', header = None, names = ['label', 'body_text'])

data.head()
Out[5]:
label body_text
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2 ham Nah I don't think he goes to usf, he lives around here though
3 ham Even my brother is not like to speak with me. They treat me like aids patent.
4 ham I HAVE A DATE ON SUNDAY WITH WILL!!

Cleaning up text used in the first post

In [6]:
# combining 3 seperate function used in first post into one function  
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

#calling out the function
data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

data.head()
Out[6]:
label body_text body_text_nostop
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
2 ham Nah I don't think he goes to usf, he lives around here though [nah, dont, think, goes, usf, lives, around, though]
3 ham Even my brother is not like to speak with me. They treat me like aids patent. [even, brother, like, speak, treat, like, aids, patent]
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! [date, sunday]

Stemming Text

In [7]:
# defning the stem function. 
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

#calling out the function
data['body_text_stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))

data.head()
Out[7]:
label body_text body_text_nostop body_text_stemmed
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom... [ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... [free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...
2 ham Nah I don't think he goes to usf, he lives around here though [nah, dont, think, goes, usf, lives, around, though] [nah, dont, think, goe, usf, live, around, though]
3 ham Even my brother is not like to speak with me. They treat me like aids patent. [even, brother, like, speak, treat, like, aids, patent] [even, brother, like, speak, treat, like, aid, patent]
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! [date, sunday] [date, sunday]

Lemmatizing

A process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word's lemma. Or A process of using vocabulary analysis of words aiming to remove inflextional endings to return the dictionary form of a word.

How is different from stemming?

  • The goal of both techniques is to condense derived words into their base forms.
  • Stemming is typically faster as it simply chops off the end of a word using heuristics, without any understanding of the context in which a word is used.
  • On the other hand, Lemmatizing is typically more accurate as it uses more informed analysis to create groups of words with similar meaning based on the context around the word.

Using Lemmatizing

In [8]:
# importing nltk package
import nltk

wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer()
In [9]:
#checking out what function, methods are in WordNetLemmatizer. We will be using lemmatize a lot.
dir(wn)
Out[9]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 'lemmatize',
 'unicode_repr']
In [10]:
#example 1
print('Steming Example')
print(ps.stem('meanness'))
print(ps.stem('meaning'))
print('')
print('Lemmatizing Example')
print(wn.lemmatize('meanness'))
print(wn.lemmatize('meaning'))
Steming Example
mean
mean

Lemmatizing Example
meanness
meaning
In [11]:
#example 2
print('Steming Example')
print(ps.stem('goose'))
print(ps.stem('geese'))
print('')
print('Lemmatizing Example')
print(wn.lemmatize('geese'))
print(wn.lemmatize('geese'))
Steming Example
goos
gees

Lemmatizing Example
goose
goose

Reading in raw text data used in the first post

In [12]:
# importing packages
import nltk
import pandas as pd
import re
import string

# displaying 100 characters
pd.set_option('display.max_colwidth', 100)

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t', header = None, names = ['label', 'body_text'])

data.head()
Out[12]:
label body_text
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2 ham Nah I don't think he goes to usf, he lives around here though
3 ham Even my brother is not like to speak with me. They treat me like aids patent.
4 ham I HAVE A DATE ON SUNDAY WITH WILL!!

Cleaning up text used in the first post

In [13]:
# combining 3 seperate function used in first post into one function  
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

#calling out the function
data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

data.head()
Out[13]:
label body_text body_text_nostop
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
2 ham Nah I don't think he goes to usf, he lives around here though [nah, dont, think, goes, usf, lives, around, though]
3 ham Even my brother is not like to speak with me. They treat me like aids patent. [even, brother, like, speak, treat, like, aids, patent]
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! [date, sunday]

Lemmatizing Text

In [14]:
# defning the stem function. 
def lemmatize(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

#calling out the function
data['body_text_lemmatized'] = data['body_text_nostop'].apply(lambda x: lemmatize(x))

data.head(10)
Out[14]:
label body_text body_text_nostop body_text_lemmatized
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom... [ive, searching, right, word, thank, breather, promise, wont, take, help, granted, fulfil, promi...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
2 ham Nah I don't think he goes to usf, he lives around here though [nah, dont, think, goes, usf, lives, around, though] [nah, dont, think, go, usf, life, around, though]
3 ham Even my brother is not like to speak with me. They treat me like aids patent. [even, brother, like, speak, treat, like, aids, patent] [even, brother, like, speak, treat, like, aid, patent]
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! [date, sunday] [date, sunday]
5 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr... [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, caller, pre...
6 spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c... [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170... [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...
7 spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came... [mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile... [mobile, 11, month, u, r, entitled, update, latest, colour, mobile, camera, free, call, mobile, ...
8 ham I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ... [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today] [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]
9 spam SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, ... [six, chances, win, cash, 100, 20000, pounds, txt, csh11, send, 87575, cost, 150pday, 6days, 16,... [six, chance, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6days, 16, t...

Author: Amandeep Saluja