3. Vectorizing Raw Data

Vecorizing is a process of encoding text as integers to create feature vectors. A feature vector is an n-dimensional vector of numerical features that represents some object. The goal of this is to create a Document Term Matrix (DCM).

Why do Vectorizing?

  • When looking at a word, Python only sees a string of characters.
  • Raw text needs to be converted to numbers so that Python and the algorithms used for machine learning can understand.

Method 1: Count Vectorization

This creates a DCM where the entry of each cell will be a count of the number of times that word occured in that document.

Reading in raw text data

In [1]:
# importing packages
import nltk
import pandas as pd
import re
import string

# displaying 100 characters
pd.set_option('display.max_colwidth', 100)

#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

data.head()
Out[1]:
label body_text
0 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1 ham Nah I don't think he goes to usf, he lives around here though
2 ham Even my brother is not like to speak with me. They treat me like aids patent.
3 ham I HAVE A DATE ON SUNDAY WITH WILL!!
4 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...

Cleaning up text by creating a function to remove punctuation, tokenize, remove stopwords, and stem

In [2]:
# combining 3 seperate function used in first post into one function  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

Importing and applying CountVectorizer

In [3]:
# importing CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# defining parameters
count_vect = CountVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text

# X_counts = count_vect.fit(data['body_text']) # this will only fit the data

X_counts = count_vect.fit_transform(data['body_text']) # this will fit and vectorize the data

# printing out the shape of our dataframe 
print(X_counts.shape)

# since the list is too long, we will print out only the last 20  unique words found across our dataset
print(count_vect.get_feature_names()[-20:])
(5567, 8104)
['yupz', 'ywhere', 'z', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zero', 'zhong', 'zindgi', 'zoe', 'zogtoriu', 'zoom', 'zouk', 'zyada', 'é', 'ü', 'üll', '〨ud']

Processing count_vect on sample dataset for better understanding

In [4]:
data_sample = data[0:20]

# defining parameters
count_vect_sample = CountVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text

# X_counts = count_vect.fit(data['body_text']) # this will only fit the data

X_counts_sample = count_vect_sample.fit_transform(data_sample['body_text']) # this will fit and vectorize the data

# printing out the shape of our dataframe 
print(X_counts_sample.shape)

# printing out all the unique words found across our dataset
print(count_vect_sample.get_feature_names())
(20, 192)
['08002986030', '08452810075over18', '09061701461', '1', '100', '100000', '11', '12', '150pday', '16', '2', '20000', '2005', '21st', '3', '4', '4403ldnw1a7rw18', '4txtú120', '6day', '81010', '87077', '87121', '87575', '9', '900', 'aft', 'aid', 'alreadi', 'alright', 'anymor', 'appli', 'ard', 'around', 'b', 'brother', 'call', 'caller', 'callertun', 'camera', 'cash', 'chanc', 'claim', 'click', 'co', 'code', 'colour', 'comin', 'comp', 'copi', 'cost', 'credit', 'cri', 'csh11', 'cup', 'custom', 'da', 'date', 'dont', 'eg', 'eh', 'england', 'enough', 'entitl', 'entri', 'even', 'fa', 'feel', 'ffffffffff', 'final', 'fine', 'finish', 'first', 'free', 'friend', 'go', 'goalsteam', 'goe', 'gonna', 'gota', 'ha', 'hl', 'home', 'hour', 'httpwap', 'im', 'info', 'ive', 'jackpot', 'joke', 'k', 'kim', 'kl341', 'lar', 'latest', 'lccltd', 'like', 'link', 'live', 'lor', 'lunch', 'macedonia', 'make', 'may', 'meet', 'mell', 'membership', 'messag', 'minnaminungint', 'miss', 'mobil', 'month', 'nah', 'name', 'nation', 'naughti', 'network', 'news', 'next', 'nurungu', 'oh', 'oru', 'patent', 'pay', 'per', 'pobox', 'poboxox36504w45wq', 'pound', 'press', 'prize', 'questionstd', 'r', 'ratetc', 'receiv', 'receivea', 'rememb', 'repli', 'request', 'reward', 'scotland', 'select', 'send', 'serious', 'set', 'six', 'smth', 'soon', 'sooner', 'speak', 'spell', 'stock', 'str', 'stuff', 'sunday', 'talk', 'tc', 'team', 'text', 'think', 'though', 'tkt', 'today', 'tonight', 'treat', 'tri', 'trywal', 'tsandc', 'txt', 'u', 'updat', 'ur', 'urgent', 'use', 'usf', 'v', 'valid', 'valu', 'vettam', 'want', 'wap', 'watch', 'way', 'week', 'wet', 'win', 'winner', 'wkli', 'word', 'wwwdbuknet', 'xxxmobilemovieclub', 'xxxmobilemovieclubcomnqjkgighjjgcbl', 'ye', 'ü']

Vactorizers output sparse matrices

A sparse matrix is a matrix in which most entires are 0. In the interest of efficient storage, a sparse matrix will be stored by only string the locations of non-zero elements.

In [5]:
X_counts_sample
Out[5]:
<20x192 sparse matrix of type '<class 'numpy.int64'>'
	with 218 stored elements in Compressed Sparse Row format>
In [6]:
# storing array as a dataframe
X_counts_df = pd.DataFrame(X_counts_sample.toarray())
X_counts_df
Out[6]:
0 1 2 3 4 5 6 7 8 9 ... 182 183 184 185 186 187 188 189 190 191
0 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 0 ... 0 0 1 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 1 0 0 0 1 1 ... 0 1 0 0 0 0 0 0 0 0
9 0 0 0 1 0 1 0 0 0 0 ... 0 0 0 0 1 1 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

20 rows × 192 columns

In [7]:
# replacing numbers with actual words
X_counts_df.columns = count_vect_sample.get_feature_names()
X_counts_df
Out[7]:
08002986030 08452810075over18 09061701461 1 100 100000 11 12 150pday 16 ... wet win winner wkli word wwwdbuknet xxxmobilemovieclub xxxmobilemovieclubcomnqjkgighjjgcbl ye ü
0 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 0 ... 0 0 1 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 1 0 0 0 1 1 ... 0 1 0 0 0 0 0 0 0 0
9 0 0 0 1 0 1 0 0 0 0 ... 0 0 0 0 1 1 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

20 rows × 192 columns

Method 2: N-Grams

This creates a DCM where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

Example:

"NLP is an interesting topic"

n Name Tokens
2 bigram ["nlp is", "is an", "an interesting", "interesting topic"]
3 trigram ["nlp is an", "is an interesting", "an interesting topic"]
4 four-gram ["nlp is an interesting", "is an interesting topic"]

Reading in raw text data

In [8]:
# importing packages
import nltk
import pandas as pd
import re
import string

# displaying 100 characters
pd.set_option('display.max_colwidth', 100)

#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

data.head()
Out[8]:
label body_text
0 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1 ham Nah I don't think he goes to usf, he lives around here though
2 ham Even my brother is not like to speak with me. They treat me like aids patent.
3 ham I HAVE A DATE ON SUNDAY WITH WILL!!
4 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...

Cleaning up text by creating a function to remove punctuation, tokenize, remove stopwords, and stem

In [9]:
# combining 3 seperate function used in first post into one function  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords]) #changes done here
    return text

#passing the function
data['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))
data.head()
Out[9]:
label body_text cleaned_text
0 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
1 ham Nah I don't think he goes to usf, he lives around here though nah dont think goe usf live around though
2 ham Even my brother is not like to speak with me. They treat me like aids patent. even brother like speak treat like aid patent
3 ham I HAVE A DATE ON SUNDAY WITH WILL!! date sunday
4 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend...

Importing and applying CountVectorizer (x/ N-Grams)

In [10]:
# importing CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# defining parameters using only bigrams (2,2)
ngram_vect = CountVectorizer(ngram_range=(2,2)) 

#fitting and vectorizing count_vect to our body_text

# X_counts = count_vect.fit(data['body_text']) # this will only fit the data

X_counts = ngram_vect.fit_transform(data['cleaned_text']) # this will fit and vectorize the data

# printing out the shape of our dataframe 
print(X_counts.shape)

# since the list is too long, we will print out only the last 20  unique words found across our dataset
print(ngram_vect.get_feature_names()[-20:])
(5567, 31260)
['ywhere dogbreath', 'zac doesnt', 'zaher got', 'zebra anim', 'zed 08701417012', 'zed 08701417012150p', 'zed pobox', 'zero save', 'zhong se', 'zindgi wo', 'zoe 18', 'zoe hit', 'zogtoriu stare', 'zoom cine', 'zouk nichol', 'zyada kisi', 'üll finish', 'üll submit', 'üll take', '〨ud even']

Processing ngram_vect on sample dataset for better understanding

In [11]:
data_sample = data[0:20]

# defining parameters
ngram_vect_sample = CountVectorizer(ngram_range=(2,2))

#fitting and vectorizing count_vect to our body_text

# X_counts = count_vect.fit(data['body_text']) # this will only fit the data

X_counts_sample = ngram_vect_sample.fit_transform(data_sample['cleaned_text']) # this will fit and vectorize the data

# printing out the shape of our dataframe 
print(X_counts_sample.shape)

# since the list is too long, we will print out only the last 20  unique words found across our dataset
print(ngram_vect_sample.get_feature_names()[-20:])
(20, 198)
['urgent week', 'use credit', 'usf live', 'valid 12', 'valu network', 'vettam set', 'want talk', 'wap link', 'way feel', 'way gota', 'way meet', 'week free', 'win cash', 'win fa', 'winner valu', 'wkli comp', 'word claim', 'wwwdbuknet lccltd', 'xxxmobilemovieclub use', 'ye naughti']

Vactorizers output sparse matrices

A sparse matrix is a matrix in which most entires are 0. In the interest of efficient storage, a sparse matrix will be stored by only string the locations of non-zero elements.

In [12]:
X_counts_sample
Out[12]:
<20x198 sparse matrix of type '<class 'numpy.int64'>'
	with 199 stored elements in Compressed Sparse Row format>
In [13]:
# storing array as a dataframe
X_counts_df = pd.DataFrame(X_counts_sample.toarray())
X_counts_df
Out[13]:
0 1 2 3 4 5 6 7 8 9 ... 188 189 190 191 192 193 194 195 196 197
0 0 0 0 0 0 0 0 0 1 1 ... 0 0 0 1 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
6 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 1 0 0 0 1 1 1 0 0 ... 0 0 1 0 0 0 0 0 0 0
9 0 0 1 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 1 1 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

20 rows × 198 columns

In [14]:
# replacing numbers with actual words
X_counts_df.columns = ngram_vect_sample.get_feature_names()
X_counts_df
Out[14]:
09061701461 claim 100 20000 100000 prize 11 month 12 hour 150pday 6day 16 tsandc 20000 pound 2005 text 21st may ... way meet week free win cash win fa winner valu wkli comp word claim wwwdbuknet lccltd xxxmobilemovieclub use ye naughti
0 0 0 0 0 0 0 0 0 1 1 ... 0 0 0 1 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
6 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 1 0 0 0 1 1 1 0 0 ... 0 0 1 0 0 0 0 0 0 0
9 0 0 1 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 1 1 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

20 rows × 198 columns

Method 3: Term Frequency Inverse Document Frequency (TF-IDF)

This creates a DCM where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how important a word is to a document.

image.png

The lower the weight, the lower the importance of the word. In other words, the rarer the word is, the higher the value is going to be.

Reading in raw text data

In [15]:
# importing packages
import nltk
import pandas as pd
import re
import string

# displaying 100 characters
pd.set_option('display.max_colwidth', 100)

#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

data.head()
Out[15]:
label body_text
0 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1 ham Nah I don't think he goes to usf, he lives around here though
2 ham Even my brother is not like to speak with me. They treat me like aids patent.
3 ham I HAVE A DATE ON SUNDAY WITH WILL!!
4 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...

Cleaning up text by creating a function to remove punctuation, tokenize, remove stopwords, and stem

In [16]:
# combining 3 seperate function used in first post into one function  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

Importing and applying CountVectorizer

In [17]:
# importing CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer #changes here

# defining parameters
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text

# X_counts = count_vect.fit(data['body_text']) # this will only fit the data

X_tfidf = tfidf_vect.fit_transform(data['body_text']) # this will fit and vectorize the data

# printing out the shape of our dataframe 
print(X_tfidf.shape)

# since the list is too long, we will print out only the last 20 unique words found across our dataset
print(tfidf_vect.get_feature_names()[-20:0])
(5567, 8104)
[]

Processing count_vect on sample dataset for better understanding

In [18]:
data_sample = data[0:20]

# defining parameters
tfidf_vect_sample = TfidfVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text

# X_counts = count_vect.fit(data['body_text']) # this will only fit the data

X_tfidf_sample = tfidf_vect_sample.fit_transform(data_sample['body_text']) # this will fit and vectorize the data

# printing out the shape of our dataframe 
print(X_tfidf_sample.shape)

# printing out all the unique words found across our dataset
print(tfidf_vect_sample.get_feature_names())
(20, 192)
['08002986030', '08452810075over18', '09061701461', '1', '100', '100000', '11', '12', '150pday', '16', '2', '20000', '2005', '21st', '3', '4', '4403ldnw1a7rw18', '4txtú120', '6day', '81010', '87077', '87121', '87575', '9', '900', 'aft', 'aid', 'alreadi', 'alright', 'anymor', 'appli', 'ard', 'around', 'b', 'brother', 'call', 'caller', 'callertun', 'camera', 'cash', 'chanc', 'claim', 'click', 'co', 'code', 'colour', 'comin', 'comp', 'copi', 'cost', 'credit', 'cri', 'csh11', 'cup', 'custom', 'da', 'date', 'dont', 'eg', 'eh', 'england', 'enough', 'entitl', 'entri', 'even', 'fa', 'feel', 'ffffffffff', 'final', 'fine', 'finish', 'first', 'free', 'friend', 'go', 'goalsteam', 'goe', 'gonna', 'gota', 'ha', 'hl', 'home', 'hour', 'httpwap', 'im', 'info', 'ive', 'jackpot', 'joke', 'k', 'kim', 'kl341', 'lar', 'latest', 'lccltd', 'like', 'link', 'live', 'lor', 'lunch', 'macedonia', 'make', 'may', 'meet', 'mell', 'membership', 'messag', 'minnaminungint', 'miss', 'mobil', 'month', 'nah', 'name', 'nation', 'naughti', 'network', 'news', 'next', 'nurungu', 'oh', 'oru', 'patent', 'pay', 'per', 'pobox', 'poboxox36504w45wq', 'pound', 'press', 'prize', 'questionstd', 'r', 'ratetc', 'receiv', 'receivea', 'rememb', 'repli', 'request', 'reward', 'scotland', 'select', 'send', 'serious', 'set', 'six', 'smth', 'soon', 'sooner', 'speak', 'spell', 'stock', 'str', 'stuff', 'sunday', 'talk', 'tc', 'team', 'text', 'think', 'though', 'tkt', 'today', 'tonight', 'treat', 'tri', 'trywal', 'tsandc', 'txt', 'u', 'updat', 'ur', 'urgent', 'use', 'usf', 'v', 'valid', 'valu', 'vettam', 'want', 'wap', 'watch', 'way', 'week', 'wet', 'win', 'winner', 'wkli', 'word', 'wwwdbuknet', 'xxxmobilemovieclub', 'xxxmobilemovieclubcomnqjkgighjjgcbl', 'ye', 'ü']

Vactorizers output sparse matrices

A sparse matrix is a matrix in which most entires are 0. In the interest of efficient storage, a sparse matrix will be stored by only string the locations of non-zero elements.

In [19]:
X_tfidf_sample
Out[19]:
<20x192 sparse matrix of type '<class 'numpy.float64'>'
	with 218 stored elements in Compressed Sparse Row format>
In [20]:
# storing array as a dataframe
X_tfidf_df = pd.DataFrame(X_tfidf_sample.toarray())
X_tfidf_df
Out[20]:
0 1 2 3 4 5 6 7 8 9 ... 182 183 184 185 186 187 188 189 190 191
0 0.000000 0.198986 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.174912 0.000000 0.198986 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.231645 0.000000 0.000000 0.000000 0.000000 0.231645 0.000000 0.000000 ... 0.000000 0.000000 0.231645 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.197682 0.000000 0.000000 0.000000 0.000000 0.000000 0.197682 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.224905 0.000000 0.000000 0.000000 0.224905 0.197695 ... 0.000000 0.197695 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.252972 0.000000 0.252972 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.252972 0.252972 0.000000 0.000000 0.000000 0.000000
10 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.272652 0.272652 0.000000 0.000000
11 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
12 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.291197 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.291197 0.000000
13 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.185730 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
16 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
17 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.377964
18 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

20 rows × 192 columns

In [21]:
# replacing numbers with actual words
X_tfidf_df.columns = tfidf_vect_sample.get_feature_names()
X_tfidf_df
Out[21]:
08002986030 08452810075over18 09061701461 1 100 100000 11 12 150pday 16 ... wet win winner wkli word wwwdbuknet xxxmobilemovieclub xxxmobilemovieclubcomnqjkgighjjgcbl ye ü
0 0.000000 0.198986 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.174912 0.000000 0.198986 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.231645 0.000000 0.000000 0.000000 0.000000 0.231645 0.000000 0.000000 ... 0.000000 0.000000 0.231645 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.197682 0.000000 0.000000 0.000000 0.000000 0.000000 0.197682 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.224905 0.000000 0.000000 0.000000 0.224905 0.197695 ... 0.000000 0.197695 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.252972 0.000000 0.252972 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.252972 0.252972 0.000000 0.000000 0.000000 0.000000
10 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.272652 0.272652 0.000000 0.000000
11 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
12 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.291197 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.291197 0.000000
13 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.185730 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
16 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
17 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.377964
18 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

20 rows × 192 columns

Author: Amandeep Saluja