5. Building Machine Learning Classifiers

What is ML?

  • The field of study that gives computers the ability to learn without being explicity programmed. (By Arthur Samuel, 1959)
  • A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measures by P, improves with experience E. (By Tom Mitchell, 1998)
  • Algorithms that can figure out how to perform important tasks by generalizing from examples. (By Univeristy of Washington, 2012)
  • Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world. (By NVIDA, 2016)

Two broad types of ML:

  1. Supervised Learning: Inferring a function from labeled training data to make predictions on unseen data.
    • Eg: Predict whether any given email is spam based on known information about the email.
  1. Unsupervised Learning: Deriving structure from the data where we don't know the effect of any of the variales.
    • Eg: Based on the content of an email, group similar emails together in distinct folders.

Cross-Validation and Evaluation Metrics

Cross-Validation

Holdout Test Set: Sample of data not used in fitting a model for the purpose of evaluating the model's ability to generalize unseen data.

K-Fold Cross-Validation: The dull data set is divided into k-subsets and the holdout method is repeated k times. Each time, one of the k-subsets is used as the test set and the other k-1 (k minus 1) subsets are put together to be used to train the model.

Evaluation Metrics

  1. Accuracy (# predicted correctly / total # of observations)
  2. Percision (# predicted as spam that are actually spam / total # predicted as spam)
  3. Recall (# predicted as spam that are actually spam / total # that are actually spam)

Random Forest

Random Forest makes use of Ensemble method which is a technique that creates multiple models and then combines them to produce better results than any of the single models individually.

Random Forest is an Ensemble learning method that constructs a collection of decision trees and then aggregates the predictions of each tree to determine the final prediction.

Benefits:

  • Can be used for classification or regression.
  • Easily handles outliers, missing values, etc.
  • Accepts various types of inputs (continous, ordinal, etc.).
  • Less likely to overfit.
  • Outputs feature importance.

Building a Random Forest Model

Reading and cleaning raw text data

In [1]:
# importing packages
import nltk
import pandas as pd
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer


#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

# defining a function to count punctuations
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
    return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)

# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

# to calculate % of puncuation
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))

# combining 3 seperate function to clean our text  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

# defining parameters
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text
X_tfidf = tfidf_vect.fit_transform(data['body_text']) # this will fit and vectorize the data

# building a datafram
X_features = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()
Out[1]:
body_len punc% 0 1 2 3 4 5 6 7 ... 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103
0 128 4.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 49 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 62 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 28 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 135 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 8106 columns

Explore RandomForestClassifier Attributes and Hyperparameters

In [2]:
# importing RandomForest
from sklearn.ensemble import RandomForestClassifier

# exploring the RandomForestClassifier
print(dir(RandomForestClassifier)) # we will be using feature_importances_, fit, and predict functions
print(RandomForestClassifier()) # we will be using max_depth and n_estimators parameters

# Note: RF is built on relatively few full build decision trees
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_estimator_type', '_get_param_names', '_make_estimator', '_set_oob_score', '_validate_X_predict', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params']
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Explore RandomForestClassifier through Cross-Validation

In [3]:
# importing packages
from sklearn.model_selection import KFold, cross_val_score
In [4]:
# instantiate our RF classifier
rf = RandomForestClassifier(n_jobs=-1) # n_jobs=-1 allows model to run faster bu building the individual decision trees in parallel

# setting up KFold
k_fold = KFold(n_splits=5) # spilitting dataset into 5 subsets

# scoring our model
cross_val_score(rf, X_features, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1)

# ''' 
#     Parameters for the model:
#     rf = model used
#     X_features = input features
#     data['label'] = label
#     cv = how we are splitting the original dataset
#     scoring = metric of scoring to use our model
#     n_jobs = allows model to run faster bu building the individual decision trees in parallel 
# '''
Out[4]:
array([0.97127469, 0.97755835, 0.96675651, 0.96495957, 0.96136568])

Building a Random Forest Model on a holdout test set

Reading and cleaning raw text data

In [5]:
# importing packages
import nltk
import pandas as pd
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer


#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

# defining a function to count punctuations
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
    return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)

# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

# to calculate % of puncuation
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))

# combining 3 seperate function to clean our text  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

# defining parameters
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text
X_tfidf = tfidf_vect.fit_transform(data['body_text']) # this will fit and vectorize the data

# building a datafram
X_features = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()
Out[5]:
body_len punc% 0 1 2 3 4 5 6 7 ... 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103
0 128 4.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 49 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 62 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 28 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 135 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 8106 columns

Explore RandomForestClassifier through Holdout Test

In [6]:
# importing packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
In [7]:
# splitting test and train dataset
X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)
In [8]:
from sklearn.ensemble import RandomForestClassifier

# instantiate our RF classifier
rf = RandomForestClassifier(n_estimators=50, max_depth=20,n_jobs=-1)

# fitting the rf model
rf_model = rf.fit(X_train, y_train)
In [9]:
# a quick look at feature importance
sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:10]
Out[9]:
[(0.0601724589565967, 'body_len'),
 (0.04331012686364079, 7350),
 (0.03529203415761097, 4796),
 (0.031698826882396225, 1803),
 (0.02141447446567646, 2031),
 (0.019594484496858064, 5724),
 (0.017852883055057472, 392),
 (0.01703567710369901, 3134),
 (0.01652325505991233, 6285),
 (0.016355554031669006, 7027)]
In [10]:
# making prediction
y_pred = rf_model.predict(X_test)

# looking at actual performance metrics
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary') 
In [11]:
print("Precision: {}\nRecall: {}\nAccuracy {}".format(round(precision, 3),
                                                         round(recall, 3),
                                                         round((y_pred==y_test).sum() / len(y_pred), 3)))
Precision: 1.0
Recall: 0.596
Accuracy 0.94

What does it mean? (in context of spam model)

  • Precision of 100% measns when the model identified something as spam, it was 100% right.
  • Recall of 59.6% means of all the spam that comes into email, 59.4% was properly palced in the spam folder.
  • Accuracy of 94% means of all the spam that comes into email, 94.3% of the emails were correctly identified as spam or ham

Grid Search exhaustively search all parameter combination in a given grid to determine the best model.

Reading and cleaning raw text data

In [12]:
# importing packages
import nltk
import pandas as pd
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer


#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

# defining a function to count punctuations
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
    return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)

# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

# to calculate % of puncuation
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))

# combining 3 seperate function to clean our text  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

# defining parameters
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text
X_tfidf = tfidf_vect.fit_transform(data['body_text']) # this will fit and vectorize the data

# building a datafram
X_features = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()
Out[12]:
body_len punc% 0 1 2 3 4 5 6 7 ... 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103
0 128 4.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 49 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 62 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 28 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 135 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 8106 columns

In [13]:
# importing packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.ensemble import RandomForestClassifier
In [14]:
# splitting test and train dataset
X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)
In [15]:
# creating a function for RF classifier
def train_RF(n_est, depth):
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth,n_jobs=-1) # instantiate our RF classifier
    rf_model = rf.fit(X_train, y_train) # fitting the rf model
    y_pred = rf_model.predict(X_test) #making prediction 
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary') #generating result matrix 
    print("No. of Estimators used: {} / Depth: {} ---- Precision: {} Recall: {} Accuracy {}".format(n_est, 
                                                                                                         depth,
                                                                                                         round(precision, 3),
                                                                                                         round(recall, 3),
                                                                                                         round((y_pred==y_test).sum() / len(y_pred), 3)))
In [16]:
# creating nested for loop for n_est and depth defined in the function
for n_est in [10, 50, 100]:
    for depth in [10, 20, 30, None]:
        train_RF(n_est, depth)
No. of Estimators used: 10 / Depth: 10 ---- Precision: 1.0 Recall: 0.213 Accuracy 0.89
No. of Estimators used: 10 / Depth: 20 ---- Precision: 1.0 Recall: 0.568 Accuracy 0.94
No. of Estimators used: 10 / Depth: 30 ---- Precision: 1.0 Recall: 0.69 Accuracy 0.957
No. of Estimators used: 10 / Depth: None ---- Precision: 1.0 Recall: 0.768 Accuracy 0.968
No. of Estimators used: 50 / Depth: 10 ---- Precision: 1.0 Recall: 0.239 Accuracy 0.894
No. of Estimators used: 50 / Depth: 20 ---- Precision: 1.0 Recall: 0.542 Accuracy 0.936
No. of Estimators used: 50 / Depth: 30 ---- Precision: 1.0 Recall: 0.703 Accuracy 0.959
No. of Estimators used: 50 / Depth: None ---- Precision: 0.992 Recall: 0.794 Accuracy 0.97
No. of Estimators used: 100 / Depth: 10 ---- Precision: 1.0 Recall: 0.303 Accuracy 0.903
No. of Estimators used: 100 / Depth: 20 ---- Precision: 1.0 Recall: 0.581 Accuracy 0.942
No. of Estimators used: 100 / Depth: 30 ---- Precision: 1.0 Recall: 0.703 Accuracy 0.959
No. of Estimators used: 100 / Depth: None ---- Precision: 1.0 Recall: 0.8 Accuracy 0.972

Evaluating Random Forest model wtih Grid Search and Cross-validation (GridSearchCV)

Reading and cleaning raw text data

In [17]:
# importing packages
import nltk
import pandas as pd
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

# defining a function to count punctuations
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
    return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)

# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

# to calculate % of puncuation
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))

# combining 3 seperate function to clean our text  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text']) # this will fit and vectorize the data
X_tfidf_feat = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_tfidf.toarray())], axis=1)

# CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_count = count_vect.fit_transform(data['body_text']) # this will fit and vectorize the data
X_count_feat = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_count.toarray())], axis=1)

Exploring parameter settings with GridSearch CV

In [18]:
# importing packages
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
In [19]:
# instantiate our GS classifier
rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
        'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1, return_train_score=True)

gs_fit = gs.fit(X_tfidf_feat, data['label']) # fitting the rf model

pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]
Out[19]:
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_n_estimators params split0_test_score split1_test_score split2_test_score ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
10 25.442637 0.724721 0.411232 0.070535 None 150 {'max_depth': None, 'n_estimators': 150} 0.980269 0.976640 0.974843 ... 0.974852 0.003422 1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000
11 48.211681 3.069400 0.516039 0.041333 None 300 {'max_depth': None, 'n_estimators': 300} 0.979372 0.977538 0.975741 ... 0.974493 0.004169 2 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000
7 24.115877 2.300286 0.416890 0.099565 90 150 {'max_depth': 90, 'n_estimators': 150} 0.978475 0.978437 0.974843 ... 0.974133 0.004638 3 0.999326 0.999102 0.999102 0.999551 0.998877 0.999192 0.000229
8 46.765986 0.149628 0.599653 0.079758 90 300 {'max_depth': 90, 'n_estimators': 300} 0.979372 0.979335 0.973944 ... 0.974133 0.004639 3 0.999326 0.999326 0.999102 0.999775 0.999102 0.999326 0.000246
6 3.042821 0.411000 0.159411 0.048053 90 10 {'max_depth': 90, 'n_estimators': 10} 0.973991 0.977538 0.975741 ... 0.973594 0.004042 5 0.998428 0.998653 0.997979 0.997979 0.996857 0.997979 0.000619

5 rows × 22 columns

In [20]:
# instantiate our RF classifier
rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
        'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1, return_train_score=True)

gs_fit = gs.fit(X_count_feat, data['label']) # fitting the rf model

pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]
Out[20]:
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth param_n_estimators params split0_test_score split1_test_score split2_test_score ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
8 35.876793 0.556350 0.435337 0.052178 90 300 {'max_depth': 90, 'n_estimators': 300} 0.979372 0.975741 0.973944 ... 0.973954 0.003606 1 0.999326 0.998877 0.999102 0.999326 0.998877 0.999102 0.000201
7 18.402876 0.377368 0.276875 0.018294 90 150 {'max_depth': 90, 'n_estimators': 150} 0.980269 0.973944 0.974843 ... 0.973415 0.004212 2 0.999102 0.998428 0.998653 0.999326 0.998877 0.998877 0.000317
11 37.849599 0.555546 0.484837 0.097426 None 300 {'max_depth': None, 'n_estimators': 300} 0.978475 0.975741 0.974843 ... 0.973235 0.004045 3 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000
10 20.302975 0.615619 0.330630 0.054201 None 150 {'max_depth': None, 'n_estimators': 150} 0.977578 0.972147 0.973046 ... 0.972517 0.003157 4 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000
3 3.089313 0.190326 0.151011 0.022885 60 10 {'max_depth': 60, 'n_estimators': 10} 0.975785 0.971249 0.976640 ... 0.972157 0.003647 5 0.994609 0.992142 0.994387 0.991917 0.991917 0.992995 0.001232

5 rows × 22 columns

Gradient Boosting (GB)

GB is an ensemble learning method that takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes of prior iterations.

Pros:

  • Extremely powerful
  • Accepts various types of inputs
  • Can be used for classification or regression
  • Outputs feature importance

Cons:

  • Longer to train (can't be parallelized)
  • More likely to overfit
  • More difficult to properly tune

Reading and cleaning raw text data

In [21]:
# importing packages
import nltk
import pandas as pd
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer


#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

# defining a function to count punctuations
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
    return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)

# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

# to calculate % of puncuation
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))

# combining 3 seperate function to clean our text  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

# defining parameters
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

#fitting and vectorizing count_vect to our body_text
X_tfidf = tfidf_vect.fit_transform(data['body_text']) # this will fit and vectorize the data

# building a datafram
X_features = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()
Out[21]:
body_len punc% 0 1 2 3 4 5 6 7 ... 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103
0 128 4.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 49 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 62 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 28 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 135 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 8106 columns

Exploring GBClassifier Attributes and HyperParameters

In [22]:
from sklearn.ensemble import GradientBoostingClassifier

print(dir(GradientBoostingClassifier))
print(GradientBoostingClassifier())
['_SUPPORTED_LOSS', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_initialized', '_check_params', '_clear_state', '_decision_function', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_init_decision_function', '_init_state', '_is_initialized', '_make_estimator', '_resize_state', '_staged_decision_function', '_validate_estimator', '_validate_y', 'apply', 'decision_function', 'feature_importances_', 'fit', 'get_params', 'n_features', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 'staged_decision_function', 'staged_predict', 'staged_predict_proba']
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)
In [23]:
# importing packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
In [24]:
# splitting test and train dataset
X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)
In [25]:
# creating a function for RF classifier
def train_GB(n_est, max_depth, lr):
    gb = GradientBoostingClassifier(n_estimators=n_est, max_depth=max_depth,learning_rate=lr) # instantiate our GB classifier
    gb_model = gb.fit(X_train, y_train) # fitting the gb model
    y_pred = gb_model.predict(X_test) #making prediction 
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary') #generating result matrix 
    print("No. of Estimators used: {} / Depth: {} / Learning Rate: {} ---- Precision: {} Recall: {} Accuracy {}".format(n_est, 
                                                                                                         depth,
                                                                                                         lr,
                                                                                                         round(precision, 3),
                                                                                                         round(recall, 3),
                                                                                                         round((y_pred==y_test).sum() / len(y_pred), 3)))
In [26]:
# creating nested for loop for n_est and depth defined in the function
# it is going to take a minimum of 40 minutes to run this code
for n_est in [50, 100, 150]:
    for max_depth in [3, 7, 11, 15]:
        for lr in [0.01, 0.1, 1]:
            train_GB(n_est, max_depth, lr)
C:\Users\saluj\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.0 Recall: 0.0 Accuracy 0.866
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.973 Recall: 0.725 Accuracy 0.961
No. of Estimators used: 50 / Depth: None / Learning Rate: 1 ---- Precision: 0.902 Recall: 0.745 Accuracy 0.955
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.01 ---- Precision: 1.0 Recall: 0.02 Accuracy 0.869
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.926 Recall: 0.752 Accuracy 0.959
No. of Estimators used: 50 / Depth: None / Learning Rate: 1 ---- Precision: 0.907 Recall: 0.785 Accuracy 0.961
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.01 ---- Precision: 1.0 Recall: 0.013 Accuracy 0.868
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.899 Recall: 0.779 Accuracy 0.959
No. of Estimators used: 50 / Depth: None / Learning Rate: 1 ---- Precision: 0.871 Recall: 0.819 Accuracy 0.96
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.01 ---- Precision: 1.0 Recall: 0.007 Accuracy 0.867
No. of Estimators used: 50 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.912 Recall: 0.765 Accuracy 0.959
No. of Estimators used: 50 / Depth: None / Learning Rate: 1 ---- Precision: 0.844 Recall: 0.832 Accuracy 0.957
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.974 Recall: 0.51 Accuracy 0.933
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.966 Recall: 0.765 Accuracy 0.965
No. of Estimators used: 100 / Depth: None / Learning Rate: 1 ---- Precision: 0.902 Recall: 0.805 Accuracy 0.962
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.959 Recall: 0.624 Accuracy 0.946
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.927 Recall: 0.765 Accuracy 0.961
No. of Estimators used: 100 / Depth: None / Learning Rate: 1 ---- Precision: 0.898 Recall: 0.772 Accuracy 0.958
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.939 Recall: 0.718 Accuracy 0.956
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.901 Recall: 0.792 Accuracy 0.961
No. of Estimators used: 100 / Depth: None / Learning Rate: 1 ---- Precision: 0.885 Recall: 0.826 Accuracy 0.962
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.932 Recall: 0.732 Accuracy 0.957
No. of Estimators used: 100 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.917 Recall: 0.812 Accuracy 0.965
No. of Estimators used: 100 / Depth: None / Learning Rate: 1 ---- Precision: 0.866 Recall: 0.826 Accuracy 0.96
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.976 Recall: 0.55 Accuracy 0.938
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.966 Recall: 0.772 Accuracy 0.966
No. of Estimators used: 150 / Depth: None / Learning Rate: 1 ---- Precision: 0.934 Recall: 0.758 Accuracy 0.961
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.951 Recall: 0.658 Accuracy 0.95
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.936 Recall: 0.785 Accuracy 0.964
No. of Estimators used: 150 / Depth: None / Learning Rate: 1 ---- Precision: 0.896 Recall: 0.805 Accuracy 0.961
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.914 Recall: 0.711 Accuracy 0.952
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.916 Recall: 0.805 Accuracy 0.964
No. of Estimators used: 150 / Depth: None / Learning Rate: 1 ---- Precision: 0.869 Recall: 0.799 Accuracy 0.957
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.01 ---- Precision: 0.933 Recall: 0.745 Accuracy 0.959
No. of Estimators used: 150 / Depth: None / Learning Rate: 0.1 ---- Precision: 0.924 Recall: 0.812 Accuracy 0.966
No. of Estimators used: 150 / Depth: None / Learning Rate: 1 ---- Precision: 0.851 Recall: 0.805 Accuracy 0.955

Evaluating Gradient Boosting with GridSearchCV

Reading and cleaning raw text data

In [27]:
# importing packages
import nltk
import pandas as pd
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

# defining a function to count punctuations
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
    return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)

# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

# to calculate % of puncuation
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))

# combining 3 seperate function to clean our text  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text']) # this will fit and vectorize the data
X_tfidf_feat = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_tfidf.toarray())], axis=1)

# CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_count = count_vect.fit_transform(data['body_text']) # this will fit and vectorize the data
X_count_feat = pd.concat([data['body_len'], data['punc%'], pd.DataFrame(X_count.toarray())], axis=1)

Exploring parameter settings with GridCV

In [28]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
In [29]:
# instantiate our GS classifier
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150],
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}

gs = GridSearchCV(gb, param, cv=5, n_jobs=-1, return_train_score=True)

cv_fit = gs.fit(X_tfidf_feat, data['label']) # fitting the rf model

pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]
Out[29]:
mean_fit_time std_fit_time mean_score_time std_score_time param_learning_rate param_max_depth param_n_estimators params split0_test_score split1_test_score ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
5 387.705381 31.857941 0.195528 0.022183 0.1 15 150 {'learning_rate': 0.1, 'max_depth': 15, 'n_est... 0.964126 0.977538 ... 0.969643 0.004437 1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
1 193.099268 1.252441 0.203705 0.014330 0.1 7 150 {'learning_rate': 0.1, 'max_depth': 7, 'n_esti... 0.965022 0.978437 ... 0.969463 0.005103 2 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3 301.014254 2.678897 0.203527 0.009649 0.1 11 150 {'learning_rate': 0.1, 'max_depth': 11, 'n_est... 0.964126 0.979335 ... 0.969463 0.005164 2 1.0 1.0 1.0 1.0 1.0 1.0 0.0
0 126.180470 4.195428 0.254553 0.031022 0.1 7 100 {'learning_rate': 0.1, 'max_depth': 7, 'n_esti... 0.962332 0.977538 ... 0.968205 0.005304 4 1.0 1.0 1.0 1.0 1.0 1.0 0.0
4 285.958562 3.705573 0.189598 0.012141 0.1 15 100 {'learning_rate': 0.1, 'max_depth': 15, 'n_est... 0.964126 0.973046 ... 0.968205 0.002913 4 1.0 1.0 1.0 1.0 1.0 1.0 0.0

5 rows × 23 columns

In [30]:
# instantiate our GS classifier
# it is going to take a minimum of 40 minutes to run this code
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150],
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}

gs = GridSearchCV(gb, param, cv=5, n_jobs=-1, return_train_score=True)

cv_fit = gs.fit(X_count_feat, data['label']) # fitting the rf model

pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]
Out[30]:
mean_fit_time std_fit_time mean_score_time std_score_time param_learning_rate param_max_depth param_n_estimators params split0_test_score split1_test_score ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
3 293.519093 1.613671 0.199143 0.002615 0.1 11 150 {'learning_rate': 0.1, 'max_depth': 11, 'n_est... 0.965022 0.976640 ... 0.969822 0.004146 1 1.0 1.0 1.000000 1.0 1.000000 1.00000 0.00000
5 377.113846 29.543739 0.197419 0.018796 0.1 15 150 {'learning_rate': 0.1, 'max_depth': 15, 'n_est... 0.964126 0.975741 ... 0.969822 0.003944 1 1.0 1.0 1.000000 1.0 1.000000 1.00000 0.00000
4 277.299901 2.055962 0.205615 0.012847 0.1 15 100 {'learning_rate': 0.1, 'max_depth': 15, 'n_est... 0.962332 0.977538 ... 0.969283 0.005386 3 1.0 1.0 1.000000 1.0 1.000000 1.00000 0.00000
2 204.725275 1.143846 0.203540 0.015488 0.1 11 100 {'learning_rate': 0.1, 'max_depth': 11, 'n_est... 0.963229 0.975741 ... 0.968205 0.004727 4 1.0 1.0 1.000000 1.0 1.000000 1.00000 0.00000
0 124.647944 4.132663 0.198463 0.014606 0.1 7 100 {'learning_rate': 0.1, 'max_depth': 7, 'n_esti... 0.962332 0.979335 ... 0.967667 0.006028 5 1.0 1.0 0.999775 1.0 0.999775 0.99991 0.00011

5 rows × 23 columns

Model Selection: Data Prep

Process:

  1. Split the data into training and test set.
  2. Train vectorizers on training set and use that to transform test set.
  3. Fit best Random Forest model and best Gradient Boosting model on training set and predict on test set.
  4. Thoroughly evaluate results of these 2 models to select the best model.
In [1]:
# importing packages
import nltk
import pandas as pd
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer


#importing Porter Stemmer
ps = nltk.PorterStemmer()

#assigning stopwords
stopwords = nltk.corpus.stopwords.words('english')

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t')
data.columns = ['label', 'body_text']

# defining a function to count punctuations
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
    return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)

# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))

# to calculate % of puncuation
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))

# combining 3 seperate function to clean our text  
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

Split into train/test set

In [2]:
# importing packages
from sklearn.model_selection import train_test_split

# splitting test and train dataset
X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punc%']], data['label'], test_size=0.2)

Vectorizing text

In [4]:
# defining parameters
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

# fitting tfidf_vect to our body_text
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text']) # this will only fit the data in X_train

# training the fit training set
tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])

# using the same vectorizer to transform our test set
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

# tfidf_train and tfidf_test will have the same number of columns as they are both
# transformed using tfidf_vect_fit which was trained on the training set. So it will
# recognize only the words which are in the training set

# concatenating vectorized data back with body length and punctuatuation to give us X_features
X_train_vect = pd.concat([X_train[['body_len', 'punc%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)

X_test_vect = pd.concat([X_test[['body_len', 'punc%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()
Out[4]:
body_len punc% 0 1 2 3 4 5 6 7 ... 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102
0 117 4.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 98 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 35 8.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 51 25.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 77 13.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 7105 columns

Final evaluation of models

In [7]:
# importing functions 
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time
In [9]:
# instantiate our RF classifier
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1) # n_jobs=-1 allows model to run faster by building the individual decision trees in parallel

start = time.time()
# fitting the rf model
rf_model = rf.fit(X_train_vect, y_train)

end = time.time()

fit_time = (end - start)


start = time.time()
# making prediction
y_pred = rf_model.predict(X_test_vect)

end = time.time()

pred_time = (end - start)

# looking at actual performance metrics
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary') 

print("Fit Time: {}\nPreict Time: {}\nPrecision: {}\nRecall: {}\nAccuracy {}".format(
                                                         round(fit_time, 3),
                                                         round(pred_time, 3),
                                                         round(precision, 3),
                                                         round(recall, 3),
                                                         round((y_pred==y_test).sum() / len(y_pred), 3)))
Fit Time: 3.451
Preict Time: 0.226
Precision: 1.0
Recall: 0.764
Accuracy 0.967

Analyzing results with Gradient Boosting

In [10]:
# instantiate our GB classifier
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

start = time.time()
# fitting the gb model
gb_model = gb.fit(X_train_vect, y_train)

end = time.time()

fit_time = (end - start)


start = time.time()
# making prediction
y_pred = gb_model.predict(X_test_vect)

end = time.time()

pred_time = (end - start)

# looking at actual performance metrics
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary') 

print("Fit Time: {}\nPreict Time: {}\nPrecision: {}\nRecall: {}\nAccuracy {}".format(
                                                         round(fit_time, 3),
                                                         round(pred_time, 3),
                                                         round(precision, 3),
                                                         round(recall, 3),
                                                         round((y_pred==y_test).sum() / len(y_pred), 3)))
Fit Time: 344.131
Preict Time: 0.151
Precision: 0.939
Recall: 0.783
Accuracy 0.962

Author: Amandeep Saluja