1. Natual Language Processing (NLP) Basics

What are NLP and NLTK (Natual Language Toolkit) ?

  • NLP is a field with the ability of a computer to understand, analyze, manipulate, and potentailly generate human language. It encompasses many topics such as:
    • Sentiment Analysis
    • Topic Modeling
    • Text Clasification
    • Sentence Segmentation or part-of-speech tagging
  • NLTK is a suite of open-source tools created to make NLP processes in Python easier to build.

Installing NLTK package

In [1]:
!pip install nltk
Requirement already satisfied: nltk in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (3.4)
Requirement already satisfied: six in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (from nltk) (1.12.0)
Requirement already satisfied: singledispatch in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (from nltk) (3.4.0.3)

Downloading the NLTK Data

In [2]:
# download all the data in the NLTK downloader pop up window
import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[2]:
True
In [3]:
# viewing all the functions, attributes, and methods within nltk package
dir(nltk)
Out[3]:
['AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'Cistem',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'Counter',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyGrammar',
 'DependencyGraph',
 'DependencyProduction',
 'DictionaryConditionalProbDist',
 'DictionaryProbDist',
 'DiscourseTester',
 'DrtExpression',
 'DrtGlueReadingCommand',
 'ELEProbDist',
 'EarleyChartParser',
 'Expression',
 'FStructure',
 'FeatDict',
 'FeatList',
 'FeatStruct',
 'FeatStructReader',
 'Feature',
 'FeatureBottomUpChartParser',
 'FeatureBottomUpLeftCornerChartParser',
 'FeatureChartParser',
 'FeatureEarleyChartParser',
 'FeatureIncrementalBottomUpChartParser',
 'FeatureIncrementalBottomUpLeftCornerChartParser',
 'FeatureIncrementalChartParser',
 'FeatureIncrementalTopDownChartParser',
 'FeatureTopDownChartParser',
 'FreqDist',
 'HTTPPasswordMgrWithDefaultRealm',
 'HeldoutProbDist',
 'HiddenMarkovModelTagger',
 'HiddenMarkovModelTrainer',
 'HunposTagger',
 'IBMModel',
 'IBMModel1',
 'IBMModel2',
 'IBMModel3',
 'IBMModel4',
 'IBMModel5',
 'ISRIStemmer',
 'ImmutableMultiParentedTree',
 'ImmutableParentedTree',
 'ImmutableProbabilisticMixIn',
 'ImmutableProbabilisticTree',
 'ImmutableTree',
 'IncrementalBottomUpChartParser',
 'IncrementalBottomUpLeftCornerChartParser',
 'IncrementalChartParser',
 'IncrementalLeftCornerChartParser',
 'IncrementalTopDownChartParser',
 'Index',
 'InsideChartParser',
 'JSONTaggedDecoder',
 'JSONTaggedEncoder',
 'KneserNeyProbDist',
 'LancasterStemmer',
 'LaplaceProbDist',
 'LazyConcatenation',
 'LazyEnumerate',
 'LazyIteratorList',
 'LazyMap',
 'LazySubsequence',
 'LazyZip',
 'LeftCornerChartParser',
 'LidstoneProbDist',
 'LineTokenizer',
 'LogicalExpressionException',
 'LongestChartParser',
 'MLEProbDist',
 'MWETokenizer',
 'Mace',
 'MaceCommand',
 'MaltParser',
 'MaxentClassifier',
 'Model',
 'MultiClassifierI',
 'MultiParentedTree',
 'MutableProbDist',
 'NaiveBayesClassifier',
 'NaiveBayesDependencyScorer',
 'NgramAssocMeasures',
 'NgramTagger',
 'NonprojectiveDependencyParser',
 'Nonterminal',
 'OrderedDict',
 'PCFG',
 'Paice',
 'ParallelProverBuilder',
 'ParallelProverBuilderCommand',
 'ParentedTree',
 'ParserI',
 'PerceptronTagger',
 'PhraseTable',
 'PorterStemmer',
 'PositiveNaiveBayesClassifier',
 'ProbDistI',
 'ProbabilisticDependencyGrammar',
 'ProbabilisticMixIn',
 'ProbabilisticNonprojectiveParser',
 'ProbabilisticProduction',
 'ProbabilisticProjectiveDependencyParser',
 'ProbabilisticTree',
 'Production',
 'ProjectiveDependencyParser',
 'Prover9',
 'Prover9Command',
 'ProxyBasicAuthHandler',
 'ProxyDigestAuthHandler',
 'ProxyHandler',
 'PunktSentenceTokenizer',
 'QuadgramCollocationFinder',
 'RSLPStemmer',
 'RTEFeatureExtractor',
 'RUS_PICKLE',
 'RandomChartParser',
 'RangeFeature',
 'ReadingCommand',
 'RecursiveDescentParser',
 'RegexpChunkParser',
 'RegexpParser',
 'RegexpStemmer',
 'RegexpTagger',
 'RegexpTokenizer',
 'ReppTokenizer',
 'ResolutionProver',
 'ResolutionProverCommand',
 'SExprTokenizer',
 'SLASH',
 'Senna',
 'SennaChunkTagger',
 'SennaNERTagger',
 'SennaTagger',
 'SequentialBackoffTagger',
 'ShiftReduceParser',
 'SimpleGoodTuringProbDist',
 'SklearnClassifier',
 'SlashFeature',
 'SnowballStemmer',
 'SpaceTokenizer',
 'StackDecoder',
 'StanfordNERTagger',
 'StanfordPOSTagger',
 'StanfordSegmenter',
 'StanfordTagger',
 'StemmerI',
 'SteppingChartParser',
 'SteppingRecursiveDescentParser',
 'SteppingShiftReduceParser',
 'TYPE',
 'TabTokenizer',
 'TableauProver',
 'TableauProverCommand',
 'TaggerI',
 'TestGrammar',
 'Text',
 'TextCat',
 'TextCollection',
 'TextTilingTokenizer',
 'TnT',
 'TokenSearcher',
 'ToktokTokenizer',
 'TopDownChartParser',
 'TransitionParser',
 'Tree',
 'TreebankWordTokenizer',
 'Trie',
 'TrigramAssocMeasures',
 'TrigramCollocationFinder',
 'TrigramTagger',
 'TweetTokenizer',
 'TypedMaxentFeatureEncoding',
 'Undefined',
 'UniformProbDist',
 'UnigramTagger',
 'UnsortedChartParser',
 'Valuation',
 'Variable',
 'ViterbiParser',
 'WekaClassifier',
 'WhitespaceTokenizer',
 'WittenBellProbDist',
 'WordNetLemmatizer',
 'WordPunctTokenizer',
 '__author__',
 '__author_email__',
 '__builtins__',
 '__cached__',
 '__classifiers__',
 '__copyright__',
 '__doc__',
 '__file__',
 '__keywords__',
 '__license__',
 '__loader__',
 '__longdescr__',
 '__maintainer__',
 '__maintainer_email__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__url__',
 '__version__',
 'absolute_import',
 'accuracy',
 'add_logs',
 'agreement',
 'align',
 'alignment_error_rate',
 'aline',
 'api',
 'app',
 'apply_features',
 'approxrand',
 'arity',
 'association',
 'bigrams',
 'binary_distance',
 'binary_search_file',
 'binding_ops',
 'bisect',
 'blankline_tokenize',
 'bleu',
 'bleu_score',
 'bllip',
 'boolean_ops',
 'boxer',
 'bracket_parse',
 'breadth_first',
 'brill',
 'brill_trainer',
 'build_opener',
 'call_megam',
 'casual',
 'casual_tokenize',
 'ccg',
 'chain',
 'chart',
 'chat',
 'choose',
 'chunk',
 'cistem',
 'class_types',
 'classify',
 'clause',
 'clean_html',
 'clean_url',
 'cluster',
 'collections',
 'collocations',
 'combinations',
 'compat',
 'config_java',
 'config_megam',
 'config_weka',
 'conflicts',
 'confusionmatrix',
 'conllstr2tree',
 'conlltags2tree',
 'corenlp',
 'corpus',
 'crf',
 'custom_distance',
 'data',
 'decisiontree',
 'decorator',
 'decorators',
 'defaultdict',
 'demo',
 'dependencygraph',
 'deque',
 'discourse',
 'distance',
 'download',
 'download_gui',
 'download_shell',
 'downloader',
 'draw',
 'drt',
 'earleychart',
 'edit_distance',
 'elementtree_indent',
 'entropy',
 'equality_preds',
 'evaluate',
 'evaluate_sents',
 'everygrams',
 'extract_rels',
 'extract_test_sentences',
 'f_measure',
 'featstruct',
 'featurechart',
 'filestring',
 'find',
 'flatten',
 'fractional_presence',
 'getproxies',
 'ghd',
 'glue',
 'grammar',
 'guess_encoding',
 'help',
 'hmm',
 'hunpos',
 'ibm1',
 'ibm2',
 'ibm3',
 'ibm4',
 'ibm5',
 'ibm_model',
 'ieerstr2tree',
 'improved_close_quote_regex',
 'improved_open_quote_regex',
 'improved_open_single_quote_regex',
 'improved_punct_regex',
 'in_idle',
 'induce_pcfg',
 'inference',
 'infile',
 'inspect',
 'install_opener',
 'internals',
 'interpret_sents',
 'interval_distance',
 'invert_dict',
 'invert_graph',
 'is_rel',
 'islice',
 'isri',
 'jaccard_distance',
 'json_tags',
 'jsontags',
 'lancaster',
 'lazyimport',
 'lfg',
 'line_tokenize',
 'linearlogic',
 'load',
 'load_parser',
 'locale',
 'log_likelihood',
 'logic',
 'mace',
 'malt',
 'map_tag',
 'mapping',
 'masi_distance',
 'maxent',
 'megam',
 'memoize',
 'metrics',
 'misc',
 'mwe',
 'naivebayes',
 'ne_chunk',
 'ne_chunk_sents',
 'ngrams',
 'nonprojectivedependencyparser',
 'nonterminals',
 'numpy',
 'os',
 'pad_sequence',
 'paice',
 'parse',
 'parse_sents',
 'pchart',
 'perceptron',
 'pk',
 'porter',
 'pos_tag',
 'pos_tag_sents',
 'positivenaivebayes',
 'pprint',
 'pr',
 'precision',
 'presence',
 'print_function',
 'print_string',
 'probability',
 'projectivedependencyparser',
 'prover9',
 'punkt',
 'py25',
 'py26',
 'py27',
 'pydoc',
 'python_2_unicode_compatible',
 'raise_unorderable_types',
 'ranks_from_scores',
 'ranks_from_sequence',
 're',
 're_show',
 'read_grammar',
 'read_logic',
 'read_valuation',
 'recall',
 'recursivedescent',
 'regexp',
 'regexp_span_tokenize',
 'regexp_tokenize',
 'register_tag',
 'relextract',
 'repp',
 'resolution',
 'ribes',
 'ribes_score',
 'root_semrep',
 'rslp',
 'rte_classifier',
 'rte_classify',
 'rte_features',
 'rtuple',
 'scikitlearn',
 'scores',
 'segmentation',
 'sem',
 'senna',
 'sent_tokenize',
 'sequential',
 'set2rel',
 'set_proxy',
 'sexpr',
 'sexpr_tokenize',
 'shiftreduce',
 'simple',
 'sinica_parse',
 'skipgrams',
 'skolemize',
 'slice_bounds',
 'snowball',
 'spearman',
 'spearman_correlation',
 'stack_decoder',
 'stanford',
 'stanford_segmenter',
 'stem',
 'str2tuple',
 'string_span_tokenize',
 'string_types',
 'subprocess',
 'subsumes',
 'sum_logs',
 'sys',
 'tableau',
 'tadm',
 'tag',
 'tagset_mapping',
 'tagstr2tree',
 'tbl',
 'text',
 'text_type',
 'textcat',
 'texttiling',
 'textwrap',
 'tkinter',
 'tnt',
 'tokenize',
 'tokenwrap',
 'toktok',
 'toolbox',
 'total_ordering',
 'transitionparser',
 'transitive_closure',
 'translate',
 'tree',
 'tree2conllstr',
 'tree2conlltags',
 'treebank',
 'treetransforms',
 'trigrams',
 'tuple2str',
 'types',
 'unify',
 'unique_list',
 'untag',
 'usage',
 'util',
 'version_file',
 'version_info',
 'viterbi',
 'weka',
 'windowdiff',
 'word_tokenize',
 'wordnet',
 'wordpunct_tokenize',
 'wsd']

What can you do with NLTK?

In [4]:
# an example to view the stopwords in nltk library
from nltk.corpus import stopwords

# view all the stopwords in english langauge with an incremenet of 25 up to 500 words
stopwords.words('english')[0:500:25]
Out[4]:
['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']

Unstructured data

It could mean binary data or data with no delimiters or no indication of rows

Reading in text data (Method 1: Lengthy way)

In [5]:
# Reading the raw file
rawData = open("SMSSpamCollection.tsv").read()

# Printing out the raw data
rawData[0:500]
Out[5]:
"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"
In [6]:
# replacing \t with \n
parsedData = rawData.replace('\t', '\n')
print(parsedData[0:500])
ham
I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.
spam
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham
Nah I don't think he goes to usf, he lives around here though
ham
Even my brother is not like to speak with me. They treat me like aid
In [7]:
# converting text into list by using split
parsedData = parsedData.split('\n')
parsedData[0:4]
Out[7]:
['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]
In [8]:
#creating a label list of ham and spam 
labelList = parsedData[0::2]

#creating a text list 
textList = parsedData[1::2]

print(labelList[0:5])
print(textList[0:5])
['ham', 'spam', 'ham', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']
In [9]:
#combining labelList and textList using pandas 
import pandas as pd

fullCorpus = pd.DataFrame({
    'label': labelList,
    'body_list': textList})

fullCorpus.head() # will throw an error as arrays are not of same length
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-4d9c660d1021> in <module>
      4 fullCorpus = pd.DataFrame({
      5     'label': labelList,
----> 6     'body_list': textList})
      7 
      8 fullCorpus.head() # will throw an error as arrays are not of same length

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    390                                  dtype=dtype, copy=copy)
    391         elif isinstance(data, dict):
--> 392             mgr = init_dict(data, index, columns, dtype=dtype)
    393         elif isinstance(data, ma.MaskedArray):
    394             import numpy.ma.mrecords as mrecords

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
    210         arrays = [data[k] for k in keys]
    211 
--> 212     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    213 
    214 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype)
     49     # figure out the index, if necessary
     50     if index is None:
---> 51         index = extract_index(arrays)
     52     else:
     53         index = ensure_index(index)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
    315             lengths = list(set(raw_lengths))
    316             if len(lengths) > 1:
--> 317                 raise ValueError('arrays must all be same length')
    318 
    319             if have_dicts:

ValueError: arrays must all be same length
In [10]:
#checking the length of both the lists
print(len(labelList))
print(len(textList))
5571
5570
In [11]:
#printing out last 5 items of label list
print(labelList[-5:])
['ham', 'ham', 'ham', 'ham', '']
In [12]:
#dropping the last item from label list

fullCorpus = pd.DataFrame({
    'label': labelList[0:-1],
    'body_list': textList})

fullCorpus.head()
Out[12]:
label body_list
0 ham I've been searching for the right words to tha...
1 spam Free entry in 2 a wkly comp to win FA Cup fina...
2 ham Nah I don't think he goes to usf, he lives aro...
3 ham Even my brother is not like to speak with me. ...
4 ham I HAVE A DATE ON SUNDAY WITH WILL!!

Reading in text data (Method 2: Easy way using pandas)

In [13]:
fullCorpus = pd.read_csv('SMSSpamCollection.tsv', sep = '\t', header = None, names = ['label', 'body_list'])
fullCorpus.head()
Out[13]:
label body_list
0 ham I've been searching for the right words to tha...
1 spam Free entry in 2 a wkly comp to win FA Cup fina...
2 ham Nah I don't think he goes to usf, he lives aro...
3 ham Even my brother is not like to speak with me. ...
4 ham I HAVE A DATE ON SUNDAY WITH WILL!!

Exploring the dataset

In [14]:
#checking number of rows and columns (method 1)
fullCorpus.shape
Out[14]:
(5568, 2)
In [15]:
#checking number of rows and columns using print statement (method 2)
print('The fullCorpus dataset has {0} columns and {1} rows'.format(fullCorpus.shape[1], fullCorpus.shape[0]))
The fullCorpus dataset has 2 columns and 5568 rows
In [16]:
#checking how many ham and spam are there in the label column
print("There are {0} ham and {1} spam rows".format(len(fullCorpus[fullCorpus.label == 'ham']),
                                                  len(fullCorpus[fullCorpus.label == 'spam'])))
There are 4822 ham and 746 spam rows
In [17]:
#checking if there is any missing data
print("There are {0} null rows in label column".format(fullCorpus['label'].isnull().sum()))
print("There are {0} null rows in body_list column".format(fullCorpus['body_list'].isnull().sum()))
There are 0 null rows in label column
There are 0 null rows in body_list column

Regular Expressions (Regex)

Regex is a text string used for describing a search pattern. If you want to check your regex search pattern, you can visit https://regex101.com/. It is probably (in my opinion) one of the best resource to test your regex.

Use of Regex:

  • Identifying the whitespace b/w words/tokens.
  • Identifying/creating delimiters or end-of-line escape characters.
  • Removing punctuations or numbers from the text.
  • Cleaning HTML tags from text.
  • Identifying some textual patterns.

Use cases:

  • Confiming passwords meet criteria.
  • Searching URL for some substring.
  • Searching for files on computers.
  • Document scraping.

How to use Regex?

In [18]:
# importing regex package
import re

# sample sentences 
re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

Splitting a sentence into a list of words

In [19]:
# \s means single whitespace
re.split('\s', re_test)
Out[19]:
['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']
In [20]:
# \s means single whitespace
re.split('\s', re_test_messy)
Out[20]:
['This',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']
In [21]:
# \s+ means multiple (one or more) whitespaces  
re.split('\s+', re_test_messy)
Out[21]:
['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']
In [22]:
# \s+ means multiple (one or more) whitespaces  
re.split('\s+', re_test_messy1) # will not work as there are no whitespace in the string
Out[22]:
['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']
In [23]:
# \W+ means multiple (one or more) non-word characters (Method 1: splitting charachters by searching the words)
re.split('\W+', re_test_messy1)
Out[23]:
['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']
In [24]:
# \S+ means multiple (one or more) non-whitespace words (Method 2: ignoring charachters which split the words)
re.findall('\S+', re_test)
Out[24]:
['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']
In [25]:
# \S+ means multiple (one or more) non-whitespace words (Method 2: ignoring charachters which split the words)
re.findall('\S+', re_test_messy)
Out[25]:
['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']
In [26]:
# \w+ means multiple (one or more) non-word characters (Method 2: ignoring charachters which split the words)
re.findall('\w+', re_test_messy1)
Out[26]:
['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

Repalcing a specific string

In [27]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'
In [28]:
# importing regex package
import re
In [29]:
# using findall to find words which contains characters from a-z 
re.findall('[a-z]+', pep8_test) # regex is case sensitive
Out[29]:
['try', 'to', 'follow', 'guidelines']
In [30]:
# using findall to find words which contains characters from A-Z  
re.findall('[A-Z]+', pep8_test) # regex is case sensitive
Out[30]:
['I', 'PEP']
In [31]:
# using findall get PEP8 word. [A-Z0-9] means finding charachters containing upper case letters OR numbers
re.findall('[A-Z0-9]+', pep8_test)
Out[31]:
['I', 'PEP8']
In [32]:
# using findall to get only the PEP8 word. [A-Z][0-9] means finding charachters containing upper case letters AND numbers
re.findall('[A-Z]+[0-9]+', pep8_test)
Out[32]:
['PEP8']
In [33]:
# using findall to get only the PEP8 word. [A-Z][0-9] means finding charachters containing upper case letters AND numbers
re.findall('[A-Z]+[0-9]+', pep7_test)
Out[33]:
['PEP7']
In [34]:
# using findall to get only the PEP8 word. [A-Z][0-9] means finding charachters containing upper case letters AND numbers
re.findall('[A-Z]+[0-9]+', peep8_test)
Out[34]:
['PEEP8']
In [35]:
# replacing PEP7 and PEEP8 by PEP using the above regex pattern in pep7_test and peep8_test using sub function
re.sub('[A-Z]+[0-9]+', 'PEP8 Python style', pep8_test)
Out[35]:
'I try to follow PEP8 Python style guidelines'
In [36]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python style', pep7_test)
Out[36]:
'I try to follow PEP8 Python style guidelines'
In [37]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python style', peep8_test)
Out[37]:
'I try to follow PEP8 Python style guidelines'

Other examples of regex methods

  • re.search()
  • re.match()
  • re.fullmatch()
  • re.finditer()
  • re.escape()

Machine Learning Pipeline

  1. Raw Text
  2. Tokenize (telling the model what to look at)
  3. Clean text (Removing STOP words, punctuation, stemming, etc.)
  4. Vectorize (converting text into numeric form)
  5. Machine learning algorithm (fit/train model)

Implementation: Removing punctuation

In [38]:
# importing pandas
import pandas as pd

# displaying 100 characters
pd.set_option('display.max_colwidth', 100)

# reading the text file
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t', header = None, names = ['label', 'body_text'])

data.head()
Out[38]:
label body_text
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2 ham Nah I don't think he goes to usf, he lives around here though
3 ham Even my brother is not like to speak with me. They treat me like aids patent.
4 ham I HAVE A DATE ON SUNDAY WITH WILL!!

Removing punctuation

In [39]:
# importing string package
import string

string.punctuation
Out[39]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [40]:
# using list comprehension to remove punctuation

# defining function to remove punctuation
def remove_punc(text):
    text_nopunct = [char for char in text if char not in string.punctuation]
    return text_nopunct

# applying remove_punc function to the data
data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punc(x))

data.head()
Out[40]:
label body_text body_text_clean
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... [I, v, e, , b, e, e, n, , s, e, a, r, c, h, i, n, g, , f, o, r, , t, h, e, , r, i, g, h, t,...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... [F, r, e, e, , e, n, t, r, y, , i, n, , 2, , a, , w, k, l, y, , c, o, m, p, , t, o, , w,...
2 ham Nah I don't think he goes to usf, he lives around here though [N, a, h, , I, , d, o, n, t, , t, h, i, n, k, , h, e, , g, o, e, s, , t, o, , u, s, f, ,...
3 ham Even my brother is not like to speak with me. They treat me like aids patent. [E, v, e, n, , m, y, , b, r, o, t, h, e, r, , i, s, , n, o, t, , l, i, k, e, , t, o, , s,...
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! [I, , H, A, V, E, , A, , D, A, T, E, , O, N, , S, U, N, D, A, Y, , W, I, T, H, , W, I, L, L]
In [41]:
# using join to recreate original sentence

# defining function to remove punctuation
def remove_punc(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

# applying remove_punc function to the data
data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punc(x))

data.head()
Out[41]:
label body_text body_text_clean
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... Ive been searching for the right words to thank you for this breather I promise i wont take your...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2 ham Nah I don't think he goes to usf, he lives around here though Nah I dont think he goes to usf he lives around here though
3 ham Even my brother is not like to speak with me. They treat me like aids patent. Even my brother is not like to speak with me They treat me like aids patent
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! I HAVE A DATE ON SUNDAY WITH WILL

Implementaion: Tokenization

Tokenization is a process to split strings or sentences into a list of words.

In [42]:
# importing regex package
import re

# definig the function
def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

# we use lower to convert any upper case letters as python is case sensitive
data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

data.head()
Out[42]:
label body_text body_text_clean body_text_tokenized
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... Ive been searching for the right words to thank you for this breather I promise i wont take your... [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e... [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...
2 ham Nah I don't think he goes to usf, he lives around here though Nah I dont think he goes to usf he lives around here though [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]
3 ham Even my brother is not like to speak with me. They treat me like aids patent. Even my brother is not like to speak with me They treat me like aids patent [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! I HAVE A DATE ON SUNDAY WITH WILL [i, have, a, date, on, sunday, with, will]

Implementation: Removing Stopwords

In [43]:
# importing nltk and stopwords
import nltk

# assigning all stopwords to a variable 
stopwords = nltk.corpus.stopwords.words('english')
stopwords
Out[43]:
['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]
In [44]:
# defining a function to remove stopwords
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopwords]
    return text

# appying the function
data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

data.head()
Out[44]:
label body_text body_text_clean body_text_tokenized body_text_nostop
0 ham I've been searching for the right words to thank you for this breather. I promise i wont take yo... Ive been searching for the right words to thank you for this breather I promise i wont take your... [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ... [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...
1 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e... [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to... [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
2 ham Nah I don't think he goes to usf, he lives around here though Nah I dont think he goes to usf he lives around here though [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though] [nah, dont, think, goes, usf, lives, around, though]
3 ham Even my brother is not like to speak with me. They treat me like aids patent. Even my brother is not like to speak with me They treat me like aids patent [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent] [even, brother, like, speak, treat, like, aids, patent]
4 ham I HAVE A DATE ON SUNDAY WITH WILL!! I HAVE A DATE ON SUNDAY WITH WILL [i, have, a, date, on, sunday, with, will] [date, sunday]

Author: Amandeep Saluja