It is a process of creating new features or tranforming exisitng features to get most out of the data.

Some examples of creating new features:

- Length of text field.
- % of characters that are punctuation in the text
- % of characters that are capitalized

To create some of the new features, we could apply some of the Transformation techniques such as:

- Power Transformations (square, square root, etc.)
- Standardizing data

In [1]:

```
import pandas as pd
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t', names = ['label', 'body_text'])
data.head()
```

Out[1]:

We will assume that spam messages tend to be longer than real text messages. We will create this feature and then we will explore whether our hypothesis is accurate.

In [2]:

```
# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data.head()
```

Out[2]:

In [3]:

```
import string
# defining a function
def count_punct(text):
count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)
# applying the above created function
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))
data.head()
```

Out[3]:

In [4]:

```
# importing packages
from matplotlib import pyplot
%matplotlib inline
import numpy as np
```

In [5]:

```
# definining bins for hitogram using numpy
bins = np.linspace(0, 200, 40)
# plot for spam distribution
pyplot.hist(data[data['label'] == 'spam']['body_len'], bins, alpha=0.5, density=True, label='spam')
# plot for ham distribution
pyplot.hist(data[data['label'] == 'ham']['body_len'], bins, alpha=0.5, density=True, label='ham')
#creating legend
pyplot.legend(loc='upper left')
pyplot.show()
```

In [6]:

```
# definining bins for hitogram using numpy
bins = np.linspace(0, 50, 40) # this is in %
# plot for spam distribution
pyplot.hist(data[data['label'] == 'spam']['punc%'], bins, alpha=0.5, density=True, label='spam')
# plot for ham distribution
pyplot.hist(data[data['label'] == 'ham']['punc%'], bins, alpha=0.5, density=True, label='ham')
#creating legend
pyplot.legend(loc='upper right')
pyplot.show()
```

In [7]:

```
import pandas as pd
data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t', names = ['label', 'body_text'])
data.head()
```

Out[7]:

In [8]:

```
# to calculate the correct length of messages, we will deduct count of whitespaces
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data.head()
```

Out[8]:

In [9]:

```
import string
# defining a function
def count_punct(text):
count = sum([1 for char in text if char in string.punctuation]) # total count of punctuation
return round(count/(len(text) - text.count(" ")), 3)*100 # No. of punctuations divided by length of message (excluding whitespaces as we did above)
# applying the above created function
data['punc%'] = data['body_text'].apply(lambda x: count_punct(x))
data.head()
```

Out[9]:

In [10]:

```
# importing packages
from matplotlib import pyplot
%matplotlib inline
import numpy as np
```

In [11]:

```
# definining bins for hitogram using numpy
bins = np.linspace(0, 200, 40)
# plot for spam distribution
pyplot.hist(data['body_len'], bins) # changes done here
# title
pyplot.title('Body Length Distribution')
pyplot.show()
```

In [12]:

```
# definining bins for hitogram using numpy
bins = np.linspace(0, 50, 40) # this is in %
# plot for spam distribution
pyplot.hist(data['punc%'], bins)
# title
pyplot.title('Punctuation % Distribution')
pyplot.show()
```

Transfomation is a process that alters each data point in a certain column in a systematic way.

**Base Form**: $$ y^x $$

X | Base Form | Transformation |
---|---|---|

-2 | $$ y ^ {-2} $$ | $$ \frac{1}{y^2} $$ |

-1 | $$ y ^ {-1} $$ | $$ \frac{1}{y} $$ |

-0.5 | $$ y ^ {\frac{-1}{2}} $$ | $$ \frac{1}{\sqrt{y}} $$ |

0 | $$ y^{0} $$ | $$ log(y) $$ |

0.5 | $$ y ^ {\frac{1}{2}} $$ | $$ \sqrt{y} $$ |

1 | $$ y^{1} $$ | $$ y $$ |

2 | $$ y^{2} $$ | $$ y^2 $$ |

**Transformation Process**

- Determine what range of exponents to test
- Apply each transformation to each value of your chosen feature
- Use some criteria to determine which of the transformations yield the best distribution

In [13]:

```
# creating the for loop to loop through some different transfomation and plot the data with that transfomation applied
for i in [1,2,3,4,5]:
pyplot.hist((data['punc%'])**(1/i), bins=40)
pyplot.title("Transformation: 1/{}".format(str(i)))
pyplot.show()
```

Author: Amandeep Saluja