# Machine Learning (Part 1)¶

## Linear Regression¶

The goal of this post is to understand the machine learning concepts given in the book "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and doing it from the scratch. More than 90% of my code is similar to the code given in the book. I simply tried to replicate the methodology used by the author and re-wrtting code to understand it better by doing it.

### Problem¶

Find out what is the Better Life Index (BLI) of Cyprus using GDP per capita and BLI of other countries.

### Process:¶

1. Importing required packages.
2. Setting up x,y axis for scatter plot.
3. Defining function to wrangle the unstructured data from 2 different sources and merging them together.
4. Using the above function to get the data we need and visualizing it using matplotlib.
5. Training the model and getting BLI of Cyprus.

#### Importing required packages¶

In [1]:
import pandas as pd
import numpy as np
import os
import sklearn.linear_model

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

# to make this notebook's output stable across runs
np.random.seed(42)


#### Settip up matplotlib to plot good figures¶

In [2]:
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


#### Setting up directory to store figures generated using matplotlib and loading the data files¶

In [3]:
root_dir = r'C:\Users\saluj\Desktop\Machine Learning'


#### Defining a function to store images in the folder (not necessary for this project)¶

In [4]:
def save_figure(fig_id, tight_layout=True):
path = os.path.join(root_dir, "images", fig_id + ".png")
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format='png', dpi=300)


#### Defining a function to read, format, and combine OECD and GDP data¶

In [5]:
def prepare_country_stats(oecd_bli, gdp_per_capita):

#filter the OECD data where Inequality is Total
oecd_bli = oecd_bli[oecd_bli["Inequality"]=="Total"]

#pivot the OECD data making country as an index, converting distinct Indicators into columns, and assigning respective value in the data
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")

#rename the 2015 column in gdp_per_capita data to GDP per capita
gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)

#sets Country column as index in gdp_per_capita data
gdp_per_capita.set_index("Country", inplace=True)

#merge both the files on index(column)
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
left_index=True, right_index=True)

#sorts the value as per GDP per capita column in decreasing order
full_country_stats.sort_values(by="GDP per capita", inplace=True)

#creating a list of missing data using random index
remove_indices = [0, 1, 6, 8, 33, 34, 35]

#creating a list of samople data using random index
keep_indices = list(set(range(36)) - set(remove_indices))

#creates a dataframe of sample data with only 3 columns: Country, GDP per capita, and Life satisfaction
return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]


In [6]:
oecd_bli = pd.read_csv(r"C:\Users\saluj\Desktop\Machine_Learning\oecd_bli_2015.csv", thousands=',')
thousands=',',delimiter='\t',
encoding='latin1', na_values="n/a")


#### Using prepare_country_stats function to format the files¶

In [7]:
full_country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)


#### Setting up variables to visualize the data in matplotlib¶

In [8]:
X = np.c_[full_country_stats["GDP per capita"]]
y = np.c_[full_country_stats["Life satisfaction"]]

# Visualizing the data
full_country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()


#### Selecting a linear model¶

In [9]:
model = sklearn.linear_model.LinearRegression()


#### Training the model¶

In [10]:
model.fit(X, y)

Out[10]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)

#### Making a prediction for Cyprus Life Satisfaction using our model¶

In [11]:
X_new = [[22587]]  # Cyprus' GDP per capita
model.predict(X_new)

Out[11]:
array([[5.96242338]])

Author: Amandeep Saluja