Machine Learning (Part 1)

Linear Regression

The goal of this post is to understand the machine learning concepts given in the book "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and doing it from the scratch. More than 90% of my code is similar to the code given in the book. I simply tried to replicate the methodology used by the author and re-wrtting code to understand it better by doing it.

Problem

Find out what is the Better Life Index (BLI) of Cyprus using GDP per capita and BLI of other countries.

Process:

  1. Importing required packages.
  2. Setting up x,y axis for scatter plot.
  3. Defining function to wrangle the unstructured data from 2 different sources and merging them together.
  4. Using the above function to get the data we need and visualizing it using matplotlib.
  5. Training the model and getting BLI of Cyprus.

Importing required packages

In [1]:
import pandas as pd
import numpy as np
import os
import sklearn.linear_model

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

# to make this notebook's output stable across runs
np.random.seed(42)

Settip up matplotlib to plot good figures

In [2]:
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Setting up directory to store figures generated using matplotlib and loading the data files

In [3]:
root_dir = r'C:\Users\saluj\Desktop\Machine Learning'

Defining a function to store images in the folder (not necessary for this project)

In [4]:
def save_figure(fig_id, tight_layout=True):
    path = os.path.join(root_dir, "images", fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

Defining a function to read, format, and combine OECD and GDP data

In [5]:
def prepare_country_stats(oecd_bli, gdp_per_capita):
    
    #filter the OECD data where Inequality is Total
    oecd_bli = oecd_bli[oecd_bli["Inequality"]=="Total"]  
    
    #pivot the OECD data making country as an index, converting distinct Indicators into columns, and assigning respective value in the data 
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    
    #rename the 2015 column in gdp_per_capita data to GDP per capita
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    
    #sets Country column as index in gdp_per_capita data
    gdp_per_capita.set_index("Country", inplace=True)
    
    #merge both the files on index(column)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    
    #sorts the value as per GDP per capita column in decreasing order
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    
    #creating a list of missing data using random index
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    
    #creating a list of samople data using random index
    keep_indices = list(set(range(36)) - set(remove_indices))
    
    #creates a dataframe of sample data with only 3 columns: Country, GDP per capita, and Life satisfaction
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

Loading the data files

In [6]:
oecd_bli = pd.read_csv(r"C:\Users\saluj\Desktop\Machine_Learning\oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(r"C:\Users\saluj\Desktop\Machine_Learning\gdp_per_capita.csv", 
                             thousands=',',delimiter='\t',
                             encoding='latin1', na_values="n/a")

Using prepare_country_stats function to format the files

In [7]:
full_country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)

Setting up variables to visualize the data in matplotlib

In [8]:
X = np.c_[full_country_stats["GDP per capita"]]
y = np.c_[full_country_stats["Life satisfaction"]]

# Visualizing the data
full_country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

Selecting a linear model

In [9]:
model = sklearn.linear_model.LinearRegression()

Training the model

In [10]:
model.fit(X, y)
Out[10]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Making a prediction for Cyprus Life Satisfaction using our model

In [11]:
X_new = [[22587]]  # Cyprus' GDP per capita
model.predict(X_new)
Out[11]:
array([[5.96242338]])

Author: Amandeep Saluja