Splitting rows in a column with seperator into multiple rows

The goal of this project is to split a column into multiple rows having the same seperator. The process will be as followed:

  1. Installing and importing necessary packages.
  2. Reading the csv file using pandas.
  3. Reformatting all column names by replacing spaces between name with an underscore.
  4. Replacing NaN values in the target column with "--".
  5. Creating a function with splits rows with seperator into multiple rows.
  6. Checking the end result.

Installing packages

In [1]:
! pip install pandas
Requirement already satisfied: pandas in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (0.24.2)
Requirement already satisfied: pytz>=2011k in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (from pandas) (2.8.0)
Requirement already satisfied: numpy>=1.12.0 in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (from pandas) (1.16.2)
Requirement already satisfied: six>=1.5 in c:\users\saluj\appdata\local\continuum\anaconda3\lib\site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)

Importing packages

In [2]:
import pandas as pd

Reading the file

In [3]:
df1 = pd.read_csv("SampleData.csv", encoding = "ISO-8859-1")

df1
Out[3]:
Date Country Financial Advisor Legal Advisor
0 30/06/2018 USA NaN NaN
1 30/06/2018 France Rothschild & Co Dechert LLP
2 29/06/2018 France Capitalmind Corporate Finance Advisory; Transl... Brunswick Societe d'Avocats
3 29/06/2018 USA Qatalyst Group; Raymond James & Associates, Inc. Goodwin Procter LLP; Kirkland & Ellis LLP
4 29/06/2018 Italy NaN NaN
5 29/06/2018 Belgium NaN NaN
6 29/06/2018 Singapore Jamieson Corporate Finance LLP ; Rothschild & Co NaN

Replacing spaces in column name with an underscore and replacing NaN values with "--"

In [4]:
df1.columns = df1.columns.str.replace(' ', '_')

print(df1.head(2))

df1 = df1.fillna("--")

print('Result after replacing NaN with --')

print(df1.head(2))
         Date Country Financial_Advisor Legal_Advisor
0  30/06/2018     USA               NaN           NaN
1  30/06/2018  France   Rothschild & Co   Dechert LLP
Result after replacing NaN with --
         Date Country Financial_Advisor Legal_Advisor
0  30/06/2018     USA                --            --
1  30/06/2018  France   Rothschild & Co   Dechert LLP

Creating the split row function into multiple rows

In [5]:
def splitDataFrameList(df,target_column,separator):
    
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    row_accumulator = []

    def splitListToRows(row, separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)

    df.apply(splitListToRows, axis=1, args = (separator, ))
    new_df = pd.DataFrame(row_accumulator)
    return new_df

Calling out the function and checking the result

In [6]:
df1 = splitDataFrameList(df1,'Financial_Advisor',';')

df1
Out[6]:
Country Date Financial_Advisor Legal_Advisor
0 USA 30/06/2018 -- --
1 France 30/06/2018 Rothschild & Co Dechert LLP
2 France 29/06/2018 Capitalmind Corporate Finance Advisory Brunswick Societe d'Avocats
3 France 29/06/2018 Translink Corporate Finance Brunswick Societe d'Avocats
4 USA 29/06/2018 Qatalyst Group Goodwin Procter LLP; Kirkland & Ellis LLP
5 USA 29/06/2018 Raymond James & Associates, Inc. Goodwin Procter LLP; Kirkland & Ellis LLP
6 Italy 29/06/2018 -- --
7 Belgium 29/06/2018 -- --
8 Singapore 29/06/2018 Jamieson Corporate Finance LLP --
9 Singapore 29/06/2018 Rothschild & Co --

Calling out the function on another column and checking the result

In [7]:
df1 = splitDataFrameList(df1,'Legal_Advisor',';')

df1
Out[7]:
Country Date Financial_Advisor Legal_Advisor
0 USA 30/06/2018 -- --
1 France 30/06/2018 Rothschild & Co Dechert LLP
2 France 29/06/2018 Capitalmind Corporate Finance Advisory Brunswick Societe d'Avocats
3 France 29/06/2018 Translink Corporate Finance Brunswick Societe d'Avocats
4 USA 29/06/2018 Qatalyst Group Goodwin Procter LLP
5 USA 29/06/2018 Qatalyst Group Kirkland & Ellis LLP
6 USA 29/06/2018 Raymond James & Associates, Inc. Goodwin Procter LLP
7 USA 29/06/2018 Raymond James & Associates, Inc. Kirkland & Ellis LLP
8 Italy 29/06/2018 -- --
9 Belgium 29/06/2018 -- --
10 Singapore 29/06/2018 Jamieson Corporate Finance LLP --
11 Singapore 29/06/2018 Rothschild & Co --

Author: Amandeep Saluja

My Portfolio