Titanic — Machine Learning from Disaster — Data Cleaning

3 min readFeb 28, 2021

Done by: Lamalharbi, Aisha Y Hakami, and Monirah abdulaziz.

Introduction

This article goes through the whole process of cleaning the data and developing a machine learning model on the Titanic dataset which is one of the famous data sets which every Data Scientist explores at the start of their career and here we are. The dataset provides all the information on the fate of passengers on the Titanic, summarized according to sex, age, economic status (class), and survival.

In addition, you can join the Titanic: Machine Learning from Disaster challenge on Kaggle.

Data Cleaning

# upload the data
train = pd.read_csv('../datasets/train.csv')
test = pd.read_csv('../datasets/test.csv')

Check the null values:

fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (18, 6))# train data 
sns.heatmap(train.isnull(), yticklabels=False, ax = ax[0], cbar=False, cmap='viridis')
ax[0].set_title('Train data')# test data
sns.heatmap(test.isnull(), yticklabels=False, ax = ax[1], cbar=False, cmap='viridis')
ax[1].set_title('Test data');

Heatmap to visualize the null values before cleaning

Observation:

There are missing values in the train set: Age, Cabin and Embarked.
There are missing values in the test set: Age, Fare and Cabin.

1. Embarked feature in Train

First, we need to check How many ports are in Embarked column:

train.Embarked.value_counts()

The result was:

S 644
C 168
Q 77

Well, now we can drop the two rows with missing Embarked values in the train set, however, let's fill the nulls with the port of highest embarkation.

from collections import Countertrain.Embarked= train.Embarked.replace(np.nan,Counter(train.Embarked).most_common(1)[0][0])

2. Fare feature in Test

To fill the missing values in Fare we can use the Pclass column, by computing the average of Fare of the missing Class.

class3_mean = train[train['Pclass']==3]['Fare'].mean()
test['Fare'] = test['Fare'].replace({np.nan:class3_mean})

3. Age feature in Train and Test

#defining a function 'impute_age'
def impute_age(age_pclass): # passing age_pclass as ['Age', 'Pclass']
    
    # Passing age_pclass[0] which is 'Age' to variable 'Age'
    Age = age_pclass[0]
    
    # Passing age_pclass[2] which is 'Pclass' to variable 'Pclass'
    Pclass = age_pclass[1]
    
    #applying condition based on the Age and filling the missing data respectively 
    if pd.isnull(Age):if Pclass == 1:
            return 38elif Pclass == 2:
            return 30else:
            return 25else:
        return Age

By using the above function we can fill the missing data in Age the feature according to the mean age with respect to each Pclass.

# (for train) grab age and apply the impute_age, our custom function
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)# (for test) grab age and apply the impute_age, our custom function 
test['Age'] = test[['Age','Pclass']].apply(impute_age,axis=1)

4. Cabin feature in Train and Test

Now we reach half of the way, the last step is the Cabin , the column we need to apply some feature engineering.

If there was a value for Cabin — replace it with 1
If the value is missing/null — replace it with 0

#Train:
train.loc[train['Cabin'].notnull(), 'Cabin'] =1
train['Cabin'] = train['Cabin'].replace({np.nan:0})
train['Cabin'] = train['Cabin'].astype(int)#Test:
test.loc[test['Cabin'].notnull(), 'Cabin'] =1
test['Cabin'] = test['Cabin'].replace({np.nan:0})
test['Cabin'] = test['Cabin'].astype(int)

Now let’s take a look at the heatmap again:

Heatmap to visualize the null values after cleaning

Look’s Great!! No more missing values :)

Feature Engineering

To include the categorical variables as predictors in statistical and machine learning models we need to Dummy the variables.

Dummy the Sex and Embarked columns

# Train
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train = pd.concat([train, sex,embark],axis=1)
train=train.drop(['Sex','Embarked'], axis=1)
train.rename(columns={"male": "sex_male", "Q": "Embarked_Q","S": "Embarked_S"}, inplace=True)
# Same for the Test

Model Preparation

First, we need to select our features and it will be the following:

[Pclass, Age, SibSp, Parch, Fare, Cabin, Sex_male, Embarked_Q, Embarked_S]

And our target will be the feature: Survived

2. Then we need to write a list comprehension to grab the selected features.

3. Separate the selected_column in X_train and the Survived in y_train.

features_drop = ['PassengerId','Name', 'Ticket', 'Survived']
selected_features = [x for x in train.columns  if x not in features_drop ]
X_train = train[selected_features]
y_train = train['Survived']

Now the data is ready to train on the models.