House Prices —Data Cleaning

8 min readMar 1, 2021

Done by: Monirah abdulaziz, Lamalharbi, and Aisha Y Hakami

This article goes through the process of cleaning a dataset which consists of information about the location of the house, price, and other aspects such as square feet, etc. When we work on these sorts of data, we need to modify and aggregate some features to expand their importance and see which column is important for us and which is not. The main goal of this article is to prepare data to do modeling steps that can give us a good prediction on the price of the house based on the other variables.

Data Dictionary

There are 81 features in this dataset,
This is the data dictionary of the dataset before cleaning steps:

Check nulls:

df.isnull().sum()

Handling Missing Values

Divide the data into numerical (“quan”), and categorical (“qual”) features.

quan = list( test.loc[:,test.dtypes != ‘object’].drop(‘Id’,axis=1).columns.values )
qual = list( test.loc[:,test.dtypes == ‘object’].columns.values )

Find out how many missing values there are for the quantitative and categorical features

print('Quantitative\n')
hasNAN = df[quan].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)
print ('-' * 100)
print('Qualitative\n')
hasNAN = df[qual].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)

The output is shown here:

Quantitative
LotFrontage     486
GarageYrBlt     159
MasVnrArea       23
BsmtHalfBath      2
BsmtFullBath      2
GarageArea        1
GarageCars        1
TotalBsmtSF       1
BsmtUnfSF         1
BsmtFinSF2        1
BsmtFinSF1        1
dtype: int64
----------------------------------------------------------------------------------------------------
Qualitative

PoolQC          2909
MiscFeature     2814
Alley           2721
Fence           2348
FireplaceQu     1420
GarageCond       159
GarageQual       159
GarageFinish     159
GarageType       157
BsmtExposure      82
BsmtCond          82
BsmtQual          81
BsmtFinType2      80
BsmtFinType1      79
MasVnrType        24
MSZoning           4
Functional         2
Utilities          2
Exterior1st        1
SaleType           1
Exterior2nd        1
KitchenQual        1
Electrical         1
dtype: int64

Rather than the statistical computing methods to fill the missing values like mean, median, or mode, we prefer to do it manually, the benefit of this procedure is to get deep insight and understanding about the data, adding that make enhancements to some features in the feature engineering stage.

Remember the meaning of these features:

LotFrontage: Linear feet of street-connected to property
LotShape: General shape of a property
Neighborhood: Physical locations within Ames city limits

print(df[[‘LotShape’, ‘Neighborhood’,’LotFrontage’]].loc[df[‘LotFrontage’].isnull()])

The output:

LotShape Neighborhood  LotFrontage
7         IR1       NWAmes          NaN
12        IR2       Sawyer          NaN
14        IR1        NAmes          NaN
16        IR1        NAmes          NaN
24        IR1       Sawyer          NaN
...       ...          ...          ...
2847      Reg      CollgCr          NaN
2850      IR2      CollgCr          NaN
2900      IR1       Timber          NaN
2901      IR1       Timber          NaN
2908      IR1      Mitchel          NaN

[486 rows x 3 columns]

Filling missing values for numerical features.

Most of the NAN should mean that The corresponding facility/structure doesn’t exist, so we use zero for most cases

df.LotFrontage.fillna(df.LotFrontage.median(), inplace=True)
df[“LotFrontage”] = df[“LotFrontage”].fillna(df.groupby([“Neighborhood”, “LotShape”])[“LotFrontage”].transform(“median”))

NAN should mean no garage. we temporarily use yr = 0 here. Will come back to this later.

df.GarageYrBlt.fillna(0, inplace=True)

Use zero to fill following nulls

df.MasVnrArea.fillna(0, inplace=True) 
df.BsmtHalfBath.fillna(0, inplace=True)
df.BsmtFullBath.fillna(0, inplace=True)
df.GarageArea.fillna(0, inplace=True)
df.GarageCars.fillna(0, inplace=True) 
df.TotalBsmtSF.fillna(0, inplace=True) 
df.BsmtUnfSF.fillna(0, inplace=True) 
df.BsmtFinSF2.fillna(0, inplace=True) 
df.BsmtFinSF1.fillna(0, inplace=True)
df.Electrical.fillna(df.Electrical.mode()[0], inplace=True)

Filling missing values for categorical features with not available (NA)

df.PoolQC.fillna('NA', inplace=True)
df.MiscFeature.fillna('NA', inplace=True)    
df.Alley.fillna('NA', inplace=True)          
df.Fence.fillna('NA', inplace=True)         
df.FireplaceQu.fillna('NA', inplace=True)    
df.GarageCond.fillna('NA', inplace=True)    
df.GarageQual.fillna('NA', inplace=True)     
df.GarageFinish.fillna('NA', inplace=True)   
df.GarageType.fillna('NA', inplace=True)     
df.BsmtExposure.fillna('NA', inplace=True)     
df.BsmtCond.fillna('NA', inplace=True)        
df.BsmtQual.fillna('NA', inplace=True)        
df.BsmtFinType2.fillna('NA', inplace=True)     
df.BsmtFinType1.fillna('NA', inplace=True)     
df.MasVnrType.fillna('None', inplace=True)   
df.Exterior2nd.fillna('None', inplace=True)df.KitchenQual.fillna('NA', inplace=True)        
df.Exterior1st.fillna('NA', inplace=True)     
df.SaleType.fillna('NA', inplace=True)
df.Functional.fillna('NA', inplace=True)
df.Utilities.fillna('NA', inplace=True)

MSZoning should highly correlate with the location, so I use the mode values of individual Neighborhoods

for i in df.Neighborhood.unique():
    if df.MSZoning[df.Neighborhood == i].isnull().sum() > 0:
        df.loc[df.Neighborhood == i,'MSZoning'] = \
        df.loc[df.Neighborhood == i,'MSZoning'].fillna(df.loc[df.Neighborhood == i,'MSZoning'].mode()[0])

Qualitative

fig = plt.subplots(nrows = 1, ncols = 1, figsize = (18, 6))
sns.heatmap(df[qual].isnull(), yticklabels=False, cbar=False, cmap=’viridis’)

Check duplicate

# Eliminating the duplicates.
df.duplicated(subset=None, keep=’first’).sum()0

Feature Engineering

If we use one-hot encoding we will get a huge number of features, to avoid that, we decide to combine features that have similar data meaning.

df.Heating.value_counts()GasA     2874
GasW       27
Grav        9
Wall        6
OthW        2
Floor       1
Name: Heating, dtype: int64

So we will combine all values (Grav, Wall, OthW, and Floor 1) together since their values have appeared a few times.

df[“Heating”] = df[“Heating”].apply(lambda x: “Other” if x != “GasA” else x)

MasVnrType

df.MasVnrType.value_counts()None       1766
BrkFace     879
Stone       249
BrkCmn       25
Name: MasVnrType, dtype: int64

So we will combine all values (Stone, BrkCmn ) to None since their values have appeared a few times.

df[“MasVnrType”] = df[“MasVnrType”].apply(lambda x: “None” if x in [“None”, “BrkCmn”] else x)

And do the same steps for all categorical features.

Converting categorical variables into labels:

df.Alley = df.Alley.map({‘NA’:0, ‘Grvl’:1, ‘Pave’:2})
df.BsmtCond = df.BsmtCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.BsmtExposure = df.BsmtExposure.map({‘NA’:0, ‘No’:1, ‘Mn’:2, ‘Av’:3, ‘Gd’:4})
df[‘BsmtFinType1’] = df[‘BsmtFinType1’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df[‘BsmtFinType2’] = df[‘BsmtFinType2’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df.BsmtQual = df.BsmtQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterCond = df.ExterCond.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterQual = df.ExterQual.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.FireplaceQu = df.FireplaceQu.map({‘NA’:0, ‘TA’:1, ‘Gd’:2, ‘other’:3})
df.Functional = df.Functional.map({ ‘Min1’:2, ‘Typ’:1})
df.GarageCond = df.GarageCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.GarageQual = df.GarageQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.HeatingQC = df.HeatingQC.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.KitchenQual = df.KitchenQual.map({‘NA’:0,’Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.LandSlope = df.LandSlope.map({‘Sev’:1, ‘Mod’:2, ‘Gtl’:3}) 
df.PavedDrive = df.PavedDrive.map({‘N’:1, ‘P’:2, ‘Y’:3})
df.PoolQC = df.PoolQC.map({‘NA’:0,’Ex’:1})
df.Street = df.Street.map({‘Grvl’:1, ‘Pave’:2})
df.Utilities = df.Utilities.map({‘NA’:0,’ELO’:1, ‘NoSeWa’:2, ‘NoSewr’:3, ‘AllPub’:4})

Update the lists of numerical and categorical features:

newquan = [‘Alley’,’BsmtCond’,’BsmtExposure’,’BsmtFinType1',’BsmtFinType2',’BsmtQual’,
 ‘ExterCond’,’ExterQual’,’FireplaceQu’,’Functional’,’GarageCond’,
 ‘GarageQual’,’HeatingQC’,’KitchenQual’,’LandSlope’,’PavedDrive’,’PoolQC’,
 ‘Street’,’Utilities’]
quan = quan + newquan 
for i in newquan: qual.remove(i)

This is actually a categorical feature, MSSubClass doesn’t mean the meaning of the value of number 20, and so on with other values

df.MSSubClass = df.MSSubClass.map({20:’class1', 30:’class2', 40:’class3', 45:’class4',
 50:’class5', 60:’class6', 70:’class7', 75:’class8',
 80:’class9', 85:’class10', 90:’class11', 120:’class12',
 150:’class13', 160:’class14', 180:’class15', 190:’class16'})

Keeping “YrSold” is enough, the month MoSold: Month Sold (MM) and YrSold: Year Sold (YYYY) is like a redundant feature in somehow.

df=df.drop(‘MoSold’,axis=1)

Update my lists of numerical and categorical features

quan.remove(‘MoSold’)
quan.remove(‘MSSubClass’)
qual.append(‘MSSubClass’)df[‘Age’] = df.YrSold — df.YearBuilt
df[‘AgeRemod’] = df.YrSold — df.YearRemodAdd
df[‘AgeGarage’] = df.YrSold — df.GarageYrBlt

For the houses without a Garage, I filled the NANs with zeros, which makes AgeGarage ~ 2000.
Here I replace their AgeGarage with the maximum value among the houses with Garages

max_AgeGarage = np.max(df.AgeGarage[df.AgeGarage < 1000])
df[‘AgeGarage’] = df[‘AgeGarage’].map(lambda x: max_AgeGarage if x > 1000 else x)

Some of the values are negative because the work was done after the house was sold. In these cases, I change them to zero to avoid negative ages.

df.Age = df.Age.map(lambda x: 0 if x < 0 else x)
df.AgeRemod = df.AgeRemod.map(lambda x: 0 if x < 0 else x)
df.AgeGarage = df.AgeGarage.map(lambda x: 0 if x < 0 else x)

drop the original time variables

df=df.drop([‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’],axis=1)

Update the list of numerical features

for i in [‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’]: quan.remove(i)
quan = quan + [‘Age’,’AgeRemod’,’AgeGarage’]

Detect the Outliers

These are somewhat arbitrary

# These are somewhat arbitrary 
index_drop = df.LotFrontage[df.LotFrontage > 300].index
index_drop = np.append(index_drop, df.LotArea[df.LotArea > 100000].index)
index_drop = np.append(index_drop, df.BsmtFinSF1[df.BsmtFinSF1 > 4000].index)
index_drop = np.append(index_drop, df.TotalBsmtSF[df.TotalBsmtSF > 6000].index)
index_drop = np.append(index_drop, df[‘1stFlrSF’][df[‘1stFlrSF’] > 4000].index)
index_drop = np.append(index_drop, df.GrLivArea[(df.GrLivArea > 4000) & (df.SalePrice < 13)].index)
index_drop = np.unique(index_drop)

Make sure we only remove data from the training set

index_drop = index_drop[index_drop < 1460]
df = df.drop(index_drop).reset_index(drop=True)
print(“{} examples in the training set are dropped.”.format(len(index_drop)))

The output:

7 examples in the training set are dropped.

Transform the numerical features that are skewed.

It is necessary to transform the numerical features that are skewed. This is because lots of regression models building assume that the features are distributed normally and have asymmetrical shapes. For most machine learning algorithms, especially linear models, normally distributed features gives us better results.

from scipy.stats import skew
for i in quan:
 print(i+’: {}’.format(round(skew(df[i]),2)))

The output:

LotFrontage: 0.76
LotArea: 3.6
OverallQual: 0.19
OverallCond: 0.57
MasVnrArea: 2.63
BsmtFinSF1: 0.98
BsmtFinSF2: 4.16
BsmtUnfSF: 0.92
TotalBsmtSF: 0.67
1stFlrSF: 1.26
2ndFlrSF: 0.86
LowQualFinSF: 12.07
GrLivArea: 1.07
BsmtFullBath: 0.62
BsmtHalfBath: 3.94
FullBath: 0.17
HalfBath: 0.69
BedroomAbvGr: 0.33
KitchenAbvGr: 4.3
TotRmsAbvGrd: 0.75
Fireplaces: 0.73
GarageCars: -0.22
GarageArea: 0.22
WoodDeckSF: 1.84
OpenPorchSF: 2.53
EnclosedPorch: 4.0
3SsnPorch: 11.36
ScreenPorch: 3.94
PoolArea: 17.67
MiscVal: 21.94
Alley: 4.13
BsmtCond: -3.6
BsmtExposure: 1.12
BsmtFinType1: -0.15
BsmtFinType2: 3.16
BsmtQual: -1.27
ExterCond: 1.31
ExterQual: 0.78
FireplaceQu: 0.61
Functional: 3.39
GarageCond: -3.38
GarageQual: -3.26
HeatingQC: -0.55
KitchenQual: 0.39
LandSlope: -4.98
PavedDrive: -2.97
PoolQC: 26.93
Street: -16.18
Utilities: -33.93
Age: 0.6
AgeRemod: 0.45
AgeGarage: 0.97

Transform those with skewness > 0.5

skewed_features = np.array(quan)[np.abs(skew(df[quan])) > 0.5]
df[skewed_features] = np.log1p(df[skewed_features])

Dummy variables for the categorical features

Create a list of dummy variables that I will drop, which will be the last column generated from each categorical feature

dummy_drop = []
for i in qual:
 dummy_drop += [ i+’_’+str(df[i].unique()[-1]) ]

Create dummy variables

df = pd.get_dummies(df,columns=qual)

Drop the last column generated from each categorical feature

df = df.drop(dummy_drop,axis=1)

Check nulls values

fig, ax = plt.subplots( figsize = (18, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap=’viridis’)

Half of the SalePrice is null which is the test set before merging datasets.

Split the Data

X_train = df[:-1459].drop([‘SalePrice’,’Id’], axis=1)
y_train = df[:-1459][‘SalePrice’]
X_test = df[-1459:].drop([‘SalePrice’,’Id’], axis=1)

Scaling

Fit the training set only, then transform both the training and test sets.

scaler = RobustScaler()
X_train[quan]= scaler.fit_transform(X_train[quan])
X_test[quan]= scaler.transform(X_test[quan])
X_train.shape, X_test.shape

The output:

((1453, 193), (1459, 193))

Now we have 193 features!

By Jason Brownlee on August 17, 2016, in XGBoost. Last Updated on April 22, 2020. XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.

Feature importance is defined only for tree boosters

!pip install xgboost
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
imp = pd.DataFrame(xgb.feature_importances_ ,columns = ['Importance'],index = X_train.columns)
imp = imp.sort_values(['Importance'], ascending = False)print(imp)

The output:

Requirement already satisfied: xgboost in c:\users\maemt\anaconda3\lib\site-packages (1.3.1)
Requirement already satisfied: numpy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.19.2)
Requirement already satisfied: scipy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.5.2)
                   Importance
GarageCond           0.298127
GarageCars           0.157924
OverallQual          0.079050
CentralAir_Y         0.052237
KitchenQual          0.037871
...                       ...
Fence_GdPrv          0.000000
Condition2_RRAn      0.000000
Condition2_PosN      0.000000
Condition2_PosA      0.000000
Exterior2nd_Stone    0.000000[193 rows x 1 columns]

The most 10 important features are:

Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.

from sklearn.metrics import make_scorer 
from sklearn.feature_selection import RFECV# Define a function to calculate RMSE
def rmse(y_true, y_pred):
 return np.sqrt(np.mean((y_true-y_pred)**2))# Define a function to calculate negative RMSE (as a score)
def nrmse(y_true, y_pred):
 return -1.0*rmse(y_true, y_pred)
neg_rmse = make_scorer(nrmse)
estimator = XGBRegressor()
selector = RFECV(estimator, cv = 3, n_jobs = -1, scoring = neg_rmse)
selector = selector.fit(X_train, y_train)print(“The number of selected features is: {}”.format(selector.n_features_))features_kept = X_train.columns.values[selector.support_] 
X_train = X_train[features_kept]
X_test = X_test[features_kept]

The output is:

The number of selected features is: 42

So finally we can deal with cleaned 42 features rather than 81 features.

Thank you for reading.

House Prices —Data Cleaning

Data Dictionary

Handling Missing Values

The output is shown here:

Filling missing values for numerical features.

NAN should mean no garage. we temporarily use yr = 0 here. Will come back to this later.

Use zero to fill following nulls

Filling missing values for categorical features with not available (NA)

MSZoning should highly correlate with the location, so I use the mode values of individual Neighborhoods

Qualitative

Check duplicate

Feature Engineering

Converting categorical variables into labels:

Detect the Outliers

Transform the numerical features that are skewed.

Dummy variables for the categorical features

Drop the last column generated from each categorical feature

Check nulls values

Split the Data

Scaling

The output:

The output is:

Written by Monirah abdulaziz

No responses yet