House Prices —Data Cleaning

Monirah abdulaziz
8 min readMar 1, 2021

--

Done by: Monirah abdulaziz, Lamalharbi, and Aisha Y Hakami

source: shorturl.at/anA57

This article goes through the process of cleaning a dataset which consists of information about the location of the house, price, and other aspects such as square feet, etc. When we work on these sorts of data, we need to modify and aggregate some features to expand their importance and see which column is important for us and which is not. The main goal of this article is to prepare data to do modeling steps that can give us a good prediction on the price of the house based on the other variables.

Data Dictionary

  • There are 81 features in this dataset,
  • This is the data dictionary of the dataset before cleaning steps:
The head of the data dictionary

Check nulls:

df.isnull().sum()
Null values

Handling Missing Values

Divide the data into numerical (“quan”), and categorical (“qual”) features.

quan = list( test.loc[:,test.dtypes != ‘object’].drop(‘Id’,axis=1).columns.values )
qual = list( test.loc[:,test.dtypes == ‘object’].columns.values )

Find out how many missing values there are for the quantitative and categorical features

print('Quantitative\n')
hasNAN = df[quan].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)
print ('-' * 100)
print('Qualitative\n')
hasNAN = df[qual].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)

The output is shown here:

Quantitative
LotFrontage 486
GarageYrBlt 159
MasVnrArea 23
BsmtHalfBath 2
BsmtFullBath 2
GarageArea 1
GarageCars 1
TotalBsmtSF 1
BsmtUnfSF 1
BsmtFinSF2 1
BsmtFinSF1 1
dtype: int64
----------------------------------------------------------------------------------------------------
Qualitative

PoolQC 2909
MiscFeature 2814
Alley 2721
Fence 2348
FireplaceQu 1420
GarageCond 159
GarageQual 159
GarageFinish 159
GarageType 157
BsmtExposure 82
BsmtCond 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
MasVnrType 24
MSZoning 4
Functional 2
Utilities 2
Exterior1st 1
SaleType 1
Exterior2nd 1
KitchenQual 1
Electrical 1
dtype: int64

Rather than the statistical computing methods to fill the missing values like mean, median, or mode, we prefer to do it manually, the benefit of this procedure is to get deep insight and understanding about the data, adding that make enhancements to some features in the feature engineering stage.

Remember the meaning of these features:

  • LotFrontage: Linear feet of street-connected to property
  • LotShape: General shape of a property
  • Neighborhood: Physical locations within Ames city limits
print(df[[‘LotShape’, ‘Neighborhood’,’LotFrontage’]].loc[df[‘LotFrontage’].isnull()])

The output:

LotShape Neighborhood  LotFrontage
7 IR1 NWAmes NaN
12 IR2 Sawyer NaN
14 IR1 NAmes NaN
16 IR1 NAmes NaN
24 IR1 Sawyer NaN
... ... ... ...
2847 Reg CollgCr NaN
2850 IR2 CollgCr NaN
2900 IR1 Timber NaN
2901 IR1 Timber NaN
2908 IR1 Mitchel NaN

[486 rows x 3 columns]

Filling missing values for numerical features.

Most of the NAN should mean that The corresponding facility/structure doesn’t exist, so we use zero for most cases

df.LotFrontage.fillna(df.LotFrontage.median(), inplace=True)
df[“LotFrontage”] = df[“LotFrontage”].fillna(df.groupby([“Neighborhood”, “LotShape”])[“LotFrontage”].transform(“median”))

NAN should mean no garage. we temporarily use yr = 0 here. Will come back to this later.

df.GarageYrBlt.fillna(0, inplace=True)

Use zero to fill following nulls

df.MasVnrArea.fillna(0, inplace=True) 
df.BsmtHalfBath.fillna(0, inplace=True)
df.BsmtFullBath.fillna(0, inplace=True)
df.GarageArea.fillna(0, inplace=True)
df.GarageCars.fillna(0, inplace=True)
df.TotalBsmtSF.fillna(0, inplace=True)
df.BsmtUnfSF.fillna(0, inplace=True)
df.BsmtFinSF2.fillna(0, inplace=True)
df.BsmtFinSF1.fillna(0, inplace=True)
df.Electrical.fillna(df.Electrical.mode()[0], inplace=True)

Filling missing values for categorical features with not available (NA)

df.PoolQC.fillna('NA', inplace=True)
df.MiscFeature.fillna('NA', inplace=True)
df.Alley.fillna('NA', inplace=True)
df.Fence.fillna('NA', inplace=True)
df.FireplaceQu.fillna('NA', inplace=True)
df.GarageCond.fillna('NA', inplace=True)
df.GarageQual.fillna('NA', inplace=True)
df.GarageFinish.fillna('NA', inplace=True)
df.GarageType.fillna('NA', inplace=True)
df.BsmtExposure.fillna('NA', inplace=True)
df.BsmtCond.fillna('NA', inplace=True)
df.BsmtQual.fillna('NA', inplace=True)
df.BsmtFinType2.fillna('NA', inplace=True)
df.BsmtFinType1.fillna('NA', inplace=True)
df.MasVnrType.fillna('None', inplace=True)
df.Exterior2nd.fillna('None', inplace=True)
df.KitchenQual.fillna('NA', inplace=True)
df.Exterior1st.fillna('NA', inplace=True)
df.SaleType.fillna('NA', inplace=True)
df.Functional.fillna('NA', inplace=True)
df.Utilities.fillna('NA', inplace=True)

MSZoning should highly correlate with the location, so I use the mode values of individual Neighborhoods

for i in df.Neighborhood.unique():
if df.MSZoning[df.Neighborhood == i].isnull().sum() > 0:
df.loc[df.Neighborhood == i,'MSZoning'] = \
df.loc[df.Neighborhood == i,'MSZoning'].fillna(df.loc[df.Neighborhood == i,'MSZoning'].mode()[0])

Qualitative

fig = plt.subplots(nrows = 1, ncols = 1, figsize = (18, 6))
sns.heatmap(df[qual].isnull(), yticklabels=False, cbar=False, cmap=’viridis’)
heat map empty from nulls

Check duplicate

# Eliminating the duplicates.
df.duplicated(subset=None, keep=’first’).sum()
0

Feature Engineering

If we use one-hot encoding we will get a huge number of features, to avoid that, we decide to combine features that have similar data meaning.

df.Heating.value_counts()GasA     2874
GasW 27
Grav 9
Wall 6
OthW 2
Floor 1
Name: Heating, dtype: int64

So we will combine all values (Grav, Wall, OthW, and Floor 1) together since their values have appeared a few times.

df[“Heating”] = df[“Heating”].apply(lambda x: “Other” if x != “GasA” else x)
  • MasVnrType
df.MasVnrType.value_counts()None       1766
BrkFace 879
Stone 249
BrkCmn 25
Name: MasVnrType, dtype: int64

So we will combine all values (Stone, BrkCmn ) to None since their values have appeared a few times.

df[“MasVnrType”] = df[“MasVnrType”].apply(lambda x: “None” if x in [“None”, “BrkCmn”] else x)

And do the same steps for all categorical features.

Converting categorical variables into labels:

df.Alley = df.Alley.map({‘NA’:0, ‘Grvl’:1, ‘Pave’:2})
df.BsmtCond = df.BsmtCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.BsmtExposure = df.BsmtExposure.map({‘NA’:0, ‘No’:1, ‘Mn’:2, ‘Av’:3, ‘Gd’:4})
df[‘BsmtFinType1’] = df[‘BsmtFinType1’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df[‘BsmtFinType2’] = df[‘BsmtFinType2’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df.BsmtQual = df.BsmtQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterCond = df.ExterCond.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterQual = df.ExterQual.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.FireplaceQu = df.FireplaceQu.map({‘NA’:0, ‘TA’:1, ‘Gd’:2, ‘other’:3})
df.Functional = df.Functional.map({ ‘Min1’:2, ‘Typ’:1})
df.GarageCond = df.GarageCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.GarageQual = df.GarageQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.HeatingQC = df.HeatingQC.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.KitchenQual = df.KitchenQual.map({‘NA’:0,’Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.LandSlope = df.LandSlope.map({‘Sev’:1, ‘Mod’:2, ‘Gtl’:3})
df.PavedDrive = df.PavedDrive.map({‘N’:1, ‘P’:2, ‘Y’:3})
df.PoolQC = df.PoolQC.map({‘NA’:0,’Ex’:1})
df.Street = df.Street.map({‘Grvl’:1, ‘Pave’:2})
df.Utilities = df.Utilities.map({‘NA’:0,’ELO’:1, ‘NoSeWa’:2, ‘NoSewr’:3, ‘AllPub’:4})

Update the lists of numerical and categorical features:

newquan = [‘Alley’,’BsmtCond’,’BsmtExposure’,’BsmtFinType1',’BsmtFinType2',’BsmtQual’,
‘ExterCond’,’ExterQual’,’FireplaceQu’,’Functional’,’GarageCond’,
‘GarageQual’,’HeatingQC’,’KitchenQual’,’LandSlope’,’PavedDrive’,’PoolQC’,
‘Street’,’Utilities’]
quan = quan + newquan
for i in newquan: qual.remove(i)

This is actually a categorical feature, MSSubClass doesn’t mean the meaning of the value of number 20, and so on with other values

df.MSSubClass = df.MSSubClass.map({20:’class1', 30:’class2', 40:’class3', 45:’class4',
50:’class5', 60:’class6', 70:’class7', 75:’class8',
80:’class9', 85:’class10', 90:’class11', 120:’class12',
150:’class13', 160:’class14', 180:’class15', 190:’class16'})

Keeping “YrSold” is enough, the month MoSold: Month Sold (MM) and YrSold: Year Sold (YYYY) is like a redundant feature in somehow.

df=df.drop(‘MoSold’,axis=1)

Update my lists of numerical and categorical features

quan.remove(‘MoSold’)
quan.remove(‘MSSubClass’)
qual.append(‘MSSubClass’)
df[‘Age’] = df.YrSold — df.YearBuilt
df[‘AgeRemod’] = df.YrSold — df.YearRemodAdd
df[‘AgeGarage’] = df.YrSold — df.GarageYrBlt

For the houses without a Garage, I filled the NANs with zeros, which makes AgeGarage ~ 2000.
Here I replace their AgeGarage with the maximum value among the houses with Garages

max_AgeGarage = np.max(df.AgeGarage[df.AgeGarage < 1000])
df[‘AgeGarage’] = df[‘AgeGarage’].map(lambda x: max_AgeGarage if x > 1000 else x)

Some of the values are negative because the work was done after the house was sold. In these cases, I change them to zero to avoid negative ages.

df.Age = df.Age.map(lambda x: 0 if x < 0 else x)
df.AgeRemod = df.AgeRemod.map(lambda x: 0 if x < 0 else x)
df.AgeGarage = df.AgeGarage.map(lambda x: 0 if x < 0 else x)

drop the original time variables

df=df.drop([‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’],axis=1)

Update the list of numerical features

for i in [‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’]: quan.remove(i)
quan = quan + [‘Age’,’AgeRemod’,’AgeGarage’]

Detect the Outliers

These are somewhat arbitrary

# These are somewhat arbitrary 
index_drop = df.LotFrontage[df.LotFrontage > 300].index
index_drop = np.append(index_drop, df.LotArea[df.LotArea > 100000].index)
index_drop = np.append(index_drop, df.BsmtFinSF1[df.BsmtFinSF1 > 4000].index)
index_drop = np.append(index_drop, df.TotalBsmtSF[df.TotalBsmtSF > 6000].index)
index_drop = np.append(index_drop, df[‘1stFlrSF’][df[‘1stFlrSF’] > 4000].index)
index_drop = np.append(index_drop, df.GrLivArea[(df.GrLivArea > 4000) & (df.SalePrice < 13)].index)
index_drop = np.unique(index_drop)

Make sure we only remove data from the training set

index_drop = index_drop[index_drop < 1460]
df = df.drop(index_drop).reset_index(drop=True)
print(“{} examples in the training set are dropped.”.format(len(index_drop)))

The output:

7 examples in the training set are dropped.

Transform the numerical features that are skewed.

It is necessary to transform the numerical features that are skewed. This is because lots of regression models building assume that the features are distributed normally and have asymmetrical shapes. For most machine learning algorithms, especially linear models, normally distributed features gives us better results.

from scipy.stats import skew
for i in quan:
print(i+’: {}’.format(round(skew(df[i]),2)))

The output:

LotFrontage: 0.76
LotArea: 3.6
OverallQual: 0.19
OverallCond: 0.57
MasVnrArea: 2.63
BsmtFinSF1: 0.98
BsmtFinSF2: 4.16
BsmtUnfSF: 0.92
TotalBsmtSF: 0.67
1stFlrSF: 1.26
2ndFlrSF: 0.86
LowQualFinSF: 12.07
GrLivArea: 1.07
BsmtFullBath: 0.62
BsmtHalfBath: 3.94
FullBath: 0.17
HalfBath: 0.69
BedroomAbvGr: 0.33
KitchenAbvGr: 4.3
TotRmsAbvGrd: 0.75
Fireplaces: 0.73
GarageCars: -0.22
GarageArea: 0.22
WoodDeckSF: 1.84
OpenPorchSF: 2.53
EnclosedPorch: 4.0
3SsnPorch: 11.36
ScreenPorch: 3.94
PoolArea: 17.67
MiscVal: 21.94
Alley: 4.13
BsmtCond: -3.6
BsmtExposure: 1.12
BsmtFinType1: -0.15
BsmtFinType2: 3.16
BsmtQual: -1.27
ExterCond: 1.31
ExterQual: 0.78
FireplaceQu: 0.61
Functional: 3.39
GarageCond: -3.38
GarageQual: -3.26
HeatingQC: -0.55
KitchenQual: 0.39
LandSlope: -4.98
PavedDrive: -2.97
PoolQC: 26.93
Street: -16.18
Utilities: -33.93
Age: 0.6
AgeRemod: 0.45
AgeGarage: 0.97

Transform those with skewness > 0.5

skewed_features = np.array(quan)[np.abs(skew(df[quan])) > 0.5]
df[skewed_features] = np.log1p(df[skewed_features])

Dummy variables for the categorical features

Create a list of dummy variables that I will drop, which will be the last column generated from each categorical feature

dummy_drop = []
for i in qual:
dummy_drop += [ i+’_’+str(df[i].unique()[-1]) ]

Create dummy variables

df = pd.get_dummies(df,columns=qual) 

Drop the last column generated from each categorical feature

df = df.drop(dummy_drop,axis=1)

Check nulls values

fig, ax = plt.subplots( figsize = (18, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap=’viridis’)
Heat map of null values
  • Half of the SalePrice is null which is the test set before merging datasets.

Split the Data

X_train = df[:-1459].drop([‘SalePrice’,’Id’], axis=1)
y_train = df[:-1459][‘SalePrice’]
X_test = df[-1459:].drop([‘SalePrice’,’Id’], axis=1)

Scaling

Fit the training set only, then transform both the training and test sets.

scaler = RobustScaler()
X_train[quan]= scaler.fit_transform(X_train[quan])
X_test[quan]= scaler.transform(X_test[quan])
X_train.shape, X_test.shape

The output:

((1453, 193), (1459, 193))

Now we have 193 features!

By Jason Brownlee on August 17, 2016, in XGBoost. Last Updated on April 22, 2020. XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.

Feature importance is defined only for tree boosters

!pip install xgboost
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
imp = pd.DataFrame(xgb.feature_importances_ ,columns = ['Importance'],index = X_train.columns)
imp = imp.sort_values(['Importance'], ascending = False)
print(imp)

The output:

Requirement already satisfied: xgboost in c:\users\maemt\anaconda3\lib\site-packages (1.3.1)
Requirement already satisfied: numpy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.19.2)
Requirement already satisfied: scipy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.5.2)
Importance
GarageCond 0.298127
GarageCars 0.157924
OverallQual 0.079050
CentralAir_Y 0.052237
KitchenQual 0.037871
... ...
Fence_GdPrv 0.000000
Condition2_RRAn 0.000000
Condition2_PosN 0.000000
Condition2_PosA 0.000000
Exterior2nd_Stone 0.000000
[193 rows x 1 columns]

The most 10 important features are:

most 10 important features

Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.

from sklearn.metrics import make_scorer 
from sklearn.feature_selection import RFECV
# Define a function to calculate RMSE
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true-y_pred)**2))
# Define a function to calculate negative RMSE (as a score)
def nrmse(y_true, y_pred):
return -1.0*rmse(y_true, y_pred)
neg_rmse = make_scorer(nrmse)
estimator = XGBRegressor()
selector = RFECV(estimator, cv = 3, n_jobs = -1, scoring = neg_rmse)
selector = selector.fit(X_train, y_train)
print(“The number of selected features is: {}”.format(selector.n_features_))features_kept = X_train.columns.values[selector.support_]
X_train = X_train[features_kept]
X_test = X_test[features_kept]

The output is:

The number of selected features is: 42

So finally we can deal with cleaned 42 features rather than 81 features.

Thank you for reading.

--

--

No responses yet