House Prices —Data Cleaning

source: shorturl.at/anA57

Data Dictionary

  • There are 81 features in this dataset,
  • This is the data dictionary of the dataset before cleaning steps:
The head of the data dictionary
df.isnull().sum()
Null values

Handling Missing Values

Divide the data into numerical (“quan”), and categorical (“qual”) features.

quan = list( test.loc[:,test.dtypes != ‘object’].drop(‘Id’,axis=1).columns.values )
qual = list( test.loc[:,test.dtypes == ‘object’].columns.values )
print('Quantitative\n')
hasNAN = df[quan].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)
print ('-' * 100)
print('Qualitative\n')
hasNAN = df[qual].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)

The output is shown here:

Quantitative
LotFrontage 486
GarageYrBlt 159
MasVnrArea 23
BsmtHalfBath 2
BsmtFullBath 2
GarageArea 1
GarageCars 1
TotalBsmtSF 1
BsmtUnfSF 1
BsmtFinSF2 1
BsmtFinSF1 1
dtype: int64
----------------------------------------------------------------------------------------------------
Qualitative

PoolQC 2909
MiscFeature 2814
Alley 2721
Fence 2348
FireplaceQu 1420
GarageCond 159
GarageQual 159
GarageFinish 159
GarageType 157
BsmtExposure 82
BsmtCond 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
MasVnrType 24
MSZoning 4
Functional 2
Utilities 2
Exterior1st 1
SaleType 1
Exterior2nd 1
KitchenQual 1
Electrical 1
dtype: int64
  • LotFrontage: Linear feet of street-connected to property
  • LotShape: General shape of a property
  • Neighborhood: Physical locations within Ames city limits
print(df[[‘LotShape’, ‘Neighborhood’,’LotFrontage’]].loc[df[‘LotFrontage’].isnull()])
LotShape Neighborhood  LotFrontage
7 IR1 NWAmes NaN
12 IR2 Sawyer NaN
14 IR1 NAmes NaN
16 IR1 NAmes NaN
24 IR1 Sawyer NaN
... ... ... ...
2847 Reg CollgCr NaN
2850 IR2 CollgCr NaN
2900 IR1 Timber NaN
2901 IR1 Timber NaN
2908 IR1 Mitchel NaN

[486 rows x 3 columns]

Filling missing values for numerical features.

Most of the NAN should mean that The corresponding facility/structure doesn’t exist, so we use zero for most cases

df.LotFrontage.fillna(df.LotFrontage.median(), inplace=True)
df[“LotFrontage”] = df[“LotFrontage”].fillna(df.groupby([“Neighborhood”, “LotShape”])[“LotFrontage”].transform(“median”))

NAN should mean no garage. we temporarily use yr = 0 here. Will come back to this later.

df.GarageYrBlt.fillna(0, inplace=True)

Use zero to fill following nulls

df.MasVnrArea.fillna(0, inplace=True) 
df.BsmtHalfBath.fillna(0, inplace=True)
df.BsmtFullBath.fillna(0, inplace=True)
df.GarageArea.fillna(0, inplace=True)
df.GarageCars.fillna(0, inplace=True)
df.TotalBsmtSF.fillna(0, inplace=True)
df.BsmtUnfSF.fillna(0, inplace=True)
df.BsmtFinSF2.fillna(0, inplace=True)
df.BsmtFinSF1.fillna(0, inplace=True)
df.Electrical.fillna(df.Electrical.mode()[0], inplace=True)

Filling missing values for categorical features with not available (NA)

df.PoolQC.fillna('NA', inplace=True)
df.MiscFeature.fillna('NA', inplace=True)
df.Alley.fillna('NA', inplace=True)
df.Fence.fillna('NA', inplace=True)
df.FireplaceQu.fillna('NA', inplace=True)
df.GarageCond.fillna('NA', inplace=True)
df.GarageQual.fillna('NA', inplace=True)
df.GarageFinish.fillna('NA', inplace=True)
df.GarageType.fillna('NA', inplace=True)
df.BsmtExposure.fillna('NA', inplace=True)
df.BsmtCond.fillna('NA', inplace=True)
df.BsmtQual.fillna('NA', inplace=True)
df.BsmtFinType2.fillna('NA', inplace=True)
df.BsmtFinType1.fillna('NA', inplace=True)
df.MasVnrType.fillna('None', inplace=True)
df.Exterior2nd.fillna('None', inplace=True)
df.KitchenQual.fillna('NA', inplace=True)
df.Exterior1st.fillna('NA', inplace=True)
df.SaleType.fillna('NA', inplace=True)
df.Functional.fillna('NA', inplace=True)
df.Utilities.fillna('NA', inplace=True)

MSZoning should highly correlate with the location, so I use the mode values of individual Neighborhoods

for i in df.Neighborhood.unique():
if df.MSZoning[df.Neighborhood == i].isnull().sum() > 0:
df.loc[df.Neighborhood == i,'MSZoning'] = \
df.loc[df.Neighborhood == i,'MSZoning'].fillna(df.loc[df.Neighborhood == i,'MSZoning'].mode()[0])

Qualitative

fig = plt.subplots(nrows = 1, ncols = 1, figsize = (18, 6))
sns.heatmap(df[qual].isnull(), yticklabels=False, cbar=False, cmap=’viridis’)
heat map empty from nulls

Check duplicate

# Eliminating the duplicates.
df.duplicated(subset=None, keep=’first’).sum()
0

Feature Engineering

If we use one-hot encoding we will get a huge number of features, to avoid that, we decide to combine features that have similar data meaning.

df.Heating.value_counts()GasA     2874
GasW 27
Grav 9
Wall 6
OthW 2
Floor 1
Name: Heating, dtype: int64
df[“Heating”] = df[“Heating”].apply(lambda x: “Other” if x != “GasA” else x)
  • MasVnrType
df.MasVnrType.value_counts()None       1766
BrkFace 879
Stone 249
BrkCmn 25
Name: MasVnrType, dtype: int64
df[“MasVnrType”] = df[“MasVnrType”].apply(lambda x: “None” if x in [“None”, “BrkCmn”] else x)

Converting categorical variables into labels:

df.Alley = df.Alley.map({‘NA’:0, ‘Grvl’:1, ‘Pave’:2})
df.BsmtCond = df.BsmtCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.BsmtExposure = df.BsmtExposure.map({‘NA’:0, ‘No’:1, ‘Mn’:2, ‘Av’:3, ‘Gd’:4})
df[‘BsmtFinType1’] = df[‘BsmtFinType1’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df[‘BsmtFinType2’] = df[‘BsmtFinType2’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df.BsmtQual = df.BsmtQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterCond = df.ExterCond.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterQual = df.ExterQual.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.FireplaceQu = df.FireplaceQu.map({‘NA’:0, ‘TA’:1, ‘Gd’:2, ‘other’:3})
df.Functional = df.Functional.map({ ‘Min1’:2, ‘Typ’:1})
df.GarageCond = df.GarageCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.GarageQual = df.GarageQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.HeatingQC = df.HeatingQC.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.KitchenQual = df.KitchenQual.map({‘NA’:0,’Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.LandSlope = df.LandSlope.map({‘Sev’:1, ‘Mod’:2, ‘Gtl’:3})
df.PavedDrive = df.PavedDrive.map({‘N’:1, ‘P’:2, ‘Y’:3})
df.PoolQC = df.PoolQC.map({‘NA’:0,’Ex’:1})
df.Street = df.Street.map({‘Grvl’:1, ‘Pave’:2})
df.Utilities = df.Utilities.map({‘NA’:0,’ELO’:1, ‘NoSeWa’:2, ‘NoSewr’:3, ‘AllPub’:4})
newquan = [‘Alley’,’BsmtCond’,’BsmtExposure’,’BsmtFinType1',’BsmtFinType2',’BsmtQual’,
‘ExterCond’,’ExterQual’,’FireplaceQu’,’Functional’,’GarageCond’,
‘GarageQual’,’HeatingQC’,’KitchenQual’,’LandSlope’,’PavedDrive’,’PoolQC’,
‘Street’,’Utilities’]
quan = quan + newquan
for i in newquan: qual.remove(i)
df.MSSubClass = df.MSSubClass.map({20:’class1', 30:’class2', 40:’class3', 45:’class4',
50:’class5', 60:’class6', 70:’class7', 75:’class8',
80:’class9', 85:’class10', 90:’class11', 120:’class12',
150:’class13', 160:’class14', 180:’class15', 190:’class16'})
df=df.drop(‘MoSold’,axis=1)
quan.remove(‘MoSold’)
quan.remove(‘MSSubClass’)
qual.append(‘MSSubClass’)
df[‘Age’] = df.YrSold — df.YearBuilt
df[‘AgeRemod’] = df.YrSold — df.YearRemodAdd
df[‘AgeGarage’] = df.YrSold — df.GarageYrBlt
max_AgeGarage = np.max(df.AgeGarage[df.AgeGarage < 1000])
df[‘AgeGarage’] = df[‘AgeGarage’].map(lambda x: max_AgeGarage if x > 1000 else x)
df.Age = df.Age.map(lambda x: 0 if x < 0 else x)
df.AgeRemod = df.AgeRemod.map(lambda x: 0 if x < 0 else x)
df.AgeGarage = df.AgeGarage.map(lambda x: 0 if x < 0 else x)
df=df.drop([‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’],axis=1)
for i in [‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’]: quan.remove(i)
quan = quan + [‘Age’,’AgeRemod’,’AgeGarage’]

Detect the Outliers

These are somewhat arbitrary

# These are somewhat arbitrary 
index_drop = df.LotFrontage[df.LotFrontage > 300].index
index_drop = np.append(index_drop, df.LotArea[df.LotArea > 100000].index)
index_drop = np.append(index_drop, df.BsmtFinSF1[df.BsmtFinSF1 > 4000].index)
index_drop = np.append(index_drop, df.TotalBsmtSF[df.TotalBsmtSF > 6000].index)
index_drop = np.append(index_drop, df[‘1stFlrSF’][df[‘1stFlrSF’] > 4000].index)
index_drop = np.append(index_drop, df.GrLivArea[(df.GrLivArea > 4000) & (df.SalePrice < 13)].index)
index_drop = np.unique(index_drop)
index_drop = index_drop[index_drop < 1460]
df = df.drop(index_drop).reset_index(drop=True)
print(“{} examples in the training set are dropped.”.format(len(index_drop)))
7 examples in the training set are dropped.

Transform the numerical features that are skewed.

It is necessary to transform the numerical features that are skewed. This is because lots of regression models building assume that the features are distributed normally and have asymmetrical shapes. For most machine learning algorithms, especially linear models, normally distributed features gives us better results.

from scipy.stats import skew
for i in quan:
print(i+’: {}’.format(round(skew(df[i]),2)))
LotFrontage: 0.76
LotArea: 3.6
OverallQual: 0.19
OverallCond: 0.57
MasVnrArea: 2.63
BsmtFinSF1: 0.98
BsmtFinSF2: 4.16
BsmtUnfSF: 0.92
TotalBsmtSF: 0.67
1stFlrSF: 1.26
2ndFlrSF: 0.86
LowQualFinSF: 12.07
GrLivArea: 1.07
BsmtFullBath: 0.62
BsmtHalfBath: 3.94
FullBath: 0.17
HalfBath: 0.69
BedroomAbvGr: 0.33
KitchenAbvGr: 4.3
TotRmsAbvGrd: 0.75
Fireplaces: 0.73
GarageCars: -0.22
GarageArea: 0.22
WoodDeckSF: 1.84
OpenPorchSF: 2.53
EnclosedPorch: 4.0
3SsnPorch: 11.36
ScreenPorch: 3.94
PoolArea: 17.67
MiscVal: 21.94
Alley: 4.13
BsmtCond: -3.6
BsmtExposure: 1.12
BsmtFinType1: -0.15
BsmtFinType2: 3.16
BsmtQual: -1.27
ExterCond: 1.31
ExterQual: 0.78
FireplaceQu: 0.61
Functional: 3.39
GarageCond: -3.38
GarageQual: -3.26
HeatingQC: -0.55
KitchenQual: 0.39
LandSlope: -4.98
PavedDrive: -2.97
PoolQC: 26.93
Street: -16.18
Utilities: -33.93
Age: 0.6
AgeRemod: 0.45
AgeGarage: 0.97
skewed_features = np.array(quan)[np.abs(skew(df[quan])) > 0.5]
df[skewed_features] = np.log1p(df[skewed_features])

Dummy variables for the categorical features

Create a list of dummy variables that I will drop, which will be the last column generated from each categorical feature

dummy_drop = []
for i in qual:
dummy_drop += [ i+’_’+str(df[i].unique()[-1]) ]
df = pd.get_dummies(df,columns=qual) 

Drop the last column generated from each categorical feature

df = df.drop(dummy_drop,axis=1)

Check nulls values

fig, ax = plt.subplots( figsize = (18, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap=’viridis’)
Heat map of null values
  • Half of the SalePrice is null which is the test set before merging datasets.

Split the Data

X_train = df[:-1459].drop([‘SalePrice’,’Id’], axis=1)
y_train = df[:-1459][‘SalePrice’]
X_test = df[-1459:].drop([‘SalePrice’,’Id’], axis=1)

Scaling

Fit the training set only, then transform both the training and test sets.

scaler = RobustScaler()
X_train[quan]= scaler.fit_transform(X_train[quan])
X_test[quan]= scaler.transform(X_test[quan])
X_train.shape, X_test.shape

The output:

((1453, 193), (1459, 193))
!pip install xgboost
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
imp = pd.DataFrame(xgb.feature_importances_ ,columns = ['Importance'],index = X_train.columns)
imp = imp.sort_values(['Importance'], ascending = False)
print(imp)
Requirement already satisfied: xgboost in c:\users\maemt\anaconda3\lib\site-packages (1.3.1)
Requirement already satisfied: numpy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.19.2)
Requirement already satisfied: scipy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.5.2)
Importance
GarageCond 0.298127
GarageCars 0.157924
OverallQual 0.079050
CentralAir_Y 0.052237
KitchenQual 0.037871
... ...
Fence_GdPrv 0.000000
Condition2_RRAn 0.000000
Condition2_PosN 0.000000
Condition2_PosA 0.000000
Exterior2nd_Stone 0.000000
[193 rows x 1 columns]
most 10 important features
from sklearn.metrics import make_scorer 
from sklearn.feature_selection import RFECV
# Define a function to calculate RMSE
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true-y_pred)**2))
# Define a function to calculate negative RMSE (as a score)
def nrmse(y_true, y_pred):
return -1.0*rmse(y_true, y_pred)
neg_rmse = make_scorer(nrmse)
estimator = XGBRegressor()
selector = RFECV(estimator, cv = 3, n_jobs = -1, scoring = neg_rmse)
selector = selector.fit(X_train, y_train)
print(“The number of selected features is: {}”.format(selector.n_features_))features_kept = X_train.columns.values[selector.support_]
X_train = X_train[features_kept]
X_test = X_test[features_kept]

The output is:

The number of selected features is: 42

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store