House Prices —Data Cleaning
Done by: Monirah abdulaziz, Lamalharbi, and Aisha Y Hakami
This article goes through the process of cleaning a dataset which consists of information about the location of the house, price, and other aspects such as square feet, etc. When we work on these sorts of data, we need to modify and aggregate some features to expand their importance and see which column is important for us and which is not. The main goal of this article is to prepare data to do modeling steps that can give us a good prediction on the price of the house based on the other variables.
Data Dictionary
- There are 81 features in this dataset,
- This is the data dictionary of the dataset before cleaning steps:
Check nulls:
df.isnull().sum()
Handling Missing Values
Divide the data into numerical (“quan”), and categorical (“qual”) features.
quan = list( test.loc[:,test.dtypes != ‘object’].drop(‘Id’,axis=1).columns.values )
qual = list( test.loc[:,test.dtypes == ‘object’].columns.values )
Find out how many missing values there are for the quantitative and categorical features
print('Quantitative\n')
hasNAN = df[quan].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)
print ('-' * 100)
print('Qualitative\n')
hasNAN = df[qual].isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN)
The output is shown here:
Quantitative
LotFrontage 486
GarageYrBlt 159
MasVnrArea 23
BsmtHalfBath 2
BsmtFullBath 2
GarageArea 1
GarageCars 1
TotalBsmtSF 1
BsmtUnfSF 1
BsmtFinSF2 1
BsmtFinSF1 1
dtype: int64
----------------------------------------------------------------------------------------------------
Qualitative
PoolQC 2909
MiscFeature 2814
Alley 2721
Fence 2348
FireplaceQu 1420
GarageCond 159
GarageQual 159
GarageFinish 159
GarageType 157
BsmtExposure 82
BsmtCond 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
MasVnrType 24
MSZoning 4
Functional 2
Utilities 2
Exterior1st 1
SaleType 1
Exterior2nd 1
KitchenQual 1
Electrical 1
dtype: int64
Rather than the statistical computing methods to fill the missing values like mean, median, or mode, we prefer to do it manually, the benefit of this procedure is to get deep insight and understanding about the data, adding that make enhancements to some features in the feature engineering stage.
Remember the meaning of these features:
- LotFrontage: Linear feet of street-connected to property
- LotShape: General shape of a property
- Neighborhood: Physical locations within Ames city limits
print(df[[‘LotShape’, ‘Neighborhood’,’LotFrontage’]].loc[df[‘LotFrontage’].isnull()])
The output:
LotShape Neighborhood LotFrontage
7 IR1 NWAmes NaN
12 IR2 Sawyer NaN
14 IR1 NAmes NaN
16 IR1 NAmes NaN
24 IR1 Sawyer NaN
... ... ... ...
2847 Reg CollgCr NaN
2850 IR2 CollgCr NaN
2900 IR1 Timber NaN
2901 IR1 Timber NaN
2908 IR1 Mitchel NaN
[486 rows x 3 columns]
Filling missing values for numerical features.
Most of the NAN should mean that The corresponding facility/structure doesn’t exist, so we use zero for most cases
df.LotFrontage.fillna(df.LotFrontage.median(), inplace=True)
df[“LotFrontage”] = df[“LotFrontage”].fillna(df.groupby([“Neighborhood”, “LotShape”])[“LotFrontage”].transform(“median”))
NAN should mean no garage. we temporarily use yr = 0 here. Will come back to this later.
df.GarageYrBlt.fillna(0, inplace=True)
Use zero to fill following nulls
df.MasVnrArea.fillna(0, inplace=True)
df.BsmtHalfBath.fillna(0, inplace=True)
df.BsmtFullBath.fillna(0, inplace=True)
df.GarageArea.fillna(0, inplace=True)
df.GarageCars.fillna(0, inplace=True)
df.TotalBsmtSF.fillna(0, inplace=True)
df.BsmtUnfSF.fillna(0, inplace=True)
df.BsmtFinSF2.fillna(0, inplace=True)
df.BsmtFinSF1.fillna(0, inplace=True)
df.Electrical.fillna(df.Electrical.mode()[0], inplace=True)
Filling missing values for categorical features with not available (NA)
df.PoolQC.fillna('NA', inplace=True)
df.MiscFeature.fillna('NA', inplace=True)
df.Alley.fillna('NA', inplace=True)
df.Fence.fillna('NA', inplace=True)
df.FireplaceQu.fillna('NA', inplace=True)
df.GarageCond.fillna('NA', inplace=True)
df.GarageQual.fillna('NA', inplace=True)
df.GarageFinish.fillna('NA', inplace=True)
df.GarageType.fillna('NA', inplace=True)
df.BsmtExposure.fillna('NA', inplace=True)
df.BsmtCond.fillna('NA', inplace=True)
df.BsmtQual.fillna('NA', inplace=True)
df.BsmtFinType2.fillna('NA', inplace=True)
df.BsmtFinType1.fillna('NA', inplace=True)
df.MasVnrType.fillna('None', inplace=True)
df.Exterior2nd.fillna('None', inplace=True)df.KitchenQual.fillna('NA', inplace=True)
df.Exterior1st.fillna('NA', inplace=True)
df.SaleType.fillna('NA', inplace=True)
df.Functional.fillna('NA', inplace=True)
df.Utilities.fillna('NA', inplace=True)
MSZoning should highly correlate with the location, so I use the mode values of individual Neighborhoods
for i in df.Neighborhood.unique():
if df.MSZoning[df.Neighborhood == i].isnull().sum() > 0:
df.loc[df.Neighborhood == i,'MSZoning'] = \
df.loc[df.Neighborhood == i,'MSZoning'].fillna(df.loc[df.Neighborhood == i,'MSZoning'].mode()[0])
Qualitative
fig = plt.subplots(nrows = 1, ncols = 1, figsize = (18, 6))
sns.heatmap(df[qual].isnull(), yticklabels=False, cbar=False, cmap=’viridis’)
Check duplicate
# Eliminating the duplicates.
df.duplicated(subset=None, keep=’first’).sum()0
Feature Engineering
If we use one-hot encoding we will get a huge number of features, to avoid that, we decide to combine features that have similar data meaning.
df.Heating.value_counts()GasA 2874
GasW 27
Grav 9
Wall 6
OthW 2
Floor 1
Name: Heating, dtype: int64
So we will combine all values (Grav, Wall, OthW, and Floor 1) together since their values have appeared a few times.
df[“Heating”] = df[“Heating”].apply(lambda x: “Other” if x != “GasA” else x)
- MasVnrType
df.MasVnrType.value_counts()None 1766
BrkFace 879
Stone 249
BrkCmn 25
Name: MasVnrType, dtype: int64
So we will combine all values (Stone, BrkCmn ) to None since their values have appeared a few times.
df[“MasVnrType”] = df[“MasVnrType”].apply(lambda x: “None” if x in [“None”, “BrkCmn”] else x)
And do the same steps for all categorical features.
Converting categorical variables into labels:
df.Alley = df.Alley.map({‘NA’:0, ‘Grvl’:1, ‘Pave’:2})
df.BsmtCond = df.BsmtCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.BsmtExposure = df.BsmtExposure.map({‘NA’:0, ‘No’:1, ‘Mn’:2, ‘Av’:3, ‘Gd’:4})
df[‘BsmtFinType1’] = df[‘BsmtFinType1’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df[‘BsmtFinType2’] = df[‘BsmtFinType2’].map({‘NA’:0, ‘Unf’:1, ‘LwQ’:2, ‘Rec’:3, ‘BLQ’:4, ‘ALQ’:5, ‘GLQ’:6})
df.BsmtQual = df.BsmtQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterCond = df.ExterCond.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.ExterQual = df.ExterQual.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.FireplaceQu = df.FireplaceQu.map({‘NA’:0, ‘TA’:1, ‘Gd’:2, ‘other’:3})
df.Functional = df.Functional.map({ ‘Min1’:2, ‘Typ’:1})
df.GarageCond = df.GarageCond.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.GarageQual = df.GarageQual.map({‘NA’:0, ‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.HeatingQC = df.HeatingQC.map({‘Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.KitchenQual = df.KitchenQual.map({‘NA’:0,’Po’:1, ‘Fa’:2, ‘TA’:3, ‘Gd’:4, ‘Ex’:5})
df.LandSlope = df.LandSlope.map({‘Sev’:1, ‘Mod’:2, ‘Gtl’:3})
df.PavedDrive = df.PavedDrive.map({‘N’:1, ‘P’:2, ‘Y’:3})
df.PoolQC = df.PoolQC.map({‘NA’:0,’Ex’:1})
df.Street = df.Street.map({‘Grvl’:1, ‘Pave’:2})
df.Utilities = df.Utilities.map({‘NA’:0,’ELO’:1, ‘NoSeWa’:2, ‘NoSewr’:3, ‘AllPub’:4})
Update the lists of numerical and categorical features:
newquan = [‘Alley’,’BsmtCond’,’BsmtExposure’,’BsmtFinType1',’BsmtFinType2',’BsmtQual’,
‘ExterCond’,’ExterQual’,’FireplaceQu’,’Functional’,’GarageCond’,
‘GarageQual’,’HeatingQC’,’KitchenQual’,’LandSlope’,’PavedDrive’,’PoolQC’,
‘Street’,’Utilities’]
quan = quan + newquan
for i in newquan: qual.remove(i)
This is actually a categorical feature, MSSubClass doesn’t mean the meaning of the value of number 20, and so on with other values
df.MSSubClass = df.MSSubClass.map({20:’class1', 30:’class2', 40:’class3', 45:’class4',
50:’class5', 60:’class6', 70:’class7', 75:’class8',
80:’class9', 85:’class10', 90:’class11', 120:’class12',
150:’class13', 160:’class14', 180:’class15', 190:’class16'})
Keeping “YrSold” is enough, the month MoSold: Month Sold (MM) and YrSold: Year Sold (YYYY) is like a redundant feature in somehow.
df=df.drop(‘MoSold’,axis=1)
Update my lists of numerical and categorical features
quan.remove(‘MoSold’)
quan.remove(‘MSSubClass’)
qual.append(‘MSSubClass’)df[‘Age’] = df.YrSold — df.YearBuilt
df[‘AgeRemod’] = df.YrSold — df.YearRemodAdd
df[‘AgeGarage’] = df.YrSold — df.GarageYrBlt
For the houses without a Garage, I filled the NANs with zeros, which makes AgeGarage ~ 2000.
Here I replace their AgeGarage with the maximum value among the houses with Garages
max_AgeGarage = np.max(df.AgeGarage[df.AgeGarage < 1000])
df[‘AgeGarage’] = df[‘AgeGarage’].map(lambda x: max_AgeGarage if x > 1000 else x)
Some of the values are negative because the work was done after the house was sold. In these cases, I change them to zero to avoid negative ages.
df.Age = df.Age.map(lambda x: 0 if x < 0 else x)
df.AgeRemod = df.AgeRemod.map(lambda x: 0 if x < 0 else x)
df.AgeGarage = df.AgeGarage.map(lambda x: 0 if x < 0 else x)
drop the original time variables
df=df.drop([‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’],axis=1)
Update the list of numerical features
for i in [‘YrSold’,’YearBuilt’,’YearRemodAdd’,’GarageYrBlt’]: quan.remove(i)
quan = quan + [‘Age’,’AgeRemod’,’AgeGarage’]
Detect the Outliers
These are somewhat arbitrary
# These are somewhat arbitrary
index_drop = df.LotFrontage[df.LotFrontage > 300].index
index_drop = np.append(index_drop, df.LotArea[df.LotArea > 100000].index)
index_drop = np.append(index_drop, df.BsmtFinSF1[df.BsmtFinSF1 > 4000].index)
index_drop = np.append(index_drop, df.TotalBsmtSF[df.TotalBsmtSF > 6000].index)
index_drop = np.append(index_drop, df[‘1stFlrSF’][df[‘1stFlrSF’] > 4000].index)
index_drop = np.append(index_drop, df.GrLivArea[(df.GrLivArea > 4000) & (df.SalePrice < 13)].index)
index_drop = np.unique(index_drop)
Make sure we only remove data from the training set
index_drop = index_drop[index_drop < 1460]
df = df.drop(index_drop).reset_index(drop=True)
print(“{} examples in the training set are dropped.”.format(len(index_drop)))
The output:
7 examples in the training set are dropped.
Transform the numerical features that are skewed.
It is necessary to transform the numerical features that are skewed. This is because lots of regression models building assume that the features are distributed normally and have asymmetrical shapes. For most machine learning algorithms, especially linear models, normally distributed features gives us better results.
from scipy.stats import skew
for i in quan:
print(i+’: {}’.format(round(skew(df[i]),2)))
The output:
LotFrontage: 0.76
LotArea: 3.6
OverallQual: 0.19
OverallCond: 0.57
MasVnrArea: 2.63
BsmtFinSF1: 0.98
BsmtFinSF2: 4.16
BsmtUnfSF: 0.92
TotalBsmtSF: 0.67
1stFlrSF: 1.26
2ndFlrSF: 0.86
LowQualFinSF: 12.07
GrLivArea: 1.07
BsmtFullBath: 0.62
BsmtHalfBath: 3.94
FullBath: 0.17
HalfBath: 0.69
BedroomAbvGr: 0.33
KitchenAbvGr: 4.3
TotRmsAbvGrd: 0.75
Fireplaces: 0.73
GarageCars: -0.22
GarageArea: 0.22
WoodDeckSF: 1.84
OpenPorchSF: 2.53
EnclosedPorch: 4.0
3SsnPorch: 11.36
ScreenPorch: 3.94
PoolArea: 17.67
MiscVal: 21.94
Alley: 4.13
BsmtCond: -3.6
BsmtExposure: 1.12
BsmtFinType1: -0.15
BsmtFinType2: 3.16
BsmtQual: -1.27
ExterCond: 1.31
ExterQual: 0.78
FireplaceQu: 0.61
Functional: 3.39
GarageCond: -3.38
GarageQual: -3.26
HeatingQC: -0.55
KitchenQual: 0.39
LandSlope: -4.98
PavedDrive: -2.97
PoolQC: 26.93
Street: -16.18
Utilities: -33.93
Age: 0.6
AgeRemod: 0.45
AgeGarage: 0.97
Transform those with skewness > 0.5
skewed_features = np.array(quan)[np.abs(skew(df[quan])) > 0.5]
df[skewed_features] = np.log1p(df[skewed_features])
Dummy variables for the categorical features
Create a list of dummy variables that I will drop, which will be the last column generated from each categorical feature
dummy_drop = []
for i in qual:
dummy_drop += [ i+’_’+str(df[i].unique()[-1]) ]
Create dummy variables
df = pd.get_dummies(df,columns=qual)
Drop the last column generated from each categorical feature
df = df.drop(dummy_drop,axis=1)
Check nulls values
fig, ax = plt.subplots( figsize = (18, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap=’viridis’)
- Half of the
SalePrice
is null which is the test set before merging datasets.
Split the Data
X_train = df[:-1459].drop([‘SalePrice’,’Id’], axis=1)
y_train = df[:-1459][‘SalePrice’]
X_test = df[-1459:].drop([‘SalePrice’,’Id’], axis=1)
Scaling
Fit the training set only, then transform both the training and test sets.
scaler = RobustScaler()
X_train[quan]= scaler.fit_transform(X_train[quan])
X_test[quan]= scaler.transform(X_test[quan])
X_train.shape, X_test.shape
The output:
((1453, 193), (1459, 193))
Now we have 193 features!
By Jason Brownlee on August 17, 2016, in XGBoost. Last Updated on April 22, 2020. XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.
Feature importance is defined only for tree boosters
!pip install xgboost
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
imp = pd.DataFrame(xgb.feature_importances_ ,columns = ['Importance'],index = X_train.columns)
imp = imp.sort_values(['Importance'], ascending = False)print(imp)
The output:
Requirement already satisfied: xgboost in c:\users\maemt\anaconda3\lib\site-packages (1.3.1)
Requirement already satisfied: numpy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.19.2)
Requirement already satisfied: scipy in c:\users\maemt\anaconda3\lib\site-packages (from xgboost) (1.5.2)
Importance
GarageCond 0.298127
GarageCars 0.157924
OverallQual 0.079050
CentralAir_Y 0.052237
KitchenQual 0.037871
... ...
Fence_GdPrv 0.000000
Condition2_RRAn 0.000000
Condition2_PosN 0.000000
Condition2_PosA 0.000000
Exterior2nd_Stone 0.000000[193 rows x 1 columns]
The most 10 important features are:
Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.
from sklearn.metrics import make_scorer
from sklearn.feature_selection import RFECV# Define a function to calculate RMSE
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true-y_pred)**2))# Define a function to calculate negative RMSE (as a score)
def nrmse(y_true, y_pred):
return -1.0*rmse(y_true, y_pred)
neg_rmse = make_scorer(nrmse)
estimator = XGBRegressor()
selector = RFECV(estimator, cv = 3, n_jobs = -1, scoring = neg_rmse)
selector = selector.fit(X_train, y_train)print(“The number of selected features is: {}”.format(selector.n_features_))features_kept = X_train.columns.values[selector.support_]
X_train = X_train[features_kept]
X_test = X_test[features_kept]
The output is:
The number of selected features is: 42
So finally we can deal with cleaned 42 features rather than 81 features.
Thank you for reading.