House Prices - Advanced Regression Techniques (Python)

House Prices - Advanced Regression Techniques (Python)#

Predict sales prices and practice feature engineering, RFs, and gradient boosting

Author: Lingsong Zeng
Date: 12/31/2024

Introduction#

Overview#

This competition runs indefinitely with a rolling leaderboard. Learn more

Description#

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Practice Skills#

Creative feature engineering
Advanced regression techniques like random forest and gradient boosting

Acknowledgments#

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

Photo by Tom Thain on Unsplash.

Evaluation#

Goal#

It is my job to predict the sales price for each house. For each Id in the test set, predict the value of the SalePrice variable.

Metric#

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

# Import required libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import root_mean_squared_error
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor, DMatrix
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, StackingRegressor

Data#

File descriptions#

train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

# Define paths for train and test datasets
train_path = os.path.join('data', 'house-prices', 'raw', 'train.csv')
test_path = os.path.join('data', 'house-prices', 'raw', 'test.csv')

# Load datasets
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

Here we using os.path.join , the dynamic path generation, instead of using hard-coded paths is aiming to provide better compatibility across different platforms.

Different operating systems use different path separators (e.g. Windows uses \, while Linux and macOS use /). os.path.join automatically chooses the correct separator based on the operating system.

# Basic dataset overview
print("Train dataset shape:", train.shape)
print("Test  dataset shape:", test.shape)

Train dataset shape: (1460, 81)
Test  dataset shape: (1459, 80)

The size of the training dataset and the test dataset are roughly the same. The test dataset has one less column than the training dataset, which is our target column SalePrice.

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 Id             1460 non-null   int64  
 MSSubClass     1460 non-null   int64  
 MSZoning       1460 non-null   object 
 LotFrontage    1201 non-null   float64
 LotArea        1460 non-null   int64  
 Street         1460 non-null   object 
 Alley          91 non-null     object 
 LotShape       1460 non-null   object 
 LandContour    1460 non-null   object 
 Utilities      1460 non-null   object 
LotConfig      1460 non-null   object 
LandSlope      1460 non-null   object 
Neighborhood   1460 non-null   object 
Condition1     1460 non-null   object 
Condition2     1460 non-null   object 
BldgType       1460 non-null   object 
HouseStyle     1460 non-null   object 
OverallQual    1460 non-null   int64  
OverallCond    1460 non-null   int64  
YearBuilt      1460 non-null   int64  
YearRemodAdd   1460 non-null   int64  
RoofStyle      1460 non-null   object 
RoofMatl       1460 non-null   object 
Exterior1st    1460 non-null   object 
Exterior2nd    1460 non-null   object 
MasVnrType     588 non-null    object 
MasVnrArea     1452 non-null   float64
ExterQual      1460 non-null   object 
ExterCond      1460 non-null   object 
Foundation     1460 non-null   object 
BsmtQual       1423 non-null   object 
BsmtCond       1423 non-null   object 
BsmtExposure   1422 non-null   object 
BsmtFinType1   1423 non-null   object 
BsmtFinSF1     1460 non-null   int64  
BsmtFinType2   1422 non-null   object 
BsmtFinSF2     1460 non-null   int64  
BsmtUnfSF      1460 non-null   int64  
TotalBsmtSF    1460 non-null   int64  
Heating        1460 non-null   object 
HeatingQC      1460 non-null   object 
CentralAir     1460 non-null   object 
Electrical     1459 non-null   object 
1stFlrSF       1460 non-null   int64  
2ndFlrSF       1460 non-null   int64  
LowQualFinSF   1460 non-null   int64  
GrLivArea      1460 non-null   int64  
BsmtFullBath   1460 non-null   int64  
BsmtHalfBath   1460 non-null   int64  
FullBath       1460 non-null   int64  
HalfBath       1460 non-null   int64  
BedroomAbvGr   1460 non-null   int64  
KitchenAbvGr   1460 non-null   int64  
KitchenQual    1460 non-null   object 
TotRmsAbvGrd   1460 non-null   int64  
Functional     1460 non-null   object 
Fireplaces     1460 non-null   int64  
FireplaceQu    770 non-null    object 
GarageType     1379 non-null   object 
GarageYrBlt    1379 non-null   float64
GarageFinish   1379 non-null   object 
GarageCars     1460 non-null   int64  
GarageArea     1460 non-null   int64  
GarageQual     1379 non-null   object 
GarageCond     1379 non-null   object 
PavedDrive     1460 non-null   object 
WoodDeckSF     1460 non-null   int64  
OpenPorchSF    1460 non-null   int64  
EnclosedPorch  1460 non-null   int64  
3SsnPorch      1460 non-null   int64  
ScreenPorch    1460 non-null   int64  
PoolArea       1460 non-null   int64  
PoolQC         7 non-null      object 
Fence          281 non-null    object 
MiscFeature    54 non-null     object 
MiscVal        1460 non-null   int64  
MoSold         1460 non-null   int64  
YrSold         1460 non-null   int64  
SaleType       1460 non-null   object 
SaleCondition  1460 non-null   object 
SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

categorical_features = [
    'MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
    'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 
    'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 
    'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 
    'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
    'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 
    'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 
    'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 
    'SaleType', 'SaleCondition', 'OverallQual', 'OverallCond'
]

numerical_features = [
    'LotFrontage', 'LotArea', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
    '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath',
    'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
    'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
    'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
    'MiscVal', 'MoSold', 'YrSold'
]

For categorical_features and numerical_features, the reason we cannot directly select features with values of int64 or float64, because some features such as MSSubClass:

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

Its value is int64 format, but it is actually a categorical feature. Therefore, here I manually distinguish all categorical_features and numerical_features according to the description in data_description.txt.

none_features = {
    'Alley': 'NoAlley',
    'BsmtQual': 'NoBsmt',
    'BsmtCond': 'NoBsmt',
    'BsmtExposure': 'NoBsmt',
    'BsmtFinType1': 'NoBsmt',
    'BsmtFinType2': 'NoBsmt',
    'FireplaceQu': 'NoFireplace',
    'GarageType': 'NoGarage',
    'GarageFinish': 'NoGarage',
    'GarageQual': 'NoGarage',
    'GarageCond': 'NoGarage',
    'PoolQC': 'NoPool',
    'Fence': 'NoFence',
    'MiscFeature': 'NoFeature'
}

According to the description in data_description.txt, NA in some features does not mean Missing Value, but means that the observation does not have the feature, such as Alley:

Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access

Therefore, the missing value filling treatment of these features should be different. I filtered out all similar features here to prepare for the missing value filling in the subsequent preprocessing.

for feature, value in none_features.items():
    train[feature] = train[feature].fillna(value)
    test[feature] = test[feature].fillna(value)

ordinal_features = [
    "MSSubClass", "OverallQual", "OverallCond", "LotShape", "LandSlope",
    "ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure",
    "BsmtFinType1", "BsmtFinType2", "HeatingQC", "KitchenQual",
    "Functional", "FireplaceQu", "GarageFinish", "GarageQual", "GarageCond",
    "PavedDrive", "PoolQC", "Fence"
]

nominal_features = [
    "MSZoning", "Street", "Alley", "LandContour", "Utilities", "LotConfig",
    "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle",
    "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType",
    "Foundation", "Heating", "CentralAir", "Electrical", "GarageType",
    "MiscFeature", "SaleType", "SaleCondition"
]

Similarly, I also distinguish between ordinal_features and nominal_features here to facilitate subsequent preprocessing. The main difference between them is whether the variable values are ordered. The following are examples of an ordinal feature and a nominal feature respectively.

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

In the following preprocessing, ordinal features will be applied with Label Encoding, while nominal features will be applied with One-Hot Encoding.

EDA#

# Summary statistics for SalePrice
print(train['SalePrice'].describe())

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

# Plot the distribution of SalePrice
sns.histplot(train['SalePrice'], bins=20, kde=True)
plt.title('SalePrice Distribution')
plt.show()

# Check skewness and kurtosis of SalePrice
print("Skewness of SalePrice:", train['SalePrice'].skew())
print("Kurtosis of SalePrice:", train['SalePrice'].kurt())

../_images/59809f4b759288b2a089af449e7e0775a290df7ec1ea0d66fe69aea881bc7371.png

Skewness of SalePrice: 1.8828757597682129
Kurtosis of SalePrice: 6.536281860064529

From the histogram and the calculated value of skewness (1.88), we can see that the distribution of SalePrice is right-skewed. This may have a negative impact on the training of the model. So we need a log transformation for SalePrice.

train['LogSalePrice'] = np.log1p(train['SalePrice'])

We use np.log1p() instead of np.log() to handle zero values, since log(0) does not exist. np.log1p() is a shorthand for log(x+1) and ensures that the target value is non-negative.

sns.histplot(train['LogSalePrice'], bins=20, kde=True)
plt.title("Log-Transformed SalePrice Distribution")
plt.show()

print("Skewness of (Log-Transformed) SalePrice:", train['LogSalePrice'].skew())
print("Kurtosis of (Log-Transformed) SalePrice:", train['LogSalePrice'].kurt())

../_images/f97425d6949c0124723ccb04e200f3ed8d68666b682378dc6972bc3a86ff49bb.png

Skewness of (Log-Transformed) SalePrice: 0.12134661989685333
Kurtosis of (Log-Transformed) SalePrice: 0.809519155707878

From the charts and statistics, we can see that after logarithmic transformation, the distribution of SalePrice is closer to normal distribution.

The following is the distribution of the remaining numerical features:

train[numerical_features].hist(figsize=(15, 15), bins=10, xlabelsize=8, ylabelsize=8);

../_images/fffddab87c1e62c9f37cd2d8e5d155bb753f565f18aa548a417849066a5a4d80.png

Feature Engineering#

def add_features(data, create_interactions=True, create_base_features=True):
    """
    Add new features to the dataset.
    
    Parameters:
    - data: DataFrame, the input data.
    - create_interactions: bool, whether to create interaction and polynomial features.
    - create_base_features: bool, whether to create base features like `HouseAge`.
    
    Returns:
    - DataFrame with new features added.
    """
    # Create a DataFrame to store new features
    new_features = pd.DataFrame(index=data.index)
    
    # Base features
    if create_base_features:
        new_features['HouseAge'] = data['YrSold'] - data['YearBuilt']
        new_features['RemodelAge'] = data['YrSold'] - data['YearRemodAdd']
        new_features['TotalSF'] = data['1stFlrSF'] + data['2ndFlrSF'] + data['TotalBsmtSF']
    
    # Interaction and polynomial features
    if create_interactions:
        new_features['GrLivArea_OverallQual'] = data['GrLivArea'] * data['OverallQual']
        new_features['GrLivArea^2'] = data['GrLivArea'] ** 2
        new_features['OverallQual^2'] = data['OverallQual'] ** 2
    
    # Add the new features to the original dataset
    data = pd.concat([data, new_features], axis=1)
    return data

train = add_features(train)
test = add_features(test)

new_features = [
    'HouseAge', 'RemodelAge', 'TotalSF', 
    'GrLivArea_OverallQual', 'GrLivArea^2', 'OverallQual^2'
]
numerical_features.extend(new_features)

train[new_features].hist(figsize=(8, 8), bins=10, xlabelsize=8, ylabelsize=8);

../_images/41415442f485851f298785a2051e2049033b1bd15306f8c5c0138f3392aef311.png

Preprocessing#

test_ids = test['Id']
X_train = train.drop(columns=['Id', 'SalePrice', 'LogSalePrice'])
y_train = train['LogSalePrice']
X_test = test.drop(columns=['Id'])

Here we drop the extra columns and set the log transformed SalePrice as y. We also save the Id of the test set, which will be used when submitting to Kaggle later. (The standard submission format is shown in sample_submission.csv)

ordinal_mappings = {
    "MSSubClass": {
        20: 1, 30: 2, 40: 3, 45: 4, 50: 5,
        60: 6, 70: 7, 75: 8, 80: 9, 85: 10,
        90: 11, 120: 12, 150: 13, 160: 14,
        180: 15, 190: 16
    },
    "LotShape": {"Reg": 3, "IR1": 2, "IR2": 1, "IR3": 0},
    "LandSlope": {"Gtl": 2, "Mod": 1, "Sev": 0},
    "ExterQual": {"Ex": 4, "Gd": 3, "TA": 2, "Fa": 1, "Po": 0},
    "ExterCond": {"Ex": 4, "Gd": 3, "TA": 2, "Fa": 1, "Po": 0},
    "BsmtQual": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "NoBsmt": 0},
    "BsmtCond": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "NoBsmt": 0},
    "BsmtExposure": {"Gd": 4, "Av": 3, "Mn": 2, "No": 1, "NoBsmt": 0},
    "BsmtFinType1": {"GLQ": 6, "ALQ": 5, "BLQ": 4, "Rec": 3, "LwQ": 2, "Unf": 1, "NoBsmt": 0},
    "BsmtFinType2": {"GLQ": 6, "ALQ": 5, "BLQ": 4, "Rec": 3, "LwQ": 2, "Unf": 1, "NoBsmt": 0},
    "HeatingQC": {"Ex": 4, "Gd": 3, "TA": 2, "Fa": 1, "Po": 0},
    "KitchenQual": {"Ex": 4, "Gd": 3, "TA": 2, "Fa": 1, "Po": 0},
    "Functional": {
        "Typ": 7, "Min1": 6, "Min2": 5, "Mod": 4,
        "Maj1": 3, "Maj2": 2, "Sev": 1, "Sal": 0
    },
    "FireplaceQu": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "NoFireplace": 0},
    "GarageFinish": {"Fin": 3, "RFn": 2, "Unf": 1, "NoGarage": 0},
    "GarageQual": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "NoGarage": 0},
    "GarageCond": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "NoGarage": 0},
    "PavedDrive": {"Y": 2, "P": 1, "N": 0},
    "PoolQC": {"Ex": 4, "Gd": 3, "TA": 2, "Fa": 1, "NoPool": 0},
    "Fence": {"GdPrv": 4, "MnPrv": 3, "GdWo": 2, "MnWw": 1, "NoFence": 0},
    "MiscFeature": {"Gar2": 4, "Shed": 3, "TenC": 2, "Othr": 1, "NoFeature": 0},
    "Alley": {"Grvl": 2, "Pave": 1, "NoAlley": 0},
    "OverallQual": {i: i for i in range(1, 11)},  # 1-10
    "OverallCond": {i: i for i in range(1, 11)}   # 1-10
}

LabelEncoder may encode Ex as 0 and Po as 4, which completely reverses the actual meaning. This is due to LabelEncoder assigns values in alphabetical order of categories, which can lead to incorrect assumptions about the order (e.g., A > B > C). Hence, in order to avoid this, we created this ordinal_mappings.

for feature in categorical_features:

    # assign type as category
    X_train[feature] = X_train[feature].astype("category")
    X_test[feature] = X_test[feature].astype("category")

    # label encoding
    if feature in ordinal_mappings:
        X_train[feature] = X_train[feature].map(ordinal_mappings[feature])
        X_test[feature] = X_test[feature].map(ordinal_mappings[feature])


original_columns = set(X_train.columns)

# one-hot encoding
X_train = pd.get_dummies(X_train, columns=nominal_features, dummy_na=True)
X_test = pd.get_dummies(X_test, columns=nominal_features, dummy_na=True)

# align train and test
X_train, X_test = X_train.align(X_test, join='outer', axis=1)
new_columns = list(set(X_train.columns) - original_columns)

# Fill missing values with 0
X_train[new_columns] = X_train[new_columns].fillna(0)
X_test[new_columns] = X_test[new_columns].fillna(0)

# Define the Iterative Imputer with Random Forest Regressor
iter_imputer = IterativeImputer(
    estimator=RandomForestRegressor(
        n_estimators=100,  # Number of trees
        n_jobs=-1,         # Use all available cores
        random_state=42
    ),
    max_iter=10,
    random_state=42
)

# Fit and transform the train and test
X_train_full = iter_imputer.fit_transform(X_train)
X_test_full = iter_imputer.transform(X_test)

X_train_full = pd.DataFrame(X_train_full, columns=X_train.columns, index=X_train.index)
X_test_full = pd.DataFrame(X_test_full, columns=X_test.columns, index=X_test.index)

scaler = StandardScaler()
X_train_full[numerical_features] = scaler.fit_transform(X_train_full[numerical_features])
X_test_full[numerical_features] = scaler.transform(X_test_full[numerical_features])

Split training data for single validation, RepeatedKFold can better evaluate the stability of the model than train_test_split, especially for small datasets.

rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
for train_idx, valid_idx in rkf.split(X_train_full, y_train):
    X_train_split, X_valid_split = X_train_full.iloc[train_idx], X_train_full.iloc[valid_idx]
    y_train_split, y_valid_split = y_train.iloc[train_idx], y_train.iloc[valid_idx]
    break  # Only return the first split for simplicity

Modeling#

def save_submission(model, X_test, test_ids, filename='submission.csv'):
    """
    Generate predictions using the provided model and save them to a CSV file.
    
    Parameters:
    - model: Trained model that implements the `predict` method.
    - X_test: ndarray or DataFrame, the test dataset containing the features.
    - test_ids: array-like, unique identifiers for the test samples.
    - filename: str, name of the output submission file. Default is 'submission.csv'.
    
    The function predicts the target variable on the test set, applies an exponential
    transformation to reverse log-transformation (if applied during training), and saves
    the predictions along with the test IDs to a CSV file.
    """
    predictions = np.expm1(model.predict(X_test)) # inverse log-transform
    output_path = os.path.join('data', 'house-prices', 'processed(py)', filename)
    submission = pd.DataFrame({
        'Id': test_ids,
        'SalePrice': predictions
    })
    submission.to_csv(output_path, index=False)
    print(f"Submission file saved to: {output_path}")

kNN#

def train_knn(X_train, y_train):
    # Define the kNN model
    knn = KNeighborsRegressor()

    # Set up hyperparameter grid for tuning
    param_grid = {
        'n_neighbors': range(1, 11), # 1-10
        'weights': ['uniform', 'distance'],
        'p': [1, 2]  # Manhattan distance (p=1)
    }                # Euclidean distance (p=2)

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=knn,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error',
        cv=5,       # 5-fold cross-validation
        verbose=1,
        n_jobs=-1   # Use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Use the best model to predict on the test set
    return grid_search.best_estimator_

knn_model = train_knn(X_train_split, y_train_split)

Fitting 5 folds for each of 40 candidates, totalling 200 fits
Best parameters: {'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
Best RMSE: 0.15161816222293886

knn_valid_pred = knn_model.predict(X_valid_split)
knn_valid_rmse = root_mean_squared_error(y_valid_split, knn_valid_pred)
print(f"Validation RMSE (kNN): {knn_valid_rmse}")

Validation RMSE (kNN): 0.17032522997840102

save_submission(knn_model, X_test_full, test_ids, 'knn.csv')

Submission file saved to: data\house-prices\processed(py)\knn.csv

SVM#

def train_svm(X_train, y_train):
    # Define the SVM model
    svm_model = SVR()

    # Set up hyperparameter grid for tuning
    param_grid = {
        'C': [0.1, 1, 10],      # Regularization strength
        'epsilon': [0.01, 0.1], # Epsilon-insensitive loss
        'kernel': ['linear', 'rbf'],       # Kernel types
        'gamma': ['scale', 'auto']
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=svm_model,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error', # minimize RMSE
        cv=5,       # 5-fold cross-validation
        verbose=1,
        n_jobs=-1   # use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Use the best model to predict on the test set
    return grid_search.best_estimator_

svm_model = train_svm(X_train_split, y_train_split)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best parameters: {'C': 1, 'epsilon': 0.01, 'gamma': 'scale', 'kernel': 'rbf'}
Best RMSE: 0.11875760276477298

svm_valid_pred = svm_model.predict(X_valid_split)
svm_valid_rmse = root_mean_squared_error(y_valid_split, svm_valid_pred)
print(f"Validation RMSE (SVM): {svm_valid_rmse}")

Validation RMSE (SVM): 0.12985131122581856

save_submission(svm_model, X_test_full, test_ids, 'svm.csv')

Submission file saved to: data\house-prices\processed(py)\svm.csv

Linear Regression#

linear_model = LinearRegression()
linear_model.fit(X_train_split, y_train_split)

linear_valid_pred = linear_model.predict(X_valid_split)
linear_valid_rmse = root_mean_squared_error(y_valid_split, linear_valid_pred)
print(f"Validation RMSE (Linear Regression): {linear_valid_rmse}")

Validation RMSE (Linear Regression): 108051429.92181346

save_submission(linear_model, X_test_full, test_ids, 'linear_regression.csv')

Submission file saved to: data\house-prices\processed(py)\linear_regression.csv

C:\Users\ArnoZ\AppData\Local\Temp\ipykernel_30220\1089569966.py:15: RuntimeWarning: overflow encountered in expm1
  predictions = np.expm1(model.predict(X_test)) # inverse log-transform

Lasso#

def train_lasso(X_train, y_train):
    # Define the Lasso model
    lasso = Lasso(max_iter=10000, random_state=42, warm_start=True)

    # Set up hyperparameter grid for tuning
    param_grid = {
        # More fine-grained alpha values for better regularization strength tuning
        'alpha': np.logspace(-4, 2, 30)  # From 0.0001 to 100, 30 evenly spaced values
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=lasso,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error', # Minimize RMSE
        cv=5,      # 5-fold cross-validation
        verbose=1,
        n_jobs=-1  # Use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters (Lasso):", grid_search.best_params_)
    print("Best RMSE (Lasso):", -grid_search.best_score_)

    # Return the best Ridge model
    return grid_search.best_estimator_

lasso_model = train_lasso(X_train_split, y_train_split)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best parameters (Lasso): {'alpha': np.float64(0.0006723357536499335)}
Best RMSE (Lasso): 0.12663467044628274

lasso_valid_pred = lasso_model.predict(X_valid_split)
lasso_valid_rmse = root_mean_squared_error(y_valid_split, lasso_valid_pred)
print(f"Validation RMSE (Lasso): {lasso_valid_rmse}")

Validation RMSE (Lasso): 0.1338938110142784

save_submission(lasso_model, X_test_full, test_ids, 'lasso.csv')

Submission file saved to: data\house-prices\processed(py)\lasso.csv

Ridge#

def train_ridge(X_train, y_train):
    # Define the Ridge model
    ridge = Ridge(max_iter=10000, random_state=42)

    # Set up hyperparameter grid for tuning
    param_grid = {
        'alpha': np.logspace(-3, 3, 13) # from 0.001 to 1000
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=ridge,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error', # minimize RMSE
        cv=5,      # 5-fold cross-validation
        verbose=1,
        n_jobs=-1  # Use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Return the best Ridge model
    return grid_search.best_estimator_

ridge_model = train_ridge(X_train_split, y_train_split)

Fitting 5 folds for each of 13 candidates, totalling 65 fits
Best parameters: {'alpha': np.float64(3.1622776601683795)}
Best RMSE: 0.1278182559275935

ridge_valid_pred = ridge_model.predict(X_valid_split)
ridge_valid_rmse = root_mean_squared_error(y_valid_split, ridge_valid_pred)
print(f"Validation RMSE (Ridge): {ridge_valid_rmse}")

Validation RMSE (Ridge): 0.12965845670827755

save_submission(ridge_model, X_test_full, test_ids, 'ridge.csv')

Submission file saved to: data\house-prices\processed(py)\ridge.csv

ElasticNet#

def train_elasticnet(X_train, y_train):
    # Define the ElasticNet model
    elasticnet = ElasticNet(max_iter=10000, random_state=42)

    # Set up hyperparameter grid for tuning
    param_grid = {
        'alpha': np.logspace(-4, 2, 13),      # from 0.0001 to 100
        'l1_ratio': np.linspace(0.1, 1.0, 10) # balance between L1 and L2 penalties
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=elasticnet,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error',  # minimize RMSE
        cv=5,      # 5-fold cross-validation
        verbose=1,
        n_jobs=-1  # Use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Return the best ElasticNet model
    return grid_search.best_estimator_

elasticnet_model = train_elasticnet(X_train_split, y_train_split)

Fitting 5 folds for each of 130 candidates, totalling 650 fits
Best parameters: {'alpha': np.float64(0.001), 'l1_ratio': np.float64(0.4)}
Best RMSE: 0.12578716262255227

elasticnet_valid_pred = elasticnet_model.predict(X_valid_split)
elasticnet_valid_rmse = root_mean_squared_error(y_valid_split, elasticnet_valid_pred)
print(f"Validation RMSE (Ridge): {elasticnet_valid_rmse}")

Validation RMSE (Ridge): 0.12972983276620062

save_submission(elasticnet_model, X_test_full, test_ids, 'elasticnet.csv')

Submission file saved to: data\house-prices\processed(py)\elasticnet.csv

Decision Tree#

def train_tree(X_train, y_train):
    # Define the Decision Tree Regressor
    tree = DecisionTreeRegressor(random_state=42)

    # Set up hyperparameter grid for tuning
    param_grid = {
        'max_depth': [3, 5, 7, 10, 15, None],
        'min_samples_split': [2, 5, 10, 20, 50],
        'min_samples_leaf': [1, 2, 5, 10, 20]
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=tree,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error',  # Minimize RMSE
        cv=5,      # 5-fold cross-validation
        verbose=1,
        n_jobs=-1  # Use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Use the best model to predict on the test set
    return grid_search.best_estimator_

tree_model = train_tree(X_train_split, y_train_split)

Fitting 5 folds for each of 150 candidates, totalling 750 fits
Best parameters: {'max_depth': 10, 'min_samples_leaf': 10, 'min_samples_split': 2}
Best RMSE: 0.17498563457387445

C:\Users\ArnoZ\AppData\Roaming\Python\Python312\site-packages\numpy\ma\core.py:2846: RuntimeWarning: invalid value encountered in cast
  _data = np.array(data, dtype=dtype, copy=copy,

tree_valid_pred = ridge_model.predict(X_valid_split)
tree_valid_rmse = root_mean_squared_error(y_valid_split, tree_valid_pred)
print(f"Validation RMSE (Tree): {tree_valid_rmse}")

Validation RMSE (Tree): 0.12965845670827755

save_submission(ridge_model, X_test_full, test_ids, 'decision_tree.csv')

Submission file saved to: data\house-prices\processed(py)\decision_tree.csv

Random Forest#

def train_random_forest(X_train, y_train):
    # Define the Random Forest Regressor
    rf = RandomForestRegressor(random_state=42)

    # Set up hyperparameter grid for tuning
    param_grid = {
        'n_estimators': [100, 200, 500],  # Number of trees in the forest
        'max_depth': [10, 20, None],      # Maximum depth of the tree
        'min_samples_split': [2, 5, 10],  # Minimum samples required to split an internal node
        'min_samples_leaf': [1, 2, 4],    # Minimum samples required to be at a leaf node
        'max_features': ['sqrt', 'log2'], # Number of features to consider at every split
        'bootstrap': [True] # Whether bootstrap sampling is used
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=rf,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error',  # Minimize RMSE
        cv=5,      # 5-fold cross-validation
        verbose=1,
        n_jobs=-1  # Use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Use the best model to predict on the test set
    return grid_search.best_estimator_

rf_model = train_random_forest(X_train_split, y_train_split)

Fitting 5 folds for each of 162 candidates, totalling 810 fits
Best parameters: {'bootstrap': True, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best RMSE: 0.1352914997113765

rf_valid_pred = rf_model.predict(X_valid_split)
rf_valid_rmse = root_mean_squared_error(y_valid_split, rf_valid_pred)
print(f"Validation RMSE (Random Forest): {rf_valid_rmse}")

Validation RMSE (Random Forest): 0.1484356652188091

save_submission(rf_model, X_test_full, test_ids, 'random_forest.csv')

Submission file saved to: data\house-prices\processed(py)\random_forest.csv

Bagging#

def train_bagging(X_train, y_train):
    # Define the base estimator (weak learner)
    base_estimator = DecisionTreeRegressor(random_state=42)

    # Define the Bagging Regressor
    bagging_model = BaggingRegressor(
        estimator=base_estimator,  # Updated parameter name
        random_state=42
    )

    # Set up hyperparameter grid for tuning
    param_grid = {
        'n_estimators': [10, 50, 100, 200], # Number of base estimators
        'max_samples': [0.5, 0.7, 1.0],     # Fraction of samples to draw
        'max_features': [0.5, 0.7, 1.0],    # Fraction of features to draw
        'estimator__max_depth': [3, 5, 10, None] # Depth of decision trees
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = GridSearchCV(
        estimator=bagging_model,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error',  # Minimize RMSE
        cv=5,  # 5-fold cross-validation
        verbose=1,
        n_jobs=-1  # Use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Use the best model to predict on the test set
    return grid_search.best_estimator_

bag_model = train_bagging(X_train_split, y_train_split)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best parameters: {'estimator__max_depth': None, 'max_features': 0.5, 'max_samples': 1.0, 'n_estimators': 200}
Best RMSE: 0.13509426431781696

bag_valid_pred = bag_model.predict(X_valid_split)
bag_valid_rmse = root_mean_squared_error(y_valid_split, bag_valid_pred)
print(f"Validation RMSE (Bagging): {bag_valid_rmse}")

Validation RMSE (Bagging): 0.14377696858693548

save_submission(bag_model, X_test_full, test_ids, 'bagging.csv')

Submission file saved to: data\house-prices\processed(py)\bagging.csv

XGBoost#

def train_xgb(X_train, y_train):
    # Convert to DMatrix for GPU compatibility
    dtrain = DMatrix(X_train, label=y_train)
    
    # Define the XGBoost Regressor
    xgb = XGBRegressor(
        objective='reg:squarederror',  # Regression objective
        random_state=42,
        tree_method = "hist",
        device = "cuda",      # enable gpu
        n_jobs=-1  # Use all available cores
    )

    # Set up hyperparameter grid for tuning
    param_grid = {
        'n_estimators': [100, 300], # Number of boosting rounds
        'learning_rate': [0.05, 0.1, 0.2], # Step size shrinkage
        'max_depth': [3, 5, 7],     # Maximum tree depth
        'subsample': [0.8, 1.0],    # Fraction of samples used for training each tree
        'colsample_bytree': [0.8, 1.0], # Fraction of features used for training each tree
        'gamma': [0, 0.1, 0.2, 0.3],    # Minimum loss reduction required to split
        'reg_alpha': [0, 0.1, 1, 10],   # L1 regularization
        'reg_lambda': [0, 0.1, 1, 10],  # L2 regularization
        'min_child_weight': [1, 3, 5, 10] # Minimum child weight
    }

    # Use GridSearchCV to find the best hyperparameters
    grid_search = RandomizedSearchCV(
        estimator=xgb,
        param_distributions=param_grid,
        n_iter=100,
        scoring='neg_root_mean_squared_error', # minimize RMSE
        cv=5,      # 5-fold cross-validation
        verbose=1,
        n_jobs=-1  # use all available cores
    )

    # Fit the grid search to the training data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and RMSE
    print("Best parameters:", grid_search.best_params_)
    print("Best RMSE:", -grid_search.best_score_)

    # Use the best model to predict on the test set
    return grid_search.best_estimator_

xgb_model = train_xgb(X_train_split, y_train_split)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'subsample': 0.8, 'reg_lambda': 0.1, 'reg_alpha': 0, 'n_estimators': 300, 'min_child_weight': 10, 'max_depth': 5, 'learning_rate': 0.05, 'gamma': 0, 'colsample_bytree': 0.8}
Best RMSE: 0.12320914496713031

xgb_valid_pred = xgb_model.predict(X_valid_split)
xgb_valid_rmse = root_mean_squared_error(y_valid_split, xgb_valid_pred)
print(f"Validation RMSE (XGBoost): {xgb_valid_rmse}")

Validation RMSE (XGBoost): 0.14073594419026517

C:\Users\ArnoZ\AppData\Roaming\Python\Python312\site-packages\xgboost\core.py:158: UserWarning: [00:10:02] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0c55ff5f71b100e98-1\xgboost\xgboost-ci-windows\src\common\error_msg.cc:58: Falling back to prediction using DMatrix due to mismatched devices. This might lead to higher memory usage and slower performance. XGBoost is running on: cuda:0, while the input data is on: cpu.
Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.

This warning will only be shown once.

  warnings.warn(smsg, UserWarning)

save_submission(xgb_model, X_test_full, test_ids, 'xgboost.csv')

Submission file saved to: data\house-prices\processed(py)\xgboost.csv

References#

Kaggle Competition: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
Dataset Description: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data