House Prices - Advanced Regression Techniques (R)

House Prices - Advanced Regression Techniques (R)#

Predict sales prices and practice feature engineering, RFs, and gradient boosting

Author: Lingsong Zeng
Date: 04/20/2020

Introduction#

Overview#

This competition runs indefinitely with a rolling leaderboard. Learn more

Description#

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Practice Skills#

Creative feature engineering
Advanced regression techniques like random forest and gradient boosting

Acknowledgments#

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

Photo by Tom Thain on Unsplash.

Evaluation#

Goal#

It is my job to predict the sales price for each house. For each Id in the test set, predict the value of the SalePrice variable.

Metric#

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

# Load required packages

suppressWarnings({
  library(readr)         # Read CSV files
  library(dplyr)         # Data manipulation
  library(tidyr)         # Data tidying
  library(ggplot2)       # Data visualization
  library(moments)       # Skewness & kurtosis analysis
  library(scales)        # Scaling and formatting plots
  library(VIM)           # kNN imputation for missing values
  library(car)           # Variance Inflation Factor (VIF) for multicollinearity
  library(caret)         # Machine learning framework (model training & tuning)
  library(e1071)         # Support Vector Machines (SVM)
  library(kernlab)       # Advanced SVM implementation
  library(glmnet)        # Lasso, Ridge, and ElasticNet regression
  library(rpart)         # Decision Tree modeling
  library(rpart.plot)    # Decision Tree visualization
  library(randomForest)  # Random Forest (bagging)
  library(xgboost)       # Extreme Gradient Boosting (XGBoost)
})

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Attaching package: 'scales'

The following object is masked from 'package:readr':

    col_factor

Loading required package: colorspace

Loading required package: grid

VIM is ready to use.

Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

Attaching package: 'VIM'

The following object is masked from 'package:datasets':

    sleep

Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

Loading required package: lattice

Attaching package: 'e1071'

The following objects are masked from 'package:moments':

    kurtosis, moment, skewness

Attaching package: 'kernlab'

The following object is masked from 'package:scales':

    alpha

The following object is masked from 'package:ggplot2':

    alpha

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-8

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

The following object is masked from 'package:dplyr':

    combine

Attaching package: 'xgboost'

The following object is masked from 'package:dplyr':

    slice

Data#

File descriptions#

train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

# Construct file paths
train_path <- file.path("data", "house-prices", "raw", "train.csv")
test_path <- file.path("data", "house-prices", "raw", "test.csv")

# Read data
train <- read_csv(train_path)
test <- read_csv(test_path)

# Extract the Id column
Id <- test$Id

# Remove the Id column
train <- train %>% select(-Id)
test <- test %>% select(-Id)

Rows: 1460 Columns: 81
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1459 Columns: 80
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here we using file.path , the dynamic path generation, instead of using hard-coded paths is aiming to provide better compatibility across different platforms.

Different operating systems use different path separators (e.g. Windows uses \, while Linux and macOS use /). file.path automatically chooses the correct separator based on the operating system.

The size of the training dataset and the test dataset are roughly the same. The test dataset has one less column than the training dataset, which is our target column SalePrice.

glimpse(train)

Rows: 1,460
Columns: 80
$ MSSubClass    <dbl> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,…
$ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R…
$ LotFrontage   <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
$ LotArea       <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
$ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
$ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", …
$ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", …
$ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
$ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I…
$ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
$ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "…
$ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",…
$ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
$ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
$ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi…
$ OverallQual   <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,…
$ OverallCond   <dbl> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,…
$ YearBuilt     <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
$ YearRemodAdd  <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19…
$ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G…
$ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
$ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "…
$ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "…
$ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",…
$ MasVnrArea    <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, …
$ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T…
$ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
$ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "…
$ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T…
$ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T…
$ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N…
$ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", …
$ BsmtFinSF1    <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99…
$ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", …
$ BsmtFinSF2    <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ BsmtUnfSF     <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17…
$ TotalBsmtSF   <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
$ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
$ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E…
$ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
$ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
$ `1stFlrSF`    <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, …
$ `2ndFlrSF`    <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,…
$ LowQualFinSF  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ GrLivArea     <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10…
$ BsmtFullBath  <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
$ BsmtHalfBath  <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FullBath      <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,…
$ HalfBath      <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
$ BedroomAbvGr  <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
$ KitchenAbvGr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
$ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T…
$ TotRmsAbvGrd  <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
$ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
$ Fireplaces    <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,…
$ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", …
$ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch…
$ GarageYrBlt   <dbl> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19…
$ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", …
$ GarageCars    <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
$ GarageArea    <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
$ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G…
$ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
$ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
$ WoodDeckSF    <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160…
$ OpenPorchSF   <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,…
$ EnclosedPorch <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, …
$ `3SsnPorch`   <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ScreenPorch   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, …
$ PoolArea      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,…
$ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, …
$ MiscVal       <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,…
$ MoSold        <dbl> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10…
$ YrSold        <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
$ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
$ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm…
$ SalePrice     <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …

Because we will use all features for modeling later, we need to convert all features in chr format to factor format. Therefore, we now need to filter out categorical features and numerical features separately.

categorical_features <- c(
    'MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
    'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 
    'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 
    'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 
    'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
    'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 
    'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 
    'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 
    'SaleType', 'SaleCondition', 'OverallQual', 'OverallCond'
)

numerical_features <- c(
    'LotFrontage', 'LotArea', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
    '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath',
    'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
    'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
    'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
    'MiscVal', 'MoSold', 'YrSold'
)

For categorical_features and numerical_features, the reason we cannot directly select features with values of int64 or float64, because some features such as MSSubClass:

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

Its value is int64 format, but it is actually a categorical feature. Therefore, here I manually distinguish all categorical_features and numerical_features according to the description in data_description.txt.

# Convert all categorical features to factor
train <- train %>%
  mutate(across(all_of(categorical_features), as.factor))

test <- test %>%
  mutate(across(all_of(categorical_features), as.factor))

glimpse(train)

Rows: 1,460
Columns: 80
$ MSSubClass    <fct> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,…
$ MSZoning      <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, RL, …
$ LotFrontage   <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
$ LotArea       <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
$ Street        <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pa…
$ Alley         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ LotShape      <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg, Reg, I…
$ LandContour   <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, L…
$ Utilities     <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
$ LotConfig     <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Corner…
$ LandSlope     <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
$ Neighborhood  <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel, So…
$ Condition1    <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Artery,…
$ Condition2    <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Ar…
$ BldgType      <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 2f…
$ HouseStyle    <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, 1Story, …
$ OverallQual   <fct> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,…
$ OverallCond   <fct> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,…
$ YearBuilt     <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
$ YearRemodAdd  <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19…
$ RoofStyle     <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable, Gable, …
$ RoofMatl      <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg, Co…
$ Exterior1st   <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, VinylSd, Vi…
$ Exterior2nd   <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, VinylSd, Vi…
$ MasVnrType    <fct> BrkFace, None, BrkFace, None, BrkFace, None, Stone, Ston…
$ MasVnrArea    <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, …
$ ExterQual     <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, …
$ ExterCond     <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
$ Foundation    <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc, CBlock…
$ BsmtQual      <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, TA, Gd, …
$ BsmtCond      <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
$ BsmtExposure  <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, No, Av, …
$ BsmtFinType1  <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ, Rec, G…
$ BsmtFinSF1    <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99…
$ BsmtFinType2  <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf, Unf, U…
$ BsmtFinSF2    <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ BsmtUnfSF     <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17…
$ TotalBsmtSF   <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
$ Heating       <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Ga…
$ HeatingQC     <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, TA, Ex, …
$ CentralAir    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
$ Electrical    <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, …
$ `1stFlrSF`    <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, …
$ `2ndFlrSF`    <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,…
$ LowQualFinSF  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ GrLivArea     <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10…
$ BsmtFullBath  <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
$ BsmtHalfBath  <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FullBath      <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,…
$ HalfBath      <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
$ BedroomAbvGr  <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
$ KitchenAbvGr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
$ KitchenQual   <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, …
$ TotRmsAbvGrd  <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
$ Functional    <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Typ, Typ, …
$ Fireplaces    <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,…
$ FireplaceQu   <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, NA, Gd, …
$ GarageType    <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, Attchd, …
$ GarageYrBlt   <dbl> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19…
$ GarageFinish  <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn, Unf, F…
$ GarageCars    <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
$ GarageArea    <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
$ GarageQual    <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, TA, TA, …
$ GarageCond    <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
$ PavedDrive    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
$ WoodDeckSF    <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160…
$ OpenPorchSF   <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,…
$ EnclosedPorch <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, …
$ `3SsnPorch`   <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ScreenPorch   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, …
$ PoolArea      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ PoolQC        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Fence         <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, NA, NA, N…
$ MiscFeature   <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, NA, NA, …
$ MiscVal       <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,…
$ MoSold        <dbl> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10…
$ YrSold        <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
$ SaleType      <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New, WD, New…
$ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal, Normal,…
$ SalePrice     <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …

# Define the variables where NA means "None"
none_features <- c(
  "Alley", "BsmtQual", "BsmtCond", "BsmtExposure",
  "BsmtFinType1", "BsmtFinType2", "FireplaceQu", "GarageType",
  "GarageFinish", "GarageQual", "GarageCond", "PoolQC",
  "Fence", "MiscFeature"
)

According to the description in data_description.txt, NA in some features does not mean Missing Value, but means that the observation does not have the feature, such as Alley:

Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access

Therefore, the missing value filling treatment of these features should be different. I filtered out all similar features here to prepare for the missing value filling in the subsequent preprocessing.

# Replace NA with "None" and ensure "None" is the lowest factor level
train <- train %>%
  mutate(
    across(all_of(none_features), 
    ~ factor(
        replace_na(as.character(.), "None"),  # fill NA by "None"
        levels = c("None", sort(unique(as.character(na.omit(.)))))  # redefine factor levels
      )
    )
  )

test <- test %>%
  mutate(
    across(
      all_of(none_features), 
      ~ factor(
          replace_na(as.character(.), "None"),  
          levels = c("None", sort(unique(as.character(na.omit(.)))))
      )
    )
  )


# Verify the changes
summary(train[none_features])

  Alley      BsmtQual   BsmtCond    BsmtExposure BsmtFinType1 BsmtFinType2
 None:1369   None: 37   None:  37   None: 38     None: 37     None:  38   
 Grvl:  50   Ex  :121   Fa  :  45   Av  :221     ALQ :220     ALQ :  19   
 Pave:  41   Fa  : 35   Gd  :  65   Gd  :134     BLQ :148     BLQ :  33   
             Gd  :618   Po  :   2   Mn  :114     GLQ :418     GLQ :  14   
             TA  :649   TA  :1311   No  :953     LwQ : 74     LwQ :  46   
                                                 Rec :133     Rec :  54   
                                                 Unf :430     Unf :1256   
 FireplaceQu   GarageType  GarageFinish GarageQual  GarageCond   PoolQC    
 None:690    None   : 81   None: 81     None:  81   None:  81   None:1453  
 Ex  : 24    2Types :  6   Fin :352     Ex  :   3   Ex  :   2   Ex  :   2  
 Fa  : 33    Attchd :870   RFn :422     Fa  :  48   Fa  :  35   Fa  :   2  
 Gd  :380    Basment: 19   Unf :605     Gd  :  14   Gd  :   9   Gd  :   3  
 Po  : 20    BuiltIn: 88                Po  :   3   Po  :   7              
 TA  :313    CarPort:  9                TA  :1311   TA  :1326              
             Detchd :387                                                   
   Fence      MiscFeature
 None :1179   None:1406  
 GdPrv:  59   Gar2:   2  
 GdWo :  54   Othr:   2  
 MnPrv: 157   Shed:  49  
 MnWw :  11   TenC:   1  
                         
                         

ordinal_features <- c(
    "MSSubClass", "OverallQual", "OverallCond", "LotShape", "LandSlope",
    "ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure",
    "BsmtFinType1", "BsmtFinType2", "HeatingQC", "KitchenQual",
    "Functional", "FireplaceQu", "GarageFinish", "GarageQual", "GarageCond",
    "PavedDrive", "PoolQC", "Fence"
)

nominal_features <- c(
    "MSZoning", "Street", "Alley", "LandContour", "Utilities", "LotConfig",
    "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle",
    "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType",
    "Foundation", "Heating", "CentralAir", "Electrical", "GarageType",
    "MiscFeature", "SaleType", "SaleCondition"
)

Similarly, I also distinguish between ordinal_features and nominal_features here to facilitate subsequent preprocessing. The main difference between them is whether the variable values are ordered. The following are examples of an ordinal feature and a nominal feature respectively.

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

In the following preprocessing, ordinal features will be applied with Label Encoding, while nominal features will be applied with One-Hot Encoding.

EDA#

summary(train$SalePrice)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34900  129975  163000  180921  214000  755000 

options(repr.plot.width = 10, repr.plot.height = 8)

# Create a tibble with SalePrice
train %>%
  ggplot(aes(x = SalePrice)) +
  geom_histogram(aes(fill = after_stat(count)), bins = 30, color = "black", alpha = 0.8) +
  scale_fill_viridis_c(name = "SalePrice", option = "plasma") +
  geom_vline(
    aes(xintercept = mean(SalePrice, na.rm = TRUE)), 
    linetype = "dashed", linewidth = 1.2, color = "red"
  ) +
  scale_x_continuous(labels = comma_format()) +  # not use scientific notation
  labs(
    x = "SalePrice", 
    y = "Count",
    title = "SalePrice Distribution",
    caption = "Histogram with Adjusted Density Plot"
  ) +
  theme_minimal(base_size = 14)

../_images/6b6a5150028ae2bc9668960001d3577b2a9b417677ce2a5f15bc3137a9f783e2.png

# Calculate skewness and kurtosis
skewness_value <- skewness(train$SalePrice, na.rm = TRUE)
kurtosis_value <- kurtosis(train$SalePrice, na.rm = TRUE)

# Print skewness and kurtosis
cat("Skewness of SalePrice:", skewness_value, "\n")
cat("Kurtosis of SalePrice:", kurtosis_value, "\n")

Skewness of SalePrice: 1.879009 
Kurtosis of SalePrice: 6.496789 

From the histogram and the calculated value of skewness (1.88), we can see that the distribution of SalePrice is right-skewed. This may have a negative impact on the training of the model. So we need a log transformation for SalePrice.

# Log transformation of SalePrice
train <- train %>%
  mutate(SalePrice_log = log(SalePrice))

# Plot the distribution of log-transformed SalePrice
ggplot(train, aes(x = SalePrice_log)) +
  geom_histogram(aes(fill = after_stat(count)), bins = 30, color = "black", alpha = 0.8) +
  scale_fill_viridis_c(name = "Log SalePrice", option = "plasma") +
  geom_vline(
    aes(xintercept = mean(SalePrice_log, na.rm = TRUE)), 
    linetype = "dashed", linewidth = 1.2, color = "red"
  ) +
  scale_x_continuous(labels = comma_format()) +  # not use scientific notation
  labs(
    x = "Log(SalePrice)", 
    y = "Count",
    title = "Log-Transformed SalePrice Distribution",
    caption = "Histogram of Log-Transformed SalePrice"
  ) +
  theme_minimal(base_size = 14)

../_images/d93d8cef35826cabab1e79f7e92783e5399dd4e9b2060cf365190b327238da71.png

# Calculate skewness and kurtosis for log-transformed SalePrice
skewness_log <- skewness(train$SalePrice_log, na.rm = TRUE)
kurtosis_log <- kurtosis(train$SalePrice_log, na.rm = TRUE)

# Print skewness and kurtosis
cat("Log-Transformed SalePrice_log Skewness:", skewness_log, "\n")
cat("Log-Transformed SalePrice_log Kurtosis:", kurtosis_log, "\n")

Log-Transformed SalePrice_log Skewness: 0.1210859 
Log-Transformed SalePrice_log Kurtosis: 0.7974482 

From the charts and statistics, we can see that after logarithmic transformation, the distribution of SalePrice is closer to normal distribution.

# Remove the original SalePrice column
train <- train %>% select(-SalePrice)

The following is the distribution of the remaining numerical features:

options(repr.plot.width = 15, repr.plot.height = 15)

# Convert train dataset to long format and remove NA/Inf values
train %>%
  select(all_of(numerical_features)) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
  filter(!is.na(Value) & is.finite(Value)) %>%  # Remove NA and Inf values

# Plot histograms for all numerical features
ggplot(aes(x = Value)) +
  geom_histogram(bins = 10, fill = "blue", color = "black", alpha = 0.7) +
  facet_wrap(~ Feature, scales = "free") +  # Free scales to adjust for different ranges
  labs(title = "Distribution of Numerical Features", x = "Value", y = "Count") +
  theme_minimal(base_size = 14)

../_images/e6fd066fa9ce9f9fdbacfd8ad46068a451afc2ce6a9b0bdb7b2847997e95b50a.png

Feature Engineering#

# Function to add features
add_features <- function(data, create_interactions = TRUE, create_base_features = TRUE) {
  # Create a new dataframe to store features
  new_features <- data.frame(row.names = rownames(data))
  
  # Base features
  if (create_base_features) {
    new_features$HouseAge <- data$YrSold - data$YearBuilt
    new_features$RemodelAge <- data$YrSold - data$YearRemodAdd
    new_features$TotalSF <- data$`1stFlrSF` + data$`2ndFlrSF` + data$TotalBsmtSF
  }
  
  # Merge new features into the original dataset
  data <- cbind(data, new_features)
  return(data)
}

# Apply feature engineering to train and test datasets
train <- add_features(train)
test <- add_features(test)

# Define new features
new_features <- c("HouseAge", "RemodelAge", "TotalSF")

# Update numerical_features list
numerical_features <- c(numerical_features, new_features)

options(repr.plot.width = 12, repr.plot.height = 4)

# Plot histograms for new features
train %>%
  select(all_of(new_features)) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 10, fill = "blue", color = "black", alpha = 0.7) +
  facet_wrap(~ Feature, scales = "free") +  
  labs(title = "Distribution of Newly Created Features", x = "Value", y = "Count") +
  theme_minimal(base_size = 14)

../_images/b65649076f2e48b69ad4d1ee4f6357f8d118ad0ce8d8a6ea45f9d0fec70cd208.png

Preprocessing#

Handling Missing Values#

# Function to find columns with NA
check_na <- function(data) {
  na_count <- colSums(is.na(data))
  na_count <- sort(na_count[na_count > 0], decreasing = TRUE)  # Only keep columns with NA
  return(as.data.frame(na_count))
}

# Check missing values in train and test
na_train <- check_na(train)
na_test <- check_na(test)

# Print missing value summary
cat("Missing Values in Train Dataset:\n")
print(na_train)

cat("\nMissing Values in Test Dataset:\n")
print(na_test)

Missing Values in Train Dataset:
            na_count
LotFrontage      259
GarageYrBlt       81
MasVnrType         8
MasVnrArea         8
Electrical         1

Missing Values in Test Dataset:
             na_count
LotFrontage       227
GarageYrBlt        78
MasVnrType         16
MasVnrArea         15
MSZoning            4
Utilities           2
BsmtFullBath        2
BsmtHalfBath        2
Functional          2
Exterior1st         1
Exterior2nd         1
BsmtFinSF1          1
BsmtFinSF2          1
BsmtUnfSF           1
TotalBsmtSF         1
KitchenQual         1
GarageCars          1
GarageArea          1
SaleType            1
TotalSF             1

NA for these variables means the structure does not exist and should be filled with 0:

GarageYrBlt (year garage was built)
MasVnrArea (brick facing area)
BsmtFinSF1, BsmtFinSF2, BsmtUnfSF (basement area)
TotalBsmtSF (total basement area)
GarageCars (number of garage spaces)
GarageArea (garage area)
TotalSF (total area, usually equal to 1stFlrSF + 2ndFlrSF + TotalBsmtSF)

zero_fill_features <- c("GarageYrBlt", "MasVnrArea", "BsmtFinSF1", 
                        "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", 
                        "GarageCars", "GarageArea", "TotalSF")

train[zero_fill_features] <- lapply(train[zero_fill_features], function(x) replace(x, is.na(x), 0))
test[zero_fill_features] <- lapply(test[zero_fill_features], function(x) replace(x, is.na(x), 0))

NA for these variables may be missing data entries and should be filled with the most common category (Mode):

MasVnrType (brickwork finish type)
Electrical (electrical system)
MSZoning (land zoning)
Utilities (public facilities)
Functional (house function)
Exterior1st, Exterior2nd (exterior wall material)
KitchenQual (kitchen quality)
SaleType (sale type)

mode_fill_features <- c("MasVnrType", "Electrical", "MSZoning", 
                        "Utilities", "Functional", "Exterior1st", 
                        "Exterior2nd", "KitchenQual", "SaleType")

fill_mode <- function(x) {
  mode_value <- names(sort(table(x), decreasing = TRUE))[1]  # get mode
  replace(x, is.na(x), mode_value)
}

train[mode_fill_features] <- lapply(train[mode_fill_features], fill_mode)
test[mode_fill_features] <- lapply(test[mode_fill_features], fill_mode)

NA values for these variables may be missing values, and filling the median value can reduce the impact of outliers:

LotFrontage (street frontage length)

train <- train %>%
  group_by(Neighborhood) %>%
  mutate(LotFrontage = ifelse(is.na(LotFrontage), median(LotFrontage, na.rm = TRUE), LotFrontage)) %>%
  ungroup()

test <- test %>%
  group_by(Neighborhood) %>%
  mutate(LotFrontage = ifelse(is.na(LotFrontage), median(LotFrontage, na.rm = TRUE), LotFrontage)) %>%
  ungroup()

Filled with 0 or 1 (integer feature)

BsmtFullBath, BsmtHalfBath (basement bathrooms)

If there is no basement, the number of bathrooms should be 0.

bath_fill_features <- c("BsmtFullBath", "BsmtHalfBath")

train[bath_fill_features] <- lapply(train[bath_fill_features], function(x) replace(x, is.na(x), 0))
test[bath_fill_features] <- lapply(test[bath_fill_features], function(x) replace(x, is.na(x), 0))

Recheck NA

na_train_after <- check_na(train)
na_test_after <- check_na(test)

cat("After processing, missing values in Train Dataset:\n")
print(na_train_after)

cat("\nAfter processing, missing values in Test Dataset:\n")
print(na_test_after)

After processing, missing values in Train Dataset:
[1] na_count
<0 rows> (or 0-length row.names)

After processing, missing values in Test Dataset:
[1] na_count
<0 rows> (or 0-length row.names)

Encoding#

Label Encoding#

Although we previously converted all features that type is chr to fct. However, when you convert a feature to a factor in R, the levels are automatically assigned based on lexicographic (alphabetical/numeric) order, unless specified otherwise. For example, the default order of factor(c("Good", "Bad", "Excellent")) is: Bad → Excellent → Good (alphabetical). If our categorical feature represents a ranking (e.g., Ex > Gd > TA > Fa > Po), incorrect ordering will mislead models that rely on factor levels (e.g., linear regression, decision trees). Therefore, to be on the safe side, we need to perform manual Label Encoding at this time to confirm again that our preprocessing of ordinal_features is correct.

factor_levels <- list(
  # Basement Quality (higher is better)
  BsmtQual = c("None", "Po", "Fa", "TA", "Gd", "Ex"),
  
  # Basement Condition (higher is better)
  BsmtCond = c("None", "Po", "Fa", "TA", "Gd", "Ex"),
  
  # Basement Exposure (higher means more exposure to outside)
  BsmtExposure = c("None", "No", "Mn", "Av", "Gd"),
  
  # Basement Finishing Type (higher is more finished)
  BsmtFinType1 = c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"),
  BsmtFinType2 = c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"),
  
  # Heating Quality (higher is better)
  HeatingQC = c("Po", "Fa", "TA", "Gd", "Ex"),
  
  # Kitchen Quality (higher is better)
  KitchenQual = c("Po", "Fa", "TA", "Gd", "Ex"),
  
  # Functional Condition of House
  Functional = c("Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"),
  
  # Fireplace Quality (higher is better)
  FireplaceQu = c("None", "Po", "Fa", "TA", "Gd", "Ex"),
  
  # Garage Finish (higher is more finished)
  GarageFinish = c("None", "Unf", "RFn", "Fin"),
  
  # Garage Quality (higher is better)
  GarageQual = c("None", "Po", "Fa", "TA", "Gd", "Ex"),
  
  # Garage Condition (higher is better)
  GarageCond = c("None", "Po", "Fa", "TA", "Gd", "Ex"),
  
  # Paved Driveway (Y = Paved, P = Partial, N = Dirt/Gravel)
  PavedDrive = c("N", "P", "Y"),
  
  # Pool Quality (higher is better)
  PoolQC = c("None", "Fa", "TA", "Gd", "Ex"),
  
  # Fence Quality (higher means better privacy/security)
  Fence = c("None", "MnWw", "GdWo", "MnPrv", "GdPrv"),
  
  # Sale Condition (Normal is typical)
  SaleCondition = c("AdjLand", "Alloca", "Family", "Normal", "Abnorml", "Partial"),
  
  # Lot Shape (Regular is the best)
  LotShape = c("IR3", "IR2", "IR1", "Reg"),
  
  # Land Slope (Gtl = gentle slope, best)
  LandSlope = c("Sev", "Mod", "Gtl"),
  
  # MSSubClass (Type of dwelling, as a categorical feature)
  MSSubClass = c("20", "30", "40", "45", "50", "60", "70", "75", "80", "85", 
                 "90", "120", "150", "160", "180", "190"),
  
  # Overall Quality (numeric but treated as an ordinal feature)
  OverallQual = as.character(1:10),  # 1 (Worst) to 10 (Best)
  
  # Overall Condition (numeric but treated as an ordinal feature)
  OverallCond = as.character(1:10)   # 1 (Poor) to 10 (Excellent)
)

for (feature in names(factor_levels)) {
  train[[feature]] <- factor(train[[feature]], levels = factor_levels[[feature]])
  test[[feature]] <- factor(test[[feature]], levels = factor_levels[[feature]])
}

Tree models such as Random Forest (RF), XGBoost, LightGBM, and CatBoost can directly process raw categorical data, so performance may be reduced due to unnecessary preprocessing.

tree_train <- as.data.frame(train)
tree_test <- as.data.frame(test)

One-Hot Encoding#

# Ensure test dataset has the same factor levels as train
for (col in categorical_features) {
  if (col %in% names(test)) {
    test[[col]] <- factor(test[[col]], levels = levels(train[[col]])) 
  }
}

# Apply One-Hot Encoding using predefined categorical_features
ohe <- dummyVars(~ ., data = train[, categorical_features])

# Apply transformation to train and test sets
train_ohe <- predict(ohe, train)
test_ohe <- predict(ohe, test)

# Merge transformed categorical features with numerical features
train <- cbind(train[, numerical_features], train_ohe, SalePrice_log = train$SalePrice_log)
test <- cbind(test[, numerical_features], test_ohe)

# Ensure all variables are numeric
str(train)

'data.frame':	1460 obs. of  344 variables:
 $ LotFrontage          : num  65 80 68 60 84 85 75 80 51 50 ...
 $ LotArea              : num  8450 9600 11250 9550 14260 ...
 $ YearBuilt            : num  2003 1976 2001 1915 2000 ...
 $ YearRemodAdd         : num  2003 1976 2002 1970 2000 ...
 $ MasVnrArea           : num  196 0 162 0 350 0 186 240 0 0 ...
 $ BsmtFinSF1           : num  706 978 486 216 655 ...
 $ BsmtFinSF2           : num  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF            : num  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF          : num  856 1262 920 756 1145 ...
 $ 1stFlrSF             : num  856 1262 920 961 1145 ...
 $ 2ndFlrSF             : num  854 0 866 756 1053 ...
 $ LowQualFinSF         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea            : num  1710 1262 1786 1717 2198 ...
 $ BsmtFullBath         : num  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath         : num  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath             : num  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath             : num  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr         : num  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr         : num  1 1 1 1 1 1 1 1 2 2 ...
 $ TotRmsAbvGrd         : num  8 6 6 7 9 5 7 7 8 5 ...
 $ Fireplaces           : num  0 1 1 1 1 0 1 2 2 2 ...
 $ GarageYrBlt          : num  2003 1976 2001 1998 2000 ...
 $ GarageCars           : num  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea           : num  548 460 608 642 836 480 636 484 468 205 ...
 $ WoodDeckSF           : num  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF          : num  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch        : num  0 0 0 272 0 0 0 228 205 0 ...
 $ 3SsnPorch            : num  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MiscVal              : num  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold               : num  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold               : num  2008 2007 2008 2006 2008 ...
 $ HouseAge             : num  5 31 7 91 8 16 3 36 77 69 ...
 $ RemodelAge           : num  5 31 6 36 8 14 2 36 58 58 ...
 $ TotalSF              : num  2566 2524 2706 2473 3343 ...
 $ MSSubClass.20        : num  0 1 0 0 0 0 1 0 0 0 ...
 $ MSSubClass.30        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.40        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.45        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.50        : num  0 0 0 0 0 1 0 0 1 0 ...
 $ MSSubClass.60        : num  1 0 1 0 1 0 0 1 0 0 ...
 $ MSSubClass.70        : num  0 0 0 1 0 0 0 0 0 0 ...
 $ MSSubClass.75        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.80        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.85        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.90        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.120       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.150       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.160       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.180       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSSubClass.190       : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MSZoning.C (all)     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSZoning.FV          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSZoning.RH          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSZoning.RL          : num  1 1 1 1 1 1 1 1 0 1 ...
 $ MSZoning.RM          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Street.Grvl          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Street.Pave          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Alley.None           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Alley.Grvl           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Alley.Pave           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotShape.IR3         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotShape.IR2         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotShape.IR1         : num  0 0 1 1 1 1 0 1 0 0 ...
 $ LotShape.Reg         : num  1 1 0 0 0 0 1 0 1 1 ...
 $ LandContour.Bnk      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandContour.HLS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandContour.Low      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandContour.Lvl      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Utilities.AllPub     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Utilities.NoSeWa     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfig.Corner     : num  0 0 0 1 0 0 0 1 0 1 ...
 $ LotConfig.CulDSac    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfig.FR2        : num  0 1 0 0 1 0 0 0 0 0 ...
 $ LotConfig.FR3        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfig.Inside     : num  1 0 1 0 0 1 1 0 1 0 ...
 $ LandSlope.Sev        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandSlope.Mod        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandSlope.Gtl        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood.Blmngtn : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.Blueste : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.BrDale  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.BrkSide : num  0 0 0 0 0 0 0 0 0 1 ...
 $ Neighborhood.ClearCr : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.CollgCr : num  1 0 1 0 0 0 0 0 0 0 ...
 $ Neighborhood.Crawfor : num  0 0 0 1 0 0 0 0 0 0 ...
 $ Neighborhood.Edwards : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.Gilbert : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.IDOTRR  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.MeadowV : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.Mitchel : num  0 0 0 0 0 1 0 0 0 0 ...
 $ Neighborhood.NAmes   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.NoRidge : num  0 0 0 0 1 0 0 0 0 0 ...
 $ Neighborhood.NPkVill : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.NridgHt : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neighborhood.NWAmes  : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Neighborhood.OldTown : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Neighborhood.Sawyer  : num  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]

# Count the minimum and maximum values of each column
zero_vars <- names(train)[sapply(train, function(col) all(col == 0))]

# Print all columns that are all zeros
print(zero_vars)

[1] "MSSubClass.150" "BsmtQual.Po"    "BsmtCond.Ex"    "KitchenQual.Po"
[5] "Functional.Sal" "PoolQC.TA"      "OverallCond.10"

# If there are all zero columns, remove them
if (length(zero_vars) > 0) {
  train <- train %>% select(-all_of(zero_vars))
  test <- test %>% select(-all_of(zero_vars))  # Make sure test is also synchronized
}

Normalize Data (Standard Scaling)#

# Apply Standard Scaling (Z-score normalization)
scaler <- preProcess(train[, numerical_features], method = c("center", "scale"))

# Normalize train and test datasets
train[, numerical_features] <- predict(scaler, train[, numerical_features])
test[, numerical_features] <- predict(scaler, test[, numerical_features])

# Check summary to confirm scaling
summary(train[, numerical_features])

  LotFrontage           LotArea          YearBuilt         YearRemodAdd    
 Min.   :-2.193290   Min.   :-0.9234   Min.   :-3.28670   Min.   :-1.6888  
 1st Qu.:-0.454694   1st Qu.:-0.2969   1st Qu.:-0.57173   1st Qu.:-0.8654  
 Median :-0.008901   Median :-0.1040   Median : 0.05735   Median : 0.4424  
 Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.436893   3rd Qu.: 0.1087   3rd Qu.: 0.95131   3rd Qu.: 0.9268  
 Max.   :10.823886   Max.   :20.5112   Max.   : 1.28240   Max.   : 1.2174  
   MasVnrArea        BsmtFinSF1        BsmtFinSF2        BsmtUnfSF      
 Min.   :-0.5706   Min.   :-0.9727   Min.   :-0.2886   Min.   :-1.2837  
 1st Qu.:-0.5706   1st Qu.:-0.9727   1st Qu.:-0.2886   1st Qu.:-0.7791  
 Median :-0.5706   Median :-0.1319   Median :-0.2886   Median :-0.2031  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.3383   3rd Qu.: 0.5889   3rd Qu.:-0.2886   3rd Qu.: 0.5449  
 Max.   : 8.2824   Max.   :11.4018   Max.   : 8.8486   Max.   : 4.0029  
  TotalBsmtSF         1stFlrSF          2ndFlrSF        LowQualFinSF    
 Min.   :-2.4103   Min.   :-2.1434   Min.   :-0.7949   Min.   :-0.1202  
 1st Qu.:-0.5965   1st Qu.:-0.7259   1st Qu.:-0.7949   1st Qu.:-0.1202  
 Median :-0.1503   Median :-0.1956   Median :-0.7949   Median :-0.1202  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.5489   3rd Qu.: 0.5914   3rd Qu.: 0.8728   3rd Qu.:-0.1202  
 Max.   :11.5170   Max.   : 9.1296   Max.   : 3.9356   Max.   :11.6438  
   GrLivArea         BsmtFullBath      BsmtHalfBath       FullBath      
 Min.   :-2.24835   Min.   :-0.8197   Min.   :-0.241   Min.   :-2.8408  
 1st Qu.:-0.73450   1st Qu.:-0.8197   1st Qu.:-0.241   1st Qu.:-1.0257  
 Median :-0.09794   Median :-0.8197   Median :-0.241   Median : 0.7895  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000  
 3rd Qu.: 0.49723   3rd Qu.: 1.1074   3rd Qu.:-0.241   3rd Qu.: 0.7895  
 Max.   : 7.85288   Max.   : 4.9617   Max.   : 8.136   Max.   : 2.6046  
    HalfBath        BedroomAbvGr      KitchenAbvGr      TotRmsAbvGrd    
 Min.   :-0.7614   Min.   :-3.5137   Min.   :-4.7499   Min.   :-2.7795  
 1st Qu.:-0.7614   1st Qu.:-1.0621   1st Qu.:-0.2114   1st Qu.:-0.9338  
 Median :-0.7614   Median : 0.1637   Median :-0.2114   Median :-0.3186  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 1.2272   3rd Qu.: 0.1637   3rd Qu.:-0.2114   3rd Qu.: 0.2967  
 Max.   : 3.2157   Max.   : 6.2928   Max.   : 8.8656   Max.   : 4.6033  
   Fireplaces       GarageYrBlt        GarageCars        GarageArea      
 Min.   :-0.9509   Min.   :-4.1189   Min.   :-2.3646   Min.   :-2.21220  
 1st Qu.:-0.9509   1st Qu.: 0.1967   1st Qu.:-1.0265   1st Qu.:-0.64769  
 Median : 0.6003   Median : 0.2386   Median : 0.3116   Median : 0.03283  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.: 0.6003   3rd Qu.: 0.2915   3rd Qu.: 0.3116   3rd Qu.: 0.48184  
 Max.   : 3.7027   Max.   : 0.3114   Max.   : 2.9879   Max.   : 4.42001  
   WoodDeckSF       OpenPorchSF      EnclosedPorch       3SsnPorch      
 Min.   :-0.7519   Min.   :-0.7042   Min.   :-0.3592   Min.   :-0.1163  
 1st Qu.:-0.7519   1st Qu.:-0.7042   1st Qu.:-0.3592   1st Qu.:-0.1163  
 Median :-0.7519   Median :-0.3269   Median :-0.3592   Median :-0.1163  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.5884   3rd Qu.: 0.3221   3rd Qu.:-0.3592   3rd Qu.:-0.1163  
 Max.   : 6.0855   Max.   : 7.5516   Max.   : 8.6723   Max.   :17.2113  
  ScreenPorch         PoolArea           MiscVal             MoSold       
 Min.   :-0.2701   Min.   :-0.06867   Min.   :-0.08766   Min.   :-1.9684  
 1st Qu.:-0.2701   1st Qu.:-0.06867   1st Qu.:-0.08766   1st Qu.:-0.4889  
 Median :-0.2701   Median :-0.06867   Median :-0.08766   Median :-0.1191  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.:-0.2701   3rd Qu.:-0.06867   3rd Qu.:-0.08766   3rd Qu.: 0.6207  
 Max.   : 8.3386   Max.   :18.29991   Max.   :31.15459   Max.   : 2.1002  
     YrSold           HouseAge          RemodelAge         TotalSF       
 Min.   :-1.3672   Min.   :-1.20819   Min.   :-1.1603   Min.   :-2.7175  
 1st Qu.:-0.6142   1st Qu.:-0.94373   1st Qu.:-0.9181   1st Qu.:-0.6785  
 Median : 0.1387   Median :-0.05117   Median :-0.4336   Median :-0.1132  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.8917   3rd Qu.: 0.57693   3rd Qu.: 0.8745   3rd Qu.: 0.5318  
 Max.   : 1.6446   Max.   : 3.28765   Max.   : 1.7950   Max.   :11.1778  

Modeling#

kNN#

# Define training control
control <- trainControl(method = "cv", number = 5)

# Train KNN model (search for best k from 3 to 20)
set.seed(123)
knn_model <- train(SalePrice_log ~ .,
                   data = train,
                   method = "knn",
                   trControl = control,
                   tuneGrid = expand.grid(k = seq(2, 20, by = 2)))

knn_model

k-Nearest Neighbors 

1460 samples
 336 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results across tuning parameters:

  k   RMSE       Rsquared   MAE      
   2  0.1845686  0.7900069  0.1276179
   4  0.1792032  0.8048398  0.1211112
   6  0.1758587  0.8133593  0.1193890
   8  0.1755424  0.8161537  0.1199109
  10  0.1743246  0.8207036  0.1192279
  12  0.1732819  0.8249960  0.1188608
  14  0.1733588  0.8257621  0.1195074
  16  0.1735761  0.8261160  0.1197313
  18  0.1740578  0.8256934  0.1201630
  20  0.1741199  0.8273405  0.1204882

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 12.

options(repr.plot.width = 12, repr.plot.height = 8)
plot(knn_model,
     cex = 1.25,
     lwd = 1.75,
     pch = 16)

../_images/7efaa6e96abfe312ea37b312a717422adf92ff99981af4cedefa00cbb712bb5e.png

knn_predictions <- predict(knn_model, newdata = test)
knn_predictions <- exp(knn_predictions) # revert log-transform

knn_submission <- data.frame(
  Id = Id,
  SalePrice = knn_predictions
)

save_dir <- file.path("data", "house-prices", "processed(r)", "knn(r).csv")
write.csv(knn_submission, save_dir, row.names = FALSE)

SVM#

# Train SVM model with radial basis function (RBF) kernel
set.seed(123)
svm_model <- train(SalePrice_log ~ ., 
                   data = train, 
                   method = "svmRadial", 
                   trControl = control, 
                   tuneLength = 10,  # Automatically searches for best parameters
                   scale = FALSE)

Here we set scale = FALSE because we have manually normalized numerical_features in the previous preprocessing stage. In order to avoid SVM to repeat the normalization of the extra columns generated by our one-hot encoding, we need to turn off its scale function here.

svm_model

Support Vector Machines with Radial Basis Function Kernel 

1460 samples
 336 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results across tuning parameters:

  C       RMSE       Rsquared   MAE       
    0.25  0.1400580  0.8793136  0.09348299
    0.50  0.1332413  0.8900571  0.08837307
    1.00  0.1290501  0.8962326  0.08551372
    2.00  0.1262895  0.9005448  0.08413747
    4.00  0.1250289  0.9023650  0.08437091
    8.00  0.1267093  0.8999574  0.08680582
   16.00  0.1295349  0.8962134  0.08979895
   32.00  0.1332403  0.8906846  0.09331982
   64.00  0.1367442  0.8850155  0.09608184
  128.00  0.1374458  0.8838847  0.09664535

Tuning parameter 'sigma' was held constant at a value of 0.002345221
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.002345221 and C = 4.

plot(svm_model,
     cex = 1.25,
     lwd = 1.75,
     pch = 16)

../_images/7217e6811f735e5a672df8a9457ca4ebd03da7192b7d14246da34ceab7ec75e5.png

svm_predictions <- predict(svm_model, newdata = test)
svm_predictions <- exp(svm_predictions) # revert log-transform

svm_submission <- data.frame(
  Id = Id,
  SalePrice = svm_predictions
)

save_dir <- file.path("data", "house-prices", "processed(r)", "svm(r).csv")
write.csv(svm_submission, save_dir, row.names = FALSE)

Linear Regression#

# Train Linear Regression model
set.seed(123)
lr_model <- train(SalePrice_log ~ ., 
                  data = train, 
                  method = "lm", 
                  trControl = control)

Warning message in predict.lm(modelFit, newdata):
"prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"
Warning message in predict.lm(modelFit, newdata):
"prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"
Warning message in predict.lm(modelFit, newdata):
"prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"
Warning message in predict.lm(modelFit, newdata):
"prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"
Warning message in predict.lm(modelFit, newdata):
"prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"

lr_model

Linear Regression 

1460 samples
 336 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results:

  RMSE       Rsquared   MAE       
  0.1659552  0.8347659  0.09375622

Tuning parameter 'intercept' was held constant at a value of TRUE

lr_predictions <- predict(lr_model, newdata = test)
lr_predictions <- exp(lr_predictions) # revert log-transform

lr_submission <- data.frame(
  Id = Id,
  SalePrice = lr_predictions
)

save_dir <- file.path("data", "house-prices", "processed(r)", "lr(r).csv")
write.csv(lr_submission, save_dir, row.names = FALSE)

Warning message in predict.lm(modelFit, newdata):
"prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"

Lasso#

# Convert train and test to matrices for glmnet
x_train <- as.matrix(train %>% select(-SalePrice_log))
y_train <- train$SalePrice_log
x_test <- as.matrix(test)

# Train Lasso Regression (alpha = 1)
set.seed(123)
lasso_model <- cv.glmnet(x_train, y_train, alpha = 1)

lasso_model

Call:  cv.glmnet(x = x_train, y = y_train, alpha = 1) 

Measure: Mean-Squared Error 

      Lambda Index Measure       SE Nonzero
min 0.005178    45 0.02198 0.005656     105
1se 0.020902    30 0.02708 0.005587      39

plot(lasso_model)

../_images/065a4f09efafaeed62a9e4b8a5f147fab3333c91fa393138da4e2585d1ef07b8.png

lasso_predictions <- predict(lasso_model, newx = x_test, s = "lambda.min")
lasso_predictions <- exp(lasso_predictions) # revert log-transform

lasso_submission <- data.frame(
  Id = Id,
  SalePrice = as.vector(lasso_predictions) # avoid including column names
)

save_dir <- file.path("data", "house-prices", "processed(r)", "lasso(r).csv")
write.csv(lasso_submission, save_dir, row.names = FALSE)

Ridge#

# Train Ridge Regression (alpha = 0)
set.seed(123)
ridge_model <- cv.glmnet(x_train, y_train, alpha = 0)

ridge_model

Call:  cv.glmnet(x = x_train, y = y_train, alpha = 0) 

Measure: Mean-Squared Error 

    Lambda Index Measure       SE Nonzero
min 0.2403    78 0.02060 0.003955     336
1se 1.0648    62 0.02413 0.002847     336

plot(ridge_model)

../_images/bb7425375abe1ea87bb32555c6377baa292f8e9090842d7d9c0f4235e7d7d250.png

ridge_predictions <- predict(ridge_model, newx = x_test, s = "lambda.min")
ridge_predictions <- exp(ridge_predictions) # revert log-transform

ridge_submission <- data.frame(
  Id = Id,
  SalePrice = as.vector(ridge_predictions) # avoid including column names
)

save_dir <- file.path("data", "house-prices", "processed(r)", "ridge(r).csv")
write.csv(ridge_submission, save_dir, row.names = FALSE)

ElasticNet#

# Train Elastic Net (alpha = 0.5)
set.seed(123)
elastic_model <- cv.glmnet(x_train, y_train, alpha = 0.5)

elastic_model

Call:  cv.glmnet(x = x_train, y = y_train, alpha = 0.5) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.01136    44 0.02169 0.005466     108
1se 0.04180    30 0.02646 0.005162      45

plot(elastic_model)

../_images/d807215a79e3750fc45f7860b9263ad89a69cc717f1cceaa800bfed2d7daba8e.png

elastic_predictions <- predict(elastic_model, newx = x_test, s = "lambda.min")
elastic_predictions <- exp(elastic_predictions) # revert log-transform

elastic_submission <- data.frame(
  Id = Id,
  SalePrice = as.vector(elastic_predictions) # avoid including column names
)

save_dir <- file.path("data", "house-prices", "processed(r)", "elastic(r).csv")
write.csv(elastic_submission, save_dir, row.names = FALSE)

Decision Tree#

# Train Decision Tree
set.seed(123)
dt_model <- train(
  x = tree_train %>% select(-SalePrice_log),
  y = tree_train$SalePrice_log,
  method = "rpart", 
  trControl = control,
  tuneGrid = expand.grid(cp = seq(0.0001, 0.0005, length = 10)) # More detailed search
)

dt_model

CART 

1460 samples
  82 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results across tuning parameters:

  cp            RMSE       Rsquared   MAE      
  0.0001000000  0.1776512  0.8045802  0.1284635
  0.0001444444  0.1777805  0.8042541  0.1285073
  0.0001888889  0.1779049  0.8039320  0.1285442
  0.0002333333  0.1777713  0.8041841  0.1279156
  0.0002777778  0.1779330  0.8037993  0.1282692
  0.0003222222  0.1778216  0.8038857  0.1283249
  0.0003666667  0.1777085  0.8041335  0.1280759
  0.0004111111  0.1775049  0.8045104  0.1281117
  0.0004555556  0.1775586  0.8043727  0.1283061
  0.0005000000  0.1779052  0.8034988  0.1283916

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.0004111111.

plot(dt_model,
     cex = 1.25,
     lwd = 1.75,
     pch = 16)

../_images/5926b832aae4de8ab7da40dab0d2660cd38eed638e0a0815b8ac5c7f436a0815.png

dt_predictions <- predict(dt_model, newdata = tree_test)
dt_predictions <- exp(dt_predictions) # revert log-transform

dt_submission <- data.frame(
  Id = Id,
  SalePrice = dt_predictions
)

save_dir <- file.path("data", "house-prices", "processed(r)", "dt(r).csv")
write.csv(dt_submission, save_dir, row.names = FALSE)

Bagging#

# Train Bagging model (explicit feature selection)
set.seed(123)
bagging_model <- train(
  x = tree_train %>% select(-SalePrice_log),
  y = tree_train$SalePrice_log,
  method = "treebag", 
  trControl = control
)

bagging_model

Bagged CART 

1460 samples
  82 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results:

  RMSE       Rsquared   MAE     
  0.1731034  0.8162952  0.123942

bagging_predictions <- predict(bagging_model, newdata = tree_test)
bagging_predictions <- exp(bagging_predictions) # revert log-transform

bagging_submission <- data.frame(
  Id = Id,
  SalePrice = bagging_predictions
)

save_dir <- file.path("data", "house-prices", "processed(r)", "bagging(r).csv")
write.csv(bagging_submission, save_dir, row.names = FALSE)

Random Forest#

# Train Random Forest model
set.seed(123)
rf_model <- train(SalePrice_log ~ ., 
                  data = tree_train, 
                  method = "rf", 
                  trControl = control,
                  tuneGrid = expand.grid(mtry = c(2, 4, 8, 16, 32, 64)), 
                  ntree = 500)  # Optimize mtry and use 500 trees

rf_model

Random Forest 

1460 samples
  82 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared   MAE       
   2    0.2249974  0.8100716  0.15743714
   4    0.1770039  0.8484972  0.11895918
   8    0.1544689  0.8719228  0.10174286
  16    0.1462462  0.8785325  0.09590145
  32    0.1427482  0.8800158  0.09394998
  64    0.1418422  0.8785970  0.09393652

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 64.

plot(rf_model,
     cex = 1.25,
     lwd = 1.75,
     pch = 16)

../_images/3dc953edc0c7b00b82051379640850ed4aae100e0908115b9f359b8fae11687f.png

rf_predictions <- predict(rf_model, newdata = tree_test)
rf_predictions <- exp(rf_predictions) # revert log-transform

rf_submission <- data.frame(
  Id = Id,
  SalePrice = rf_predictions
)

save_dir <- file.path("data", "house-prices", "processed(r)", "rf(r).csv")
write.csv(rf_submission, save_dir, row.names = FALSE)

XGBoost#

# Define training control for cross-validation
xgb_control <- trainControl(
  method = "cv",        # Cross-validation
  number = 5,           # 5-fold CV
  verboseIter = FALSE,  # Don't show training progress
  allowParallel = TRUE  # Enable parallel processing
)

# Define the tuning grid for hyperparameters
xgb_grid <- expand.grid(
  nrounds = seq(400, 500, 600),       # Number of boosting rounds
  eta = c(0.01, 0.05),                # Learning rate
  max_depth = c(2, 3, 4),             # Tree depth
  gamma = c(0, 0.1),                  # Minimum loss reduction
  colsample_bytree = c(0.8, 1),       # Feature sampling ratio
  min_child_weight = c(1, 2),         # Minimum sum of instance weight
  subsample = c(0.8, 1)               # Row sampling ratio
)

# Train the XGBoost model
set.seed(123)
xgb_model <- train(
  SalePrice_log ~ ., 
  data = tree_train, 
  method = "xgbTree", 
  trControl = xgb_control,
  tuneGrid = xgb_grid,
  verbosity = 0,            # don't print training process
  nthread = 18,             # Adjust based on CPU cores
  metric = "RMSE",
  maximize = FALSE
)

xgb_model

eXtreme Gradient Boosting 

samples
predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results across tuning parameters:

  eta   max_depth  gamma  colsample_bytree  min_child_weight  subsample
01  2          0.0    0.8               1                 0.8      
01  2          0.0    0.8               1                 1.0      
01  2          0.0    0.8               2                 0.8      
01  2          0.0    0.8               2                 1.0      
01  2          0.0    1.0               1                 0.8      
01  2          0.0    1.0               1                 1.0      
01  2          0.0    1.0               2                 0.8      
01  2          0.0    1.0               2                 1.0      
01  2          0.1    0.8               1                 0.8      
01  2          0.1    0.8               1                 1.0      
01  2          0.1    0.8               2                 0.8      
01  2          0.1    0.8               2                 1.0      
01  2          0.1    1.0               1                 0.8      
01  2          0.1    1.0               1                 1.0      
01  2          0.1    1.0               2                 0.8      
01  2          0.1    1.0               2                 1.0      
01  3          0.0    0.8               1                 0.8      
01  3          0.0    0.8               1                 1.0      
01  3          0.0    0.8               2                 0.8      
01  3          0.0    0.8               2                 1.0      
01  3          0.0    1.0               1                 0.8      
01  3          0.0    1.0               1                 1.0      
01  3          0.0    1.0               2                 0.8      
01  3          0.0    1.0               2                 1.0      
01  3          0.1    0.8               1                 0.8      
01  3          0.1    0.8               1                 1.0      
01  3          0.1    0.8               2                 0.8      
01  3          0.1    0.8               2                 1.0      
01  3          0.1    1.0               1                 0.8      
01  3          0.1    1.0               1                 1.0      
01  3          0.1    1.0               2                 0.8      
01  3          0.1    1.0               2                 1.0      
01  4          0.0    0.8               1                 0.8      
01  4          0.0    0.8               1                 1.0      
01  4          0.0    0.8               2                 0.8      
01  4          0.0    0.8               2                 1.0      
01  4          0.0    1.0               1                 0.8      
01  4          0.0    1.0               1                 1.0      
01  4          0.0    1.0               2                 0.8      
01  4          0.0    1.0               2                 1.0      
01  4          0.1    0.8               1                 0.8      
01  4          0.1    0.8               1                 1.0      
01  4          0.1    0.8               2                 0.8      
01  4          0.1    0.8               2                 1.0      
01  4          0.1    1.0               1                 0.8      
01  4          0.1    1.0               1                 1.0      
01  4          0.1    1.0               2                 0.8      
01  4          0.1    1.0               2                 1.0      
05  2          0.0    0.8               1                 0.8      
05  2          0.0    0.8               1                 1.0      
05  2          0.0    0.8               2                 0.8      
05  2          0.0    0.8               2                 1.0      
05  2          0.0    1.0               1                 0.8      
05  2          0.0    1.0               1                 1.0      
05  2          0.0    1.0               2                 0.8      
05  2          0.0    1.0               2                 1.0      
05  2          0.1    0.8               1                 0.8      
05  2          0.1    0.8               1                 1.0      
05  2          0.1    0.8               2                 0.8      
05  2          0.1    0.8               2                 1.0      
05  2          0.1    1.0               1                 0.8      
05  2          0.1    1.0               1                 1.0      
05  2          0.1    1.0               2                 0.8      
05  2          0.1    1.0               2                 1.0      
05  3          0.0    0.8               1                 0.8      
05  3          0.0    0.8               1                 1.0      
05  3          0.0    0.8               2                 0.8      
05  3          0.0    0.8               2                 1.0      
05  3          0.0    1.0               1                 0.8      
05  3          0.0    1.0               1                 1.0      
05  3          0.0    1.0               2                 0.8      
05  3          0.0    1.0               2                 1.0      
05  3          0.1    0.8               1                 0.8      
05  3          0.1    0.8               1                 1.0      
05  3          0.1    0.8               2                 0.8      
05  3          0.1    0.8               2                 1.0      
05  3          0.1    1.0               1                 0.8      
05  3          0.1    1.0               1                 1.0      
05  3          0.1    1.0               2                 0.8      
05  3          0.1    1.0               2                 1.0      
05  4          0.0    0.8               1                 0.8      
05  4          0.0    0.8               1                 1.0      
05  4          0.0    0.8               2                 0.8      
05  4          0.0    0.8               2                 1.0      
05  4          0.0    1.0               1                 0.8      
05  4          0.0    1.0               1                 1.0      
05  4          0.0    1.0               2                 0.8      
05  4          0.0    1.0               2                 1.0      
05  4          0.1    0.8               1                 0.8      
05  4          0.1    0.8               1                 1.0      
05  4          0.1    0.8               2                 0.8      
05  4          0.1    0.8               2                 1.0      
05  4          0.1    1.0               1                 0.8      
05  4          0.1    1.0               1                 1.0      
05  4          0.1    1.0               2                 0.8      
05  4          0.1    1.0               2                 1.0      
  RMSE       Rsquared   MAE       
2674426  0.8427061  0.23459442
2665025  0.8435777  0.23389561
2677552  0.8420158  0.23487203
2666581  0.8436973  0.23402377
2673682  0.8421636  0.23439542
2663729  0.8438931  0.23386450
2671335  0.8429491  0.23427391
2663047  0.8445997  0.23384919
2676488  0.8418476  0.23454578
2666718  0.8433375  0.23398229
2674658  0.8420859  0.23439216
2663977  0.8441113  0.23384652
2674369  0.8420214  0.23450497
2664955  0.8436680  0.23393618
2675489  0.8423303  0.23456927
2663613  0.8445533  0.23385743
2624078  0.8580467  0.23194906
2617033  0.8585858  0.23177669
2625388  0.8570844  0.23172941
2616960  0.8586496  0.23162674
2626995  0.8572430  0.23207201
2618082  0.8582050  0.23182066
2625111  0.8582894  0.23203975
2617648  0.8584586  0.23161683
2626191  0.8569038  0.23192935
2618372  0.8582525  0.23179081
2626499  0.8561484  0.23176740
2615355  0.8585311  0.23126732
2626204  0.8570590  0.23201030
2618337  0.8578479  0.23178995
2628234  0.8565929  0.23216312
2617571  0.8588351  0.23159877
2606377  0.8655160  0.23090431
2600518  0.8648643  0.23082648
2605597  0.8664500  0.23096934
2599090  0.8650757  0.23045348
2610646  0.8647910  0.23153417
2601972  0.8648333  0.23080260
2603953  0.8652052  0.23077552
2600749  0.8649822  0.23053299
2609384  0.8631836  0.23126733
2599903  0.8634732  0.23069477
2604561  0.8650215  0.23096952
2602902  0.8638525  0.23064996
2609973  0.8640400  0.23128405
2599727  0.8640398  0.23053968
2609179  0.8642933  0.23107640
2600315  0.8646526  0.23052310
1317881  0.8918497  0.08705925
1328718  0.8900443  0.08964776
1304375  0.8940294  0.08705810
1314973  0.8923925  0.08895901
1304682  0.8939120  0.08663787
1322679  0.8908917  0.08996039
1307059  0.8934723  0.08733899
1312586  0.8925804  0.08949422
1337954  0.8885581  0.08975669
1359362  0.8854774  0.09270269
1337471  0.8887607  0.08980617
1361690  0.8849986  0.09251612
1342147  0.8879668  0.08947439
1363596  0.8847613  0.09287078
1333915  0.8891499  0.08960885
1359700  0.8854499  0.09268589
1290224  0.8959769  0.08493457
1266440  0.8996855  0.08537632
1269220  0.8991915  0.08447927
1285390  0.8967898  0.08641749
1263128  0.9003197  0.08335503
1274231  0.8981546  0.08600511
1271342  0.8986386  0.08488650
1282467  0.8972557  0.08592856
1304566  0.8940208  0.08840227
1335931  0.8891449  0.09106910
1319498  0.8916188  0.08940726
1345240  0.8876467  0.09157519
1308307  0.8931915  0.08821965
1353757  0.8863153  0.09263494
1325931  0.8904917  0.08961553
1350417  0.8870217  0.09242410
1274684  0.8982670  0.08400865
1265582  0.8996254  0.08542955
1266545  0.8995481  0.08367745
1274532  0.8985573  0.08495362
1259326  0.9006082  0.08391668
1270658  0.8989115  0.08554302
1262197  0.9002912  0.08384277
1276361  0.8981920  0.08526380
1305978  0.8936195  0.08805463
1324574  0.8912089  0.09047717
1308349  0.8935777  0.08858110
1340192  0.8885791  0.09124193
1299655  0.8948461  0.08773780
1355903  0.8859485  0.09312126
1318776  0.8916703  0.08907883
1351936  0.8869130  0.09290608

Tuning parameter 'nrounds' was held constant at a value of 400
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 400, max_depth = 4, eta
 = 0.05, gamma = 0, colsample_bytree = 1, min_child_weight = 1 and subsample
 = 0.8.

plot(xgb_model,
     cex = 1.25,
     lwd = 1.75,
     pch = 16)

../_images/a2b99d532eef35bef6c287c88ded625b841d6fef2f9efebc7571ac6db8499f79.png

../_images/0b953f7664e6e4cdb6389517ee09a2903fa1a0a1ba94bac96135b50ecec286c6.png

../_images/530eab34b6b27400858297592b76647738137f682d69465637b1a668a1657595.png

../_images/0652b0b4ce70c39c1d671da2f070ae012fcf2b44205b3e573c21aa5068dc3fe9.png

# Predict final values using model
xgb_predictions <- predict(xgb_model, newdata = tree_test)
xgb_predictions <- exp(xgb_predictions) # revert log-transform

# Prepare submission file
xgb_submission <- data.frame(
  Id = Id,
  SalePrice = xgb_predictions
)

# Save submission file
save_dir <- file.path("data", "house-prices", "processed(r)", "xgb(r).csv")
write.csv(xgb_submission, save_dir, row.names = FALSE)

Stacking#

# Step 1: Get predictions from base models
xgb_pred_train <- predict(xgb_model, newdata = tree_train)
xgb_pred_test <- predict(xgb_model, newdata = tree_test)

# Get the best lambda value from cross-validation
ridge_lambda <- ridge_model$lambda.min  # Alternatively, use ridge_model$lambda.1se

# Predict using the trained ridge regression model
ridge_pred_train <- predict(ridge_model$glmnet.fit, newx = x_train, s = ridge_lambda)
ridge_pred_test <- predict(ridge_model$glmnet.fit, newx = x_test, s = ridge_lambda)

svm_pred_train <- predict(svm_model, newdata = train)
svm_pred_test <- predict(svm_model, newdata = test)

# Step 2: Construct Stacking Dataset
stack_train <- data.frame(
  XGB = xgb_pred_train, 
  Ridge = ridge_pred_train, 
  SVM = svm_pred_train, 
  SalePrice_log = y_train
)

stack_test <- data.frame(
  XGB = xgb_pred_test, 
  Ridge = ridge_pred_test, 
  SVM = svm_pred_test
)

# Step 3: Train Meta-Learner (Using Ridge)
set.seed(123)
stack_model <- train(
  SalePrice_log ~ ., 
  data = stack_train,
  method = "glmnet",   # Using Ridge as meta-learner
  trControl = trainControl(method = "cv", number = 5),
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 0.1, length = 100)),
  metric = "RMSE"
)

stack_model

glmnet 

samples
predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1169, 1169, 1167, 1168, 1167 
Resampling results across tuning parameters:

  lambda       RMSE        Rsquared   MAE       
000100000  0.06920248  0.9712447  0.05080970
001109091  0.06920248  0.9712447  0.05080970
002118182  0.06920248  0.9712447  0.05080970
003127273  0.06920248  0.9712447  0.05080970
004136364  0.06920248  0.9712447  0.05080970
005145455  0.06920248  0.9712447  0.05080970
006154545  0.06920248  0.9712447  0.05080970
007163636  0.06920248  0.9712447  0.05080970
008172727  0.06920248  0.9712447  0.05080970
009181818  0.06920248  0.9712447  0.05080970
010190909  0.06920248  0.9712447  0.05080970
011200000  0.06920248  0.9712447  0.05080970
012209091  0.06920248  0.9712447  0.05080970
013218182  0.06920248  0.9712447  0.05080970
014227273  0.06920248  0.9712447  0.05080970
015236364  0.06920248  0.9712447  0.05080970
016245455  0.06920248  0.9712447  0.05080970
017254545  0.06920248  0.9712447  0.05080970
018263636  0.06920248  0.9712447  0.05080970
019272727  0.06920248  0.9712447  0.05080970
020281818  0.06920248  0.9712447  0.05080970
021290909  0.06920248  0.9712447  0.05080970
022300000  0.06920248  0.9712447  0.05080970
023309091  0.06920248  0.9712447  0.05080970
024318182  0.06920248  0.9712447  0.05080970
025327273  0.06920248  0.9712447  0.05080970
026336364  0.06920248  0.9712447  0.05080970
027345455  0.06920248  0.9712447  0.05080970
028354545  0.06920248  0.9712447  0.05080970
029363636  0.06920248  0.9712447  0.05080970
030372727  0.06920248  0.9712447  0.05080970
031381818  0.06920248  0.9712447  0.05080970
032390909  0.06920248  0.9712447  0.05080970
033400000  0.06920248  0.9712447  0.05080970
034409091  0.06920248  0.9712447  0.05080970
035418182  0.06920248  0.9712447  0.05080970
036427273  0.06920248  0.9712447  0.05080970
037436364  0.06920248  0.9712447  0.05080970
038445455  0.06920248  0.9712447  0.05080970
039454545  0.06923313  0.9712287  0.05082905
040463636  0.06937357  0.9711465  0.05091653
041472727  0.06956622  0.9710309  0.05103645
042481818  0.06976022  0.9709146  0.05115911
043490909  0.06995150  0.9708012  0.05128140
044500000  0.07013157  0.9706984  0.05139659
045509091  0.07030972  0.9705980  0.05151029
046518182  0.07048914  0.9704972  0.05162472
047527273  0.07066635  0.9703987  0.05173732
048536364  0.07083584  0.9703078  0.05184512
049545455  0.07100077  0.9702217  0.05194948
050554545  0.07116689  0.9701353  0.05205444
051563636  0.07133350  0.9700491  0.05215954
052572727  0.07149832  0.9699652  0.05226330
053581818  0.07165521  0.9698892  0.05236164
054590909  0.07181169  0.9698143  0.05245916
055600000  0.07196931  0.9697392  0.05255705
056609091  0.07212720  0.9696645  0.05265457
057618182  0.07228372  0.9695917  0.05275052
058627273  0.07243319  0.9695259  0.05284224
059636364  0.07258188  0.9694616  0.05293430
060645455  0.07273166  0.9693971  0.05302806
061654545  0.07288252  0.9693324  0.05312241
062663636  0.07303221  0.9692694  0.05321531
063672727  0.07317972  0.9692089  0.05330636
064681818  0.07332201  0.9691539  0.05339481
065690909  0.07346507  0.9690990  0.05348354
066700000  0.07360915  0.9690439  0.05357251
067709091  0.07375427  0.9689886  0.05366209
068718182  0.07389835  0.9689349  0.05375123
069727273  0.07404163  0.9688826  0.05384009
070736364  0.07418000  0.9688354  0.05392562
071745455  0.07431875  0.9687887  0.05401078
072754545  0.07445848  0.9687419  0.05409597
073763636  0.07459919  0.9686949  0.05418213
074772727  0.07474020  0.9686483  0.05426877
075781818  0.07488067  0.9686028  0.05435484
076790909  0.07501922  0.9685597  0.05443918
077800000  0.07515565  0.9685192  0.05452271
078809091  0.07529295  0.9684786  0.05460752
079818182  0.07543118  0.9684380  0.05469333
080827273  0.07557034  0.9683972  0.05477985
081836364  0.07571018  0.9683565  0.05486657
082845455  0.07584907  0.9683172  0.05495237
083854545  0.07598771  0.9682788  0.05503829
084863636  0.07612216  0.9682446  0.05512334
085872727  0.07625636  0.9682114  0.05520811
086881818  0.07639144  0.9681781  0.05529287
087890909  0.07652739  0.9681446  0.05537764
088900000  0.07666422  0.9681111  0.05546289
089909091  0.07680158  0.9680778  0.05554891
090918182  0.07693860  0.9680453  0.05563521
091927273  0.07707571  0.9680133  0.05572193
092936364  0.07721019  0.9679839  0.05580700
093945455  0.07734451  0.9679553  0.05589165
094954545  0.07747966  0.9679266  0.05597654
095963636  0.07761563  0.9678979  0.05606194
096972727  0.07775242  0.9678691  0.05614964
097981818  0.07789003  0.9678402  0.05623880
098990909  0.07802758  0.9678119  0.05632857
100000000  0.07816523  0.9677842  0.05641875

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 0.03844545.

plot(stack_model,
     cex = 0.75,
     lwd = 1.75,
     pch = 16)

../_images/70c0e6d36aae2215d40ac89c1814ef2dbe85ff73b7e76d102ce2f23c105b2aef.png

# Step 4: Make final predictions
stack_predictions <- predict(stack_model$finalModel, newx = as.matrix(stack_test), s = stack_model$bestTune$lambda)

# Step 5: Convert predictions back to original scale
stack_predictions <- exp(stack_predictions)  # Assuming log-transformed target

# Prepare submission file
stack_submission <- data.frame(
  Id = Id,
  SalePrice = as.vector(stack_predictions) # avoid including column names
)

# Save submission file
save_dir <- file.path("data", "house-prices", "processed(r)", "stack(r).csv")
write.csv(stack_submission, save_dir, row.names = FALSE)

References#

Kaggle Competition: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
Dataset Description: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data