Feature Engineering: The Secret Weapon That Decoded California Housing Prices

The Problem with Raw Data

Accurate housing price prediction is the holy grail for real estate investors and market analysts. But here’s the secret: raw data is rarely good enough.

When I started my project to predict California housing prices, my initial models (the baseline) performed reasonably well. However, I knew the raw dataset was missing crucial, common-sense factors. For example, the data had total_rooms and households, but no measure for the average number of rooms per family—a feature that intuitively impacts property value.

This led to my core research question: How much does intelligent feature engineering actually improve the performance of machine learning models in predicting California housing prices?

Experiment 1: The Baseline Challenge

My first step was establishing a baseline performance using common preprocessing (imputation and scaling) but no feature creation. I tested six standard regression models on the California Housing dataset:

Model	Baseline RMSE (USD)
Random Forest	53,747
Gradient Boosting	55,486
K-Nearest Neighbors	60,685
Linear Regression	67,354
Decision Tree	75,036
SVR	116,889

The Random Forest Regressor was the clear winner, achieving an RMSE (Root Mean Squared Error) of approximately $ 53,747. This means, on average, the model’s prediction was off by about $ 53,747. This is a decent starting point, but a margin we aimed to shrink.

Experiment 2: Creating Basic Ratio Features

Next, I focused on creating new, meaningful features by combining existing columns. These are features a human expert would instinctively use to evaluate a property:

rooms_per_household: total_rooms / households (Size proxy)
bedrooms_per_room: total_bedrooms / total_rooms (Quality/Density proxy)
population_per_household: population / households (Density proxy)

Model	Baseline RMSE (USD)	Basic Features RMSE (USD)	Change
Random Forest	53,747	53,253	-0.92%
Gradient Boosting	55,486	54,833	-1.18%
K-Nearest Neighbors	60,685	66,817	+10.09%

Simply adding these three ratio features slightly improved the RMSE for the top models (Random Forest and Gradient Boosting). However, they worsened the performance of the K-Nearest Neighbors model, demonstrating that not all feature engineering benefits all model types equally. The Random Forest model’s RMSE dropped to $53,253.

Experiment 3: Advanced Spatial Features

To capture geographic factors, which are paramount in real estate, I engineered advanced features: spatial proximity. I calculated the Haversine distance (actual distance over the earth’s surface) from each house to major economic hubs: Los Angeles and San Francisco.

def transform(self, X, y=None):
        X = X.copy()

        # Calculate new features
        X['rooms_per_household'] = X['total_rooms'] / X['households']
        X['bedrooms_per_room'] = X['total_bedrooms'] / X['total_rooms']
        X['population_per_household'] = X['population'] / X['households']

        # Calculate distance to each city
        for city, (lon, lat) in self.cities.items():
            X[f'distance_to_{city}'] = self.haversine_distance(X, lon, lat)

        return X

This new feature set included the three ratios from Experiment 2 plus the two new distance metrics.

Model	Basic Features RMSE (USD)	Advanced Features RMSE (USD)	Change
Random Forest	53,253	54,652	+2.62%
Gradient Boosting	54,833	54,396	-0.79%

Advanced Feature Insight

Interestingly, the Gradient Boosting Regressor took the lead in this phase, achieving a minor but measurable drop to $54,396. While the Random Forest Regressor’s score worsened, the overall best performing model now included the spatial features, suggesting these features captured variance that boosted the Gradient Boosting model’s predictive power.

The Final Leap: Hyperparameter Tuning

To unlock the full potential of feature engineering, I took the best model—the Random Forest Regressor (which performed best across all feature sets post-tuning)—and fine-tuned its hyperparameters using GridSearchCV.

I performed this tuning across all three feature sets (Baseline, Basic, Advanced) to see which combination yielded the absolute lowest error.

Feature Set	Best Post-Tuning RMSE (USD)	Best Model Parameters	Total Improvement (vs. Baseline)
Baseline	49,659	max_features=8, n_estimators=30	7.6%
Basic Ratios	49,763	max_features=8, n_estimators=30	7.4%
Advanced (Ratios + Distance)	47,231	max_features=6, n_estimators=30	12.1%

Key Takeaway

The combination of Advanced Feature Engineering (Ratios + Distance) and Hyperparameter Tuning delivered the absolute best result: an RMSE of $47,231. This represents a 12.1% reduction in prediction error compared to the initial baseline model.

Conclusion: Feature Engineering is Not Optional

This project conclusively proves that feature engineering is critical for developing high-accuracy predictive models in real estate. The creation of ratio-based features (like rooms per household) and spatial features (distance to cities) gave the model meaningful context that the raw data lacked.

The Random Forest Regressor, when combined with the engineered features and proper tuning, provided the most accurate and interpretable housing price predictions. The final model is off by less than $47,500 on average, demonstrating a significant improvement in reliability for informed decision-making.

Share on

X Facebook LinkedIn Bluesky

Feature Engineering: The Secret Weapon That Decoded California Housing Prices

Ernest Essel-Kaitoo

Feature Engineering: The Secret Weapon That Decoded California Housing Prices

The Problem with Raw Data

Experiment 1: The Baseline Challenge

Experiment 2: Creating Basic Ratio Features

Experiment 3: Advanced Spatial Features

The Final Leap: Hyperparameter Tuning

Key Takeaway

Share on

You may also enjoy

Stock Price Prediction using LSTM Neural Networks

Sentiment Classification Using Transformer-based Representation Models

Effect of Different Optimizers on Classification

BERT Model Compression through Layer Freezing: An Analysis of Performance and Efficiency