Machine Learning (Part 13): Understanding Random Forests

·

3 min read

Machine Learning (Part 13): Understanding Random Forests

Welcome back to our journey through the intricate realm of Machine Learning! In this chapter, we're delving into the dynamic and powerful world of Random Forests, an ensemble method that builds upon the principles of Decision Trees. Prepare to understand the theoretical foundations, witness the magic of ensemble learning, and implement Random Forests in Python for a hands-on experience.

If you have missed out on the previous part where we delved into Decision Trees, click here.

What are Random Forests?

Ensemble learning involves combining multiple models to enhance overall performance. Random Forests, a popular ensemble method, constructs a multitude of Decision Trees and amalgamates their predictions.

Components

  • Decision Trees: The fundamental building blocks, where each tree is constructed using a random subset of the training data and features.

  • Bootstrapping: Random Forests employ bootstrapping, a resampling technique, to create multiple subsets of the training data for training each tree.

  • Feature Randomization: At each split, only a random subset of features is considered, adding an extra layer of randomness.

Role of Decision Trees

  • Decision Trees as Building Blocks: Each tree in the Random Forest is crafted using a random subset of the training data and features.

  • Creating Diversity: By introducing randomness in the data and feature selection, each tree becomes a unique learner.

Random Forest Process

  1. Building Individual Trees:

    • For each tree, a distinct subset of data and features is employed.

    • The trees are constructed independently based on these subsets.

  2. Aggregating Predictions:

    • Predictions from all trees are collected during the aggregation phase.

    • For classification tasks, the majority vote determines the final prediction.

    • For regression tasks, the average of predictions is taken.

Example: House Price Prediction

Predicting House Price With Random Forest Regressor | by Gera Abhishek |  Analytics Vidhya | Medium

Advantages of Random Forests

  • Reducing Overfitting: The ensemble approach mitigates overfitting, a common challenge in complex models.

  • Handling High-Dimensional Data: Random Forests effectively handle datasets with a large number of features.

Disadvantages of Random Forests

  1. Complexity and Interpretability: Random Forests' ensemble nature can lead to a complex model, making it less interpretable.

  2. Resource Intensive: Training Random Forests can be computationally expensive, requiring more resources than simpler models.

Implementing Random Forests in Python

Let's implement Random Forests using the famous California housing dataset for classification.

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Load California housing dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
data['PRICE'] = california_housing.target

# Prepare data
X = data.drop('PRICE', axis=1)
y = data['PRICE']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Visualize the accuracy
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual Prices vs Predicted Prices')
plt.show()

Output:

When to use Random Forests?

  1. Versatile Tasks: Random Forests excel in both classification and regression scenarios.

  2. High-Dimensional Data: They effectively handle datasets with a large number of features.

  3. Robustness to Outliers: Random Forests exhibit resilience to outliers due to the randomness in data sampling.

  4. Mitigating Overfitting: The ensemble approach helps mitigate overfitting, making Random Forests suitable for complex datasets.

Conclusion

In this chapter, we embarked on a theoretical journey through the core concepts of Random Forests, unveiling the mechanisms of ensemble learning. Through practical implementation, we witnessed the robustness of Random Forests in classifying iris flowers. As we continue our odyssey in Machine Learning, stay tuned for our next chapter, where we'll take a deep dive into Support Vector Machines.

Did you find this article valuable?

Support Sanjay's blog by becoming a sponsor. Any amount is appreciated!