Professional Documents
Culture Documents
This method is based on Bayes' theorem, which is a rule in probability theory that helps us
update our beliefs when we get new information. In the context of regression, it means that
we start with some initial guess about the relationship between our input data and the
output, and then we update that guess using the actual data we have.
The formula for Bayesian Regression is a bit more complex than traditional regression
techniques, but at its core, it involves calculating a probability distribution over the possible
lines or relationships that could explain our data. This distribution helps us to understand
not only the most likely relationship but also how confident we are about it.
y = Xβ + ε
where:
The goal of Bayesian Regression is to estimate the most likely relationship between input
and output variables while quantifying the uncertainty about that relationship.
The dataset consists of 20,640 instances, each representing a census tract in California.
There are eight features in the dataset, including:
california = fetch_california_housing()
Out[2]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedH
In [4]: california_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB
In [5]: california_df.describe().T
Out[5]: count mean std min 25% 50% 75
Missing values:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseVal 0
dtype: int64
In [7]: california_df.duplicated().sum()
0
Out[7]:
Here in the plots we can clearly see very high outliers on the right hand side. So we need to
deal with them appropriately
Note: For learning purposes I have shown how to do MedHouseVal transformation for
skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out!
axs[0,0].scatter(california_df_scaled['Latitude'], california_df['MedHouseVal'], co
axs[0,0].set_xlabel('Latitude')
axs[0,0].set_ylabel('Median House Value')
axs[0,0].set_title('Latitude vs Median House Value')
axs[0,1].scatter(california_df_scaled['Longitude'], california_df['MedHouseVal'], c
axs[0,1].set_xlabel('Longitude')
axs[0,1].set_ylabel('Median House Value')
axs[0,1].set_title('Longitude vs Median House Value')
axs[0,2].scatter(california_df_scaled['HouseAge'], california_df['MedHouseVal'], co
axs[0,2].set_xlabel('Housing Median Age')
axs[0,2].set_ylabel('Median House Value')
axs[0,2].set_title('Housing Median Age vs Median House Value')
axs[0,3].scatter(california_df_scaled['AveRooms'], california_df['MedHouseVal'], co
axs[0,3].set_xlabel('Total Rooms')
axs[0,3].set_ylabel('Median House Value')
axs[0,3].set_title('Total Rooms vs Median House Value')
axs[1,0].scatter(california_df_scaled['AveBedrms'], california_df['MedHouseVal'], c
axs[1,0].set_xlabel('Total Bedrooms')
axs[1,0].set_ylabel('Median House Value')
axs[1,0].set_title('Total Bedrooms vs Median House Value')
axs[1,1].scatter(california_df_scaled['Population'], california_df['MedHouseVal'],
axs[1,1].set_xlabel('Population')
axs[1,1].set_ylabel('Median House Value')
axs[1,1].set_title('Population vs Median House Value')
axs[1,2].scatter(california_df_scaled['AveOccup'], california_df['MedHouseVal'], co
axs[1,2].set_xlabel('Households')
axs[1,2].set_ylabel('Median House Value')
axs[1,2].set_title('Households vs Median House Value')
plt.show()
Step 4: Define Dependant and Independant Variable
In [17]: X = california_df_scaled.drop(['MedHouseVal'],axis=1)
y = california_df['MedHouseVal'] # We dont scale dependant variable
Step 5(a): Split the data into training and testing sets in the ratio 70:30
Step 5(b): Set the hyperparameters to be tuned and create the Bayesian
Ridge model:
Step 5(d): Train the Bayesian Ridge model with the best parameters:
The RMSE is 0.68. This value tells you that, on average, the model's predictions are off
by 0.68 units from the actual values.
Conclusion:
In conclusion, this Bayesian Regression tutorial showcased how to effectively model the
relationship between input features and the target variable, specifically the median house
value in California. We began by loading and preprocessing the dataset, followed by
visualizing the relationships between the variables using scatter plots.
We then split the data into training and testing sets and employed Bayesian Regression to
fit a model to the training data. Throughout the process, we fine-tuned the model's
hyperparameters to achieve an optimal balance between bias and variance. Subsequently,
we used the model to make predictions on the test data and evaluated its performance
using mean squared error and R-squared score.
The results, with an R^2 score of 0.65 and an RMSE of 0.68, indicate that the Bayesian
Regression model has captured a reasonable amount of the relationship between the input
features and the target variable, the median house value in California. While the
performance is decent, there is still room for improvement. Further refinements, such as
feature engineering, model selection, or hyperparameter tuning, could potentially enhance
the model's performance, enabling more accurate predictions of house values.