You are on page 1of 14

HANOI UNIVERSITY OF SCIENCE AND

TECHNOLOGY SCHOOL OF ECONOMICS AND


MANAGEMENT
----o0o----

REPORT – DATA SCIENCE FOR BUSINESS


TOPIC 15: HOUSE PRICES

Lecturer: Le Hai Ha

Student: Nguyen Thi Tu

Std.ID: 20192664
Table of Contents
Introduction....................................................................................................................................2
I. Introduction about project: House Prices – Advanced Regression Techniques..................2
1.1 Topic description....................................................................................................................2
1.2 About Linear regression.........................................................................................................2
1.3 Practice Skills.........................................................................................................................3
II. Project solution.........................................................................................................................3
2.1 Importing libraries..................................................................................................................3
2.2 Importing and read data.........................................................................................................3
2.3 Data exploration and visualization.........................................................................................4
2.3.1 Histogram and Normal probability plot..........................................................................4
2.3.2 Relationship with numerical variables and categorical features.....................................5
2.4 Data preprocessing.................................................................................................................7
2.4.1 Separating target and features.........................................................................................8
2.4.2 Looking at NaN % within the data..................................................................................8
2.4.3 Data cleaning with missing values..................................................................................8
2.5 Feature engineering..............................................................................................................10
2.6 Modeling..............................................................................................................................11
Conclusion....................................................................................................................................13
Introduction
Business data science is the study of data to extract insights that are meaningful to business
operations. It is a multidisciplinary approach that combines the principles and practices of the
fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large
volumes of data. This analysis will help data scientists ask and answer questions such as what
happened, why it happened, what events will happen, and how the results can be used. for what
purpose.
After finishing the course, students learn how to use several tools such as: RapidMiner, Python,
SQL, collect and process structured and unstructured data, administration skills, aggregation,
data extraction, exploiting information, detecting hidden trends, new knowledge in data.
Learning about this field right now will bring promising results for learners. Investment in this
field is also an important orientation of modern society, indispensable for the development of
each country and each economy, and Vietnam will certainly not stand aside from this orientation

I. Introduction about project: House Prices – Advanced Regression Techniques


1.1 Topic description
Ask a home buyer to describe their dream house, and they probably won't begin with the height
of the basement ceiling or the proximity to an east-west railroad. But this playground
competition's dataset proves that much more influences price negotiations than the number of
bedrooms or a white-picket fence.
79 explanatory variables were described (almost) every aspect of residential homes in Ames,
Lowa, this topic challenges you to predict the final price of each home. The Ames Housing
dataset was compiled by Dean De Cock for use in data science education. It's an incredible
alternative for data scientists looking for a modernized and expanded version of the often cited
Boston Housing dataset.
Topic’s goal is predicting the sales price for each house. For each Id in the test set, I must predict
the value of the SalePrice variable
1.2 About Linear regression
Linear regression attempts to model the relationship between two variables by fitting a linear
equation to observed data. One variable is considered to be an explanatory variable, and the other
is considered to be a dependent variable
Linear-regression models are relatively simple and provide an easy-to-interpret mathematical
formula that can generate predictions. Linear regression can be applied to various areas in
business and academic study. Linear regression is used in everything from biological, behavioral,
environmental and social sciences to business. Linear-regression models have become a proven
way to predict the future scientifically and reliably. Because linear regression is a long-
established statistical procedure, the properties of linear-regression models are well understood
and can be trained very quickly.
Linear Regression is a commonly used type of predictive analysis. The Linear Regression model
is primarily used for the following:
 Determine how good the predictor variables are
 Forecasting values
 Trend forecasting
1.3 Practice Skills
Comprehensive Data Exploration with Python

 Understand how variables are distributed and how they interact


 Apply different transformations before training machine learning models
House Prices EDA

 Learn to use visualization techniques to study missing data and distributions


 Includes correlation heatmaps, pairplots, and t-SNE to help inform appropriate inputs to a
linear model
A Study on Regression Applied to the Ames Dataset

 Demonstrate effective tactics for feature engineering


 Explore linear regression with different regularization methods including ridge, LASSO,
and ElasticNet using scikit-learn
Regularized Linear Models

 Build a basic linear model


 Try more advanced algorithms including XGBoost and neural nets using Keras

II. Project solution


2.1 Importing libraries

2.2 Importing and read data


5 rows x 81 columns
2.3 Data exploration and visualization
Before working with any kind of data it is important to understand them. A crucial step to this
aim is the Exploratory data analysis (EDA): a combination of visualizations and statistical
analysis that helps us to better understand the data we are working with and to gain insight into
their relationships. So, let's explore our target variable and how the other features influence it.
2.3.1 Histogram and Normal probability plot
The point here is to test 'SalePrice' in a very lean way. I pay attention to:

 Histogram - Kurtosis and skewness.


 Normal probability plot - Data distribution should closely follow the diagonal that
represents the normal distribution.
In literature, acceptable values for skewness are between -0.5 and 0.5 while -2 and 2 for
Kurtosis. Looking at the plot, we can clearly see how the distribution does not seem to be normal
but is highly right-skewed. The non-normality of our distribution is also supported by the
Shapiro test for normality (p-value really small that allows us to reject the hypothesis of
normality).
2.3.2 Relationship with numerical variables and categorical features
With “GrLivArea” variable

With “TotalBsmtSF” variable


With “OverallQual” variable

With “YearBuilt” variable


In summary
Stories aside, I conclude that:

 “GrLivArea” and “TotalBsmtSF” seem to be linearly related with “SalePrice”. Both


relationships are positive, which means that as one variable increases, the other also
increases. In the case of 'TotalBsmtSF', we can see that the slope of the linear relationship
is particularly high.
 “OverallQual” and “YearBuilt” also seem to be related with “SalePrice”. The relationship
seems to be stronger in the case of 'OverallQual', where the box plot shows how sales
prices increase with the overall quality.
I just analyzed four variables, but there are many others that we should analyze. The trick here
seems to be the choice of the right features (feature selection) and not the definition of complex
relationships between them (feature engineering).
2.4 Data preprocessing
Now that we have some insights about data, we need to preprocess them for the modeling part.
The main steps are:

 Looking at potential NaN


 Dealing with categorical features (e.g. Dummy coding)
 Normalization
Usually, in a real-world project, the test data are not available until the end. For this reason, test
data should contain the same type of data of the training set to preprocess them in the same way.
Here, the test set is available. It contains some observations do not present in the training dataset
and, the use of dummy coding could raise several issues (I spent a lot of time figuring out why I
was not able to make predictions on the test set). The easiest way to solve this problem (that is
not applicable if test data are not available) is to concatenate Train and Test sets, preprocess, and
divide them again.
2.4.1 Separating target and features
We remove the target variable in order to focus on the feature

2.4.2 Looking at NaN % within the data

Are we sure that all these nans are real missing values? Looking at the given description file, we
can see how the majority of these nans reflect the absence of something, and for this reason, they
are not NaNs. We can impute them (for numerical features) or substitute them with data in the
file.
2.4.3 Data cleaning with missing values
Important questions when thinking about missing data:

 How prevalent is the missing data?


 Is missing data random or does it have a pattern?
The answer to these questions is important for practical reasons because missing data can imply
a reduction of the sample size. This can prevent us from proceeding with the analysis. Moreover,
from a substantive perspective, we need to ensure that the missing data process is not biased and
hidding an inconvenient truth.

Let's analyse this to understand how to I handle the missing data.


I'll consider that when more than 15% of the data is missing, I should delete the corresponding
variable and pretend it never existed. This means that we will not try any trick to fill the missing
data in these cases. According to this, there is a set of variables (e.g. 'PoolQC', 'MiscFeature',
'Alley', etc.) that I should delete. None of these variables seem to be very important, since most
of them are not aspects in which we think about when buying a house (maybe that's the reason
why data is missing?). Moreover, looking closer at the variables, I could say that variables like
'PoolQC', 'MiscFeature' and 'FireplaceQu' are strong candidates for outliers, so we'll be happy to
delete them
In what concerns the remaining cases, I can see that 'GarageX' variables have the same number
of missing data. I bet missing data refers to the same set of observations (although I will not
check it; it's just 5% and we should not spend 20 in 5 problems). Since the most important
information regarding garages is expressed by 'GarageCars' and considering that we are just
talking about 5% of missing data, I'll delete the mentioned 'GarageX' variables. The same logic
applies to 'BsmtX' variables.
Regarding 'MasVnrArea' and 'MasVnrType', I can consider that these variables are not essential.
Furthermore, they have a strong correlation with 'YearBuilt' and 'OverallQual'. Thus, we will not
lose information if delete 'MasVnrArea' and 'MasVnrType'.
Finally, Data has one missing observation in 'Electrical'. Since it is just one observation, I'll
delete this observation and keep the variable.
In summary, to handle missing data, I'll delete all the variables with missing data, except the
variable 'Electrical'. In 'Electrical' we'll just delete the observation with missing data.
2.5 Feature engineering
I create some new features combining the ones that we already have. These could help to
increase the performance of the model!

 Converting non-numeric predictors stored as numbers into string


 Creating dummy variables from categorical features
 Fetch all numeric features
 Normalize skewed features using log_transformation
Now let's try to transform our target distribution into a normal one. To do this I use a log
transformation. I will use qq-plot to see the transformation effect and I have SalePrice before
transformation:
2.6 Modeling

Then, I separated Train test:

And create the RMSE metric:


This is a result table about RMSE_mean and RMSE_std of 10 types of regression
Now let's take a look at the top 20 most important variables for our model. This could give us
further insight into the functioning of the algorithm and how and which data it uses most to
arrive at the final prediction.

354 rows x 2 columns


Now we need to optimize the hyperparameters trying to tune the model to obtain a better
performance

So, I have the result table:


Conclusion
I reached the end of my project!
Throughout this kernel I put in practice many of advanged regression. I philosophied about the
variables, I analysed 'SalePrice' alone and with the most correlated variables, I dealt with missing
data and outliers, we tested some of the fundamental statistical assumptions and we even
transformed categorial variables into dummy variables. That's a lot of work that Python helped
me make easier.

You might also like