You are on page 1of 7

2139472 Lab_2

December 3, 2021

1 Data Mining

2 Lab_2

3 Vinay Sirohi

4 2139472

5 Select appropriate dataset and apply data reduction


5.0.1 Importing the useful libraries

[27]: import numpy as np


from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

5.1 Step 1: Reading and Understanding the Data


[2]: housing = pd.read_csv('Housing.csv')
housing.head()

[2]: price area bedrooms bathrooms stories mainroad guestroom basement \


0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes

1
hotwaterheating airconditioning parking prefarea furnishingstatus
0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished

[3]: housing.shape

[3]: (545, 13)

[4]: housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 545 non-null int64
1 area 545 non-null int64
2 bedrooms 545 non-null int64
3 bathrooms 545 non-null int64
4 stories 545 non-null int64
5 mainroad 545 non-null object
6 guestroom 545 non-null object
7 basement 545 non-null object
8 hotwaterheating 545 non-null object
9 airconditioning 545 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB

5.2 Step 2: Visualising the Data


[5]: plt.figure(figsize=[5,5])
plt.scatter(housing.area, housing.price , color = '')
plt.show()

2
[6]: sns.pairplot(housing)
plt.show()

3
5.3 Step 3: Data Preparation
5.3.1 We can see that your dataset has many columns with values as ‘Yes’ or ‘No’.
5.3.2 But in order to fit a regression line, we would need numerical values and not
string. Hence, we need to convert them to 1s and 0s, where 1 is a ‘Yes’ and 0
is a ‘No’.
[7]: housing.columns

[7]: Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',


'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
'parking', 'prefarea', 'furnishingstatus'],
dtype='object')

4
[8]: labelencoder = LabelEncoder()
housing['mainroad']=labelencoder.fit_transform(housing['mainroad'])
housing['guestroom']=labelencoder.fit_transform(housing['guestroom'])
housing['basement']=labelencoder.fit_transform(housing['basement'])
housing['hotwaterheating']=labelencoder.
,→fit_transform(housing['hotwaterheating'])

housing['airconditioning']=labelencoder.
,→fit_transform(housing['airconditioning'])

housing['prefarea']=labelencoder.fit_transform(housing['prefarea'])
housing

[8]: price area bedrooms bathrooms stories mainroad guestroom \


0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1
.. … … … … … … …
540 1820000 3000 2 1 1 1 0
541 1767150 2400 3 1 1 0 0
542 1750000 3620 2 1 1 1 0
543 1750000 2910 3 1 1 0 0
544 1750000 3850 3 1 2 1 0

basement hotwaterheating airconditioning parking prefarea \


0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0
.. … … … … …
540 1 0 0 2 0
541 0 0 0 0 0
542 0 0 0 0 0
543 0 0 0 0 0
544 0 0 0 0 0

furnishingstatus
0 furnished
1 furnished
2 semi-furnished
3 furnished
4 furnished
.. …
540 unfurnished
541 semi-furnished
542 unfurnished

5
543 furnished
544 unfurnished

[545 rows x 13 columns]

5.3.3 Dummy Variables


The variable furnishingstatus has three levels. We need to convert these levels into integer as well.

[9]: housing = pd.get_dummies(housing)


housing.head()

[9]: price area bedrooms bathrooms stories mainroad guestroom \


0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1

basement hotwaterheating airconditioning parking prefarea \


0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0

furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1 0
1 1 0
2 0 1
3 1 0
4 1 0

furnishingstatus_unfurnished
0 0
1 0
2 0
3 0
4 0

5.4 5. Data Reduction (PCA)


[38]: X=housing.values
pca = PCA(n_components=6)
pca.fit(X)

#The amount of variance that each PC explains

6
var= pca.explained_variance_ratio_
print(var)

#Cumulative Variance explains


var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

print(var1)

[0.21278547 0.19219357 0.13747189 0.11440193 0.08688183 0.06171814]


[21.28 40.5 54.25 65.69 74.38 80.55]

You might also like