1 Data Mining 2 Lab - 2 3 Vinay Sirohi 4 2139472 5 Select Appropriate Dataset and Apply Data Reduction

2139472 Lab_2
December 3, 2021
1 Data Mining
2 Lab_2
3 Vinay Sirohi
4 2139472
5 Select appropriate dataset and apply data reduction

5.0.1 Importing the useful libraries
[27]: import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
5.1 Step 1: Reading and Understanding the Data

[2]: housing = pd.read_csv('Housing.csv')
housing.head()
[2]: price area bedrooms bathrooms stories mainroad guestroom basement \

0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes
1
hotwaterheating airconditioning parking prefarea furnishingstatus
0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished
[3]: housing.shape
[3]: (545, 13)
[4]: housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 545 non-null int64
1 area 545 non-null int64
2 bedrooms 545 non-null int64
3 bathrooms 545 non-null int64
4 stories 545 non-null int64
5 mainroad 545 non-null object
6 guestroom 545 non-null object
7 basement 545 non-null object
8 hotwaterheating 545 non-null object
9 airconditioning 545 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB
5.2 Step 2: Visualising the Data

[5]: plt.figure(figsize=[5,5])
plt.scatter(housing.area, housing.price , color = '')
plt.show()
2
[6]: sns.pairplot(housing)
plt.show()
3
5.3 Step 3: Data Preparation
5.3.1 We can see that your dataset has many columns with values as ‘Yes’ or ‘No’.
5.3.2 But in order to fit a regression line, we would need numerical values and not
string. Hence, we need to convert them to 1s and 0s, where 1 is a ‘Yes’ and 0
is a ‘No’.
[7]: housing.columns
[7]: Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',

'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
'parking', 'prefarea', 'furnishingstatus'],
dtype='object')
4
[8]: labelencoder = LabelEncoder()
housing['mainroad']=labelencoder.fit_transform(housing['mainroad'])
housing['guestroom']=labelencoder.fit_transform(housing['guestroom'])
housing['basement']=labelencoder.fit_transform(housing['basement'])
housing['hotwaterheating']=labelencoder.
,→fit_transform(housing['hotwaterheating'])
housing['airconditioning']=labelencoder.
,→fit_transform(housing['airconditioning'])
housing['prefarea']=labelencoder.fit_transform(housing['prefarea'])
housing
[8]: price area bedrooms bathrooms stories mainroad guestroom \

0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1
.. … … … … … … …
540 1820000 3000 2 1 1 1 0
541 1767150 2400 3 1 1 0 0
542 1750000 3620 2 1 1 1 0
543 1750000 2910 3 1 1 0 0
544 1750000 3850 3 1 2 1 0
basement hotwaterheating airconditioning parking prefarea \

0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0
.. … … … … …
540 1 0 0 2 0
541 0 0 0 0 0
542 0 0 0 0 0
543 0 0 0 0 0
544 0 0 0 0 0
furnishingstatus
0 furnished
1 furnished
2 semi-furnished
3 furnished
4 furnished
.. …
540 unfurnished
541 semi-furnished
542 unfurnished
5
543 furnished
544 unfurnished
[545 rows x 13 columns]
5.3.3 Dummy Variables

The variable furnishingstatus has three levels. We need to convert these levels into integer as well.
[9]: housing = pd.get_dummies(housing)

housing.head()
[9]: price area bedrooms bathrooms stories mainroad guestroom \

0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1
basement hotwaterheating airconditioning parking prefarea \

0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0
furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1 0
1 1 0
2 0 1
3 1 0
4 1 0
furnishingstatus_unfurnished
0 0
1 0
2 0
3 0
4 0
5.4 5. Data Reduction (PCA)

[38]: X=housing.values
pca = PCA(n_components=6)
pca.fit(X)
#The amount of variance that each PC explains
6
var= pca.explained_variance_ratio_
print(var)
#Cumulative Variance explains

var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print(var1)
[0.21278547 0.19219357 0.13747189 0.11440193 0.08688183 0.06171814]

[21.28 40.5 54.25 65.69 74.38 80.55]

1 Data Mining 2 Lab - 2 3 Vinay Sirohi 4 2139472 5 Select Appropriate Dataset and Apply Data Reduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Data Mining 2 Lab - 2 3 Vinay Sirohi 4 2139472 5 Select Appropriate Dataset and Apply Data Reduction

Uploaded by

Copyright:

Available Formats

2139472 Lab_2

5 Select appropriate dataset and apply data reduction

[27]: import numpy as np

5.1 Step 1: Reading and Understanding the Data

[2]: price area bedrooms bathrooms stories mainroad guestroom basement \

[3]: (545, 13)

5.2 Step 2: Visualising the Data

[7]: Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',

[8]: price area bedrooms bathrooms stories mainroad guestroom \

basement hotwaterheating airconditioning parking prefarea \

[545 rows x 13 columns]

5.3.3 Dummy Variables

[9]: housing = pd.get_dummies(housing)

[9]: price area bedrooms bathrooms stories mainroad guestroom \

basement hotwaterheating airconditioning parking prefarea \

5.4 5. Data Reduction (PCA)

#The amount of variance that each PC explains

#Cumulative Variance explains

[0.21278547 0.19219357 0.13747189 0.11440193 0.08688183 0.06171814]

You might also like