You are on page 1of 21

Name: Rimjhim Kumari

Roll no 22527
IT- 34 Knowledge Representation and Artificial Intelligence: ML, DL

Q.1 Download a dataset from the link:


https://www.kaggle.com/datasets/priyanshusethi/happiness-classification-dataset and perform
following:
• Pre-processing and exploratory analysis of the data.
• By examining the data identify the machine learning problem to be solved.
• Compare and analyse the performances of minimum 3 machine learning algorithms
that can be applied to solve the problem.

Rubrics:
Parameters Satisfactory Good Very Good Score
Perform Any 1 of the 2 steps are Minimum 3 CO3: 3 Marks
minimum three steps is performed - 2 steps are
steps of data performed - 1 Marks performed - 3
pre-processing Mark Marks
(1 mark for each
step)
Perform Any 1 of the 3 2 of the 3 All 3 mentioned CO4: 6 Marks
Exploratory mentioned steps mentioned steps steps performed
Data Analysis: performed - 2 performed - 4 - 6 Marks
Summary Marks Marks
Statistics, Data
Visualization,
Correlation
analysis (2
marks each)

Problem Correctly Correctly Correctly CO4: 4 Marks


specification specifying the specifying the specifying the
and Identifying problem – 1 problem and problem and
Algorithms (4 Mark correctly correctly
Marks) categorising the categorising the
attributes – 2 attributes and
Marks identifying the
Algorithms - 4
Marks
Implementation Implementing 1 Implementing 2 Implementing 3 CO3: 9 Marks
of the algorithm - 3 algorithms - 6 algorithms - 9
Algorithms (3 Marks Marks Marks
marks for each
algorithm)
Performance Computing Computing Computing CO5: 3 Marks
Evaluation accuracies – 1 accuracies and accuracies,
Mark comparing comparing
performances,
and finalising
performances – the Model – 3
2 Marks Marks

Ans>> Pre-processing and exploratory analysis of the data.

The dataset we will be working with comprises responses from the survey conducted among
residents of different cities. It includes the following features:

• infoavail: Availability of information about city services.


• housecost: Cost of housing in the city.
• schoolquality: Overall quality of public schools.
• policetrust: Trust in the local police.
• streetquality: Maintenance of streets and sidewalks.
• events: Availability of social community events.
• happy: Decision attribute indicating happiness, with values 0 (unhappy) and 1
(happy).

We will explore the relationships between these features and happiness, perform data analysis
and visualization, and build a classification model to predict happiness based on the given
attributes.

>>Load libraries & data


Basic Statistical Analysis

df.info()
Output>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 infoavail 143 non-null int64
1 housecost 143 non-null int64
2 schoolquality 143 non-null int64
3 policetrust 143 non-null int64
4 streetquality 143 non-null int64
5 ëvents 143 non-null int64
6 happy 143 non-null int64
dtypes: int64(7)
memory usage: 7.9 KB

>>df.describe()

housecos schoolqua policetr streetqual


infoavail ëvents happy
t lity ust ity

cou 143.0000 143.0000 143.00000 143.0000 143.00000 143.0000 143.0000


nt 00 00 0 00 0 00 00

mea
4.314685 2.538462 3.265734 3.699301 3.615385 4.216783 0.538462
n

std 0.799820 1.118155 0.992586 0.888383 1.131639 0.848693 0.500271


housecos schoolqua policetr streetqual
infoavail ëvents happy
t lity ust ity

min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000

25
4.000000 2.000000 3.000000 3.000000 3.000000 4.000000 0.000000
%

50
5.000000 3.000000 3.000000 4.000000 4.000000 4.000000 1.000000
%

75
5.000000 3.000000 4.000000 4.000000 4.000000 5.000000 1.000000
%

max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 1.000000

>>df['happy'].value_counts()

1 77
0 66
Name: happy, dtype: int64
This dataset has 77 data rows corresponding to Happy label and 66 data rows corresponding
to Unhappy label.

Check missing values

df.isna().sum()
infoavail 0
housecost 0
schoolquality 0
policetrust 0
streetquality 0
ëvents 0
happy 0
dtype: int64
Observation :

• No missing values in the table.


• No outliers in the any feature, since all the predictor features are caonsisting of values
in the finite set {1,2,3,4,5}.

# Dropping the duplicated values


• df = df.drop_duplicates()

• sns.displot(df['infoavail'],kde=True)
output >> <seaborn.axisgrid.FacetGrid at 0x798ec19f6e90>
>>sns.displot(df['housecost'],kde=True)
Output>
>>sns.displot(df['schoolquality'],kde=True)
Output > <seaborn.axisgrid.FacetGrid at 0x798ebf94cd30>
>>sns.displot(df['policetrust'],kde=True)
Output <seaborn.axisgrid.FacetGrid at 0x798ec1bc3280>
>> sns.displot(df['streetquality'],kde=True)
Output <seaborn.axisgrid.FacetGrid at 0x798ebd81fd30>
>>sns.displot(df['ëvents'],kde=True)
<seaborn.axisgrid.FacetGrid at 0x798ebd7297e0>
Algorithms: Train-Test Split
x = df.iloc[:,:6]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x , df["happy"],
test_size=0.2,
random_state=0)

x_train.shape, x_test.shape

output >((100, 6), (25, 6))


>> Logistic Regression
# importing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lreg = LogisticRegression()
lreg.fit(x_train,y_train)
output >> LogisticRegression()

# predicting values and checking accuracy


lpred = lreg.predict(x_test)
accuracy_score(lpred , y_test)
output > 0.6

Decision Trees
#importing library
from sklearn import tree
#implementing decision trees
dtr = tree.DecisionTreeClassifier()
dtr.fit(x_train,y_train)
DecisionTreeClassifier()

#predicting values and testing accuracy


dpred=dtr.predict(x_test)
accuracy_score(dpred,y_test)
0.56
KNN
#importing libraries

from sklearn.neighbors import KNeighborsClassifier


#implementing KNN
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)

Output >> KNeighborsClassifier()


#predicting values and testing accuracy
kpred = knn.predict(x_test)
accuracy_score(kpred,y_test)
Output >> 0.6

SVM
# importing library
from sklearn import svm
#implementing SVM
sv = svm.SVC()
sv.fit(x_train,y_train)
Output >>SVC()
#predicting values and testing accuracy
spred = sv.predict(x_test)
accuracy_score(spred,y_test)
Output>>0.56
Random Forest
#importing library
from sklearn.ensemble import RandomForestClassifier
#implementing random forests
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
Output >>RandomForestClassifier()

#predicting values and testing accuracy


fpred=rfr.predict(x_test)
accuracy_score(fpred,y_test)
Output >>0.52
>>entire code :
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv(r"C:\\Users\\rahul\Desktop\\happydata.csv")
df.sample(5)
df.head()
df.columns
df.info()
df.describe()
df['happy'].value_counts()
df.isna().sum()
df = df.drop_duplicates()
sns.displot(df['infoavail'],kde=True)
sns.displot(df['housecost'],kde=True)
sns.displot(df['schoolquality'],kde=True)
sns.displot(df['policetrust'],kde=True)
sns.displot(df['streetquality'],kde=True)
sns.displot(df['ëvents'],kde=True)
x = df.iloc[:,:6]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x , df["happy"],
test_size=0.2,
random_state=0)

x_train.shape, x_test.shape
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lreg = LogisticRegression()
lreg.fit(x_train,y_train)
lpred = lreg.predict(x_test)
accuracy_score(lpred , y_test)
from sklearn import tree
dtr = tree.DecisionTreeClassifier()
dtr.fit(x_train,y_train)
dpred=dtr.predict(x_test)
accuracy_score(dpred,y_test)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)
kpred = knn.predict(x_test)
accuracy_score(kpred,y_test)
from sklearn import svm
sv = svm.SVC()
sv.fit(x_train,y_train)
spred = sv.predict(x_test)
accuracy_score(spred,y_test)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)
kpred = knn.predict(x_test)
accuracy_score(kpred,y_test)
from sklearn.ensemble import RandomForestClassifier
#implementing random forests
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
fpred=rfr.predict(x_test)
accuracy_score(fpred,y_test)

>>comparing the accuracy results


Algorithm Accuracy output observed
Logistic Regression 0.6

Decision Trees 0.56

Random Forest 0.52

KNN 0.6

SVM 0.56

You might also like