Rimjhim

Name: Rimjhim Kumari
Roll no 22527
IT- 34 Knowledge Representation and Artificial Intelligence: ML, DL
Q.1 Download a dataset from the link:

https://www.kaggle.com/datasets/priyanshusethi/happiness-classification-dataset and perform
following:
• Pre-processing and exploratory analysis of the data.
• By examining the data identify the machine learning problem to be solved.
• Compare and analyse the performances of minimum 3 machine learning algorithms
that can be applied to solve the problem.
Rubrics:
Parameters Satisfactory Good Very Good Score
Perform Any 1 of the 2 steps are Minimum 3 CO3: 3 Marks
minimum three steps is performed - 2 steps are
steps of data performed - 1 Marks performed - 3
pre-processing Mark Marks
(1 mark for each
step)
Perform Any 1 of the 3 2 of the 3 All 3 mentioned CO4: 6 Marks
Exploratory mentioned steps mentioned steps steps performed
Data Analysis: performed - 2 performed - 4 - 6 Marks
Summary Marks Marks
Statistics, Data
Visualization,
Correlation
analysis (2
marks each)
Problem Correctly Correctly Correctly CO4: 4 Marks

specification specifying the specifying the specifying the
and Identifying problem – 1 problem and problem and
Algorithms (4 Mark correctly correctly
Marks) categorising the categorising the
attributes – 2 attributes and
Marks identifying the
Algorithms - 4
Marks
Implementation Implementing 1 Implementing 2 Implementing 3 CO3: 9 Marks
of the algorithm - 3 algorithms - 6 algorithms - 9
Algorithms (3 Marks Marks Marks
marks for each
algorithm)
Performance Computing Computing Computing CO5: 3 Marks
Evaluation accuracies – 1 accuracies and accuracies,
Mark comparing comparing
performances,
and finalising
performances – the Model – 3
2 Marks Marks
Ans>> Pre-processing and exploratory analysis of the data.
The dataset we will be working with comprises responses from the survey conducted among
residents of different cities. It includes the following features:
• infoavail: Availability of information about city services.

• housecost: Cost of housing in the city.
• schoolquality: Overall quality of public schools.
• policetrust: Trust in the local police.
• streetquality: Maintenance of streets and sidewalks.
• events: Availability of social community events.
• happy: Decision attribute indicating happiness, with values 0 (unhappy) and 1
(happy).
We will explore the relationships between these features and happiness, perform data analysis
and visualization, and build a classification model to predict happiness based on the given
attributes.
>>Load libraries & data

Basic Statistical Analysis
df.info()
Output>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 infoavail 143 non-null int64
1 housecost 143 non-null int64
2 schoolquality 143 non-null int64
3 policetrust 143 non-null int64
4 streetquality 143 non-null int64
5 ëvents 143 non-null int64
6 happy 143 non-null int64
dtypes: int64(7)
memory usage: 7.9 KB
>>df.describe()
housecos schoolqua policetr streetqual

infoavail ëvents happy
t lity ust ity
cou 143.0000 143.0000 143.00000 143.0000 143.00000 143.0000 143.0000

nt 00 00 0 00 0 00 00
mea
4.314685 2.538462 3.265734 3.699301 3.615385 4.216783 0.538462
n
std 0.799820 1.118155 0.992586 0.888383 1.131639 0.848693 0.500271

housecos schoolqua policetr streetqual
infoavail ëvents happy
t lity ust ity
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000
25
4.000000 2.000000 3.000000 3.000000 3.000000 4.000000 0.000000
%
50
5.000000 3.000000 3.000000 4.000000 4.000000 4.000000 1.000000
%
75
5.000000 3.000000 4.000000 4.000000 4.000000 5.000000 1.000000
%
max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 1.000000
>>df['happy'].value_counts()
1 77
0 66
Name: happy, dtype: int64
This dataset has 77 data rows corresponding to Happy label and 66 data rows corresponding
to Unhappy label.
Check missing values
df.isna().sum()
infoavail 0
housecost 0
schoolquality 0
policetrust 0
streetquality 0
ëvents 0
happy 0
dtype: int64
Observation :
• No missing values in the table.

• No outliers in the any feature, since all the predictor features are caonsisting of values
in the finite set {1,2,3,4,5}.
# Dropping the duplicated values

• df = df.drop_duplicates()
• sns.displot(df['infoavail'],kde=True)
output >> <seaborn.axisgrid.FacetGrid at 0x798ec19f6e90>
>>sns.displot(df['housecost'],kde=True)
Output>
>>sns.displot(df['schoolquality'],kde=True)
Output > <seaborn.axisgrid.FacetGrid at 0x798ebf94cd30>
>>sns.displot(df['policetrust'],kde=True)
Output <seaborn.axisgrid.FacetGrid at 0x798ec1bc3280>
>> sns.displot(df['streetquality'],kde=True)
Output <seaborn.axisgrid.FacetGrid at 0x798ebd81fd30>
>>sns.displot(df['ëvents'],kde=True)
<seaborn.axisgrid.FacetGrid at 0x798ebd7297e0>
Algorithms: Train-Test Split
x = df.iloc[:,:6]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x , df["happy"],
test_size=0.2,
random_state=0)
x_train.shape, x_test.shape
output >((100, 6), (25, 6))

>> Logistic Regression
# importing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lreg = LogisticRegression()
lreg.fit(x_train,y_train)
output >> LogisticRegression()
# predicting values and checking accuracy

lpred = lreg.predict(x_test)
accuracy_score(lpred , y_test)
output > 0.6
Decision Trees
#importing library
from sklearn import tree
#implementing decision trees
dtr = tree.DecisionTreeClassifier()
dtr.fit(x_train,y_train)
DecisionTreeClassifier()
#predicting values and testing accuracy

dpred=dtr.predict(x_test)
accuracy_score(dpred,y_test)
0.56
KNN
#importing libraries
from sklearn.neighbors import KNeighborsClassifier

#implementing KNN
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)
Output >> KNeighborsClassifier()

kpred = knn.predict(x_test)
accuracy_score(kpred,y_test)
Output >> 0.6
SVM
# importing library
from sklearn import svm
#implementing SVM
sv = svm.SVC()
sv.fit(x_train,y_train)
Output >>SVC()
spred = sv.predict(x_test)
accuracy_score(spred,y_test)
Output>>0.56
Random Forest
#importing library
from sklearn.ensemble import RandomForestClassifier
#implementing random forests
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
Output >>RandomForestClassifier()

fpred=rfr.predict(x_test)
accuracy_score(fpred,y_test)
Output >>0.52
>>entire code :
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv(r"C:\\Users\\rahul\Desktop\\happydata.csv")
df.sample(5)
df.head()
df.columns
df.info()
df.describe()
df['happy'].value_counts()
df.isna().sum()
df = df.drop_duplicates()
sns.displot(df['infoavail'],kde=True)
sns.displot(df['housecost'],kde=True)
sns.displot(df['schoolquality'],kde=True)
sns.displot(df['policetrust'],kde=True)
sns.displot(df['streetquality'],kde=True)
sns.displot(df['ëvents'],kde=True)
x = df.iloc[:,:6]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x , df["happy"],
test_size=0.2,
random_state=0)
x_train.shape, x_test.shape
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lreg = LogisticRegression()
lreg.fit(x_train,y_train)
lpred = lreg.predict(x_test)
accuracy_score(lpred , y_test)
from sklearn import tree
dtr = tree.DecisionTreeClassifier()
dtr.fit(x_train,y_train)
dpred=dtr.predict(x_test)
accuracy_score(dpred,y_test)
from sklearn import svm
sv = svm.SVC()
sv.fit(x_train,y_train)
spred = sv.predict(x_test)
accuracy_score(spred,y_test)
from sklearn.ensemble import RandomForestClassifier
#implementing random forests
rfr = RandomForestClassifier()
rfr.fit(x_train,y_train)
fpred=rfr.predict(x_test)
accuracy_score(fpred,y_test)
>>comparing the accuracy results

Algorithm Accuracy output observed
Logistic Regression 0.6
Decision Trees 0.56
Random Forest 0.52
KNN 0.6
SVM 0.56

Rimjhim

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rimjhim

Uploaded by

Copyright:

Available Formats

Name: Rimjhim Kumari

Q.1 Download a dataset from the link:

Problem Correctly Correctly Correctly CO4: 4 Marks

Ans>> Pre-processing and exploratory analysis of the data.

• infoavail: Availability of information about city services.

>>Load libraries & data

housecos schoolqua policetr streetqual

cou 143.0000 143.0000 143.00000 143.0000 143.00000 143.0000 143.0000

std 0.799820 1.118155 0.992586 0.888383 1.131639 0.848693 0.500271

min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000

max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 1.000000

Check missing values

• No missing values in the table.

# Dropping the duplicated values

output >((100, 6), (25, 6))

# predicting values and checking accuracy

#predicting values and testing accuracy

from sklearn.neighbors import KNeighborsClassifier

Output >> KNeighborsClassifier()

#predicting values and testing accuracy

>>comparing the accuracy results

Decision Trees 0.56

Random Forest 0.52

You might also like