PBL Report Aidm

Project-Based Learning Report
On
Implement Measuring central tendency, measuring dispersion of data in
python with real time database data mining techniques
Submitted in the partial fulfillment of the requirements

For Project-based learning in Artificial Intelligence and Data Mining
in
Electronics & Communication Engineering
By
2014111034 Tushar Chaubey
2014111044 Sania Goyal
2014111045 Swastik Gupta
Under the guidance of the Course In-charge
Prof. V.P. Kaduskar
Department of Electronics & Communication Engineering
Bharati Vidyapeeth
(Deemed to be University)
College of Engineering,
Pune – 4110043
Academic Year: 2023-24
3
Bharati Vidyapeeth
(Deemed to be University)
College of Engineering,
Pune – 411043
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING
CERTIFICATE
Certified that the Project Based Learning report entitled, “Implement Measuring central
tendency, measuring dispersion of data in python with real time database data mining
techniques” is work done by
2014111034 Tushar Chaubey

2014111044 Sania Goyal
2014111045 Swastik Gupta
in partial fulfillment of the requirements for the award of credits for Project Based Learning
(PBL) in Artificial Intelligence and Data Mining of Bachelor of Technology Semester VII, in
Electronics and Communication.
Date: 17/10/2023
Prof. V.P.Kaduskar Dr. Arundhati A. Shinde
Course In-charge Professor & Head E.C.E
4
INDEX
Sr. No. Title Page No.
1 Description of problem statement 6-7
2 Implement Measuring Central Tendency & Dispersion with 8-16

a Student Performance dataset
3 Result of Performance Evaluation 17-18
4 Conclusion 19
5 Project Outcome 20
6 Appendix 21
5
Description of Problem Statement
Implement Measuring central tendency, measuring dispersion of data in python with real time
database data mining techniques.
Measuring Central Tendency:
Central tendency is a statistical measure that identifies a single value as representative of an

entire dataset. It gives you an idea about the center of the data. The most common measures of
central tendency are the mean, median, and mode:
1. Mean: The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar. Calculating the mean is very simple. You just add up
all of the values and divide by the number of observations in your dataset.
The calculation of the mean incorporates all values in the data. If you change any value,
the mean changes. However, the mean doesn’t always locate the center of the data
accurately.
2. Median: he median is the middle value. It is the value that splits the dataset in half,
making it a natural measure of central tendency.
To find the median, order your data from smallest to largest, and then find the data point
that has an equal number of values above it and below it. The method for locating the
median varies slightly depending on whether your dataset has an even or odd number of
values.
3. Mode: The mode is the value that occurs the most frequently in your data set, making it a
different type of measure of central tendency than the mean or median.
To find the mode, sort the values in your dataset by numeric values or by categories.
Then identify the value that occurs most often.
On a bar chart, the mode is the highest bar. If the data have multiple values that are tied
for occurring the most frequently, you have a multimodal distribution. If no value repeats,
the data do not have a mode.
6
Measuring Dispersion:
Dispersion refers to the spread or variability of a dataset. It indicates how much the values differ
from the central tendency. Common measures of dispersion include variance, standard deviation,
and range:
1. Variance: The term variance refers to a statistical measurement of the spread between
numbers in a data set. More specifically, variance measures how far each number in the
set is from the mean (average), and thus from every other number in the set. Variance is
often depicted by this symbol: σ2. It is used by both analysts and traders to
determine volatility and market security.
σ2 = ∑ (xi – x̄ )2/(n – 1)
2. Standard Deviation: A standard deviation is the positive square root of the arithmetic
mean of the squares of the deviations of the given values from their arithmetic mean. It is
denoted by a Greek letter sigma, σ. It is also referred to as root mean square deviation.
The standard deviation is given as
σ = [(Σi (yi – ȳ) ⁄ n] ½ = [(Σ i yi 2 ⁄ n) – ȳ 2] ½
For a grouped frequency distribution, it is
σ = [(Σi fi (yi – ȳ) ⁄ N] ½ = [(Σi fi yi 2 ⁄ n) – ȳ 2] ½
Range: Range is the difference between the maximum and minimum values in a dataset.
Range = Highest Value - Lowest Value
7
Implementing Central Tendency and Dispersion with Real-time Database
Data Mining Techniques:
To implement these concepts using real-time database data mining techniques in Python, you
would typically follow these steps:
Connect to the Database: Use a library like sqlalchemy to establish a connection to the database.
Retrieve Data: Execute SQL queries to retrieve the required data from the database and store it in
a pandas DataFrame.
Calculate Central Tendency: Use pandas and numpy functions to calculate mean, median, and
mode from the retrieved data.
Calculate Dispersion: Use numpy functions to calculate variance, standard deviation, and range
from the retrieved data.
Display or Use the Results: Print the calculated central tendency and dispersion values or use
them for further analysis or visualization.
Remember, the specific implementation details can vary based on the type of database you're
working with (e.g., MySQL, PostgreSQL, MongoDB) and the data retrieval methods supported
by the chosen database library. Additionally, handling real-time data might involve setting up a
streaming mechanism or periodic polling of the database to get the latest data for analysis.
Calculate and visualize Mean, Median, and Mode in Python.

The code below is a Python script that shows the calculation of measures of central tendency
(mean, median, and mode) for a normally distributed dataset and a skewed dataset (not
symmetrical around its mean). The code uses the numpy, matplotlib, pandas, and seaborn
libraries. We use histograms to visualize Mean, Median, and Mode in Python. In the histogram
on the left, the mean, median, and mode are close to each other because the data is symmetrically
8
distributed around the center. However, in the histogram on the right, which represents a skewed
dataset, the mean is significantly influenced by extreme values in the tails of the distribution,
while the median and mode are less affected.
9
Code and Output-:
 Measuring Central Tendency & Dispersion with a Student_Performance dataset:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df=pd.read_csv("StudentsPerformance.csv")
df.head()
Figure 1:StudentPerformance
df.info()
Gender Count
df.gender.value_counts()
df['total marks']=((df['math score']+df['reading score']+df['writing score']))
df.head(2)
df.groupby(['parental level of education','total marks']).size().reset_index()
df['total marks'].mean()
plt.figure(figsize=(5,5))
sns.barplot(x=df['parental level of education'],y=df['total marks'])
plt.xticks(rotation=60)
plt.ylim(0,300)
plt.axhline(y = 203.312, color = 'r', linestyle = 'dashed', label = "Mean value line")
plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper center')
plt.show()
Figure 2:Total marks vs Parental Education
 Who scores the most on average for math, reading and writing based on
● Gender
● Test preparation course
df.mean()
plt.subplots_adjust(left=0.125, bottom=0.1, right=1.0, top=1.0,wspace=0.8, hspace=0.2)
plt.subplot(131)
plt.title('Math Scores')
sns.barplot(hue="gender", y="math score", x="test preparation course", data=df,palette='bright')
plt.axhline(y = 66.067, color = 'r', linestyle = 'dashed', label = "avg math score")
plt.subplot(132)
plt.title('Reading Scores')
sns.barplot (hue="gender", y="reading score", x="test preparation course", data=df,palette='hls')
plt.axhline(y = 69.117, color = 'g', linestyle = '-', label = "avg reading score")
plt.subplot(133)
plt.title('Writing Scores')
sns.barplot (hue="gender", y="writing score", x="test preparation course",
data=df,palette='colorblind')
plt.axhline(y = 67.997, color = 'b', linestyle = ':', label = "avg writing score")
Figure 3:Scores vs Test preparation

 Who scores the most on average for math, reading and writing based on
● Gender
● Test preparation course
sns.set_palette('bright')
sns.histplot(x='math score', data=df, kde=True, hue='gender')
plt.title('math scores of male and female')
plt.show()
sns.set_palette('hls')
sns.histplot(x='reading score', data=df, kde=True, hue='gender')
plt.title('reading scores of male and female')
plt.show()
sns.set_palette('colorblind')
sns.histplot(x='writing score', data=df, kde=True, hue='gender')
plt.title('writing scores of male and female')
plt.show()
Figure 4:No. of students vs Math score

Figure 5: No. of students vs Reading score
Figure 6: No. of students vs Writing score

sns.set_palette('bright')
sns.histplot(x='math score', data=df, kde=True, hue='test preparation course')
plt.title('math scores wrt test preparation course')
plt.show()
sns.set_palette('hls')
sns.histplot(x='reading score', data=df, kde=True, hue='test preparation course')
plt.title('reading scores wrt test preparation course')
plt.show()
sns.set_palette('colorblind')
sns.histplot(x='writing score', data=df, kde=True, hue='test preparation course')
plt.title('writing scores wrt test preparation course')
plt.show()
Figure 7: No. of students vs Math score

Figure 8: No. of students vs Reading score
Figure 9: No. of students vs Writing score

The management needs your help to give bonus points to the top 25% of students based on
their math score, so how will you help the management to achieve this.
df.describe()
np.percentile(df['math score'], 75)

sns.boxplot(data=df,y=df['math score'],palette='colorblind')
plt.show()
Figure 10:Boxplot of math score

x=df[df['math score']>=77.0].groupby(['gender','race/ethnicity','parental level of
education','lunch','test preparation course','math score']).size().reset_index()
x.rename(columns={0:'Count'},inplace=True)
x
x.info()
x.gender.value_counts()
print (df.median())
Conclusion
Hence, in this Project-Based Learning, on the topic- ‘Implement Measuring central tendency,
measuring dispersion of data in python with real time database data mining techniques.’ we have
studied about the central tendencies and dispersion measuring with a student performance
database. Overall, the understanding was developed, and all the related concepts were
understood well and performed using python programming language.
19
Project Outcome
Here, in this Project-Based Learning, under Course Outcome 4 (CO4): Apply the basic concept
of data mining and its functionality. The concepts were understood and performed using python.
20
Appendix
Github link-
saniagoyal/Measuring-Central-Tendency-Dispersion-with-a-Student_Performance-dataset.
(github.com)
21

PBL Report Aidm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PBL Report Aidm

Uploaded by

Copyright:

Available Formats

Project-Based Learning Report

Submitted in the partial fulfillment of the requirements

Under the guidance of the Course In-charge

Prof. V.P. Kaduskar

Department of Electronics & Communication Engineering

Academic Year: 2023-24

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

2014111034 Tushar Chaubey

Prof. V.P.Kaduskar Dr. Arundhati A. Shinde

Course In-charge Professor & Head E.C.E

Sr. No. Title Page No.

1 Description of problem statement 6-7

2 Implement Measuring Central Tendency & Dispersion with 8-16

Measuring Central Tendency:

Central tendency is a statistical measure that identifies a single value as representative of an

σ = [(Σi (yi – ȳ) ⁄ n] ½ = [(Σ i yi 2 ⁄ n) – ȳ 2] ½

For a grouped frequency distribution, it is

σ = [(Σi fi (yi – ȳ) ⁄ N] ½ = [(Σi fi yi 2 ⁄ n) – ȳ 2] ½

Range = Highest Value - Lowest Value

Calculate and visualize Mean, Median, and Mode in Python.

 Measuring Central Tendency & Dispersion with a Student_Performance dataset:

df.groupby(['parental level of education','total marks']).size().reset_index()

Figure 3:Scores vs Test preparation

Figure 4:No. of students vs Math score

Figure 6: No. of students vs Writing score

Figure 7: No. of students vs Math score

Figure 9: No. of students vs Writing score

np.percentile(df['math score'], 75)

Figure 10:Boxplot of math score

You might also like