You are on page 1of 20

Project-Based Learning Report

On
Implement Measuring central tendency, measuring dispersion of data in
python with real time database data mining techniques

Submitted in the partial fulfillment of the requirements


For Project-based learning in Artificial Intelligence and Data Mining

in
Electronics & Communication Engineering
By
2014111034 Tushar Chaubey
2014111044 Sania Goyal
2014111045 Swastik Gupta

Under the guidance of the Course In-charge

Prof. V.P. Kaduskar

Department of Electronics & Communication Engineering

Bharati Vidyapeeth
(Deemed to be University)
College of Engineering,
Pune – 4110043

Academic Year: 2023-24

3
Bharati Vidyapeeth
(Deemed to be University)
College of Engineering,
Pune – 411043

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

CERTIFICATE

Certified that the Project Based Learning report entitled, “Implement Measuring central
tendency, measuring dispersion of data in python with real time database data mining
techniques” is work done by

2014111034 Tushar Chaubey


2014111044 Sania Goyal
2014111045 Swastik Gupta

in partial fulfillment of the requirements for the award of credits for Project Based Learning
(PBL) in Artificial Intelligence and Data Mining of Bachelor of Technology Semester VII, in
Electronics and Communication.

Date: 17/10/2023

Prof. V.P.Kaduskar Dr. Arundhati A. Shinde

Course In-charge Professor & Head E.C.E

4
INDEX

Sr. No. Title Page No.

1 Description of problem statement 6-7

2 Implement Measuring Central Tendency & Dispersion with 8-16


a Student Performance dataset
3 Result of Performance Evaluation 17-18

4 Conclusion 19

5 Project Outcome 20

6 Appendix 21

5
Description of Problem Statement

Implement Measuring central tendency, measuring dispersion of data in python with real time
database data mining techniques.

Measuring Central Tendency:

Central tendency is a statistical measure that identifies a single value as representative of an


entire dataset. It gives you an idea about the center of the data. The most common measures of
central tendency are the mean, median, and mode:

1. Mean: The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar. Calculating the mean is very simple. You just add up
all of the values and divide by the number of observations in your dataset.

The calculation of the mean incorporates all values in the data. If you change any value,
the mean changes. However, the mean doesn’t always locate the center of the data
accurately.

2. Median: he median is the middle value. It is the value that splits the dataset in half,
making it a natural measure of central tendency.
To find the median, order your data from smallest to largest, and then find the data point
that has an equal number of values above it and below it. The method for locating the
median varies slightly depending on whether your dataset has an even or odd number of
values.

3. Mode: The mode is the value that occurs the most frequently in your data set, making it a
different type of measure of central tendency than the mean or median.
To find the mode, sort the values in your dataset by numeric values or by categories.
Then identify the value that occurs most often.
On a bar chart, the mode is the highest bar. If the data have multiple values that are tied
for occurring the most frequently, you have a multimodal distribution. If no value repeats,
the data do not have a mode.

6
Measuring Dispersion:
Dispersion refers to the spread or variability of a dataset. It indicates how much the values differ
from the central tendency. Common measures of dispersion include variance, standard deviation,
and range:

1. Variance: The term variance refers to a statistical measurement of the spread between
numbers in a data set. More specifically, variance measures how far each number in the
set is from the mean (average), and thus from every other number in the set. Variance is
often depicted by this symbol: σ2. It is used by both analysts and traders to
determine volatility and market security.

σ2 = ∑ (xi – x̄ )2/(n – 1)

2. Standard Deviation: A standard deviation is the positive square root of the arithmetic
mean of the squares of the deviations of the given values from their arithmetic mean. It is
denoted by a Greek letter sigma, σ. It is also referred to as root mean square deviation.
The standard deviation is given as

σ = [(Σi (yi – ȳ) ⁄ n] ½ = [(Σ i yi 2 ⁄ n) – ȳ 2] ½

For a grouped frequency distribution, it is

σ = [(Σi fi (yi – ȳ) ⁄ N] ½ = [(Σi fi yi 2 ⁄ n) – ȳ 2] ½

Range: Range is the difference between the maximum and minimum values in a dataset.

Range = Highest Value - Lowest Value

7
Implementing Central Tendency and Dispersion with Real-time Database
Data Mining Techniques:
To implement these concepts using real-time database data mining techniques in Python, you
would typically follow these steps:

Connect to the Database: Use a library like sqlalchemy to establish a connection to the database.

Retrieve Data: Execute SQL queries to retrieve the required data from the database and store it in
a pandas DataFrame.

Calculate Central Tendency: Use pandas and numpy functions to calculate mean, median, and
mode from the retrieved data.

Calculate Dispersion: Use numpy functions to calculate variance, standard deviation, and range
from the retrieved data.

Display or Use the Results: Print the calculated central tendency and dispersion values or use
them for further analysis or visualization.

Remember, the specific implementation details can vary based on the type of database you're
working with (e.g., MySQL, PostgreSQL, MongoDB) and the data retrieval methods supported
by the chosen database library. Additionally, handling real-time data might involve setting up a
streaming mechanism or periodic polling of the database to get the latest data for analysis.

Calculate and visualize Mean, Median, and Mode in Python.


The code below is a Python script that shows the calculation of measures of central tendency
(mean, median, and mode) for a normally distributed dataset and a skewed dataset (not
symmetrical around its mean). The code uses the numpy, matplotlib, pandas, and seaborn
libraries. We use histograms to visualize Mean, Median, and Mode in Python. In the histogram
on the left, the mean, median, and mode are close to each other because the data is symmetrically

8
distributed around the center. However, in the histogram on the right, which represents a skewed
dataset, the mean is significantly influenced by extreme values in the tails of the distribution,
while the median and mode are less affected.

9
Code and Output-:

 Measuring Central Tendency & Dispersion with a Student_Performance dataset:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df=pd.read_csv("StudentsPerformance.csv")
df.head()

Figure 1:StudentPerformance

df.info()

Gender Count
df.gender.value_counts()
df['total marks']=((df['math score']+df['reading score']+df['writing score']))
df.head(2)

df.groupby(['parental level of education','total marks']).size().reset_index()

df['total marks'].mean()
plt.figure(figsize=(5,5))
sns.barplot(x=df['parental level of education'],y=df['total marks'])
plt.xticks(rotation=60)
plt.ylim(0,300)
plt.axhline(y = 203.312, color = 'r', linestyle = 'dashed', label = "Mean value line")
plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper center')
plt.show()
Figure 2:Total marks vs Parental Education

 Who scores the most on average for math, reading and writing based on
● Gender
● Test preparation course

df.mean()

plt.figure(figsize=(15,5))
plt.subplots_adjust(left=0.125, bottom=0.1, right=1.0, top=1.0,wspace=0.8, hspace=0.2)
plt.subplot(131)
plt.title('Math Scores')
sns.barplot(hue="gender", y="math score", x="test preparation course", data=df,palette='bright')
plt.axhline(y = 66.067, color = 'r', linestyle = 'dashed', label = "avg math score")
plt.legend(bbox_to_anchor = (1.3, 1), loc = 'upper center')
plt.subplot(132)
plt.title('Reading Scores')
sns.barplot (hue="gender", y="reading score", x="test preparation course", data=df,palette='hls')
plt.axhline(y = 69.117, color = 'g', linestyle = '-', label = "avg reading score")
plt.legend(bbox_to_anchor = (1.35, 1), loc = 'upper center')
plt.subplot(133)

plt.title('Writing Scores')
sns.barplot (hue="gender", y="writing score", x="test preparation course",
data=df,palette='colorblind')
plt.axhline(y = 67.997, color = 'b', linestyle = ':', label = "avg writing score")
plt.legend(bbox_to_anchor = (1.4, 1), loc = 'upper center')

Figure 3:Scores vs Test preparation


 Who scores the most on average for math, reading and writing based on
● Gender
● Test preparation course
plt.figure(figsize=(15,5))
sns.set_palette('bright')
plt.figure(figsize=(15,6))
sns.histplot(x='math score', data=df, kde=True, hue='gender')
plt.title('math scores of male and female')
plt.show()
sns.set_palette('hls')
plt.figure(figsize=(15,6))
sns.histplot(x='reading score', data=df, kde=True, hue='gender')
plt.title('reading scores of male and female')
plt.show()
sns.set_palette('colorblind')
plt.figure(figsize=(15,6))
sns.histplot(x='writing score', data=df, kde=True, hue='gender')
plt.title('writing scores of male and female')
plt.show()

Figure 4:No. of students vs Math score


Figure 5: No. of students vs Reading score

Figure 6: No. of students vs Writing score


plt.figure(figsize=(15,5))
sns.set_palette('bright')
plt.figure(figsize=(15,6))
sns.histplot(x='math score', data=df, kde=True, hue='test preparation course')
plt.title('math scores wrt test preparation course')
plt.show()

sns.set_palette('hls')
plt.figure(figsize=(15,6))
sns.histplot(x='reading score', data=df, kde=True, hue='test preparation course')
plt.title('reading scores wrt test preparation course')
plt.show()
sns.set_palette('colorblind')
plt.figure(figsize=(15,6))
sns.histplot(x='writing score', data=df, kde=True, hue='test preparation course')
plt.title('writing scores wrt test preparation course')
plt.show()

Figure 7: No. of students vs Math score


Figure 8: No. of students vs Reading score

Figure 9: No. of students vs Writing score


The management needs your help to give bonus points to the top 25% of students based on
their math score, so how will you help the management to achieve this.

df.describe()

np.percentile(df['math score'], 75)


sns.boxplot(data=df,y=df['math score'],palette='colorblind')
plt.show()

Figure 10:Boxplot of math score


x=df[df['math score']>=77.0].groupby(['gender','race/ethnicity','parental level of
education','lunch','test preparation course','math score']).size().reset_index()
x.rename(columns={0:'Count'},inplace=True)
x

x.info()

x.gender.value_counts()

print (df.median())
Conclusion

Hence, in this Project-Based Learning, on the topic- ‘Implement Measuring central tendency,
measuring dispersion of data in python with real time database data mining techniques.’ we have
studied about the central tendencies and dispersion measuring with a student performance
database. Overall, the understanding was developed, and all the related concepts were
understood well and performed using python programming language.

19
Project Outcome

Here, in this Project-Based Learning, under Course Outcome 4 (CO4): Apply the basic concept
of data mining and its functionality. The concepts were understood and performed using python.

20

Appendix
Github link-
saniagoyal/Measuring-Central-Tendency-Dispersion-with-a-Student_Performance-dataset.
(github.com)

21

You might also like