Professional Documents
Culture Documents
PBL Report Aidm
PBL Report Aidm
On
Implement Measuring central tendency, measuring dispersion of data in
python with real time database data mining techniques
in
Electronics & Communication Engineering
By
2014111034 Tushar Chaubey
2014111044 Sania Goyal
2014111045 Swastik Gupta
Bharati Vidyapeeth
(Deemed to be University)
College of Engineering,
Pune – 4110043
3
Bharati Vidyapeeth
(Deemed to be University)
College of Engineering,
Pune – 411043
CERTIFICATE
Certified that the Project Based Learning report entitled, “Implement Measuring central
tendency, measuring dispersion of data in python with real time database data mining
techniques” is work done by
in partial fulfillment of the requirements for the award of credits for Project Based Learning
(PBL) in Artificial Intelligence and Data Mining of Bachelor of Technology Semester VII, in
Electronics and Communication.
Date: 17/10/2023
4
INDEX
4 Conclusion 19
5 Project Outcome 20
6 Appendix 21
5
Description of Problem Statement
Implement Measuring central tendency, measuring dispersion of data in python with real time
database data mining techniques.
1. Mean: The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar. Calculating the mean is very simple. You just add up
all of the values and divide by the number of observations in your dataset.
The calculation of the mean incorporates all values in the data. If you change any value,
the mean changes. However, the mean doesn’t always locate the center of the data
accurately.
2. Median: he median is the middle value. It is the value that splits the dataset in half,
making it a natural measure of central tendency.
To find the median, order your data from smallest to largest, and then find the data point
that has an equal number of values above it and below it. The method for locating the
median varies slightly depending on whether your dataset has an even or odd number of
values.
3. Mode: The mode is the value that occurs the most frequently in your data set, making it a
different type of measure of central tendency than the mean or median.
To find the mode, sort the values in your dataset by numeric values or by categories.
Then identify the value that occurs most often.
On a bar chart, the mode is the highest bar. If the data have multiple values that are tied
for occurring the most frequently, you have a multimodal distribution. If no value repeats,
the data do not have a mode.
6
Measuring Dispersion:
Dispersion refers to the spread or variability of a dataset. It indicates how much the values differ
from the central tendency. Common measures of dispersion include variance, standard deviation,
and range:
1. Variance: The term variance refers to a statistical measurement of the spread between
numbers in a data set. More specifically, variance measures how far each number in the
set is from the mean (average), and thus from every other number in the set. Variance is
often depicted by this symbol: σ2. It is used by both analysts and traders to
determine volatility and market security.
σ2 = ∑ (xi – x̄ )2/(n – 1)
2. Standard Deviation: A standard deviation is the positive square root of the arithmetic
mean of the squares of the deviations of the given values from their arithmetic mean. It is
denoted by a Greek letter sigma, σ. It is also referred to as root mean square deviation.
The standard deviation is given as
Range: Range is the difference between the maximum and minimum values in a dataset.
7
Implementing Central Tendency and Dispersion with Real-time Database
Data Mining Techniques:
To implement these concepts using real-time database data mining techniques in Python, you
would typically follow these steps:
Connect to the Database: Use a library like sqlalchemy to establish a connection to the database.
Retrieve Data: Execute SQL queries to retrieve the required data from the database and store it in
a pandas DataFrame.
Calculate Central Tendency: Use pandas and numpy functions to calculate mean, median, and
mode from the retrieved data.
Calculate Dispersion: Use numpy functions to calculate variance, standard deviation, and range
from the retrieved data.
Display or Use the Results: Print the calculated central tendency and dispersion values or use
them for further analysis or visualization.
Remember, the specific implementation details can vary based on the type of database you're
working with (e.g., MySQL, PostgreSQL, MongoDB) and the data retrieval methods supported
by the chosen database library. Additionally, handling real-time data might involve setting up a
streaming mechanism or periodic polling of the database to get the latest data for analysis.
8
distributed around the center. However, in the histogram on the right, which represents a skewed
dataset, the mean is significantly influenced by extreme values in the tails of the distribution,
while the median and mode are less affected.
9
Code and Output-:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df=pd.read_csv("StudentsPerformance.csv")
df.head()
Figure 1:StudentPerformance
df.info()
Gender Count
df.gender.value_counts()
df['total marks']=((df['math score']+df['reading score']+df['writing score']))
df.head(2)
df['total marks'].mean()
plt.figure(figsize=(5,5))
sns.barplot(x=df['parental level of education'],y=df['total marks'])
plt.xticks(rotation=60)
plt.ylim(0,300)
plt.axhline(y = 203.312, color = 'r', linestyle = 'dashed', label = "Mean value line")
plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper center')
plt.show()
Figure 2:Total marks vs Parental Education
Who scores the most on average for math, reading and writing based on
● Gender
● Test preparation course
df.mean()
plt.figure(figsize=(15,5))
plt.subplots_adjust(left=0.125, bottom=0.1, right=1.0, top=1.0,wspace=0.8, hspace=0.2)
plt.subplot(131)
plt.title('Math Scores')
sns.barplot(hue="gender", y="math score", x="test preparation course", data=df,palette='bright')
plt.axhline(y = 66.067, color = 'r', linestyle = 'dashed', label = "avg math score")
plt.legend(bbox_to_anchor = (1.3, 1), loc = 'upper center')
plt.subplot(132)
plt.title('Reading Scores')
sns.barplot (hue="gender", y="reading score", x="test preparation course", data=df,palette='hls')
plt.axhline(y = 69.117, color = 'g', linestyle = '-', label = "avg reading score")
plt.legend(bbox_to_anchor = (1.35, 1), loc = 'upper center')
plt.subplot(133)
plt.title('Writing Scores')
sns.barplot (hue="gender", y="writing score", x="test preparation course",
data=df,palette='colorblind')
plt.axhline(y = 67.997, color = 'b', linestyle = ':', label = "avg writing score")
plt.legend(bbox_to_anchor = (1.4, 1), loc = 'upper center')
sns.set_palette('hls')
plt.figure(figsize=(15,6))
sns.histplot(x='reading score', data=df, kde=True, hue='test preparation course')
plt.title('reading scores wrt test preparation course')
plt.show()
sns.set_palette('colorblind')
plt.figure(figsize=(15,6))
sns.histplot(x='writing score', data=df, kde=True, hue='test preparation course')
plt.title('writing scores wrt test preparation course')
plt.show()
df.describe()
x.info()
x.gender.value_counts()
print (df.median())
Conclusion
Hence, in this Project-Based Learning, on the topic- ‘Implement Measuring central tendency,
measuring dispersion of data in python with real time database data mining techniques.’ we have
studied about the central tendencies and dispersion measuring with a student performance
database. Overall, the understanding was developed, and all the related concepts were
understood well and performed using python programming language.
19
Project Outcome
Here, in this Project-Based Learning, under Course Outcome 4 (CO4): Apply the basic concept
of data mining and its functionality. The concepts were understood and performed using python.
20
Appendix
Github link-
saniagoyal/Measuring-Central-Tendency-Dispersion-with-a-Student_Performance-dataset.
(github.com)
21