Mini Project Report On

Mini Project Report on
Exploratory Data Analysis using Python

Submitted in partial fulfillment of the requirement for
award of the degree of
BACHELOR OF TECHNOLOGY
in
Computer Science and Engineering
(Artificial Intelligence)
2023-24
By
<Riya Gupta><2200681520081>3rd sem
<Raziya><2200681520080>3rd sem
Under the guidance of

<Mr. Aamir Sohail>
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(AI)
MEERUT INSTITUTE OF ENGINEERING & TECHNOLOGY,

MEERUT
AFFILIATED TO
DR. A. P. J. ABDUL KALAM TECHNICAL UNIVERSITY
LUCKNOW
JAN 2024
TABLE OF CONTENT
DESCRIPTION PAGE NO.
DECLARATION Ii
CERTIFICATE Iii
ACKNOWLEDGEMENT iv
ABSTRACT
CHAPTER 1- INTRODUCTION
CHAPTER 2- WORKFLOW OF PROJECT
CHAPTER 3- TECHNOLOGY USED
CHAPTER 4- DATA DESCRIPTION
CHAPTER 5- PROJECT DESCRIPTION
APPENDICES- IMPLEMENTATION CODE
REFERENCES
DECLARATION
We hereby declare that the project titled - “EXPLORATORY

DATA ANALYSIS USING PYTHON”, which is being submitted
as Mini Project in the department of Computer Science and
Engineering (Artificial Intelligence) to Meerut Institute of
Engineering and Technology, Meerut (U.P.) is an authentic record
of our genuine work done under the guidance of “Mr. Aamir
Sohail” of “CSE (AI)”, Meerut Institute of Engineering and
Technology, Meerut
Date: 15 JAN 2024 Name of Student:

Raziya(2200681520080)
Place:MIET,Meerut Riya Gupta(2200681520081)
CERTIFICATE
This is to certify that mini project report titled –

“EXPLORATORY DATA ANALYSIS USING PYTHON”
submitted by “Riya Gupta (2200681520081) , Raziya
(2200681520080)” has been carried out under the guidance
of “Mr. Aamir Sohail ” of “CSE (AI)”, Meerut Institute of
Engineering and Technology, Meerut. This project report is
approved for Mini Project in 3rd semester in CSE (AI) from
Meerut Institute of Engineering and Technology, Meerut
Supervisor: Mr. Aamir Sohail
Date: 15 JAN 2024

ACKNOWLEDGEMENT
We express our sincere gratitude towards our guide “Mr. Aamir

Sohail” of “CSE (AI)”, Meerut Institute of Engineering and
Technology, Meerut for his valuable suggestion, guidance and
supervision throughout the project work. We would also like to
thank our Head of Department “Dr. Rambir Singh” of “CSE
(AI/AI&ML)” for his expert advice from time to time. We owe
sincere thanks to all the faculty members of our department for
their kind encouragement.
Date: 15 JAN 2024 Name of Student:

Riya Gupta(2200681520081)
Place: MIET, Meerut Raziya(2200681520080)

ABSTRACT
Exploratory Data Analysis (EDA) is an essential initial step in
the data analysis process, utilizing Python's powerful libraries
such as Pandas, Matplotlib, Seaborn, and Numpy. Loading the
dataset with Pandas, the initial exploration involves examining
the first few rows, obtaining information on data types and
non-null counts, and generating descriptive statistics.
Univariate analysis employs histograms and box plots to
visualize the distribution and variability of individual
variables. Bivariate analysis utilizes scatter plots and pair plots
to explore relationships between two variables. For categorical
data, count plots and pie charts offer insights into the
distribution of different categories. Correlation analysis is
facilitated through heatmaps, revealing the strength and
direction of relationships between numerical variables. EDA
serves to unveil patterns, identify outliers, and guide
subsequent analysis, providing a comprehensive
understanding of the dataset's characteristics through a
combination of statistical summaries and visually intuitive
representations.
INTRODUCTION
• Exploratory Data Analysis (EDA) with Python: we uses a real-world dataset of
Student Scores stats data obtained via web scraping and performs various data
cleaning and analysis tasks to gain insights into the dataset. This covers
important EDA concepts such as summarizing the data using summary
statistics, counting unique values in a column, and grouping data by a specific
column.
• EDA is crucial for understanding data: Analyze structure, patterns, relationships

to inform further analysis and decision-making.
• Powerful Python libraries: Pandas, NumPy, Matplotlib, Seaborn, Plotly

simplify data manipulation, visualization, and analysis.
• Load data (CSV, etc.)
• Explore basic information: missing values, data types, summary statistics.
• Analyse individual variables (Univariate): distributions, central tendency,

outliers.
• Analyse relationships between pairs of variables (Bivariate): scatter plots,

correlation matrix.
• Explore relationships between multiple variables (Multivariate): pair plots,

dimensionality reduction.
• Clean and pre-process data: handle missing values, transform data, handle
outliers.
• Engineer new features from existing ones.
• Effectively visualize insights to communicate findings.

WORKFLOW OF PROJECT
TECHNOLOGY USED
1. Programming Languages:
• Python: Widely adopted for its simplicity and extensive ecosystem of
libraries, Python is a go-to language for EDA. Pandas offers data
structures like Data Frames for manipulation, while Matplotlib, Seaborn,
and Plotly provide versatile visualization options.
2. Data Manipulation Libraries:
• Pandas: A powerful library for data manipulation and analysis. It allows
for tasks such as cleaning data, handling missing values, and reshaping
datasets with ease.
• NumPy: Fundamental for numerical operations in Python, NumPy

supports efficient array operations and mathematical functions
3. Data Visualization Libraries:

• Matplotlib: A foundational library for static plotting, Matplotlib
provides a fine level of control over visualizations.
• Seaborn: Built on Matplotlib, Seaborn simplifies the creation of

attractive statistical graphics with fewer lines of code.
4. Notebook Environments:
• Jupyter Notebooks: These interactive notebooks allow combining code,
visualizations, and narrative text, making them popular for data
exploration.
DATA DESCRIPTION
• DATA IS MULTIVARIATE IN NATURE BECAUSE IT HAS SEVERAL COLUMNS.
• IT HAS BOTH CATEGORICAL AS WELL AS NUMERICAL FEATURES (ATTRIBUTES).
• IT HAS 15 COLUMNS.
• COLUMN NAMES: 1. Serial no.
2. Gender
3. Ethnic Group
4. Parent Edu
5. Lunch Type
6. Test Prep
7. Parent Marital Status
8. Practice Sport
9. Is First Child
10. Nr Siblings
11. Transport Means
12. Wkly Study Hours
13. Math Score
14. Reading Score
15. Writing Score

PROJECT DESCRIPTION
 Import Python libraries:
 Reading Data Set:
 Analysing the Data:

head() will display the top 5 observations of the dataset
data.info() shows the variables have missing values

Missing value calculation
 Data Reduction:
Some columns or variables can be dropped if they do not add value to our analysis.
 EDA Univariate Analysis
• Univariate analysis can be done for both Categorical and Numerical variables.
• Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
• Numerical Variables can be visualized using Histogram, Box Plot.
COUNT PLOT:
#from the above chart we have analyzed that:

#the number of females in the data is more than the number of males.
BOXPLOT:
PIE PLOT:
 EDA Multivariate Analysis:

Multivariate analysis looks at more than two variables. Multivariate analysis is
one of the most useful methods to determine relationships and analyze patterns
for any dataset.
Heat Map gives the correlation between the variables, whether it has a positive or
negative correlation.
#from the above chart we have concluded that education of the parents have a
good impact on their scores
IMPLEMENTATION CODE
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• import seaborn as sns
• df = pd.read_csv("student_scores.csv")
• print(df.head())
• df.describe()
• df.info()
• df.isnull().sum()
• df = df.drop("Unnamed: 0", axis = 1)
• print(df.head())
• plt.figure(figsize = (5,5))
• ax = sns.countplot(data = df, x = "Gender")
• ax.bar_label(ax.containers[0])
• plt.title("Gender Distribution")
• plt.show()
• gb = df.groupby("ParentEduc").agg({"MathScore":'mean', "ReadingScore":"mean",
"WritingScore":"mean"})
• print(gb)
• sns.heatmap(gb, annot = True)
• plt.title("Relationship between parent's education and student's score")
• plt.show
• gb1 = df.groupby("ParentMaritalStatus").agg({"MathScore":'mean', "ReadingScore":"mean",

"WritingScore":"mean"})
• print(gb1)
• sns.heatmap(gb, annot = True)
• plt.title("Relationship between parent's education and student's score")
• plt.show
• sns.boxplot(data = df, x = "MathScore")
• plt.show
• sns.boxplot(data = df, x = "ReadingScore")
• plt.show
• print(df["EthnicGroup"].unique())
• groupA = df.loc[(df['EthnicGroup'] == "group A")].count()
• groupB = df.loc[(df['EthnicGroup'] == "group B")].count()
• groupC = df.loc[(df['EthnicGroup'] == "group C")].count()
• groupD = df.loc[(df['EthnicGroup'] == "group D")].count()
• groupE = df.loc[(df['EthnicGroup'] == "group E")].count()
• l = ["group A", "group B", "group C", "group D", "group E"]
• mlist = [groupA["EthnicGroup"], groupB["EthnicGroup"],

groupC["EthnicGroup"], groupD["EthnicGroup"], groupE["EthnicGroup"]]
• plt.pie(mlist, labels = l, autopct = "%1.2f%%")
• plt.title("Distribution of ethnic groups")
• plt.show
• ax = sns.countplot(data = df, x = 'EthnicGroup')
• ax.bar_label(ax.containers[0])
REFERENCES
1. https://www.kaggle.com/
2. https://www.geeksforgeeks.org/
3. https://towardsdatascience.com/exploratory-data-ana
lysis-eda-python

Mini Project Report On

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mini Project Report On

Uploaded by

Copyright:

Available Formats

Mini Project Report on

Exploratory Data Analysis using Python

Under the guidance of

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

MEERUT INSTITUTE OF ENGINEERING & TECHNOLOGY,

DESCRIPTION PAGE NO.

We hereby declare that the project titled - “EXPLORATORY

Date: 15 JAN 2024 Name of Student:

This is to certify that mini project report titled –

Supervisor: Mr. Aamir Sohail

Date: 15 JAN 2024

We express our sincere gratitude towards our guide “Mr. Aamir

Date: 15 JAN 2024 Name of Student:

Place: MIET, Meerut Raziya(2200681520080)

• EDA is crucial for understanding data: Analyze structure, patterns, relationships

• Powerful Python libraries: Pandas, NumPy, Matplotlib, Seaborn, Plotly

• Load data (CSV, etc.)

• Explore basic information: missing values, data types, summary statistics.

• Analyse individual variables (Univariate): distributions, central tendency,

• Analyse relationships between pairs of variables (Bivariate): scatter plots,

• Explore relationships between multiple variables (Multivariate): pair plots,

• Engineer new features from existing ones.

• Effectively visualize insights to communicate findings.

• NumPy: Fundamental for numerical operations in Python, NumPy

3. Data Visualization Libraries:

• Seaborn: Built on Matplotlib, Seaborn simplifies the creation of

• DATA IS MULTIVARIATE IN NATURE BECAUSE IT HAS SEVERAL COLUMNS.

• IT HAS BOTH CATEGORICAL AS WELL AS NUMERICAL FEATURES (ATTRIBUTES).

• COLUMN NAMES: 1. Serial no.

7. Parent Marital Status

11. Transport Means

12. Wkly Study Hours

13. Math Score

14. Reading Score

15. Writing Score

 Reading Data Set:

 Analysing the Data:

data.info() shows the variables have missing values

#from the above chart we have analyzed that:

 EDA Multivariate Analysis:

• import matplotlib.pyplot as plt

• import seaborn as sns

• df = df.drop("Unnamed: 0", axis = 1)

• ax = sns.countplot(data = df, x = "Gender")

• sns.heatmap(gb, annot = True)

• plt.title("Relationship between parent's education and student's score")

• gb1 = df.groupby("ParentMaritalStatus").agg({"MathScore":'mean', "ReadingScore":"mean",

• plt.title("Relationship between parent's education and student's score")

• sns.boxplot(data = df, x = "MathScore")

• sns.boxplot(data = df, x = "ReadingScore")

• groupA = df.loc[(df['EthnicGroup'] == "group A")].count()

• groupB = df.loc[(df['EthnicGroup'] == "group B")].count()

• groupC = df.loc[(df['EthnicGroup'] == "group C")].count()

• groupD = df.loc[(df['EthnicGroup'] == "group D")].count()

• groupE = df.loc[(df['EthnicGroup'] == "group E")].count()

• mlist = [groupA["EthnicGroup"], groupB["EthnicGroup"],

• plt.pie(mlist, labels = l, autopct = "%1.2f%%")

• plt.title("Distribution of ethnic groups")

• ax = sns.countplot(data = df, x = 'EthnicGroup')

You might also like