You are on page 1of 17

Mini Project Report on

Exploratory Data Analysis using Python


Submitted in partial fulfillment of the requirement for
award of the degree of

BACHELOR OF TECHNOLOGY
in
Computer Science and Engineering
(Artificial Intelligence)
2023-24
By
<Riya Gupta><2200681520081>3rd sem
<Raziya><2200681520080>3rd sem

Under the guidance of


<Mr. Aamir Sohail>

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


(AI)

MEERUT INSTITUTE OF ENGINEERING & TECHNOLOGY,


MEERUT
AFFILIATED TO
DR. A. P. J. ABDUL KALAM TECHNICAL UNIVERSITY
LUCKNOW

JAN 2024
TABLE OF CONTENT

DESCRIPTION PAGE NO.

DECLARATION Ii
CERTIFICATE Iii
ACKNOWLEDGEMENT iv
ABSTRACT

CHAPTER 1- INTRODUCTION
CHAPTER 2- WORKFLOW OF PROJECT
CHAPTER 3- TECHNOLOGY USED
CHAPTER 4- DATA DESCRIPTION
CHAPTER 5- PROJECT DESCRIPTION
APPENDICES- IMPLEMENTATION CODE
REFERENCES
DECLARATION

We hereby declare that the project titled - “EXPLORATORY


DATA ANALYSIS USING PYTHON”, which is being submitted
as Mini Project in the department of Computer Science and
Engineering (Artificial Intelligence) to Meerut Institute of
Engineering and Technology, Meerut (U.P.) is an authentic record
of our genuine work done under the guidance of “Mr. Aamir
Sohail” of “CSE (AI)”, Meerut Institute of Engineering and
Technology, Meerut

Date: 15 JAN 2024 Name of Student:


Raziya(2200681520080)
Place:MIET,Meerut Riya Gupta(2200681520081)
CERTIFICATE

This is to certify that mini project report titled –


“EXPLORATORY DATA ANALYSIS USING PYTHON”
submitted by “Riya Gupta (2200681520081) , Raziya
(2200681520080)” has been carried out under the guidance
of “Mr. Aamir Sohail ” of “CSE (AI)”, Meerut Institute of
Engineering and Technology, Meerut. This project report is
approved for Mini Project in 3rd semester in CSE (AI) from
Meerut Institute of Engineering and Technology, Meerut

Supervisor: Mr. Aamir Sohail

Date: 15 JAN 2024


ACKNOWLEDGEMENT

We express our sincere gratitude towards our guide “Mr. Aamir


Sohail” of “CSE (AI)”, Meerut Institute of Engineering and
Technology, Meerut for his valuable suggestion, guidance and
supervision throughout the project work. We would also like to
thank our Head of Department “Dr. Rambir Singh” of “CSE
(AI/AI&ML)” for his expert advice from time to time. We owe
sincere thanks to all the faculty members of our department for
their kind encouragement.

Date: 15 JAN 2024 Name of Student:


Riya Gupta(2200681520081)

Place: MIET, Meerut Raziya(2200681520080)


ABSTRACT
Exploratory Data Analysis (EDA) is an essential initial step in
the data analysis process, utilizing Python's powerful libraries
such as Pandas, Matplotlib, Seaborn, and Numpy. Loading the
dataset with Pandas, the initial exploration involves examining
the first few rows, obtaining information on data types and
non-null counts, and generating descriptive statistics.
Univariate analysis employs histograms and box plots to
visualize the distribution and variability of individual
variables. Bivariate analysis utilizes scatter plots and pair plots
to explore relationships between two variables. For categorical
data, count plots and pie charts offer insights into the
distribution of different categories. Correlation analysis is
facilitated through heatmaps, revealing the strength and
direction of relationships between numerical variables. EDA
serves to unveil patterns, identify outliers, and guide
subsequent analysis, providing a comprehensive
understanding of the dataset's characteristics through a
combination of statistical summaries and visually intuitive
representations.
INTRODUCTION
• Exploratory Data Analysis (EDA) with Python: we uses a real-world dataset of
Student Scores stats data obtained via web scraping and performs various data
cleaning and analysis tasks to gain insights into the dataset. This covers
important EDA concepts such as summarizing the data using summary
statistics, counting unique values in a column, and grouping data by a specific
column.

• EDA is crucial for understanding data: Analyze structure, patterns, relationships


to inform further analysis and decision-making.

• Powerful Python libraries: Pandas, NumPy, Matplotlib, Seaborn, Plotly


simplify data manipulation, visualization, and analysis.

• Load data (CSV, etc.)

• Explore basic information: missing values, data types, summary statistics.

• Analyse individual variables (Univariate): distributions, central tendency,


outliers.

• Analyse relationships between pairs of variables (Bivariate): scatter plots,


correlation matrix.

• Explore relationships between multiple variables (Multivariate): pair plots,


dimensionality reduction.

• Clean and pre-process data: handle missing values, transform data, handle
outliers.

• Engineer new features from existing ones.

• Effectively visualize insights to communicate findings.


WORKFLOW OF PROJECT
TECHNOLOGY USED
1. Programming Languages:
• Python: Widely adopted for its simplicity and extensive ecosystem of
libraries, Python is a go-to language for EDA. Pandas offers data
structures like Data Frames for manipulation, while Matplotlib, Seaborn,
and Plotly provide versatile visualization options.
2. Data Manipulation Libraries:
• Pandas: A powerful library for data manipulation and analysis. It allows
for tasks such as cleaning data, handling missing values, and reshaping
datasets with ease.

• NumPy: Fundamental for numerical operations in Python, NumPy


supports efficient array operations and mathematical functions

3. Data Visualization Libraries:


• Matplotlib: A foundational library for static plotting, Matplotlib
provides a fine level of control over visualizations.

• Seaborn: Built on Matplotlib, Seaborn simplifies the creation of


attractive statistical graphics with fewer lines of code.

4. Notebook Environments:
• Jupyter Notebooks: These interactive notebooks allow combining code,
visualizations, and narrative text, making them popular for data
exploration.
DATA DESCRIPTION

• DATA IS MULTIVARIATE IN NATURE BECAUSE IT HAS SEVERAL COLUMNS.

• IT HAS BOTH CATEGORICAL AS WELL AS NUMERICAL FEATURES (ATTRIBUTES).

• IT HAS 15 COLUMNS.

• COLUMN NAMES: 1. Serial no.

2. Gender

3. Ethnic Group

4. Parent Edu

5. Lunch Type

6. Test Prep

7. Parent Marital Status

8. Practice Sport

9. Is First Child

10. Nr Siblings

11. Transport Means

12. Wkly Study Hours

13. Math Score

14. Reading Score

15. Writing Score


PROJECT DESCRIPTION
 Import Python libraries:

 Reading Data Set:

 Analysing the Data:


head() will display the top 5 observations of the dataset

data.info() shows the variables have missing values


Missing value calculation

 Data Reduction:
Some columns or variables can be dropped if they do not add value to our analysis.
 EDA Univariate Analysis
• Univariate analysis can be done for both Categorical and Numerical variables.
• Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
• Numerical Variables can be visualized using Histogram, Box Plot.
COUNT PLOT:

#from the above chart we have analyzed that:


#the number of females in the data is more than the number of males.
BOXPLOT:
PIE PLOT:

 EDA Multivariate Analysis:


Multivariate analysis looks at more than two variables. Multivariate analysis is
one of the most useful methods to determine relationships and analyze patterns
for any dataset.
Heat Map gives the correlation between the variables, whether it has a positive or
negative correlation.

#from the above chart we have concluded that education of the parents have a
good impact on their scores
IMPLEMENTATION CODE

• import numpy as np

• import pandas as pd

• import matplotlib.pyplot as plt

• import seaborn as sns

• df = pd.read_csv("student_scores.csv")

• print(df.head())

• df.describe()

• df.info()

• df.isnull().sum()

• df = df.drop("Unnamed: 0", axis = 1)

• print(df.head())

• plt.figure(figsize = (5,5))

• ax = sns.countplot(data = df, x = "Gender")

• ax.bar_label(ax.containers[0])

• plt.title("Gender Distribution")

• plt.show()

• gb = df.groupby("ParentEduc").agg({"MathScore":'mean', "ReadingScore":"mean",
"WritingScore":"mean"})

• print(gb)

• sns.heatmap(gb, annot = True)

• plt.title("Relationship between parent's education and student's score")

• plt.show

• gb1 = df.groupby("ParentMaritalStatus").agg({"MathScore":'mean', "ReadingScore":"mean",


"WritingScore":"mean"})

• print(gb1)
• sns.heatmap(gb, annot = True)

• plt.title("Relationship between parent's education and student's score")

• plt.show

• sns.boxplot(data = df, x = "MathScore")

• plt.show

• sns.boxplot(data = df, x = "ReadingScore")

• plt.show

• print(df["EthnicGroup"].unique())

• groupA = df.loc[(df['EthnicGroup'] == "group A")].count()

• groupB = df.loc[(df['EthnicGroup'] == "group B")].count()

• groupC = df.loc[(df['EthnicGroup'] == "group C")].count()

• groupD = df.loc[(df['EthnicGroup'] == "group D")].count()

• groupE = df.loc[(df['EthnicGroup'] == "group E")].count()

• l = ["group A", "group B", "group C", "group D", "group E"]

• mlist = [groupA["EthnicGroup"], groupB["EthnicGroup"],


groupC["EthnicGroup"], groupD["EthnicGroup"], groupE["EthnicGroup"]]

• plt.pie(mlist, labels = l, autopct = "%1.2f%%")

• plt.title("Distribution of ethnic groups")

• plt.show

• ax = sns.countplot(data = df, x = 'EthnicGroup')

• ax.bar_label(ax.containers[0])
REFERENCES

1. https://www.kaggle.com/
2. https://www.geeksforgeeks.org/
3. https://towardsdatascience.com/exploratory-data-ana
lysis-eda-python

You might also like