You are on page 1of 4

Name: Pratik Vasant Bhosale

Class: MBA 1 (BA)


Roll No. 05
Subject: Data Exploration and Visualization

Article Review
On
Data Exploration and Analysis Using Python
Author: Raji Rai

Data exploration is a key aspect of data analysis


and model building. Without spending significant time on
understanding the data and its patterns one cannot expect to
build efficient predictive models. Data exploration takes major
chunk of time in a data science project comprising of data
cleaning and preprocessing.
In this article, explain the various steps involved in data
exploration through simple explanations and Python code
snippets.
Data sources can vary from databases to websites. Data
sourced is known as raw data. Raw data cannot be directly used
for model building, as it will be inconsistent and not suitable for
prediction. It has to be treated for anomalies and missing
values. Variable can be of different types such as character,
numeric, categorical, and continuous.
Identifying the predictor and target variable is also a key
step in model building. Target is the dependent variable and
predictor is the independent variable based on which the
prediction is made. Categorical or discrete variables are those
that cannot be mathematically manipulated. It is made up of
fixed values such as 0 and 1. On the other hand, continuous
variables can be interpreted using mathematical functions like
finding the average or sum of all values. You can use a series of
Python codes to understand the types of variables in your
dataset.
Univariate analysis is used to highlight missing and outlier
values. Here each variable is analysed on its own for range and
distribution. Univariate analysis differs for categorical and
continuous variables. For categorical variables, you can use
frequency table to understand distribution of each category.
For continuous variables, you have to understand the central
tendency and spread of the variable. It can be measured using
mean, median, mode, etc. It can be visualized using box plot or
histogram.

Bivariate Analysis is used to find the relationship between


two variables. Analysis can be performed for combination of
categorical and continuous variables. Scatter plot is suitable for
analyzing two continuous variables. It indicates the linear or
non-linear relationship between the variables. Bar charts helps
to understand relation between two categorical variables.
Certain statistical tests are also used to effectively understand
bivariate relationship. Scipy library has extensive modules for
performing these tests in Python.
Matplotlib and Seaborn libraries can be used to plot
different relational graphs that help visualizing bivariate
relationship between different types of variables.
Missing values in the dataset can reduce model fit. It can
lead to a biased model as the data cannot be analysed
completely. Behavior and relationship with other variables
cannot be deduced correctly. It can lead to wrong prediction or
classification. Missing values may occur due to problems in data
extraction or data collection, which can be categorized as
MCAR, MAR, and NMAR.
Missing ValuesMissing values can be treated by deletion,
mean/mode/median imputation, KNN imputation, or using
prediction models.
Outliers can occur naturally in a data or can be due to data
entry errors. They can drastically change the results of the data
analysis and statistical modeling. Outliers are easily detected by
visualization methods, like box-plot, histogram, and scatter
plot. Outliers are handled like missing values by deleting
observations, transforming them, binning or grouping them,
treating them as a separate group, or imputing values.
Author - Raji Rai
Source – towardsdatascience.com

You might also like