Professional Documents
Culture Documents
Subject: Exploratory Data Analysis and Visualization Laboratory Class: TY BTech DS(A)
Assignment No.: 1
#Create a DataFrame
df = pd.DataFrame(d)
print df
Its output is as follows −
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
7 34 Lee 3.78
8 40 David 2.98
9 30 Gasper 4.80
10 51 Betina 4.10
11 46 Andres 3.65
1
sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()
Its output is as follows −
Age 382
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating 44.92
dtype: object
Each individual column is added individually (Strings are appended).
axis=1
This syntax will give the output as shown below.
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)
Its output is as follows −
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
2
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
mean()
Returns the average value
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()
Its output is as follows −
Age 31.833333
Rating 3.743333
dtype: float64
std()
Returns the Bressel standard deviation of the numerical columns.
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print df.std()
Its output is as follows −
Age 9.232682
Rating 0.661628
dtype: float64
3
Functions & Description
Let us now understand the functions under Descriptive Statistics in Python Pandas. The
following table list down the important functions −
Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with
all functions.
Functions like sum(), cumsum() work with both numeric and character (or) string data
elements without any error. Though n practice, character aggregations are never used
generally, these functions do not throw any exception.
Functions like abs(), cumprod() throw exception when the DataFrame contains character
or string data because such operations cannot be performed.
Summarizing Data
The describe() function computes a summary of statistics pertaining to the DataFrame columns.
import pandas as pd
import numpy as np
4
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Its output is as follows −
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
This function gives the mean, std and IQR values. And, function excludes the character
columns and given summary about numeric columns. 'include' is the argument which is used to
pass necessary information regarding what columns need to be considered for summarizing.
Takes the list of values; by default, 'number'.
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])
Its output is as follows −
Name
count 12
unique 12
top Ricky
freq 1
Now, use the following statement and check the output −
5
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')
Its output is as follows −
Age Name Rating
count 12.000000 12 12.000000
unique NaN 12 NaN
top NaN Ricky NaN
freq NaN 1 NaN
mean 31.833333 NaN 3.743333
std 9.232682 NaN 0.661628
min 23.000000 NaN 2.560000
25% 25.000000 NaN 3.230000
50% 29.500000 NaN 3.790000
75% 35.500000 NaN 4.132500
max 51.000000 NaN 4.800000
Observations: Thus students are able to write a program to get statistical characteristics of
dataset using pandas.
Prepared by- DR. Mrs J. N. Jadhav. Associate Professor Deptt of CSE, DYPCET
6
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: Exploratory Data Analysis and Visualization Laboratory Class: T.Y.B.Tech DS(A)
Assignment No.: 2
Title: Programs for analysis of data through different plots(scatter, bubble, area, stacked)
and charts( line, bar, table, pie, histogram)
Plots
1. Scatter Plot
It is a type of plot using Cartesian coordinates to display values for two variables for a set of
data. It is displayed as a collection of points. Their position on the horizontal axis determines the
value of one variable. The position on the vertical axis determines the value of the other variable.
A scatter plot can be used when one variable can be controlled and the other variable depends on
it. It can also be used when both continuous variables are independent.
Visual:
Plotly code:
7
import plotly.express as px
df = px.data.iris() # iris is a pandas DataFrame
fig = px.scatter(df, x="sepal_width", y="sepal_length")
fig.show()
Seaborn code:
According to the correlation of the data points, scatter plots are grouped into different types.
Positive Correlation
In these types of plots, an increase in the independent variable indicates an increase in the
variable that depends on it. A scatter plot can have a high or low positive correlation.
Negative Correlation
In these types of plots, an increase in the independent variable indicates a decrease in the
variable that depends on it. A scatter plot can have a high or low negative correlation.
No Correlation
Two groups of data visualized on a scatter plot are said to not correlate if there is no clear
2. Bubble Chart
A bubble chart displays three attributes of data. They are represented by x location, y location,
and size of the bubble.
Visualization:
8
Plotly code:
import plotly.express as px
df = px.data.gapminder()
Seaborn code:
b.set(xscale="log")
plt.show()
9
Their categories into different types are based on the number of variables in the dataset, the type
It is the basic type of bubble chart and is equivalent to the normal bubble chart.
The bubbles on this bubble chart are labelled for easy identification. This is to deal with different
groups of data.
This chart has four dataset variables. The fourth variable is distinguished with a different colour.
3D Bubble Chart
This is a bubble chart designed in a 3-dimensional space. The bubbles here are spherical.
Area Chart
It is represented by the area between the lines and the axis. The area is proportional to the
amount it represents.
IIn this chart, the coloured segments overlap each other. They are placed above each other.
In this chart, the coloured segments are stacked on top of one another. Thus they do not intersect.
In this chart, the area occupied by each group of data is measured as a percentage of its amount
from the total data. Usually, the vertical axis totals a hundred per cent.
10
This chart is measured on a 3-dimensional space.
We will look at visual representation and code for the most common type below.
Visual:
Plotly:
import plotly.express as px
df = px.data.gapminder()
fig = px.area(df, x="year", y="pop", color="continent",
line_group="country")
fig.show()
Seaborn:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
11
'team_C': [11, 8, 10, 6, 6, 5, 9, 12]})
The stacked bar graphs are used to show dataset subgroups. However, the bars are stacked on top
import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()
import pandas
import matplotlib.pylab as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
df = pandas.DataFrame(dict(
12
number=[2, 5, 1, 6, 3],
count=[56, 21, 34, 36, 12],
select=[29, 13, 17, 21, 8]
))
bar_plot1 = sns.barplot(x='number', y='count', data=df, label="count", color="red")
bar_plot2 = sns.barplot(x='number', y='select', data=df, label="select", color="green")
plt.legend(ncol=2, loc="upper right", frameon=True)
plt.show()
Charts:
Line Graph
It displays a sequence of data points as markers. The points are ordered typically by their x-axis
value. These points are joined with straight line segments. A line graph is used to visualize a
import plotly.express as px
df = px.data.gapminder().query("country=='Canada'")
fig = px.line(df, x="year", y="lifeExp", title='Life expectancy in Canada')
fig.show()
13
import seaborn as sns
sns.lineplot(data=df, x="year", y="lifeExp")
A simple line graph plots only one line on the graph. One of the axes defines the independent
Multiple line graphs contain more than one line. They represent multiple variables in a dataset.
This type of graph can be used to study more than one variable over the same period.
import plotly.express as px
df = px.data.gapminder().query("continent == 'Oceania'")
fig = px.line(df, x='year', y='lifeExp', color='country', symbol="country")
fig.show()
In seaborn as:
14
Here is the illustration:
It is an extension of a simple line graph. It is used when dealing with different groups of data
from a larger dataset. Its every line graph is shaded downwards to the x-axis. It has each group
Here is an illustration:
15
Bar Graph
A bar graph is a graph that presents categorical data with rectangle-shaped bars. The heights or
lengths of these bars are proportional to the values that they represent. The bars can be vertical or
16
Follo
import plotly.express as px
data_canada = px.data.gapminder().query("country == 'Canada'")
fig = px.bar(data_canada, x='year', y='pop')
fig.show()
17
The following are types of bar graphs:
Grouped bar graphs are used when the datasets have subgroups that need to be visualized on the
graph. The subgroups are differentiated by distinct colours. Here is an illustration of such a
graph:
import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color="time")
18
fig.show()
import seaborn as sb
df = sb.load_dataset('tips')
df = df.groupby(['size', 'sex']).agg(mean_total_bill=("total_bill", 'mean'))
df = df.reset_index()
sb.barplot(x="size", y="mean_total_bill", hue="sex", data=df)
Stacked Bar Graph
The stacked bar graphs are used to show dataset subgroups. However, the bars are stacked on top
import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()
import pandas
import matplotlib.pylab as plt
19
import seaborn as sns
plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
df = pandas.DataFrame(dict(
number=[2, 5, 1, 6, 3],
count=[56, 21, 34, 36, 12],
select=[29, 13, 17, 21, 8]
))
bar_plot1 = sns.barplot(x='number', y='count', data=df, label="count", color="red")
bar_plot2 = sns.barplot(x='number', y='select', data=df, label="select", color="green")
plt.legend(ncol=2, loc="upper right", frameon=True)
plt.show()
Segmented Bar Graph
This is the type of stacked bar graph where each stacked bar shows the percentage of its discrete
value from the total value. The total percentage is 100%. Here is an illustration:
Pie Chart
A pie chart is a circular statistical graphic. To illustrate numerical proportion, it is divided into
slices. In a pie chart, for every slice, each of its arc lengths is proportional to the amount it
represents. The central angles, and area are also proportional. It is named after a sliced pie.
import plotly.express as px
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
20
df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries
fig = px.pie(df, values='pop', names='country', title='Population of European continent')
fig.show()
Seaborn doesn’t have a default function to create pie charts, but the following syntax in
matplotlib can be used to create a pie chart and add a seaborn color palette:
colors = sns.color_palette('pastel')[0:5]
21
These are types of pie charts:
This is the basic type of pie chart. It is often called just a pie chart.
One or more sectors of the chart are separated (termed as exploded) from the chart in an
exploded pie chart. It is used to emphasize a particular element in the data set.
import plotly.graph_objects as go
labels = ['Oxygen','Hydrogen','Carbon_Dioxide','Nitrogen']
values = [4500, 2500, 1053, 500]
22
In seaborn the explode attribute of the pie method in matplotlib can be used as:
colors = sns.color_palette('pastel')[0:5]
plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%', explode = [0, 0, 0, 0.2, 0])
plt.show()
Donut Chart
In this pie chart, there is a hole in the centre. The hole makes it look like a donut from which it
import plotly.graph_objects as go
labels = ['Oxygen','Hydrogen','Carbon_Dioxide','Nitrogen']
values = [4500, 2500, 1053, 500]
23
# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.show()
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randint(20, 100, 6)
plt.pie(data)
circle = plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(circle)
plt.show()
Pie of Pie
A pie of pie is a chart that generates an entirely new pie chart detailing a small sector of the
existing pie chart. It can be used to reduce the clutter and emphasize a particular group of
elements.
Here is an illustration:
24
Bar of Pie
This is similar to the pie of pie, except that a bar chart is what is generated.
Here is an illustration:
3D Pie Chart
25
The shadow attribute can be set to True for doing it in seaborn / matplotlib.
divided into non-overlapping intervals called bins and buckets. A rectangle is erected over a bin
whose height is proportional to the number of data points in the bin. Histograms give a feel of
Here is a visual:
26
Plotly code:
import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x="total_bill")
fig.show()
Seaborn code:
Normal Distribution
Bimodal Distribution
In this histogram, there are two groups of histogram charts that are of normal distribution. It is a
Visualization:
27
Plotly code:
import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x="total_bill", y="tip", color="sex", marginal="rug",
hover_data=df.columns)
fig.show()
Seaborn:
This is an asymmetric graph with an off-centre peak. The peak tends towards the beginning or
end of the graph. A histogram can be said to be right or left-skewed depending on the direction
Random Distribution
This histogram does not have a regular pattern. It produces multiple peaks. It can be called a
multimodal distribution.
28
Edge Peak Distribution
This distribution is similar to that of a normal distribution, except for a large peak at one of its
ends.
Comb Distribution
The comb distribution is like a comb. The height of rectangle-shaped bars is alternately tall and
short.
Observations: Thus students are able to analyze data through different plots and charts.
29
D. Y. Patil College of Engineering and Technology, Kolhapur
Assignment No.: 3
Data transformation is the process of extracting good, reliable data from these sources. This
involves converting data from one structure (or no structure) to another so you can integrate it
with a data warehouse or with different applications. It allows you to expose the information to
advanced business intelligence tools to create valuable performance reports and forecast future
trends.
Data transformation includes two primary stages: understanding and mapping the data; and
transforming the data.
The best way to select, implement and integrate data deduplication can vary depending on how
the deduplication is performed. Here are some general principles that you can follow in selecting
the right deduplicating approach and then integrating it into your environment.
What deduplication ratio a company achieves will depend heavily on the following factors:
Type of data
30
The challenge most companies have is quickly and effectively gathering this data. Agentless data
gathering and information classification tools from Aptare Inc., Asigra Inc., Bocada
Inc. and Kazeon Systems Inc. can assist in performing these assessments while requiring
minimal or no changes to your servers in the form of agent deployments.
Step 2: Establish how much you can change your backup environment
Deploying backup software that uses software agents will require installing agents on each server
or virtual machine and doing server reboots after it's installed. This approach generally results in
faster backup times and higher deduplication ratios than using a data deduplication appliance.
However, it can take more time and require many changes to a company's backup environment.
Using a data deduplication appliance typically requires no changes to servers, though a company
will need to tune its backup software according to if the appliance is configured as a file server or
a virtual tape library (VTL).
The amount of data that a company initially plans to back up and what it actually ends up
backing up are usually two very different numbers. A company usually finds deduplication so
effective when it starts using it in its backup process that it quickly scales its use and deployment
beyond initial intentions, so you should confirm that deduplicating hardware appliances can scale
both performance and capacity. You should also verify that the hardware and software
deduplication products can provide global deduplication and replication features to maximize
duplication's benefits throughout the enterprise, facilitate technology refreshes and/or capacity
growth, and efficiently bring in deduplicated data from remote offices.
Step 4: Check the level of integration between backup software and hardware appliances
The level of integration that a hardware appliance has with backup software (or vice versa) can
expedite backups and recoveries. For example, ExaGrid Systems Inc. ExaGrid appliances
recognize backup streams from CA ARCserve and can better deduplicate data from that backup
software than streams from backup software that it doesn't recognize. Enterprise backup software
is also starting to better manage disk storage systems so data can be placed on different disk
31
storage systems with different tiers of disk, so they can back up and recover data more quickly
short term and then more cost-effectively store it long term.
The first backup using agent-based deduplication software can potentially be a harrowing
experience. It can create a significant amount of overhead on the server and take much longer
than normal to complete because it needs to deduplicate all of the data. However, once the first
backup is complete, it only needs to back up and deduplicate changed data going forward. Using
a hardware appliance, the experience tends to be the opposite. The first backup may occur
quickly but backups may slow over time depending on how scalable the hardware appliance is,
how much data is changing and how much data growth that a company is experiencing.
32
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 4
33
The rationale for Mode is to replace the population of missing values with the most frequent value
since this is the most likely occurrence.
Image by Author
If data is time-series data, one of the most widely used imputation methods is the last observation
carried forward (LOCF). Whenever a value is missing, it is replaced with the last observed value.
This method is advantageous as it is easy to understand and communicate. Although simple, this
method strongly assumes that the value of the outcome remains unchanged by the missing data,
which seems unlikely in many settings.
34
Image by Author
A similar approach like LOCF works oppositely by taking the first observation after the missing
value and carrying it backward (“next observation carried backwards”, or NOCB).
35
Image by Author
4) Linear Interpolation
Interpolation is a mathematical method that adjusts a function to data and uses this function to
extrapolate the missing data. The simplest type of interpolation is linear interpolation, which
means between the values before the missing data and the value. Of course, we could have a
pretty complex pattern in data, and linear interpolation could not be enough. There are several
different types of interpolation. Just in Pandas, we have the following options like: ‘linear’,
‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘polynomial’, ‘spline’,
‘piecewise polynomial’ and many more.
36
Image by Author
5) Common-Point Imputation
For a rating scale, using the middle point or most commonly chosen value. For example, on a
five-point scale, substitute a 3, the midpoint, or a 4, the most common value (in many cases). It is
similar to the mean value but more suitable for ordinal values.
This is perhaps the most widely used method of missing data imputation for categorical variables.
This method consists of treating missing data as an additional label or category of the variable. All
the missing observations are grouped in the newly created label ‘Missing’. It does not assume
37
anything on the missingness of the values. It is very well suited when the number of missing data
is high.
Image by Author
Replacement of missing values by the most frequent category is the equivalent of mean/median
imputation. It consists of replacing all occurrences of missing values within a variable with the
variable's most frequent label or category.
38
Image by Author
Arbitrary value imputation consists of replacing all occurrences of missing values within a
variable with an arbitrary value. Ideally, the arbitrary value should be different from the
median/mean/mode and not within the normal values of the variable. Typically used arbitrary
values are 0, 999, -999 (or other combinations of 9’s) or -1 (if the distribution is positive).
Sometimes data already contain an arbitrary value from the originator for the missing values. This
works reasonably well for numerical features predominantly positive in value and for tree-based
models in general. This used to be a more common method when the out-of-the-box machine
learning libraries and algorithms were not very adept at working with missing data.
39
Image by Author
When data are not missing completely at random, we can capture the importance of missingness
by creating an additional variable indicating whether the data was missing for that observation (1)
or not (0). The additional variable is binary: it takes only the values 0 and 1, 0 indicating that a
value was present for that observation, and 1 indicating that the value was missing. Typically,
mean/median imputation is done to add a variable to capture those observations where the data
was missing.
40
Image by Author
41
Multiple Imputation
Multiple Imputation (MI) is a statistical technique for handling missing data. The key concept of
MI is to use the distribution of the observed data to estimate a set of plausible values for the
missing data. Random components are incorporated into these estimated values to show their
uncertainty. Multiple datasets are created and then analysed individually but identically to obtain
a set of parameter estimates. Estimates are combined to obtain a set of parameter estimates. The
benefit of the multiple imputations is that restoring the natural variability of the missing values
incorporates the uncertainty due to the missing data, which results in a valid statistical inference.
As a flexible way of handling more than one missing variable, apply a Multiple Imputation by
Chained Equations (MICE) approach. Refer to the reference section to get more information on
MI and MICE. Below is a schematic representation of MICE.
This should be done in conjunction with some cross-validation scheme to avoid leakage. This can
be very effective and can help with the final model. There are many options for such a predictive
model, including a neural network. Here I am listing a few which are very popular.
Linear Regression
In regression imputation, the existing variables are used to predict, and then the predicted value is
substituted as if an actually obtained value. This approach has several advantages because the
imputation retains a great deal of data over the listwise or pairwise deletion and avoids
significantly altering the standard deviation or the shape of the distribution. However, as in a
42
mean substitution, while a regression imputation substitutes a value predicted from other
variables, no novel information is added, while the sample size has been increased and the
standard error is reduced.
Random Forest
Random forest is a non-parametric imputation method applicable to various variable types that
work well with both data missing at random and not missing at random. Random forest uses
multiple decision trees to estimate missing values and outputs OOB (out of the bag) imputation
error estimates. One caveat is that random forest works best with large datasets, and using random
forest on small datasets runs the risk of overfitting.
k-NN imputes the missing attribute values based on the nearest K neighbour. Neighbours are
determined based on a distance measure. Once K neighbours are determined, the missing value is
imputed by taking mean/median or mode of known attribute values of the missing attribute.
Maximum likelihood
The assumption that the observed data are a sample drawn from a multivariate normal distribution
is relatively easy to understand. After the parameters are estimated using the available data, the
missing data are estimated based on the parameters which have just been estimated. Several
strategies are using the maximum likelihood method to handle the missing data.
Expectation-Maximization
Expectation-Maximization (EM) is the maximum likelihood method used to create a new data set.
All missing values are imputed with values estimated by the maximum likelihood methods. This
approach begins with the expectation step, during which the parameters (e.g., variances,
covariances, and means) are estimated, perhaps using the listwise deletion. Those estimates are
then used to create a regression equation to predict the missing data. The maximization step uses
those equations to fill in the missing data. The expectation step is then repeated with the new
parameters, where the new regression equations are determined to “fill in” the missing data. The
expectation and maximization steps are repeated until the system stabilizes.
43
Sensitivity analysis
Sensitivity analysis is defined as the study which defines how the uncertainty in the output of a
model can be allocated to the different sources of uncertainty in its inputs. When analysing the
missing data, additional assumptions on the missing data are made, and these assumptions are
often applicable to the primary analysis. However, the assumptions cannot be definitively
validated for correctness. Therefore, the National Research Council has proposed that the
sensitivity analysis be conducted to evaluate the robustness of the results to the deviations from
the MAR assumption.
Not all algorithms fail when there is missing data. Some algorithms can be made robust to
missing data, such as k-Nearest Neighbours, that can ignore a column from a distance measure
when a value is missing. Some algorithms can use the missing value as a unique and different
value when building the predictive model, such as classification and regression trees. An
algorithm like XGBoost takes into consideration of any missing data. If your imputation does not
work well, try a model that is robust to missing data.
Observations: Thus students are able to implement data transformation-Handling , missing data,
filling missing data.
44
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 5
45
i=0
k=0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
# performing binning
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
avrg = avrg + h
i = i + 1
elif(i == num_of_data_in_each_bin):
k = k + 1
i = 0
binn.append(round(avrg / num_of_data_in_each_bin, 3))
avrg = 0
avrg = avrg + h
i = i + 1
rem = len(x)% bi
if(rem == 0):
binn.append(round(avrg / num_of_data_in_each_bin, 3))
else:
binn.append(round(avrg / rem, 3))
# store the new value of each data
i=0
j=0
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
x_new[g]= binn[j]
i = i + 1
else:
i = 0
j = j + 1
x_new[g]= binn[j]
i = i + 1
print("number of data in each bin")
print(math.ceil(len(x)/bi))
for i in range(0, len(x)):
print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))
Data binning, bucketing is a data pre-processing method used to minimize the effects of small
observation errors. The original data values are divided into small intervals known as bins and
then they are replaced by a general value calculated for that bin. This has a smoothing effect on
the input data and may also reduce the chances of overfitting in the case of small datasets
There are 2 methods of dividing data into bins:
1. Equal Frequency Binning: bins have an equal frequency.
2. Equal Width Binning : bins have equal width with a range of each bin are defined as [min
+ w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).
Equal frequency:
46
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Equal Width:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
Code : Implementation of Binning Technique:
Python
# equal frequency
a = len(arr1)
n = int(a / m)
arr = []
if j >= a:
break
print(arr)
# equal width
47
def equiwidth(arr1, m):
a = len(arr1)
min1 = min(arr1)
arr = []
arri=[]
temp = []
for j in arr1:
temp += [j]
arri += [temp]
print(arri)
# data to be binned
data = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
# no of bins
m = 3
48
equifreq(data, m)
equiwidth(data, 3)
Output :
equal frequency binning
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Observations: Thus students are able to understand the process of Discretization and binning of
data.
49
Program Co-ordinator HOD Dean Academics
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 6
As you can see three dummy variables are created for the three categorical values of the
temperature attribute. We can create dummy variables in python using get_dummies() method.
Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’,)
Parameters:
50
data= input data i.e. it includes pandas data frame. list . set . numpy arrays etc.
prefix= Initial value
prefix_sep= Data values separation.
Return Type: Dummy variables.
Step-by-step Approach:
Import necessary modules
Consider the data
Perform operations on data to get dummies
Example 1:
Python3
Output:
Example 2:
Consider List arrays to get dummies
Python3
Output:
51
Example 3:
Here is another example, to get dummy variables.
Python3
Output:
Observations: Thus students are able to understand the concept of dummy variables.
52
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 7
Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is
why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables
often turns out to be normally distributed, contributing to its widespread application. Any
A normal distribution is highly different from Binomial Distribution. However, if the number of
The mean and variance of a random variable X which is said to be normally distributed is given
by:
53
Variance -> Var(X) = σ^2
A standard normal distribution is defined as the distribution with mean 0 and standard deviation
Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can be
any number. Now, the entire number of calls at a call center in a day is modeled by Poisson
54
4. The number of suicides reported in a particular city.
5. The number of printing errors at each page of the book.
You can now think of many examples following the same course. Poisson Distribution is
applicable in situations where events occur at random points of time and space wherein our
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some
Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.
Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.
The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that
55
The graph shown below illustrates the shift in the curve due to increase in mean.
It is perceptible that as the mean increases, the curve shifts to the right.
Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are
equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the
56
You can see that the shape of the Uniform distribution curve is rectangular, the reason why
The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of
Let’s try calculating the probability that the daily sales will fall between 15 and 30.
The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5
Similarly, the probability that daily sales are greater than 20 is = 0.667
The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform
Gamma distribution.
57
Gamma(λ, r) or Gamma(α, β). Continuous. In the same Poisson process for the exponential
distribution, the gamma distribution gives the time to the r th event. Thus, Exponential(λ) =
Gamma(λ, 1). The gamma distribution also has applications when r is not an integer. For that
generality the factorial function is replaced by the gamma function, Γ(x), described above. There
connection is α = r, and β = 1/λ which is the expected time to the first event in a Poisson process.
Gamma
(λ, r) f(x) = 1 Γ(r) λ rx r−1 e −λx = x α−1 e −x/β β αΓ(α) , for x ∈ [0,∞) µ = r/λ = αβ. σ2 = r/λ2 =
conjugate priors for the distributions in the Poisson process. In Gamma(α, β), α counts the
58
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 8
print df
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
b NaN NaN NaN
c -0.390208 -0.551605 -2.301950
d NaN NaN NaN
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g NaN NaN NaN
h 0.085100 0.532791 0.887415
59
Using reindexing, we have created a DataFrame with missing values. In the
output, NaN means Not a Number.
Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides
the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −
Example
import pandas as pd
import numpy as np
print df['one'].isnull()
Its output is as follows −
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill
in” NA values with non-null data in a couple of ways, which we have illustrated in the following
sections.
Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print df
print ("NaN replaced with '0':")
print df.fillna(0)
Its output is as follows −
one two three
a -0.576991 -0.741695 0.553172
b NaN NaN NaN
60
c 0.744328 -1.735166 1.749580
Method Action
print df.fillna(method='pad')
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
b 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
d -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
Drop Missing Values
If you want to simply exclude the missing values, then use the dropna function along with
the axis argument. By default, axis=0, i.e., along row, which means that if any value within a
row is NA then the whole row is excluded.
Example
import pandas as pd
import numpy as np
61
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
Observations: Thus students are able to understand the concept of data cleaning.
62
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 9
\(\displaystyle Var(X)=\frac{1}{n}\sum_{i=1}^n(a_i-\overline{x})^2\)
ddof : int, optional(ddof stands for delta degrees of freedom. It is the divisor used in the
calculation, which is N – ddof, where N is the number of elements. The default value of ddof
is 0)
>>> import numpy as np
>>> A=np.array([[10,14,11,7,9.5,15,19],[8,9,17,14.5,12,18,15.5],
[15,7.5,11.5,10,10.5,7,11],[11.5,11,9,12,14,12,7.5]])
>>> B=A.T
>>> a = np.var(B,axis=0)
>>> b = np.var(B,axis=1)
>>> print(a)
[ 13.98979592 12.8877551 6.12244898 3.92857143]
>>> print(b)
[ 6.546875 5.921875 8.796875 7.546875 2.875 16.5 19.0625 ]
Copy
1.
Skewness refers to…
… whether the distribution has a longer tail on one side or the other or has left-right
symmetry. There have been different skewness coefficients proposed over the years. The
most common way to calculate is by taking the mean of the cubes of differences of each
63
point from the mean and then dividing it by the cube of the standard deviation. This gives
a coefficient that is independent of the units of the observations.
Copy
2.
Kurtosis quantifies…
…whether the shape of a distribution mat
fisher: if True then Fisher’s definition is used and if False, Pearson’s definition is used.
Default is True
>>> import numpy as np
>>> import scipy
>>> A=np.array([[10,14,11,7,9.5,15,19],[8,9,17,14.5,12,18,15.5],
[15,7.5,11.5,10,10.5,7,11],
[11.5,11,9,12,14,12,7.5]])
>>> B=A.T
>>> a=scipy.stats.kurtosis(B,axis=0,fisher=False) #Pearson Kurtosis
>>> b=scipy.stats.kurtosis(B,axis=1) #Fisher's Kurtosis
>>> print(a,b)
[ 2.19732518 1.6138826 2.516175 2.30595041] [-1.11918934 -1.25539366
-0.86157952 -1.24277613 -1.30245747 -1.22038567 -1.46061811]
3.
Percentiles and quartiles with python
numpy.percentile(a, q, axis=None,iterpolation=’linear’)
a: array containing numbers whose range is required
q: percentile to compute(must be between 0 and 100)
axis: axis or axes along which the range is computed, default is to compute the range of the
flattened array
interpolation: it can take the values as ‘linear’, ‘lower’, ‘higher’, ‘midpoint’or ‘nearest’.
64
This parameter specifies the method which is to be used when the desired quartile lies
between two data points, say i and j.
linear: returns i + (j-i)*fraction, fraction here is the fractional part of the index surrounded by i
and j
lower: returns i
higher: returns j
midpoint: returns (i+j)/2
nearest: returns the nearest point whether i or j
numpy.percentile() agrees with the manual calculation of percentiles (as shown above) only
when interpolation is set as ‘lower’.
65
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 10
66
3. Toss the other data into the buckets
67
Splitting Data into Groups
Splitting is a process in which we split data into a group by applying some conditions on
datasets. In order to split the data, we apply certain conditions on datasets. In order to split the
data, we use groupby() function this function is used to split the data into groups based on
some criteria. Pandas objects can be split on any of their axes. The abstract definition of
grouping is to provide a mapping of labels to group names. Pandas datasets can be split into
any of their objects. There are multiple ways to split data like:
obj.groupby(key)
obj.groupby(key, axis=1)
obj.groupby([key1, key2])
Note :In this we refer to the grouping objects as the keys.
Grouping data with one key:
In order to group data with one key, we pass only one key as an argument in groupby function.
Python3
68
Now we group a data of Name using groupby() function.
Python3
Output :
Now we print the first entries in all the groups formed.
Python3
Output :
Grouping data with multiple keys :
In order to group data with multiple keys, we pass multiple keys in groupby function.
Python3
69
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age':[27, 24, 22, 32,
33, 36, 27, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd',
'B.Tech', 'B.com', 'Msc', 'MA']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)
print(df)
Now we group a data of “Name” and “Qualification” together using multiple keys in groupby
function.
Python3
Output :
Grouping data by sorting keys :
Group keys are sorted by default using the groupby operation. User can pass sort=False for
70
potential speedups.
Python3
Python3
Output :
71
Now we apply groupby() using sort in order to attain potential speedups
Python3
Output :
Grouping data with object attributes :
Groups attribute is like dictionary whose keys are the computed unique groups and
corresponding values being the axis labels belonging to each group.
Python3
72
Now we group data like we do in a dictionary using keys.
Python3
Output :
Iterating through groups
In order to iterate an element of groups, we can iterate through the object similar to
itertools.obj.
Python3
73
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)
print(df)
Python3
# iterating an element
# of group
grp = df.groupby('Name')
for name, group in grp:
print(name)
print(group)
print()
Output :
74
Now we iterate an element of group containing multiple keys
Python3
# iterating an element
# of group containing
# multiple keys
grp = df.groupby(['Name', 'Qualification'])
for name, group in grp:
print(name)
print(group)
print()
Output :
As shown in output that group name will be tuple
75
Selecting a groups
In order to select a group, we can select group using GroupBy.get_group(). We can select a
group by applying a function GroupBy.get_group this function select a single group.
Python3
76
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd',
'B.Tech', 'B.com', 'Msc', 'MA']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)
print(df)
Python3
Output :
Python3
77
Output :
Observations: Thus students are able to perform implementation of grouping and group by
useful concept in data Science.
78
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 11
Theory:
Methodology
Runs test is a hypothesis testing based methodology that is widely used in statistical analysis to
test if a set of values are generated randomly or not. It is a hypothesis test so we have a pair of a
Test Data
I will using the data provided by NIST.
79
1.4.2.5.1. Background and Data
The following are the data used for this case study. The reader can download the data as a
text file.
www.itl.nist.gov
Octave/Matlab Implementation and Results
Using online Octave:
Upload text file
Import data to array
Load ‘statistics’ package
Run runstest
Python Implementation
Implement the test using Python.
Running the code produces the results below (Calculated Z Score, Z Score at %95 confidence):
(2.8355606218883844, 1.6448536269514722)
80
D. Y. Patil College of Engineering and Technology, Kolhapur
Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)
Assignment No.: 12
Now, we will learn in a stepwise manner how to create a dashboard in Tableau Desktop.
Selecting the New Dashboard option or clicking on the Dashboard icon will open a new
window named Dashboard 1. You change the name of the dashboard as per your liking.
81
2. Dashboard pane
In the window where we can create our dashboard, we get a lot of tabs and options related to
dashboarding. On the left, we have a Dashboard pane which shows the dashboard size, list of
available sheets in a workbook, objects, etc.
From the Dashboard tab, we can set the size of our dashboard. We can enter custom dimensions
like the width and height of the dashboard as per our requirements.
82
Or, you can select from a list of available fixed dashboard sizes as shown in the screenshot
below.
3. Layout pane
Right next to the Dashboard pane is the Layout pane where we can enhance the appearance and
layout of the dashboard by setting the position, size, border, background, and paddings.
83
4. Adding a sheet
Now, we’ll add a sheet onto our empty dashboard. To add a sheet, drag and drop a sheet from
the Sheets column present in the Dashboard tab. It will display all the visualizations we have on
that sheet on our dashboard. If you wish to change or adjust the size and place of the
visual/chart/graph, click on the graph then click on the small downward arrow given at the
right. A drop-down list appears having the option Floating, select it. This will unfix your chart
from one position so that you can adjust it as per your liking.
Have a look at the picture below to see how you can drag a sheet or visual around on the
dashboard and adjust its size.
84
5. Adding more sheets
In a similar way, we can add as many sheets as we require and arrange them on the dashboard
properly.
Also, you can apply the filter or selections on one graph and treat it like a filter for all the other
visuals on the dashboard. To add a filter to a dashboard in Tableau, select Use as Filter option
given on the right of every visual.
85
Then on the selected visual, we make selections. For instance, we select the data point
corresponding to New Jersey in the heat map shown below. As soon as we select it, all the rest of
the graphs and charts change their information and make it relevant to New Jersey. Notice in
the Region section, the only region left is East which is where New Jersey is located.
6. Adding objects
Another set of tools that we get to make our dashboard more interactive and dynamic is in
the Objects section. We can add a wide variety of objects such as a web page, button, text box,
extension, etc.
86
From the objects pane, we can add a button and also select the action of that button, that is, what
that button should do when you click on it. Select the Edit Button option to explore the options
you can select from for a button object.
For instance, we add a web page of our DataFlair official site as shown in the screenshot below.
87
7. Final dashboard
Now, we move towards making a final dashboard in Tableau with all its elements in place. As
you can see in the screenshot below, we have three main visualizations on our current dashboard
i.e. a segmented area chart, scatter chart and a line chart showing the sales and profits forecast.
On the right pane, we have the list of legends showing Sub-category names, a forecast indicator
and a list of clusters.
We can add filters on this dashboard by clicking on a visual. For instance, we want to add a filter
based on months on the scatter plot showing sales values for different clusters. To add a months
filter, we click on the small downward arrow and then select Filters option. Then we
select Months of Order Date option. You can select any field based on which you wish to
create a new filter.
`
88
This will give us a slider filter to select a range of months for which we want to see our data.
You can adjust the position of the filter box and drag and drop it at whichever place you want.
You can make more changes into the filter by right-clicking on it. Also, you can change the type
of filter from the drop-down menu such as Relative Date, Range of Date, Start Date, End Date,
Browse Periods, etc.
89
Similarly, you can add and edit more filters on the dashboard.
8. Presentation mode
Once our dashboard is ready, we can view it in the Presentation Mode. To enable the
presentation mode, click on the icon present on the bar at the top as shown in the screenshot
below or press F7.
This opens our dashboard in the presentation mode. So far we were working in the Edit Mode. In
the presentation mode, it neatly shows all the visuals and objects that we have added on the
dashboard. We can see how the dashboard will look when we finally present it to others or share
it with other people for analysis.
90
Here, we can also apply the filter range to our data. The dashboard is interactive and will change
the data according to the filters we apply or selections we make.
For instance, we selected the brand Pixel from our list of items from the sub-category field. This
instantly changes the information on the visuals and makes it relevant to only Pixel.
9. Share workbook with others
We can also share all the worksheets and dashboard that we create together as a workbook with
other users. To share the workbook with others, click on the share icon (highlighted in red).
Next, you need to enter the server address of a Tableau server.
Note – You must have a Tableau Online or Tableau Server account in order to do this.
91
Observations: Thus students are able to create simple dashboard using tableau.
92
93