You are on page 1of 93

D. Y.

Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year-2022-23

Subject: Exploratory Data Analysis and Visualization Laboratory Class: TY BTech DS(A)

Assignment No.: 1

Title: Program to get statistical characteristics of dataset using pandas


Theory: A large number of methods collectively compute descriptive statistics and other related
operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them,
like sumsum(), produce an object of the same size. Generally speaking, these methods take
an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or
integer
 DataFrame − “index” (axis=0, default), “columns” (axis=1)
Let us create a DataFrame and use this object throughout this assignment for all the operations.
Example
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df
Its output is as follows −
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
7 34 Lee 3.78
8 40 David 2.98
9 30 Gasper 4.80
10 51 Betina 4.10
11 46 Andres 3.65

1
sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()
Its output is as follows −
Age 382
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating 44.92
dtype: object
Each individual column is added individually (Strings are appended).
axis=1
This syntax will give the output as shown below.
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)
Its output is as follows −
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80

2
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
mean()
Returns the average value
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()
Its output is as follows −
Age 31.833333
Rating 3.743333
dtype: float64
std()
Returns the Bressel standard deviation of the numerical columns.
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.std()
Its output is as follows −
Age 9.232682
Rating 0.661628
dtype: float64

3
Functions & Description
Let us now understand the functions under Descriptive Statistics in Python Pandas. The
following table list down the important functions −

Sr.No. Function Description

1 count() Number of non-null observations

2 sum() Sum of values

3 mean() Mean of Values

4 median() Median of Values

5 mode() Mode of values

6 std() Standard Deviation of the Values

7 min() Minimum Value

8 max() Maximum Value

9 abs() Absolute Value

10 prod() Product of Values

11 cumsum() Cumulative Sum

12 cumprod() Cumulative Product

Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with
all functions.
 Functions like sum(), cumsum() work with both numeric and character (or) string data
elements without any error. Though n practice, character aggregations are never used
generally, these functions do not throw any exception.
 Functions like abs(), cumprod() throw exception when the DataFrame contains character
or string data because such operations cannot be performed.
Summarizing Data
The describe() function computes a summary of statistics pertaining to the DataFrame columns.
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])

4
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Its output is as follows −
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
This function gives the mean, std and IQR values. And, function excludes the character
columns and given summary about numeric columns. 'include' is the argument which is used to
pass necessary information regarding what columns need to be considered for summarizing.
Takes the list of values; by default, 'number'.

 object − Summarizes String columns


 number − Summarizes Numeric columns
 all − Summarizes all columns together (Should not pass it as a list value)
Now, use the following statement in the program and check the output −
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])
Its output is as follows −
Name
count 12
unique 12
top Ricky
freq 1
Now, use the following statement and check the output −

5
import pandas as pd
import numpy as np

#Create a Dictionary of series


d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')
Its output is as follows −
Age Name Rating
count 12.000000 12 12.000000
unique NaN 12 NaN
top NaN Ricky NaN
freq NaN 1 NaN
mean 31.833333 NaN 3.743333
std 9.232682 NaN 0.661628
min 23.000000 NaN 2.560000
25% 25.000000 NaN 3.230000
50% 29.500000 NaN 3.790000
75% 35.500000 NaN 4.132500
max 51.000000 NaN 4.800000

Observations: Thus students are able to write a program to get statistical characteristics of
dataset using pandas.

Prepared by- DR. Mrs J. N. Jadhav. Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

6
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year-2022-23

Subject: Exploratory Data Analysis and Visualization Laboratory Class: T.Y.B.Tech DS(A)

Assignment No.: 2

Title: Programs for analysis of data through different plots(scatter, bubble, area, stacked)
and charts( line, bar, table, pie, histogram)

 Plots

1. Scatter Plot

It is a type of plot using Cartesian coordinates to display values for two variables for a set of

data. It is displayed as a collection of points. Their position on the horizontal axis determines the

value of one variable. The position on the vertical axis determines the value of the other variable.

A scatter plot can be used when one variable can be controlled and the other variable depends on

it. It can also be used when both continuous variables are independent.

Visual:

Plotly code:

7
import plotly.express as px
df = px.data.iris() # iris is a pandas DataFrame
fig = px.scatter(df, x="sepal_width", y="sepal_length")
fig.show()

Seaborn code:

import seaborn as sns


tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip")

According to the correlation of the data points, scatter plots are grouped into different types.

These correlation types are listed below

Positive Correlation

In these types of plots, an increase in the independent variable indicates an increase in the

variable that depends on it. A scatter plot can have a high or low positive correlation.

Negative Correlation

In these types of plots, an increase in the independent variable indicates a decrease in the

variable that depends on it. A scatter plot can have a high or low negative correlation.

No Correlation

Two groups of data visualized on a scatter plot are said to not correlate if there is no clear

correlation between them.

2. Bubble Chart
A bubble chart displays three attributes of data. They are represented by x location, y location,
and size of the bubble.

Visualization:

8
Plotly code:

import plotly.express as px
df = px.data.gapminder()

fig = px.scatter(df.query("year==2007"), x="gdpPercap", y="lifeExp",


size="pop", color="continent",
hover_name="country", log_x=True, size_max=60)
fig.show()

Seaborn code:

import matplotlib.pyplot as plt


import seaborn as sns
from gapminder import gapminder # import data set

data = gapminder.loc[gapminder.year == 2007]

b = sns.scatterplot(data=data, x="gdpPercap", y="lifeExp", size="pop", legend=False, sizes=(20,


2000))

b.set(xscale="log")

plt.show()

9
Their categories into different types are based on the number of variables in the dataset, the type

of visualized data, and the number of dimensions in them.

Simple Bubble Chart

It is the basic type of bubble chart and is equivalent to the normal bubble chart.

Labelled Bubble Chart

The bubbles on this bubble chart are labelled for easy identification. This is to deal with different

groups of data.

The multivariable Bubble Chart

This chart has four dataset variables. The fourth variable is distinguished with a different colour.

Map Bubble Chart

It is used to illustrate data on a map.

3D Bubble Chart

This is a bubble chart designed in a 3-dimensional space. The bubbles here are spherical.

Area Chart

It is represented by the area between the lines and the axis. The area is proportional to the

amount it represents.

These are types of area charts:

Simple area Chart

IIn this chart, the coloured segments overlap each other. They are placed above each other.

Stacked Area Chart

In this chart, the coloured segments are stacked on top of one another. Thus they do not intersect.

100% Stacked area Chart

In this chart, the area occupied by each group of data is measured as a percentage of its amount

from the total data. Usually, the vertical axis totals a hundred per cent.

3-D Area Chart

10
This chart is measured on a 3-dimensional space.

We will look at visual representation and code for the most common type below.

Visual:

Plotly:

import plotly.express as px
df = px.data.gapminder()
fig = px.area(df, x="year", y="pop", color="continent",
line_group="country")
fig.show()

Seaborn:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

df = pd.DataFrame({'period': [1, 2, 3, 4, 5, 6, 7, 8],


'team_A': [20, 12, 15, 14, 19, 23, 25, 29],
'team_B': [5, 7, 7, 9, 12, 9, 9, 4],

11
'team_C': [11, 8, 10, 6, 6, 5, 9, 12]})

plt.stackplot(df.period, df.team_A, df.team_B, df.team_C)

Stacked Bar Graph

The stacked bar graphs are used to show dataset subgroups. However, the bars are stacked on top

of each other. Here is an illustration:

Here is a code snippet on how to do it in plotly:

import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()

Seaborn code snippet:

import pandas
import matplotlib.pylab as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
df = pandas.DataFrame(dict(

12
number=[2, 5, 1, 6, 3],
count=[56, 21, 34, 36, 12],
select=[29, 13, 17, 21, 8]
))
bar_plot1 = sns.barplot(x='number', y='count', data=df, label="count", color="red")
bar_plot2 = sns.barplot(x='number', y='select', data=df, label="select", color="green")
plt.legend(ncol=2, loc="upper right", frameon=True)
plt.show()
 Charts:

Line Graph

It displays a sequence of data points as markers. The points are ordered typically by their x-axis

value. These points are joined with straight line segments. A line graph is used to visualize a

trend in data over intervals of time.

The following is an illustration of Canadian life expectancy by years in Line Graph.

Here is how to do it in plotly:

import plotly.express as px
df = px.data.gapminder().query("country=='Canada'")
fig = px.line(df, x="year", y="lifeExp", title='Life expectancy in Canada')
fig.show()

Here is how to do it in seabron:

13
import seaborn as sns
sns.lineplot(data=df, x="year", y="lifeExp")

Here are types of line graphs:

Simple Line Graph

A simple line graph plots only one line on the graph. One of the axes defines the independent

variable. The other axis contains a variable that depends on it.

Multiple Line Graph

Multiple line graphs contain more than one line. They represent multiple variables in a dataset.

This type of graph can be used to study more than one variable over the same period.

It can be drawn in plotly as:

import plotly.express as px
df = px.data.gapminder().query("continent == 'Oceania'")
fig = px.line(df, x='year', y='lifeExp', color='country', symbol="country")
fig.show()

Here is the illustration:

In seaborn as:

import seaborn as sns


sns.lineplot(data=df, x='year', y='lifeExp', hue='country')

14
Here is the illustration:

Compound Line Graph

It is an extension of a simple line graph. It is used when dealing with different groups of data

from a larger dataset. Its every line graph is shaded downwards to the x-axis. It has each group

stacked upon one another.

Here is an illustration:

15
 

Bar Graph

A bar graph is a graph that presents categorical data with rectangle-shaped bars. The heights or

lengths of these bars are proportional to the values that they represent. The bars can be vertical or

horizontal. A vertical bar graph is sometimes called a column graph.

Following is an illustration of a bar graph indicating the population in Canada by years.

16
Follo

wing is the code indicating how to do it in plotly.

import plotly.express as px
data_canada = px.data.gapminder().query("country == 'Canada'")
fig = px.bar(data_canada, x='year', y='pop')
fig.show()

Following is the representational code of doing it in seaborn.

import seaborn as sns


sns.set_theme(style="whitegrid")
ax = sns.barplot(x="year", y="pop", data=data_canada)

This is how it looks:

17
The following are types of bar graphs:

Grouped Bar Graph

Grouped bar graphs are used when the datasets have subgroups that need to be visualized on the

graph. The subgroups are differentiated by distinct colours. Here is an illustration of such a

graph:

Here is a code snippet on how to do it in plotly:

import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color="time")

18
fig.show()

Here is a code snippet on how to do it in seaborn:

import seaborn as sb
df = sb.load_dataset('tips')
df = df.groupby(['size', 'sex']).agg(mean_total_bill=("total_bill", 'mean'))
df = df.reset_index()
sb.barplot(x="size", y="mean_total_bill", hue="sex", data=df)
Stacked Bar Graph

The stacked bar graphs are used to show dataset subgroups. However, the bars are stacked on top

of each other. Here is an illustration:

Here is a code snippet on how to do it in plotly:

import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()

Seaborn code snippet:

import pandas
import matplotlib.pylab as plt

19
import seaborn as sns
plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
df = pandas.DataFrame(dict(
number=[2, 5, 1, 6, 3],
count=[56, 21, 34, 36, 12],
select=[29, 13, 17, 21, 8]
))
bar_plot1 = sns.barplot(x='number', y='count', data=df, label="count", color="red")
bar_plot2 = sns.barplot(x='number', y='select', data=df, label="select", color="green")
plt.legend(ncol=2, loc="upper right", frameon=True)
plt.show()
Segmented Bar Graph

This is the type of stacked bar graph where each stacked bar shows the percentage of its discrete

value from the total value. The total percentage is 100%. Here is an illustration:

 Pie Chart

A pie chart is a circular statistical graphic. To illustrate numerical proportion, it is divided into

slices. In a pie chart, for every slice, each of its arc lengths is proportional to the amount it

represents. The central angles, and area are also proportional. It is named after a sliced pie.

Here is how to do it in plotly:

import plotly.express as px
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")

20
df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries
fig = px.pie(df, values='pop', names='country', title='Population of European continent')
fig.show()

And here is how it looks:

Seaborn doesn’t have a default function to create pie charts, but the following syntax in

matplotlib can be used to create a pie chart and add a seaborn color palette:

import matplotlib.pyplot as plt


import seaborn as sns

data = [15, 25, 25, 30, 5]


labels = ['Group 1', 'Group 2', 'Group 3', 'Group 4', 'Group 5']

colors = sns.color_palette('pastel')[0:5]

plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%')


plt.show()

This is how it looks:

21
These are types of pie charts:

Simple Pie Chart

This is the basic type of pie chart. It is often called just a pie chart.

Exploded Pie Chart

One or more sectors of the chart are separated (termed as exploded) from the chart in an

exploded pie chart. It is used to emphasize a particular element in the data set.

This is a way to do it in plotly:

import plotly.graph_objects as go

labels = ['Oxygen','Hydrogen','Carbon_Dioxide','Nitrogen']
values = [4500, 2500, 1053, 500]

# pull is given as a fraction of the pie radius


fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0, 0.2, 0])])
fig.show()

And this is how it looks:

22
In seaborn the explode attribute of the pie method in matplotlib can be used as:

import matplotlib.pyplot as plt


import seaborn as sns

data = [15, 25, 25, 30, 5]


labels = ['Group 1', 'Group 2', 'Group 3', 'Group 4', 'Group 5']

colors = sns.color_palette('pastel')[0:5]

plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%', explode = [0, 0, 0, 0.2, 0])
plt.show()
Donut Chart

In this pie chart, there is a hole in the centre. The hole makes it look like a donut from which it

derives its name.

The way to do it in plotly is:

import plotly.graph_objects as go

labels = ['Oxygen','Hydrogen','Carbon_Dioxide','Nitrogen']
values = [4500, 2500, 1053, 500]

23
# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.show()

And this is how it looks:

This is how it is done in seaborn:

import numpy as np
import matplotlib.pyplot as plt
data = np.random.randint(20, 100, 6)
plt.pie(data)
circle = plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(circle)
plt.show()
Pie of Pie

A pie of pie is a chart that generates an entirely new pie chart detailing a small sector of the

existing pie chart. It can be used to reduce the clutter and emphasize a particular group of

elements.

Here is an illustration:

24
 

Bar of Pie

This is similar to the pie of pie, except that a bar chart is what is generated.

Here is an illustration:

3D Pie Chart

This is a pie chart that is represented in a 3-dimensional space. Here is an illustration:

25
The shadow attribute can be set to True for doing it in seaborn / matplotlib.

import matplotlib.pyplot as plt


labels = ['Python', 'C++', 'Ruby', 'Java']
sizes = [215, 130, 245, 210]
# Plot
plt.pie(sizes, labels=labels,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
Histogram

A histogram is an approximate representation of the distribution of numerical data. The data is

divided into non-overlapping intervals called bins and buckets. A rectangle is erected over a bin

whose height is proportional to the number of data points in the bin. Histograms give a feel of

the density of the distribution of the underlying data.

Here is a visual:

26
Plotly code:

import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x="total_bill")
fig.show()

Seaborn code:

import seaborn as sns


penguins = sns.load_dataset("penguins")
sns.histplot(data=penguins, x="flipper_length_mm")

It is classified into different parts depending on its distribution as below:

Normal Distribution

This chart is usually bell-shaped.

Bimodal Distribution

In this histogram, there are two groups of histogram charts that are of normal distribution. It is a

result of combining two variables in a dataset.

Visualization:

27
Plotly code:

import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x="total_bill", y="tip", color="sex", marginal="rug",
hover_data=df.columns)
fig.show()

Seaborn:

import seaborn as sns


iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)
Skewed Distribution

This is an asymmetric graph with an off-centre peak. The peak tends towards the beginning or

end of the graph. A histogram can be said to be right or left-skewed depending on the direction

where the peak tends towards.

Random Distribution

This histogram does not have a regular pattern. It produces multiple peaks. It can be called a

multimodal distribution.

28
Edge Peak Distribution
This distribution is similar to that of a normal distribution, except for a large peak at one of its
ends.
Comb Distribution

The comb distribution is like a comb. The height of rectangle-shaped bars is alternately tall and

short.

Observations: Thus students are able to analyze data through different plots and charts.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academic

29
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2022-23
Subject: Exploratory Data Analysis and Visualization Laboratory Class: TY BTech DS (A)

Assignment No.: 3

Title: Implementation of data transformation-reshaping and deduplication of data


Theory:

Data transformation is the process of extracting good, reliable data from these sources. This
involves converting data from one structure (or no structure) to another so you can integrate it
with a data warehouse or with different applications. It allows you to expose the information to
advanced business intelligence tools to create valuable performance reports and forecast future
trends.

Data transformation includes two primary stages: understanding and mapping the data; and
transforming the data.

The best way to select, implement and integrate data deduplication can vary depending on how
the deduplication is performed. Here are some general principles that you can follow in selecting
the right deduplicating approach and then integrating it into your environment.

Step 1: Assess your backup environment

What deduplication ratio a company achieves will depend heavily on the following factors:

 Type of data

 Change rate of the data

 Amount of redundant data

 Type of backup performed (full, incremental or differential)

 Retention length of the archived or backup data

30
The challenge most companies have is quickly and effectively gathering this data. Agentless data
gathering and information classification tools from Aptare Inc., Asigra Inc., Bocada
Inc. and Kazeon Systems Inc. can assist in performing these assessments while requiring
minimal or no changes to your servers in the form of agent deployments.

Step 2: Establish how much you can change your backup environment

Deploying backup software that uses software agents will require installing agents on each server
or virtual machine and doing server reboots after it's installed. This approach generally results in
faster backup times and higher deduplication ratios than using a data deduplication appliance.
However, it can take more time and require many changes to a company's backup environment.
Using a data deduplication appliance typically requires no changes to servers, though a company
will need to tune its backup software according to if the appliance is configured as a file server or
a virtual tape library (VTL).

Step 3: Purchase a scalable storage architecture

The amount of data that a company initially plans to back up and what it actually ends up
backing up are usually two very different numbers. A company usually finds deduplication so
effective when it starts using it in its backup process that it quickly scales its use and deployment
beyond initial intentions, so you should confirm that deduplicating hardware appliances can scale
both performance and capacity. You should also verify that the hardware and software
deduplication products can provide global deduplication and replication features to maximize
duplication's benefits throughout the enterprise, facilitate technology refreshes and/or capacity
growth, and efficiently bring in deduplicated data from remote offices.

Step 4: Check the level of integration between backup software and hardware appliances

The level of integration that a hardware appliance has with backup software (or vice versa) can
expedite backups and recoveries. For example, ExaGrid Systems Inc. ExaGrid appliances
recognize backup streams from CA ARCserve and can better deduplicate data from that backup
software than streams from backup software that it doesn't recognize. Enterprise backup software
is also starting to better manage disk storage systems so data can be placed on different disk

31
storage systems with different tiers of disk, so they can back up and recover data more quickly
short term and then more cost-effectively store it long term.

Step 5: Perform the first backup

The first backup using agent-based deduplication software can potentially be a harrowing
experience. It can create a significant amount of overhead on the server and take much longer
than normal to complete because it needs to deduplicate all of the data. However, once the first
backup is complete, it only needs to back up and deduplicate changed data going forward. Using
a hardware appliance, the experience tends to be the opposite. The first backup may occur
quickly but backups may slow over time depending on how scalable the hardware appliance is,
how much data is changing and how much data growth that a company is experiencing.

Observations: Thus students are able to implement data transformation-reshaping and


deduplication of data

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

32
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 4

Title: Implementation of data transformation-Handling missing data ,filling missing data


Theory:

33
The rationale for Mode is to replace the population of missing values with the most frequent value
since this is the most likely occurrence.

Image by Author

2) Last Observation Carried Forward (LOCF)

If data is time-series data, one of the most widely used imputation methods is the last observation
carried forward (LOCF). Whenever a value is missing, it is replaced with the last observed value.
This method is advantageous as it is easy to understand and communicate. Although simple, this
method strongly assumes that the value of the outcome remains unchanged by the missing data,
which seems unlikely in many settings.

34
Image by Author

3) Next Observation Carried Backward (NOCB)

A similar approach like LOCF works oppositely by taking the first observation after the missing
value and carrying it backward (“next observation carried backwards”, or NOCB).

35
Image by Author

4) Linear Interpolation

Interpolation is a mathematical method that adjusts a function to data and uses this function to
extrapolate the missing data. The simplest type of interpolation is linear interpolation, which
means between the values before the missing data and the value. Of course, we could have a
pretty complex pattern in data, and linear interpolation could not be enough. There are several
different types of interpolation. Just in Pandas, we have the following options like: ‘linear’,
‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘polynomial’, ‘spline’,
‘piecewise polynomial’ and many more.

36
Image by Author

5) Common-Point Imputation

For a rating scale, using the middle point or most commonly chosen value. For example, on a
five-point scale, substitute a 3, the midpoint, or a 4, the most common value (in many cases). It is
similar to the mean value but more suitable for ordinal values.

6) Adding a category to capture NA

This is perhaps the most widely used method of missing data imputation for categorical variables.
This method consists of treating missing data as an additional label or category of the variable. All
the missing observations are grouped in the newly created label ‘Missing’. It does not assume

37
anything on the missingness of the values. It is very well suited when the number of missing data
is high.

Image by Author

7) Frequent category imputation

Replacement of missing values by the most frequent category is the equivalent of mean/median
imputation. It consists of replacing all occurrences of missing values within a variable with the
variable's most frequent label or category.

38
Image by Author

8) Arbitrary Value Imputation

Arbitrary value imputation consists of replacing all occurrences of missing values within a
variable with an arbitrary value. Ideally, the arbitrary value should be different from the
median/mean/mode and not within the normal values of the variable. Typically used arbitrary
values are 0, 999, -999 (or other combinations of 9’s) or -1 (if the distribution is positive).
Sometimes data already contain an arbitrary value from the originator for the missing values. This
works reasonably well for numerical features predominantly positive in value and for tree-based
models in general. This used to be a more common method when the out-of-the-box machine
learning libraries and algorithms were not very adept at working with missing data.

39
Image by Author

9) Adding a variable to capture NA

When data are not missing completely at random, we can capture the importance of missingness
by creating an additional variable indicating whether the data was missing for that observation (1)
or not (0). The additional variable is binary: it takes only the values 0 and 1, 0 indicating that a
value was present for that observation, and 1 indicating that the value was missing. Typically,
mean/median imputation is done to add a variable to capture those observations where the data
was missing.

40
Image by Author

10) Random Sampling Imputation

Random sampling imputation is in principle similar to mean/median imputation because it aims to


preserve the statistical parameters of the original variable, for which data is missing. Random
sampling consists of taking a random observation from the pool of available observations and
using that randomly extracted value to fill the NA. In Random Sampling, one takes as many
random observations as missing values are present in the variable. Random sample imputation
assumes that the data are missing completely at random (MCAR). If this is the case, it makes
sense to substitute the missing values with values extracted from the original variable distribution.

41
Multiple Imputation

Multiple Imputation (MI) is a statistical technique for handling missing data. The key concept of
MI is to use the distribution of the observed data to estimate a set of plausible values for the
missing data. Random components are incorporated into these estimated values to show their
uncertainty. Multiple datasets are created and then analysed individually but identically to obtain
a set of parameter estimates. Estimates are combined to obtain a set of parameter estimates. The
benefit of the multiple imputations is that restoring the natural variability of the missing values
incorporates the uncertainty due to the missing data, which results in a valid statistical inference.
As a flexible way of handling more than one missing variable, apply a Multiple Imputation by
Chained Equations (MICE) approach. Refer to the reference section to get more information on
MI and MICE. Below is a schematic representation of MICE.

Predictive/Statistical models that impute the missing data

This should be done in conjunction with some cross-validation scheme to avoid leakage. This can
be very effective and can help with the final model. There are many options for such a predictive
model, including a neural network. Here I am listing a few which are very popular.

Linear Regression

In regression imputation, the existing variables are used to predict, and then the predicted value is
substituted as if an actually obtained value. This approach has several advantages because the
imputation retains a great deal of data over the listwise or pairwise deletion and avoids
significantly altering the standard deviation or the shape of the distribution. However, as in a

42
mean substitution, while a regression imputation substitutes a value predicted from other
variables, no novel information is added, while the sample size has been increased and the
standard error is reduced.

Random Forest

Random forest is a non-parametric imputation method applicable to various variable types that
work well with both data missing at random and not missing at random. Random forest uses
multiple decision trees to estimate missing values and outputs OOB (out of the bag) imputation
error estimates. One caveat is that random forest works best with large datasets, and using random
forest on small datasets runs the risk of overfitting.

k-NN (k Nearest Neighbour)

k-NN imputes the missing attribute values based on the nearest K neighbour. Neighbours are
determined based on a distance measure. Once K neighbours are determined, the missing value is
imputed by taking mean/median or mode of known attribute values of the missing attribute.

Maximum likelihood

The assumption that the observed data are a sample drawn from a multivariate normal distribution
is relatively easy to understand. After the parameters are estimated using the available data, the
missing data are estimated based on the parameters which have just been estimated. Several
strategies are using the maximum likelihood method to handle the missing data.

Expectation-Maximization

Expectation-Maximization (EM) is the maximum likelihood method used to create a new data set.
All missing values are imputed with values estimated by the maximum likelihood methods. This
approach begins with the expectation step, during which the parameters (e.g., variances,
covariances, and means) are estimated, perhaps using the listwise deletion. Those estimates are
then used to create a regression equation to predict the missing data. The maximization step uses
those equations to fill in the missing data. The expectation step is then repeated with the new
parameters, where the new regression equations are determined to “fill in” the missing data. The
expectation and maximization steps are repeated until the system stabilizes.

43
Sensitivity analysis

Sensitivity analysis is defined as the study which defines how the uncertainty in the output of a
model can be allocated to the different sources of uncertainty in its inputs. When analysing the
missing data, additional assumptions on the missing data are made, and these assumptions are
often applicable to the primary analysis. However, the assumptions cannot be definitively
validated for correctness. Therefore, the National Research Council has proposed that the
sensitivity analysis be conducted to evaluate the robustness of the results to the deviations from
the MAR assumption.

Algorithms that Support Missing Values

Not all algorithms fail when there is missing data. Some algorithms can be made robust to
missing data, such as k-Nearest Neighbours, that can ignore a column from a distance measure
when a value is missing. Some algorithms can use the missing value as a unique and different
value when building the predictive model, such as classification and regression trees. An
algorithm like XGBoost takes into consideration of any missing data. If your imputation does not
work well, try a model that is robust to missing data.

Observations: Thus students are able to implement data transformation-Handling , missing data,
filling missing data.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

44
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 5

Title: Program on Discretization and binning of data


Theory:
ML | Binning or Discretization
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
# import statsmodels.api as sm
import statistics
import math
from collections import OrderedDict
  
x =[]
print("enter the data")
x = list(map(float, input().split()))
  
print("enter the number of bins")
bi = int(input())
  
# X_dict will store the data in sorted order
X_dict = OrderedDict()
# x_old will store the original data
x_old ={}
# x_new will store the data after binning
x_new ={}
  
  
for i in range(len(x)):
    X_dict[i]= x[i]
    x_old[i]= x[i]
  
x_dict = sorted(X_dict.items(), key = lambda x: x[1])
  
# list of lists(bins)
binn =[]
# a variable to find the mean of each bin
avrg = 0
  

45
i=0
k=0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
  
# performing binning
for g, h in X_dict.items():
    if(i<num_of_data_in_each_bin):
        avrg = avrg + h
        i = i + 1
    elif(i == num_of_data_in_each_bin):
        k = k + 1
        i = 0
        binn.append(round(avrg / num_of_data_in_each_bin, 3))
        avrg = 0
        avrg = avrg + h
        i = i + 1
rem = len(x)% bi
if(rem == 0):
    binn.append(round(avrg / num_of_data_in_each_bin, 3))
else:
    binn.append(round(avrg / rem, 3))
  
# store the new value of each data
i=0
j=0
for g, h in X_dict.items():
    if(i<num_of_data_in_each_bin):
        x_new[g]= binn[j]
        i = i + 1
    else:
        i = 0
        j = j + 1
        x_new[g]= binn[j]
        i = i + 1
print("number of data in each bin")
print(math.ceil(len(x)/bi))
  
for i in range(0, len(x)):
    print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))

Data binning, bucketing is a data pre-processing method used to minimize the effects of small
observation errors. The original data values are divided into small intervals known as bins and
then they are replaced by a general value calculated for that bin. This has a smoothing effect on
the input data and may also reduce the chances of overfitting in the case of small datasets
There are 2 methods of dividing data into bins:  
1. Equal Frequency Binning: bins have an equal frequency.
2. Equal Width Binning : bins have equal width with a range of each bin are defined as [min
+ w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).
Equal frequency: 

46
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Equal Width:  
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
Code : Implementation of Binning Technique: 
 Python

# equal frequency

def equifreq(arr1, m):    

    a = len(arr1)

    n = int(a / m)

    for i in range(0, m):

        arr = []

        for j in range(i * n, (i + 1) * n):

            if j >= a:

                break

            arr = arr + [arr1[j]]

        print(arr)

  

# equal width

47
def equiwidth(arr1, m):

    a = len(arr1)

    w = int((max(arr1) - min(arr1)) / m)

    min1 = min(arr1)

    arr = []

    for i in range(0, m + 1):

        arr = arr + [min1 + w * i]

    arri=[]

    for i in range(0, m):

        temp = []

        for j in arr1:

            if j >= arr[i] and j <= arr[i+1]:

                temp += [j]

        arri += [temp]

    print(arri) 

# data to be binned

data = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

# no of bins

m = 3 

48
  

print("equal frequency binning")

equifreq(data, m)

  

print("\n\nequal width binning")

equiwidth(data, 3)

Output : 
equal frequency binning
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]

equal width binning


[[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]]

Observations: Thus students are able to understand the process of Discretization and binning of
data.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

49
Program Co-ordinator HOD Dean Academics

D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 6

Title: Program on Handing dummy variables


Theory:
A dataset may contain various type of values, sometimes it consists of categorical values. So,
in-order to use those categorical value for programming efficiently we create dummy variables.
A dummy variable is a binary variable that indicates whether a separate categorical variable
takes on a specific value. 
Explanation:

As you can see three dummy variables are created for the three categorical values of the
temperature attribute. We can create dummy variables in python using get_dummies() method.
Syntax:  pandas.get_dummies(data, prefix=None, prefix_sep=’_’,)
Parameters:

50
 data= input data i.e. it includes pandas data frame. list . set . numpy arrays etc.
 prefix= Initial value
 prefix_sep= Data values separation.
Return Type: Dummy variables.
Step-by-step Approach:
 Import necessary modules
 Consider the data
 Perform operations on data to get dummies
Example 1: 
Python3

# import required modules


import pandas as pd
import numpy as np
# create dataset
df = pd.DataFrame({'Temperature': ['Hot', 'Cold', 'Warm', 'Cold'],
                   })
# display dataset
print(df)
# create dummy variables
pd.get_dummies(df)

Output:

Example 2:
Consider List arrays to get dummies
Python3

# import required modules


import pandas as pd
import numpy as np
# create dataset
s = pd.Series(list('abca'))
# display dataset
print(s)
# create dummy variables
pd.get_dummies(s)

Output:

51
Example 3: 
Here is another example, to get dummy variables.
Python3

# import required modules


import pandas as pd
import numpy as np
# create dataset
df = pd.DataFrame({'A': ['hello', 'vignan', 'geeks'],
                   'B': ['vignan', 'hello', 'hello'],
                   'C': [1, 2, 3]})
 
# display dataset
print(df)
# create dummy variables
pd.get_dummies(df)

Output:

Observations: Thus students are able to understand the concept of dummy variables.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

52
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 7

Title: Implementation of different distributions(normal, poisson, uniform, gamma)


Theory:

Normal Distribution

Normal distribution represents the behavior of most of the situations in the universe (That is

why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables

often turns out to be normally distributed, contributing to its widespread application. Any

distribution is known as Normal distribution if it has the following characteristics:

1. The mean, median and mode of the distribution coincide.


2. The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3. The total area under the curve is 1.
4. Exactly half of the values are to the left of the center and the other half to the right.

A normal distribution is highly different from Binomial Distribution. However, if the number of

trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed is given

by:

Mean -> E(X) = µ

53
Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.

The graph of a random variable X ~ N (µ, σ) is shown below.

A standard normal distribution is defined as the distribution with mean 0 and standard deviation

1.  For such a case, the PDF becomes:

Poisson Distribution

Suppose you work at a call center, approximately how many calls do you get in a day? It can be

any number. Now, the entire number of calls at a call center in a day is modeled by Poisson

distribution. Some more examples are

1. The number of emergency calls recorded at a hospital in a day.


2. The number of thefts reported in an area on a day.
3. The number of customers arriving at a salon in an hour.

54
4. The number of suicides reported in a particular city.
5. The number of printing errors at each page of the book.

You can now think of many examples following the same course. Poisson Distribution is

applicable in situations where events occur at random points of time and space wherein our

interest lies only in the number of occurrences of the event.

A distribution is called Poisson distribution when the following assumptions are valid:

1. Any successful event should not influence the outcome of another successful event.

2. The probability of success over a short interval must equal the probability of success over a

longer interval.

3. The probability of success in an interval approaches zero as the interval becomes smaller.

Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some

notations used in Poisson distribution are:

 λ is the rate at which an event occurs,


 t is the length of a time interval,
 And X is the number of events in that time interval.

Here, X is called a Poisson Random Variable and the probability distribution of X is called

Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The PMF of X following a Poisson distribution is given by:

The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that

interval. The graph of a Poisson distribution is shown below:

55
The graph shown below illustrates the shift in the curve due to increase in mean.

It is perceptible that as the mean increases, the curve shifts to the right.

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ

Variance -> Var(X) = µ

 Uniform Distribution

When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are

equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the

n number of possible outcomes of a uniform distribution are equally likely.

A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like

56
You can see that the shape of the Uniform distribution curve is rectangular, the reason why

Uniform distribution is called rectangular distribution.

For a Uniform Distribution, a and b are the parameters. 

The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of

40 and a minimum of 10.

Let’s try calculating the probability that the daily sales will fall between 15 and 30.

The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5

Similarly, the probability that daily sales are greater than 20 is  = 0.667

The mean and variance of X following a uniform distribution is:

Mean -> E(X) = (a+b)/2

Variance -> V(X) =  (b-a)²/12

The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform

density is given by:

Gamma distribution.

57
Gamma(λ, r) or Gamma(α, β). Continuous. In the same Poisson process for the exponential

distribution, the gamma distribution gives the time to the r th event. Thus, Exponential(λ) =

Gamma(λ, 1). The gamma distribution also has applications when r is not an integer. For that

generality the factorial function is replaced by the gamma function, Γ(x), described above. There

is an alternate parameterization Gamma(α, β) of the family of gamma distributions. The

connection is α = r, and β = 1/λ which is the expected time to the first event in a Poisson process.

Gamma

(λ, r) f(x) = 1 Γ(r) λ rx r−1 e −λx = x α−1 e −x/β β αΓ(α) , for x ∈ [0,∞) µ = r/λ = αβ. σ2 = r/λ2 =

αβ2 m(t) = (1 − t/λ) −r = (1 − βt) −α

Application to Bayesian statistics. Gamma distributions are used in Bayesian statistics as

conjugate priors for the distributions in the Poisson process. In Gamma(α, β), α counts the

number of occurrences observe while β keeps track of the elapsed time.

Observations: Thus students are able to perform implementation of different


distributions(normal, poisson ,uniform, gamma)

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

58
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 8

Title: Program on Data cleaning


Theory:
Missing data is always a problem in real life scenarios. Areas like machine learning and data
mining face severe issues in the accuracy of their model predictions because of poor quality of
data caused by missing values. In these areas, missing value treatment is a major point of focus
to make their models more accurate and valid.
When and Why Is Data Missed?
Let us consider an online survey for a product. Many a times, people do not share all the
information related to them. Few people share their experience, but not how long they are using
the product; few people share how long they are using the product, their experience but not their
contact information. Thus, in some or the other way a part of data is always missing, and this is
very common in real time.
Let us now see how we can handle missing values (say NA or NaN) using Pandas.
# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',


'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
b NaN NaN NaN
c -0.390208 -0.551605 -2.301950
d NaN NaN NaN
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g NaN NaN NaN
h 0.085100 0.532791 0.887415

59
Using reindexing, we have created a DataFrame with missing values. In the
output, NaN means Not a Number.
Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides
the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −
Example
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',


'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df['one'].isnull()
Its output is as follows −
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill
in” NA values with non-null data in a couple of ways, which we have illustrated in the following
sections.
Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print df
print ("NaN replaced with '0':")
print df.fillna(0)
Its output is as follows −
one two three
a -0.576991 -0.741695 0.553172
b NaN NaN NaN

60
c 0.744328 -1.735166 1.749580

NaN replaced with '0':


one two three
a -0.576991 -0.741695 0.553172
b 0.000000 0.000000 0.000000
c 0.744328 -1.735166 1.749580
Here, we are filling with value zero; instead we can also fill with any other value.
Fill NA Forward and Backward
Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

Method Action

pad/fill Fill methods


Forward

bfill/backfill Fill methods


Backward
Example
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',


'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df.fillna(method='pad')
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
b 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
d -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
Drop Missing Values
If you want to simply exclude the missing values, then use the dropna function along with
the axis argument. By default, axis=0, i.e., along row, which means that if any value within a
row is NA then the whole row is excluded.
Example
import pandas as pd
import numpy as np

61
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])


print df.dropna()
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
Replace Missing (or) Generic Values
Many times, we have to replace a generic value with some specific value. We can achieve this by
applying the replace method.
Replacing NA with a scalar value is equivalent behavior of the fillna() function.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
Its output is as follows −
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60

Observations: Thus students are able to understand the concept of data cleaning.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

62
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 9

Title: Implementation of descriptive statistics(variance, skewness, kurtosis, percentile)


Theory:
1.
Variance is…
…the average of the squared differences from the mean. For a dataset, \(X=\
{a_1,a_2,\ldots,a_n\) with the mean as \(\overline{x}\), variance is

\(\displaystyle Var(X)=\frac{1}{n}\sum_{i=1}^n(a_i-\overline{x})^2\)

Variance with Python


Variance can be calculated in python using different libraries like numpy, pandas, and
statistics.

numpy.var(a, axis=None, dtype=None, ddof=0)


Parameters are the same as numpy.mean except

ddof : int, optional(ddof stands for delta degrees of freedom. It is the divisor used in the
calculation, which is N – ddof, where N is the number of elements. The default value of ddof
is 0)
>>> import numpy as np
>>> A=np.array([[10,14,11,7,9.5,15,19],[8,9,17,14.5,12,18,15.5],
[15,7.5,11.5,10,10.5,7,11],[11.5,11,9,12,14,12,7.5]])
>>> B=A.T
>>> a = np.var(B,axis=0)
>>> b = np.var(B,axis=1)
>>> print(a)
[ 13.98979592 12.8877551 6.12244898 3.92857143]
>>> print(b)
[ 6.546875 5.921875 8.796875 7.546875 2.875 16.5 19.0625 ]
Copy
1.
Skewness refers to…
… whether the distribution has a longer tail on one side or the other or has left-right
symmetry. There have been different skewness coefficients proposed over the years. The
most common way to calculate is by taking the mean of the cubes of differences of each

63
point from the mean and then dividing it by the cube of the standard deviation. This gives
a coefficient that is independent of the units of the observations.

It can be positive(representing right skewed distribution), negative(representing left


skewed distribution), or zero(representing unskewed distribution).
Skewness with python
scipy.stats.skew(a, axis=0)
>>> import numpy as np
>>> import scipy
>>> A=np.array([[10,14,11,7,9.5,15,19],[8,9,17,14.5,12,18,15.5],
[15,7.5,11.5,10,10.5,7,11],[11.5,11,9,12,14,12,7.5]])
>>> B=A.T
>>> a=scipy.stats.skew(B,axis=0)
>>> print(a)
[ 0.45143419 -0.30426514 0.38321624 -0.39903339]

Copy
2.
Kurtosis quantifies…
…whether the shape of a distribution mat

Kurtosis with python


scipy.stats.kurtosis(a, axis=0, fisher=True)
The parameters remain the same except fisher

fisher: if True then Fisher’s definition is used and if False, Pearson’s definition is used.
Default is True
>>> import numpy as np
>>> import scipy
>>> A=np.array([[10,14,11,7,9.5,15,19],[8,9,17,14.5,12,18,15.5],
[15,7.5,11.5,10,10.5,7,11],
[11.5,11,9,12,14,12,7.5]])
>>> B=A.T
>>> a=scipy.stats.kurtosis(B,axis=0,fisher=False) #Pearson Kurtosis
>>> b=scipy.stats.kurtosis(B,axis=1) #Fisher's Kurtosis
>>> print(a,b)
[ 2.19732518 1.6138826 2.516175 2.30595041] [-1.11918934 -1.25539366
-0.86157952 -1.24277613 -1.30245747 -1.22038567 -1.46061811]

3.
Percentiles and quartiles with python
numpy.percentile(a, q, axis=None,iterpolation=’linear’)
a: array containing numbers whose range is required
q: percentile to compute(must be between 0 and 100)
axis: axis or axes along which the range is computed, default is to compute the range of the
flattened array
interpolation: it can take the values as ‘linear’, ‘lower’, ‘higher’, ‘midpoint’or ‘nearest’.

64
This parameter specifies the method which is to be used when the desired quartile lies
between two data points, say i and j.
linear: returns i + (j-i)*fraction, fraction here is the fractional part of the index surrounded by i
and j
lower: returns i
higher: returns j
midpoint: returns (i+j)/2
nearest: returns the nearest point whether i or j
numpy.percentile() agrees with the manual calculation of percentiles (as shown above) only
when interpolation is set as ‘lower’.

>>> import numpy as np


>>> A=np.array([[10,14,11,7,9.5,15,19],[8,9,17,14.5,12,18,15.5],
[15,7.5,11.5,10,10.5,7,11],[11.5,11,9,12,14,12,7.5]])
>>> B=A.T
>>> a=np.percentile(B,27,axis=0, interpolation='lower')
>>> b=np.percentile(B,25,axis=1, interpolation='lower')
>>> c=np.percentile(B,75,axis=0, interpolation='lower')
>>> d=np.percentile(B,50,axis=0, interpolation='lower')
>>> print(a)
[ 9.5 9. 7.5 9. ]
>>> print(b)
[ 8. 7.5 9. 7. 9.5 7. 7.5]
>>> print(c)
[ 14. 15.5 11. 12. ]
>>> print(d)
[ 11. 14.5 10.5 11.5]

Observations: Thus students are able to implement different distributions.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

65
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 10

Title: Implementation of grouping and group by.


Theory:
Groupby is a pretty simple concept. We can create a grouping of categories and apply a
function to the categories. It’s a simple concept but it’s an extremely valuable technique that’s
widely used in data science. In real data science projects, you’ll be dealing with large amounts
of data and trying things over and over, so for efficiency, we use Groupby concept. Groupby
concept is really important because it’s ability to aggregate data efficiently, both in
performance and the amount code is magnificent. Groupby mainly refers to a process involving
one or more of the following steps they are: 
 
 Splitting : It is a process in which we split data into group by applying some conditions on
datasets.
 Applying : It is a process in which we apply a function to each group independently
 Combining : It is a process in which we combine different datasets after applying groupby
and results into a data structure
The following image will help in understanding a process involve in Groupby concept.  
1. Group the unique values from the Team column 
 

2. Now there’s a bucket for each group 


 

66
3. Toss the other data into the buckets 
 

4. Apply a function on the weight column of each bucket. 


 

67
 
Splitting Data into Groups
Splitting is a process in which we split data into a group by applying some conditions on
datasets. In order to split the data, we apply certain conditions on datasets. In order to split the
data, we use groupby() function this function is used to split the data into groups based on
some criteria. Pandas objects can be split on any of their axes. The abstract definition of
grouping is to provide a mapping of labels to group names. Pandas datasets can be split into
any of their objects. There are multiple ways to split data like: 
 
 obj.groupby(key)
 obj.groupby(key, axis=1)
 obj.groupby([key1, key2])
Note :In this we refer to the grouping objects as the keys. 
Grouping data with one key: 
In order to group data with one key, we pass only one key as an argument in groupby function.  
 
 Python3

# importing pandas module


import pandas as pd
  
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
        'Age':[27, 24, 22, 32,
               33, 36, 27, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                   'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd',
                         'B.Tech', 'B.com', 'Msc', 'MA']}
    
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)
  
print(df)

68
Now we group a data of Name using groupby() function. 
 

 Python3

# using groupby function


# with one key
 
df.groupby('Name')
print(df.groupby('Name').groups)

Output : 
 

  
Now we print the first entries in all the groups formed. 
 

 Python3

# applying groupby() function to


# group the data on Name value.
gk = df.groupby('Name')
   
# Let's print the first entries
# in all the groups formed.
gk.first()

Output : 
 

  
Grouping data with multiple keys : 
In order to group data with multiple keys, we pass multiple keys in groupby function.  
 
 Python3

69
# importing pandas module
import pandas as pd
  
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
        'Age':[27, 24, 22, 32,
               33, 36, 27, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                   'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd',
                         'B.Tech', 'B.com', 'Msc', 'MA']}
    
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)
  
print(df)

Now we group a data of “Name” and “Qualification” together using multiple keys in groupby
function. 
 

 Python3

# Using multiple keys in


# groupby() function
df.groupby(['Name', 'Qualification'])
 
print(df.groupby(['Name', 'Qualification']).groups)

Output : 
 

  
Grouping data by sorting keys : 
Group keys are sorted by default using the groupby operation. User can pass sort=False for

70
potential speedups. 
 
 Python3

# importing pandas module


import pandas as pd
  
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
        'Age':[27, 24, 22, 32,
               33, 36, 27, 32], }
    
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)
  
print(df)

Now we apply groupby() without sort 


 

 Python3

# using groupby function


# without using sort
 
df.groupby(['Name']).sum()

Output : 
 

71
Now we apply groupby() using sort in order to attain potential speedups 
 

 Python3

# using groupby function


# with sort
 
df.groupby(['Name'], sort = False).sum()

Output : 
 

  
Grouping data with object attributes : 
Groups attribute is like dictionary whose keys are the computed unique groups and
corresponding values being the axis labels belonging to each group. 
 
 Python3

# importing pandas module


import pandas as pd
  
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
        'Age':[27, 24, 22, 32,
               33, 36, 27, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                   'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd',
                         'B.Tech', 'B.com', 'Msc', 'MA']}
    
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)
  
print(df)

72
Now we group data like we do in a dictionary using keys. 
 

 Python3

# using keys for grouping


# data
 
df.groupby('Name').groups

Output : 
 

  
 
Iterating through groups
In order to iterate an element of groups, we can iterate through the object similar to
itertools.obj. 
 

 Python3

# importing pandas module


import pandas as pd
  
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
        'Age':[27, 24, 22, 32,
               33, 36, 27, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                   'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd',
                         'B.Tech', 'B.com', 'Msc', 'MA']}

73
    
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)
  
print(df)

Now we iterate an element of group in a similar way we do in itertools.obj.  


 

 Python3

# iterating an element
# of group
 
grp = df.groupby('Name')
for name, group in grp:
    print(name)
    print(group)
    print()

Output : 
 

74
Now we iterate an element of group containing multiple keys 
 

 Python3

# iterating an element
# of group containing
# multiple keys
 
grp = df.groupby(['Name', 'Qualification'])
for name, group in grp:
    print(name)
    print(group)
    print()

Output : 
As shown in output that group name will be tuple 
 

75
  
 
Selecting a groups
In order to select a group, we can select group using GroupBy.get_group(). We can select a
group by applying a function GroupBy.get_group this function select a single group.  
 

 Python3

# importing pandas module


import pandas as pd
  
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
        'Age':[27, 24, 22, 32,
               33, 36, 27, 32],

76
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                   'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd',
                         'B.Tech', 'B.com', 'Msc', 'MA']}
    
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)
  
print(df)

Now we select a single group using Groupby.get_group. 


 

 Python3

# selecting a single group


 
grp = df.groupby('Name')
grp.get_group('Jai')

Output : 
 

Now we select an object grouped on multiple columns 


 

 Python3

# selecting object grouped


# on multiple columns
 
grp = df.groupby(['Name', 'Qualification'])
grp.get_group(('Jai', 'Msc'))

77
Output : 
 

Observations: Thus students are able to perform implementation of grouping and group by
useful concept in data Science.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

78
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 11

Title: Implementation of hypothesis testing -T Test

Theory:
Methodology

Runs test is a hypothesis testing based methodology that is widely used in statistical analysis to

test if a set of values are generated randomly or not. It is a hypothesis test so we have a pair of a

null hypothesis and an alternative one.


Null hypothesis: The values are randomly generated.
Alternative hypothesis: The values are NOT randomly generated.
A Z score is generated on the data simply following the general formula:
(Observed-Excepted) / Standard Deviation
The score is then tested against the confidence interval(two-tailed) we specify. If the value is
higher, we conclude that our alternative hypothesis holds. Otherwise, if the value is lower, we
cannot say anything about the randomness of data at this significance level. We will be using %95
confidence interval(alpha = 0.05) through the rest of this article.
Definitions and Formulas
A run: A series of positive or negative values:
Data: 1 -2 -3 4 5 6 –7
Runs: [1], [-2, -3], [4, 5, 6], [-7]
Score formula
(Number of runs - Excepted number of runs) / Standard Deviation
Expected value formula
n_p = Number of positive values
n_n = Number of negative values
Excepted value of runs = (2 * n_p * n_n) / (n_p + n_n) + 1
Standard Deviation formula
n_p = Number of positive values
n_n = Number of negative values
Excepted value of runs = (2 * n_p * n_n) / (n_p + n_n) + 1

Test Data
I will using the data provided by NIST.

79
1.4.2.5.1. Background and Data
The following are the data used for this case study. The reader can download the data as a
text file.
www.itl.nist.gov
Octave/Matlab Implementation and Results
Using online Octave:
Upload text file
Import data to array
Load ‘statistics’ package
Run runstest

Python Implementation
Implement the test using Python.
Running the code produces the results below (Calculated Z Score, Z Score at %95 confidence):
(2.8355606218883844, 1.6448536269514722)

Observations: Thus students are able to implement hypothesis testing –T T

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

80
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


` Academic Year 2022-23

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 12

Title: Create simple dashboard using tableau.


Theory:

A dashboard is a collection of different kinds of visualizations or views that we create on


Tableau. We can bring together different elements of multiple worksheets and put them on a
single dashboard. The dashboard option enables us to import and add charts and graphs from
worksheets to create a dashboard. On a dashboard, we can place relevant charts and graphs in
one view and analyze them for better insights.

Now, we will learn in a stepwise manner how to create a dashboard in Tableau Desktop.

1. Open a new dashboard


You can open a dashboard window either from the Dashboard option given on the menu bar or
from the Dashboard icon highlighted in red on the bottom bar.

Selecting the New Dashboard option or clicking on the Dashboard icon will open a new
window named Dashboard 1. You change the name of the dashboard as per your liking.

81
2. Dashboard pane
In the window where we can create our dashboard, we get a lot of tabs and options related to
dashboarding. On the left, we have a Dashboard pane which shows the dashboard size, list of
available sheets in a workbook, objects, etc.

From the Dashboard tab, we can set the size of our dashboard. We can enter custom dimensions
like the width and height of the dashboard as per our requirements.

82
Or, you can select from a list of available fixed dashboard sizes as shown in the screenshot
below.

3. Layout pane
Right next to the Dashboard pane is the Layout pane where we can enhance the appearance and
layout of the dashboard by setting the position, size, border, background, and paddings.

83
4. Adding a sheet
Now, we’ll add a sheet onto our empty dashboard. To add a sheet, drag and drop a sheet from
the Sheets column present in the Dashboard tab. It will display all the visualizations we have on
that sheet on our dashboard. If you wish to change or adjust the size and place of the
visual/chart/graph, click on the graph then click on the small downward arrow given at the
right. A drop-down list appears having the option Floating, select it. This will unfix your chart
from one position so that you can adjust it as per your liking.

Have a look at the picture below to see how you can drag a sheet or visual around on the
dashboard and adjust its size.

84
5. Adding more sheets
In a similar way, we can add as many sheets as we require and arrange them on the dashboard
properly.

Also, you can apply the filter or selections on one graph and treat it like a filter for all the other
visuals on the dashboard. To add a filter to a dashboard in Tableau, select Use as Filter option
given on the right of every visual.

85
Then on the selected visual, we make selections. For instance, we select the data point
corresponding to New Jersey in the heat map shown below. As soon as we select it, all the rest of
the graphs and charts change their information and make it relevant to New Jersey. Notice in
the Region section, the only region left is East which is where New Jersey is located.

6. Adding objects
Another set of tools that we get to make our dashboard more interactive and dynamic is in
the Objects section. We can add a wide variety of objects such as a web page, button, text box,
extension, etc.

86
From the objects pane, we can add a button and also select the action of that button, that is, what
that button should do when you click on it. Select the Edit Button option to explore the options
you can select from for a button object.

For instance, we add a web page of our DataFlair official site as shown in the screenshot below.

87
7. Final dashboard
Now, we move towards making a final dashboard in Tableau with all its elements in place. As
you can see in the screenshot below, we have three main visualizations on our current dashboard
i.e. a segmented area chart,  scatter chart and a line chart showing the sales and profits forecast.
On the right pane, we have the list of legends showing Sub-category names, a forecast indicator
and a list of clusters.

We can add filters on this dashboard by clicking on a visual. For instance, we want to add a filter
based on months on the scatter plot showing sales values for different clusters. To add a months
filter, we click on the small downward arrow and then select Filters option. Then we
select Months of Order Date option. You can select any field based on which you wish to
create a new filter.
`

88
This will give us a slider filter to select a range of months for which we want to see our data.
You can adjust the position of the filter box and drag and drop it at whichever place you want.

You can make more changes into the filter by right-clicking on it. Also, you can change the type
of filter from the drop-down menu such as Relative Date, Range of Date, Start Date, End Date,
Browse Periods, etc.

89
Similarly, you can add and edit more filters on the dashboard.

8. Presentation mode
Once our dashboard is ready, we can view it in the Presentation Mode. To enable the
presentation mode, click on the icon present on the bar at the top as shown in the screenshot
below or press F7.

This opens our dashboard in the presentation mode. So far we were working in the Edit Mode. In
the presentation mode, it neatly shows all the visuals and objects that we have added on the
dashboard. We can see how the dashboard will look when we finally present it to others or share
it with other people for analysis.

90
Here, we can also apply the filter range to our data. The dashboard is interactive and will change
the data according to the filters we apply or selections we make.

For instance, we selected the brand Pixel from our list of items from the sub-category field. This
instantly changes the information on the visuals and makes it relevant to only Pixel.
9. Share workbook with others
We can also share all the worksheets and dashboard that we create together as a workbook with
other users. To share the workbook with others, click on the share icon (highlighted in red).
Next, you need to enter the server address of a Tableau server.
Note – You must have a Tableau Online or Tableau Server account in order to do this.

91
Observations: Thus students are able to create simple dashboard using tableau.

Prepared by- DR.Mrs J. N. Jadhav Associate Professor Deptt of CSE, DYPCET

Program Co-ordinator HOD Dean Academics

92
93

You might also like