Data Science - Unit II

What is Data science
• Data science is the study of data to extract meaningful insights for

business.
• It is a multidisciplinary approach that combines principles and
practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of
data.
• Data science is important because it combines tools, methods, and
technology to generate meaning from data.
Data Science Applications
Data Science Tools
Pandas
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating

data.
• The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
Futures of Pandas
• Pandas provides fast, flexible data structures, such as data frame CDs,
which are designed to work with structured data very easily and
intuitively.
• Pandas (Python data analysis) is a must in the data science life cycle.
• It is the most popular and widely used Python library for data
science, along with NumPy in matplotlib
Pandas Major Applications
Example
• Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:

df = pd.DataFrame(data)
print(df)
• Result
calories duration
0 420 50
1 380 40
2 390 45
Named Indexes
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
• Result
calories duration
day1 420 50
day2 380 40
day3 390 45
Let us assume that we are creating a data
frame with student’s data.
Example 2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Its output is as follows −
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
# importing the pandas library
import pandas as pd
# creating a dataframe object
student_register = pd.DataFrame()
# assigning values to the
# rows and columns of the
# dataframe
student_register['Name'] = ['Abhijit',
'Smriti',
'Akash',
'Roshni']
student_register['Age'] = [20, 19, 20, 14]
student_register['Student'] = [False, True,
True, False]
student_register
Add a new student in the datagram
# creating a new pandas

# series object
new_person = pd.Series(['Mansi', 19, True],
index = ['Name', 'Age',
'Student'])
# using the .append() function

# to add that row to the dataframe
student_register.append(new_person, ignore_index = True)
Python | pandas.to_markdown() in Pandas
With the help of pandas.to_markdown() method, we can get the
markdown table from the given dataframes by using
pandas.to_markdown() method.
Syntax : pandas.to_markdown()
Return : Return the markdown table.

Example 1
# import pandas
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3],

"B": [1.1, 2.2, 3.3]},
index =['a', 'a', 'b'])
# Using pandas.to_markdown() method

gfg = df.to_markdown()
print(gfg)
Output :
| | A| B|
|:---|----:|----:|
| a | 1 | 1.1 |
| a | 2 | 2.2 |
| b | 3 | 3.3 |
Example 2
# import pandas
import pandas as pd
df = pd.DataFrame({"A": [3, 4, 5],

"B": ['c', 'd', 'e']},
index =['I', 'II', 'III'])
# Using pandas.to_markdown() method

gfg = df.to_markdown()
print(gfg)
Output :
| | A|B |
|:----|----:|:----|
|I | 3|c |
| II | 4 | d |
| III | 5 | e |
Add a new column in Pandas Data Frame
Using a Dictionary
Pandas is basically the library in Python used for Data Analysis and
Manipulation.
To add a new Column in the data frame we have a variety of methods.
But here in this post, we are discussing adding a new column by using
the dictionary.
Let’s take Example!
# Python program to illustrate
# Add a new column in Pandas
# Importing the pandas Library

import pandas as pd
# creating a data frame with some data values.

data_frame = pd.DataFrame([[i] for i in range(7)], columns =['data'])
print (data_frame)
Output:
data
0 0
1 1
2 2
3 3
4 4
5 5
6 6
Map Function : Adding column “new_data_1” by giving the
functionality of getting week name for the column named “data”.
Call map and pass the dict, this will perform a lookup and return the
associated value for that key.
# Python program to illustrate
# Add a new column in Pandas
# Data Frame Using a Dictionary
import pandas as pd
data_frame = pd.DataFrame([[i] for i in range(7)], columns =['data'])
# Introducing weeks as dictionary
weeks = {0:'Sunday', 1:'Monday', 2:'Tuesday', 3:'Wednesday',
4:'Thursday', 5:'Friday', 6:'Saturday'}
# Mapping the dictionary keys to the data frame.
data_frame['new_data_1'] = data_frame['data'].map(weeks)
print (data_frame)
Output:
data new_data_1
0 0 Sunday
1 1 Monday
2 2 Tuesday
3 3 Wednesday
4 4 Thursday
5 5 Friday
6 6 Saturday
import pandas as pd
data = pd.DataFrame({"x1":["x", "y", "x", "y", "x", "x"], # Create pandas
DataFrame
"x2":range(15, 21),
"x3":["a", "b", "c", "d", "e", "f"],
"x4":range(20, 8, - 2)})
print(data)
Example 1: Remove Column from pandas
DataFrame
data_drop = data.drop("x3", axis = 1) # Drop variable from
DataFrame
print(data_drop) # Print updated DataFrame
Example 2: Add New Column to pandas
DataFrame
x5 = ["foo", "bar", "foo", "bar", "foo", "bar"]
print(x5)
data_add = data.assign(x5 = x5)

print(data_add)
Example 3: Merge Two pandas DataFrames
data_add = pd.DataFrame({"x3":["c", "d", "e", "f", "g", "h"],
"y1":range(101, 107),
"y2":["foo", "bar", "foo", "bar", "foo", "foo"]})
print(data_add)
data_merge = pd.merge(data, data_add,
on = "x3",
how = "outer")
print(data_merge)
3 advantages of using pandas in real time
applications
• Additional benefits derived from the Pandas library include data
alignment and integrated handling of missing data
• data set merging and joining
• reshaping and pivoting of data sets
• hierarchical axis indexing to work with high-dimensional data in a

lower-dimensional data structure; and label-based slicing.
NumPy
• NumPy is a Python library.
• NumPy is used for working with arrays.
• NumPy is short for "Numerical Python".

Use of numpy
)
import numpy as np
arr = np.array( [[ 1, 2, 3], [ 4, 2, 5]] )
print("Array is of type: ", type(arr))
print("No. of dimensions: ", arr.ndim)
print("Shape of array: ", arr.shape)
print("Size of array: ", arr.size)
print("Array stores elements of type: ", arr.dtype)
• Output :
Array is of type:
No. of dimensions: 2
Shape of array: (2, 3)
Size of array: 6
Array stores elements of type: int64
• This article will help you get acquainted with the widely used array-
processing library in Python, NumPy.
• What is NumPy? NumPy is a general-purpose array-processing

package.
• It provides a high-performance multidimensional array object, and
tools for working with these arrays.
• It is the fundamental package for scientific computing with Python.
• It is open-source software. It contains various features including

these important ones:
• A powerful N-dimensional array object
• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random number
capabilities
• Basic operations: Plethora of built-in arithmetic functions are
provided in NumPy.
• Operations on single array: We can use overloaded arithmetic

operators to do element-wise operation on array to create a new
array.
• In case of +=, -=, *= operators, the existing array is modified.
import numpy as np
a = np.array([1, 2, 5, 3])
# add 1 to every element
print ("Adding 1 to every element:", a+1)
# subtract 3 from each element
print ("Subtracting 3 from each element:", a-3)
# multiply each element by 10
print ("Multiplying each element by 10:", a*10)
# square each element

print ("Squaring each element:", a**2)
# modify existing array
a *= 2
print ("Doubled each element of original array:", a)
# transpose of array
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])
print ("\nOriginal array:\n", a)

print ("Transpose of array:\n", a.T)
Output :
Adding 1 to every element: [2 3 6 4]

Subtracting 3 from each element: [-2 -1 2 0]
Multiplying each element by 10: [10 20 50 30]
Squaring each element: [ 1 4 25 9]
Doubled each element of original array: [ 2 4 10 6]
Original array:
[[1 2 3]
[3 4 5]
[9 6 0]]
Transpose of array:
[[1 3 9]
[2 4 6]
[3 5 0]]
Unary operators: Many unary operations are provided as a method of
ndarray class.
This includes sum, min, max, etc. These functions can also be applied
row-wise or column-wise by setting an axis parameter.
# Python program to demonstrate
# unary operators in numpy
import numpy as np
arr = np.array([[1, 5, 6],

[4, 7, 2],
[3, 1, 9]])
# maximum element of array

print ("Largest element is:", arr.max())
print ("Row-wise maximum elements:",
arr.max(axis = 1))
# minimum element of array
print ("Column-wise minimum elements:",
arr.min(axis = 0))
# sum of array elements

print ("Sum of all array elements:",
arr.sum())
# cumulative sum along each row

print ("Cumulative sum along each row:\n",
arr.cumsum(axis = 1))
Output :
Largest element is: 9

Row-wise maximum elements: [6 7 9]
Column-wise minimum elements: [1 1 2]
Sum of all array elements: 38
Cumulative sum along each row:
[[ 1 6 12]
[ 4 11 13]
[ 3 4 13]]
Binary operators: These operations apply on array elementwise and a
new array is created.
You can use all basic arithmetic operators like +, -, /, , etc. In case of +=,
-=, = operators, the existing array is modified.
# Python program to demonstrate
# binary operators in Numpy
import numpy as np
a = np.array([[1, 2],
[3, 4]])
b = np.array([[4, 3],
[2, 1]])
# add arrays
print ("Array sum:\n", a + b)
# multiply arrays (elementwise multiplication)
print ("Array multiplication:\n", a*b)
# matrix multiplication
print ("Matrix multiplication:\n", a.dot(b))
Output:
Array sum:
[[5 5]
[5 5]]
Array multiplication:
[[4 6]
[6 4]]
Matrix multiplication:
[[ 8 5]
[20 13]]
SciPy Introduction
• SciPy is a scientific computation library that uses NumPy underneath.
• SciPy stands for Scientific Python.
• It provides more utility functions for optimization, stats and signal processing.
• Like NumPy, SciPy is open source so we can use it freely.
• SciPy was created by NumPy's creator Travis Olliphant.

Why Use SciPy?
• If SciPy uses NumPy underneath, why can we not just use NumPy?
• SciPy has optimized and added functions that are frequently used in
NumPy and Data Science.
Constants in SciPy
• As SciPy is more focused on scientific implementations, it provides

many built-in scientific constants.
• These constants can be helpful when you are working with Data
Science.
Unit Categories
The units are placed under these categories:
• Metric
• Binary
• Mass
• Angle
• Time
• Length
• Pressure
• Volume
• Speed
• Temperature
• Energy
• Power
• Force
Mass
• Return the specified unit in kg (e.g. gram returns 0.001)
from scipy import constants
print(constants.gram) #0.001
print(constants.metric_ton) #1000.0
print(constants.grain) #6.479891e-05
print(constants.lb) #0.45359236999999997
print(constants.pound) #0.45359236999999997
print(constants.oz) #0.028349523124999998
print(constants.ounce) #0.028349523124999998
print(constants.stone) #6.3502931799999995
print(constants.long_ton) #1016.0469088
Time:
Return the specified unit in seconds (e.g. hour returns 3600.0)
from scipy import constants
print(constants.minute) #60.0
print(constants.hour) #3600.0
print(constants.day) #86400.0
print(constants.week) #604800.0
print(constants.year) #31536000.0
print(constants.Julian_year) #31557600.0
SciPy Sparse Data
What is Sparse Data
Sparse data is data that has mostly unused elements (elements that
don't carry any information ).
It can be an array like this one:
[1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
Sparse Data: is a data set where most of the item values are zero.
Dense Array: is the opposite of a sparse array: most of the values are
not zero.
How to Work With Sparse Data
SciPy has a module, scipy.sparse that provides functions to deal with sparse data.
There are primarily two types of sparse matrices that we use:
CSC - Compressed Sparse Column. For efficient arithmetic, fast column slicing.
CSR - Compressed Sparse Row. For fast row slicing, faster matrix vector products
We will use the CSR matrix in this tutorial.

CSR Matrix
We can create CSR matrix by passing an arrray into function scipy.sparse.csr_matrix().
Example
Create a CSR matrix from an array:
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([0, 0, 0, 0, 1, 1, 0, 2])
print(csr_matrix(arr))
Result
(0, 5) 1
(0, 6)1
(0, 8)2
From the result we can see that there are 3 items with value.
The 1. item is in row 0 position 5 and has the value 1.

Counting nonzeros with the count_nonzero()
method:
Example
import numpy as np
arr = np.array([[0, 0, 0], [0, 0, 1], [1, 0, 2]])
print(csr_matrix(arr).count_nonzero())
Removing zero-entries from the matrix with
the eliminate_zeros() method:
Example
import numpy as np
arr = np.array([[0, 0, 0], [0, 0, 1], [1, 0, 2]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat)
SciPy Interpolation
• The function interp1d() is used to interpolate a distribution with 1
variable.
• It takes x and y points and returns a callable function that can be

called with new x and returns corresponding y.
Example
For given xs and ys interpolate values from 2.1, 2.2... to 2.9:
from scipy.interpolate import interp1d

import numpy as np
xs = np.arange(10)
ys = 2*xs + 1
interp_func = interp1d(xs, ys)
newarr = interp_func(np.arange(2.1, 3, 0.1))
print(newarr)
Result:
[5.2 5.4 5.6 5.8 6. 6.2 6.4 6.6 6.8]

SciPy Graphs
Working with Graphs
Graphs are an essential data structure.
SciPy provides us with the module scipy.sparse.csgraph for working with such
data structures.
Adjacency Matrix
Adjacency matrix is a nxn matrix where n is the number of elements in a graph.
And the values represents the connection between the elements.

Example:
For a graph like this, with elements A, B and C, the connections are:
A & B are connected with weight 1.
A & C are connected with weight 2.
C & B is not connected.

The Adjency Matrix would look like this:
ABC
A:[0 1 2]
B:[1 0 0]
C:[2 0 0]
Connected Components
Find all of the connected components with the
connected_components() method.
Example
import numpy as np
from scipy.sparse.csgraph import connected_components
arr = np.array([
[0, 1, 2],
[1, 0, 0],
[2, 0, 0]
])
newarr = csr_matrix(arr)
print(connected_components(newarr))
SciPy Spatial Data
Working with Spatial Data
Spatial data refers to data that is represented in a geometric space.
E.g. points on a coordinate system.
We deal with spatial data problems on many tasks.
E.g. finding if a point is inside a boundary or not.
SciPy provides us with the module scipy.spatial, which has functions for working with
spatial data.
Example
Create a triangulation from following points:
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt
points = np.array([
[2, 4],
[3, 4],
[3, 0],
[2, 2],
[4, 1]
])
simplices = Delaunay(points).simplices
plt.triplot(points[:, 0], points[:, 1], simplices)

plt.scatter(points[:, 0], points[:, 1], color='r')
plt.show()
Matplotlib
Import Matplotlib
Once Matplotlib is installed, import it in your applications by adding the
import module statement:
import matplotlib
Example
import matplotlib
print(matplotlib.__version__)s
Matplotlib Pyplot
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and
are usually imported under the plt alias:
Now the Pyplot package can be referred to as plt.
Example
Draw a line in a diagram from position (0,0) to position (6,250):

import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Result:
Matplotlib Plotting
Plotting x and y points
The plot() function is used to draw points (markers) in a diagram.
By default, the plot() function draws a line from point to point.
The function takes parameters for specifying points in the diagram.

• Parameter 1 is an array containing the points on the x-axis.
• Parameter 2 is an array containing the points on the y-axis.
• If we need to plot a line from (1, 3) to (8, 10), we have to pass two
arrays [1, 8] and [3, 10] to the plot function.
Example
Draw a line in a diagram from position (1, 3) to position (8, 10):

import numpy as np
xpoints = np.array([1, 8])
ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints)
plt.show()
Result:
The x-axis is the horizontal axis.
The y-axis is the vertical axis.

Matplotlib Markers
Markers
You can use the keyword argument marker to emphasize each point
with a specified marker:
Example
Mark each point with a circle:
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, marker = 'o')
plt.show()
Result:
Matplotlib Line
Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the
plotted line:
Example
Use a dotted line:
import numpy as np
plt.plot(ypoints, linestyle = 'dotted')

plt.show()
Example
Use a dashed line:
plt.plot(ypoints, linestyle = 'dashed')

Line Color
You can use the keyword argument color or the shorter c to set the color of the line:
Example
Set the line color to red:
import numpy as np
plt.plot(ypoints, color = 'r')

plt.show()
Multiple Lines
You can plot as many lines as you like by simply adding more plt.plot()
functions:
Example
Draw two lines by specifying a plt.plot() function for each line:
import numpy as np
y1 = np.array([3, 8, 1, 10])
y2 = np.array([6, 2, 7, 11])
plt.plot(y1)
plt.plot(y2)
plt.show()
Line Width
You can use the keyword argument linewidth or the shorter lw to
change the width of the line.
The value is a floating number, in points:

Example
Plot with a 20.5pt wide line:

import numpy as np
plt.plot(ypoints, linewidth = '20.5')

plt.show()
Add labels to the x- and y-axis:
Create Labels for a Plot

With Pyplot, you can use the xlabel() and ylabel() functions to set a
label for the x- and y-axis.
import numpy as np
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.plot(x, y)
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.show()
Matplotlib Adding Grid Lines
With Pyplot, you can use the grid() function to add grid lines to the plot.
import numpy as np
x =
np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 12
5])
y =
np.array([240, 250, 260, 270, 280, 290, 300, 310, 320
, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid()
plt.show()
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two
arrays of the same length, one for the values of the x-axis, and one for
values on the y-axis:
Example
A simple scatter plot:

import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
Matplotlib Pie Charts
Creating Pie Charts

With Pyplot, you can use the pie() function to draw pie charts:
import numpy as np
y = np.array([35, 25, 25, 15])
plt.pie(y)
plt.show()
Sorting
• Sorting is the process of arranging data into meaningful order so that
you can analyze it more effectively.
• For example, you might want to order sales data by calendar month
so that you can produce a graph of sales performance.
• You can use Discoverer to sort data as follows: sort text data into
alphabetical order.
What are the examples of data sorting?
Some common examples include sorting alphabetically (A to Z or Z to
A), by value (largest to smallest or smallest to largest), by day of the
week (Mon, Tue, Wed..), or by month names (Jan, Feb..) etc.
What is sorting in real life example?
• Insertion Sort - Insertion Sort Algorithm with Examples
• The contact list in your phone is sorted, which means you can easily
access your desired contact from your phone since the data is
arranged in that manner for you. In other words, “it is sorted”.
• While shopping on flip kart or amazon, you sort items based on your
choice, that is, price low to high or high to low.
SORTING
This article will discuss how to sort Pandas Data Frame using various methods in Python.
# importing pandas library
import pandas as pd
# creating and initializing a nested list
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
# creating a pandas dataframe
df = pd.DataFrame(age_list, columns=['Country', 'Year',
'Population', 'Continent'])
df
OUTPUT
In order to sort the data frame in pandas, function sort_values() is
used.
Pandas sort_values() can sort the data frame in Ascending or
Descending order.
# Sorting by column 'Country'
df.sort_values(by=['Country'])
Sorting the Data frame in Descending order
# Sorting by column "Population"

df.sort_values(by=['Population'], ascending=False)
Sorting Data frames by multiple columns
# Sorting by columns "Country" and then "Continent"
df.sort_values(by=['Country', 'Continent'])
Sorting Data frames by multiple columns but
different order
# Sorting by columns "Country" in descending
# order and then "Continent" in ascending order
df.sort_values(by=['Country', 'Continent'],
ascending=[False, True])
Grouping
• Grouping data is the process of organizing data into related sets.
• This can be done in a number of ways, including by category, by

attribute, or by value.
• Grouping data can be helpful for data analysis and for understanding
patterns in data
What does grouping mean in data?
• Grouped data means the data (or information) given in the form of
class intervals such as 0-20, 20-40 and so on.
• Ungrouped data is defined as the data given as individual points (i.e.
values or numbers) such as 15, 63, 34, 20, 25, and so on.
Purpose of grouping data
• Data is grouped so that it becomes understandable and can be
interpreted.
• Grouped data is helpful to make calculations of certain values which
will help in describing and analyzing the data
Aggregation and Grouping
Example2
Pandas DataFrame.groupby()
• In Pandas, groupby() function allows us to rearrange the data by
utilizing them on real-world data sets.
• Its primary task is to split the data into various groups.
• These groups are categorized based on some criteria.
• The objects can be divided from any of their axes.
This operation consists of the following steps for aggregating/grouping
the data:
• Splitting datasets
• Analyzing data
• Aggregating or combining data
We can also add some functionality to each subset. The following
operations can be performed on the applied functionality:
Aggregation: Computes summary statistic.

Transformation: It performs some group-specific operation.
Filtration: It filters the data by discarding it with some condition
Returns value of Groupby Column Pandas
Groupby column pandas return the value of the Groupby object.
import pandas as pd
ipl_data = {'Name': ['Priya', 'Rudra', 'Dev', 'Nisha', 'Arpita',
'Shipra', 'Kakali', 'Kunal', 'Neha', 'Rup', 'Rim', 'Ram'],
'Rank': [1, 2, 3, 4, 5,6 ,7 ,8,9 , 10,11,12],
'DOB': [2000,2000,2002,1999,2001,2000,1998,1999,2000,2002,2001,2000],
'Points':[676,709,963,873,790,802,956,688,794,801,890,890]}
df = pd.DataFrame(ipl_data)
print (df)
output
Name Rank DOB Points
0 Priya 1 2000 676
1 Rudra 2 2000 709
2 Dev 3 2002 963
3 Nisha 4 1999 873
4 Arpita 5 2001 790
5 Shipra 6 2000 802
6 Kakali 7 1998 956
7 Kunal 8 1999 688
8 Neha 9 2000 794
9 Rup 10 2002 801
10 Rim 11 2001 890
11 Ram 12 2000 890
we give an example of groupby column pandas. The example is given below
import pandas as pd
import numpy as np
ipl_data = {'Name': ['Priya', 'Rudra', 'Dev', 'Nisha', 'Arpita',
'Shipra', 'Kakali', 'Kunal', 'Neha', 'Rup', 'Rim', 'Ram'],
'Rank': [1, 2, 3, 4, 5,6 ,7 ,8,9 , 10,11,12],
'DOB': [2000,2000,2002,1999,2001,2000,1998,1999,2000,2002,2001,2000],
'Points':[676,709,963,873,790,802,956,688,794,801,890,890]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('DOB')
print (grouped['Points'].agg(np.mean))
output
DOB
1998 956.0
1999 780.5
2000 774.2
2001 840.0
2002 882.0
Name: Points, dtype: float64
Example
# import the pandas library

import pandas as pd
import numpy as np
data = {'Name': ['Parker', 'Smith', 'John', 'William'],
'Percentage': [82, 98, 91, 87],
'Course': ['B.Sc','B.Ed','M.Phill','BA']}
df = pd.DataFrame(data)
grouped = df.groupby('Course')
print(grouped['Percentage'].agg(np.mean))
Output
Course
B.Ed 98
B.Sc 82
BA 87
M.Phill 91
Name: Percentage, dtype: int64
Advantage of grouping
Properly structured, group projects can reinforce skills that are relevant
to both group and individual work, including the ability to: Break
complex tasks into parts and steps.
Plan and manage time.
Refine understanding through discussion and explanation.
Plotting
• Plotting and data visualization can tell different types of stories

between features and target variables e.g. comparing different
quantities, studying trends, quantifying relationships, or displaying
proportions.
•
• Plotting or data visualization is the oldest and most important branch
of data science.
• This is one of the most used plots in the field of Data Science.
• It shows the distribution of quantitative data in a way that facilitates
comparisons between variables or across levels of a categorical
variable.
• It helps us to detect outliers more easily compared to the swarm
plots.
• Plotting is a graphical representation of a data set that shows a
relationship between two or more variables.
• MATLAB plots play an essential role in the field of mathematics,
science, engineering, technology, and finance for statistics and data
analysis.
• There are several functions available in MATLAB to create 2-

dimensional and 3-dimensional plots.
Creating Plotting
MATLAB makes it easy to create plots. For example in 2D, is to take a vector
of a- coordinates, a = (a1... an ), and a vector of b-coordinates, b = (b1...bn),
locate the points (ai...bi), with i = 1, 2. . . n and then connect them by
straight lines.
The MATLAB commands to plot a graph is plot (a, b).
The vectors a = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) and b = (0, 1, -1, 1, 0) produce

the picture shown in figure.
>> a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
>> b = [0, 1, -1, 1, 0];
>> plot(a, b)
filtering the missing data in data science
• Missing values can be handled by deleting the rows or columns having
null values.
• If columns have more than half of the rows as null then the entire
column can be dropped.
• The rows which are having one or more columns values as null can
also be dropped.
Pandas Filter Rows with NAN Value from
DataFrame Column
• You can filter out rows with NAN value from pandas DataFrame
column string, float, datetime e.t.c by using DataFrame.dropna() and
DataFrame.
• notnull() methods. Python doesn’t support Null hence any missing
data is represented as None or NaN.
• NaN stands for Not A Number and is one of the common ways to
represent the missing value in the data.
# Create a pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark","Java"],
'Fee' :[22000,25000,np.nan,np.nan,np.nan,np.nan],
'Duration':['30days',np.nan,'30days','N/A', np.nan,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
Courses Fee Duration
0 Spark 22000.0 30days
1 PySpark 25000.0 NaN
2 Spark NaN 30days
3 Python NaN N/A
4 PySpark NaN NaN
5 Java NaN NaN
Using DataFrame.Dropna() Filter Rows with
NAN Value
# Using DataFrame.dropna() method drop all rows that have
NAN/none.
df2=df.dropna()
print(df2)
# OutPut:
Courses Fee Duration
0 Spark 22000.0 30days

Data Science - Unit II

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science - Unit II

Uploaded by

Copyright:

Available Formats

What is Data science

• Data science is the study of data to extract meaningful insights for

• It has functions for analyzing, cleaning, exploring, and manipulating

#load data into a DataFrame object:

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

# creating a new pandas

# using the .append() function

Return : Return the markdown table.

df = pd.DataFrame({"A": [1, 2, 3],

# Using pandas.to_markdown() method

df = pd.DataFrame({"A": [3, 4, 5],

# Using pandas.to_markdown() method

# Importing the pandas Library

# creating a data frame with some data values.

data_add = data.assign(x5 = x5)

• data set merging and joining

• reshaping and pivoting of data sets

• hierarchical axis indexing to work with high-dimensional data in a

• NumPy is used for working with arrays.

• NumPy is short for "Numerical Python".

• What is NumPy? NumPy is a general-purpose array-processing

• It is the fundamental package for scientific computing with Python.

• It is open-source software. It contains various features including

• Operations on single array: We can use overloaded arithmetic

# square each element

print ("\nOriginal array:\n", a)

Adding 1 to every element: [2 3 6 4]

arr = np.array([[1, 5, 6],

# maximum element of array

# sum of array elements

# cumulative sum along each row

Largest element is: 9

• SciPy stands for Scientific Python.

• Like NumPy, SciPy is open source so we can use it freely.

• SciPy was created by NumPy's creator Travis Olliphant.

• As SciPy is more focused on scientific implementations, it provides

It can be an array like this one:

There are primarily two types of sparse matrices that we use:

We will use the CSR matrix in this tutorial.

arr = np.array([0, 0, 0, 0, 1, 1, 0, 2])

The 1. item is in row 0 position 5 and has the value 1.

The 2. item is in row 0 position 6 and has the value 1.

The 3. item is in row 0 position 8 and has the value 2.

arr = np.array([[0, 0, 0], [0, 0, 1], [1, 0, 2]])

arr = np.array([[0, 0, 0], [0, 0, 1], [1, 0, 2]])

• It takes x and y points and returns a callable function that can be

from scipy.interpolate import interp1d

newarr = interp_func(np.arange(2.1, 3, 0.1))

[5.2 5.4 5.6 5.8 6. 6.2 6.4 6.6 6.8]

And the values represents the connection between the elements.

A & B are connected with weight 1.

A & C are connected with weight 2.

C & B is not connected.

E.g. points on a coordinate system.

We deal with spatial data problems on many tasks.

E.g. finding if a point is inside a boundary or not.

plt.triplot(points[:, 0], points[:, 1], simplices)

import matplotlib.pyplot as plt

By default, the plot() function draws a line from point to point.

The function takes parameters for specifying points in the diagram.

• Parameter 2 is an array containing the points on the y-axis.

import matplotlib.pyplot as plt

The x-axis is the horizontal axis.