You are on page 1of 52

Course On Machine Learning & Data science

AI, DS & ML them in selection


AI in Business and solving
issues.
Businesses have now started realizing
the use of data to improve their
products/processes. Various use cases
Technologies: NLP, Speech
from across business domains are
recognition.
discussed to show the adoption of AI. The
use cases are taken from retail,
Companies: Alexa-Amazon
healthcare, banking, manufacturing &
energy.
AI  Contextual Commerce
in

Benefits:
Contextual
RETAIL commerce is an online
content videos,
articles, reviews, photos- from which
consumers can buys the items featured
within it directly, without beings=
redirected to another site.
Technologies: Optimization, NLP, ML
Companies: PUMA,Bazaar,Ted baker
 Conversational Commerce
 Actionable Analytics

Benefits:
Computers will with clients in
human languages. Will understand Benefits:
their needs and emotions and assist
1
Course On Machine Learning & Data science

Analysis of data that can be put into


well defined action geared towards Technologies: ML, Expert Systems
specific results. Inventory management,
Pricing, Targeted Campaign. Companies: Tesco
Technologies: ML, Big Data
Companies: Amazon, Flip kart AI Based Retail Strategies:
 Predictive Marketing  NLP, Chatbots
 Personalize
 Sentimet Analysis
 Branding
 Marketing
 ML
Benefits:  Inventory
To extract info  Store Layout
from customer  Promotions
data sets to  Pricing
determine a pattern and predict  360 degree view
future outcomes and trends. Can
help generate more revenue by  Image Analytics
targeting only potential customers.  Offline Stores
 Kiosk
Technologies: ML, Expert Systems  Virtual Trial Mirrors

Companies: Myntra, Amazon  Virtual Reality


 Shop in a real store from
 Guided Sales: anywhere
 (Buy + Alibaba)

 Augmented Reality
 How will things look in their
actual place?
Benefits:  iOS 11
Understandin
g needs,
Suggesting Python for Data Science:
the best match.

2
Course On Machine Learning & Data science

Python is an open source, general- Use of libraries will help John in the following
purpose programming language. It ways:
supports both structured and object- Faster application development – Libraries
oriented style of programming. It can be promote code reusability and help the developers
utilized for developing wide range of save time and focus on building the functional
logic.
applications including web applications,
data analytics, machine learning Enhance code efficiency – Use of pre-tested
applications etc. libraries enhances the quality and stability of the
application.
DATA TYPES: Python provides various data
Achieve code modularization – Libraries can be
types and data structures for storing and coupled or decoupled based on requirement.
processing data. For handling single
Over the last two decades, python has emerged
values, there are data types like int, float,
as a first-choice tool for tasks that involve
str, and bool. For handling data in groups, scientific computing, including the analysis and
python provides data structures like list, visualization of large datasets. Python has
tuple, dictionary, set, etc. gained popularity, particularly in the field of data
science because of large and active ecosystem of
third-party libraries.
LIBRARIES: Python has a wide range of Few of the popular libraries in data science
libraries and built-in functions which aid include NumPy, Pandas, Matplotlib and Scikit-
in rapid development of Learn.
applications. Python libraries are
collections of pre-written codes to
perform specific tasks. This eliminates the
need of rewriting the code from scratch. NUMPY:
EXAMPLE: Basic eg: A python List can be used to
store a group of elements together in a
John is a software developer. His
sequence. It can contain heterogeneous
project requires developing an application
elements.
that connects to various database servers
like MySQL, Postgre, MongoDB etc. To Following are some examples of List:
implement this requirement from scratch,
item_list = ['Bread', 'Milk', 'Eggs', 'Butter',
John needs to invest his time and effort to
'Cocoa']
understand the underlying architectures of
the respective databases. Instead, John student_marks = [78, 47, 96, 55, 34]
can choose to use pre-defined libraries to hetero_list = [ 1,2,3.0, ‘text’, True, 3+2j]
perform the database operations which
abstracts the complexities involved. To perform operations on the List
elements, one needs to iterate
3
Course On Machine Learning & Data science

through the List. For example, if five extra


marks need to be awarded to all the
But the problem with lists is run time is
entries in the student marks list. The
higher 395ms. Same using with Numpy.
following approach can be used to
achieve the same:
It can be observed that the same
operation has been completed in 12
milliseconds when compared to 395
milliseconds taken by Python List. As the
data size and the complexity of operations
increases, the difference between the
performance of Numpy and Python Lists
broadens.
In Data Science, there are millions of
records to be dealt with. The performance

Here len is used as we can’t determine %%time


the numerical range of the list every time
as data entries can vary and for loop #Used to calculate total operation
needs a numerical value in range. time

It can be observed that, there is use of a #Importing Numpy


loop. The code is lengthy and becomes import numpy as np
computationally expensive with increase
in the size of the List. #Creating a numpy array of 1
million numbers
Data Science is a field that utilizes
scientific methods and algorithms to a = np.arange(1,1000000)
generate insights from the data. These b = np.arange(2,1000001)
insights can then be made actionable and
c = a+b
applied across a broad range of
application domains. Data Science deals
with large datasets. Operating on such limitations faced by using Python List can
data with lists and loops is time consuming be managed by usage of advanced Python
and computationally expensive. libraries like Numpy.

Let us understand why Python Lists can Numeric-Python (Numpy), is a Python


become a bottleneck if they are used for library that is used for numeric and
large data. scientific operations. It serves as a building

%%time 4

#Used to calculate total operation


time
list1 = list (range (1,1000000))

Course On Machine Learning & Data science

block for many libraries available in IMPORTING NUMPY: Numpy library needs
Python. to be imported in the environment before
it can be used as shown below. 'np' is the
Data structures in Numpy
standard alias used for Numpy.
The main data structure of NumPy is the
Numpy array can be created by using
ndarray or n-dimensional array.
array() function. The array() function in
The ndarray is a multidimensional
container of elements of the same type as
depicted below. It can easily deal
with matrix and vector operations.
1. As the array size increases, Numpy
can execute more parallel Numpy returns an array object named
operations, thereby making ndarray.
computation faster. When the array
size gets close to 5,000,000, NumPy Syntax: np.array(object, dtype)
gets around 120 times faster than object – A python object (for example, a
Python List. list)
2. NumPy has many optimized built-in
dtype – data type of object (for example,
mathematical functions. These
integer)
functions help in performing variety
of complex mathematical Example: Consider the following marks
computations faster and with very scored by students:
minimal code.
3. Another great feature of NumPy is
that it has multidimensional array
data structures that can represent
vectors and matrices. This can be
useful as lot of machine learning
algorithms rely on matrix
operations.

Let us take another example: There are


various columns in this dataset. Each
column contains multiple values. These
values can be represented as lists of items.

5
Course On Machine Learning & Data science

Since each column contains homogenous Here, 3 represents the number of rows
values, Numpy arrays can be used to and 5 represents the number of elements
represent them. in each row.
Let us understand, how to represent the dtype' refers to the data type of the data
car ‘horsepower’ values in a Numpy array. contained by the array. Numpy supports

This can be achieved by creating the


Numpy array from List of Lists.
multiple datatypes like integer, float,
Let us understand, how to represent the
string, boolean etc.
car 'mpg', ‘horsepower’, and 'acceleration'
values in a Numpy array. Below is an example of using dtype
property to identify the data type of
elements in an array.

The numpy.ndarray.shape returns a tuple


that describes the shape of the array.
Numpy dtype can be changed as per
For example:
requirements. For example, an array
a one-dimensional array having 10 of integers can be converted to float.
elements will have a shape as (10,)
Below is an example of using dtype as an
a two-dimensional array having 10 argument of np.array() function to convert
elements distributed evenly in two rows the data type of elements from integer to
will have a shape as (2,5) float.
Let us comprehend, how to find out the
shape of car attributes array.

6
Course On Machine Learning & Data science

car_names = ['chevrolet chevelle malibu',


'buick skylark 320', 'plymouth satellite',
'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names,
horsepower])

OPERATIONS ON NUMPY: car_hp_arr

The elements in the ndarray OUTPUT: array(['chevrolet chevelle


are accessed using index within the square malibu', 'buick skylark 320', 'plymouth
brackets []. In Numpy, both positive and satellite', 'amc rebel sst', 'ford torino'],
negative indices can be used to access dtype=’<U25’)
elements in the ndarray. Positive indices #Creating a 2D array consisting car names
start from the beginning of the array, and horsepower
while negative indices start from the end
of the array. Array indexing starts from 0 car_names = ['chevrolet chevelle malibu',
in positive indexing and from -1 in 'buick skylark 320', 'plymouth satellite',
negative indexing. 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]

ACCESSING ELEMENT FROM 1d ARRAY: car_hp_arr = np.array([car_names,


horsepower])
#creating an array of cars
#Accessing horsepower
cars = np.array(['chevrolet chevelle
malibu', 'buick skylark 320', 'plymouth car_hp_arr[1]
satellite', 'amc rebel sst', 'ford torino']) OUTPUT: array([130, 165, 150, 150,
#accessing the second car from the array 140],dtype=’<U25’)

cars[1] ## “Here 0 represents car_names and 1


represents horsepower”
OUTPUT: 'buick skylark 320’
#Creating a 2D array consisting car names
ACCESSING ELEMENT FROM 2d ARRAY: and horsepower
#Creating a 2D array consisting car names car_names = ['chevrolet chevelle malibu',
and horsepower 'buick skylark 320', 'plymouth satellite',
'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]

7
Course On Machine Learning & Data science

car_hp_arr = np.array([car_names, car_names = ['chevrolet chevelle malibu',


horsepower]) 'buick skylark 320', 'plymouth satellite',
'amc rebel sst', 'ford torino']
#Accessing second car - 0 represents 1st
row and 1 represents 2nd element of the horsepower = [130, 165, 150, 150, 140]
row
acceleration = [18, 15, 18, 16, 17]
car_hp_arr[0,1]
car_hp_acc_arr = np.array([car_names,
horsepower, acceleration])
OUTPUT: 'buick skylark 320' #Accessing name, horsepower and
acceleration of first three cars
#Creating a 2D array consisting car names
and horsepower car_hp_acc_arr[0:3, 0:3]
car_names = ['chevrolet chevelle malibu', OUTPUT:
'buick skylark 320', 'plymouth satellite',
'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names,
horsepower])
#Accessing name of last car using negative
indexing
car_hp_arr[0,-1]
OUTPUT: 'ford torino'
SLICING OF NUMPY ARRAY:
Slicing is a way to access and obtain
subsets of ndarray in Numpy.
Syntax: array_name[start : end] – index
starts at ‘start’ and ends at ‘end - 1’.
EG: MEAN & MEDIAN CALCULATION IN
ARRAY:
#Creating a 2D array consisting car names,
horsepower and acceleration

8
Course On Machine Learning & Data science

MIN & MAX: Getting some elements out of an existing


array based on certain conditions and
INDEXING:
creating a new array out of them is
called filtering.
The following code can be used to
The 'where' function can be used for this accomplish this:
requirement. Given a condition, 'where' #creating a list of 5 horsepower values
function returns the indexes of the array
horsepower = [130, 165, 150, 150, 140]
where the condition satisfies. Using these
indexes, the respective values from the #creating a numpy array from horsepower
array can be obtained. list
horsepower_arr = np.array(horsepower)
#creating filter array
filter_arr = horsepower_arr > 135
newarr = horsepower_arr[filter_arr]
print(filter_arr)
print(newarr)

#creating a list of 5 horsepower values SORTING: The NumPy array can be sorted
by passing the array to the
horsepower = [130, 165, 150, 150, 140]
function sort(array) or by array.sort.
#creating a numpy array from horsepower
So, what is the difference between these
list
two functions though they are used for
horsepower_arr = np.array(horsepower) the same functionality?
x = np.where(horsepower_arr >= 150) The difference is that the array.sort()
print(x) # gives the indices function modifies the original array by
default, whereas the sort(array) function
# With the indices, we can find those does not.
values
horsepower_arr[x]

9
Course On Machine Learning & Data science

In addition to arithmetic operations,


several other mathematical operations
like exponents, logarithms and
trigonometric functions are also available
in Numpy. This makes Numpy a very
useful tool for scientific computing.
##SUM
student_marks_arr = np.array([78, 92, 36,
64, 89])
print(np.sum(student_marks_arr))
OUTPUT: 359
Award extra marks in subjects as follows:
English: +2
The mathematical operations can be Mathematics: +2
performed on Numpy arrays. Numpy
Physics: +5
makes use of optimized, pre-compiled
code to perform mathematical operations Chemistry: +10
on each array element. This eliminates the Biology: +2
need of using loops, thereby enhancing
the performance. This process is called additional_marks = [2, 2, 5, 10, 1]
vectorization. student_marks_arr += additional_marks
Numpy provides various mathematical student_marks_arr
functions such as sum(), add(), sub(), log(),
OUTPUT: array([80,94,41,74,90])
sin() etc. which uses vectorization.

BROADCASTING:
Figure 1 NUMPY OPERATORS
"Broadcasting" refers to the term on how
Numpy handles arrays with different
10
Course On Machine Learning & Data science

shapes during arithmetic operations.  In the first operation, the shape of


Array of smaller size is stretched or first array is 1x3 and the shape of
copied across the larger array. second array is 1x1. Hence,
according to broadcasting rules, the
For example, considering the following
second array gets stretched to
arithmetic operations across 2 arrays:
match the shape of first array and
import numpy as np the shape of the resulting array is
# Array 1 1x3.
 In the second operation, the shape
array1=np.array([5, 10, 15]) of first array is 3x3 and the shape of
# Array 2 second array is 1x3. Hence,
according to broadcasting rules, the
array2=np.array([5])
second array gets stretched to
match the shape of first array and
the shape of the resulting array is
3x3.
 In the third operation, the shape of
array3= array1 * array2 first array is 3x1 and the shape of
second array is 1x3. Hence,
array3 according to broadcasting rules,
OUTPUT: both first and second arrays
get stretched and the shape of the
In this example, the array2 is resulting array is 3x3.
being stretched or copied to match array1
during the arithmetic operation resulting Figure 2 Scores of 4 students in 2 subjects

in new array array3 with the same shape


#Making an array
as array1.
students_marks = np.array([[67, 45],
The following diagram explains
[90, 92],[66, 72],[32, 40]])
broadcasting:
students_marks

11
Course On Machine Learning & Data science

Now the teacher wants to award extra #Broadcasting


five marks in Chemistry and extra ten
students_marks += [5,10]
marks in Physics.
students_marks
students_marks = np.array([[67, 45],[90,
92],[66, 72],[32, 40]]) Output:

The student's marks array is a 2D array of


shape 4x2. The marks to be added are in
the form of a 1D array of size 1x2.
According to the broadcasting rules, the
marks to be added get stretched to match
the shape of student marks array and the
shape of the resulting array is 4x2.
IMAGE AS NUMPY MATRIX:
Images are stored as arrays of
hundreds, thousands or even millions of
picture elements called as pixels.
Therefore, images can also be treated as
Numpy array, as they can be represented
as matrix of pixels.
Certain basic operations and
manipulations can be carried out on
images using Numpy and scikit-image
package. Scikit-image is an image
processing package. The package is
imported as skimage.
IMPORTING IMAGES:

12
Course On Machine Learning & Data science

6. img =
imread(os.path.join(data_dir
, 'astronaut.png'))

7.
To view as a matrix, the below command
8. #Slicing out the rocket
must be followed:
print((img)mg) 9. img_slice = img.copy()

Let us understand the type, dimensions 10. img_slice =


img_slice[0:300,360:480]
and shape of the image.
11. plt.figure()
print('Type of image: ', type(img))
12. plt.imshow(img_slice)
print('Dimensions of image: ', img.ndim)
print('Shape of image:', img.shape)
OUTPUT:

So far, you have become familiar


with, how to retrieve the basic attributes
of the image. Let us proceed to
understand some examples on indexing
and selection on images. Assigning the values corresponding to the
Cutting the rocket out of the image sliced image as 0:

1. #Importing path and skimage 1. img[0:300,360:480,:] = 0


i/o library
2. plt.imshow(img)
2. import os.path

3. from skimage.io import


imread

4. from skimage import data_dir

5. #reading the astronaut image

img_slice[np.greater_equal(img_slice[:,:,0
],100) &
13
Course On Machine Learning & Data science

plt.figure()  Numpy offers additional


capabilities to perform Linear
plt.imshow(img_slice)
Algebra and scientific computing.
The place where the sliced rocket image This is out of scope of this module.
was present initially, is now filled with
Arange
black color because 0 is assigned to the
values corresponding to the sliced image. This method returns evenly spaced values
between the given intervals excluding the
Replacing the ‘rocket’ back to its original
place:

1.
end limit. The values are generated based
2. img[0:300,360:480,:] =
img_slice on the step value and by default, the step
value is 1.
3. plt.imshow(img)
Linspace
This method returns the given number of
evenly spaced values, between the given

For the above picture, the black image in


the previous step is replaced with the
sliced ‘rocket’.
 Numpy offers multi-dimensional intervals. By default, the number of values
arrays. between a given interval is 50.
 It provides array operations that OUTPUT:
are better than python list Zeros:
operations in terms of speed,
efficiency and ease of writing code.
 Numpy provides fast and
convenient operations in the form
of vectorization and broadcasting.
14
Course On Machine Learning & Data science

Returns an array of given shape filled with Random:


zeros
NumPy has numerous ways to create
random number arrays. Random numbers
can be created for the required length,
from a uniform distribution by just passing
the value of required length to the
random.rand function.

Ones:
1. #generating 5 random numbers
Returns an array of given shape filled with from a uniform distribution
ones. 2. np.random.rand(5)

3.

Similarly, to generate random numbers


Full:
from a Normal distribution, use
Returns an array of given shape, filled with random.randn function.
given value, irrespective of datatype.
Random numbers of type 'integer' can also
be generated using random.randint
function. Below shown is an example of
creating five random numbers between 1
and 10.

Eye:
1. #random integer values
Returns an identity matrix for the given low=1, high=10, number of
shape. values=5

2. np.random.randint(1,10,
size=5)

3.

15
Course On Machine Learning & Data science

Similarly, two-dimensional arrays of cleaned, transformed, manipulated, and


random numbers can also be created by analyzed. It is suited for different kinds of
passing the shape instead of number of data including tabular as in a SQL table or
values. a Excel spreadsheets, time series data,
observational or statistical datasets.
The steps involved to perform data
analysis using Pandas are as follows:

To generate a random number from a


predefined set of values present in an
array, the choice() method can be used.
The first step is to read the data. There
The choice() method takes an array as a are multiple formats in which data can be
parameter and randomly returns the obtained such as '.csv', ‘. json', '.xlsx' etc.
values based on the size.
Below are the examples:

We can also make a multi-dimensional


array with random values from a
predefined set of values.
Figure 3 Spreadsheet form of data

PANDAS:
Pandas is an open-source library for real
world data analysis in python. It is built on
top of Numpy. Using Pandas, data can be
16
Course On Machine Learning & Data science

 Grouping operations
 Sorting operations
 Masking operations
 Merging operations
 Concatenating operations
Visualizing data

Figure 4 json file (JavaScript Object Notation


The next step is to visualize the data to
File) get a clear picture of various relationships
among the data. The following plots can
help visualize the data:
 Scatter plot
 Box plot
 Bar plot
 Histogram and many more
Generating Insights
Figure 5 Comma separated values (csv file) All the above steps help generating
insights about our data.
Exploring the data
Pandas is one of the most popular data
The next step is to explore the data. wrangling and analysis tools because it:
Exploring data helps to:
 has the capability to load huge sizes
 know the shape (number of rows of data easily.
and columns) of the data  provides us with extremely
 understand the nature of the data streamlined forms of data
by obtaining subsets of the data representation.
 identify missing values and treat  can handle heterogenous data, has
them accordingly extensive set of data manipulation
 get insights about the data using features and makes data flexible
descriptive statistics and customizable.

Performing operations on the data To get started with Pandas, Numpy and
Pandas needs to be imported. In a
Some of the operations supported by
nutshell, Pandas objects are advanced
pandas for data manipulation are as
versions of NumPy structured arrays in
follows:
which the rows and columns are identified

17
Course On Machine Learning & Data science

with labels instead of simple integer


indices.
The basic data structures of Pandas are
Series and DataFrame.
SERIES: Series is one dimensional labelled
array. It supports different datatypes like
integer, float, string etc. Let us understand
more about series with the following
example. Listed vertically other that array. OUTPUT:

The panda’s series object can be used to


represent this data in a meaningful
manner. Series is created using the
following syntax:
Series.values provides the values.
Syntax:
Series.index provides the index.
pd.Series(data, index, dtype)
Slicing and Indexing is same as that used
data – It can be a list, a list of lists in lists and numpy arrays.
or even a dictionary.
By default, series creates an integer
index – The index can be explicitly index. The custom index can also be
defined for different values, if required. defined.
dtype – This represents the data
type used in the series (optional
parameter).

OUTPUT:

18
Course On Machine Learning & Data science

A DataFrame is a collection of series


where each series represents a column
from a table.
Let us create a Dataframe object using the
series objects as shown below:
Syntax:
pd.DataFrame(data, index,
columns)
data - data can contain Series
or list-like objects. If data is a dictionary,
Series can also be viewed as a column order follows insertion-order.
specialized dictionary where the keys act
index- index for dataframe that is
as index and corresponding values act as
created. By default, it
values as dict.
willbe RangeIndex(0, 1, 2, …, n) if no
explicit index is provided
columns- If data contains column
labels, it will use the same. Else, default to
Range Index (0, 1, 2, …, n).

OUTPUT:

To represent this data, we have to first


develop 2 series with dictionaries each for
price and for manufacturer. And
Dataframe develops a series with Price &
A series gives a useful way to view and Manufacturer as 2 columns.
manipulate one dimensional data. But
when data is present in rows and columns,
it becomes necessary to make use of the
Pandas DataFrame object.

19
Course On Machine Learning & Data science

The output shows the Dataframe


containing multiple columns. The car
names act as the indices and ‘Price’ and
‘Manufacturer’ act as the columns
or 'features' of this small dataset.

OUTPUT: It combines the two dictionaries


and provides us the data in the form of a
table. Where the common indices/keys Here we have to specify which is the
becomes the Columns. column, so it will be easier to form a
Dataframe.

OUTPUT:

To access individual features, the From a list of dictionaries


following code can be used:
Consider the following data of marks for
cars['Price'] four students.
cars['Manufacturer']
DIFFERENT APPROACHES TO CREATE
DATAFRAME:
From a single series object
A DataFrame is a collection
of Series objects, and a single-
column DataFrame can be constructed
from a single Series.

Here we can see a single dictionary with


keys and values representing the above

20
Course On Machine Learning & Data science

data is converted into a Dataframe using Click here to download the json file used
the pd.Dataframe command. in the demo.
OUTPUT:

We can also
arrange data which weren’t that much

Each dictionary element in the list is taken


as a row. Index is representing different
subjects.
Note: NaN(Not a Number) represents
missing values.

OUTPUT:
From an existing file
In most real-world scenarios, the data is in
different file formats like csv, xlsx, json etc.
Pandas supports reading the data from
these files. Below is an example of creating
a DataFrame from a json file.

21
Course On Machine Learning & Data science

html files (read_html), json files


(read_json) etc.
OUTPUT:

import pandas as pd
import numpy as np
df = pd.read_csv('auto_mpg.csv')

The df
axis keyword: print(df)
One of the important parameters used HEAD & TAIL FUNCTIONS:
while performing operations on
DataFrames is 'axis'. Axis takes two To view the first few rows or the last
few rows, the functions that can be used
are: df.head() and df.tail() respectively.
If the number of rows to be viewed is not
passed, then, the head and tail functions
provides five rows by default. Example for
head is given below.
values: 0 and 1.
X=df.head()
axis = 0 represents row specific
Print(X)
operations.
DESCRIBE:
axis = 1 represents column specific
operations. The describe function can be used to
generate a quick summary of data
Reading the data from XYZ custom cars
statistics. It provides the mean, max, min
Pandas can read a variety of files. For and standard deviation values for the
example, a table of fixed width formatted data.
lines (read_fwf), excel sheets (read_excel),
X=df.describe()
22
Course On Machine Learning & Data science

Print(X) df.dropna(inplace = True)


x=df.info()
#'inplace' makes changes to the original
DataFrame
print(x)

INFO
To know about the datatypes and number
of rows containing null values for
respective columns, the info() function can
be used.
After dropping the rows with null
1. df.info() horsepower values, it can be observed
that the number of rows has been
reduced to 392.

 NOTE: df.fillna(condition) can be


used to fill all the missing values.
The missing values are filled
with mean, median, mode, or
constant values.

Selecting a subset of the data

In addition to data access techniques,


pandas also provide techniques for
indexing and selection. Selecting a
We can see info returns the info about specific column in a DataFrame can be
non-null rows in the available data. achieved in following ways:
DROPPING NULL VALUES
It can be observed that the
‘horsepower’ attribute has some null
values. The easiest approach is to remove
the rows with any null values. This can be
achieved using dropna() function.

23
Course On Machine Learning & Data science

 Passing the column name as shown


below. Output is a Series containing
car names.
 Passing the

Setting custom index:

just one column. Custom index can be set to the


DataFrame as seen in series according to
the requirements. The following example
depicts the same:

OUTPUT:

 To extract the subset of the data,


we can pass the column names in a
list as shown below:

24
Course On Machine Learning & Data science

Last column and 2nd row value:

A subset from dataframe :

Here names of the cars are given as the


index for easy understanding.
'iloc' and 'loc' are the two indexing
techniques that help us in selecting
specific rows and columns. 2. loc- Access a group of rows and
columns by custom index(here its
1. iloc- Access a group of rows and
NAME).
columns by integer index.
The loc indexer follows explicit indexing.
The ‘iloc’ indexer follows implicit index.
Syntax - df.iloc[Rows, Columns]
In the following demos, 'df' refers to XYZ
Custom Cars DataFrame.

OUTPUT:

Figure 7 Custom cars Dataframe

First column and 2nd row value. To select a subset of columns, the
column names can be passed as a list.
Note: While retrieving records using loc,
the upper range of slice is inclusive.
25
Course On Machine Learning & Data science

He Output:
r e

REMOVING:

from zero to five the subset containing


Cylinders, HP and name is taken.
Adding/Removing columns to a
dataframe:
Consider the given example, marks of 4 Problem statement: Retrieve details of all
students. the cars built in year 72.
Y=df.loc[df['model_year'] == 72]. head()
print(Y)
OUTPUT:

Output:

Problem statement: Retrieve details of all


The teacher wants to insert a ‘Total
the cars built in Japan having 6 cylinders
marks’ column which gives the sum of
marks of all subjects. X=df.loc[(df['origin'] == 'japan') &
(df['cylinders'] == 6)]
print(X)

26
Course On Machine Learning & Data science

OUTPUT: Category Description Features


involved

Fuel Cars designed MPG > 29,


efficient with low power Horsepower <
and high fuel 93.5,
efficiency
Weight < 2500
Problem Statement: Muscle Intermediate Displacement
Cars sized cars >262,
XYZ Custom Cars want to categorize cars designed for Horsepower >
high 126, Weight in
in different categories as follows: performance range [2800,
3600]
Category Description Features
coming in play SUV Big sized cars Horsepower >
designed for 140, Weight >
Fuel Cars designed High MPG,
high- 4500
efficient with low power Low
performance,
and high fuel Horsepower,
long-distance
efficiency Low weight
trips and family
comfort
Muscle Intermediate High Race car Cars Weight <2223,
Cars sized cars displacement, specifically acceleration >
designed for High designed for 17
high horsepower, race tracks
performance Moderate
weight
SUV Big sized cars High SOLUTION:
designed for horsepower,
high- High weight # Fuel efficient
performance,
long-distance # MPG > 29, Horsepower < 93.5,
trips and
family comfort # Weight < 2500

Race car Cars Low weight, df.loc[(df['mpg'] > 29) & (df['horsepower']
specifically High < 93.5) & (df['weight'] < 2500)]
designed for acceleration
race tracks Output: Here this program returns 83
rows x 9 columns of the entire Dataframe.
(83 cars)
Their experienced engineers and
mechanics have come up with the
following parameters for these categories.

27
Course On Machine Learning & Data science

students have to be replaced with ‘Fail’.


So, how can the task be performed?

OUTPUT:

# Muscle cars
# Displacement >262, Horsepower > 126,
Weight in range [2800, 3600]
Syntax:
df.loc[(df['displacement'] > 262) &
DataFrame.mask(cond, other = nan,
(df['horsepower'] > 126) & (df['weight']
inplace = False, axis = None)
>=2800) & (df['weight'] <= 3600)]
cond – Where cond is False, keep the
Output: It returns 11x9 table. (11cars)
original value. Where True, replace with
corresponding value from other
other - Entries where cond is True are
replaced with corresponding value
from other.
inplace - Whether to perform the
operation in place on the data.
Race cars and SUVs are classified from axis – alignment axis
the entire dataframe using the d.loc
function.
MASKING OPERATION: The masking
operation replaces values where the
condition is True.
The teacher does not want to reveal the
marks of students who have failed. The
condition is that if a student has scored
marks >= 33, then they have passed,
otherwise failed. The marks of failed Figure 8 Masking Code

Figure 9 Creating a Dataframe.


28
Course On Machine Learning & Data science

SORTING: For sin to work the dataframe should


have only numericals/integers. Similarly
Now the teacher is wishing to sort the
np.cos can also be used. The encrypted
data based on Physics marks. Then we use
marks are with same indices as the
the df.sort_values function to get our data
original marks. This is called as index
sorted.
preservation.
Resetting Index:
In case of a requirement where the index
Here ‘x’ denotes the masked dataframe has to be restored to the default index,
in the previous step. reset_index() function must be used. It
adds the existing index as a new column in
the DataFrame. This can be done as
follows:

Pandas preserves the index and column


labels in the output. For binary operations
such as addition and multiplication,
Pandas will automatically align
indices when passing the objects to the
functions.
PROBLEM: The teacher wants to encrypt
the marks for confidential reasons.
Broadcasting refers to a set of rules to
Therefore, the teacher decides to save the
operate between data of different sizes
marks as sine of the original marks. For
and shapes.
example, if Subodh has scored 67 in
chemistry, then his encrypted marks will Same example of students’ marks.
be sin(67) = -0.855520 The teacher wants to increase the marks
of all the students as follows-
Chemistry: + 5
Physics: + 10
Mathematics: +10
English: + 2
INPUT:

29
Course On Machine Learning & Data science

OUTPUT:

The teacher wants to get the total marks


scored by each student.

Apply: The students


were unable to
This method is used to apply a function attend the next set of exams due to the
along an axis of the DataFrame. pandemic. Hence, the teacher decides to
Syntax: award them average marks based on their
previous performance.
DataFrame.apply(func, axis = 0,
result_type = None)
func: Function to apply to each column
or row.
axis: Axis along which the function is
applied.
result_type: one out of 'expand', AGGREGATION:
'reduce' or 'broadcast'. In the
Consider the scenario where the board
demo, 'broadcast' is used.
of XYZ custom cars wants to know
 ‘broadcast’: results will be about minimum and maximum of all the
broadcast to the original shape of numerical columns.
the DataFrame, the original index
Aggregation operation is used to
and columns will be retained.
aggregate using one or more operations
The teacher wants to gets the total over the specified axis.
marks in the subject for the same
Syntax:
example.

30
Course On Machine Learning & Data science

DataFrame.agg(func, axis = 0)
func - Function to use for aggregating the
data. If a function, must either work when
passed a DataFrame or when passed to One of the engineers suggests about
DataFrame.apply. checking the mean, minimum and
maximum horsepower based on number
axis: If 0 or ‘index’: apply function to each of cylinders and model year. For such
column. If 1 or ‘columns’: apply function requirement, the ‘agg’ function can be
to each row. combined with groupby function as shown
below:

GROUPING:
XYZ custom cars want to know the
number of cars manufactured in each
year.
This would require a grouping operation. The teacher wants to combine the marks
Pandas supports a group by feature to of these students.
group our data for aggregate operations. Solution: Using concatenation to combine
the marks-

Syntax: Syntax: pd.concat(data1, data2, sort)

DataFrame.groupby(by =
column_name, axis, sort)

31
Course On Machine Learning & Data science

A
Pi
vo
t

Table is used to summarise, sort,


reorganise, group, count, total or average
data stored in a table. If we want to create
spreadsheet-style pivot table as a data
frame, pandas provide us with an option.
Som
Syntax :
eti mes
pd.pivot_table(data, index, aggfunc)
data: DataFrame
index: column to be set as index
aggfunc: function/list of functions,
while using the above concat column default = numpy.mean
mismatch may happen, which is resolved
using merge function. In Pandas, the
merge keyword automatically performs
the inner join. For other types of joins, the
'how' parameter must be specified.
The engineers at XYZ Custom Cars want
to know the frequency distribution of
different number of cylinders across Pandas also provides us options to
different years. visualize the data. Here are some of the
examples:
For such given condition, cross tab is
used. It gives us a tabular representation Syntax:
of the frequency distribution. df.plot(X, y, marker, kind)
X = value on X axis
y = value on y axis

32
Course On Machine Learning & Data science

marker = shape in case of specific plots drawn. It can contain multiple


like a scatter plot independent subplots, multiple
Axes, a title, a legend, etc.
kind = type of plot
 Axes: The Axes are the area on
A scatter plot to visualize the trend of which data is plotted. It can have
acceleration in different years. Refer labels or ticks associated with it.
further plots in ppts…. There can be multiple Axes in a
DATA VISUALIZATION: figure. But a given Axes object can
only be in one figure. (graphs with
Data Visualization is a concept ordinate & abscissa, different
of graphical representation of data or graphs).
information using visual elements like
graphs, charts, and maps. This Let us understand Axes, Figures, and Plots
representation helps us in understanding with the help of the images below:
the patterns, trends, and outliers in the
data and it makes data easily
understandable and explainable.
With the increase in the volume of data,
discovering the patterns in data has
become challenging. By making use of
data visualization, a huge chunk of
complex data can be displayed in a way
that is easy to understand and is also
appealing to the eyes.
In Data Visualization, plot is a basic
structure for the graphical representation
of data. Let us understand the plot and its
components.
Plot
A plot is the basic visualization element
that helps to visualize the data. To
visualize(plot) the data, required is figure
and axes objects.
 Figure: The Figure is the top-level
container that acts as the window
or page on which everything is
33
Course On Machine Learning & Data science

5. Apart from these, there are several


customizations available for specific
plots.
MATPLOTLIB:
Matplotlib is one of the most basic and
popular Python libraries used for data
visualization. It was developed for
imitating the plotting capabilities of
MATLAB.
matplotlib.pyplot is used for two-
dimensional graphics in python
programming. It can be used in python
shell, scripts, web application servers, and
other graphical user interface toolkits.
Matplotlib uses libraries such as NumPy
as a base for underlying operations.
Following are the various approaches to
A plot comprises several elements such as plotting in Matplotlib:
title, label, axes, legend etc., that adds
 MATLAB way of plotting using
more meaning to the visualization. A
matplotlib.plyplot. It is simple to
sample representation of the plots is
use.
shown below.
 Object-Oriented way of plotting for
1. Title: The title is the name of the more control and customization.
plot. By default, the title will be
Importing Matplotlib
present at the top-center of the
plot. To make use of the functionalities that
2. Axis: This consists of the X and Y are present in Matplotlib library, the
axis which provide the horizontal package must be imported into
and vertical coordinates of the data the environment. matplotlib.pyplot is the
points. simplest way of plotting. It creates the
3. Labels: The labels are used to name default elements like Figure and Axes
the X and Y axis with appropriate required and then plots the data.
names respectively.
4. Legend: A legend is a set of key-
value pairs that contains the list of
plots and their respective labels.
34
Course On Machine Learning & Data science

This approach can be used to plot


different kinds of graphs like Line, Bar,
Scatter plots, histogram etc., Matplotlib
can be imported to the local environment
or IDE using:
Output:
import matplotlib.pyplot as plt
<Figure size 432x288 with 0 Axes>
import numpy as np
Setting the axes for the figure. The axes
#creating two arrays object is the region where the data can be
X=np.array([1,2,3,4,5]) plotted. A figure can have n number of
axes object.
Y=X**2
Now, let us plot the
values using matplotlib.pyplot.
Syntax:
plt.plot(x, y)
x = data on the horizontal axis
y = data on the vertical axis
#plotting the values
plt.plot(X,Y)
W
e

Output:

can also create the plotting elements using


object-oriented approach. The use of an
object-oriented approach is
recommended as it gives more control
over customization of the plots.
Plotting a line on the axes.

35
Course On Machine Learning & Data science

Syntax: Here I plot two lines between x & y and x


& z with two different colours.
Here we are completely creating a
plot from scratch adding Title, Label and
Legend.

a x
.p l
o t(
x,
y,

Creating subplots, (plots within plot)


color, label) #ax represents axes
can also be done in matplotlib library.
x = data on the horizontal axis

=
data
on
the

vertical axis

36
Course On Machine Learning & Data science

accessed through ax1 and ax2


respectively.
The subplots method is used to create a
common layout for multiple plots. 'm x
n' number of plots can be created using a
subplot where m and n represent the
number of rows and columns respectively.
Now, let us create a subplot that has 1
row and 2 columns.
Syntax:

ax1 specifies the axis for the first plot plt.subpl


and ax2 specifies the axis for the second ots(nrow
s, ncols)

plot.
Though it seems like plot 2 is
embedded in plot 1 due to the placement
A subplot is crested with two different
plots aligned in 1 row and 2 columns. In
this plot, it is seen that the y_label of the
second plot is overlapping with the first
plot. To avoid
this, ‘fig.tight_layout()’ must be added.
OUTPUT:
of the axes, plot 1 and plot 2 are
completely different plots and can only be
TYPES OF PLOTS:

37
Course On Machine Learning & Data science

 Box plot
 Scatter plot
 Bar chart
 Histogram
 Pie chart
 Line chart
There is no data for city mileage, but
BOX PLOT: city mileage is 25% less than the average
A boxplot gives a good indication of mileage i.e., ‘mpg’. Next is to process the
distribution of data about the median. data for city mileage. A new column
Boxplots are a standardized way of ‘city_mileage’ is created. Next, the
displaying the distribution of data based distribution of the average mileage and
on the five-number summary city mileage has to be compared.
(“minimum”, first quartile (Q1), median,
third quartile (Q3), and “maximum”).
First, let us plot the average mileage ‘mpg’
from the data using a boxplot.
Syntax:
ax.boxplot(data) #ax represents axes

SCATTER PLOT:
A scatter plot uses dots or markers to
represent a value on the axes. The scatter
plot is one of the simplest plots which can
accept both quantitative and qualitative
values, with a wide variety of applications
in primitive data analysis.
Several meaningful insights can be drawn
OUTPUT: from a scatter plot. For
example, identifying the type of
correlation between variables before
diving deeper into predictions.

38
Course On Machine Learning & Data science

Syntax:
ax.scatter(x, y, marker) #ax represents axes
x = data on the horizontal axis
y = data on the vertical axis
Visualizing the correlation between
marker = shape of data points (example 'o' for
mileage and horsepower based on the
origin of the cars.
OUTPUT:

In the above Scatter plot different colors


are assigned to different origins.
From the above plot, following are the
insights:
1. Which country’s car has the highest
mileage?
Ans: Japan
2. Which country’s car has the highest
circles, 's' horsepower?
for squares
Ans: USA
etc.)
BAR CHART
A bar chart is a graph with rectangular
bars that usually compare different
categories. Each bar represents a
OUTPUT: particular category. The length of the bar
indicates the total number of values or
From the above image, it can be
items in that category.
concluded that mileage and horsepower
are negatively correlated to some extent.
39
Course On Machine Learning & Data science

The bar graphs can be plotted vertically HISTOGRAM:


or horizontally. The example shown below A histogram also represents data as
shows the number of cars manufactured rectangular bars. Unlike the bar graph, It
by each company. is used for continuous data. Each bar
groups the numbers into intervals (bins)
and the height of the bar is based on the
number of values that fall into the
corresponding intervals.
A histogram is ideally suited to obtain
the frequency distribution of a given data
and one such example is shown below.
This can be done with the help of a
histogram as follows:
Syntax:
Syntax: ax.hist(x, bins) #ax represents axes
ax.scatter(x, height, width, bottom, x = Input values. single array or
align) #ax represents axes sequence of arrays
x = data on the horizontal axis bins = int or sequence or str. If bins is
height = height of the bar integer, it defines the number of equal-
width bins in the range.
weight= weight of the bar. default:0.8
bottom=Y coordinates of bar bases.
default:0
align = alignment of the bars to x
coordinates. default value: 'center'

40
Course On Machine Learning & Data science

OUTPUT:
From the histogram, it is observed
The details on origin of the cars and their
numbers can be presented to the
stakeholders visually for their easy
understanding.
Let us visualize the data using a pie chart
as follows:
Syntax:
ax.pie(x, labels) #ax represents axes
x = wedge size. one-dimensional array
that most of the cars, have the
labels = sequence of strings providing
horsepower value ranging between 70 and
the label for wedges
110.

PIE CHART:

A Pie chart divides the entire dataset


into distinct groups. The chart consists of a
circle split into pies and each
pie represents a group. The size of the
pie is proportional to the number of items
in each group compared to others. A basic Pie chart has been plotted using
the ax.pie() function in matplotlib. This
The sum of the pies in a pie chart will plot could also be made visually appealing
always be 100%. An example of a pie chart by customizing other parameters present
is shown below. in the ax.pie() function.

41
Course On Machine Learning & Data science

Following are the parameters: Answer: USA


explode - To get an elevated view for 2. What is the percentage of cars
the selected pie. produced from Europe?
colors - To customize the colors for Answer: 17.6%
the plot.
Line Chart
autopct - To add the percentage of the
Line Chart is drawn by interconnecting
distribution in the pie chart.
all data points using straight line segments
shadow - To add shadow to the plot. in sequential order. It is used to analyze
historic variations and trends in data. The
startangle - To change the starting angle
individual data points are chronologically
of the pie chart.
connected to obtain the patterns and
draw meaningful inferences.
An image depicting the changes in the
stock prices over time is shown below.

As a data analyst, a line chart can be


created to visualize the relationship
between mileage, horsepower, and the
weight of the cars manufactured in
different years.
Let us create a DataFrame containing
the mean values of mileage,
From the plot above, following are the horsepower and the weight of the cars
insights: based on the model year.
1. Which will be the most suitable Matplotlib supports several line styles
region to start a new branch? such as solid line, dashed line (-----),
dotted line (…..), dashdot (-.-.-.-) etc. In
42
Course On Machine Learning & Data science

addition to color and linestyle, the width Figure can be saved as an imagein the
of the line can be customized to get a local systems using Matplotlib. This will
unique plot. help document the plots easily. The
‘fig.savefig()’ is used for this functionality.
Below is a small example of how linestyle,
width, and color can be used. Let us save the figure from the previous
example as follows:
# To save a figure --
fig.savefig("multiple-axes-plots.jpg",dpi =
200)
Here, we have saved the image as
"multiple-axes-plots.jpg". 'dpi' – Dots Per
Inch indicates the resolution of the image,
higher the number, more will be the
resolution.
While using plt.plot() to create a
plot, ‘plt.savefig()’ can be used to save the
figure as an image
#when graph plotted using plt
plt.savefig("filename.jpg",dpi = 200)
Seaborn
Several others such as texting and
annotate can also be done in matplot line Seaborn is a statistical data visualization
graphs which can be seen in the ppt. library in python. It is integrated to work
with Pandas DataFrames with a straight
https://
forward approach.
infyspringboard.onwingspan.com/web/
en/viewer/web-module/ Seaborn extends the plotting capabilities
lex_auth_013331489980080128217_share of Matplotlib and provides a high-level
d? interface to generate attractive plots that
collectionId=lex_auth_0136097078904913 are visually appealing.
9215&collectionType=Learning Plotly
%20Path&pathId=lex_auth_01333063698
060902494_shared,lex_auth_0133313795 Plotly is another data visualization library
25869568215_shared that is used to generate highly interactive
plots.

43
Course On Machine Learning & Data science

SCIKIT_LEARN:
Scikit-learn (also referred as sklearn) is
a python library widely used for machine
learning. It is characterized by a clean,
uniform and streamlined API.
Machine Learning (ML)is a branch of
artificial intelligence that aims at building
systems that can learn from data, identify
patterns and make decisions with minimal
human intervention. STEP 2 DATA PREPROCESSING
PROBLEM: Engineers at XYZ custom Data Properties:
cars now want to create a machine
df.info()
learning model that can predict
the mpg of any car that comes to their Output:
garage. MPG refers to Miles per gallon.
A linear regression model is to be made
for this problem. The different stages to
be followed in ML Model building is shown
below:

Dropping null values:


STEP 1 DATA LOADING:
It can be observed that the ‘horsepower’
attribute has some null values. The easiest

OUTPUT:
44
Course On Machine Learning & Data science

approach is to remove the rows with null #Mpg is the target 0 (as it is to be
values as shown below. predicted)
Since the origin feature is a categorical
variable, get_dummies function can be
used from Pandas to encode it as shown
below:
X = pd.get_dummies(X)
X
From the below image, it can be
observed that the categorical variable
'origin' has been encoded with 0s and 1s.

Train and test split:


The data must be divided into two parts.
First, a training set on which model can be
trained. Second, a testing set on which the
model can be validated.
Predictors and Target: The Sklearn library is used for this as
The target variable is 'mpg' which has to shown below:
be predicted. The predictors are the
variables that are used to predict the
target. Here, except name of the car, all
the other variables are included as
predictors (From cylinders to origin in the
dataframe).
#Creating matrix of predictors from mpg
to origin. from sklearn.model_selection import
train_test_split
X = df.iloc[:, 1:8]
X_train, X_test, y_train, y_test =
#Creating target
train_test_split(X, y, test_size = 0.2,
y = df.iloc[:, 0] random_state = 0)

45
Course On Machine Learning & Data science

X_train and y_train correspond to #Importing and fitting the model on


training predictors and target training set
respectively. X_test and y_test
from sklearn.linear_model import
correspond to testing predictors and
LinearRegression
target respectively. 'test_size = 0.2'
represents that 20% of the data will be reg = LinearRegression()
used as test set. #Fitting the model on training data:
Since all the variables in the data are reg.fit(X_train, y_train)
with different units of measurements and
different scales, it would be a good idea to #Checking the coefficient(slope) and
standardize them. A standard scaler intercept.
performs this operation by transforming #'m' represents the coefficient and 'c'
the columns such that the mean of every represents the intercept.
column or variable is 0 and standard
m = reg.coef
deviation is 1.
c = reg.intercept_
#Applying standard scaler on the data
m,c
from sklearn.preprocessing import
StandardScaler OUTPUT:
scale = StandardScaler()
scale.fit_transform(X_train)
scale.transform(X_test);
IMPOTING REQUIRED MODEL: In the next step, the linear regression
The linear regression model is used to model created is used for prediction
build the model. A linear regression model against the training and testing data set.
uses the following equation:
#Predicting the target: mpg against the
y = B + B *X + B *X + _ _ _ _ + B *X
0 1 1 2 2 n n predictors in the training data set
In this case, y refers to the target and #Predicted data stored in y_pred_train
X ,X …..X refer to the predictors. B is the
1 2 n 0
y_pred_train = reg.predict(X_train)
intercept and B , B …..B are the
1 2 n

coefficients. #Predicting the target: mpg against the


predictors in the testing data set
Below code demonstrates the Linear
Regression model building using sklearn #Predicted data stored in y_pred_test
library on the training data set.
y_pred_test = reg.predict(X_test)
46
Course On Machine Learning & Data science

There are different metrics used to easily comprehensible as well as pleasing


evaluate the performance of the model. to the eyes.
Here the R Square score is used.
Data visualisation:
# Prediction Accuracy in terms of how
 helps in finding patterns and
close is the predicted value of target: mpg
connections between variables.
# to the real value in training data set  requires less effort from the reader
to understand the visuals.
from sklearn.metrics import r2_score
 condenses a large amount of
r2_S = r2_score(y_train, y_pred_train) information into a small space for
r2_S quick analysis.
 provides relevant answers and
The r-squared score for training accuracy clarity on certain questions swiftly.
comes out to be 81%
There could potentially be two types of
Evaluating performance on test set. visualisations based on the types
of stakeholders involved.
# Prediction Accuracy in terms of how 1. For self-consumption during data
close the predicted value of target: mpg exploration, feature
engineering, etc.
# to the real value in testing data set
2. For presenting or communicating
from sklearn.metrics import r2_score the insights (from the data) with a
r2_S = r2_score(y_test, y_pred_test) target audience, typically decision
makers. This sort of visualisation is
r2_S usually performed to prepare the
The accuracy comes out to be final results/reports that may
83%. Thus, scikit-learn helps to train, test enable the target audience in
and evaluate machine learning models. decision making.

DATA VISUALIZATION: The different types of data collected from


various sources are as follows:
Data visualisation is a concept
of graphical representation of data or  Temporal Data: Data with a time
information using visual elements like component attached to it. For
graphs, charts and maps. With the example, opening and closing values
increase in volume of the data, discovering of stocks in a year. A plot that can
patterns in the data becomes challenging. represent the sequence in this data
Through data visualisation, a huge chunk and the pattern changes over time
of complex data can be displayed to be is required.
47
Course On Machine Learning & Data science

 Geospatial Data: Data with a medium, tall), income (low,


physical location as an attribute. For medium, high), etc.
example, location of volcanoes  Quantitative/Numerical Data: Data
around the world. A plot that can that is numerical in nature and can
represent this data on a be measured. It is further
geographical map is required. categorised as:
 Topical Data: Data concerned with 1. Discrete: Data that can be counted
topics. For example, feedback from (whole numbers). For
customers. A plot that can example, number of floors in a
represent the relationships in this building, number of students in a
data is required. classroom, etc.
 Network Data: Data in the form of 2. Continuous: Data that can take any
nodes and links between nodes. For value within a range. For
example, social networking data. A example, weight, mileage of a car,
plot that can represent the etc.
relationship between nodes is
required.
 Tree Data: Data which is basically Data analysts perform exploratory data
network data but with some analysis to handle missing values, outliers,
hierarchy in it. For etc. and analyse the relationship between
example, organisational structure. A the variables. Data analysts require the
plot that can represent the tree knowledge of statistical concepts used in
structure is required. data analysis to:

Generally, data can be of two types:  select the right type of graph
 infer information like outliers,
 Qualitative/Categorical Data: Data
correlation of variables, redundant
that deals with characteristics and
features etc.
descriptions. It is further
categorised as: Outliers:
1. Binary: Data that is dichotomous. Outliers are the extreme values present
For example, True/False, Yes/No, in the dataset. They affect the properties
1/0 etc. of data like mean and variance which are
2. Nominal: Data with no ordering or used in model building. Hence, they may
ranking. For example, different impact the accuracy of the model.
colours, blood groups,
nationality etc. So, the question that arises is, how to
3. Ordinal: Data with specific order or know if a value is an outlier? And how to
ranking. For example, height (short, deal with such values? Let us find out.

48
Course On Machine Learning & Data science

Quartiles
Quartiles divide the number of data
points into four equal-sized groups, or
quarters.
Following are the steps to find quartiles: Identifying Patterns in Data

 sort the dataset in ascending order. Visualising the tuples as scatter plots
 find median of the sorted can be useful to spot gaps in the values
dataset (median divides the dataset and hence identify the data points crucial
into two halves - Quartile 2 or Q2). to the dataset. It can help draw decisive
 repeat step 2 with the first inferences about the type of predictor or
and second half of the data (this classifier to be used.
gives Q1 and Q3, dividing the
dataset into four equal parts).
With the help of quartiles, a value
called Inter-Quartile Range (IQR) can be
calculated using the formula:
IQR = Q3 - Q1
Inter-Quartile Range A line chart is used to analyse historic
Inter-Quartile Range also called mid- variations and trends in data.
spread, H-spread, or IQR, indicates where A bar chart is a graph with rectangular
most of the data is lying. bars that compares different categories.
As IQR is calculated using the median, the A histogram represents data as
outlying values don not affect it. A rectangular bars. Unlike the bar chart, it
formula is used to calculate the upper is used for continuous data, to obtain
limit and lower limit of this range. Any the frequency distribution of a given data.
data point lying outside these limits is an
outlier.
Upper Limit: Q3 + (1.5 * IQR)
Lower Limit: Q1 – (1.5 * IQR)

49
Course On Machine Learning & Data science

A dist plot or distribution plot, depicts similar values are depicted by the same
the variation in a data distribution. It colours. The colours vary based on
represents the overall distribution of the intensity of the results.
continuous data variables. The dist plot
One example for heat map is to find the
correlation between the variables in a
dataset, as depicted in the figure below.

A network is a set of objects


(called nodes or vertices) that are
connected to each other. The connections
between the nodes are
called edges or links.
If the edges in a network are
directed, i.e., pointing in only one
depicts the data by a histogram and a line direction, the network is called a directed
in combination with it. network. When drawing a directed
network, the edges are typically drawn as
A joint plot is a combination of two arrows indicating the direction.
univariate and one bivariate plots.
The bivariate plot (in the center) helps in If all edges are bidirectional, or
analysing the relationship between two unidirectional, the network is an
variables. The univariate plot describes undirected network.
the distribution of data in each variable as
a marginal plot.
A pair plot depicts pairwise
relationships between all the variables in a
dataset in a matrix format. Each row and
column in the matrix represent a variable
in the dataset.
The plots present in the diagonal are
univariate plots as the variables are A word cloud is a visual representation
compared with themselves and the others of free form text, which is7 like a collage.
are bivariate scatter plots. It is typically used to depict keyword
metadata of websites, articles, reviews,
A heat map is a graphical
feedbacks etc. The frequency and
representation of data where
50
Course On Machine Learning & Data science

significance of the words are depicted by 2020. It represents the count of the
the font, font size and colour of the text in spread on the given date. A deeper shade
the cluster. corresponds to a higher value while a
lighter shade marks the safe regions.
Words with greater significance and
occurrence are depicted in a bigger and
bolder font towards the central location of
the cluster and other latent words occupy
peripheral places with smaller fonts and
faded colours. Most insignificant words,
stop words, irrelevant information is
eliminated from the cluster while plotting
it.
A word cloud finds its usage more in
Natural Language Processing.

Few of the popular Python libraries used


for data visualisation are listed as follows:
 Matplotlib
 Seaborn
A choropleth map is a pictorial
 Plotly
representation of data on a geographical
map. The intensity of color in a region on Seaborn is a statistical data visualisation
the map corresponds to the respective library in Python. It is integrated to work
values. with Pandas data frames with a more
straight forward approach.
The figure below depicts the choropleth
map of Covid-19 distribution in India as of Seaborn extends the plotting
14t capabilities of matplotlib and provides a
h
Oc high-level interface to generate attractive
t ob plots that are visually appealing.
e r,

51
Course On Machine Learning & Data science

52

You might also like