Advanced Data Science Training

Advanced Data Science Training
© The Knowledge Academy Ltd 1

About The Knowledge Academy
• World Class Training Solutions
• Subject Matter Experts
• Highest Quality Training Material
• Accelerated Learning Techniques
• Project, Programme, and Change
Management, ITIL® Consultancy
• Bespoke Tailor Made Training Solutions
• PRINCE2®, MSP®, ITIL®, Soft Skills, and More

Administration
• Trainer
• Fire Procedures
• Facilities
• Days/Times
• Breaks
• Special Needs
• Delegate ID check
• Phones and Mobile devices

Outlines
• Module 1: Python for Data Analysis
- NumPy
• Module 2: Python for Data Analysis

– Pandas
• Module 3: Python for Data

Visualisation – Matplotlib
• Module 4: Python for Data

Visualisation – Seaborn

Outlines
• Module 5: Machine Learning
• Module 6: Natural Language

Processing
• Module 7: Deep Learning
• Module 8: Big Data
• Module 9: Working with Data in R
• Module 10: Regression in R

Outlines
• Module 11: Modelling Data in
Power BI
• Module 12: Shaping and

Combining Data using Power BI
• Module 13: Interactive Data

Visualisations

Module 1: Python for Data Analysis -
NumPy

Introduction to NumPy
What is NumPy
• It is a python library which is used for working with arrays. Even, it has functions for
working in the domain of Fourier transform, matrices, and linear algebra. You can use it
freely because it is an open-source. NumPy is the short form of Numerical Python
Why Use NumPy

• We have lists in Python which serve the purpose of arrays, but their processing is slow.
NumPy intends for providing an array object which is up to 50x faster as compared to
handed-down Python lists
• In NumPy, the array object is called ndarray, and many supporting functions are given
by it which make working with ndarray easy as pie. Arrays are oftentimes used in data
science, where resources and speed are essential

Introduction to NumPy
Why is NumPy Faster Than Lists
• Unlike lists, in memory, NumPy arrays are stored at one continuous place so processes
can easily access and manipulate them. In computer science, this behaviour is known as
locality of reference
• This is the major cause that why NumPy is faster as compared with lists. Moreover, it is
optimised for working with the latest CPU (Central Processing Units) architectures
Language, in which Numpy has written
• NumPy is a Python library, and also it is written partially in Python, but the maximum
part which needs fast computation is written in C or C++ programming languages

NumPy Arrays
1. Arrays in NumPy: The homogeneous multidimensional array is the main object of
NumPy
• It is a table of elements (typically numbers), all of the similar type, indexed by a tuple of
positive integers
• Dimensions are named as axes in NumPy. The number of axes is rank
• NumPy’s array class is known as ndarray. Even, it is known by the alias array
Example
Output

NumPy Arrays
2. Array Indexing: To analyse and manipulate the array object, it is essential to
understand the array indexing basics. Various ways are provided by NumPy to do array
indexing
Slicing
• NumPy arrays can be sliced, as lists in python. For every dimension of the array, you
need to specify a slice, as arrays can be multidimensional
Integer array indexing
• Lists are passed for indexing for every dimension in this method. In order to construct a
new arbitrary array, one to one mapping of corresponding elements is done

NumPy Arrays
(Continued)
Boolean array indexing
• This method is used while picking elements from the array which satisfy some condition
Example
Output

NumPy Arrays
3. Basic operations: In NumPy, Plethora of built-in arithmetic functions are provided
Operations on single array
• Overloaded arithmetic operators can be used to do element-wise operation on the array

in order to create a new array. The existing array is modified in case of -=, +=, *=
operators
Example
Output

NumPy Arrays
(Continued)
Unary operators
• Various unary operations are given as a method of ndarray class. This includes min,
sum, max, etc. By setting an axis parameter, these functions can also be applied column-
wise or row-wise
Example
Output

NumPy Arrays
(Continued)
Binary operators
• These operations apply on array element-wise, and a new array is created. All basic
arithmetic operators such as +, -, /, etc., can be used. The existing array is modified in
case of +=, -=, = operators
Example
Output

Aggregations: Min, Max and more
• In the Python numpy module, for working with a single-dimensional or multi-
dimensional array, we have various statistical functions or aggregate functions
• Min, sum, mean, max, average, median, product, standard deviation, argmin, variance,
percentile, argmax, cumsum, cumprod, and corrcoef are Python Numpy aggregate
functions
• The following arrays are used in order to show these Python numpy aggregate
functions:

(Continued)
Python NumPy Sum
• The sum of values in an array is calculated by the Python NumPy sum function
• This Python numpy sum function permits you to utilise an optional argument named an
axis. To calculate the sum of a given axis Python numpy Aggregate Function can be used.
For instance, the sum of each column in a Numpy array is returned by the axis = 0

(Continued)
• The sum of each row in an array is returned by axis = 1

(Continued)
Python NumPy average
• The average of a given array is returned by Python NumPy average function
• Average of x and Y axis

(Continued)
• Without using the axis name calculate numpy array Average

(Continued)
Python NumPy min
• The minimum value in a given axis or an array is returned by the Python numpymin
function

(Continued)
• Here, we are finding the numpy array minimum value in the X and Y-axis

(Continued)
Python NumPy max
• The maximum number in a given axis or from a given array is returned by the Python
numpy max function

(Continued)
• By using numpy max function find the maximum value in the X and Y-axis

(Continued)
Python Numpy mean
• The average or mean of a given array or in a given axis is returned by the Python numpy
mean function. The mathematical formula for this numpy mean is the sum of all the
items in an array

(Continued)
• Mean value of x and Y-axis (or every row and column)
• Here, we are calculating Mean without using the axis name

Computation on Arrays: Broadcasting
• The term broadcasting refers to how numpy treats arrays with different Dimension
while arithmetic operations leading to specific constraints. Moreover, the smaller array
is broadcast across the larger array so that they have compatible shapes
• Broadcasting provides a means of vectorising array operations so that looping occurs in

C rather than Python as we understand that Numpy implemented in C programming
language
• It does this without creating unnecessary data copies and which leads to efficient
algorithm implementations
• In some cases, broadcasting is a bad idea because it leads to ineffective memory

utilisation which declines the computation

(Continued)
• Example
Output
Broadcasting Rules:
• The following are the rules in order to broadcast two arrays together:
1. Prepend the shape of the lower rank array with 1s until both shapes have the same
length if the arrays do not have the same rank

2. In a dimension, the two arrays are compatible if they have the same size in the
dimension or if one of the arrays has size 1 in that dimension
3. If arrays are compatible with all dimensions then they can be broadcasted together
4. After broadcasting, every array acts as if it had shape equivalent to the element-wise
maximum of shapes of the two input arrays
5. In any dimension where one array had size 1, as well as the other array had size greater
than 1, the first array acts as if it were copied along that dimension

(Continued)
Example 1: Single Dimension array
Output

(Continued)
Example 2: Two Dimensional Array
Output

(Continued)
Plotting a two-dimensional function:
• Even, Broadcasting is usually used in presenting images based on two-dimensional

functions. If we want to define a function z=f(x, y)
Output

Comparison, Boolean Logic, and Masks
Python Numpy Comparison Operators
• The Python numpy comparison functions and operators are used in order to compare
the array items as well as return Boolean True or false
• greater_equal, Greater, less_equal, less, equal, and not_equal are Python Numpy
comparison functions. Python Numpy comparison operators are <, <=, >, >=, == and !=
• For generating random two dimensional and three-dimensional integer arrays numpy
random randint function can be used

(Continued)
• A two-dimensional array is generated by the first array having a size of 5 rows and 8
columns, and the values are within 10 and 50
arr1 = np.random.randint(10, 50, size = (5, 8))
• A random three-dimensional array of size 2*3*6 is generated by this second array. The
generated random values are within 1 and 20
arr2 = np.random.randint(1, 20, size = (2, 3, 6))

Python Numpy Array greater
• Firstly, we create an array of random elements. After that, we will examine whether the
array elements are greater than 0, 1, and 2. If false, false is returned otherwise, True is
returned
Output

(Continued)
• Here, the Python Numpy greater function on 2-Dimensional and 3-Dimensional Arrays is
used
• The first array greater function checks whether the values in 2-D array are greater than
30 or not
• If true, then Boolean True returned otherwise false returned. Next, we are checking
whether the array elements in a 3-D array are greater than 10 or not

(Continued)
Output

(Continued)
Python Numpy Array greater_equal
• Whether the given array elements are greater than or equal to a specified number is
checked by the Python Numpy greater_equal function. True returned, If True otherwise,
False
• Whether items in the area is greater than or equal to 2 is checked by the first Numpy
statement. Moreover, the items in a random 2-Dimensional array is greater than or
equal to 25 is checked by the second Numpy statement
• Randomly generated 3-D array items which are greater than or equal to 7 are checked
by the third statement

(Continued)
Output

(Continued)
Python Numpy Array less
• Whether the elements in a given array is less than a specified number or not, is checked
by the Python Numpy less function
• If True, boolean True returned otherwise, False. The syntax of this Python Numpy less
function is:
numpy.less(array_name, integer_value)

(Continued)
Output

(Continued)
Python Numpy Array less_equal
• Whether each element in a provided array is less than or equal to a specified number or
not, is checked by the Python Numpy less_equal function. If True, boolean True
returned otherwise, False
• The following is the syntax of this Python Numpy less_equal function:
numpy.less_equal(array_name, integer_value).

(Continued)
Output

Boolean numpy arrays
• A boolean array is a NumPy array with boolean (True/False) values. By applying a logical
operator to another NumPy array, such array can be obtained:

Logical operations on Boolean arrays
• By using logical operators, Boolean arrays can be combined:
operator meaning
~ negation (logical “not”)
& logical “and”
| logical “or”

(Continued)
• Example 1:
Output

(Continued)
• Example 2:
Output

(Continued)
• Example 3:
Output

Masks
• In numpy.ma.mask_rows() function, mask rows of a 2-Dimensional array which hold

masked values. The numpy.ma.mask_rows() function is a shortcut to mask_rowcols
with axis equal to 0
Example
Output

Fancy Indexing
• Fancy indexing is like the simple indexing, but we pass arrays of indices instead of single
scalars
• It permits us to quickly access as well as change complicated subsets of an array's values
Exploring Fancy Indexing
• Fancy indexing is conceptually simple which means passing an array of indices in order
to access multiple array elements at one time
• For instance, consider the below-written array:
Output

Fancy Indexing
(Continued)
• Let us suppose, we want to access three different elements
• On the other hand, we can pass an array of indices or a single list for getting the same
result:

Fancy Indexing
(Continued)
• While utilising fancy indexing, the shape of the result reflects the shape of the index
arrays instead of the shape of the array being indexed:
Output
• Even fancy indexing works in multiple dimensions. See the example shown below:
Output

Fancy Indexing
(Continued)
• The first index refers to the row, and the second to the column Like with standard
indexing:
Output
• The broadcasting rules are followed by the pairing of indices in fancy indexing.
Therefore, for instance, we get a two-dimensional result if we combine a column vector
as well as a row vector within the indices:
Output

Fancy Indexing
(Continued)
• It is always necessary to memorise with fancy indexing that the broadcasted shape of
the indices is reflected by the return value, instead of the shape of the array being
indexed
Combined Indexing
• Fancy indexing can be combined with the other indexing schemes for more powerful
operations:
Output

Fancy Indexing
(Continued)
• We can combine simple as well as fancy indices:
• We can combine fancy indexing with slicing as well:

Fancy Indexing
(Continued)
• Even, fancy indexing can be combined with masking:
Output
• All of these indexing options combined lead to a very flexible group of operations for
accessing as well as modifying array values

Fancy Indexing
Modifying Values with Fancy Indexing
• Fancy indexing can be used for accessing parts of an array. Moreover, it can modify parts
of an array as well. For instance, assume we have an array of indices, and we want to set
similar items in an array to some value:
• For this, any assignment-type operator can be used. For instance:

Fancy Indexing
(Continued)
• Notice that, repeated indices with these operations can cause some potentially
unexpected outcomes
• The outcome of this operation is to first assign A[0] = 2, followed by A[0] = 8. The result
is that A[0] contains the value 8

Sorting Arrays
• Term sorting means placing elements in an organised sequence
• Ordered sequence is any sequence which has an order corresponding to elements, such
as ascending or descending, alphabetical or numeric
• The NumPy ndarray object has a function named as sort(), which will sort an array
Example
Output

Sorting Arrays
(Continued)
• Even, you can sort string arrays, or any other data type:
Output
• The following is the example to Sort a Boolean array:
Output

Sorting Arrays
Sorting a 2-D Array
• By using the sort() method, both arrays of the 2-D array will be sorted:
Output

NumPy’s Structured Array
• Numpy’s Structured Array is similar to Struct in C programming language. It is used in
order to group data of different sizes and types
• Data containers named as fields are used by the structure array. Every data field can
contain data of any size and type. With the help of dot notation, array elements can be
accessed
Structured Array Properties
• All structs in the array have the similar number of fields
• All structs have same fields names

(Continued)
• For instance, consider a student's structured array with different fields such as year,
name, and marks
• Every record in array student has a structure of class Struct. Moreover, the array of a
structure is referred to as struct as adding any new fields for a new struct in the array,
contains the empty array
Output

(Continued)
Example
• The structure array can be sorted by using numpy.sort() method and also passing the
order as a parameter. This parameter takes the field value according to which it is
required to be sorted
Output

Module 2: Python for Data Analysis -
Pandas

Installing pandas
• Perform the following steps to install pandas:
Step 1: Choose Anaconda Prompt (Anaconda3) and Run as an administrator

Installing pandas
Step 2: Execute the pip install pandas command. Pandas will be installed successfully in
Anaconda

Pandas Objects
• Pandas objects can be thought of as improved versions of NumPy structured arrays at
the fundamental level in which the rows and columns are recognized with labels
instead of simple integer indices
• Index, DataFrame and series are the three basic pandas data structures
• Import Numpy and Pandas

Pandas Objects
(Continued)
The Pandas Series Object
• A Pandas Series is a 1-D array of indexed data. It can be created from a array or list as
shown in the following screenshot:

Pandas Objects
(Continued)
• As shown in the output, A sequence of indices and sequence of values both are
wrapped by the series, which we can access with the index attributes and values. The
values are simply a familiar NumPy array:
• The index is an array-like object of type pd.Index

Pandas Objects
(Continued)
• As NumPy array, data can be obtained by the associated index through the familiar
Python square-bracket notation:

Pandas Objects
(Continued)
• The Pandas Series is much more general as well as flexible as compared to 1-D NumPy
array that it emulates
Series as generalized NumPy array
• The Series object is basically interchangeable with a 1-D NumPy array
• The significant difference is the presence of the index: whereas the Numpy Array has an
implicitly defined integer index used in order to obtain the values, the Pandas Series
has a clear-cut defined index associated with the values

Pandas Objects
(Continued)
• The Series object additional capabilities are provided by this clear index description.
The index needs not to be an integer but can made up of values of any wanted type.
For instance, we can use strings as an index:

Pandas Objects
(Continued)
• And the item access works as expected
• Even, non-sequential or non-contiguous indices can be used

Pandas Objects
(Continued)
Series as specialized dictionary
• A dictionary is a structure which maps arbitrary keys to a collection of arbitrary values,

as well as a Series is a structure which maps typed keys to a set of typed values
• This typing is significant: just as the type-specific compiled code behind a NumPy array
makes it more well-organized than a Python list for certain operations, the type
information of a Pandas Series makes it much more efficient as compare to Python
dictionaries for certain operations

Pandas Objects
(Continued)
• By creating a Series object directly from a Python dictionary the Series-as-dictionary

analogy can be made even more explicit:

Pandas Objects
(Continued)
• A Series will be built where the index is drawn from the sorted keys by default. Typical
dictionary-style item access can be performed from here:
• Array-style operations such as slicing is also supported by the Series:

Pandas Objects
(Continued)
Constructing Series objects
• For instance, data can be a NumPy array or list, in which case index defaults to an
integer sequence:

Pandas Objects
(Continued)
• Data can be a scalar, which is repeated in order to fill the specified index:
• Data can be a dictionary, in which index defaults to the sorted dictionary keys

Pandas Objects
(Continued)
• The index can be set explicitly in every case if a different result is preferred:

Pandas Objects
(Continued)
The Pandas DataFrame Object
• In Pandas, the next primary structure is the DataFrame
• The DataFrame can be examined either as a Python dictionary specialisation or a

generalization of a NumPy array

Pandas Objects
(Continued)
DataFrame as a generalized NumPy array
• Suppose a Series is an analogue of a 1-D array with flexible indices. In that case, a
DataFrame is an analogue of a 2-D array with both flexible column names and flexible
row indices
• For showing this, first, make a new Series listing the area of each of the five states:

Pandas Objects
(Continued)
• To construct a single 2-D object containing this information , we can use a dictionary:

Pandas Objects
(Continued)
• Similar to the Series object, the DataFrame has an index attribute which provides
access to the index labels:
• In addition, the DataFrame has a columns attribute, which is an Index object containing
the column labels:

Pandas Objects
(Continued)
• Therefore we can think DataFrame as a generalization of a 2-D NumPy array, where

both the rows and columns have a generalized index to access the data
DataFrame as specialized dictionary
• Likewise, we can consider a DataFrame as a specialization of a dictionary as well. Where

a DataFrame maps a column name to a Series of column data, a dictionary maps a key
to a value

Pandas Objects
(Continued)
• For instance, 'area' attribute returns the Series object holding the areas:
• Data[0] will return the first row in a 2-D NumPy array. Data['col0'] will return the first
column for a DataFrame

Pandas Objects
(Continued)
Constructing DataFrame objects
• Various ways can be used in order to construct Pandas DataFram. The following are
several examples:
o From a single Series object: A DataFrame is a collection of Series objects.

Moreover, from a single Series a single-column DataFrame can be constructed

Pandas Objects
(Continued)
o From a list of dicts: Any list of dictionaries can be made into a DataFrame
o Even if a few keys are missing in the dictionary, they will be filled by Pandas with
NaN which means "not a number" values:

Pandas Objects
(Continued)
o From a dictionary of Series objects: A DataFrame can be constructed from a

dictionary of Series objects as well:

Data Indexing and Selection
Data Selection in Series
• As we saw in the previous slides, a Series object acts in many ways like a one-
dimensional NumPy array, as well as in many ways like a standard Python dictionary
• If we keep these two overlapping analogies in mind, it will help us to understand the
patterns of data indexing as well as selection in these arrays

(Continued)
Series as dictionary
• Like a dictionary, the Series object provides a mapping from a group of keys to a
collection of values:

(Continued)
• We can also use dictionary-like Python expressions as well as methods to examine the
keys or indices as well as values:

(Continued)
• Series objects can even be altered with a dictionary-like syntax. Just as you can extend a
dictionary by assigning to a new key, you can extend a Series by assigning to a new
index value:

(Continued)
• This easy mutability of the objects is a useful feature: under the hood, Pandas is making
decisions about memory layout as well as data copying that might need to take place;
the user generally does not need to worry about these issues
Series as one-dimensional array
• A Series builds on this dictionary-like interface as well as provides array-style item

selection through the same fundamental mechanisms as NumPy arrays which is, slices,
masking, as well as fancy indexing. Examples of these are as follows:

(Continued)

(Continued)
• Among these, slicing may be the source of the most confusion. Notice that when slicing
with an clear index (i.e., data['a':'c']), the final index is included in the slice, while when
slicing with an understood index (i.e., data[0:2]), the final index is excluded from the
slice
• These slicing as well as indexing conventions can be a source of confusion. Such as, if
your Series has an clear integer index, an indexing operation for example data[1] will
use the clear indices, while a slicing operation like data[1:3] will use the understood
Python-style index

(Continued)
• Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes which clearly uncover certain indexing schemes

(Continued)
• These are not functional methods, but attributes which uncover a specific slicing
interface to the data in the Series
• First, the loc attribute permits indexing as well as slicing which always references the
clear index:

(Continued)
• The iloc attribute permits indexing as well as slicing which always references the
implicit Python-style index:
• A third indexing attribute, ix, is a hybrid of the two, as well as for Series objects is equal
to standard []-based indexing. The determination of the ix indexer will become more
apparent in the context of DataFrame objects, which we will discuss in a moment

(Continued)
DataFrame as a dictionary
• The first analogy we will consider is the DataFrame as a dictionary of related Series
objects. Let us return to our example of areas and populations of states:
Output

(Continued)
• The individual Series which make up the columns of the DataFrame can be retrieved
through dictionary-style indexing of the column name:
• Equivalently, we can use attribute-style access with column names which are strings:

(Continued)
• This attribute-style column access actually accesses the exact same object as the
dictionary-style access:
• Though this is a useful shorthand, remember that it does not work for all cases! Such
as, if the column names are not strings, or if the column names conflict with methods
of the DataFrame, this attribute-style access is not possible

(Continued)
• For instance, the DataFrame has a pop() method, so data.pop will point to this rather
than the "pop" column:
• In specific, you should avoid the temptation to try column assignment through attribute
(i.e., use data['pop'] = z rather than data.pop = z)

(Continued)
• Like with the Series objects discussed earlier, this dictionary-style syntax can also be
used to alter the object, in this case adding a new column:

(Continued)
• This displays a preview of the direct syntax of element-by-element arithmetic between

Series objects
Additional indexing conventions
• There are a couple extra indexing conventions which might seem at odds with the
preceding discussion, but nonetheless can be very beneficial in practice. First, while
indexing refers to columns, slicing refers to rows:

(Continued)
• Similarly, direct masking operations are also interpreted row-wise rather than column-
wise:

(Continued)
• These two conventions are syntactically alike to those on a NumPy array, as well as
while these may not quite fit the mold of the Pandas conventions, they are
nevertheless quite useful in practice

Operating on Data in Pandas
• One of the significant pieces of NumPy is the capability to perform fast element-wise
operations, both with fundamental arithmetic (like addition, subtraction, multiplication,
etc.) as well as with more complex operations (trigonometric functions, exponential
and logarithmic functions, etc.)
• Pandas inherits much of this functionality from NumPy, as well as the ufuncs (Universal
functions) which we introduced in Computation on NumPy Arrays: Universal Functions
are key to this
• Pandas contains a couple valuable twists, however: for unary operations such as
negation as well as trigonometric functions, these ufuncs will preserve index as well as
column labels in the output, and for binary operations like addition as well as
multiplication, Pandas will automatically align indices when passing the objects to the
ufunc

(Continued)
• This means that keeping the context of data as well as joining data from diverse sources
both potentially error-prone tasks with raw NumPy arrays become essentially foolproof
ones with Pandas
• We will additionally see that there are well-defined operations between 1-D Series
structures and 2-D DataFrame structures

(Continued)
Ufuncs: Index Preservation
• Because Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series as well as DataFrame objects
• Start by defining a simple Series as well as DataFrame on which to show this:

(Continued)

(Continued)
• If we apply a NumPy ufunc on either of these objects, the result will be another Pandas
object with the indices preserved:
• Or, for a little more complex calculation:

(Continued)
UFuncs: Index Alignment
• For binary operations on two Series or DataFrame objects, Pandas will align indices in
the process of performing the operation
Index alignment in Series
• For example, suppose we are combining two unlike data sources, as well as find only
the top three US states by area as well as the top three US states by population:

(Continued)
• Let's see what happens when we divide these to compute the population density:
• The resulting array holds the union of indices of the two input arrays, which could be
determined by using standard Python set arithmetic on these indices:

(Continued)
• Any item for which one or the other does not have an entry is marked with NaN, or
"Not a Number," which is how Pandas marks missing data
• This index matching is applied this method for any of Python's built-in arithmetic
expressions; any absent values are filled in with NaN by default:

(Continued)
• If using NaN values is not the desired behaviour, the fill value can be altered using
suitable object methods in place of the operators
• For instance, calling A.add(B) is alike to calling A + B, but permits optional clear
specification of the fill value for any elements in A or B that might be missing:

(Continued)
• A same kind of alignment takes place for both columns as well as indices when
performing operations on DataFrames:

(Continued)
• Notice that indices are aligned properly irrespective of their order in the two objects, as
well as indices in the result are sorted
• As was the case with Series, we can use the associated object's arithmetic method as
well as pass any wanted fill_value to be used in place of missing entries
• Here we will fill with the mean of all values in A (computed by first stacking the rows of
A):

(Continued)
• The following table lists Python operators as well as their equal Pandas object methods:
Python Operator Pandas Method(s)
+ Add()
- Sub(), subtract()
* Mul(), multiply()
/ truediv(), div(), divide()
// floordiv()

(Continued)
Python Operator Pandas Method(s)
% mod()
** pow()

Handling Missing Data
• Missing Data can occur when no information is provided for one or more items or for a
whole unit
• Missing Data is a very big problem in real life scenario. Missing Data can also refer to as
NA(Not Available) values in pandas
• In DataFrame sometimes many datasets simply arrive with missing data, either because
it exists as well as was not collected or it never existed
• For instance, Suppose different user being surveyed may select not to share their
income, some user may select not to share the address in this way many datasets went
missing

(Continued)
• In Pandas missing data is represented by two value:
o None: None is a Python singleton object which is usually used for missing data in
Python code
o NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point representation
• Pandas treat None as well as NaN as essentially exchangeable for indicating missing or
null values

(Continued)
• To facilitate this convention, there are various useful functions for detecting, removing,
as well as replacing null values in Pandas DataFrame
• isnull()
• notnull()
• dropna()
• fillna()
• replace()

(Continued)
• interpolate()
Checking for missing values using isnull() and notnull()
• In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull()
• Both function help in checking whether a value is NaN or not. These function can also
be used in Pandas Series in order to find null values in a series

(Continued)
Checking for missing values using isnull()
• In order to check null values in Pandas DataFrame, we use isnull() function this function
return dataframe of Boolean values which are True for NaN values
Example 1

(Continued)
Example 2

(Continued)
• As shown in the output image, only the rows having Gender = NULL are displayed

(Continued)
Checking for missing values using notnull()
• In order to check null values in Pandas Dataframe, we use notnull() function this
function return dataframe of Boolean values which are False for NaN values
Example 3
Output

(Continued)
Example 4
• As shown in the output image, only the rows having Gender = NOT NULL are displayed

(Continued)
Filling missing values using fillna(), replace() and interpolate()
• In order to fill null values in a datasets, we use fillna(), replace() and interpolate()
function these function replace NaN values with some value of their own
• All these function help in filling a null values in datasets of a DataFrame
• Interpolate() function is basically used to fill NA values in the dataframe but it uses
various interpolation method to fill the missing values rather than hard-coding the
value

(Continued)
Example 1: Filling null values with a single value
Output

(Continued)
Example 2: Filling null values with the previous ones
Output

(Continued)
Example 3: Filling null value with the next ones
Output

(Continued)
Example 4: Filling null values in CSV File
Output

(Continued)
• Now we are going to fill all the null values in Gender column with “No Gender”
Output

(Continued)
Example 5: Filling a null values using replace() method
Output

(Continued)
• Now we are going to replace the all Nan value in the data frame with -99 value
Output

(Continued)
Example 6: Using interpolate() function to fill the missing values using linear method.
Output

(Continued)
• Interpolate the missing values using Linear method. Note that Linear method ignore the
index as well as treat the values as equally spaced
• As we can see the output, values in the first row could not get filled as the direction of
filling of values is forward as well as there is no previous value which could have been
used in interpolation

Hierarchical Indexing
• In order to do data analysis python is a great language, mainly because of the excellent
ecosystem of the data-centric python package
• Moreover, Pandas makes importing and analysing data simple
• A MultiIndex reshaped is returned by the Pandas MultiIndex.to_hierarchical() function

in order to confirm the shapes provided by n_shuffle and n_repeat
• Moreover, for combination with another Index with n_repeat items, it is beneficial to
replicate as well as rearrange a MultiIndex

(Continued)
Example 1
• In order to repeat the labels in the MultiIndex, use MultiIndex.to_hierarchical()

function
Output

(Continued)
• Now, repeat two times to the labels of a MultiIndex
Output
• As you can see in the following output figure, the labels in the returned MultiIndex is
repeated 2 times.

(Continued)
Example 2: Use MultiIndex.to_hierarchical() function to repeat and reshuffle the labels in

the MultiIndex
Output

(Continued)
• Now let’s reiterate as well as reshuffle the labels of the MultiIndex 2 times
Output
• As you can see in the output figure, the labels are repeated as well as reshuffled twice
in the returned MultiIndex

Concat and Append
• The concat function performs concatenation operations along an axis. Let us create
different objects and do concatenation.
Output

Concat and Append
(Continued)
• Assume we want to associate particular keys with each of the pieces of the chopped up
DataFrame. This can be done by using the keys argument:
Output

Concat and Append
(Continued)
• The index of the resultant is duplicated; every index is repeated
• Set ignore_index to True if the resultant object has to follow its own indexing
Output

Concat and Append
(Continued)
• Note, the index changes entirely, and the Keys are overridden as well
• The new columns will be added if two objects need to be added along axis=1
Output

Concat and Append
(Continued)
Concatenating Using append
• A worthwhile shortcut to concat is the append instance methods on DataFrame as well

as Series. These methods predated concat. They concatenate along axis=0, namely the
index:
Output

Concat and Append
(Continued)
• Multiple objects can also be taken by the append function:
Output

Merge and Join
• Pandas DataFrame is 2-D size-mutable, a potentially diverse tabular data structure with
labelled columns and rows
• A Data frame is a 2-D data structure which means data is aligned in a tabular form in
columns and rows
• There are various methods by using them we can merge, join and concat dataframe
• In Dataframe methods such as df.join(), df.merge(), and df.concat() help in merging,

joining, and concating different dataframes
• We use concat() function to concat dataframe. This function helps in concatenating a

dataframe. We can concat a dataframe in various ways

Merge and Join
(Continued)
• The following are some ways:
1 Concatenating DataFrame by
using .concat()
Concatenating DataFrame by
setting logic on axes 4
ignoring indexes
Concatenating DataFrame
with mixed ndims 5
using .append()
Concatenating DataFrame
with group keys 6

Merge and Join
(Continued)
Concatenating DataFrame using .concat():
• We use .concat() function to concat a dataframe because this function concat a

dataframe and also a new dataframe is returned by it
• Before applying .concat() function:
Output

Merge and Join
(Continued)
• Output after applying .concat() function
Output

Merge and Join
(Continued)
Concatenating DataFrame by using .append()

• .append() function is used to concat a dataframe. This function concatenate along
axis=0, namely the index
• This function exist before .concat. The following output we will get before applying
the .append() function:
Output

Merge and Join
(Continued)
• The following output we will get after applying .append() function
Output

Merge and Join
(Continued)
Concatenating DataFrame by ignoring indexes :
• To concat a dataframe by ignoring indexes we ignore those indexes which do not have a
meaningful meaning
• You may wish to append them and ignore the fact that they may have overlapping
indexes. We use ignore_index as an argument to do that

Merge and Join
(Continued)
• Output before applying ignoring indexes methodology
Output

Merge and Join
(Continued)
• Output after applying ignoring indexes methodology
Output

Merge and Join
(Continued)
Concatenating DataFrame with group keys :
• We override the column names with the use of the keys argument to concat dataframe
with group keys
• Keys argument is to override the column names when creating a new DataFrame based
on existing Series

Merge and Join
(Continued)
• Output before applying group key methodology
Output

Merge and Join
(Continued)
• Output after using keys as an argument
Output

Aggregations and Grouping
Aggregations
• Various methods are available for performing aggregations on data once the
expanding, rolling, and ewm objects are created
Applying Aggregations on DataFrame
• Create a DataFrame and apply aggregations
Output

(Continued)
• We can aggregate by selecting a column through the standard get item method, or
passing a function to the whole DataFrame
Apply Aggregation on a Whole Dataframe
Output

(Continued)
Apply Aggregation on a Single Column of a Dataframe
Output

(Continued)
Apply Aggregation on Multiple Columns of a DataFrame
Output

(Continued)
Apply Multiple Functions on a Single Column of a DataFrame
Output

(Continued)
Apply Multiple Functions on Multiple Columns of a DataFrame
Output

Groupby
• One of the following operations is involved by any groupby operation on the original
object:
o Applying a function
o Splitting the Object
o Combining the results

(Continued)
• We split the data into sets in many situations, and we apply some functionality on each
subset. We can perform the following operations in the apply functionality
o Aggregation: calculating a summary statistic
o Transformation: perform some group-specific operation
o Filtration: discarding the data with some condition

(Continued)
• Let us now create a DataFrame object as well as perform all the operations on it:
Output

(Continued)
Split Data into Groups
• Pandas object can be split into any of their objects. Various ways are there in order to
split an object such as:
o obj.groupby('key')
o obj.groupby(['key1','key2'])
o obj.groupby(key,axis=1)

(Continued)
• Now see how the grouping objects can be applied to the DataFrame object
Output

(Continued)
View Groups

(Continued)
Group by with multiple columns

(Continued)
Iterating through Groups
• With the groupby object in hand, we can iterate via the object alike to itertools.obj.
Output

(Continued)
• By default, the groupby object has the same label name as the group name
Select a Group
• Using the get_group() method, we can choose a single group
Output

Pivot Tables
• pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’,
fill_value=None, margins=False, dropna=True, margins_name=’All’) create a
spreadsheet-style pivot table as a DataFrame
• Levels in the pivot table will be stored in MultiIndex objects on the index and columns
of the result DataFrame
Example
Output

Pivot Tables
(Continued)
Output

Pivot Tables
(Continued)
Output

Vectorised String Operations
• One strength of Python is its relative ease in handling as well as manipulating string
data
• Pandas builds on this as well as provides a comprehensive collection of vectorised

string operations which become an significant piece of the type of munging needed
when working with (read: cleaning up) real-world data
Introducing Pandas String Operations
• As we know that tools like numpy as well as pandas generalize arithmetic operations so
that we can easily as well as quickly perform the same operation on numerous array
elements

(Continued)
Example
• This vectorization of operations simplifies the syntax of operating on arrays of data: we

no longer have to worry about the size or shape of the array, but just about what
operation we want done

(Continued)
• For arrays of strings, NumPy does not provide such simple access, as well as thus you
are stuck using a more verbose loop syntax:
• This is perhaps sufficient to work with some data, but it will break if there are any
missing values

(Continued)
• Pandas includes features to address both this need for vectorised string operations as
well as for properly handling missing data through the str attribute of Pandas Series as
well as Index objects containing strings
• So, for instance, suppose we create a Pandas Series with this data:

(Continued)
• Now a single method can be called by us which will capitalize all the entries, while
skipping over any missing values:
• Using tab completion on this str attribute will list all the vectorised string methods
available to Pandas

(Continued)
Tables of Pandas String Methods
• If you have a good understanding of string manipulation in Python, most of Pandas

string syntax is intuitive enough that it is probably sufficient to just list a table of
available methods
• The examples in this section use the following series of names:

(Continued)
Methods similar to Python string methods
• Nearly all Python's built-in string methods are mirrored by a Pandas vectorised string
method. Here is a list of Pandas str methods which mirror Python string methods:
1 2 3
len() ljust() rjust()
4 5 6 7
center() zfill() strip() translate()

(Continued)
8 9 10
startswith() endswith() rfind()
11 12 13 14
isalpha() isdigit() lower() upper()

(Continued)
• Notice that these strings ( shown on previous slides) have various return values. Some,
like lower(), return a series of strings:
• But some others return numbers:

(Continued)
• Or Boolean values:
• Still others return lists or other compound values for each element:

Working with Time Series
• Even though time series is also present in scikit-learn however Pandas has some kind of
complied more features
• We can include the date and time for each record and can fetch the dataframe records
in this module of Pandas
• By using pandas module called Time series we can find the data within a specific range
of date and time
Example 1
Output

(Continued)
• In this code, for date ranges from 1/1/2019 – 8/1/2019 we have created the timestamp
on the bases of minutes. We can vary the frequency by hours to seconds or minutes
• This function will help you to tack the record of data stored per minute. The length of
the datetime stamp is 10081 As we can see in the output as shown in the output
Example 2
Output

(Continued)
• We are checking the type of our object named range_date
Example 3
Output

(Continued)
• We have first created a time series then converted this data into dataframe and for
generating the random data and map over the dataframe use random function. Then
we use print function to check the result
• We need to have a datetime index to do time series manipulation so that dataframe is

indexed on the timestamp

(Continued)
Example 4

(Continued)
• This code use the elements of data_rng and converted to string. Moreover, we slice the
data and print the first ten values list string_data because of more data
• We got all the values which are in the series range_date by using the for each loop in
list. We always have to specify the start and end date when we are using date_range
Example 5
Output

eval() and query()
query()
• For data analysis, python is an excellent language, mainly because of the incredible
ecosystem of data-centric Python packages
• Pandas makes importing and analyzing data much easier
• Analyzing data needs many filtering operations. In order to filter a Data frame, various
methods are provided by Pandas. Dataframe.query() is one of them

eval() and query()
(Continued)
Example 1: Single condition filtering
The data is filtered based on a single condition in this example. The spaces in column
names have been replaced with ‘_’ before applying the query() method
Output

eval() and query()
eval()
• For evaluating an expression in the context of the calling dataframe instance Pandas
dataframe.eval() function is used
• The expression is evaluated over the columns of the dataframe
Example 1
• In order to evaluate the sum of all column elements in the dataframe and insert the
resulting column in the dataframe use eval() function
Output

eval() and query()
(Continued)
• Now, evaluate the sum over all the columns and add the resultant column to the
dataframe:
Output

eval() and query()
(Continued)
Example 2: For evaluating the sum of any two column element in the dataframe and insert
the resulting column in the dataframe use eval() function. The dataframe has NaN value
Output

eval() and query()
(Continued)
• Now, evaluate the sum of column “B” with “C”
Output
• Note that the resulting column ‘D’ has NaN value in the last row as the similar cell
utilised in the evaluation was a NaN cell.

Module 3: Python for Data Visualization –
Matplotlib

Overview of Matplotlibs
Introduction to Matplotlib
• Matplotlib is an amazing visualisation library in Python for 2D plots of arrays
• Matplotlib is a multi-platform data visualisation library built on NumPy arrays as well as

designed to work with the broader SciPy stack
• One of the greatest advantages of visualisation is that it permits us visual access to large
amounts of data in easily digestible visuals
• Matplotlib consists of various plots such as line, bar, scatter, histogram etc.

Installation
• Windows, Linux as well as MacOS distributions have matplotlib and most of its
dependencies as wheel packages
• Run the command to install Matplotlib package:

Importing Matplotlib

Basic Plots in Matplotlib
• Matplotlib comes with an extensive diversity of plots
• Plots help to understand trends, patterns, as well as to make correlations
• They are typically appliances for reasoning about quantitative information

(Continued)
Line Plot

(Continued)
Bar Plot

(Continued)
Histogram

(Continued)
Scatter Plot

Object-Oriented Interface
• In object-oriented method, we can create figure objects and then call methods or
attributes off of that object. This interface helps better in dealing with a canvas which
has several plots on it
• To commence with, we create a figure instance that provides an empty canvas
fig = plt.figure()
• Now, add axes to the created figure. The add_axes() method needs a list object of 4
elements corresponding to bottom, left, height and width of the figure. Every number
should be between 0 and 1
ax=fig.add_axes([0,0,1,1])

Two Interfaces
(Continued)
• Set title and labels for x and y axis
ax.set_title("sine wave")
ax.set_xlabel('angle')
ax.set_ylabel('sine')
• Call the plot() method of the axes object
ax.plot(x,y)

Two Interfaces
(Continued)
Example:

Simple Line Plots and Scatter Plots
Simple Line plot
Example 1

(Continued)
Example 2: Straight Line

(Continued)
Example 3: Curved line

(Continued)
Example 4: Multiple lines

(Continued)
Example 5: Dotted line

Scatter Plots
• Example 1

(Continued)
• Example 2

(Continued)
• Example 3

Visualising Errors
• In visualisation of data as well as results, demonstrating visualising errors effectively can
make a plot that convey much more complete information of the data
Basic Errorbars
• With a single Matplotlib function, a basic errorbar can be created:

Visualising Errors
(Continued)
• Example 1:

Visualising Errors
(Continued)
• Example 2:

Contour Plots
• Firstly, we have to import the functions for plotting
Visualising a Three-Dimensional Function

• We will start by showing a contour plot by using a function z=f(x,y)

Contour Plots
(Continued)
• plt.contour function is used to create a contour plot
• This function takes three arguments: a grid of x values, a grid of y values, as well as a
grid of z values
• The x as well as y values signify positions on the plot, and the contour levels will
represent the z values

Contour Plots
(Continued)
• The np.meshgrid function is used to build two-dimensional grids from one-dimensional

arrays:
• The lines in the plotting can be color-coded by specifying a colourmap with the cmap
argument

Contour Plots
(Continued)
• Also, we will specify that we want more lines to be drawn, i.e. 20 equally spaced
intervals within the data range:

Contour Plots
(Continued)
• Matplotlib has a wide range of colourmap that you can easily browse in IPython by
writing a plt.cm. and then press Tab key
plt.cm.<TAB>

Contour Plots
(Continued)
• We can also apply a filled contour plot by using the plt.contourf() function
• Moreover, we will add a plt.colorbar() command that automatically creates an

additional axis with the labelled colour information for the plotting:

Contour Plots
(Continued)
• The colorbar makes it clear that the black regions are peaks. On the other hand, the red
regions are valleys
• Also, we use the plt.imshow() function to interpret a two-dimensional grid of data as an

image

Histograms, Binnings, and Density
Example of Histograms

(Continued)
• The hist() function has several options to tune both the calculation as well as the
display; here is an example of more customised histogram:

(Continued)
• The plt.hist docstring has more information on other customisation options available
• This combination of histtype='stepfilled' along with some transparency alpha to be very

useful when comparing histograms of several distributions:

(Continued)
• The np.histogram() function represents the frequency of the data distribution

Binnings
plt.hexbin: Hexagonal binnings
• The two-dimensional histogram creates a tessellation of squares across the axes. The
regular hexagon is another natural shape for such a tessellation
• Plt.hexbin routine is provided by the Matplotlib for this purpose, which will represent a
two-dimensional dataset binned within a grid of hexagons:

(Continued)
• plt.hexbin has a number of interesting options, including the ability to specify weights
for each point, as well as to alter the output in each bin to any NumPy aggregate (mean
of weights, standard deviation of weights, etc.)
Kernel Density Estimation

• Another common technique of evaluating densities in multiple dimensions is kernel
density estimation (KDE)

(Continued)
Example:

Customising Plot Legends
• Plot legends in data science give meaning to a visualisation, assigning meaning to the
several plot elements
• plt.legend() command is used to create the simplest legend, which automatically

creates a legend for any labelled plot elements:

(Continued)
• But, there are several ways we might want to customise such a legend. For instance, we
can define the location as well as turn off the frame:

(Continued)
• We can use the ncol command for specifying the number of columns in the legend:

(Continued)
• We can use a fancybox (rounded box) or add a shadow, alter the transparency (alpha
value) of the frame, or alter the padding around the text:

Choosing Elements for the Legend
• We can fine-tune which elements as well as labels appear in the legend using the
objects returned by the plot commands
• Multiple lines at once can be created by plt.plot() command, as well as returns a list of
created line instances. Passing any of these to plt.legend() will tell it which to identify,
along with the labels we had like to specify:

(Continued)
• Now, applying labels to the plot elements which show on the legend:

(Continued)
Multiple Legends

Customising Colorbars
• Firstly, import functions:
• The simplest colorbar can be created with the plt.colorbar function:

• The colormap can be specified by using the cmap argument to the plotting function that
is creating the visualisation:

(Continued)
Color limits and extensions

Discrete Color Bars
• plt.cm.get_cmap() function is used for discrete color bars

Multiple Subplots
• These subplots might be insets, grids of plots, or other more complex layouts
• By using plt.axes

Multiple Subplots
(Continued)
• Example of fig.add_axes()

Multiple Subplots
(Continued)
• plt.subplot() that creates a single subplot within a grid

Multiple Subplots
(Continued)
• The command plt.subplots_adjust is used for adjusting the spacing between these
plots. The following example uses the equivalent object-oriented command named
fig.add_subplot():

Multiple Subplots
(Continued)
• We can also specify subplot locations as well as extents:

Text Annotation
• The following is an example of drawing text at several locations using these transforms:

Text Annotation
(Continued)
• Note that by default, the text is aligned above as well as to the left of the specified
coordinates: here the "." at the commencement of each string will approximately mark
the given coordinate location
• The transData coordinates give the common data coordinates associated with the x- as
well as y-axis labels
• The transAxes coordinates give the location from the bottom-left corner of the axes
(here the white box), as a fraction of the axes size

Text Annotation
(Continued)
• The transFigure coordinates are identical, however, specify the position from the
bottom-left of the figure (here the gray box), as a fraction of the figure size
• Notice now that if we alter the axes boundaries, it is only the transData coordinates
that will be affected, whereas the others remain static:

Text Annotation
(Continued)
Arrows and Annotation

• Along with tick marks as well as text, another useful annotation mark is the simple
arrow
• Drawing arrows in Matplotlib is often much harder than you would bargain for. While
there is a plt.arrow() function available. The arrows it creates are SVG (Scalable Vector
Graphics) objects which will be subject to the varying aspect ratio of your plots, as well
as a result is rarely what the user intended

Text Annotation
(Continued)
• plt.annotate() function creates some text as well as an arrow, and the arrows can be
very flexibly specified
• Here, we will use annotate with various of its options:

Three-Dimensional Plotting in Matplotlib
• Three-dimensional plots are enabled by importing the mplot3d toolkit, included with
the main Matplotlib installation:
• Once this submodule is imported, a three-dimensional axes can be created by passing

the keyword projection='3d' to any of the normal axes creation routines:

(Continued)
Three-dimensional Points and Lines

• The most fundamental three-dimensional plot is a line or collection of scatter plot
created from sets of (x, y, z) triples
• These can be created by using the ax.plot3D as well as ax.scatter3D functions

(Continued)
Three-dimensional Contour Plots

(Continued)
• In the following code, we will use an elevation of 60 degrees (that is, 60 degrees above
the x-y plane) as well as an azimuth of 35 degrees (that is, rotated 35 degrees counter-
clockwise about the z-axis):

(Continued)
Wireframes and Surface Plots

• Two other types of three-dimensional plots which work on gridded data are wireframes
as well as surface plots
• These take a grid of values as well as project it onto the specified three-dimensional
surface, and can make the resulting three-dimensional forms quite easy to visualise

(Continued)
• The following is an example of using a wireframe:

(Continued)
• A surface plot is as same as a wireframe plot, but each face of the wireframe is a filled
polygon. Adding a colormap to the filled polygons can aid observation of the topology of
the surface being visualised:

(Continued)
• Note that even though the grid of values for a surface plot needs to be two-
dimensional, it need not be rectilinear
• Here is an instance of creating a partial polar grid, which when used with the surface3D
plot can give us a slice into the function we are visualising:

Module 4: Python for Data Visualization -
Seaborn

Install Seaborn and Load a Dataset For
Analysis
Install Seaborn
Using Pip Installer
• For installing the latest version of Seaborn, you could use pip:
Output

Analysis
(Continued)
Python 2.7 or 3.4+
Dependencies
Numpy
Consider
the
following
Scipy
dependenci
es of
Seaborn
Pandas
Matplotlib

Analysis
Load a Dataset For Analysis
• Seaborn.load_dataset (name, cache=True, data_home=None, **kws)
• This function gives quick access to a small number of example datasets that are useful
for documenting seaborn and generating reproducible illustrations for bug reports
• For normal usage, it is not necessary

Analysis
(Continued)
• Remember that some of the datasets contain a small amount of preprocessing applied
for defining a proper ordering for the categorical variables
• To see a list of available datasets, use get_dataset_names()
Parameters: name: str
• Name of the dataset ({name}.csv)

Analysis
(Continued)
Cache: boolean, optional
• Try to load from the local cache first, if True and save to the cache if a download is
required
data_home: string, optional
• The directory in which to cache data; see get_data_home()

Analysis
(Continued)
Kws: keys and values, optional
• Additional keyword arguments are passed to passed through to pandas.read_csv()
Returns: df: pandas.DataFrame
• Tabular data, possibly with some preprocessing applied

Plot the Distribution Using a Histogram and
Kernel Density Estimate Curve
• Histograms depict the data distribution through forming bins along with the range of
the data, then drawing bars for showing the number of observations that fall in every
bin
• Seaborn comes with some datasets, and we have used some datasets

(Continued)
Output

• KDE is a process for an estimate the probability density function from a continuous
random variable, and for non-parametric analysis, it is used
• Setting the hist flag to False in distplot would yield the kernel density estimate plot

(Continued)
Output

Regression Analysis by Using the Seaborn
lmplot
• Most of the times, we use datasets that include multiple quantitative variables, and the
goal of analysis has usually related to each other variables. The regression lines could do
it
• Usually, we check for the multicollinearity while building the regression model, where
we had to perceive the correlation between all the combinations of continuous
variables
• And it will take significant action for removing multicollinearity if exists

Regression Analysis by Using the Seaborn
lmplot
(Continued)
• In Seaborn, there are two main functions for visualising a linear relationship determined
through regression. These functions are Regplot() and lmplot()
Regplot Implot
It accepts the x and y variables in a variety of It has data as a required parameter and the x
formats containing simple numpy arrays, and y variables must be specified as strings.
pandas Series objects, or as references to This data format is called “long-form” data
variables in a pandas DataFrame

Basic Aesthetic Themes and Styles Available
in Seaborn
• Aesthetics is a set of principles which is concerned with the nature and appreciation of
beauty, particularly in art.
• Visualisation is an art of interpreting data in a useful and most comfortable way
• Matplotlib library highly supports customisation, but knowing what settings to tweak for
achieving an attractive and anticipated plot is what one must be aware of to make use
of it
• Unlike Matplotlib, Seaborn comes packed with customised themes and a high-level
interface for controlling and customising the look of Matplotlib figures

in Seaborn
(Continued)
Output

in Seaborn
Seaborn Figure Styles
• For manipulating the styles, the interface is set_style(),by using this function, you could
set the theme of the plot, according to the latest updated version, the following are the
five themes
Darkgrid Whitegrid Dark
White Ticks

in Seaborn
(Continued)
Output

Distinguish between Scatter Plots, Hexbin
Plots, and KDE Plots
Scatter Plots
• Dots are used in the scatter plot for representing values in two distinct numeric
variables
• The position of each dot on the vertical and horizontal axis indicates values for a single
data point
• Scatter plots are used for observing relations within variables

(Continued)

Hexbin plots
• For representing the relationship of 2 numerical variables, a Hexbin plot is useful for
that, also, when you have a lot of data point
• Instead of overlapping, the plotting window is split into numerous hexbins, and the
number of points per hexbin is counted
• The colour indicates this number of points. While using the hexbin function of
Matplotlib, it could be instantly done

(Continued)
Output

KDE plots
• Kernel Density Estimate is used for visualising the Probability Density of a continuous
variable
• It shows the probability density at distinct values in a constant variable

(Continued)
Output

Use Boxplots and Violin Plots to Visualise the
Distributions of Data
• Violin Plot is a way for visualising the distribution of numerical data of distinct variables
• It is corresponding to Box Plot but with a rotated plot on every side, providing more
information about the density estimation on the y-axis
• The density is also mirrored and flipped over, and the resulting shape of violin plot is
filled in, creating an image resembling the violin
• The benefit of a violin plot is that it could depict the nuances in the distribution that are
not perceptible in a boxplot

(Continued)
• On the opposite hand, in the data, the boxplot more clearly indicates the outliers
• Violin Plots contain more information than the box plots; because of their unpopularity,
they are less popular, their meaning could be more difficult to grasp, and several readers
are not familiar with the violin plot representation

(Continued)
Example of Boxplot:

(Continued)
Example of Violin plot:

Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
Categorical Plots
• For visualising the relationship between variables, it uses Plots. Variables could be either
be numerical or a category like a class, group, or division
• Seaborn, besides being a statistical plotting library, it also gives some default datasets.
We would be using one such default dataset known as ‘tips’
• The ‘tips’ dataset holds information regarding people who probably had food at a
restaurant and whether or not they left a tip for the waiters, their gender, whether they
smoke and so on

(Continued)
Output

Barplot
• In order to aggregate the categorical data according to some methods and by default its
the means a barplot is utilised
• It can also be known as a visualisation of the group by action
Syntax:
barplot([x, y, hue, data, order, hue_order, …])

(Continued)
Output

Stripplot
• Basically, it creates a scatter plot based on the category
Syntax:
stripplot([x, y, hue, data, order, …])

(Continued)
Output

Swarmplot
• It is similar to a strip plot except for the fact that points are adjusted so that they do not
overlap. Some people also like combining the idea of a violin plot and a strip plot for
forming plot
• One disadvantage of using swarm plot is that they do not scale well to huge numbers
and takes a lot of computation for arranging them
• So in case, we need to visualise a swarm plot correctly we could plot it on top of a violin
plot

(Continued)
Syntax: swarmplot([x, y, hue, data, order, …])
Output

Recall Some of the Use Cases and Features of
Seaborn
Important Features of Seaborn
• Built in themes for styling matplotlib graphics
• Visualising univariate and bivariate data
• Fitting in and visualising linear regression models
• Plotting statistical time series data
• Seaborn works well with NumPy and Pandas data structures
• It comes with built in themes for styling Matplotlib graphics

Module 5: Introduction to Machine
Learning

Introduction
• Machine Learning refers to the study of algorithms and statistical models used by
computer systems as a way of effectively performing tasks without the need for specific
instructions, but relying on patterns and inference instead
• The following describes the two ways a system can improve:
1. By acquiring new knowledge, facts, and skills
2. By adapting its behaviour, solving problems more accurately, and more efficiently
• There are three main elements that comprise Machine Learning:
1. Base knowledge in which the system is aware of the answer thus enabling the system
to learn

Introduction
(Continued)
2. The computational algorithm which is at the core of making determinations
3. Variables and features used to make decisions
• Machine Learning is the main subarea of artificial intelligence
• Machine Learning allows the computers or machines to routinely adjust and customise
themselves instead of being explicitly programmed to carry out specific tasks
• These programs or algorithms are specifically designed to improve their performance P

at some task T with experience E:
o T: Recognising hand-written words

Introduction
(Continued)
o P: Percentage of words correctly classified
o E: Database of human-labelled images of handwritten words
• The following are real life examples of Machine Learning:
o While shopping on the internet, users are presented with advertisements related to
their purchases
o When shopping, a person checks a product on the internet then it recommends

similar products

Introduction
(Continued)
o When using an app to book a cab ride, the app will provide an estimation of the
price of that ride. When using these services, how do they minimise the detours?
The answer is machine learning
• Some Other Real-Life Examples of Machine Learning:
1. Virtual Personal Assistants
o Siri, Alexa, are few of the popular examples of virtual personal assistants
o Virtual Assistants are integrated in a variety of platforms. For example:
o Smartphones: Samsung Bixby on Samsung S8

Introduction
(Continued)
o Smart Speakers: Amazon Echo and Google Home
o Mobile Apps: Google Allo
2. Social Media Services
o Social media platforms are utilising machine learning for their own benefits as well
as for the benefit of the user. Below are a few examples:
o Face Recognition: Upload a picture of you with a friend and Facebook instantly
recognizes that friend
o Similar Pins: Computer Vision is used by Pinterest as a way of recognises objects in

images and recommends similar pins accordingly
Introduction
(Continued)
o Smart Speakers: Amazon Echo and Google Home
o Mobile Apps: Google Allo
2. Social Media Services
o Social media platforms are utilising machine learning for their own benefits as well
as for the benefit of the user. Below are a few examples:
o Face Recognition: Upload a picture of you with a friend and Facebook instantly
recognizes that friend
o Similar Pins: Computer Vision is used by Pinterest as a way of recognises objects in

images and recommends similar pins accordingly
Introduction
3. Online Fraud Detection
o Machine learning is proving its potential to make cyberspace a secure place and
tracking monetary frauds online is one of its examples
o For example: PayPal is using ML for protection against money laundering
4. Online Customer Support
o Most websites will offer the option to chat to customer support. In most cases, you
talk to a chatbot rather than a live executive to answer your queries
o These bots tend to extract information from the website and present it to the
customers

Introduction
Difference Between Traditional Programming and Machine Learning
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output

Importance of Machine Learning
• Machine learning has become a key technique for problem solving in a variety of
fields:
Image Automotive,
Natural
Computational Computational Processing Energy aerospace,
Language
Biology Finance and Computer Production and
Processing
Vision manufacturing
Drugs
Recovery Credit Motion
Price
Scoring Detection
Voice
Tumor Predictive
Recognition
Detection Maintenance
Applications
Algorithm Object Load

DNA Trading Detection Forecasting
Sequencing

Types of Machine Learning
Machine Learning
Three Types
Supervised Learning Unsupervised Learning Reinforcement Learning
Classification Regression
Task Driven (Predict next value) Clustering
Data Driven (Predict next value) Learn from Mistakes
K-Means, K-Medoids
Support Vector Machines Linear Regression, GLM
Fuzzy C-Means
Discriminant Analysis SVR, GPR Hierarchical
Naïve Bayes Ensemble Methods Gaussian Mixture
Nearest Neighbour Decision Trees Neural Networks
Continuos Neural Networks Hidden Markov Model
Categorical

How Machine Learning Works?
• Machine Learning uses both Supervised and Unsupervised Learning. Supervised
Learning trains a model on known input and output data so that it can predict future
outputs. Unsupervised learning identifies hidden patterns or intrinsic structures in input
data
Machine Learning
Unsupervised Learning Supervised Learning
Group and interpret data based Develop predictive model based

only on input data on both input and output data
Clustering Classification
Regression

How Machine Learning Works?
Training the Machine
Learning algorithm
START If the accuracy is
not acceptable
Training Data Set
Model Input Data ML

algorithm is If the accuracy
trained is acceptable Machine Learning
again algorithm is
deployed
New Data Input

introduced to Prediction
make a prediction
Machine Learning Algorithm

Machine Learning Mathematics
• Machine Learning Theory is a field that uses probabilistic, computer science,
statistical, and algorithms feature as a result of learning iteratively from data and
identifying hidden patterns that can later be used to generate intelligent applications
Why mathematics is significant for machine learning?
o Selecting the right algorithm
o Identifying underfitting and overfitting
o Choosing parameter settings and validation strategies
o Estimating the right confidence interval and uncertainty

Machine Learning Mathematics
Importance of Maths Topics Required For Machine Learning
10%
15% 35%
Linear Algebra
Multivariate Calculus
Probability Theory and Statistics
Algorithms and Complexity
Others
25%
15%

Module 6: Natural Language Processing

Introduction to NLP
• Natural Language Processing (NLP) is a
branch of artificial Intelligence (AI)
• It serves the interaction between

computers and humans with the help of
natural language
• The final purpose of NLP is to read,

decipher, understand, and make sense of
the human languages in a valuable manner
• Most of the NLP techniques depend upon

machine learning to derive meaning from
human languages

Introduction to NLP
• Below is an examples of a usual interaction between humans and machines using NLP:
1. A human talks to the machine
2. The audio is captured by the machine
3. Audio gets converted into text
4. Text data is being processed
5. Data is converted into audio
6. The machine responds to the human by playing the audio file

Introduction to NLP
• The following are the common applications that have NLP as their driving force:
o Language translator applications like Google Translate
o IVR (Interactive Voice Response) applications that are used in call centres to respond to
specific user requests
o Word processors like Grammarly that employ NLP for checking grammatical errors
o Personal Assistant applications like Siri, Alexa etc.

Introduction to NLP
Components of NLP
The following are five main components of NLP:
1 • Morphological and Lexical Analysis

.
2 • Syntactic Analysis
.
3 • Semantic Analysis
.
• Discourse Integration
4
• Pragmatic Analysis
5

Introduction to NLP
Input Sentence
Components of NLP
Morphologic
al Processing
Lexicon
Syntax
Analysis
(pasing)
Grammar
Semantic
Semantic Analysis
Rules
Contextual
Information Pragmatic
Analysis

Introduction to NLP
1. Morphological and Lexical Analysis
• Lexical Analysis is basically a vocabulary that has words and expressions
• It represents analysing, identifying, and explanation of the structure of words
• It consist of dividing a text into paragraphs, words, and sentences
• Individual words are analysed into their components, and non-word tokens like
punctuations are separated from the words

Introduction to NLP
2. Semantic Analysis
• Semantic analysis is a structure produced by syntactic analyser that assigns meanings
• It transfers linear sequences of words into structures
• It demonstrates how the words are associated with each other
• Semantics concentrates on the literal meaning of words, phrases, and sentences
• This summaries the dictionary meaning or the real meaning from the given context only

Introduction to NLP
3. Pragmatic Analysis
• This analysis handles communicative and social content, as well as its effect on interpretation
• In this analysis, the key emphasis is always on what was said and then reinterpreted based
upon the actual meaning
• This analysis assists users to find out this anticipated effect by applying a set of rules that
characterise cooperative dialogues

Introduction to NLP
4. Syntax Analysis
• The words are the smallest units of syntax
• The syntax basically refers to the principles and rules governing the sentence structure
of any individual languages
• Syntax concentrates on the appropriate ordering of words that can affect its meaning
• It implicates analysis of the words in a sentence through grasping the grammatical

structure of the sentence

Introduction to NLP
5. Discourse Integration
• Discourse integration implies a sense of the context
• The significance of any single sentence that depends upon that sentences
• It also considers the meaning of the following sentence
• As an example, in the sentence “He wanted that”, the word “that” depends upon the
previous discourse context

NLP and Writing Systems
• To determine the best approach for text pre-processing, the type of writing system
used for a language is determined by one particular element
Writing systems can be:
Logographic Syllabic Alphabetic

• Enormous • Individual • Individual
number of symbols symbols
individual represent represent sound
symbols syllables
represent words

NLP Examples
The following are the common applications of NLP:
I. Information retrieval and Web Search
• Search engines such as Google, Bing, Yahoo etc. base their machine translation
technology on NLP deep learning models
• NLP lets algorithms read text on a webpage, interpret its meaning, and translate it to
another language
II. Question Answering
• To ask questions in Natural Language, type in keywords

NLP Examples
III. Grammar Correction
• NLP techniques are broadly used by word processing software such as MS-word for spelling
correction and grammar checks
IV. Machine Translation
• Translating text or speech from one natural language to another using computer applications

Advantages of NLP
• The following are the advantages of NLP:
a) Users can ask as many questions about any subject and get a response instantly
within seconds
b) NLP systems provide solutions to the questions in natural language
c) These systems provide exact answers to the questions, no unwanted or unnecessary

information
d) The accuracy of the answers depend upon the quantity of relevant information
provided in the question

Advantages of NLP
e) NLP is a highly unstructured data source
f) Enables us to analyse more language-based data compared to humans, and without

fatigue or bias
g) NLP processes helps computers in communicating with humans in their language and
therefore increases language related tasks

NLP Applications
• There are so many applications of Natural language processing (NLP) in the real world
• Some of them are as follows:
Statistical Machine Information

Machine Translation Speech recognition
Translation Retrieval
Information Question Answering Word sense

Text Classification
Extraction System disambiguation
Optical character
Topic modelling Language detection
recognition

Module 7: Deep Learning

Deep Learning
• Deep learning is a machine learning
technique that trains machines to do
what comes naturally to humans. They
learn by example
• It is a key technology behind driverless

cars, allowing them to distinguish a
pedestrian from a lamppost or to
recognise a stop sign
• It controls the voice in consumer devices

such as tablets, phones, TVs, and hands-
free speakers

Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before
• In deep learning, a computer model learns to perform classification

tasks directly from text, images, or sound
• The deep learning models can obtain state-of-the-art accuracy,

sometimes exceeding human-level performance
• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers

Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before
• In deep learning, a computer model learns to perform classification

tasks directly from text, images, or sound
• The deep learning models can obtain state-of-the-art accuracy,

sometimes exceeding human-level performance
• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers

Importance of Deep Learning
• As the name suggests, Artificial Intelligence is to make a machine artificially intelligent
so that, making the machines that act and think like humans
• The amount of useful data available and an increase in computational speed are the two
factors that have made the whole world to invest in this field
• If a robot is hard coded i.e. all the logic has manually been coded to the system, then it
is not AI so it does not mean that simple robots mean AI
• Machine learning means making a machine learn from its experience and enhancing its
performance with time as in case of a human baby
• The concept of machine learning became possible only when an adequate amount of
data made available for training machines. It assists in dealing with a complex and
sound system

(Continued)
• Mainly, deep learning is a subset of machine learning, but in this case, the machine
learns the way where humans are believed to learn
• The structure of both deep learning model and the human brain is similar to a large
number of nodes and neurons, neurons in the brain of human thus result in artificial
neural network
• When traditional machine learning algorithms are applied we need to select input
features manually from complex data set and then train them that is a boring job for the
scientist of Machine Learning, but in neural networks, we do not need to select
manually useful input features

(Continued)
• There are several types of neural networks to manage the complexity of data set and
algorithm
• Deep learning has allowed most of the Industries Experts to overcome challenges that
were not possible, a decade ago like Image and Speech recognition and Natural
Language Processing
• Industries like Entertainment, Journalism, Manufacturing or even Digital Sector,

Healthcare, Banking and Finance, Automobile depending on it
• Trending successes of deep learning are Voice Assistants, Mail Services, Self Driving cars,
Video recommendations, Intelligent Chat bots

How Deep Learning Works
• Neural networks are composed of layers of nodes, similar to the human brain, which is
made of neurons. Nodes within individual layers are combined to adjacent layers
• In the human brain, a single unit of the neuron gets thousands of signals from other
neurons. In an artificial neural network, signals are travel between nodes and allocate
weight accordingly
• A node weighing heavy will apply more impact on the next layer of the nodes. The final
layer put together the weighted inputs to give an output
• Systems of Deep learning needs powerful hardware as they have a huge amount of
processed data and includes many complex mathematical calculations
• In spite of having such advanced hardware, calculations of deep learning training can
take weeks

How Deep Learning Works
(Continued)
• Deep learning systems need a large amount of data to get back to accurate results;
according to that, information is served as huge data sets
• When data is processing, artificial neural networks are able to categorise data with the
answers gets from a series of true/ false questions that include highly complex
mathematical computations fed
• For instance, programs of facial identification work by learning to identify and detect
edges and lines of faces, then more important parts of faces, and finally complete
representations of the faces
• As the program trains itself and the possibility of getting the right answers enhances
with time

Module 8: Big Data

Big Data Analytics
Introduction
• Big data analysis is the often complex process of

analysing large and varied data sets that can help
companies make informed business decisions
• Big data is a branch related to the analysis,

processing, and storage of large collections of data
that usually originate from different sources
• Big data includes complex transactions and data

sources that require special technologies and
methods to draw vision out of data

Big Data Analytics
(Continued)
• The analysis of big data datasets is an interdisciplinary attempt that combines statistics,
mathematics, computer science, and subject matter expertise
• It produces value from the storage and processing of substantial quantities of digital
information that cannot be analysed with conventional computing techniques

Big Data Analytics
The Definition Of Big Data Includes Five V’s:
Volume
Value Variety
Data
Complexity
Velocity Veracity

Big Data Analytics
Sources of Big Data
• Below are the different sources of big data:
1. Archives
2.Enterprise Data
3.Transactional Data
4. Social Media
5. Activity Generated
6. Public Data

Big Data Analytics
1. Archives
• A significant amount of data is archived by an

organisation, most of which is rarely required
• As hardware is getting cheaper, organisations do not

want to reject any data; rather, they prefer storing and
capturing as much data as they can
• This data can include scanned copies of agreements,

documents, ex-employees records etc and this type of
data, which is less frequently accessed, is known as
Archive Data

Big Data Analytics
2. Enterprise Data
• In enterprises, there are large volumes of data in

different formats
• Flat files, word documents, pdf documents, emails,

legacy formats, HTML pages, presentations, and XMLs
are some of the common formats
• The data that is spread in different formats across the

organisation is known as enterprise data

Big Data Analytics
3. Transactional Data
• Every enterprise has different applications that include

performing various kinds of transactions like CRM
Systems, Mobile Applications, Web Applications and
many more
• There are one or more relational databases as backend

infrastructure to support the transactions in these
applications
• This is mostly structured data and is known as

transactional data

Big Data Analytics
4. Social Media
• There is a significant amount of data generated on

different social networks like Facebook, Twitter etc.
• The social networks involve mostly unstructured data

formats which include images, audio, text, videos, etc.
• This category of the data source is known as social

media

Big Data Analytics
5. Activity Generated
• Machines generate a significant amount of data that

exceeds the volume of data generated by humans
• These comprise data from cell phone towers, medical

devices, industrial machinery, satellites, and other data
generated mostly by machines
• These data types are known as activity generated data

Big Data Analytics
6. Public Data
• Public data includes those data that is available publicly

such as research data published by research institutes,
sample open source data feeds, census data, data
published by governments etc.
• This publicly accessible data is known as public data

State of Practice in Analytics
• Current business problems offer numerous opportunities for organisations to become
increasingly more analytics and data-driven
• Business Drivers for Advanced Analytics:
Business Driver Examples

Optimise business operations Sales, pricing, profitability, efficiency
Identify business risk Customer churn, fraud, default
Predict new business opportunities Upsell, cross-sell, best new customer
prospects
Comply with laws or regulatory Anti-Money Laundering, Fair Lending, Basel

requirements II-III, Sarbanes-Oxley (SOX)

(Continued)
• The table describes the four categories of common business problems that organisations
contest with where they have a chance to use advanced analytics to create a
competitive advantage
• Rather than just performing standard reporting on these areas, advanced analytical
techniques can be applied by the organisations to optimise processes and derive more
values from these regular tasks
• The initial three examples don't describe new problems. Organisations have been
attempting to decrease customer churn, increase sales, and cross-sell customers for
many years

(Continued)
• The last example describes emerging regulatory necessities
• Multiple compliance and regulatory laws have been in presence for quite a long time;
however extra requirements are added every year, that represents added complexity
and data requirements for organisations
• Anti-money laundering (AML) related laws and fraud prevention require advanced
analytical techniques for complying and managing appropriately

Main Roles for New Big Data Ecosystem
• There are three key roles for the New Big Data Ecosystem
Deep Analytical Data savvy Technology and

Talent professionals data enablers
• Advanced training • Savvy but less • Support people –
in quantitative technical than e.g., DB admins,
disciplines – e.g., group 1 programmers, etc.
statistics, math,
and machine
learning

Phases of Data Analytics Lifecycle
Discovery
• Discovery is the phase 1 where the team learns the business domain, including
appropriate history such as whether the business unit or organization has attempted
similar projects in the past from which they can learn
• The team analyses the resources available to support the project in terms of technology,
people, time, and data
• In this step, essential activities include framing the business problem as an analytics
challenge that can be solved throughout subsequent phases and formulating initial
hypotheses (IHs) to test and start learning the data

Data Preparation
• Data preparation requires the existence of an analytical sandbox, in which the team can
work with data and perform analytics for the duration of the project
• In this phase, the team needs to execute extract, transform and load (ETL) or extract,
load, and transform (ELT) to retrieve data into the sandbox
• Sometimes the ETL and ELT are abbreviated as the ETLT
• In the ETLT process, data should be transformed so that the team can work with the
data and analyse it. The team also requires to familiarise itself with the data thoroughly
and take steps to condition the data

Model Planning
• In this phase, the team determines the techniques, methods, and workflow it intends to
follow for the subsequent model building phase
• The team examines the data to learn about the relationships between variables and
subsequently selects key variables and the most relevant models

Model Building
• Phase 4 is model building, where the team develops datasets for training, testing, and
production purposes
• Also, in the model planning phase, the team builds and executes models based on the
work done
• Sometimes the ETL and ELT are abbreviated as the ETLT
• In the ETLT process, data should be transformed so that the team can work with the
data and analyse it. The team also requires to familiarise itself with the data thoroughly
and take steps to condition the data

Communicate Results
• Communicate results is phase 5, where the team, in collaboration with major

stakeholders, decides if the results of the project are a success or a failure based on the
criteria developed in the Discovery phase
• In this, the team should quantify the business value, identify key findings, and develop
a narrative to summarise and convey findings to stakeholders

Operationalise
• Phase 6 is Operationalise, where the team delivers final reports, briefings, code, and
technical documents
• Also, the team may run a pilot project in a production environment to implement the
models

Module 9: Working with Data in R

Data Manipulation in R
• We can represent data in the form of data analytics with the help of data structures
• Data Manipulation in R is used for further analysis and visualisation
• The most important aspects of computing with Data Manipulation in R is that it enables
its subsequent analysis and visualisation
• The following are the basic data structures in R:
o Vectors
o Matrices
o Lists
o Data Frames

Creating Subsets of Data in R
• The following are the different methods of subsetting in R are:
1. $ - The dollar sign operator selects a single element of data
2. [[ - like $ in R, the double square brackets operator in R also returns a single element
3. [ - The single square bracket operator in R returns multiple elements of data

(Continued)
Example:
• To retrieve 5 rows and all columns of already built-in dataset iris, the below command, is
used
Output:
Input:

Creating Subgroups or Bins of Data
1. cut() function in R
• cut() function groups the values of a variable into larger bins
Input:
Output:

(Continued)
2. table() function in R
• We can use the R table() command, to count the observations in each level of factor
Input:
Output:

Combining and Merging Datasets in R
• The following are the ways to combine the different sets of data:
By Adding Columns using cbind() in R
By Adding Rows using rbind() function in R
By Combining Data With Different Shapes using merge() function in R

Data clean up
Introduction
• It is the process of transforming the raw data into consistent data and analysing it
• The main aim of data cleaning is to improve the statistical statements based on the data
and their reliability
• It can profoundly influence the statistical statements based on the data

Data clean up
Steps to clean data
Initial Exploratory Analysis
Visualise Your Data
Cleaning The Errors

Data clean up
(Continued)
Initial Exploratory Analysis:
• The first step involves an initial exploration of the data frame that just imported into the
R
• The important thing is to understand how to import data into R and save it as a data
frame

Data clean up
(Continued)
Output:

Data clean up
(Continued)
The first thing that you check the class of your data frame:
• class(data)
o In this we can clearly see the our dataset is saved as a data frame
1. "data frame“
o We want to check the number of rows and columns in the data frame

Data clean up
(Continued)
The code give up and its results:
1. 1460 81: We can see that data frame has 1460 rows and 81 columns
We can view the statistical for all the columns of the data frame using the code that shown
in the next slide:

Data clean up
(Continued)
• summary(data)
Output:

Data clean up
(Continued)
Visual Exploratory Analysis:
• There are two types of plots that should use during the data cleaning process:
Histogram BoxPlot

Data clean up
(Continued)
Histogram:
• The histogram is useful to see the overall distribution of numeric columns
• We can determine whether the distribution of data is normal or unimodal or bi-modal or

any other kind of distribution of interest
• The histogram is useful to figure out if there are outliers in the particular numerical
columns under study

Data clean up
(Continued)
The code and output is given below:
insatll.package(“plyr”)
library(plyr)
hist(data$Dist_Taxi)

Data clean up
(Continued)
BoxPlot:
• It is super useful because it shows the median, along with the first, second, and third
quartiles
• BoxPlots are the best way of spotting outliers in your data frame

Data clean up
(Continued)
The output is given below:
boxplot(data$Dist_Taxi)

Data clean up
(Continued)
Correcting the Errors:
• In this method the main focus is to correct all the errors that you have seen
• If you want to change the name of your data frame, the code is:
data$carpet_area<-data$Carpet
• With this code we renamed the Carpet Column as “Carpet_area”

Data clean up
(Continued)
• If some column has an incorrect type associated with them. For example, a column
containing the text elements stored as a numeric column
• In such case, we can change the type of colum by using the following code:
Data$Dist_Taxi<-as.character(data$Dist_Taxi)
class(data$Dist_Taxi)

Reading and Exporting Data
• R is a programming language used in statistical computing
• It is used by data analysts, researchers and statisticians
• R uses cutting edge technology to manipulate data which can be used for predictive
modelling
• In order to analyse data, we need to access the data from different databases by using
SQL commands
• Then read the data and export the data using different file formats
• There are many pre-defined procedures

• The most popular one is ACCESS, where you can create access descriptors and it
describes data stored in a DBMS
• Access descriptors enable you to create view descriptors wherein they function in the
same way as the PROC SQL command
• Once data is accessed, it is analysed depending on the requirement
• Then data is exported to a different location altogether
• The beauty is it can be done using different file types
• Care needs to be taken when you are exporting data from one type to another type

• Let us take a look at Data Export Formats:
1. CSV
2. TSV
3. SPSS
4. HTML
5. Fixed Field Text, and many more
• The following two export formats are available for Data Table Exports only:
a. CSV (Comma Separated Values)

b. TSV (Tab Separated Values)

• The others are:
i. XML
ii. XPSS
iii. HTML
iv. Fixed field text
v. Tableau
vi. JSON
• CSV (Comma Separated Value) can be opened in MS-EXCEL. It can also be converted to
other statistical software
• TSV (Tab Separated Value) is a simple text format for storing data in a tabular structure.
TSV and CSV are compatible file formats to import data to QUALTRICS

• As and when you decide to export data from EXCEL, you will be exporting it to an XLSX
file
• The XML is used for putting your raw data into a database. It is a general purpose mark-
up language. It is compatible with Excel
• For statistical analysis, a software package called SPSS is used
• HTML format is used to view your data in a table on a web browser
• Fixed Field Text is a flat file format. It is accompanied by a separate data map file
• Many organisations use a data analysis application called the Tableau. JSON (Java script
object notation) is also available for use

Importing Data
• Importing data means to import data from various sources into the R programming
environment
Process of Importing Data in R

Importing Data
1. Using the Combine Command
• In R programming, we make use of c() function to concatenate or combine various data

values together
• In the following example:- vector1, vector2 and vector3 are the variables to store
integer values separately. We make use of the c() function to combine these values
together

Importing Data
2. Entering Numerical Items as Data
• We can enhance numerical data by typing the values separated by commas into the c()
command
• Let us create a data set by using the c() command:
• In this example, data1 is the object that stores our data. Then, type our numerical
values between the two parentheses and these values will be separated by commas.
Type ‘data1’ to display the dataset

Importing Data
(Continued)
• We will create an object data2 that stores our data. We will also specify data1 as one of
the member components
3. Entering Text Items as Data
• We make use of single-quotes or double-quotes to enter character data
• Whatever these quotes cover the data is interpreted as a type of character or a text
item

Importing Data
(Continued)
• In the following example, we will take our data in the form of characters as the days of a
week and store them in the day1 object
• Then we pass day1 with another text element into the same vector. In this case,
however, day1, is not of a text but of a numerical type. If numbers and text are
combined, R converts the number into text

Importing Data
4. Using the scan() command
• We can use the scan() command that doesn't require you to enter a comma after every
input data instead of typing input data with the additional specification of commas
• scan() command can also be used for taking data from files as well as with the
clipboards
• scan() command invokes a prompt through which you enter the data. It does not take
any input between its parentheses

Importing Data
(Continued)
• In the above example, we created a data frame that is then stored as a file called
‘data.txt’ on the local disk. This text file can be accessed using the scan function as
follows:

Importing Data
5. Using the Clipboard to Make Data
• To copy and paste the data more interactively, we can use the clipboard
• We can enter the input data such as spreadsheets with the help of scan() command
• The key steps to import spreadsheet data into R are as follows:
o If the spreadsheet contains data of numerical type, type command in R before

switching to this spreadsheet
o After highlighting the important cells, we copy them to the clipboard

Importing Data
(Continued)
o Paste the data from the clipboard in R after returning to R. R then waits until an
empty line is entered before the data entry process is stopped to make it easier to
copying and pasting data as required
o Finally, to complete the data entry procedure, a blank line is entered
• If the data is separated by spaces, simply copy and paste. However, if some other
character or symbol separates the data, we must enter it in R before importing the data

Importing Data
6. Using Scan() to Retrieve Data from CSV file
• We can retrieve data from a CSV file using the scan() command. We will save our
antecedently created data frame ‘data’ as a CSV file
• Now, we scan our CSV file and define the what attribute with ‘character’

Importing Data
7. Reading a File of Data from a Disk
• We can use the scan() command to get data file from our system's local memory
• Data can be read from a console and written to a vector with the help of scan()
command. In the scan() function, we add the file name as follows:

Importing Data
8. Reading Bigger Data Files
• We used the scan() command in the above sections to read data from simple files. We
can enter a large number of data containing complicated data in R
• There are different ways and means of reading such large data that are stored in a
variety of text formats
o In order to read from csv file as: > read.csv() or read.csv2()
o We can read data files from tables with: > read.table()
o We can read from files that contain values separated by tabs: > delim()

Module 10: Regression in R

Regression Analysis
(Continued)
• Regression is of the following two types:

Linear Regression
• Using linear regression, an analyst can
compress data points from a sample into a
straight line
• A “strong” or “loose” correlation can then be

determined by their closeness to the regression
line
• A more scattered plot pattern in relation to the

line suggests loose, while a tighter clustering of
plot points suggests a stronger correlation
• Regression lines can be positive or negative

Logistic Regression
• We fit a regression curve in logistic regression, y = f(x) where y is a categorical variable
• It is used to estimate that y has given a set of predictors x
• Therefore, the predictors can be categorical, continuous or a combination of both
• It is an algorithm of classification which comes under nonlinear regression
• This model is used to predict a binary outcome such as (1/ 0, True/ False, Yes/ No )
given as a set of independent variables
• Also, by using dummy variables, it helps to represent categorical/binary results

Logistic Regression
(Continued)
• It is a regression model in which the response variable has binary values such as 0/1
or True/False. Hence, we are able to calculate the probability of the binary response
• Expression of R Logistic Regression:
• x and y is the predictor variable and response variable respectively
• a and b are the coefficients which are numeric constants

Logistic Regression
Syntax of Logistic Regression
• In logistic regression, the basic syntax for glm() function is:
glm( formula, data, family)
• Description of the parameters used:

Logistic Regression
Building Logistic Regression Model in R Programming
• In this, we use the BreastCancer dataset that is available by default in R
• Firstly we import the data and displaying the information related to the BreastCancer
dataset with the str() function:
• To execute the code, press Ctrl+Enter

Logistic Regression
(Continued)
Output:

Logistic Regression
Applications of Logistic Regression with R
• Logistic Regression helps in categorisation and image segmentation
• In geographic image processing, we use logistic regression
• We use logistic regression in handwriting recognition
• Healthcare is an application area of logistic regression
• We use this type of regression to make predictions about something

Multiple Regression
• Multiple regression can be defined as an extension of linear regression in relationship
among more than two variables
• In case of simple linear relation there is one predictor and one response variable, but in
case of multiple regression there is one or more than one predictor variable and one
response variable
• The mathematical equation (General) for multiple regression can be expressed as
y = a + b1x1 + b2x2 +...bnxn

Multiple Regression
(Continued)
Following is the description of the parameters which are used in the equation on the
previous slide −
• y is the response variable
• a, b1, b2...bn are the coefficients
• x1, x2, ...xn are the predictor variables.

Multiple Regression
lm() Function
• lm() function creates the relationship model between the response variable and the
predictor
Output lm(y ~ x1+x2+x3...,data)
o formula is a symbol that defines the relation between predictor variables and the
response variable.
o data is the parameter on which the formula will be applied

Multiple Regression
Example
• Input Data
o Take the data set "mtcars" which is available by default in the R environment
o It gives a comparison among different car models in terms of weight of the

car("wt"), mileage per gallon (mpg), horse power("hp"), cylinder
displacement("disp") and some more parameters
o The goal of the model is for establishing the relationship between "wt", "hp" and
"disp" as predictor variables with "mpg" as a response variable

Multiple Regression
(Continued)
input <-
mtcars[,c("mpg","disp","hp","wt")] Output
print(head(input))

Multiple Regression
(Continued)
• Create Relationship Model & get the Coefficients
input <- mtcars[,c("mpg","disp","hp","wt")]

# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")
a <- coef(model)[1] Output
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)

Multiple Regression
(Continued)
• Create Equation for Regression Model
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
Or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
• Apply Equation to predict New Values
o We can use the previously created regression equation for predicting the mileage
when a new set of values for weight, horse power and displacement is provided

Multiple Regression
(Continued)
o For a car with wt = 2.91, hp = 102 and disp = 221 the predicted mileage is :
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91
print(Y)
Output

Normal Distribution
• Usually, it is observed that random data collection from independent sources is normally
distributed
• On plotting a graph, we get a bell shape curve with the count of the values in the
vertical axis and the value of the variable in the horizontal axis
• The middle part of the curve is the mean of the dataset
• To generate normal distribution, R programming has four inbuilt functions
dnorm(x, mean, sd)

qnorm(x, mean, sd)
pnorm(x, mean, sd)
rnorm(x, mean, sd)

Normal Distribution
(Continued)
o x represents a vector of numbers
o p represents a vector of probabilities
o n represents the number of observations
o mean represents the mean value of the sample data. Its default value is 0
o sd represents the standard deviation. Also, its default value is 1

Normal Distribution
Example of dnorm()
• Build a sequence of numbers between -10 and 10 that increases by 0.2
x <- seq(-20, 20, by = .2)

y <- dnorm(x, mean = 5.0, sd =
1.0) Output
plot(x,y, main = "Normal
Distribution", col = "brown")

Binomial Distribution
• The binomial distribution explores the probability of success of an event having only
two possible outcomes in a series of experiments
• For example, tossing one coin always gives a head or a tail. During the binomial
distribution, the probability of finding exactly 3 heads in tossing a coin repeatedly for 10
times is estimated
• To generate normal distribution, R programming has four inbuilt functions
dbinom(x, size, prob)

rbinom(n, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)

(Continued)
o x represents a vector of numbers
o p represents a vector of probabilities
o n represents the number of observations
o size represents the number of trials
o prob defines the probability of success of each trial

Example of dbinom()
• dbinom() function gives the distribution of probability density at each point
# Create a sample of 50 numbers

which are incremented by 5.
x <- seq(0,50,by = 5)
y <- dbinom(x,50,0.5) # Create Output
the binomial distribution.
plot(x,y, main = "Binomial
Distribution") # Plot the graph.

Example of pbinom()
• pbinom() function gives the cumulative probability of an event
# Probability of getting 26 or less

heads from a 51 tosses of a coin. Output
x <- pbinom(26,51,0.5)
print(x)

Example of qbinom()
• qbinom() function takes the value of probability and gives a number whose cumulative
value matches the value of probability
# How many heads will have a probability

of 0.25 will come out when a coin is tossed
51 times. Output
y <- qbinom(0.25,51,1/2)
print(y)

Example of rbinom()
• rbinom() function generates the required number of random values of a given

probability
# Find 8 random values from a

sample of 150 with probability of
0.4. Output
y <- rbinom(8,150,.4)
print(y)

Module 11: Modelling Data

What are the Relationships?
• In Power BI, a relationship is used to
describe the connections or the relation
between two or more tables
• The relationship is used to perform an

analysis based on multiple tables
• The relationship helps to display the

data as well as correct information
between multiple tables
• The relationship is also used to

calculate the accurate results

Viewing Relationships
• The model view displays all of the tables, columns, and relationships in your model
• This view can be mainly useful when your model contains complex relationships
between many tables
• Click on the Model icon placed at the left side of the window to see a view of the
existing model
• Your cursor over a relationship line represent the columns that are used as shown in the
next slide:

Viewing Relationships
(Continued)

Creating Relationships
• The following are the steps to create a relationship manually:
Step 1: On the Modeling tab, select Manage Relationships > New

Step 2: In the Create relationship dialog box, select a Products table in the first table drop-
down list, and then choose the column you want to use in the relationship

Step 3: In the second table drop-down list, choose the other table you want in the
relationship and then select the other column you want to use, press OK

Cardinality
• While creating a relationship between two tables, you get two values that can be 1 or *
on the two ends of the relationship among two tables, known as Cardinality of the
relationship
• There are four types of cardinality, as follows:
1. *-1: Many-to-One
2. 1-1: One-to-One
3. 1-*: One-to-Many
4. *-*: Many-to-Many

Cardinality
1. Many to one (*:1)
• A many-to-one relationship is a vital type of cardinality and default type of relationship
• In many to one relationship, the column in a given table can have more than one
instance of the value, and the other related table is known as the lookup table and
contained only one instance of a value
2. One to one (1:1)
• In a one-to-one (1:1) relationship, the column in one table has only one example of a
specific value, and the different related table also contains only one instance of a
specific value

Cardinality
3. One to many (1:*)
• In a one-to-many (1:*) relationship, the column in one table has only one instance of a
specific value, and the other related table contains more than one instance of a value
4. Many to many (*:*)
• You can develop a many-to-many relationship between tables with composite models
that removes the requirements for unique values in tables
• It also eliminates the previous workarounds, such as introducing new tables only to
build relationships

Cross Filter Direction
• Each model relationship must be described with a cross filter direction
• Your selection decides the direction(s) that filters will propagate
• The possible cross filter options are dependent on the type of cardinality
• Single cross filter direction indicates single direction, and both show both directions
• A relationship that filters in both directions are commonly defined as bi-directional

Cross Filter Direction
(Continued)
Cardinality type Cross filter options

One-to-many (or Many-to-one) Single
Both
One-to-one Both
Many-to-many Single (Table1 to Table2)
Single (Table2 to Table1)
Both

What is DAX?
• DAX which stands for data analysis expressions
is a collection of operators, functions, and
constants that we can use in expressions or
formulas
• DAX helps us to return values after making

calculations from the already available data
• To understand DAX, you just need to be

familiar with Microsoft Excel formulas

What is DAX?
(Continued)
• DAX formulas are just like the ones we issue in Microsoft excel
• However, DAX functions and excel functions differ in certain aspects
• Excel allows its users to reference cells or arrays. If the users need to perform such
behaviour in Power Bi, they would require the use of DAX Functions
• DAX provides more data types than Microsoft Excel does

Syntax
• DAX formulas begin with an = sign after which any scalar value can be provided
• The scalar value can be an expression that evaluates to a scalar or an expression that
can be converted to a scalar
Expressions
Scaler Operator, containing any
Expressions or values, and A function
of the
Constants that References to Constants result along
following:
use Scaler Columns or specified as a with its
operators,
Operator such Tables part of arguments and
constants, or
as +, -, *, /, >. =, expression parameters
references to
&& etc.
columns

Syntax
(Continued)
• DAX requires that all its objects whether tables or columns must have unique names
• Also, names of objects are case insensitive, i.e. Products and PRODUCTS would refer to
the same table or column
• A column name should always be fully qualified, i.e. it must be preceded by the table
name and should be written in square brackets, e.g. SALES.[Prodcut_Id]
• Sometimes table names will contain spaces in which case they must be enclosed in
single quotations

Syntax
(Continued)
• A fully qualified name is required in the following circumstances:
1. 2. 3.
When the VALUES function As arguments to the ALL or While using the CALCULATE
requires arguments ALLEXCEPT functions or CALCULATABLE functions
passes as a filter argument
4. 5.
As an argument to As an argument at any time
RELATEDTABLE function intelligence function

Functions
• Functions in DAX can be categorised into the following:
Date and Time Functions 1 2 Filter Functions
Time Intelligence Functions 3 4 Information Functions
Logical Functions 5 6 Math and Trig Functions
Text Functions 7 8 Many others as well

Functions
(Continued)
• DAX functions always point to either a column or a table
• To specify only selected values you will need to filter them
• DAX is also capable of returning a whole table rather than a column only
• DAX works with Time Intelligence functions to perform dynamic calculations

Row Context
• Row context is much easier to understand as compared to filter context
• The simplest way to visualise row context is to take a table and add a calculated column
• Each row in a table contains its own row context
• For instance, if a table has two columns a and b and row 1 values are 1 and 2
respectively
• Similarly, row 2 values are 3 and 4 respectively
• If you add a column c that sums the value of columns a and b, then the column c value
for row 1 would be 3, and the value for row 2 would be seven

Calculated Columns
• With calculated columns, you can append new data to the existing table in your table
• You can create a data analysis expressions (DAX) formula that defines the columns
values rather of querying and loading values in your new column from a data source
• In Power BI desktop, calculated columns are generated by using the new column feature
in Report view
• Calculated columns that you create appear in the fields list just like any other field

Calculated Columns
(Continued)
• But they will contain a special icon showing its values are the result of a formula:
• You can name new columns whatever you want, and add them to a report visualisation
just like other fields

Calculated Tables
• You can create a calculated table by using the New Table feature in a data view or report
view of Power BI desktop
• For instance, suppose you are a personnel manager who has a table of Sales_2019 and
another table of Sales_2020, and you want to combine both tables into a single table
called Sales
Sales_2019 Sales_2020

Calculated Tables
• The following are the steps to create a calculated table:
Step 1: Click on the Modeling Tab and then select New Table

Calculated Tables
Step 2: Enter the following formula in the formula bar

Calculated Tables
Step 3: A new table named Sales is created and appears just like any other table in
the Fields pane:

Measures
• Measures are generally used for data analyses
• Simple summarisations such as sums, averages, counts and minimum, maximum can be
set through the Fields well
• The calculated results of measures are always varying according to your interaction with
your reports, allowing for fast and dynamic ad-hoc data exploration
• In the Power BI desktop, measures are created in a data view or report view
• The measures that you are created appear in the Fields list with a calculator icon

Measures
(Continued)
• You can enter the name for measures whatever you want and add them to a new or
existing visualisation just like any other field

Module 12: Shaping and Combining Data

Shaping and Combining Data
Power BI Desktop Queries
• Queries, in Power BI desktop are as much essential as are datasets in Power BI service
• It is the queries that form the bases of the reports and visualisations in Power BI
• A query is created in Power BI as soon as the command to fetch data (or Get Data to be
more precise) is given
• However, these tasks can only be performed from the query editor in the Power BI
desktop version
• Users can use multiple queries to get the results they want

(Continued)
• This is possible if they have already imported these datasets from some external source
such as Excel, CSV (comma-separated values) file, or some databases
• Click on the Edit Queries option to start working

(Continued)
• Power BI Desktop Queries can consist of the following:
• Data Retrieved from a Single Table

1
• Data Retrieved from Multiple Tables

2
• Data Having Calculated Columns in the Query

3
• Data Related to Another Table based on a Calculated Column

4

The Query Editor
• When the query is loaded, power query editor view becomes more interesting
• Power query editor loads information about the data if we connect to the web data
source that you can use and then begin to shape
• The following steps show how power query editor appears once a data connection is
established:
Step 1: In the ribbon, various buttons are now active to interact with the data in the query
Step 2: In the left pane, queries are listed as well as available for selection, shaping and
viewing
Step 3: In the centre pane, data from the selected query is displayed or available for
shaping

The Query Editor
Step 4: The Query Settings pane displays, listing the query's properties as well as applied
steps
2 4

Shaping Data and Applied Steps
Shaping Data
• In the query editor, when you shape the data along with providing step-by-step
instructions for query editor to carry out or to adjust the data as it loads and present it
for you
• In the Power BI desktop, there is a lot that can happen to the data that has been
retrieved
• While in the query editor users can opt to remove columns/rows from a dataset or may
even add new columns to the existing columns
• The new columns can be populated with a calculated value also

Applied Steps
• Whenever any action takes place in the query editor, the Applied Steps window lists the
changes that have taken place in
• A in front of the steps we applied allows us to cancel the change we made

(Continued)
• The case is shown below:
• There are various ways of removing the errors
• Remove Errors is one way in which all rows containing the errors would be removed

(Continued)
• As we want to keep our data and rectify the errors, we are not going to use this option
• Click the column that has ERROR displayed to show the following:

(Continued)
• The Query Editor provides the user with a

Context Menu for Applied Steps
• The options in the menu include Rename,

Delete, Delete Until End, Insert Step After
etc.
• So just choose the step and click Delete

Advanced Editor
• The advanced editor enables you to view the code that power query editor is creating
with each step
• It also supports you to create your own shaping code
• To enable the advanced editor, select View from the ribbon, then select Advanced
Editor

Advanced Editor
(Continued)
• A window appears that displays the existing query code as shown below:

Formatting Data
• You can specify modified cell colours, including colour gradients, based on field values
with the help of conditional formatting from tables in Power BI desktop
• You can also describe cell values with data bars or as active web links or KPI icons
• You can also apply conditional formatting to any text or data field, as long as you base
the formatting on a field that includes numeric, hex code or colour name, or web URL
values
• The following are the steps to apply conditional formatting:
Step 1: Select a Table or Matrix visualisation in Power BI desktop

Formatting Data
Step 2: In the Fields section of the Visualisations pane, select the down-arrow next to the
field in the Values well that you want to format

Formatting Data
Step 3: Click on Conditional formatting and then select the type of formatting to apply

Transforming Data
• The following are the steps for transforming data:
Step 1: Open Power BI, choose Excel option from the Get Data

Transforming Data
Step 2: Select the Excel file named as Employee and click Open

Transforming Data
Step 3: Select any data which you want to transform data and click on the Transform Data
button

Transforming Data
Step 4: The transform data will be displayed as follows:

Combining Data
• When we have two or more different data sources to create our reports that is when
combining it proves to be an efficient method
• To combine data, the easiest way would be to establish a relationship between the data
sources on a column
• Let us take a scenario where the user has a list of cities and their country codes in one
data sources and the country code and their respective names in another as shown in
the next slide:

Combining Data
(Continued)

Combining Data
(Continued)
• Import both the files into the Power BI desktop
• Once done, click on the Relationships icon on the right-hand side to create a
relationship as shown:

Combining Data
(Continued)
• Next, click on the Reports icon to see a result of the data you have combined

Combining Data
(Continued)
• The data you entered has been mapped automatically using Power BI desktop

Combining Data
(Continued)
• Data can also be combined from the web with some existing data
• Suppose we have some data that has organisation names and their respective US
country codes but not the country names, and a report requires that organisations be
listed with country names then we could take the web as a data source
Step 1: Choose Get Data > Web and provide the URL from where to retrieve the country
codes and country names. Click OK

Combining Data
Step 2: Choose Table, and click Load
Step 3: Rest of the process is the same, click Relationships and then Reports

Module 13: Interactive Data
Visualisations

Page Layout and Formatting
• Page view settings are accessible in both the Power BI
service as well as Power BI desktop, but having a small
change in interface
• In the Power BI report, the first set of page view setting

controls display of your report page relative to the
browser window and choose between:
o Fit to page (default): Contents are scaled to fit the

page best
o Fit to width: Contents are scaled to fit within the

width of the page
o Actual size: Contents appear at full size

(Continued)
• In the second set of page view settings, it controls the positioning of objects on the
report canvas and chooses between:
o Show gridlines: Turn on gridlines help you to position objects on the report canvas
o Snap to grid: Use with Show gridlines to position precisely as well as to align objects
on the report canvas
o Lock objects: Lock all objects on the canvas so that they can not be resized or
moved

(Continued)
o Selection pane: The Selection pane lists all objects on the canvas, and you can
decide which to show and which to hide

Page Size Settings
• Page size settings are accessible only by report owners
• These settings are available in the Visualisations pane and control the actual size (in
pixels) as well as the display ratio of the report canvas:
o 4:3 ratio
o 16:9 ratio (default)
o Letter

(Continued)
o Custom (height and width in pixels)

Multiple Visualisations
• Graphs and visualisations are an integral part of Power BI – both desktop and service
• There are innumerable visualisation in Power BI, the list is growing, and according to
Microsoft this will keep on growing
• As of now, Power BI offers visualisations that help the user with simple charts and also
measure their performances
• Power BI offers the following types of visualisations:
Area Charts 1 2 Doughnut Charts
Bar Charts 3 4 Funnel Charts

(Continued)
Column Charts 5 6 Gauge Charts
Cards : Single Row or Multi-Row 7 8 Matrix Charts
Combo Charts 9 10 Pie Charts
Scatter Charts 11 12 Slicer Charts
Standalone Images 13 14 Tables

(Continued)
Water Fall Charts 15 16 Tree Maps
KPIs 17 18 Bubble Chart
Line Charts 19 20 Maps

Creating Charts
• The following are the steps to create a chart in Power BI report:
Step 1: Create a visualisation by selecting a field from the Fields pane
Step 2: Start with a numeric field like Customers > City > Customer_ID. Power BI creates a
column chart with a single column, and you can select the desired chart from visualizations

Using Geographic Data
• Power BI integrates with bing maps to produce default map coordinates called as geo-
coding so you can create maps
• It may contain the data in the Location, Latitude, and Longitude buckets of the visual's
field well
• The following are the steps to represent the data by using a map chart in Power Bi
report:
Step 1: Start with a geography field, such as Geo > City > Customer_ID
Step 2: Power BI and bing maps create a map visualisation

Using Geographic Data
(Continued)

Histograms
• In Power BI, a histogram chart is used to explain the
frequency distribution of your data
• The histogram chart feature is not in-built in

visualization; you have to add it in visualization
• The following are the steps to add histogram chart in

visualization:
Step 1: Go to visualization pane and click on Get more

visuals as shown in the given figure

Histograms
Step 2: Enter histogram in the App Source search bar and click on histogram chart Add
button

Histograms
Step 3: Click on OK. The histogram chart feature is successfully added in the visualization
pane

Histograms
• The following are the steps to represent the data by using a histogram chart in Power BI
report:
Step 1: Click on Get Data icon on the ribbon and select Excel to import a workbook

Histograms
Step 2: Select the table that contains that dataset, and then click on load

Histograms
Step 3: To create the histogram, click on the histogram chart icon on the visualizations pane
and then add the appropriate fields:

Power BI Admin Portal
• The admin portal allows you to manage a Power BI tenant for your organisation
• The admin portal consists of items such as usage metrics, access to the Microsoft 365
admin centre, and settings
• The full admin portal is open to all users who are global admins or have Power BI service
administrator role
• Make sure your account marked as a Global Admin, with Microsoft 365 or Azure Active
Directory (Azure AD), or contains a Power BI service administrator role, to get access to
the Power BI admin portal

(Continued)
• Below are the steps to get the Power BI admin portal:
Step 1: Select the settings gear in the top right side of the Power BI service
Step 2: Click on the Admin portal

Step 3: The admin portal contains twelve tabs

Service Settings
• The following are the steps to manage common data service settings:
Step 1: You can manage and observe the settings for your environments by signing in to
the Power Platform admin centre
Step 2: Go to the Environments page, select an environment and then click on Settings

Service Settings
Step 3: Setting for the selected environment can be managed in the given window:

Desktop Settings
• The creators of Power BI should become aware of the settings available in Power BI
options and data source settings
• As a configuration of these setting determine available functionality, default behaviours,

user interface options, performance, and the security of the data being used
• Global option is implemented for all Power BI desktop files created or accessed by the
user

Desktop Settings
(Continued)
• While CURRENT FILE options must be described for each Power BI desktop file

Dashboard and Report Settings
• The Power BI includes the list of dashboard settings that are available for you
• To illustrate this, we are going to use Adam Insights dashboard available in my Power BI
workspace
• The following are the steps to change this Power BI dashboard settings:
Step 1: On the top right corner, click on the … button and then select the Settings option
from the context menu

(Continued)

Step 2: Select the Settings option to open the dashboard settings window

Step 3: Click on Save button

Congratulations
Congratulations on completing this course!

Keep in touch
info@theknowledgeacademy.com
Thank you

Advanced Data Science Training - Trainer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Data Science Training - Trainer

Uploaded by

Copyright:

Available Formats

© The Knowledge Academy Ltd 1

© The Knowledge Academy Ltd 2

© The Knowledge Academy Ltd 3

• Module 2: Python for Data Analysis

• Module 3: Python for Data

• Module 4: Python for Data

© The Knowledge Academy Ltd 4

• Module 6: Natural Language

• Module 7: Deep Learning

• Module 8: Big Data

• Module 9: Working with Data in R

• Module 10: Regression in R

© The Knowledge Academy Ltd 5

• Module 12: Shaping and

• Module 13: Interactive Data

© The Knowledge Academy Ltd 6

© The Knowledge Academy Ltd 7

Why Use NumPy

© The Knowledge Academy Ltd 8

Language, in which Numpy has written

© The Knowledge Academy Ltd 9

• Dimensions are named as axes in NumPy. The number of axes is rank

© The Knowledge Academy Ltd 10

Integer array indexing

© The Knowledge Academy Ltd 11

Boolean array indexing

© The Knowledge Academy Ltd 12

Operations on single array

• Overloaded arithmetic operators can be used to do element-wise operation on the array

© The Knowledge Academy Ltd 13

© The Knowledge Academy Ltd 14

© The Knowledge Academy Ltd 15

© The Knowledge Academy Ltd 16

Python NumPy Sum

© The Knowledge Academy Ltd 17

• The sum of each row in an array is returned by axis = 1

© The Knowledge Academy Ltd 18

Python NumPy average

• The average of a given array is returned by Python NumPy average function

• Average of x and Y axis

© The Knowledge Academy Ltd 19

• Without using the axis name calculate numpy array Average

© The Knowledge Academy Ltd 20

Python NumPy min

© The Knowledge Academy Ltd 21

© The Knowledge Academy Ltd 22

Python NumPy max

© The Knowledge Academy Ltd 23

© The Knowledge Academy Ltd 24

Python Numpy mean

© The Knowledge Academy Ltd 25

• Mean value of x and Y-axis (or every row and column)

• Here, we are calculating Mean without using the axis name

© The Knowledge Academy Ltd 26

• Broadcasting provides a means of vectorising array operations so that looping occurs in

• In some cases, broadcasting is a bad idea because it leads to ineffective memory

© The Knowledge Academy Ltd 27

© The Knowledge Academy Ltd 28

© The Knowledge Academy Ltd 29

Example 1: Single Dimension array

© The Knowledge Academy Ltd 30

Example 2: Two Dimensional Array