You are on page 1of 515

Advanced Data Science Training

© The Knowledge Academy Ltd 1


About The Knowledge Academy
• World Class Training Solutions
• Subject Matter Experts
• Highest Quality Training Material
• Accelerated Learning Techniques
• Project, Programme, and Change
Management, ITIL® Consultancy
• Bespoke Tailor Made Training Solutions
• PRINCE2®, MSP®, ITIL®, Soft Skills, and More

© The Knowledge Academy Ltd 2


Administration
• Trainer
• Fire Procedures
• Facilities
• Days/Times
• Breaks
• Special Needs
• Delegate ID check
• Phones and Mobile devices

© The Knowledge Academy Ltd 3


Outlines
• Module 1: Python for Data Analysis
- NumPy

• Module 2: Python for Data Analysis


– Pandas

• Module 3: Python for Data


Visualisation – Matplotlib

• Module 4: Python for Data


Visualisation – Seaborn

© The Knowledge Academy Ltd 4


Outlines
• Module 5: Machine Learning

• Module 6: Natural Language


Processing

• Module 7: Deep Learning

• Module 8: Big Data

• Module 9: Working with Data in R

• Module 10: Regression in R

© The Knowledge Academy Ltd 5


Outlines
• Module 11: Modelling Data in
Power BI

• Module 12: Shaping and


Combining Data using Power BI

• Module 13: Interactive Data


Visualisations

© The Knowledge Academy Ltd 6


Module 1: Python for Data Analysis -
NumPy

© The Knowledge Academy Ltd 7


Introduction to NumPy
What is NumPy
• It is a python library which is used for working with arrays. Even, it has functions for
working in the domain of Fourier transform, matrices, and linear algebra. You can use it
freely because it is an open-source. NumPy is the short form of Numerical Python

Why Use NumPy


• We have lists in Python which serve the purpose of arrays, but their processing is slow.
NumPy intends for providing an array object which is up to 50x faster as compared to
handed-down Python lists

• In NumPy, the array object is called ndarray, and many supporting functions are given
by it which make working with ndarray easy as pie. Arrays are oftentimes used in data
science, where resources and speed are essential

© The Knowledge Academy Ltd 8


Introduction to NumPy
Why is NumPy Faster Than Lists

• Unlike lists, in memory, NumPy arrays are stored at one continuous place so processes
can easily access and manipulate them. In computer science, this behaviour is known as
locality of reference

• This is the major cause that why NumPy is faster as compared with lists. Moreover, it is
optimised for working with the latest CPU (Central Processing Units) architectures

Language, in which Numpy has written

• NumPy is a Python library, and also it is written partially in Python, but the maximum
part which needs fast computation is written in C or C++ programming languages

© The Knowledge Academy Ltd 9


NumPy Arrays
1. Arrays in NumPy: The homogeneous multidimensional array is the main object of
NumPy

• It is a table of elements (typically numbers), all of the similar type, indexed by a tuple of
positive integers

• Dimensions are named as axes in NumPy. The number of axes is rank

• NumPy’s array class is known as ndarray. Even, it is known by the alias array

Example

Output

© The Knowledge Academy Ltd 10


NumPy Arrays
2. Array Indexing: To analyse and manipulate the array object, it is essential to
understand the array indexing basics. Various ways are provided by NumPy to do array
indexing

Slicing

• NumPy arrays can be sliced, as lists in python. For every dimension of the array, you
need to specify a slice, as arrays can be multidimensional

Integer array indexing

• Lists are passed for indexing for every dimension in this method. In order to construct a
new arbitrary array, one to one mapping of corresponding elements is done

© The Knowledge Academy Ltd 11


NumPy Arrays
(Continued)

Boolean array indexing

• This method is used while picking elements from the array which satisfy some condition

Example

Output

© The Knowledge Academy Ltd 12


NumPy Arrays
3. Basic operations: In NumPy, Plethora of built-in arithmetic functions are provided

Operations on single array

• Overloaded arithmetic operators can be used to do element-wise operation on the array


in order to create a new array. The existing array is modified in case of -=, +=, *=
operators

Example

Output

© The Knowledge Academy Ltd 13


NumPy Arrays
(Continued)

Unary operators

• Various unary operations are given as a method of ndarray class. This includes min,
sum, max, etc. By setting an axis parameter, these functions can also be applied column-
wise or row-wise

Example

Output

© The Knowledge Academy Ltd 14


NumPy Arrays
(Continued)

Binary operators

• These operations apply on array element-wise, and a new array is created. All basic
arithmetic operators such as +, -, /, etc., can be used. The existing array is modified in
case of +=, -=, = operators

Example

Output

© The Knowledge Academy Ltd 15


Aggregations: Min, Max and more
• In the Python numpy module, for working with a single-dimensional or multi-
dimensional array, we have various statistical functions or aggregate functions

• Min, sum, mean, max, average, median, product, standard deviation, argmin, variance,
percentile, argmax, cumsum, cumprod, and corrcoef are Python Numpy aggregate
functions

• The following arrays are used in order to show these Python numpy aggregate
functions:

© The Knowledge Academy Ltd 16


Aggregations: Min, Max and more
(Continued)

Python NumPy Sum

• The sum of values in an array is calculated by the Python NumPy sum function

• This Python numpy sum function permits you to utilise an optional argument named an
axis. To calculate the sum of a given axis Python numpy Aggregate Function can be used.
For instance, the sum of each column in a Numpy array is returned by the axis = 0

© The Knowledge Academy Ltd 17


Aggregations: Min, Max and more
(Continued)

• The sum of each row in an array is returned by axis = 1

© The Knowledge Academy Ltd 18


Aggregations: Min, Max and more
(Continued)

Python NumPy average

• The average of a given array is returned by Python NumPy average function

• Average of x and Y axis

© The Knowledge Academy Ltd 19


Aggregations: Min, Max and more
(Continued)

• Without using the axis name calculate numpy array Average

© The Knowledge Academy Ltd 20


Aggregations: Min, Max and more
(Continued)

Python NumPy min

• The minimum value in a given axis or an array is returned by the Python numpymin
function

© The Knowledge Academy Ltd 21


Aggregations: Min, Max and more
(Continued)

• Here, we are finding the numpy array minimum value in the X and Y-axis

© The Knowledge Academy Ltd 22


Aggregations: Min, Max and more
(Continued)

Python NumPy max

• The maximum number in a given axis or from a given array is returned by the Python
numpy max function

© The Knowledge Academy Ltd 23


Aggregations: Min, Max and more
(Continued)

• By using numpy max function find the maximum value in the X and Y-axis

© The Knowledge Academy Ltd 24


Aggregations: Min, Max and more
(Continued)

Python Numpy mean

• The average or mean of a given array or in a given axis is returned by the Python numpy
mean function. The mathematical formula for this numpy mean is the sum of all the
items in an array

© The Knowledge Academy Ltd 25


Aggregations: Min, Max and more
(Continued)

• Mean value of x and Y-axis (or every row and column)

• Here, we are calculating Mean without using the axis name

© The Knowledge Academy Ltd 26


Computation on Arrays: Broadcasting
• The term broadcasting refers to how numpy treats arrays with different Dimension
while arithmetic operations leading to specific constraints. Moreover, the smaller array
is broadcast across the larger array so that they have compatible shapes

• Broadcasting provides a means of vectorising array operations so that looping occurs in


C rather than Python as we understand that Numpy implemented in C programming
language

• It does this without creating unnecessary data copies and which leads to efficient
algorithm implementations

• In some cases, broadcasting is a bad idea because it leads to ineffective memory


utilisation which declines the computation

© The Knowledge Academy Ltd 27


Computation on Arrays: Broadcasting
(Continued)

• Example

Output

Broadcasting Rules:

• The following are the rules in order to broadcast two arrays together:

1. Prepend the shape of the lower rank array with 1s until both shapes have the same
length if the arrays do not have the same rank

© The Knowledge Academy Ltd 28


Computation on Arrays: Broadcasting
2. In a dimension, the two arrays are compatible if they have the same size in the
dimension or if one of the arrays has size 1 in that dimension

3. If arrays are compatible with all dimensions then they can be broadcasted together

4. After broadcasting, every array acts as if it had shape equivalent to the element-wise
maximum of shapes of the two input arrays

5. In any dimension where one array had size 1, as well as the other array had size greater
than 1, the first array acts as if it were copied along that dimension

© The Knowledge Academy Ltd 29


Computation on Arrays: Broadcasting
(Continued)

Example 1: Single Dimension array

Output

© The Knowledge Academy Ltd 30


Computation on Arrays: Broadcasting
(Continued)

Example 2: Two Dimensional Array

Output

© The Knowledge Academy Ltd 31


Computation on Arrays: Broadcasting
(Continued)

Plotting a two-dimensional function:

• Even, Broadcasting is usually used in presenting images based on two-dimensional


functions. If we want to define a function z=f(x, y)

Output

© The Knowledge Academy Ltd 32


Comparison, Boolean Logic, and Masks
Python Numpy Comparison Operators
• The Python numpy comparison functions and operators are used in order to compare
the array items as well as return Boolean True or false

• greater_equal, Greater, less_equal, less, equal, and not_equal are Python Numpy
comparison functions. Python Numpy comparison operators are <, <=, >, >=, == and !=

• For generating random two dimensional and three-dimensional integer arrays numpy
random randint function can be used

© The Knowledge Academy Ltd 33


Comparison, Boolean Logic, and Masks
(Continued)

• A two-dimensional array is generated by the first array having a size of 5 rows and 8
columns, and the values are within 10 and 50

arr1 = np.random.randint(10, 50, size = (5, 8))

• A random three-dimensional array of size 2*3*6 is generated by this second array. The
generated random values are within 1 and 20

arr2 = np.random.randint(1, 20, size = (2, 3, 6))

© The Knowledge Academy Ltd 34


Comparison, Boolean Logic, and Masks
Python Numpy Array greater
• Firstly, we create an array of random elements. After that, we will examine whether the
array elements are greater than 0, 1, and 2. If false, false is returned otherwise, True is
returned

Output

© The Knowledge Academy Ltd 35


Comparison, Boolean Logic, and Masks
(Continued)

• Here, the Python Numpy greater function on 2-Dimensional and 3-Dimensional Arrays is
used

• The first array greater function checks whether the values in 2-D array are greater than
30 or not

• If true, then Boolean True returned otherwise false returned. Next, we are checking
whether the array elements in a 3-D array are greater than 10 or not

© The Knowledge Academy Ltd 36


Comparison, Boolean Logic, and Masks
(Continued)

Output

© The Knowledge Academy Ltd 37


Comparison, Boolean Logic, and Masks
(Continued)

Python Numpy Array greater_equal

• Whether the given array elements are greater than or equal to a specified number is
checked by the Python Numpy greater_equal function. True returned, If True otherwise,
False

• Whether items in the area is greater than or equal to 2 is checked by the first Numpy
statement. Moreover, the items in a random 2-Dimensional array is greater than or
equal to 25 is checked by the second Numpy statement

• Randomly generated 3-D array items which are greater than or equal to 7 are checked
by the third statement

© The Knowledge Academy Ltd 38


Comparison, Boolean Logic, and Masks
(Continued)

Output

© The Knowledge Academy Ltd 39


Comparison, Boolean Logic, and Masks
(Continued)

Python Numpy Array less

• Whether the elements in a given array is less than a specified number or not, is checked
by the Python Numpy less function

• If True, boolean True returned otherwise, False. The syntax of this Python Numpy less
function is:

numpy.less(array_name, integer_value)

© The Knowledge Academy Ltd 40


Comparison, Boolean Logic, and Masks
(Continued)

Output

© The Knowledge Academy Ltd 41


Comparison, Boolean Logic, and Masks
(Continued)

Python Numpy Array less_equal

• Whether each element in a provided array is less than or equal to a specified number or
not, is checked by the Python Numpy less_equal function. If True, boolean True
returned otherwise, False

• The following is the syntax of this Python Numpy less_equal function:

numpy.less_equal(array_name, integer_value).

© The Knowledge Academy Ltd 42


Comparison, Boolean Logic, and Masks
(Continued)

Output

© The Knowledge Academy Ltd 43


Comparison, Boolean Logic, and Masks
Boolean numpy arrays
• A boolean array is a NumPy array with boolean (True/False) values. By applying a logical
operator to another NumPy array, such array can be obtained:

© The Knowledge Academy Ltd 44


Comparison, Boolean Logic, and Masks
Logical operations on Boolean arrays
• By using logical operators, Boolean arrays can be combined:

operator meaning
~ negation (logical “not”)
& logical “and”
| logical “or”

© The Knowledge Academy Ltd 45


Comparison, Boolean Logic, and Masks
(Continued)

• Example 1:

Output

© The Knowledge Academy Ltd 46


Comparison, Boolean Logic, and Masks
(Continued)

• Example 2:

Output

© The Knowledge Academy Ltd 47


Comparison, Boolean Logic, and Masks
(Continued)

• Example 3:

Output

© The Knowledge Academy Ltd 48


Comparison, Boolean Logic, and Masks
Masks

• In numpy.ma.mask_rows() function, mask rows of a 2-Dimensional array which hold


masked values. The numpy.ma.mask_rows() function is a shortcut to mask_rowcols
with axis equal to 0

Example

Output

© The Knowledge Academy Ltd 49


Fancy Indexing
• Fancy indexing is like the simple indexing, but we pass arrays of indices instead of single
scalars

• It permits us to quickly access as well as change complicated subsets of an array's values

Exploring Fancy Indexing

• Fancy indexing is conceptually simple which means passing an array of indices in order
to access multiple array elements at one time

• For instance, consider the below-written array:

Output

© The Knowledge Academy Ltd 50


Fancy Indexing
(Continued)

• Let us suppose, we want to access three different elements

• On the other hand, we can pass an array of indices or a single list for getting the same
result:

© The Knowledge Academy Ltd 51


Fancy Indexing
(Continued)

• While utilising fancy indexing, the shape of the result reflects the shape of the index
arrays instead of the shape of the array being indexed:

Output

• Even fancy indexing works in multiple dimensions. See the example shown below:

Output

© The Knowledge Academy Ltd 52


Fancy Indexing
(Continued)

• The first index refers to the row, and the second to the column Like with standard
indexing:

Output

• The broadcasting rules are followed by the pairing of indices in fancy indexing.
Therefore, for instance, we get a two-dimensional result if we combine a column vector
as well as a row vector within the indices:

Output

© The Knowledge Academy Ltd 53


Fancy Indexing
(Continued)

• It is always necessary to memorise with fancy indexing that the broadcasted shape of
the indices is reflected by the return value, instead of the shape of the array being
indexed

Combined Indexing

• Fancy indexing can be combined with the other indexing schemes for more powerful
operations:

Output

© The Knowledge Academy Ltd 54


Fancy Indexing
(Continued)

• We can combine simple as well as fancy indices:

• We can combine fancy indexing with slicing as well:

© The Knowledge Academy Ltd 55


Fancy Indexing
(Continued)

• Even, fancy indexing can be combined with masking:

Output

• All of these indexing options combined lead to a very flexible group of operations for
accessing as well as modifying array values

© The Knowledge Academy Ltd 56


Fancy Indexing
Modifying Values with Fancy Indexing
• Fancy indexing can be used for accessing parts of an array. Moreover, it can modify parts
of an array as well. For instance, assume we have an array of indices, and we want to set
similar items in an array to some value:

• For this, any assignment-type operator can be used. For instance:

© The Knowledge Academy Ltd 57


Fancy Indexing
(Continued)

• Notice that, repeated indices with these operations can cause some potentially
unexpected outcomes

• The outcome of this operation is to first assign A[0] = 2, followed by A[0] = 8. The result
is that A[0] contains the value 8

© The Knowledge Academy Ltd 58


Sorting Arrays
• Term sorting means placing elements in an organised sequence

• Ordered sequence is any sequence which has an order corresponding to elements, such
as ascending or descending, alphabetical or numeric

• The NumPy ndarray object has a function named as sort(), which will sort an array

Example

Output

© The Knowledge Academy Ltd 59


Sorting Arrays
(Continued)

• Even, you can sort string arrays, or any other data type:

Output

• The following is the example to Sort a Boolean array:

Output

© The Knowledge Academy Ltd 60


Sorting Arrays
Sorting a 2-D Array
• By using the sort() method, both arrays of the 2-D array will be sorted:

Output

© The Knowledge Academy Ltd 61


NumPy’s Structured Array
• Numpy’s Structured Array is similar to Struct in C programming language. It is used in
order to group data of different sizes and types

• Data containers named as fields are used by the structure array. Every data field can
contain data of any size and type. With the help of dot notation, array elements can be
accessed

Structured Array Properties

• All structs in the array have the similar number of fields

• All structs have same fields names

© The Knowledge Academy Ltd 62


NumPy’s Structured Array
(Continued)

• For instance, consider a student's structured array with different fields such as year,
name, and marks

• Every record in array student has a structure of class Struct. Moreover, the array of a
structure is referred to as struct as adding any new fields for a new struct in the array,
contains the empty array

Output

© The Knowledge Academy Ltd 63


NumPy’s Structured Array
(Continued)

Example

• The structure array can be sorted by using numpy.sort() method and also passing the
order as a parameter. This parameter takes the field value according to which it is
required to be sorted

Output

© The Knowledge Academy Ltd 64


Module 2: Python for Data Analysis -
Pandas

© The Knowledge Academy Ltd 65


Installing pandas
• Perform the following steps to install pandas:

Step 1: Choose Anaconda Prompt (Anaconda3) and Run as an administrator

© The Knowledge Academy Ltd 66


Installing pandas
Step 2: Execute the pip install pandas command. Pandas will be installed successfully in
Anaconda

© The Knowledge Academy Ltd 67


Pandas Objects
• Pandas objects can be thought of as improved versions of NumPy structured arrays at
the fundamental level in which the rows and columns are recognized with labels
instead of simple integer indices

• Index, DataFrame and series are the three basic pandas data structures

• Import Numpy and Pandas

© The Knowledge Academy Ltd 68


Pandas Objects
(Continued)

The Pandas Series Object

• A Pandas Series is a 1-D array of indexed data. It can be created from a array or list as
shown in the following screenshot:

© The Knowledge Academy Ltd 69


Pandas Objects
(Continued)

• As shown in the output, A sequence of indices and sequence of values both are
wrapped by the series, which we can access with the index attributes and values. The
values are simply a familiar NumPy array:

• The index is an array-like object of type pd.Index

© The Knowledge Academy Ltd 70


Pandas Objects
(Continued)

• As NumPy array, data can be obtained by the associated index through the familiar
Python square-bracket notation:

© The Knowledge Academy Ltd 71


Pandas Objects
(Continued)

• The Pandas Series is much more general as well as flexible as compared to 1-D NumPy
array that it emulates

Series as generalized NumPy array

• The Series object is basically interchangeable with a 1-D NumPy array

• The significant difference is the presence of the index: whereas the Numpy Array has an
implicitly defined integer index used in order to obtain the values, the Pandas Series
has a clear-cut defined index associated with the values

© The Knowledge Academy Ltd 72


Pandas Objects
(Continued)

• The Series object additional capabilities are provided by this clear index description.
The index needs not to be an integer but can made up of values of any wanted type.
For instance, we can use strings as an index:

© The Knowledge Academy Ltd 73


Pandas Objects
(Continued)

• And the item access works as expected

• Even, non-sequential or non-contiguous indices can be used

© The Knowledge Academy Ltd 74


Pandas Objects
(Continued)

Series as specialized dictionary

• A dictionary is a structure which maps arbitrary keys to a collection of arbitrary values,


as well as a Series is a structure which maps typed keys to a set of typed values

• This typing is significant: just as the type-specific compiled code behind a NumPy array
makes it more well-organized than a Python list for certain operations, the type
information of a Pandas Series makes it much more efficient as compare to Python
dictionaries for certain operations

© The Knowledge Academy Ltd 75


Pandas Objects
(Continued)

• By creating a Series object directly from a Python dictionary the Series-as-dictionary


analogy can be made even more explicit:

© The Knowledge Academy Ltd 76


Pandas Objects
(Continued)

• A Series will be built where the index is drawn from the sorted keys by default. Typical
dictionary-style item access can be performed from here:

• Array-style operations such as slicing is also supported by the Series:

© The Knowledge Academy Ltd 77


Pandas Objects
(Continued)

Constructing Series objects

• For instance, data can be a NumPy array or list, in which case index defaults to an
integer sequence:

© The Knowledge Academy Ltd 78


Pandas Objects
(Continued)

• Data can be a scalar, which is repeated in order to fill the specified index:

• Data can be a dictionary, in which index defaults to the sorted dictionary keys

© The Knowledge Academy Ltd 79


Pandas Objects
(Continued)

• The index can be set explicitly in every case if a different result is preferred:

© The Knowledge Academy Ltd 80


Pandas Objects
(Continued)

The Pandas DataFrame Object

• In Pandas, the next primary structure is the DataFrame

• The DataFrame can be examined either as a Python dictionary specialisation or a


generalization of a NumPy array

© The Knowledge Academy Ltd 81


Pandas Objects
(Continued)

DataFrame as a generalized NumPy array

• Suppose a Series is an analogue of a 1-D array with flexible indices. In that case, a
DataFrame is an analogue of a 2-D array with both flexible column names and flexible
row indices

• For showing this, first, make a new Series listing the area of each of the five states:

© The Knowledge Academy Ltd 82


Pandas Objects
(Continued)

• To construct a single 2-D object containing this information , we can use a dictionary:

© The Knowledge Academy Ltd 83


Pandas Objects
(Continued)

• Similar to the Series object, the DataFrame has an index attribute which provides
access to the index labels:

• In addition, the DataFrame has a columns attribute, which is an Index object containing
the column labels:

© The Knowledge Academy Ltd 84


Pandas Objects
(Continued)

• Therefore we can think DataFrame as a generalization of a 2-D NumPy array, where


both the rows and columns have a generalized index to access the data

DataFrame as specialized dictionary

• Likewise, we can consider a DataFrame as a specialization of a dictionary as well. Where


a DataFrame maps a column name to a Series of column data, a dictionary maps a key
to a value

© The Knowledge Academy Ltd 85


Pandas Objects
(Continued)

• For instance, 'area' attribute returns the Series object holding the areas:

• Data[0] will return the first row in a 2-D NumPy array. Data['col0'] will return the first
column for a DataFrame

© The Knowledge Academy Ltd 86


Pandas Objects
(Continued)

Constructing DataFrame objects

• Various ways can be used in order to construct Pandas DataFram. The following are
several examples:

o From a single Series object: A DataFrame is a collection of Series objects.


Moreover, from a single Series a single-column DataFrame can be constructed

© The Knowledge Academy Ltd 87


Pandas Objects
(Continued)

o From a list of dicts: Any list of dictionaries can be made into a DataFrame

o Even if a few keys are missing in the dictionary, they will be filled by Pandas with
NaN which means "not a number" values:

© The Knowledge Academy Ltd 88


Pandas Objects
(Continued)

o From a dictionary of Series objects: A DataFrame can be constructed from a


dictionary of Series objects as well:

© The Knowledge Academy Ltd 89


Data Indexing and Selection
Data Selection in Series

• As we saw in the previous slides, a Series object acts in many ways like a one-
dimensional NumPy array, as well as in many ways like a standard Python dictionary

• If we keep these two overlapping analogies in mind, it will help us to understand the
patterns of data indexing as well as selection in these arrays

© The Knowledge Academy Ltd 90


Data Indexing and Selection
(Continued)

Series as dictionary

• Like a dictionary, the Series object provides a mapping from a group of keys to a
collection of values:

© The Knowledge Academy Ltd 91


Data Indexing and Selection
(Continued)

• We can also use dictionary-like Python expressions as well as methods to examine the
keys or indices as well as values:

© The Knowledge Academy Ltd 92


Data Indexing and Selection
(Continued)

• Series objects can even be altered with a dictionary-like syntax. Just as you can extend a
dictionary by assigning to a new key, you can extend a Series by assigning to a new
index value:

© The Knowledge Academy Ltd 93


Data Indexing and Selection
(Continued)

• This easy mutability of the objects is a useful feature: under the hood, Pandas is making
decisions about memory layout as well as data copying that might need to take place;
the user generally does not need to worry about these issues

Series as one-dimensional array

• A Series builds on this dictionary-like interface as well as provides array-style item


selection through the same fundamental mechanisms as NumPy arrays which is, slices,
masking, as well as fancy indexing. Examples of these are as follows:

© The Knowledge Academy Ltd 94


Data Indexing and Selection
(Continued)

© The Knowledge Academy Ltd 95


Data Indexing and Selection
(Continued)

• Among these, slicing may be the source of the most confusion. Notice that when slicing
with an clear index (i.e., data['a':'c']), the final index is included in the slice, while when
slicing with an understood index (i.e., data[0:2]), the final index is excluded from the
slice

• These slicing as well as indexing conventions can be a source of confusion. Such as, if
your Series has an clear integer index, an indexing operation for example data[1] will
use the clear indices, while a slicing operation like data[1:3] will use the understood
Python-style index

© The Knowledge Academy Ltd 96


Data Indexing and Selection
(Continued)

• Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes which clearly uncover certain indexing schemes

© The Knowledge Academy Ltd 97


Data Indexing and Selection
(Continued)

• These are not functional methods, but attributes which uncover a specific slicing
interface to the data in the Series

• First, the loc attribute permits indexing as well as slicing which always references the
clear index:

© The Knowledge Academy Ltd 98


Data Indexing and Selection
(Continued)

• The iloc attribute permits indexing as well as slicing which always references the
implicit Python-style index:

• A third indexing attribute, ix, is a hybrid of the two, as well as for Series objects is equal
to standard []-based indexing. The determination of the ix indexer will become more
apparent in the context of DataFrame objects, which we will discuss in a moment

© The Knowledge Academy Ltd 99


Data Indexing and Selection
(Continued)

DataFrame as a dictionary

• The first analogy we will consider is the DataFrame as a dictionary of related Series
objects. Let us return to our example of areas and populations of states:

Output

© The Knowledge Academy Ltd 100


Data Indexing and Selection
(Continued)

• The individual Series which make up the columns of the DataFrame can be retrieved
through dictionary-style indexing of the column name:

• Equivalently, we can use attribute-style access with column names which are strings:

© The Knowledge Academy Ltd 101


Data Indexing and Selection
(Continued)

• This attribute-style column access actually accesses the exact same object as the
dictionary-style access:

• Though this is a useful shorthand, remember that it does not work for all cases! Such
as, if the column names are not strings, or if the column names conflict with methods
of the DataFrame, this attribute-style access is not possible

© The Knowledge Academy Ltd 102


Data Indexing and Selection
(Continued)

• For instance, the DataFrame has a pop() method, so data.pop will point to this rather
than the "pop" column:

• In specific, you should avoid the temptation to try column assignment through attribute
(i.e., use data['pop'] = z rather than data.pop = z)

© The Knowledge Academy Ltd 103


Data Indexing and Selection
(Continued)

• Like with the Series objects discussed earlier, this dictionary-style syntax can also be
used to alter the object, in this case adding a new column:

© The Knowledge Academy Ltd 104


Data Indexing and Selection
(Continued)

• This displays a preview of the direct syntax of element-by-element arithmetic between


Series objects

Additional indexing conventions

• There are a couple extra indexing conventions which might seem at odds with the
preceding discussion, but nonetheless can be very beneficial in practice. First, while
indexing refers to columns, slicing refers to rows:

© The Knowledge Academy Ltd 105


Data Indexing and Selection
(Continued)

• Similarly, direct masking operations are also interpreted row-wise rather than column-
wise:

© The Knowledge Academy Ltd 106


Data Indexing and Selection
(Continued)

• These two conventions are syntactically alike to those on a NumPy array, as well as
while these may not quite fit the mold of the Pandas conventions, they are
nevertheless quite useful in practice

© The Knowledge Academy Ltd 107


Operating on Data in Pandas
• One of the significant pieces of NumPy is the capability to perform fast element-wise
operations, both with fundamental arithmetic (like addition, subtraction, multiplication,
etc.) as well as with more complex operations (trigonometric functions, exponential
and logarithmic functions, etc.)

• Pandas inherits much of this functionality from NumPy, as well as the ufuncs (Universal
functions) which we introduced in Computation on NumPy Arrays: Universal Functions
are key to this

• Pandas contains a couple valuable twists, however: for unary operations such as
negation as well as trigonometric functions, these ufuncs will preserve index as well as
column labels in the output, and for binary operations like addition as well as
multiplication, Pandas will automatically align indices when passing the objects to the
ufunc

© The Knowledge Academy Ltd 108


Operating on Data in Pandas
(Continued)

• This means that keeping the context of data as well as joining data from diverse sources
both potentially error-prone tasks with raw NumPy arrays become essentially foolproof
ones with Pandas

• We will additionally see that there are well-defined operations between 1-D Series
structures and 2-D DataFrame structures

© The Knowledge Academy Ltd 109


Operating on Data in Pandas
(Continued)

Ufuncs: Index Preservation

• Because Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series as well as DataFrame objects

• Start by defining a simple Series as well as DataFrame on which to show this:

© The Knowledge Academy Ltd 110


Operating on Data in Pandas
(Continued)

© The Knowledge Academy Ltd 111


Operating on Data in Pandas
(Continued)

• If we apply a NumPy ufunc on either of these objects, the result will be another Pandas
object with the indices preserved:

• Or, for a little more complex calculation:

© The Knowledge Academy Ltd 112


Operating on Data in Pandas
(Continued)

UFuncs: Index Alignment

• For binary operations on two Series or DataFrame objects, Pandas will align indices in
the process of performing the operation

Index alignment in Series

• For example, suppose we are combining two unlike data sources, as well as find only
the top three US states by area as well as the top three US states by population:

© The Knowledge Academy Ltd 113


Operating on Data in Pandas
(Continued)

• Let's see what happens when we divide these to compute the population density:

• The resulting array holds the union of indices of the two input arrays, which could be
determined by using standard Python set arithmetic on these indices:

© The Knowledge Academy Ltd 114


Operating on Data in Pandas
(Continued)

• Any item for which one or the other does not have an entry is marked with NaN, or
"Not a Number," which is how Pandas marks missing data

• This index matching is applied this method for any of Python's built-in arithmetic
expressions; any absent values are filled in with NaN by default:

© The Knowledge Academy Ltd 115


Operating on Data in Pandas
(Continued)

• If using NaN values is not the desired behaviour, the fill value can be altered using
suitable object methods in place of the operators

• For instance, calling A.add(B) is alike to calling A + B, but permits optional clear
specification of the fill value for any elements in A or B that might be missing:

© The Knowledge Academy Ltd 116


Operating on Data in Pandas
(Continued)

• A same kind of alignment takes place for both columns as well as indices when
performing operations on DataFrames:

© The Knowledge Academy Ltd 117


Operating on Data in Pandas
(Continued)

• Notice that indices are aligned properly irrespective of their order in the two objects, as
well as indices in the result are sorted

• As was the case with Series, we can use the associated object's arithmetic method as
well as pass any wanted fill_value to be used in place of missing entries

• Here we will fill with the mean of all values in A (computed by first stacking the rows of
A):

© The Knowledge Academy Ltd 118


Operating on Data in Pandas
(Continued)

• The following table lists Python operators as well as their equal Pandas object methods:

Python Operator Pandas Method(s)

+ Add()
- Sub(), subtract()
* Mul(), multiply()
/ truediv(), div(), divide()
// floordiv()

© The Knowledge Academy Ltd 119


Operating on Data in Pandas
(Continued)

Python Operator Pandas Method(s)

% mod()
** pow()

© The Knowledge Academy Ltd 120


Handling Missing Data
• Missing Data can occur when no information is provided for one or more items or for a
whole unit

• Missing Data is a very big problem in real life scenario. Missing Data can also refer to as
NA(Not Available) values in pandas

• In DataFrame sometimes many datasets simply arrive with missing data, either because
it exists as well as was not collected or it never existed

• For instance, Suppose different user being surveyed may select not to share their
income, some user may select not to share the address in this way many datasets went
missing

© The Knowledge Academy Ltd 121


Handling Missing Data
(Continued)

• In Pandas missing data is represented by two value:

o None: None is a Python singleton object which is usually used for missing data in
Python code

o NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point representation

• Pandas treat None as well as NaN as essentially exchangeable for indicating missing or
null values

© The Knowledge Academy Ltd 122


Handling Missing Data
(Continued)

• To facilitate this convention, there are various useful functions for detecting, removing,
as well as replacing null values in Pandas DataFrame

• isnull()

• notnull()

• dropna()

• fillna()

• replace()

© The Knowledge Academy Ltd 123


Handling Missing Data
(Continued)

• interpolate()

Checking for missing values using isnull() and notnull()

• In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull()

• Both function help in checking whether a value is NaN or not. These function can also
be used in Pandas Series in order to find null values in a series

© The Knowledge Academy Ltd 124


Handling Missing Data
(Continued)

Checking for missing values using isnull()

• In order to check null values in Pandas DataFrame, we use isnull() function this function
return dataframe of Boolean values which are True for NaN values

Example 1

© The Knowledge Academy Ltd 125


Handling Missing Data
(Continued)

Example 2

© The Knowledge Academy Ltd 126


Handling Missing Data
(Continued)

• As shown in the output image, only the rows having Gender = NULL are displayed

© The Knowledge Academy Ltd 127


Handling Missing Data
(Continued)

Checking for missing values using notnull()

• In order to check null values in Pandas Dataframe, we use notnull() function this
function return dataframe of Boolean values which are False for NaN values

Example 3

Output

© The Knowledge Academy Ltd 128


Handling Missing Data
(Continued)

Example 4

• As shown in the output image, only the rows having Gender = NOT NULL are displayed

© The Knowledge Academy Ltd 129


Handling Missing Data
(Continued)

Filling missing values using fillna(), replace() and interpolate()

• In order to fill null values in a datasets, we use fillna(), replace() and interpolate()
function these function replace NaN values with some value of their own

• All these function help in filling a null values in datasets of a DataFrame

• Interpolate() function is basically used to fill NA values in the dataframe but it uses
various interpolation method to fill the missing values rather than hard-coding the
value

© The Knowledge Academy Ltd 130


Handling Missing Data
(Continued)

Example 1: Filling null values with a single value

Output

© The Knowledge Academy Ltd 131


Handling Missing Data
(Continued)

Example 2: Filling null values with the previous ones

Output

© The Knowledge Academy Ltd 132


Handling Missing Data
(Continued)

Example 3: Filling null value with the next ones

Output

© The Knowledge Academy Ltd 133


Handling Missing Data
(Continued)

Example 4: Filling null values in CSV File

Output

© The Knowledge Academy Ltd 134


Handling Missing Data
(Continued)

• Now we are going to fill all the null values in Gender column with “No Gender”

Output

© The Knowledge Academy Ltd 135


Handling Missing Data
(Continued)

Example 5: Filling a null values using replace() method

Output

© The Knowledge Academy Ltd 136


Handling Missing Data
(Continued)

• Now we are going to replace the all Nan value in the data frame with -99 value

Output

© The Knowledge Academy Ltd 137


Handling Missing Data
(Continued)

Example 6: Using interpolate() function to fill the missing values using linear method.

Output

© The Knowledge Academy Ltd 138


Handling Missing Data
(Continued)

• Interpolate the missing values using Linear method. Note that Linear method ignore the
index as well as treat the values as equally spaced

• As we can see the output, values in the first row could not get filled as the direction of
filling of values is forward as well as there is no previous value which could have been
used in interpolation

© The Knowledge Academy Ltd 139


Hierarchical Indexing
• In order to do data analysis python is a great language, mainly because of the excellent
ecosystem of the data-centric python package

• Moreover, Pandas makes importing and analysing data simple

• A MultiIndex reshaped is returned by the Pandas MultiIndex.to_hierarchical() function


in order to confirm the shapes provided by n_shuffle and n_repeat

• Moreover, for combination with another Index with n_repeat items, it is beneficial to
replicate as well as rearrange a MultiIndex

© The Knowledge Academy Ltd 140


Hierarchical Indexing
(Continued)

Example 1

• In order to repeat the labels in the MultiIndex, use MultiIndex.to_hierarchical()


function

Output

© The Knowledge Academy Ltd 141


Hierarchical Indexing
(Continued)

• Now, repeat two times to the labels of a MultiIndex

Output

• As you can see in the following output figure, the labels in the returned MultiIndex is
repeated 2 times.

© The Knowledge Academy Ltd 142


Hierarchical Indexing
(Continued)

Example 2: Use MultiIndex.to_hierarchical() function to repeat and reshuffle the labels in


the MultiIndex

Output

© The Knowledge Academy Ltd 143


Hierarchical Indexing
(Continued)

• Now let’s reiterate as well as reshuffle the labels of the MultiIndex 2 times

Output

• As you can see in the output figure, the labels are repeated as well as reshuffled twice
in the returned MultiIndex

© The Knowledge Academy Ltd 144


Concat and Append
• The concat function performs concatenation operations along an axis. Let us create
different objects and do concatenation.

Output

© The Knowledge Academy Ltd 145


Concat and Append
(Continued)

• Assume we want to associate particular keys with each of the pieces of the chopped up
DataFrame. This can be done by using the keys argument:

Output

© The Knowledge Academy Ltd 146


Concat and Append
(Continued)

• The index of the resultant is duplicated; every index is repeated

• Set ignore_index to True if the resultant object has to follow its own indexing

Output

© The Knowledge Academy Ltd 147


Concat and Append
(Continued)

• Note, the index changes entirely, and the Keys are overridden as well

• The new columns will be added if two objects need to be added along axis=1

Output

© The Knowledge Academy Ltd 148


Concat and Append
(Continued)

Concatenating Using append

• A worthwhile shortcut to concat is the append instance methods on DataFrame as well


as Series. These methods predated concat. They concatenate along axis=0, namely the
index:

Output

© The Knowledge Academy Ltd 149


Concat and Append
(Continued)

• Multiple objects can also be taken by the append function:

Output

© The Knowledge Academy Ltd 150


Merge and Join
• Pandas DataFrame is 2-D size-mutable, a potentially diverse tabular data structure with
labelled columns and rows

• A Data frame is a 2-D data structure which means data is aligned in a tabular form in
columns and rows

• There are various methods by using them we can merge, join and concat dataframe

• In Dataframe methods such as df.join(), df.merge(), and df.concat() help in merging,


joining, and concating different dataframes

• We use concat() function to concat dataframe. This function helps in concatenating a


dataframe. We can concat a dataframe in various ways

© The Knowledge Academy Ltd 151


Merge and Join
(Continued)

• The following are some ways:

1 Concatenating DataFrame by
using .concat()
Concatenating DataFrame by
setting logic on axes 4

2 Concatenating DataFrame by
ignoring indexes
Concatenating DataFrame
with mixed ndims 5

3 Concatenating DataFrame by
using .append()
Concatenating DataFrame
with group keys 6

© The Knowledge Academy Ltd 152


Merge and Join
(Continued)

Concatenating DataFrame using .concat():

• We use .concat() function to concat a dataframe because this function concat a


dataframe and also a new dataframe is returned by it

• Before applying .concat() function:

Output

© The Knowledge Academy Ltd 153


Merge and Join
(Continued)

• Output after applying .concat() function

Output

© The Knowledge Academy Ltd 154


Merge and Join
(Continued)

Concatenating DataFrame by using .append()


• .append() function is used to concat a dataframe. This function concatenate along
axis=0, namely the index

• This function exist before .concat. The following output we will get before applying
the .append() function:

Output

© The Knowledge Academy Ltd 155


Merge and Join
(Continued)
• The following output we will get after applying .append() function

Output

© The Knowledge Academy Ltd 156


Merge and Join
(Continued)

Concatenating DataFrame by ignoring indexes :

• To concat a dataframe by ignoring indexes we ignore those indexes which do not have a
meaningful meaning

• You may wish to append them and ignore the fact that they may have overlapping
indexes. We use ignore_index as an argument to do that

© The Knowledge Academy Ltd 157


Merge and Join
(Continued)

• Output before applying ignoring indexes methodology

Output

© The Knowledge Academy Ltd 158


Merge and Join
(Continued)

• Output after applying ignoring indexes methodology

Output

© The Knowledge Academy Ltd 159


Merge and Join
(Continued)

Concatenating DataFrame with group keys :

• We override the column names with the use of the keys argument to concat dataframe
with group keys

• Keys argument is to override the column names when creating a new DataFrame based
on existing Series

© The Knowledge Academy Ltd 160


Merge and Join
(Continued)

• Output before applying group key methodology

Output

© The Knowledge Academy Ltd 161


Merge and Join
(Continued)

• Output after using keys as an argument

Output

© The Knowledge Academy Ltd 162


Aggregations and Grouping
Aggregations
• Various methods are available for performing aggregations on data once the
expanding, rolling, and ewm objects are created

Applying Aggregations on DataFrame

• Create a DataFrame and apply aggregations

Output

© The Knowledge Academy Ltd 163


Aggregations and Grouping
(Continued)

• We can aggregate by selecting a column through the standard get item method, or
passing a function to the whole DataFrame

Apply Aggregation on a Whole Dataframe

Output

© The Knowledge Academy Ltd 164


Aggregations and Grouping
(Continued)

Apply Aggregation on a Single Column of a Dataframe

Output

© The Knowledge Academy Ltd 165


Aggregations and Grouping
(Continued)

Apply Aggregation on Multiple Columns of a DataFrame

Output

© The Knowledge Academy Ltd 166


Aggregations and Grouping
(Continued)

Apply Multiple Functions on a Single Column of a DataFrame

Output

© The Knowledge Academy Ltd 167


Aggregations and Grouping
(Continued)

Apply Multiple Functions on Multiple Columns of a DataFrame

Output

© The Knowledge Academy Ltd 168


Aggregations and Grouping
Groupby
• One of the following operations is involved by any groupby operation on the original
object:

o Applying a function

o Splitting the Object

o Combining the results

© The Knowledge Academy Ltd 169


Aggregations and Grouping
(Continued)

• We split the data into sets in many situations, and we apply some functionality on each
subset. We can perform the following operations in the apply functionality

o Aggregation: calculating a summary statistic

o Transformation: perform some group-specific operation

o Filtration: discarding the data with some condition

© The Knowledge Academy Ltd 170


Aggregations and Grouping
(Continued)

• Let us now create a DataFrame object as well as perform all the operations on it:

Output

© The Knowledge Academy Ltd 171


Aggregations and Grouping
(Continued)

Split Data into Groups

• Pandas object can be split into any of their objects. Various ways are there in order to
split an object such as:

o obj.groupby('key')

o obj.groupby(['key1','key2'])

o obj.groupby(key,axis=1)

© The Knowledge Academy Ltd 172


Aggregations and Grouping
(Continued)

• Now see how the grouping objects can be applied to the DataFrame object

Output

© The Knowledge Academy Ltd 173


Aggregations and Grouping
(Continued)

View Groups

© The Knowledge Academy Ltd 174


Aggregations and Grouping
(Continued)

Group by with multiple columns

© The Knowledge Academy Ltd 175


Aggregations and Grouping
(Continued)

Iterating through Groups

• With the groupby object in hand, we can iterate via the object alike to itertools.obj.

Output

© The Knowledge Academy Ltd 176


Aggregations and Grouping
(Continued)

• By default, the groupby object has the same label name as the group name

Select a Group

• Using the get_group() method, we can choose a single group

Output

© The Knowledge Academy Ltd 177


Pivot Tables
• pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’,
fill_value=None, margins=False, dropna=True, margins_name=’All’) create a
spreadsheet-style pivot table as a DataFrame

• Levels in the pivot table will be stored in MultiIndex objects on the index and columns
of the result DataFrame

Example

Output

© The Knowledge Academy Ltd 178


Pivot Tables
(Continued)

Output

© The Knowledge Academy Ltd 179


Pivot Tables
(Continued)

Output

© The Knowledge Academy Ltd 180


Vectorised String Operations
• One strength of Python is its relative ease in handling as well as manipulating string
data

• Pandas builds on this as well as provides a comprehensive collection of vectorised


string operations which become an significant piece of the type of munging needed
when working with (read: cleaning up) real-world data

Introducing Pandas String Operations

• As we know that tools like numpy as well as pandas generalize arithmetic operations so
that we can easily as well as quickly perform the same operation on numerous array
elements

© The Knowledge Academy Ltd 181


Vectorised String Operations
(Continued)

Example

• This vectorization of operations simplifies the syntax of operating on arrays of data: we


no longer have to worry about the size or shape of the array, but just about what
operation we want done

© The Knowledge Academy Ltd 182


Vectorised String Operations
(Continued)

• For arrays of strings, NumPy does not provide such simple access, as well as thus you
are stuck using a more verbose loop syntax:

• This is perhaps sufficient to work with some data, but it will break if there are any
missing values

© The Knowledge Academy Ltd 183


Vectorised String Operations
(Continued)

• Pandas includes features to address both this need for vectorised string operations as
well as for properly handling missing data through the str attribute of Pandas Series as
well as Index objects containing strings

• So, for instance, suppose we create a Pandas Series with this data:

© The Knowledge Academy Ltd 184


Vectorised String Operations
(Continued)

• Now a single method can be called by us which will capitalize all the entries, while
skipping over any missing values:

• Using tab completion on this str attribute will list all the vectorised string methods
available to Pandas

© The Knowledge Academy Ltd 185


Vectorised String Operations
(Continued)

Tables of Pandas String Methods

• If you have a good understanding of string manipulation in Python, most of Pandas


string syntax is intuitive enough that it is probably sufficient to just list a table of
available methods

• The examples in this section use the following series of names:

© The Knowledge Academy Ltd 186


Vectorised String Operations
(Continued)

Methods similar to Python string methods

• Nearly all Python's built-in string methods are mirrored by a Pandas vectorised string
method. Here is a list of Pandas str methods which mirror Python string methods:

1 2 3
len() ljust() rjust()

4 5 6 7
center() zfill() strip() translate()

© The Knowledge Academy Ltd 187


Vectorised String Operations
(Continued)

8 9 10

startswith() endswith() rfind()

11 12 13 14

isalpha() isdigit() lower() upper()

© The Knowledge Academy Ltd 188


Vectorised String Operations
(Continued)

• Notice that these strings ( shown on previous slides) have various return values. Some,
like lower(), return a series of strings:

• But some others return numbers:

© The Knowledge Academy Ltd 189


Vectorised String Operations
(Continued)

• Or Boolean values:

• Still others return lists or other compound values for each element:

© The Knowledge Academy Ltd 190


Working with Time Series
• Even though time series is also present in scikit-learn however Pandas has some kind of
complied more features

• We can include the date and time for each record and can fetch the dataframe records
in this module of Pandas

• By using pandas module called Time series we can find the data within a specific range
of date and time

Example 1

Output

© The Knowledge Academy Ltd 191


Working with Time Series
(Continued)

• In this code, for date ranges from 1/1/2019 – 8/1/2019 we have created the timestamp
on the bases of minutes. We can vary the frequency by hours to seconds or minutes

• This function will help you to tack the record of data stored per minute. The length of
the datetime stamp is 10081 As we can see in the output as shown in the output

Example 2

Output

© The Knowledge Academy Ltd 192


Working with Time Series
(Continued)

• We are checking the type of our object named range_date

Example 3

Output

© The Knowledge Academy Ltd 193


Working with Time Series
(Continued)

• We have first created a time series then converted this data into dataframe and for
generating the random data and map over the dataframe use random function. Then
we use print function to check the result

• We need to have a datetime index to do time series manipulation so that dataframe is


indexed on the timestamp

© The Knowledge Academy Ltd 194


Working with Time Series
(Continued)

Example 4

© The Knowledge Academy Ltd 195


Working with Time Series
(Continued)

• This code use the elements of data_rng and converted to string. Moreover, we slice the
data and print the first ten values list string_data because of more data

• We got all the values which are in the series range_date by using the for each loop in
list. We always have to specify the start and end date when we are using date_range

Example 5

Output

© The Knowledge Academy Ltd 196


eval() and query()
query()
• For data analysis, python is an excellent language, mainly because of the incredible
ecosystem of data-centric Python packages

• Pandas makes importing and analyzing data much easier

• Analyzing data needs many filtering operations. In order to filter a Data frame, various
methods are provided by Pandas. Dataframe.query() is one of them

© The Knowledge Academy Ltd 197


eval() and query()
(Continued)

Example 1: Single condition filtering

The data is filtered based on a single condition in this example. The spaces in column
names have been replaced with ‘_’ before applying the query() method

Output

© The Knowledge Academy Ltd 198


eval() and query()
eval()
• For evaluating an expression in the context of the calling dataframe instance Pandas
dataframe.eval() function is used

• The expression is evaluated over the columns of the dataframe

Example 1

• In order to evaluate the sum of all column elements in the dataframe and insert the
resulting column in the dataframe use eval() function

Output

© The Knowledge Academy Ltd 199


eval() and query()
(Continued)

• Now, evaluate the sum over all the columns and add the resultant column to the
dataframe:

Output

© The Knowledge Academy Ltd 200


eval() and query()
(Continued)

Example 2: For evaluating the sum of any two column element in the dataframe and insert
the resulting column in the dataframe use eval() function. The dataframe has NaN value

Output

© The Knowledge Academy Ltd 201


eval() and query()
(Continued)

• Now, evaluate the sum of column “B” with “C”

Output

• Note that the resulting column ‘D’ has NaN value in the last row as the similar cell
utilised in the evaluation was a NaN cell.

© The Knowledge Academy Ltd 202


Module 3: Python for Data Visualization –
Matplotlib

© The Knowledge Academy Ltd 203


Overview of Matplotlibs
Introduction to Matplotlib
• Matplotlib is an amazing visualisation library in Python for 2D plots of arrays

• Matplotlib is a multi-platform data visualisation library built on NumPy arrays as well as


designed to work with the broader SciPy stack

• One of the greatest advantages of visualisation is that it permits us visual access to large
amounts of data in easily digestible visuals

• Matplotlib consists of various plots such as line, bar, scatter, histogram etc.

© The Knowledge Academy Ltd 204


Overview of Matplotlibs
Installation
• Windows, Linux as well as MacOS distributions have matplotlib and most of its
dependencies as wheel packages

• Run the command to install Matplotlib package:

© The Knowledge Academy Ltd 205


Overview of Matplotlibs
Importing Matplotlib

© The Knowledge Academy Ltd 206


Overview of Matplotlibs
Basic Plots in Matplotlib

• Matplotlib comes with an extensive diversity of plots

• Plots help to understand trends, patterns, as well as to make correlations

• They are typically appliances for reasoning about quantitative information

© The Knowledge Academy Ltd 207


Overview of Matplotlibs
(Continued)

Line Plot

© The Knowledge Academy Ltd 208


Overview of Matplotlibs
(Continued)

Bar Plot

© The Knowledge Academy Ltd 209


Overview of Matplotlibs
(Continued)

Histogram

© The Knowledge Academy Ltd 210


Overview of Matplotlibs
(Continued)

Scatter Plot

© The Knowledge Academy Ltd 211


Object-Oriented Interface
• In object-oriented method, we can create figure objects and then call methods or
attributes off of that object. This interface helps better in dealing with a canvas which
has several plots on it

• To commence with, we create a figure instance that provides an empty canvas

fig = plt.figure()

• Now, add axes to the created figure. The add_axes() method needs a list object of 4
elements corresponding to bottom, left, height and width of the figure. Every number
should be between 0 and 1

ax=fig.add_axes([0,0,1,1])

© The Knowledge Academy Ltd 212


Two Interfaces
(Continued)

• Set title and labels for x and y axis

ax.set_title("sine wave")
ax.set_xlabel('angle')
ax.set_ylabel('sine')

• Call the plot() method of the axes object

ax.plot(x,y)

© The Knowledge Academy Ltd 213


Two Interfaces
(Continued)

Example:

© The Knowledge Academy Ltd 214


Simple Line Plots and Scatter Plots
Simple Line plot

Example 1

© The Knowledge Academy Ltd 215


Simple Line Plots and Scatter Plots
(Continued)

Example 2: Straight Line

© The Knowledge Academy Ltd 216


Simple Line Plots and Scatter Plots
(Continued)

Example 3: Curved line

© The Knowledge Academy Ltd 217


Simple Line Plots and Scatter Plots
(Continued)

Example 4: Multiple lines

© The Knowledge Academy Ltd 218


Simple Line Plots and Scatter Plots
(Continued)

Example 5: Dotted line

© The Knowledge Academy Ltd 219


Simple Line Plots and Scatter Plots
Scatter Plots
• Example 1

© The Knowledge Academy Ltd 220


Simple Line Plots and Scatter Plots
(Continued)

• Example 2

© The Knowledge Academy Ltd 221


Simple Line Plots and Scatter Plots
(Continued)

• Example 3

© The Knowledge Academy Ltd 222


Visualising Errors
• In visualisation of data as well as results, demonstrating visualising errors effectively can
make a plot that convey much more complete information of the data

Basic Errorbars
• With a single Matplotlib function, a basic errorbar can be created:

© The Knowledge Academy Ltd 223


Visualising Errors
(Continued)

• Example 1:

© The Knowledge Academy Ltd 224


Visualising Errors
(Continued)

• Example 2:

© The Knowledge Academy Ltd 225


Contour Plots
• Firstly, we have to import the functions for plotting

Visualising a Three-Dimensional Function


• We will start by showing a contour plot by using a function z=f(x,y)

© The Knowledge Academy Ltd 226


Contour Plots
(Continued)

• plt.contour function is used to create a contour plot

• This function takes three arguments: a grid of x values, a grid of y values, as well as a
grid of z values

• The x as well as y values signify positions on the plot, and the contour levels will
represent the z values

© The Knowledge Academy Ltd 227


Contour Plots
(Continued)

• The np.meshgrid function is used to build two-dimensional grids from one-dimensional


arrays:

• The lines in the plotting can be color-coded by specifying a colourmap with the cmap
argument

© The Knowledge Academy Ltd 228


Contour Plots
(Continued)

• Also, we will specify that we want more lines to be drawn, i.e. 20 equally spaced
intervals within the data range:

© The Knowledge Academy Ltd 229


Contour Plots
(Continued)

• Matplotlib has a wide range of colourmap that you can easily browse in IPython by
writing a plt.cm. and then press Tab key

plt.cm.<TAB>

© The Knowledge Academy Ltd 230


Contour Plots
(Continued)

• We can also apply a filled contour plot by using the plt.contourf() function

• Moreover, we will add a plt.colorbar() command that automatically creates an


additional axis with the labelled colour information for the plotting:

© The Knowledge Academy Ltd 231


Contour Plots
(Continued)

• The colorbar makes it clear that the black regions are peaks. On the other hand, the red
regions are valleys

• Also, we use the plt.imshow() function to interpret a two-dimensional grid of data as an


image

© The Knowledge Academy Ltd 232


Histograms, Binnings, and Density
Example of Histograms

© The Knowledge Academy Ltd 233


Histograms, Binnings, and Density
(Continued)

• The hist() function has several options to tune both the calculation as well as the
display; here is an example of more customised histogram:

© The Knowledge Academy Ltd 234


Histograms, Binnings, and Density
(Continued)

• The plt.hist docstring has more information on other customisation options available

• This combination of histtype='stepfilled' along with some transparency alpha to be very


useful when comparing histograms of several distributions:

© The Knowledge Academy Ltd 235


Histograms, Binnings, and Density
(Continued)

• The np.histogram() function represents the frequency of the data distribution

© The Knowledge Academy Ltd 236


Histograms, Binnings, and Density
Binnings

plt.hexbin: Hexagonal binnings

• The two-dimensional histogram creates a tessellation of squares across the axes. The
regular hexagon is another natural shape for such a tessellation

• Plt.hexbin routine is provided by the Matplotlib for this purpose, which will represent a
two-dimensional dataset binned within a grid of hexagons:

© The Knowledge Academy Ltd 237


Histograms, Binnings, and Density
(Continued)

• plt.hexbin has a number of interesting options, including the ability to specify weights
for each point, as well as to alter the output in each bin to any NumPy aggregate (mean
of weights, standard deviation of weights, etc.)

Kernel Density Estimation


• Another common technique of evaluating densities in multiple dimensions is kernel
density estimation (KDE)

© The Knowledge Academy Ltd 238


Histograms, Binnings, and Density
(Continued)

Example:

© The Knowledge Academy Ltd 239


Customising Plot Legends
• Plot legends in data science give meaning to a visualisation, assigning meaning to the
several plot elements

• plt.legend() command is used to create the simplest legend, which automatically


creates a legend for any labelled plot elements:

© The Knowledge Academy Ltd 240


Customising Plot Legends
(Continued)

• But, there are several ways we might want to customise such a legend. For instance, we
can define the location as well as turn off the frame:

© The Knowledge Academy Ltd 241


Customising Plot Legends
(Continued)

• We can use the ncol command for specifying the number of columns in the legend:

© The Knowledge Academy Ltd 242


Customising Plot Legends
(Continued)

• We can use a fancybox (rounded box) or add a shadow, alter the transparency (alpha
value) of the frame, or alter the padding around the text:

© The Knowledge Academy Ltd 243


Customising Plot Legends
Choosing Elements for the Legend

• We can fine-tune which elements as well as labels appear in the legend using the
objects returned by the plot commands

• Multiple lines at once can be created by plt.plot() command, as well as returns a list of
created line instances. Passing any of these to plt.legend() will tell it which to identify,
along with the labels we had like to specify:

© The Knowledge Academy Ltd 244


Customising Plot Legends
(Continued)

• Now, applying labels to the plot elements which show on the legend:

© The Knowledge Academy Ltd 245


Customising Plot Legends
(Continued)

Multiple Legends

© The Knowledge Academy Ltd 246


Customising Colorbars
• Firstly, import functions:

• The simplest colorbar can be created with the plt.colorbar function:

© The Knowledge Academy Ltd 247


Customising Colorbars
Customising Colorbars
• The colormap can be specified by using the cmap argument to the plotting function that
is creating the visualisation:

© The Knowledge Academy Ltd 248


Customising Colorbars
(Continued)

Color limits and extensions

© The Knowledge Academy Ltd 249


Customising Colorbars
Discrete Color Bars
• plt.cm.get_cmap() function is used for discrete color bars

© The Knowledge Academy Ltd 250


Multiple Subplots
• These subplots might be insets, grids of plots, or other more complex layouts

• By using plt.axes

© The Knowledge Academy Ltd 251


Multiple Subplots
(Continued)

• Example of fig.add_axes()

© The Knowledge Academy Ltd 252


Multiple Subplots
(Continued)

• plt.subplot() that creates a single subplot within a grid

© The Knowledge Academy Ltd 253


Multiple Subplots
(Continued)

• The command plt.subplots_adjust is used for adjusting the spacing between these
plots. The following example uses the equivalent object-oriented command named
fig.add_subplot():

© The Knowledge Academy Ltd 254


Multiple Subplots
(Continued)

• We can also specify subplot locations as well as extents:

© The Knowledge Academy Ltd 255


Text Annotation
• The following is an example of drawing text at several locations using these transforms:

© The Knowledge Academy Ltd 256


Text Annotation
(Continued)

• Note that by default, the text is aligned above as well as to the left of the specified
coordinates: here the "." at the commencement of each string will approximately mark
the given coordinate location

• The transData coordinates give the common data coordinates associated with the x- as
well as y-axis labels

• The transAxes coordinates give the location from the bottom-left corner of the axes
(here the white box), as a fraction of the axes size

© The Knowledge Academy Ltd 257


Text Annotation
(Continued)

• The transFigure coordinates are identical, however, specify the position from the
bottom-left of the figure (here the gray box), as a fraction of the figure size

• Notice now that if we alter the axes boundaries, it is only the transData coordinates
that will be affected, whereas the others remain static:

© The Knowledge Academy Ltd 258


Text Annotation
(Continued)

Arrows and Annotation


• Along with tick marks as well as text, another useful annotation mark is the simple
arrow

• Drawing arrows in Matplotlib is often much harder than you would bargain for. While
there is a plt.arrow() function available. The arrows it creates are SVG (Scalable Vector
Graphics) objects which will be subject to the varying aspect ratio of your plots, as well
as a result is rarely what the user intended

© The Knowledge Academy Ltd 259


Text Annotation
(Continued)

• plt.annotate() function creates some text as well as an arrow, and the arrows can be
very flexibly specified

• Here, we will use annotate with various of its options:

© The Knowledge Academy Ltd 260


Three-Dimensional Plotting in Matplotlib
• Three-dimensional plots are enabled by importing the mplot3d toolkit, included with
the main Matplotlib installation:

• Once this submodule is imported, a three-dimensional axes can be created by passing


the keyword projection='3d' to any of the normal axes creation routines:

© The Knowledge Academy Ltd 261


Three-Dimensional Plotting in Matplotlib
(Continued)

Three-dimensional Points and Lines


• The most fundamental three-dimensional plot is a line or collection of scatter plot
created from sets of (x, y, z) triples

• These can be created by using the ax.plot3D as well as ax.scatter3D functions

© The Knowledge Academy Ltd 262


Three-Dimensional Plotting in Matplotlib
(Continued)

Three-dimensional Contour Plots

© The Knowledge Academy Ltd 263


Three-Dimensional Plotting in Matplotlib
(Continued)

• In the following code, we will use an elevation of 60 degrees (that is, 60 degrees above
the x-y plane) as well as an azimuth of 35 degrees (that is, rotated 35 degrees counter-
clockwise about the z-axis):

© The Knowledge Academy Ltd 264


Three-Dimensional Plotting in Matplotlib
(Continued)

Wireframes and Surface Plots


• Two other types of three-dimensional plots which work on gridded data are wireframes
as well as surface plots

• These take a grid of values as well as project it onto the specified three-dimensional
surface, and can make the resulting three-dimensional forms quite easy to visualise

© The Knowledge Academy Ltd 265


Three-Dimensional Plotting in Matplotlib
(Continued)

• The following is an example of using a wireframe:

© The Knowledge Academy Ltd 266


Three-Dimensional Plotting in Matplotlib
(Continued)

• A surface plot is as same as a wireframe plot, but each face of the wireframe is a filled
polygon. Adding a colormap to the filled polygons can aid observation of the topology of
the surface being visualised:

© The Knowledge Academy Ltd 267


Three-Dimensional Plotting in Matplotlib
(Continued)

• Note that even though the grid of values for a surface plot needs to be two-
dimensional, it need not be rectilinear

• Here is an instance of creating a partial polar grid, which when used with the surface3D
plot can give us a slice into the function we are visualising:

© The Knowledge Academy Ltd 268


Module 4: Python for Data Visualization -
Seaborn

© The Knowledge Academy Ltd 269


Install Seaborn and Load a Dataset For
Analysis
Install Seaborn

Using Pip Installer

• For installing the latest version of Seaborn, you could use pip:

Output

© The Knowledge Academy Ltd 270


Install Seaborn and Load a Dataset For
Analysis
(Continued)
Python 2.7 or 3.4+
Dependencies
Numpy
Consider
the
following
Scipy
dependenci
es of
Seaborn
Pandas

Matplotlib

© The Knowledge Academy Ltd 271


Install Seaborn and Load a Dataset For
Analysis
Load a Dataset For Analysis

• Seaborn.load_dataset (name, cache=True, data_home=None, **kws)

• This function gives quick access to a small number of example datasets that are useful
for documenting seaborn and generating reproducible illustrations for bug reports

• For normal usage, it is not necessary

© The Knowledge Academy Ltd 272


Install Seaborn and Load a Dataset For
Analysis
(Continued)

• Remember that some of the datasets contain a small amount of preprocessing applied
for defining a proper ordering for the categorical variables

• To see a list of available datasets, use get_dataset_names()

Parameters: name: str

• Name of the dataset ({name}.csv)

© The Knowledge Academy Ltd 273


Install Seaborn and Load a Dataset For
Analysis
(Continued)

Cache: boolean, optional

• Try to load from the local cache first, if True and save to the cache if a download is
required

data_home: string, optional

• The directory in which to cache data; see get_data_home()

© The Knowledge Academy Ltd 274


Install Seaborn and Load a Dataset For
Analysis
(Continued)

Kws: keys and values, optional

• Additional keyword arguments are passed to passed through to pandas.read_csv()

Returns: df: pandas.DataFrame

• Tabular data, possibly with some preprocessing applied

© The Knowledge Academy Ltd 275


Plot the Distribution Using a Histogram and
Kernel Density Estimate Curve
• Histograms depict the data distribution through forming bins along with the range of
the data, then drawing bars for showing the number of observations that fall in every
bin

• Seaborn comes with some datasets, and we have used some datasets

© The Knowledge Academy Ltd 276


Plot the Distribution Using a Histogram and
Kernel Density Estimate Curve
(Continued)

Output

© The Knowledge Academy Ltd 277


Plot the Distribution Using a Histogram and
Kernel Density Estimate Curve
Kernel Density Estimate Curve

• KDE is a process for an estimate the probability density function from a continuous
random variable, and for non-parametric analysis, it is used

• Setting the hist flag to False in distplot would yield the kernel density estimate plot

© The Knowledge Academy Ltd 278


Plot the Distribution Using a Histogram and
Kernel Density Estimate Curve
(Continued)

Output

© The Knowledge Academy Ltd 279


Regression Analysis by Using the Seaborn
lmplot
• Most of the times, we use datasets that include multiple quantitative variables, and the
goal of analysis has usually related to each other variables. The regression lines could do
it

• Usually, we check for the multicollinearity while building the regression model, where
we had to perceive the correlation between all the combinations of continuous
variables

• And it will take significant action for removing multicollinearity if exists

© The Knowledge Academy Ltd 280


Regression Analysis by Using the Seaborn
lmplot
(Continued)

• In Seaborn, there are two main functions for visualising a linear relationship determined
through regression. These functions are Regplot() and lmplot()

Regplot Implot

It accepts the x and y variables in a variety of It has data as a required parameter and the x
formats containing simple numpy arrays, and y variables must be specified as strings.
pandas Series objects, or as references to This data format is called “long-form” data
variables in a pandas DataFrame

© The Knowledge Academy Ltd 281


Basic Aesthetic Themes and Styles Available
in Seaborn
• Aesthetics is a set of principles which is concerned with the nature and appreciation of
beauty, particularly in art.

• Visualisation is an art of interpreting data in a useful and most comfortable way

• Matplotlib library highly supports customisation, but knowing what settings to tweak for
achieving an attractive and anticipated plot is what one must be aware of to make use
of it

• Unlike Matplotlib, Seaborn comes packed with customised themes and a high-level
interface for controlling and customising the look of Matplotlib figures

© The Knowledge Academy Ltd 282


Basic Aesthetic Themes and Styles Available
in Seaborn
(Continued)

Output

© The Knowledge Academy Ltd 283


Basic Aesthetic Themes and Styles Available
in Seaborn
Seaborn Figure Styles

• For manipulating the styles, the interface is set_style(),by using this function, you could
set the theme of the plot, according to the latest updated version, the following are the
five themes

Darkgrid Whitegrid Dark

White Ticks

© The Knowledge Academy Ltd 284


Basic Aesthetic Themes and Styles Available
in Seaborn
(Continued)

Output

© The Knowledge Academy Ltd 285


Distinguish between Scatter Plots, Hexbin
Plots, and KDE Plots
Scatter Plots

• Dots are used in the scatter plot for representing values in two distinct numeric
variables

• The position of each dot on the vertical and horizontal axis indicates values for a single
data point

• Scatter plots are used for observing relations within variables

© The Knowledge Academy Ltd 286


Distinguish between Scatter Plots, Hexbin
Plots, and KDE Plots
(Continued)

© The Knowledge Academy Ltd 287


Distinguish between Scatter Plots, Hexbin
Plots, and KDE Plots
Hexbin plots

• For representing the relationship of 2 numerical variables, a Hexbin plot is useful for
that, also, when you have a lot of data point

• Instead of overlapping, the plotting window is split into numerous hexbins, and the
number of points per hexbin is counted

• The colour indicates this number of points. While using the hexbin function of
Matplotlib, it could be instantly done

© The Knowledge Academy Ltd 288


Distinguish between Scatter Plots, Hexbin
Plots, and KDE Plots
(Continued)

Output

© The Knowledge Academy Ltd 289


Distinguish between Scatter Plots, Hexbin
Plots, and KDE Plots
KDE plots

• Kernel Density Estimate is used for visualising the Probability Density of a continuous
variable

• It shows the probability density at distinct values in a constant variable

© The Knowledge Academy Ltd 290


Distinguish between Scatter Plots, Hexbin
Plots, and KDE Plots
(Continued)

Output

© The Knowledge Academy Ltd 291


Use Boxplots and Violin Plots to Visualise the
Distributions of Data
• Violin Plot is a way for visualising the distribution of numerical data of distinct variables

• It is corresponding to Box Plot but with a rotated plot on every side, providing more
information about the density estimation on the y-axis

• The density is also mirrored and flipped over, and the resulting shape of violin plot is
filled in, creating an image resembling the violin

• The benefit of a violin plot is that it could depict the nuances in the distribution that are
not perceptible in a boxplot

© The Knowledge Academy Ltd 292


Use Boxplots and Violin Plots to Visualise the
Distributions of Data
(Continued)

• On the opposite hand, in the data, the boxplot more clearly indicates the outliers

• Violin Plots contain more information than the box plots; because of their unpopularity,
they are less popular, their meaning could be more difficult to grasp, and several readers
are not familiar with the violin plot representation

© The Knowledge Academy Ltd 293


Use Boxplots and Violin Plots to Visualise the
Distributions of Data
(Continued)

Example of Boxplot:

© The Knowledge Academy Ltd 294


Use Boxplots and Violin Plots to Visualise the
Distributions of Data
(Continued)

Example of Violin plot:

© The Knowledge Academy Ltd 295


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
Categorical Plots

• For visualising the relationship between variables, it uses Plots. Variables could be either
be numerical or a category like a class, group, or division

• Seaborn, besides being a statistical plotting library, it also gives some default datasets.
We would be using one such default dataset known as ‘tips’

• The ‘tips’ dataset holds information regarding people who probably had food at a
restaurant and whether or not they left a tip for the waiters, their gender, whether they
smoke and so on

© The Knowledge Academy Ltd 296


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
(Continued)

Output

© The Knowledge Academy Ltd 297


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
Barplot

• In order to aggregate the categorical data according to some methods and by default its
the means a barplot is utilised

• It can also be known as a visualisation of the group by action

Syntax:

barplot([x, y, hue, data, order, hue_order, …])

© The Knowledge Academy Ltd 298


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
(Continued)

Output

© The Knowledge Academy Ltd 299


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
Stripplot

• Basically, it creates a scatter plot based on the category

Syntax:

stripplot([x, y, hue, data, order, …])

© The Knowledge Academy Ltd 300


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
(Continued)

Output

© The Knowledge Academy Ltd 301


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
Swarmplot

• It is similar to a strip plot except for the fact that points are adjusted so that they do not
overlap. Some people also like combining the idea of a violin plot and a strip plot for
forming plot

• One disadvantage of using swarm plot is that they do not scale well to huge numbers
and takes a lot of computation for arranging them

• So in case, we need to visualise a swarm plot correctly we could plot it on top of a violin
plot

© The Knowledge Academy Ltd 302


Compare the Use Cases for Swarm Plots, Bar
Plots Strip Plots, and Categorical Plots
(Continued)

Syntax: swarmplot([x, y, hue, data, order, …])

Output

© The Knowledge Academy Ltd 303


Recall Some of the Use Cases and Features of
Seaborn
Important Features of Seaborn
• Built in themes for styling matplotlib graphics

• Visualising univariate and bivariate data

• Fitting in and visualising linear regression models

• Plotting statistical time series data

• Seaborn works well with NumPy and Pandas data structures

• It comes with built in themes for styling Matplotlib graphics

© The Knowledge Academy Ltd 304


Module 5: Introduction to Machine
Learning

© The Knowledge Academy Ltd 305


Introduction
• Machine Learning refers to the study of algorithms and statistical models used by
computer systems as a way of effectively performing tasks without the need for specific
instructions, but relying on patterns and inference instead

• The following describes the two ways a system can improve:

1. By acquiring new knowledge, facts, and skills

2. By adapting its behaviour, solving problems more accurately, and more efficiently

• There are three main elements that comprise Machine Learning:

1. Base knowledge in which the system is aware of the answer thus enabling the system
to learn

© The Knowledge Academy Ltd 306


Introduction
(Continued)

2. The computational algorithm which is at the core of making determinations

3. Variables and features used to make decisions

• Machine Learning is the main subarea of artificial intelligence

• Machine Learning allows the computers or machines to routinely adjust and customise
themselves instead of being explicitly programmed to carry out specific tasks

• These programs or algorithms are specifically designed to improve their performance P


at some task T with experience E:

o T: Recognising hand-written words

© The Knowledge Academy Ltd 307


Introduction
(Continued)

o P: Percentage of words correctly classified

o E: Database of human-labelled images of handwritten words

• The following are real life examples of Machine Learning:

o While shopping on the internet, users are presented with advertisements related to
their purchases

o When shopping, a person checks a product on the internet then it recommends


similar products

© The Knowledge Academy Ltd 308


Introduction
(Continued)

o When using an app to book a cab ride, the app will provide an estimation of the
price of that ride. When using these services, how do they minimise the detours?
The answer is machine learning

• Some Other Real-Life Examples of Machine Learning:

1. Virtual Personal Assistants

o Siri, Alexa, are few of the popular examples of virtual personal assistants

o Virtual Assistants are integrated in a variety of platforms. For example:

o Smartphones: Samsung Bixby on Samsung S8

© The Knowledge Academy Ltd 309


Introduction
(Continued)

o Smart Speakers: Amazon Echo and Google Home

o Mobile Apps: Google Allo

2. Social Media Services

o Social media platforms are utilising machine learning for their own benefits as well
as for the benefit of the user. Below are a few examples:

o Face Recognition: Upload a picture of you with a friend and Facebook instantly
recognizes that friend

o Similar Pins: Computer Vision is used by Pinterest as a way of recognises objects in


images and recommends similar pins accordingly
© The Knowledge Academy Ltd 310
Introduction
(Continued)

o Smart Speakers: Amazon Echo and Google Home

o Mobile Apps: Google Allo

2. Social Media Services

o Social media platforms are utilising machine learning for their own benefits as well
as for the benefit of the user. Below are a few examples:

o Face Recognition: Upload a picture of you with a friend and Facebook instantly
recognizes that friend

o Similar Pins: Computer Vision is used by Pinterest as a way of recognises objects in


images and recommends similar pins accordingly
© The Knowledge Academy Ltd 311
Introduction
3. Online Fraud Detection

o Machine learning is proving its potential to make cyberspace a secure place and
tracking monetary frauds online is one of its examples

o For example: PayPal is using ML for protection against money laundering

4. Online Customer Support

o Most websites will offer the option to chat to customer support. In most cases, you
talk to a chatbot rather than a live executive to answer your queries

o These bots tend to extract information from the website and present it to the
customers

© The Knowledge Academy Ltd 312


Introduction
Difference Between Traditional Programming and Machine Learning
Traditional Programming

Data
Computer Output
Program

Machine Learning
Data
Computer Program
Output

© The Knowledge Academy Ltd 313


Importance of Machine Learning
• Machine learning has become a key technique for problem solving in a variety of
fields:
Image Automotive,
Natural
Computational Computational Processing Energy aerospace,
Language
Biology Finance and Computer Production and
Processing
Vision manufacturing

Drugs
Recovery Credit Motion
Price
Scoring Detection

Voice
Tumor Predictive
Recognition
Detection Maintenance
Applications

Algorithm Object Load


DNA Trading Detection Forecasting
Sequencing

© The Knowledge Academy Ltd 314


Types of Machine Learning
Machine Learning
Three Types

Supervised Learning Unsupervised Learning Reinforcement Learning

Classification Regression
Task Driven (Predict next value) Clustering
Data Driven (Predict next value) Learn from Mistakes

K-Means, K-Medoids
Support Vector Machines Linear Regression, GLM
Fuzzy C-Means

Discriminant Analysis SVR, GPR Hierarchical

Naïve Bayes Ensemble Methods Gaussian Mixture

Nearest Neighbour Decision Trees Neural Networks

Continuos Neural Networks Hidden Markov Model

Categorical

© The Knowledge Academy Ltd 315


How Machine Learning Works?
• Machine Learning uses both Supervised and Unsupervised Learning. Supervised
Learning trains a model on known input and output data so that it can predict future
outputs. Unsupervised learning identifies hidden patterns or intrinsic structures in input
data
Machine Learning

Unsupervised Learning Supervised Learning

Group and interpret data based Develop predictive model based


only on input data on both input and output data

Clustering Classification

Regression

© The Knowledge Academy Ltd 316


How Machine Learning Works?
Training the Machine
Learning algorithm
START If the accuracy is
not acceptable

Training Data Set

Model Input Data ML


algorithm is If the accuracy
trained is acceptable Machine Learning
again algorithm is
deployed

New Data Input


introduced to Prediction
make a prediction
Machine Learning Algorithm

© The Knowledge Academy Ltd 317


Machine Learning Mathematics
• Machine Learning Theory is a field that uses probabilistic, computer science,
statistical, and algorithms feature as a result of learning iteratively from data and
identifying hidden patterns that can later be used to generate intelligent applications

Why mathematics is significant for machine learning?

o Selecting the right algorithm

o Identifying underfitting and overfitting

o Choosing parameter settings and validation strategies

o Estimating the right confidence interval and uncertainty

© The Knowledge Academy Ltd 318


Machine Learning Mathematics
Importance of Maths Topics Required For Machine Learning
10%

15% 35%

Linear Algebra
Multivariate Calculus
Probability Theory and Statistics
Algorithms and Complexity
Others

25%

15%

© The Knowledge Academy Ltd 319


Module 6: Natural Language Processing

© The Knowledge Academy Ltd 320


Introduction to NLP
• Natural Language Processing (NLP) is a
branch of artificial Intelligence (AI)

• It serves the interaction between


computers and humans with the help of
natural language

• The final purpose of NLP is to read,


decipher, understand, and make sense of
the human languages in a valuable manner

• Most of the NLP techniques depend upon


machine learning to derive meaning from
human languages

© The Knowledge Academy Ltd 321


Introduction to NLP
• Below is an examples of a usual interaction between humans and machines using NLP:

1. A human talks to the machine

2. The audio is captured by the machine

3. Audio gets converted into text

4. Text data is being processed

5. Data is converted into audio

6. The machine responds to the human by playing the audio file

© The Knowledge Academy Ltd 322


Introduction to NLP
• The following are the common applications that have NLP as their driving force:

o Language translator applications like Google Translate

o IVR (Interactive Voice Response) applications that are used in call centres to respond to
specific user requests

o Word processors like Grammarly that employ NLP for checking grammatical errors

o Personal Assistant applications like Siri, Alexa etc.

© The Knowledge Academy Ltd 323


Introduction to NLP
Components of NLP

The following are five main components of NLP:

1 • Morphological and Lexical Analysis


.
2 • Syntactic Analysis
.
3 • Semantic Analysis
.
• Discourse Integration
4
• Pragmatic Analysis
5

© The Knowledge Academy Ltd 324


Introduction to NLP
Input Sentence
Components of NLP

Morphologic
al Processing

Lexicon
Syntax
Analysis
(pasing)
Grammar

Semantic
Semantic Analysis
Rules

Contextual
Information Pragmatic
Analysis

© The Knowledge Academy Ltd 325


Introduction to NLP
1. Morphological and Lexical Analysis

• Lexical Analysis is basically a vocabulary that has words and expressions

• It represents analysing, identifying, and explanation of the structure of words

• It consist of dividing a text into paragraphs, words, and sentences

• Individual words are analysed into their components, and non-word tokens like
punctuations are separated from the words

© The Knowledge Academy Ltd 326


Introduction to NLP
2. Semantic Analysis

• Semantic analysis is a structure produced by syntactic analyser that assigns meanings

• It transfers linear sequences of words into structures

• It demonstrates how the words are associated with each other

• Semantics concentrates on the literal meaning of words, phrases, and sentences

• This summaries the dictionary meaning or the real meaning from the given context only

© The Knowledge Academy Ltd 327


Introduction to NLP
3. Pragmatic Analysis

• This analysis handles communicative and social content, as well as its effect on interpretation

• In this analysis, the key emphasis is always on what was said and then reinterpreted based
upon the actual meaning

• This analysis assists users to find out this anticipated effect by applying a set of rules that
characterise cooperative dialogues

© The Knowledge Academy Ltd 328


Introduction to NLP
4. Syntax Analysis

• The words are the smallest units of syntax

• The syntax basically refers to the principles and rules governing the sentence structure
of any individual languages

• Syntax concentrates on the appropriate ordering of words that can affect its meaning

• It implicates analysis of the words in a sentence through grasping the grammatical


structure of the sentence

© The Knowledge Academy Ltd 329


Introduction to NLP
5. Discourse Integration

• Discourse integration implies a sense of the context

• The significance of any single sentence that depends upon that sentences

• It also considers the meaning of the following sentence

• As an example, in the sentence “He wanted that”, the word “that” depends upon the
previous discourse context

© The Knowledge Academy Ltd 330


NLP and Writing Systems
• To determine the best approach for text pre-processing, the type of writing system
used for a language is determined by one particular element

Writing systems can be:

Logographic Syllabic Alphabetic


• Enormous • Individual • Individual
number of symbols symbols
individual represent represent sound
symbols syllables
represent words

© The Knowledge Academy Ltd 331


NLP Examples
The following are the common applications of NLP:

I. Information retrieval and Web Search

• Search engines such as Google, Bing, Yahoo etc. base their machine translation
technology on NLP deep learning models

• NLP lets algorithms read text on a webpage, interpret its meaning, and translate it to
another language

II. Question Answering

• To ask questions in Natural Language, type in keywords

© The Knowledge Academy Ltd 332


NLP Examples
III. Grammar Correction

• NLP techniques are broadly used by word processing software such as MS-word for spelling
correction and grammar checks

IV. Machine Translation

• Translating text or speech from one natural language to another using computer applications

© The Knowledge Academy Ltd 333


Advantages of NLP
• The following are the advantages of NLP:

a) Users can ask as many questions about any subject and get a response instantly
within seconds

b) NLP systems provide solutions to the questions in natural language

c) These systems provide exact answers to the questions, no unwanted or unnecessary


information

d) The accuracy of the answers depend upon the quantity of relevant information
provided in the question

© The Knowledge Academy Ltd 334


Advantages of NLP
e) NLP is a highly unstructured data source

f) Enables us to analyse more language-based data compared to humans, and without


fatigue or bias

g) NLP processes helps computers in communicating with humans in their language and
therefore increases language related tasks

© The Knowledge Academy Ltd 335


NLP Applications
• There are so many applications of Natural language processing (NLP) in the real world

• Some of them are as follows:

Statistical Machine Information


Machine Translation Speech recognition
Translation Retrieval

Information Question Answering Word sense


Text Classification
Extraction System disambiguation

Optical character
Topic modelling Language detection
recognition

© The Knowledge Academy Ltd 336


Module 7: Deep Learning

© The Knowledge Academy Ltd 337


Deep Learning
• Deep learning is a machine learning
technique that trains machines to do
what comes naturally to humans. They
learn by example

• It is a key technology behind driverless


cars, allowing them to distinguish a
pedestrian from a lamppost or to
recognise a stop sign

• It controls the voice in consumer devices


such as tablets, phones, TVs, and hands-
free speakers

© The Knowledge Academy Ltd 338


Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before

• In deep learning, a computer model learns to perform classification


tasks directly from text, images, or sound

• The deep learning models can obtain state-of-the-art accuracy,


sometimes exceeding human-level performance

• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers

© The Knowledge Academy Ltd 339


Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before

• In deep learning, a computer model learns to perform classification


tasks directly from text, images, or sound

• The deep learning models can obtain state-of-the-art accuracy,


sometimes exceeding human-level performance

• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers

© The Knowledge Academy Ltd 340


Importance of Deep Learning
• As the name suggests, Artificial Intelligence is to make a machine artificially intelligent
so that, making the machines that act and think like humans

• The amount of useful data available and an increase in computational speed are the two
factors that have made the whole world to invest in this field

• If a robot is hard coded i.e. all the logic has manually been coded to the system, then it
is not AI so it does not mean that simple robots mean AI

• Machine learning means making a machine learn from its experience and enhancing its
performance with time as in case of a human baby

• The concept of machine learning became possible only when an adequate amount of
data made available for training machines. It assists in dealing with a complex and
sound system

© The Knowledge Academy Ltd 341


Importance of Deep Learning
(Continued)

• Mainly, deep learning is a subset of machine learning, but in this case, the machine
learns the way where humans are believed to learn

• The structure of both deep learning model and the human brain is similar to a large
number of nodes and neurons, neurons in the brain of human thus result in artificial
neural network

• When traditional machine learning algorithms are applied we need to select input
features manually from complex data set and then train them that is a boring job for the
scientist of Machine Learning, but in neural networks, we do not need to select
manually useful input features

© The Knowledge Academy Ltd 342


Importance of Deep Learning
(Continued)

• There are several types of neural networks to manage the complexity of data set and
algorithm

• Deep learning has allowed most of the Industries Experts to overcome challenges that
were not possible, a decade ago like Image and Speech recognition and Natural
Language Processing

• Industries like Entertainment, Journalism, Manufacturing or even Digital Sector,


Healthcare, Banking and Finance, Automobile depending on it

• Trending successes of deep learning are Voice Assistants, Mail Services, Self Driving cars,
Video recommendations, Intelligent Chat bots

© The Knowledge Academy Ltd 343


How Deep Learning Works
• Neural networks are composed of layers of nodes, similar to the human brain, which is
made of neurons. Nodes within individual layers are combined to adjacent layers

• In the human brain, a single unit of the neuron gets thousands of signals from other
neurons. In an artificial neural network, signals are travel between nodes and allocate
weight accordingly

• A node weighing heavy will apply more impact on the next layer of the nodes. The final
layer put together the weighted inputs to give an output

• Systems of Deep learning needs powerful hardware as they have a huge amount of
processed data and includes many complex mathematical calculations

• In spite of having such advanced hardware, calculations of deep learning training can
take weeks

© The Knowledge Academy Ltd 344


How Deep Learning Works
(Continued)

• Deep learning systems need a large amount of data to get back to accurate results;
according to that, information is served as huge data sets

• When data is processing, artificial neural networks are able to categorise data with the
answers gets from a series of true/ false questions that include highly complex
mathematical computations fed

• For instance, programs of facial identification work by learning to identify and detect
edges and lines of faces, then more important parts of faces, and finally complete
representations of the faces

• As the program trains itself and the possibility of getting the right answers enhances
with time

© The Knowledge Academy Ltd 345


Module 8: Big Data

© The Knowledge Academy Ltd 346


Big Data Analytics
Introduction

• Big data analysis is the often complex process of


analysing large and varied data sets that can help
companies make informed business decisions

• Big data is a branch related to the analysis,


processing, and storage of large collections of data
that usually originate from different sources

• Big data includes complex transactions and data


sources that require special technologies and
methods to draw vision out of data

© The Knowledge Academy Ltd 347


Big Data Analytics
(Continued)

• The analysis of big data datasets is an interdisciplinary attempt that combines statistics,
mathematics, computer science, and subject matter expertise

• It produces value from the storage and processing of substantial quantities of digital
information that cannot be analysed with conventional computing techniques

© The Knowledge Academy Ltd 348


Big Data Analytics
The Definition Of Big Data Includes Five V’s:

Volume

Value Variety

Data
Complexity

Velocity Veracity

© The Knowledge Academy Ltd 349


Big Data Analytics
Sources of Big Data

• Below are the different sources of big data:

1. Archives

2.Enterprise Data

3.Transactional Data

4. Social Media

5. Activity Generated

6. Public Data

© The Knowledge Academy Ltd 350


Big Data Analytics
1. Archives

• A significant amount of data is archived by an


organisation, most of which is rarely required

• As hardware is getting cheaper, organisations do not


want to reject any data; rather, they prefer storing and
capturing as much data as they can

• This data can include scanned copies of agreements,


documents, ex-employees records etc and this type of
data, which is less frequently accessed, is known as
Archive Data

© The Knowledge Academy Ltd 351


Big Data Analytics
2. Enterprise Data

• In enterprises, there are large volumes of data in


different formats

• Flat files, word documents, pdf documents, emails,


legacy formats, HTML pages, presentations, and XMLs
are some of the common formats

• The data that is spread in different formats across the


organisation is known as enterprise data

© The Knowledge Academy Ltd 352


Big Data Analytics
3. Transactional Data

• Every enterprise has different applications that include


performing various kinds of transactions like CRM
Systems, Mobile Applications, Web Applications and
many more

• There are one or more relational databases as backend


infrastructure to support the transactions in these
applications

• This is mostly structured data and is known as


transactional data

© The Knowledge Academy Ltd 353


Big Data Analytics
4. Social Media

• There is a significant amount of data generated on


different social networks like Facebook, Twitter etc.

• The social networks involve mostly unstructured data


formats which include images, audio, text, videos, etc.

• This category of the data source is known as social


media

© The Knowledge Academy Ltd 354


Big Data Analytics
5. Activity Generated

• Machines generate a significant amount of data that


exceeds the volume of data generated by humans

• These comprise data from cell phone towers, medical


devices, industrial machinery, satellites, and other data
generated mostly by machines

• These data types are known as activity generated data

© The Knowledge Academy Ltd 355


Big Data Analytics
6. Public Data

• Public data includes those data that is available publicly


such as research data published by research institutes,
sample open source data feeds, census data, data
published by governments etc.

• This publicly accessible data is known as public data

© The Knowledge Academy Ltd 356


State of Practice in Analytics
• Current business problems offer numerous opportunities for organisations to become
increasingly more analytics and data-driven

• Business Drivers for Advanced Analytics:

Business Driver Examples


Optimise business operations Sales, pricing, profitability, efficiency
Identify business risk Customer churn, fraud, default
Predict new business opportunities Upsell, cross-sell, best new customer
prospects

Comply with laws or regulatory Anti-Money Laundering, Fair Lending, Basel


requirements II-III, Sarbanes-Oxley (SOX)

© The Knowledge Academy Ltd 357


State of Practice in Analytics
(Continued)

• The table describes the four categories of common business problems that organisations
contest with where they have a chance to use advanced analytics to create a
competitive advantage

• Rather than just performing standard reporting on these areas, advanced analytical
techniques can be applied by the organisations to optimise processes and derive more
values from these regular tasks

• The initial three examples don't describe new problems. Organisations have been
attempting to decrease customer churn, increase sales, and cross-sell customers for
many years

© The Knowledge Academy Ltd 358


State of Practice in Analytics
(Continued)

• The last example describes emerging regulatory necessities

• Multiple compliance and regulatory laws have been in presence for quite a long time;
however extra requirements are added every year, that represents added complexity
and data requirements for organisations

• Anti-money laundering (AML) related laws and fraud prevention require advanced
analytical techniques for complying and managing appropriately

© The Knowledge Academy Ltd 359


Main Roles for New Big Data Ecosystem
• There are three key roles for the New Big Data Ecosystem

Deep Analytical Data savvy Technology and


Talent professionals data enablers
• Advanced training • Savvy but less • Support people –
in quantitative technical than e.g., DB admins,
disciplines – e.g., group 1 programmers, etc.
statistics, math,
and machine
learning

© The Knowledge Academy Ltd 360


Phases of Data Analytics Lifecycle
Discovery

• Discovery is the phase 1 where the team learns the business domain, including
appropriate history such as whether the business unit or organization has attempted
similar projects in the past from which they can learn

• The team analyses the resources available to support the project in terms of technology,
people, time, and data

• In this step, essential activities include framing the business problem as an analytics
challenge that can be solved throughout subsequent phases and formulating initial
hypotheses (IHs) to test and start learning the data

© The Knowledge Academy Ltd 361


Phases of Data Analytics Lifecycle
Data Preparation

• Data preparation requires the existence of an analytical sandbox, in which the team can
work with data and perform analytics for the duration of the project

• In this phase, the team needs to execute extract, transform and load (ETL) or extract,
load, and transform (ELT) to retrieve data into the sandbox

• Sometimes the ETL and ELT are abbreviated as the ETLT

• In the ETLT process, data should be transformed so that the team can work with the
data and analyse it. The team also requires to familiarise itself with the data thoroughly
and take steps to condition the data

© The Knowledge Academy Ltd 362


Phases of Data Analytics Lifecycle
Model Planning

• In this phase, the team determines the techniques, methods, and workflow it intends to
follow for the subsequent model building phase

• The team examines the data to learn about the relationships between variables and
subsequently selects key variables and the most relevant models

© The Knowledge Academy Ltd 363


Phases of Data Analytics Lifecycle
Model Building

• Phase 4 is model building, where the team develops datasets for training, testing, and
production purposes

• Also, in the model planning phase, the team builds and executes models based on the
work done

• Sometimes the ETL and ELT are abbreviated as the ETLT

• In the ETLT process, data should be transformed so that the team can work with the
data and analyse it. The team also requires to familiarise itself with the data thoroughly
and take steps to condition the data

© The Knowledge Academy Ltd 364


Phases of Data Analytics Lifecycle
Communicate Results

• Communicate results is phase 5, where the team, in collaboration with major


stakeholders, decides if the results of the project are a success or a failure based on the
criteria developed in the Discovery phase

• In this, the team should quantify the business value, identify key findings, and develop
a narrative to summarise and convey findings to stakeholders

© The Knowledge Academy Ltd 365


Phases of Data Analytics Lifecycle
Operationalise

• Phase 6 is Operationalise, where the team delivers final reports, briefings, code, and
technical documents

• Also, the team may run a pilot project in a production environment to implement the
models

© The Knowledge Academy Ltd 366


Module 9: Working with Data in R

© The Knowledge Academy Ltd 367


Data Manipulation in R
• We can represent data in the form of data analytics with the help of data structures

• Data Manipulation in R is used for further analysis and visualisation

• The most important aspects of computing with Data Manipulation in R is that it enables
its subsequent analysis and visualisation

• The following are the basic data structures in R:

o Vectors
o Matrices
o Lists
o Data Frames

© The Knowledge Academy Ltd 368


Data Manipulation in R
Creating Subsets of Data in R
• The following are the different methods of subsetting in R are:

1. $ - The dollar sign operator selects a single element of data

2. [[ - like $ in R, the double square brackets operator in R also returns a single element

3. [ - The single square bracket operator in R returns multiple elements of data

© The Knowledge Academy Ltd 369


Data Manipulation in R
(Continued)

Example:

• To retrieve 5 rows and all columns of already built-in dataset iris, the below command, is
used

Output:
Input:

© The Knowledge Academy Ltd 370


Data Manipulation in R
Creating Subgroups or Bins of Data

1. cut() function in R

• cut() function groups the values of a variable into larger bins

Input:

Output:

© The Knowledge Academy Ltd 371


Data Manipulation in R
(Continued)

2. table() function in R

• We can use the R table() command, to count the observations in each level of factor

Input:

Output:

© The Knowledge Academy Ltd 372


Data Manipulation in R
Combining and Merging Datasets in R
• The following are the ways to combine the different sets of data:

By Adding Columns using cbind() in R

By Adding Rows using rbind() function in R

By Combining Data With Different Shapes using merge() function in R

© The Knowledge Academy Ltd 373


Data clean up
Introduction

• It is the process of transforming the raw data into consistent data and analysing it

• The main aim of data cleaning is to improve the statistical statements based on the data
and their reliability

• It can profoundly influence the statistical statements based on the data

© The Knowledge Academy Ltd 374


Data clean up
Steps to clean data

Initial Exploratory Analysis

Visualise Your Data

Cleaning The Errors

© The Knowledge Academy Ltd 375


Data clean up
(Continued)

Initial Exploratory Analysis:

• The first step involves an initial exploration of the data frame that just imported into the
R

• The important thing is to understand how to import data into R and save it as a data
frame

© The Knowledge Academy Ltd 376


Data clean up
(Continued)

Output:

© The Knowledge Academy Ltd 377


Data clean up
(Continued)

The first thing that you check the class of your data frame:

• class(data)

o In this we can clearly see the our dataset is saved as a data frame

1. "data frame“

o We want to check the number of rows and columns in the data frame

© The Knowledge Academy Ltd 378


Data clean up
(Continued)

The code give up and its results:

1. 1460 81: We can see that data frame has 1460 rows and 81 columns

We can view the statistical for all the columns of the data frame using the code that shown
in the next slide:

© The Knowledge Academy Ltd 379


Data clean up
(Continued)

• summary(data)

Output:

© The Knowledge Academy Ltd 380


Data clean up
(Continued)

Visual Exploratory Analysis:

• There are two types of plots that should use during the data cleaning process:

Histogram BoxPlot

© The Knowledge Academy Ltd 381


Data clean up
(Continued)

Histogram:

• The histogram is useful to see the overall distribution of numeric columns

• We can determine whether the distribution of data is normal or unimodal or bi-modal or


any other kind of distribution of interest

• The histogram is useful to figure out if there are outliers in the particular numerical
columns under study

© The Knowledge Academy Ltd 382


Data clean up
(Continued)

The code and output is given below:

insatll.package(“plyr”)
library(plyr)
hist(data$Dist_Taxi)

© The Knowledge Academy Ltd 383


Data clean up
(Continued)

BoxPlot:

• It is super useful because it shows the median, along with the first, second, and third
quartiles

• BoxPlots are the best way of spotting outliers in your data frame

© The Knowledge Academy Ltd 384


Data clean up
(Continued)

The output is given below:

boxplot(data$Dist_Taxi)

© The Knowledge Academy Ltd 385


Data clean up
(Continued)

Correcting the Errors:

• In this method the main focus is to correct all the errors that you have seen

• If you want to change the name of your data frame, the code is:

data$carpet_area<-data$Carpet

• With this code we renamed the Carpet Column as “Carpet_area”

© The Knowledge Academy Ltd 386


Data clean up
(Continued)

• If some column has an incorrect type associated with them. For example, a column
containing the text elements stored as a numeric column

• In such case, we can change the type of colum by using the following code:

Data$Dist_Taxi<-as.character(data$Dist_Taxi)
class(data$Dist_Taxi)

© The Knowledge Academy Ltd 387


Reading and Exporting Data
• R is a programming language used in statistical computing

• It is used by data analysts, researchers and statisticians

• R uses cutting edge technology to manipulate data which can be used for predictive
modelling

• In order to analyse data, we need to access the data from different databases by using
SQL commands

• Then read the data and export the data using different file formats

• There are many pre-defined procedures

© The Knowledge Academy Ltd 388


Reading and Exporting Data
• The most popular one is ACCESS, where you can create access descriptors and it
describes data stored in a DBMS

• Access descriptors enable you to create view descriptors wherein they function in the
same way as the PROC SQL command

• Once data is accessed, it is analysed depending on the requirement

• Then data is exported to a different location altogether

• The beauty is it can be done using different file types

• Care needs to be taken when you are exporting data from one type to another type

© The Knowledge Academy Ltd 389


Reading and Exporting Data
• Let us take a look at Data Export Formats:

1. CSV
2. TSV
3. SPSS
4. HTML
5. Fixed Field Text, and many more

• The following two export formats are available for Data Table Exports only:

a. CSV (Comma Separated Values)


b. TSV (Tab Separated Values)

© The Knowledge Academy Ltd 390


Reading and Exporting Data
• The others are:

i. XML
ii. XPSS
iii. HTML
iv. Fixed field text
v. Tableau
vi. JSON

• CSV (Comma Separated Value) can be opened in MS-EXCEL. It can also be converted to
other statistical software

• TSV (Tab Separated Value) is a simple text format for storing data in a tabular structure.
TSV and CSV are compatible file formats to import data to QUALTRICS

© The Knowledge Academy Ltd 391


Reading and Exporting Data
• As and when you decide to export data from EXCEL, you will be exporting it to an XLSX
file

• The XML is used for putting your raw data into a database. It is a general purpose mark-
up language. It is compatible with Excel

• For statistical analysis, a software package called SPSS is used

• HTML format is used to view your data in a table on a web browser

• Fixed Field Text is a flat file format. It is accompanied by a separate data map file

• Many organisations use a data analysis application called the Tableau. JSON (Java script
object notation) is also available for use

© The Knowledge Academy Ltd 392


Importing Data
• Importing data means to import data from various sources into the R programming
environment

Process of Importing Data in R

© The Knowledge Academy Ltd 393


Importing Data
1. Using the Combine Command

• In R programming, we make use of c() function to concatenate or combine various data


values together

• In the following example:- vector1, vector2 and vector3 are the variables to store
integer values separately. We make use of the c() function to combine these values
together

© The Knowledge Academy Ltd 394


Importing Data
2. Entering Numerical Items as Data

• We can enhance numerical data by typing the values separated by commas into the c()
command

• Let us create a data set by using the c() command:

• In this example, data1 is the object that stores our data. Then, type our numerical
values between the two parentheses and these values will be separated by commas.
Type ‘data1’ to display the dataset

© The Knowledge Academy Ltd 395


Importing Data
(Continued)

• We will create an object data2 that stores our data. We will also specify data1 as one of
the member components

3. Entering Text Items as Data

• We make use of single-quotes or double-quotes to enter character data

• Whatever these quotes cover the data is interpreted as a type of character or a text
item

© The Knowledge Academy Ltd 396


Importing Data
(Continued)

• In the following example, we will take our data in the form of characters as the days of a
week and store them in the day1 object

• Then we pass day1 with another text element into the same vector. In this case,
however, day1, is not of a text but of a numerical type. If numbers and text are
combined, R converts the number into text

© The Knowledge Academy Ltd 397


Importing Data
4. Using the scan() command

• We can use the scan() command that doesn't require you to enter a comma after every
input data instead of typing input data with the additional specification of commas

• scan() command can also be used for taking data from files as well as with the
clipboards

• scan() command invokes a prompt through which you enter the data. It does not take
any input between its parentheses

© The Knowledge Academy Ltd 398


Importing Data
(Continued)

• In the above example, we created a data frame that is then stored as a file called
‘data.txt’ on the local disk. This text file can be accessed using the scan function as
follows:

© The Knowledge Academy Ltd 399


Importing Data
5. Using the Clipboard to Make Data

• To copy and paste the data more interactively, we can use the clipboard

• We can enter the input data such as spreadsheets with the help of scan() command

• The key steps to import spreadsheet data into R are as follows:

o If the spreadsheet contains data of numerical type, type command in R before


switching to this spreadsheet

o After highlighting the important cells, we copy them to the clipboard

© The Knowledge Academy Ltd 400


Importing Data
(Continued)

o Paste the data from the clipboard in R after returning to R. R then waits until an
empty line is entered before the data entry process is stopped to make it easier to
copying and pasting data as required

o Finally, to complete the data entry procedure, a blank line is entered

• If the data is separated by spaces, simply copy and paste. However, if some other
character or symbol separates the data, we must enter it in R before importing the data

© The Knowledge Academy Ltd 401


Importing Data
6. Using Scan() to Retrieve Data from CSV file

• We can retrieve data from a CSV file using the scan() command. We will save our
antecedently created data frame ‘data’ as a CSV file

• Now, we scan our CSV file and define the what attribute with ‘character’

© The Knowledge Academy Ltd 402


Importing Data
7. Reading a File of Data from a Disk

• We can use the scan() command to get data file from our system's local memory

• Data can be read from a console and written to a vector with the help of scan()
command. In the scan() function, we add the file name as follows:

© The Knowledge Academy Ltd 403


Importing Data
8. Reading Bigger Data Files

• We used the scan() command in the above sections to read data from simple files. We
can enter a large number of data containing complicated data in R

• There are different ways and means of reading such large data that are stored in a
variety of text formats

o In order to read from csv file as: > read.csv() or read.csv2()

o We can read data files from tables with: > read.table()

o We can read from files that contain values separated by tabs: > delim()

© The Knowledge Academy Ltd 404


Module 10: Regression in R

© The Knowledge Academy Ltd 405


Regression Analysis
(Continued)

• Regression is of the following two types:

© The Knowledge Academy Ltd 406


Linear Regression
• Using linear regression, an analyst can
compress data points from a sample into a
straight line

• A “strong” or “loose” correlation can then be


determined by their closeness to the regression
line

• A more scattered plot pattern in relation to the


line suggests loose, while a tighter clustering of
plot points suggests a stronger correlation

• Regression lines can be positive or negative

© The Knowledge Academy Ltd 407


Logistic Regression
• We fit a regression curve in logistic regression, y = f(x) where y is a categorical variable

• It is used to estimate that y has given a set of predictors x

• Therefore, the predictors can be categorical, continuous or a combination of both

• It is an algorithm of classification which comes under nonlinear regression

• This model is used to predict a binary outcome such as (1/ 0, True/ False, Yes/ No )
given as a set of independent variables

• Also, by using dummy variables, it helps to represent categorical/binary results

© The Knowledge Academy Ltd 408


Logistic Regression
(Continued)

• It is a regression model in which the response variable has binary values such as 0/1
or True/False. Hence, we are able to calculate the probability of the binary response

• Expression of R Logistic Regression:

• x and y is the predictor variable and response variable respectively

• a and b are the coefficients which are numeric constants

© The Knowledge Academy Ltd 409


Logistic Regression
Syntax of Logistic Regression

• In logistic regression, the basic syntax for glm() function is:

glm( formula, data, family)

• Description of the parameters used:

© The Knowledge Academy Ltd 410


Logistic Regression
Building Logistic Regression Model in R Programming

• In this, we use the BreastCancer dataset that is available by default in R

• Firstly we import the data and displaying the information related to the BreastCancer
dataset with the str() function:

• To execute the code, press Ctrl+Enter

© The Knowledge Academy Ltd 411


Logistic Regression
(Continued)

Output:

© The Knowledge Academy Ltd 412


Logistic Regression
Applications of Logistic Regression with R

• Logistic Regression helps in categorisation and image segmentation

• In geographic image processing, we use logistic regression

• We use logistic regression in handwriting recognition

• Healthcare is an application area of logistic regression

• We use this type of regression to make predictions about something

© The Knowledge Academy Ltd 413


Multiple Regression
• Multiple regression can be defined as an extension of linear regression in relationship
among more than two variables

• In case of simple linear relation there is one predictor and one response variable, but in
case of multiple regression there is one or more than one predictor variable and one
response variable

• The mathematical equation (General) for multiple regression can be expressed as

y = a + b1x1 + b2x2 +...bnxn

© The Knowledge Academy Ltd 414


Multiple Regression
(Continued)

Following is the description of the parameters which are used in the equation on the
previous slide −

• y is the response variable

• a, b1, b2...bn are the coefficients

• x1, x2, ...xn are the predictor variables.

© The Knowledge Academy Ltd 415


Multiple Regression
lm() Function

• lm() function creates the relationship model between the response variable and the
predictor

Output lm(y ~ x1+x2+x3...,data)

o formula is a symbol that defines the relation between predictor variables and the
response variable.

o data is the parameter on which the formula will be applied

© The Knowledge Academy Ltd 416


Multiple Regression
Example

• Input Data

o Take the data set "mtcars" which is available by default in the R environment

o It gives a comparison among different car models in terms of weight of the


car("wt"), mileage per gallon (mpg), horse power("hp"), cylinder
displacement("disp") and some more parameters

o The goal of the model is for establishing the relationship between "wt", "hp" and
"disp" as predictor variables with "mpg" as a response variable

© The Knowledge Academy Ltd 417


Multiple Regression
(Continued)

input <-
mtcars[,c("mpg","disp","hp","wt")] Output
print(head(input))

© The Knowledge Academy Ltd 418


Multiple Regression
(Continued)

• Create Relationship Model & get the Coefficients

input <- mtcars[,c("mpg","disp","hp","wt")]


# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")
a <- coef(model)[1] Output
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)

© The Knowledge Academy Ltd 419


Multiple Regression
(Continued)

• Create Equation for Regression Model

Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
Or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

• Apply Equation to predict New Values

o We can use the previously created regression equation for predicting the mileage
when a new set of values for weight, horse power and displacement is provided

© The Knowledge Academy Ltd 420


Multiple Regression
(Continued)

o For a car with wt = 2.91, hp = 102 and disp = 221 the predicted mileage is :

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91
print(Y)

Output

© The Knowledge Academy Ltd 421


Normal Distribution
• Usually, it is observed that random data collection from independent sources is normally
distributed

• On plotting a graph, we get a bell shape curve with the count of the values in the
vertical axis and the value of the variable in the horizontal axis

• The middle part of the curve is the mean of the dataset

• To generate normal distribution, R programming has four inbuilt functions

dnorm(x, mean, sd)


qnorm(x, mean, sd)
pnorm(x, mean, sd)
rnorm(x, mean, sd)

© The Knowledge Academy Ltd 422


Normal Distribution
(Continued)

o x represents a vector of numbers

o p represents a vector of probabilities

o n represents the number of observations

o mean represents the mean value of the sample data. Its default value is 0

o sd represents the standard deviation. Also, its default value is 1

© The Knowledge Academy Ltd 423


Normal Distribution
Example of dnorm()

• Build a sequence of numbers between -10 and 10 that increases by 0.2

x <- seq(-20, 20, by = .2)


y <- dnorm(x, mean = 5.0, sd =
1.0) Output
plot(x,y, main = "Normal
Distribution", col = "brown")

© The Knowledge Academy Ltd 424


Binomial Distribution
• The binomial distribution explores the probability of success of an event having only
two possible outcomes in a series of experiments

• For example, tossing one coin always gives a head or a tail. During the binomial
distribution, the probability of finding exactly 3 heads in tossing a coin repeatedly for 10
times is estimated

• To generate normal distribution, R programming has four inbuilt functions

dbinom(x, size, prob)


rbinom(n, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)

© The Knowledge Academy Ltd 425


Binomial Distribution
(Continued)

o x represents a vector of numbers

o p represents a vector of probabilities

o n represents the number of observations

o size represents the number of trials

o prob defines the probability of success of each trial

© The Knowledge Academy Ltd 426


Binomial Distribution
Example of dbinom()

• dbinom() function gives the distribution of probability density at each point

# Create a sample of 50 numbers


which are incremented by 5.
x <- seq(0,50,by = 5)
y <- dbinom(x,50,0.5) # Create Output
the binomial distribution.
plot(x,y, main = "Binomial
Distribution") # Plot the graph.

© The Knowledge Academy Ltd 427


Binomial Distribution
Example of pbinom()

• pbinom() function gives the cumulative probability of an event

# Probability of getting 26 or less


heads from a 51 tosses of a coin. Output
x <- pbinom(26,51,0.5)
print(x)

© The Knowledge Academy Ltd 428


Binomial Distribution
Example of qbinom()

• qbinom() function takes the value of probability and gives a number whose cumulative
value matches the value of probability

# How many heads will have a probability


of 0.25 will come out when a coin is tossed
51 times. Output
y <- qbinom(0.25,51,1/2)
print(y)

© The Knowledge Academy Ltd 429


Binomial Distribution
Example of rbinom()

• rbinom() function generates the required number of random values of a given


probability

# Find 8 random values from a


sample of 150 with probability of
0.4. Output
y <- rbinom(8,150,.4)
print(y)

© The Knowledge Academy Ltd 430


Module 11: Modelling Data

© The Knowledge Academy Ltd 431


What are the Relationships?
• In Power BI, a relationship is used to
describe the connections or the relation
between two or more tables

• The relationship is used to perform an


analysis based on multiple tables

• The relationship helps to display the


data as well as correct information
between multiple tables

• The relationship is also used to


calculate the accurate results

© The Knowledge Academy Ltd 432


Viewing Relationships
• The model view displays all of the tables, columns, and relationships in your model

• This view can be mainly useful when your model contains complex relationships
between many tables

• Click on the Model icon placed at the left side of the window to see a view of the
existing model

• Your cursor over a relationship line represent the columns that are used as shown in the
next slide:

© The Knowledge Academy Ltd 433


Viewing Relationships
(Continued)

© The Knowledge Academy Ltd 434


Creating Relationships
• The following are the steps to create a relationship manually:

Step 1: On the Modeling tab, select Manage Relationships > New

© The Knowledge Academy Ltd 435


Creating Relationships
Step 2: In the Create relationship dialog box, select a Products table in the first table drop-
down list, and then choose the column you want to use in the relationship

© The Knowledge Academy Ltd 436


Creating Relationships
Step 3: In the second table drop-down list, choose the other table you want in the
relationship and then select the other column you want to use, press OK

© The Knowledge Academy Ltd 437


Cardinality
• While creating a relationship between two tables, you get two values that can be 1 or *
on the two ends of the relationship among two tables, known as Cardinality of the
relationship

• There are four types of cardinality, as follows:

1. *-1: Many-to-One

2. 1-1: One-to-One

3. 1-*: One-to-Many

4. *-*: Many-to-Many

© The Knowledge Academy Ltd 438


Cardinality
1. Many to one (*:1)

• A many-to-one relationship is a vital type of cardinality and default type of relationship

• In many to one relationship, the column in a given table can have more than one
instance of the value, and the other related table is known as the lookup table and
contained only one instance of a value

2. One to one (1:1)

• In a one-to-one (1:1) relationship, the column in one table has only one example of a
specific value, and the different related table also contains only one instance of a
specific value

© The Knowledge Academy Ltd 439


Cardinality
3. One to many (1:*)

• In a one-to-many (1:*) relationship, the column in one table has only one instance of a
specific value, and the other related table contains more than one instance of a value

4. Many to many (*:*)

• You can develop a many-to-many relationship between tables with composite models
that removes the requirements for unique values in tables

• It also eliminates the previous workarounds, such as introducing new tables only to
build relationships

© The Knowledge Academy Ltd 440


Cross Filter Direction
• Each model relationship must be described with a cross filter direction

• Your selection decides the direction(s) that filters will propagate

• The possible cross filter options are dependent on the type of cardinality

• Single cross filter direction indicates single direction, and both show both directions

• A relationship that filters in both directions are commonly defined as bi-directional

© The Knowledge Academy Ltd 441


Cross Filter Direction
(Continued)

Cardinality type Cross filter options


One-to-many (or Many-to-one) Single
Both
One-to-one Both
Many-to-many Single (Table1 to Table2)
Single (Table2 to Table1)
Both

© The Knowledge Academy Ltd 442


What is DAX?
• DAX which stands for data analysis expressions
is a collection of operators, functions, and
constants that we can use in expressions or
formulas

• DAX helps us to return values after making


calculations from the already available data

• To understand DAX, you just need to be


familiar with Microsoft Excel formulas

© The Knowledge Academy Ltd 443


What is DAX?
(Continued)

• DAX formulas are just like the ones we issue in Microsoft excel

• However, DAX functions and excel functions differ in certain aspects

• Excel allows its users to reference cells or arrays. If the users need to perform such
behaviour in Power Bi, they would require the use of DAX Functions

• DAX provides more data types than Microsoft Excel does

© The Knowledge Academy Ltd 444


Syntax
• DAX formulas begin with an = sign after which any scalar value can be provided

• The scalar value can be an expression that evaluates to a scalar or an expression that
can be converted to a scalar

Expressions
Scaler Operator, containing any
Expressions or values, and A function
of the
Constants that References to Constants result along
following:
use Scaler Columns or specified as a with its
operators,
Operator such Tables part of arguments and
constants, or
as +, -, *, /, >. =, expression parameters
references to
&& etc.
columns

© The Knowledge Academy Ltd 445


Syntax
(Continued)

• DAX requires that all its objects whether tables or columns must have unique names

• Also, names of objects are case insensitive, i.e. Products and PRODUCTS would refer to
the same table or column

• A column name should always be fully qualified, i.e. it must be preceded by the table
name and should be written in square brackets, e.g. SALES.[Prodcut_Id]

• Sometimes table names will contain spaces in which case they must be enclosed in
single quotations

© The Knowledge Academy Ltd 446


Syntax
(Continued)

• A fully qualified name is required in the following circumstances:

1. 2. 3.
When the VALUES function As arguments to the ALL or While using the CALCULATE
requires arguments ALLEXCEPT functions or CALCULATABLE functions
passes as a filter argument

4. 5.
As an argument to As an argument at any time
RELATEDTABLE function intelligence function

© The Knowledge Academy Ltd 447


Functions
• Functions in DAX can be categorised into the following:

Date and Time Functions 1 2 Filter Functions

Time Intelligence Functions 3 4 Information Functions

Logical Functions 5 6 Math and Trig Functions

Text Functions 7 8 Many others as well

© The Knowledge Academy Ltd 448


Functions
(Continued)

• DAX functions always point to either a column or a table

• To specify only selected values you will need to filter them

• DAX is also capable of returning a whole table rather than a column only

• DAX works with Time Intelligence functions to perform dynamic calculations

© The Knowledge Academy Ltd 449


Row Context
• Row context is much easier to understand as compared to filter context

• The simplest way to visualise row context is to take a table and add a calculated column

• Each row in a table contains its own row context

• For instance, if a table has two columns a and b and row 1 values are 1 and 2
respectively

• Similarly, row 2 values are 3 and 4 respectively

• If you add a column c that sums the value of columns a and b, then the column c value
for row 1 would be 3, and the value for row 2 would be seven

© The Knowledge Academy Ltd 450


Calculated Columns
• With calculated columns, you can append new data to the existing table in your table

• You can create a data analysis expressions (DAX) formula that defines the columns
values rather of querying and loading values in your new column from a data source

• In Power BI desktop, calculated columns are generated by using the new column feature
in Report view

• Calculated columns that you create appear in the fields list just like any other field

© The Knowledge Academy Ltd 451


Calculated Columns
(Continued)

• But they will contain a special icon showing its values are the result of a formula:

• You can name new columns whatever you want, and add them to a report visualisation
just like other fields

© The Knowledge Academy Ltd 452


Calculated Tables
• You can create a calculated table by using the New Table feature in a data view or report
view of Power BI desktop

• For instance, suppose you are a personnel manager who has a table of Sales_2019 and
another table of Sales_2020, and you want to combine both tables into a single table
called Sales

Sales_2019 Sales_2020

© The Knowledge Academy Ltd 453


Calculated Tables
• The following are the steps to create a calculated table:

Step 1: Click on the Modeling Tab and then select New Table

© The Knowledge Academy Ltd 454


Calculated Tables
Step 2: Enter the following formula in the formula bar

© The Knowledge Academy Ltd 455


Calculated Tables
Step 3: A new table named Sales is created and appears just like any other table in
the Fields pane:

© The Knowledge Academy Ltd 456


Measures
• Measures are generally used for data analyses

• Simple summarisations such as sums, averages, counts and minimum, maximum can be
set through the Fields well

• The calculated results of measures are always varying according to your interaction with
your reports, allowing for fast and dynamic ad-hoc data exploration

• In the Power BI desktop, measures are created in a data view or report view

• The measures that you are created appear in the Fields list with a calculator icon

© The Knowledge Academy Ltd 457


Measures
(Continued)

• You can enter the name for measures whatever you want and add them to a new or
existing visualisation just like any other field

© The Knowledge Academy Ltd 458


Module 12: Shaping and Combining Data

© The Knowledge Academy Ltd 459


Shaping and Combining Data
Power BI Desktop Queries
• Queries, in Power BI desktop are as much essential as are datasets in Power BI service

• It is the queries that form the bases of the reports and visualisations in Power BI

• A query is created in Power BI as soon as the command to fetch data (or Get Data to be
more precise) is given

• However, these tasks can only be performed from the query editor in the Power BI
desktop version

• Users can use multiple queries to get the results they want

© The Knowledge Academy Ltd 460


Shaping and Combining Data
(Continued)

• This is possible if they have already imported these datasets from some external source
such as Excel, CSV (comma-separated values) file, or some databases

• Click on the Edit Queries option to start working

© The Knowledge Academy Ltd 461


Shaping and Combining Data
(Continued)

• Power BI Desktop Queries can consist of the following:

• Data Retrieved from a Single Table


1

• Data Retrieved from Multiple Tables


2

• Data Having Calculated Columns in the Query


3

• Data Related to Another Table based on a Calculated Column


4

© The Knowledge Academy Ltd 462


The Query Editor
• When the query is loaded, power query editor view becomes more interesting

• Power query editor loads information about the data if we connect to the web data
source that you can use and then begin to shape

• The following steps show how power query editor appears once a data connection is
established:

Step 1: In the ribbon, various buttons are now active to interact with the data in the query

Step 2: In the left pane, queries are listed as well as available for selection, shaping and
viewing

Step 3: In the centre pane, data from the selected query is displayed or available for
shaping

© The Knowledge Academy Ltd 463


The Query Editor
Step 4: The Query Settings pane displays, listing the query's properties as well as applied
steps

2 4

© The Knowledge Academy Ltd 464


Shaping Data and Applied Steps
Shaping Data
• In the query editor, when you shape the data along with providing step-by-step
instructions for query editor to carry out or to adjust the data as it loads and present it
for you

• In the Power BI desktop, there is a lot that can happen to the data that has been
retrieved

• While in the query editor users can opt to remove columns/rows from a dataset or may
even add new columns to the existing columns

• The new columns can be populated with a calculated value also

© The Knowledge Academy Ltd 465


Shaping Data and Applied Steps
Applied Steps
• Whenever any action takes place in the query editor, the Applied Steps window lists the
changes that have taken place in

• A in front of the steps we applied allows us to cancel the change we made

© The Knowledge Academy Ltd 466


Shaping Data and Applied Steps
(Continued)

• The case is shown below:

• There are various ways of removing the errors

• Remove Errors is one way in which all rows containing the errors would be removed

© The Knowledge Academy Ltd 467


Shaping Data and Applied Steps
(Continued)

• As we want to keep our data and rectify the errors, we are not going to use this option

• Click the column that has ERROR displayed to show the following:

© The Knowledge Academy Ltd 468


Shaping Data and Applied Steps
(Continued)

• The Query Editor provides the user with a


Context Menu for Applied Steps

• The options in the menu include Rename,


Delete, Delete Until End, Insert Step After
etc.

• So just choose the step and click Delete

© The Knowledge Academy Ltd 469


Advanced Editor
• The advanced editor enables you to view the code that power query editor is creating
with each step

• It also supports you to create your own shaping code

• To enable the advanced editor, select View from the ribbon, then select Advanced
Editor

© The Knowledge Academy Ltd 470


Advanced Editor
(Continued)

• A window appears that displays the existing query code as shown below:

© The Knowledge Academy Ltd 471


Formatting Data
• You can specify modified cell colours, including colour gradients, based on field values
with the help of conditional formatting from tables in Power BI desktop

• You can also describe cell values with data bars or as active web links or KPI icons

• You can also apply conditional formatting to any text or data field, as long as you base
the formatting on a field that includes numeric, hex code or colour name, or web URL
values

• The following are the steps to apply conditional formatting:

Step 1: Select a Table or Matrix visualisation in Power BI desktop

© The Knowledge Academy Ltd 472


Formatting Data
Step 2: In the Fields section of the Visualisations pane, select the down-arrow next to the
field in the Values well that you want to format

© The Knowledge Academy Ltd 473


Formatting Data
Step 3: Click on Conditional formatting and then select the type of formatting to apply

© The Knowledge Academy Ltd 474


Transforming Data
• The following are the steps for transforming data:

Step 1: Open Power BI, choose Excel option from the Get Data

© The Knowledge Academy Ltd 475


Transforming Data
Step 2: Select the Excel file named as Employee and click Open

© The Knowledge Academy Ltd 476


Transforming Data
Step 3: Select any data which you want to transform data and click on the Transform Data
button

© The Knowledge Academy Ltd 477


Transforming Data
Step 4: The transform data will be displayed as follows:

© The Knowledge Academy Ltd 478


Combining Data
• When we have two or more different data sources to create our reports that is when
combining it proves to be an efficient method

• To combine data, the easiest way would be to establish a relationship between the data
sources on a column

• Let us take a scenario where the user has a list of cities and their country codes in one
data sources and the country code and their respective names in another as shown in
the next slide:

© The Knowledge Academy Ltd 479


Combining Data
(Continued)

© The Knowledge Academy Ltd 480


Combining Data
(Continued)

• Import both the files into the Power BI desktop

• Once done, click on the Relationships icon on the right-hand side to create a
relationship as shown:

© The Knowledge Academy Ltd 481


Combining Data
(Continued)

• Next, click on the Reports icon to see a result of the data you have combined

© The Knowledge Academy Ltd 482


Combining Data
(Continued)

• The data you entered has been mapped automatically using Power BI desktop

© The Knowledge Academy Ltd 483


Combining Data
(Continued)

• Data can also be combined from the web with some existing data

• Suppose we have some data that has organisation names and their respective US
country codes but not the country names, and a report requires that organisations be
listed with country names then we could take the web as a data source

Step 1: Choose Get Data > Web and provide the URL from where to retrieve the country
codes and country names. Click OK

© The Knowledge Academy Ltd 484


Combining Data
Step 2: Choose Table, and click Load

Step 3: Rest of the process is the same, click Relationships and then Reports

© The Knowledge Academy Ltd 485


Module 13: Interactive Data
Visualisations

© The Knowledge Academy Ltd 486


Page Layout and Formatting
• Page view settings are accessible in both the Power BI
service as well as Power BI desktop, but having a small
change in interface

• In the Power BI report, the first set of page view setting


controls display of your report page relative to the
browser window and choose between:

o Fit to page (default): Contents are scaled to fit the


page best

o Fit to width: Contents are scaled to fit within the


width of the page

o Actual size: Contents appear at full size

© The Knowledge Academy Ltd 487


Page Layout and Formatting
(Continued)

• In the second set of page view settings, it controls the positioning of objects on the
report canvas and chooses between:

o Show gridlines: Turn on gridlines help you to position objects on the report canvas

o Snap to grid: Use with Show gridlines to position precisely as well as to align objects
on the report canvas

o Lock objects: Lock all objects on the canvas so that they can not be resized or
moved

© The Knowledge Academy Ltd 488


Page Layout and Formatting
(Continued)

o Selection pane: The Selection pane lists all objects on the canvas, and you can
decide which to show and which to hide

© The Knowledge Academy Ltd 489


Page Layout and Formatting
Page Size Settings
• Page size settings are accessible only by report owners

• These settings are available in the Visualisations pane and control the actual size (in
pixels) as well as the display ratio of the report canvas:

o 4:3 ratio

o 16:9 ratio (default)

o Letter

© The Knowledge Academy Ltd 490


Page Layout and Formatting
(Continued)

o Custom (height and width in pixels)

© The Knowledge Academy Ltd 491


Multiple Visualisations
• Graphs and visualisations are an integral part of Power BI – both desktop and service

• There are innumerable visualisation in Power BI, the list is growing, and according to
Microsoft this will keep on growing

• As of now, Power BI offers visualisations that help the user with simple charts and also
measure their performances

• Power BI offers the following types of visualisations:

Area Charts 1 2 Doughnut Charts

Bar Charts 3 4 Funnel Charts

© The Knowledge Academy Ltd 492


Multiple Visualisations
(Continued)

Column Charts 5 6 Gauge Charts

Cards : Single Row or Multi-Row 7 8 Matrix Charts

Combo Charts 9 10 Pie Charts

Scatter Charts 11 12 Slicer Charts

Standalone Images 13 14 Tables

© The Knowledge Academy Ltd 493


Multiple Visualisations
(Continued)

Water Fall Charts 15 16 Tree Maps

KPIs 17 18 Bubble Chart

Line Charts 19 20 Maps

© The Knowledge Academy Ltd 494


Creating Charts
• The following are the steps to create a chart in Power BI report:

Step 1: Create a visualisation by selecting a field from the Fields pane

Step 2: Start with a numeric field like Customers > City > Customer_ID. Power BI creates a
column chart with a single column, and you can select the desired chart from visualizations

© The Knowledge Academy Ltd 495


Using Geographic Data
• Power BI integrates with bing maps to produce default map coordinates called as geo-
coding so you can create maps

• It may contain the data in the Location, Latitude, and Longitude buckets of the visual's
field well

• The following are the steps to represent the data by using a map chart in Power Bi
report:

Step 1: Start with a geography field, such as Geo > City > Customer_ID

Step 2: Power BI and bing maps create a map visualisation

© The Knowledge Academy Ltd 496


Using Geographic Data
(Continued)

© The Knowledge Academy Ltd 497


Histograms
• In Power BI, a histogram chart is used to explain the
frequency distribution of your data

• The histogram chart feature is not in-built in


visualization; you have to add it in visualization

• The following are the steps to add histogram chart in


visualization:

Step 1: Go to visualization pane and click on Get more


visuals as shown in the given figure

© The Knowledge Academy Ltd 498


Histograms
Step 2: Enter histogram in the App Source search bar and click on histogram chart Add
button

© The Knowledge Academy Ltd 499


Histograms
Step 3: Click on OK. The histogram chart feature is successfully added in the visualization
pane

© The Knowledge Academy Ltd 500


Histograms
• The following are the steps to represent the data by using a histogram chart in Power BI
report:

Step 1: Click on Get Data icon on the ribbon and select Excel to import a workbook

© The Knowledge Academy Ltd 501


Histograms
Step 2: Select the table that contains that dataset, and then click on load

© The Knowledge Academy Ltd 502


Histograms
Step 3: To create the histogram, click on the histogram chart icon on the visualizations pane
and then add the appropriate fields:

© The Knowledge Academy Ltd 503


Power BI Admin Portal
• The admin portal allows you to manage a Power BI tenant for your organisation

• The admin portal consists of items such as usage metrics, access to the Microsoft 365
admin centre, and settings

• The full admin portal is open to all users who are global admins or have Power BI service
administrator role

• Make sure your account marked as a Global Admin, with Microsoft 365 or Azure Active
Directory (Azure AD), or contains a Power BI service administrator role, to get access to
the Power BI admin portal

© The Knowledge Academy Ltd 504


Power BI Admin Portal
(Continued)

• Below are the steps to get the Power BI admin portal:

Step 1: Select the settings gear in the top right side of the Power BI service

Step 2: Click on the Admin portal

© The Knowledge Academy Ltd 505


Power BI Admin Portal
Step 3: The admin portal contains twelve tabs

© The Knowledge Academy Ltd 506


Service Settings
• The following are the steps to manage common data service settings:

Step 1: You can manage and observe the settings for your environments by signing in to
the Power Platform admin centre

Step 2: Go to the Environments page, select an environment and then click on Settings

© The Knowledge Academy Ltd 507


Service Settings
Step 3: Setting for the selected environment can be managed in the given window:

© The Knowledge Academy Ltd 508


Desktop Settings
• The creators of Power BI should become aware of the settings available in Power BI
options and data source settings

• As a configuration of these setting determine available functionality, default behaviours,


user interface options, performance, and the security of the data being used

• Global option is implemented for all Power BI desktop files created or accessed by the
user

© The Knowledge Academy Ltd 509


Desktop Settings
(Continued)

• While CURRENT FILE options must be described for each Power BI desktop file

© The Knowledge Academy Ltd 510


Dashboard and Report Settings
• The Power BI includes the list of dashboard settings that are available for you

• To illustrate this, we are going to use Adam Insights dashboard available in my Power BI
workspace

• The following are the steps to change this Power BI dashboard settings:

Step 1: On the top right corner, click on the … button and then select the Settings option
from the context menu

© The Knowledge Academy Ltd 511


Dashboard and Report Settings
(Continued)

© The Knowledge Academy Ltd 512


Dashboard and Report Settings
Step 2: Select the Settings option to open the dashboard settings window

© The Knowledge Academy Ltd 513


Dashboard and Report Settings
Step 3: Click on Save button

© The Knowledge Academy Ltd 514


Congratulations

Congratulations on completing this course!


Keep in touch
info@theknowledgeacademy.com
Thank you

© The Knowledge Academy Ltd 515

You might also like