Professional Documents
Culture Documents
• In NumPy, the array object is called ndarray, and many supporting functions are given
by it which make working with ndarray easy as pie. Arrays are oftentimes used in data
science, where resources and speed are essential
• Unlike lists, in memory, NumPy arrays are stored at one continuous place so processes
can easily access and manipulate them. In computer science, this behaviour is known as
locality of reference
• This is the major cause that why NumPy is faster as compared with lists. Moreover, it is
optimised for working with the latest CPU (Central Processing Units) architectures
• NumPy is a Python library, and also it is written partially in Python, but the maximum
part which needs fast computation is written in C or C++ programming languages
• It is a table of elements (typically numbers), all of the similar type, indexed by a tuple of
positive integers
• NumPy’s array class is known as ndarray. Even, it is known by the alias array
Example
Output
Slicing
• NumPy arrays can be sliced, as lists in python. For every dimension of the array, you
need to specify a slice, as arrays can be multidimensional
• Lists are passed for indexing for every dimension in this method. In order to construct a
new arbitrary array, one to one mapping of corresponding elements is done
• This method is used while picking elements from the array which satisfy some condition
Example
Output
Example
Output
Unary operators
• Various unary operations are given as a method of ndarray class. This includes min,
sum, max, etc. By setting an axis parameter, these functions can also be applied column-
wise or row-wise
Example
Output
Binary operators
• These operations apply on array element-wise, and a new array is created. All basic
arithmetic operators such as +, -, /, etc., can be used. The existing array is modified in
case of +=, -=, = operators
Example
Output
• Min, sum, mean, max, average, median, product, standard deviation, argmin, variance,
percentile, argmax, cumsum, cumprod, and corrcoef are Python Numpy aggregate
functions
• The following arrays are used in order to show these Python numpy aggregate
functions:
• The sum of values in an array is calculated by the Python NumPy sum function
• This Python numpy sum function permits you to utilise an optional argument named an
axis. To calculate the sum of a given axis Python numpy Aggregate Function can be used.
For instance, the sum of each column in a Numpy array is returned by the axis = 0
• The minimum value in a given axis or an array is returned by the Python numpymin
function
• Here, we are finding the numpy array minimum value in the X and Y-axis
• The maximum number in a given axis or from a given array is returned by the Python
numpy max function
• By using numpy max function find the maximum value in the X and Y-axis
• The average or mean of a given array or in a given axis is returned by the Python numpy
mean function. The mathematical formula for this numpy mean is the sum of all the
items in an array
• It does this without creating unnecessary data copies and which leads to efficient
algorithm implementations
• Example
Output
Broadcasting Rules:
• The following are the rules in order to broadcast two arrays together:
1. Prepend the shape of the lower rank array with 1s until both shapes have the same
length if the arrays do not have the same rank
3. If arrays are compatible with all dimensions then they can be broadcasted together
4. After broadcasting, every array acts as if it had shape equivalent to the element-wise
maximum of shapes of the two input arrays
5. In any dimension where one array had size 1, as well as the other array had size greater
than 1, the first array acts as if it were copied along that dimension
Output
Output
Output
• greater_equal, Greater, less_equal, less, equal, and not_equal are Python Numpy
comparison functions. Python Numpy comparison operators are <, <=, >, >=, == and !=
• For generating random two dimensional and three-dimensional integer arrays numpy
random randint function can be used
• A two-dimensional array is generated by the first array having a size of 5 rows and 8
columns, and the values are within 10 and 50
• A random three-dimensional array of size 2*3*6 is generated by this second array. The
generated random values are within 1 and 20
Output
• Here, the Python Numpy greater function on 2-Dimensional and 3-Dimensional Arrays is
used
• The first array greater function checks whether the values in 2-D array are greater than
30 or not
• If true, then Boolean True returned otherwise false returned. Next, we are checking
whether the array elements in a 3-D array are greater than 10 or not
Output
• Whether the given array elements are greater than or equal to a specified number is
checked by the Python Numpy greater_equal function. True returned, If True otherwise,
False
• Whether items in the area is greater than or equal to 2 is checked by the first Numpy
statement. Moreover, the items in a random 2-Dimensional array is greater than or
equal to 25 is checked by the second Numpy statement
• Randomly generated 3-D array items which are greater than or equal to 7 are checked
by the third statement
Output
• Whether the elements in a given array is less than a specified number or not, is checked
by the Python Numpy less function
• If True, boolean True returned otherwise, False. The syntax of this Python Numpy less
function is:
numpy.less(array_name, integer_value)
Output
• Whether each element in a provided array is less than or equal to a specified number or
not, is checked by the Python Numpy less_equal function. If True, boolean True
returned otherwise, False
numpy.less_equal(array_name, integer_value).
Output
operator meaning
~ negation (logical “not”)
& logical “and”
| logical “or”
• Example 1:
Output
• Example 2:
Output
• Example 3:
Output
Example
Output
• Fancy indexing is conceptually simple which means passing an array of indices in order
to access multiple array elements at one time
Output
• On the other hand, we can pass an array of indices or a single list for getting the same
result:
• While utilising fancy indexing, the shape of the result reflects the shape of the index
arrays instead of the shape of the array being indexed:
Output
• Even fancy indexing works in multiple dimensions. See the example shown below:
Output
• The first index refers to the row, and the second to the column Like with standard
indexing:
Output
• The broadcasting rules are followed by the pairing of indices in fancy indexing.
Therefore, for instance, we get a two-dimensional result if we combine a column vector
as well as a row vector within the indices:
Output
• It is always necessary to memorise with fancy indexing that the broadcasted shape of
the indices is reflected by the return value, instead of the shape of the array being
indexed
Combined Indexing
• Fancy indexing can be combined with the other indexing schemes for more powerful
operations:
Output
Output
• All of these indexing options combined lead to a very flexible group of operations for
accessing as well as modifying array values
• Notice that, repeated indices with these operations can cause some potentially
unexpected outcomes
• The outcome of this operation is to first assign A[0] = 2, followed by A[0] = 8. The result
is that A[0] contains the value 8
• Ordered sequence is any sequence which has an order corresponding to elements, such
as ascending or descending, alphabetical or numeric
• The NumPy ndarray object has a function named as sort(), which will sort an array
Example
Output
• Even, you can sort string arrays, or any other data type:
Output
Output
Output
• Data containers named as fields are used by the structure array. Every data field can
contain data of any size and type. With the help of dot notation, array elements can be
accessed
• For instance, consider a student's structured array with different fields such as year,
name, and marks
• Every record in array student has a structure of class Struct. Moreover, the array of a
structure is referred to as struct as adding any new fields for a new struct in the array,
contains the empty array
Output
Example
• The structure array can be sorted by using numpy.sort() method and also passing the
order as a parameter. This parameter takes the field value according to which it is
required to be sorted
Output
• Index, DataFrame and series are the three basic pandas data structures
• A Pandas Series is a 1-D array of indexed data. It can be created from a array or list as
shown in the following screenshot:
• As shown in the output, A sequence of indices and sequence of values both are
wrapped by the series, which we can access with the index attributes and values. The
values are simply a familiar NumPy array:
• As NumPy array, data can be obtained by the associated index through the familiar
Python square-bracket notation:
• The Pandas Series is much more general as well as flexible as compared to 1-D NumPy
array that it emulates
• The significant difference is the presence of the index: whereas the Numpy Array has an
implicitly defined integer index used in order to obtain the values, the Pandas Series
has a clear-cut defined index associated with the values
• The Series object additional capabilities are provided by this clear index description.
The index needs not to be an integer but can made up of values of any wanted type.
For instance, we can use strings as an index:
• This typing is significant: just as the type-specific compiled code behind a NumPy array
makes it more well-organized than a Python list for certain operations, the type
information of a Pandas Series makes it much more efficient as compare to Python
dictionaries for certain operations
• A Series will be built where the index is drawn from the sorted keys by default. Typical
dictionary-style item access can be performed from here:
• For instance, data can be a NumPy array or list, in which case index defaults to an
integer sequence:
• Data can be a scalar, which is repeated in order to fill the specified index:
• Data can be a dictionary, in which index defaults to the sorted dictionary keys
• The index can be set explicitly in every case if a different result is preferred:
• Suppose a Series is an analogue of a 1-D array with flexible indices. In that case, a
DataFrame is an analogue of a 2-D array with both flexible column names and flexible
row indices
• For showing this, first, make a new Series listing the area of each of the five states:
• To construct a single 2-D object containing this information , we can use a dictionary:
• Similar to the Series object, the DataFrame has an index attribute which provides
access to the index labels:
• In addition, the DataFrame has a columns attribute, which is an Index object containing
the column labels:
• For instance, 'area' attribute returns the Series object holding the areas:
• Data[0] will return the first row in a 2-D NumPy array. Data['col0'] will return the first
column for a DataFrame
• Various ways can be used in order to construct Pandas DataFram. The following are
several examples:
o From a list of dicts: Any list of dictionaries can be made into a DataFrame
o Even if a few keys are missing in the dictionary, they will be filled by Pandas with
NaN which means "not a number" values:
• As we saw in the previous slides, a Series object acts in many ways like a one-
dimensional NumPy array, as well as in many ways like a standard Python dictionary
• If we keep these two overlapping analogies in mind, it will help us to understand the
patterns of data indexing as well as selection in these arrays
Series as dictionary
• Like a dictionary, the Series object provides a mapping from a group of keys to a
collection of values:
• We can also use dictionary-like Python expressions as well as methods to examine the
keys or indices as well as values:
• Series objects can even be altered with a dictionary-like syntax. Just as you can extend a
dictionary by assigning to a new key, you can extend a Series by assigning to a new
index value:
• This easy mutability of the objects is a useful feature: under the hood, Pandas is making
decisions about memory layout as well as data copying that might need to take place;
the user generally does not need to worry about these issues
• Among these, slicing may be the source of the most confusion. Notice that when slicing
with an clear index (i.e., data['a':'c']), the final index is included in the slice, while when
slicing with an understood index (i.e., data[0:2]), the final index is excluded from the
slice
• These slicing as well as indexing conventions can be a source of confusion. Such as, if
your Series has an clear integer index, an indexing operation for example data[1] will
use the clear indices, while a slicing operation like data[1:3] will use the understood
Python-style index
• Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes which clearly uncover certain indexing schemes
• These are not functional methods, but attributes which uncover a specific slicing
interface to the data in the Series
• First, the loc attribute permits indexing as well as slicing which always references the
clear index:
• The iloc attribute permits indexing as well as slicing which always references the
implicit Python-style index:
• A third indexing attribute, ix, is a hybrid of the two, as well as for Series objects is equal
to standard []-based indexing. The determination of the ix indexer will become more
apparent in the context of DataFrame objects, which we will discuss in a moment
DataFrame as a dictionary
• The first analogy we will consider is the DataFrame as a dictionary of related Series
objects. Let us return to our example of areas and populations of states:
Output
• The individual Series which make up the columns of the DataFrame can be retrieved
through dictionary-style indexing of the column name:
• Equivalently, we can use attribute-style access with column names which are strings:
• This attribute-style column access actually accesses the exact same object as the
dictionary-style access:
• Though this is a useful shorthand, remember that it does not work for all cases! Such
as, if the column names are not strings, or if the column names conflict with methods
of the DataFrame, this attribute-style access is not possible
• For instance, the DataFrame has a pop() method, so data.pop will point to this rather
than the "pop" column:
• In specific, you should avoid the temptation to try column assignment through attribute
(i.e., use data['pop'] = z rather than data.pop = z)
• Like with the Series objects discussed earlier, this dictionary-style syntax can also be
used to alter the object, in this case adding a new column:
• There are a couple extra indexing conventions which might seem at odds with the
preceding discussion, but nonetheless can be very beneficial in practice. First, while
indexing refers to columns, slicing refers to rows:
• Similarly, direct masking operations are also interpreted row-wise rather than column-
wise:
• These two conventions are syntactically alike to those on a NumPy array, as well as
while these may not quite fit the mold of the Pandas conventions, they are
nevertheless quite useful in practice
• Pandas inherits much of this functionality from NumPy, as well as the ufuncs (Universal
functions) which we introduced in Computation on NumPy Arrays: Universal Functions
are key to this
• Pandas contains a couple valuable twists, however: for unary operations such as
negation as well as trigonometric functions, these ufuncs will preserve index as well as
column labels in the output, and for binary operations like addition as well as
multiplication, Pandas will automatically align indices when passing the objects to the
ufunc
• This means that keeping the context of data as well as joining data from diverse sources
both potentially error-prone tasks with raw NumPy arrays become essentially foolproof
ones with Pandas
• We will additionally see that there are well-defined operations between 1-D Series
structures and 2-D DataFrame structures
• Because Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series as well as DataFrame objects
• If we apply a NumPy ufunc on either of these objects, the result will be another Pandas
object with the indices preserved:
• For binary operations on two Series or DataFrame objects, Pandas will align indices in
the process of performing the operation
• For example, suppose we are combining two unlike data sources, as well as find only
the top three US states by area as well as the top three US states by population:
• Let's see what happens when we divide these to compute the population density:
• The resulting array holds the union of indices of the two input arrays, which could be
determined by using standard Python set arithmetic on these indices:
• Any item for which one or the other does not have an entry is marked with NaN, or
"Not a Number," which is how Pandas marks missing data
• This index matching is applied this method for any of Python's built-in arithmetic
expressions; any absent values are filled in with NaN by default:
• If using NaN values is not the desired behaviour, the fill value can be altered using
suitable object methods in place of the operators
• For instance, calling A.add(B) is alike to calling A + B, but permits optional clear
specification of the fill value for any elements in A or B that might be missing:
• A same kind of alignment takes place for both columns as well as indices when
performing operations on DataFrames:
• Notice that indices are aligned properly irrespective of their order in the two objects, as
well as indices in the result are sorted
• As was the case with Series, we can use the associated object's arithmetic method as
well as pass any wanted fill_value to be used in place of missing entries
• Here we will fill with the mean of all values in A (computed by first stacking the rows of
A):
• The following table lists Python operators as well as their equal Pandas object methods:
+ Add()
- Sub(), subtract()
* Mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()
• Missing Data is a very big problem in real life scenario. Missing Data can also refer to as
NA(Not Available) values in pandas
• In DataFrame sometimes many datasets simply arrive with missing data, either because
it exists as well as was not collected or it never existed
• For instance, Suppose different user being surveyed may select not to share their
income, some user may select not to share the address in this way many datasets went
missing
o None: None is a Python singleton object which is usually used for missing data in
Python code
o NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point representation
• Pandas treat None as well as NaN as essentially exchangeable for indicating missing or
null values
• To facilitate this convention, there are various useful functions for detecting, removing,
as well as replacing null values in Pandas DataFrame
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
• In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull()
• Both function help in checking whether a value is NaN or not. These function can also
be used in Pandas Series in order to find null values in a series
• In order to check null values in Pandas DataFrame, we use isnull() function this function
return dataframe of Boolean values which are True for NaN values
Example 1
Example 2
• As shown in the output image, only the rows having Gender = NULL are displayed
• In order to check null values in Pandas Dataframe, we use notnull() function this
function return dataframe of Boolean values which are False for NaN values
Example 3
Output
Example 4
• As shown in the output image, only the rows having Gender = NOT NULL are displayed
• In order to fill null values in a datasets, we use fillna(), replace() and interpolate()
function these function replace NaN values with some value of their own
• Interpolate() function is basically used to fill NA values in the dataframe but it uses
various interpolation method to fill the missing values rather than hard-coding the
value
Output
Output
Output
Output
• Now we are going to fill all the null values in Gender column with “No Gender”
Output
Output
• Now we are going to replace the all Nan value in the data frame with -99 value
Output
Example 6: Using interpolate() function to fill the missing values using linear method.
Output
• Interpolate the missing values using Linear method. Note that Linear method ignore the
index as well as treat the values as equally spaced
• As we can see the output, values in the first row could not get filled as the direction of
filling of values is forward as well as there is no previous value which could have been
used in interpolation
• Moreover, for combination with another Index with n_repeat items, it is beneficial to
replicate as well as rearrange a MultiIndex
Example 1
Output
Output
• As you can see in the following output figure, the labels in the returned MultiIndex is
repeated 2 times.
Output
• Now let’s reiterate as well as reshuffle the labels of the MultiIndex 2 times
Output
• As you can see in the output figure, the labels are repeated as well as reshuffled twice
in the returned MultiIndex
Output
• Assume we want to associate particular keys with each of the pieces of the chopped up
DataFrame. This can be done by using the keys argument:
Output
• Set ignore_index to True if the resultant object has to follow its own indexing
Output
• Note, the index changes entirely, and the Keys are overridden as well
• The new columns will be added if two objects need to be added along axis=1
Output
Output
Output
• A Data frame is a 2-D data structure which means data is aligned in a tabular form in
columns and rows
• There are various methods by using them we can merge, join and concat dataframe
1 Concatenating DataFrame by
using .concat()
Concatenating DataFrame by
setting logic on axes 4
2 Concatenating DataFrame by
ignoring indexes
Concatenating DataFrame
with mixed ndims 5
3 Concatenating DataFrame by
using .append()
Concatenating DataFrame
with group keys 6
Output
Output
• This function exist before .concat. The following output we will get before applying
the .append() function:
Output
Output
• To concat a dataframe by ignoring indexes we ignore those indexes which do not have a
meaningful meaning
• You may wish to append them and ignore the fact that they may have overlapping
indexes. We use ignore_index as an argument to do that
Output
Output
• We override the column names with the use of the keys argument to concat dataframe
with group keys
• Keys argument is to override the column names when creating a new DataFrame based
on existing Series
Output
Output
Output
• We can aggregate by selecting a column through the standard get item method, or
passing a function to the whole DataFrame
Output
Output
Output
Output
Output
o Applying a function
• We split the data into sets in many situations, and we apply some functionality on each
subset. We can perform the following operations in the apply functionality
• Let us now create a DataFrame object as well as perform all the operations on it:
Output
• Pandas object can be split into any of their objects. Various ways are there in order to
split an object such as:
o obj.groupby('key')
o obj.groupby(['key1','key2'])
o obj.groupby(key,axis=1)
• Now see how the grouping objects can be applied to the DataFrame object
Output
View Groups
• With the groupby object in hand, we can iterate via the object alike to itertools.obj.
Output
• By default, the groupby object has the same label name as the group name
Select a Group
Output
• Levels in the pivot table will be stored in MultiIndex objects on the index and columns
of the result DataFrame
Example
Output
Output
Output
• As we know that tools like numpy as well as pandas generalize arithmetic operations so
that we can easily as well as quickly perform the same operation on numerous array
elements
Example
• For arrays of strings, NumPy does not provide such simple access, as well as thus you
are stuck using a more verbose loop syntax:
• This is perhaps sufficient to work with some data, but it will break if there are any
missing values
• Pandas includes features to address both this need for vectorised string operations as
well as for properly handling missing data through the str attribute of Pandas Series as
well as Index objects containing strings
• So, for instance, suppose we create a Pandas Series with this data:
• Now a single method can be called by us which will capitalize all the entries, while
skipping over any missing values:
• Using tab completion on this str attribute will list all the vectorised string methods
available to Pandas
• Nearly all Python's built-in string methods are mirrored by a Pandas vectorised string
method. Here is a list of Pandas str methods which mirror Python string methods:
1 2 3
len() ljust() rjust()
4 5 6 7
center() zfill() strip() translate()
8 9 10
11 12 13 14
• Notice that these strings ( shown on previous slides) have various return values. Some,
like lower(), return a series of strings:
• Or Boolean values:
• Still others return lists or other compound values for each element:
• We can include the date and time for each record and can fetch the dataframe records
in this module of Pandas
• By using pandas module called Time series we can find the data within a specific range
of date and time
Example 1
Output
• In this code, for date ranges from 1/1/2019 – 8/1/2019 we have created the timestamp
on the bases of minutes. We can vary the frequency by hours to seconds or minutes
• This function will help you to tack the record of data stored per minute. The length of
the datetime stamp is 10081 As we can see in the output as shown in the output
Example 2
Output
Example 3
Output
• We have first created a time series then converted this data into dataframe and for
generating the random data and map over the dataframe use random function. Then
we use print function to check the result
Example 4
• This code use the elements of data_rng and converted to string. Moreover, we slice the
data and print the first ten values list string_data because of more data
• We got all the values which are in the series range_date by using the for each loop in
list. We always have to specify the start and end date when we are using date_range
Example 5
Output
• Analyzing data needs many filtering operations. In order to filter a Data frame, various
methods are provided by Pandas. Dataframe.query() is one of them
The data is filtered based on a single condition in this example. The spaces in column
names have been replaced with ‘_’ before applying the query() method
Output
Example 1
• In order to evaluate the sum of all column elements in the dataframe and insert the
resulting column in the dataframe use eval() function
Output
• Now, evaluate the sum over all the columns and add the resultant column to the
dataframe:
Output
Example 2: For evaluating the sum of any two column element in the dataframe and insert
the resulting column in the dataframe use eval() function. The dataframe has NaN value
Output
Output
• Note that the resulting column ‘D’ has NaN value in the last row as the similar cell
utilised in the evaluation was a NaN cell.
• One of the greatest advantages of visualisation is that it permits us visual access to large
amounts of data in easily digestible visuals
• Matplotlib consists of various plots such as line, bar, scatter, histogram etc.
Line Plot
Bar Plot
Histogram
Scatter Plot
fig = plt.figure()
• Now, add axes to the created figure. The add_axes() method needs a list object of 4
elements corresponding to bottom, left, height and width of the figure. Every number
should be between 0 and 1
ax=fig.add_axes([0,0,1,1])
ax.set_title("sine wave")
ax.set_xlabel('angle')
ax.set_ylabel('sine')
ax.plot(x,y)
Example:
Example 1
• Example 2
• Example 3
Basic Errorbars
• With a single Matplotlib function, a basic errorbar can be created:
• Example 1:
• Example 2:
• This function takes three arguments: a grid of x values, a grid of y values, as well as a
grid of z values
• The x as well as y values signify positions on the plot, and the contour levels will
represent the z values
• The lines in the plotting can be color-coded by specifying a colourmap with the cmap
argument
• Also, we will specify that we want more lines to be drawn, i.e. 20 equally spaced
intervals within the data range:
• Matplotlib has a wide range of colourmap that you can easily browse in IPython by
writing a plt.cm. and then press Tab key
plt.cm.<TAB>
• We can also apply a filled contour plot by using the plt.contourf() function
• The colorbar makes it clear that the black regions are peaks. On the other hand, the red
regions are valleys
• The hist() function has several options to tune both the calculation as well as the
display; here is an example of more customised histogram:
• The plt.hist docstring has more information on other customisation options available
• The two-dimensional histogram creates a tessellation of squares across the axes. The
regular hexagon is another natural shape for such a tessellation
• Plt.hexbin routine is provided by the Matplotlib for this purpose, which will represent a
two-dimensional dataset binned within a grid of hexagons:
• plt.hexbin has a number of interesting options, including the ability to specify weights
for each point, as well as to alter the output in each bin to any NumPy aggregate (mean
of weights, standard deviation of weights, etc.)
Example:
• But, there are several ways we might want to customise such a legend. For instance, we
can define the location as well as turn off the frame:
• We can use the ncol command for specifying the number of columns in the legend:
• We can use a fancybox (rounded box) or add a shadow, alter the transparency (alpha
value) of the frame, or alter the padding around the text:
• We can fine-tune which elements as well as labels appear in the legend using the
objects returned by the plot commands
• Multiple lines at once can be created by plt.plot() command, as well as returns a list of
created line instances. Passing any of these to plt.legend() will tell it which to identify,
along with the labels we had like to specify:
• Now, applying labels to the plot elements which show on the legend:
Multiple Legends
• By using plt.axes
• Example of fig.add_axes()
• The command plt.subplots_adjust is used for adjusting the spacing between these
plots. The following example uses the equivalent object-oriented command named
fig.add_subplot():
• Note that by default, the text is aligned above as well as to the left of the specified
coordinates: here the "." at the commencement of each string will approximately mark
the given coordinate location
• The transData coordinates give the common data coordinates associated with the x- as
well as y-axis labels
• The transAxes coordinates give the location from the bottom-left corner of the axes
(here the white box), as a fraction of the axes size
• The transFigure coordinates are identical, however, specify the position from the
bottom-left of the figure (here the gray box), as a fraction of the figure size
• Notice now that if we alter the axes boundaries, it is only the transData coordinates
that will be affected, whereas the others remain static:
• Drawing arrows in Matplotlib is often much harder than you would bargain for. While
there is a plt.arrow() function available. The arrows it creates are SVG (Scalable Vector
Graphics) objects which will be subject to the varying aspect ratio of your plots, as well
as a result is rarely what the user intended
• plt.annotate() function creates some text as well as an arrow, and the arrows can be
very flexibly specified
• In the following code, we will use an elevation of 60 degrees (that is, 60 degrees above
the x-y plane) as well as an azimuth of 35 degrees (that is, rotated 35 degrees counter-
clockwise about the z-axis):
• These take a grid of values as well as project it onto the specified three-dimensional
surface, and can make the resulting three-dimensional forms quite easy to visualise
• A surface plot is as same as a wireframe plot, but each face of the wireframe is a filled
polygon. Adding a colormap to the filled polygons can aid observation of the topology of
the surface being visualised:
• Note that even though the grid of values for a surface plot needs to be two-
dimensional, it need not be rectilinear
• Here is an instance of creating a partial polar grid, which when used with the surface3D
plot can give us a slice into the function we are visualising:
• For installing the latest version of Seaborn, you could use pip:
Output
Matplotlib
• This function gives quick access to a small number of example datasets that are useful
for documenting seaborn and generating reproducible illustrations for bug reports
• Remember that some of the datasets contain a small amount of preprocessing applied
for defining a proper ordering for the categorical variables
• Try to load from the local cache first, if True and save to the cache if a download is
required
• Seaborn comes with some datasets, and we have used some datasets
Output
• KDE is a process for an estimate the probability density function from a continuous
random variable, and for non-parametric analysis, it is used
• Setting the hist flag to False in distplot would yield the kernel density estimate plot
Output
• Usually, we check for the multicollinearity while building the regression model, where
we had to perceive the correlation between all the combinations of continuous
variables
• In Seaborn, there are two main functions for visualising a linear relationship determined
through regression. These functions are Regplot() and lmplot()
Regplot Implot
It accepts the x and y variables in a variety of It has data as a required parameter and the x
formats containing simple numpy arrays, and y variables must be specified as strings.
pandas Series objects, or as references to This data format is called “long-form” data
variables in a pandas DataFrame
• Matplotlib library highly supports customisation, but knowing what settings to tweak for
achieving an attractive and anticipated plot is what one must be aware of to make use
of it
• Unlike Matplotlib, Seaborn comes packed with customised themes and a high-level
interface for controlling and customising the look of Matplotlib figures
Output
• For manipulating the styles, the interface is set_style(),by using this function, you could
set the theme of the plot, according to the latest updated version, the following are the
five themes
White Ticks
Output
• Dots are used in the scatter plot for representing values in two distinct numeric
variables
• The position of each dot on the vertical and horizontal axis indicates values for a single
data point
• For representing the relationship of 2 numerical variables, a Hexbin plot is useful for
that, also, when you have a lot of data point
• Instead of overlapping, the plotting window is split into numerous hexbins, and the
number of points per hexbin is counted
• The colour indicates this number of points. While using the hexbin function of
Matplotlib, it could be instantly done
Output
• Kernel Density Estimate is used for visualising the Probability Density of a continuous
variable
Output
• It is corresponding to Box Plot but with a rotated plot on every side, providing more
information about the density estimation on the y-axis
• The density is also mirrored and flipped over, and the resulting shape of violin plot is
filled in, creating an image resembling the violin
• The benefit of a violin plot is that it could depict the nuances in the distribution that are
not perceptible in a boxplot
• On the opposite hand, in the data, the boxplot more clearly indicates the outliers
• Violin Plots contain more information than the box plots; because of their unpopularity,
they are less popular, their meaning could be more difficult to grasp, and several readers
are not familiar with the violin plot representation
Example of Boxplot:
• For visualising the relationship between variables, it uses Plots. Variables could be either
be numerical or a category like a class, group, or division
• Seaborn, besides being a statistical plotting library, it also gives some default datasets.
We would be using one such default dataset known as ‘tips’
• The ‘tips’ dataset holds information regarding people who probably had food at a
restaurant and whether or not they left a tip for the waiters, their gender, whether they
smoke and so on
Output
• In order to aggregate the categorical data according to some methods and by default its
the means a barplot is utilised
Syntax:
Output
Syntax:
Output
• It is similar to a strip plot except for the fact that points are adjusted so that they do not
overlap. Some people also like combining the idea of a violin plot and a strip plot for
forming plot
• One disadvantage of using swarm plot is that they do not scale well to huge numbers
and takes a lot of computation for arranging them
• So in case, we need to visualise a swarm plot correctly we could plot it on top of a violin
plot
Output
2. By adapting its behaviour, solving problems more accurately, and more efficiently
1. Base knowledge in which the system is aware of the answer thus enabling the system
to learn
• Machine Learning allows the computers or machines to routinely adjust and customise
themselves instead of being explicitly programmed to carry out specific tasks
o While shopping on the internet, users are presented with advertisements related to
their purchases
o When using an app to book a cab ride, the app will provide an estimation of the
price of that ride. When using these services, how do they minimise the detours?
The answer is machine learning
o Siri, Alexa, are few of the popular examples of virtual personal assistants
o Social media platforms are utilising machine learning for their own benefits as well
as for the benefit of the user. Below are a few examples:
o Face Recognition: Upload a picture of you with a friend and Facebook instantly
recognizes that friend
o Social media platforms are utilising machine learning for their own benefits as well
as for the benefit of the user. Below are a few examples:
o Face Recognition: Upload a picture of you with a friend and Facebook instantly
recognizes that friend
o Machine learning is proving its potential to make cyberspace a secure place and
tracking monetary frauds online is one of its examples
o Most websites will offer the option to chat to customer support. In most cases, you
talk to a chatbot rather than a live executive to answer your queries
o These bots tend to extract information from the website and present it to the
customers
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
Drugs
Recovery Credit Motion
Price
Scoring Detection
Voice
Tumor Predictive
Recognition
Detection Maintenance
Applications
Classification Regression
Task Driven (Predict next value) Clustering
Data Driven (Predict next value) Learn from Mistakes
K-Means, K-Medoids
Support Vector Machines Linear Regression, GLM
Fuzzy C-Means
Categorical
Clustering Classification
Regression
15% 35%
Linear Algebra
Multivariate Calculus
Probability Theory and Statistics
Algorithms and Complexity
Others
25%
15%
o IVR (Interactive Voice Response) applications that are used in call centres to respond to
specific user requests
o Word processors like Grammarly that employ NLP for checking grammatical errors
Morphologic
al Processing
Lexicon
Syntax
Analysis
(pasing)
Grammar
Semantic
Semantic Analysis
Rules
Contextual
Information Pragmatic
Analysis
• Individual words are analysed into their components, and non-word tokens like
punctuations are separated from the words
• This summaries the dictionary meaning or the real meaning from the given context only
• This analysis handles communicative and social content, as well as its effect on interpretation
• In this analysis, the key emphasis is always on what was said and then reinterpreted based
upon the actual meaning
• This analysis assists users to find out this anticipated effect by applying a set of rules that
characterise cooperative dialogues
• The syntax basically refers to the principles and rules governing the sentence structure
of any individual languages
• Syntax concentrates on the appropriate ordering of words that can affect its meaning
• The significance of any single sentence that depends upon that sentences
• As an example, in the sentence “He wanted that”, the word “that” depends upon the
previous discourse context
• Search engines such as Google, Bing, Yahoo etc. base their machine translation
technology on NLP deep learning models
• NLP lets algorithms read text on a webpage, interpret its meaning, and translate it to
another language
• NLP techniques are broadly used by word processing software such as MS-word for spelling
correction and grammar checks
• Translating text or speech from one natural language to another using computer applications
a) Users can ask as many questions about any subject and get a response instantly
within seconds
d) The accuracy of the answers depend upon the quantity of relevant information
provided in the question
g) NLP processes helps computers in communicating with humans in their language and
therefore increases language related tasks
Optical character
Topic modelling Language detection
recognition
• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers
• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers
• The amount of useful data available and an increase in computational speed are the two
factors that have made the whole world to invest in this field
• If a robot is hard coded i.e. all the logic has manually been coded to the system, then it
is not AI so it does not mean that simple robots mean AI
• Machine learning means making a machine learn from its experience and enhancing its
performance with time as in case of a human baby
• The concept of machine learning became possible only when an adequate amount of
data made available for training machines. It assists in dealing with a complex and
sound system
• Mainly, deep learning is a subset of machine learning, but in this case, the machine
learns the way where humans are believed to learn
• The structure of both deep learning model and the human brain is similar to a large
number of nodes and neurons, neurons in the brain of human thus result in artificial
neural network
• When traditional machine learning algorithms are applied we need to select input
features manually from complex data set and then train them that is a boring job for the
scientist of Machine Learning, but in neural networks, we do not need to select
manually useful input features
• There are several types of neural networks to manage the complexity of data set and
algorithm
• Deep learning has allowed most of the Industries Experts to overcome challenges that
were not possible, a decade ago like Image and Speech recognition and Natural
Language Processing
• Trending successes of deep learning are Voice Assistants, Mail Services, Self Driving cars,
Video recommendations, Intelligent Chat bots
• In the human brain, a single unit of the neuron gets thousands of signals from other
neurons. In an artificial neural network, signals are travel between nodes and allocate
weight accordingly
• A node weighing heavy will apply more impact on the next layer of the nodes. The final
layer put together the weighted inputs to give an output
• Systems of Deep learning needs powerful hardware as they have a huge amount of
processed data and includes many complex mathematical calculations
• In spite of having such advanced hardware, calculations of deep learning training can
take weeks
• Deep learning systems need a large amount of data to get back to accurate results;
according to that, information is served as huge data sets
• When data is processing, artificial neural networks are able to categorise data with the
answers gets from a series of true/ false questions that include highly complex
mathematical computations fed
• For instance, programs of facial identification work by learning to identify and detect
edges and lines of faces, then more important parts of faces, and finally complete
representations of the faces
• As the program trains itself and the possibility of getting the right answers enhances
with time
• The analysis of big data datasets is an interdisciplinary attempt that combines statistics,
mathematics, computer science, and subject matter expertise
• It produces value from the storage and processing of substantial quantities of digital
information that cannot be analysed with conventional computing techniques
Volume
Value Variety
Data
Complexity
Velocity Veracity
1. Archives
2.Enterprise Data
3.Transactional Data
4. Social Media
5. Activity Generated
6. Public Data
• The table describes the four categories of common business problems that organisations
contest with where they have a chance to use advanced analytics to create a
competitive advantage
• Rather than just performing standard reporting on these areas, advanced analytical
techniques can be applied by the organisations to optimise processes and derive more
values from these regular tasks
• The initial three examples don't describe new problems. Organisations have been
attempting to decrease customer churn, increase sales, and cross-sell customers for
many years
• Multiple compliance and regulatory laws have been in presence for quite a long time;
however extra requirements are added every year, that represents added complexity
and data requirements for organisations
• Anti-money laundering (AML) related laws and fraud prevention require advanced
analytical techniques for complying and managing appropriately
• Discovery is the phase 1 where the team learns the business domain, including
appropriate history such as whether the business unit or organization has attempted
similar projects in the past from which they can learn
• The team analyses the resources available to support the project in terms of technology,
people, time, and data
• In this step, essential activities include framing the business problem as an analytics
challenge that can be solved throughout subsequent phases and formulating initial
hypotheses (IHs) to test and start learning the data
• Data preparation requires the existence of an analytical sandbox, in which the team can
work with data and perform analytics for the duration of the project
• In this phase, the team needs to execute extract, transform and load (ETL) or extract,
load, and transform (ELT) to retrieve data into the sandbox
• In the ETLT process, data should be transformed so that the team can work with the
data and analyse it. The team also requires to familiarise itself with the data thoroughly
and take steps to condition the data
• In this phase, the team determines the techniques, methods, and workflow it intends to
follow for the subsequent model building phase
• The team examines the data to learn about the relationships between variables and
subsequently selects key variables and the most relevant models
• Phase 4 is model building, where the team develops datasets for training, testing, and
production purposes
• Also, in the model planning phase, the team builds and executes models based on the
work done
• In the ETLT process, data should be transformed so that the team can work with the
data and analyse it. The team also requires to familiarise itself with the data thoroughly
and take steps to condition the data
• In this, the team should quantify the business value, identify key findings, and develop
a narrative to summarise and convey findings to stakeholders
• Phase 6 is Operationalise, where the team delivers final reports, briefings, code, and
technical documents
• Also, the team may run a pilot project in a production environment to implement the
models
• The most important aspects of computing with Data Manipulation in R is that it enables
its subsequent analysis and visualisation
o Vectors
o Matrices
o Lists
o Data Frames
2. [[ - like $ in R, the double square brackets operator in R also returns a single element
Example:
• To retrieve 5 rows and all columns of already built-in dataset iris, the below command, is
used
Output:
Input:
1. cut() function in R
Input:
Output:
2. table() function in R
• We can use the R table() command, to count the observations in each level of factor
Input:
Output:
• It is the process of transforming the raw data into consistent data and analysing it
• The main aim of data cleaning is to improve the statistical statements based on the data
and their reliability
• The first step involves an initial exploration of the data frame that just imported into the
R
• The important thing is to understand how to import data into R and save it as a data
frame
Output:
The first thing that you check the class of your data frame:
• class(data)
o In this we can clearly see the our dataset is saved as a data frame
1. "data frame“
o We want to check the number of rows and columns in the data frame
1. 1460 81: We can see that data frame has 1460 rows and 81 columns
We can view the statistical for all the columns of the data frame using the code that shown
in the next slide:
• summary(data)
Output:
• There are two types of plots that should use during the data cleaning process:
Histogram BoxPlot
Histogram:
• The histogram is useful to figure out if there are outliers in the particular numerical
columns under study
insatll.package(“plyr”)
library(plyr)
hist(data$Dist_Taxi)
BoxPlot:
• It is super useful because it shows the median, along with the first, second, and third
quartiles
• BoxPlots are the best way of spotting outliers in your data frame
boxplot(data$Dist_Taxi)
• In this method the main focus is to correct all the errors that you have seen
• If you want to change the name of your data frame, the code is:
data$carpet_area<-data$Carpet
• If some column has an incorrect type associated with them. For example, a column
containing the text elements stored as a numeric column
• In such case, we can change the type of colum by using the following code:
Data$Dist_Taxi<-as.character(data$Dist_Taxi)
class(data$Dist_Taxi)
• R uses cutting edge technology to manipulate data which can be used for predictive
modelling
• In order to analyse data, we need to access the data from different databases by using
SQL commands
• Then read the data and export the data using different file formats
• Access descriptors enable you to create view descriptors wherein they function in the
same way as the PROC SQL command
• Care needs to be taken when you are exporting data from one type to another type
1. CSV
2. TSV
3. SPSS
4. HTML
5. Fixed Field Text, and many more
• The following two export formats are available for Data Table Exports only:
i. XML
ii. XPSS
iii. HTML
iv. Fixed field text
v. Tableau
vi. JSON
• CSV (Comma Separated Value) can be opened in MS-EXCEL. It can also be converted to
other statistical software
• TSV (Tab Separated Value) is a simple text format for storing data in a tabular structure.
TSV and CSV are compatible file formats to import data to QUALTRICS
• The XML is used for putting your raw data into a database. It is a general purpose mark-
up language. It is compatible with Excel
• Fixed Field Text is a flat file format. It is accompanied by a separate data map file
• Many organisations use a data analysis application called the Tableau. JSON (Java script
object notation) is also available for use
• In the following example:- vector1, vector2 and vector3 are the variables to store
integer values separately. We make use of the c() function to combine these values
together
• We can enhance numerical data by typing the values separated by commas into the c()
command
• In this example, data1 is the object that stores our data. Then, type our numerical
values between the two parentheses and these values will be separated by commas.
Type ‘data1’ to display the dataset
• We will create an object data2 that stores our data. We will also specify data1 as one of
the member components
• Whatever these quotes cover the data is interpreted as a type of character or a text
item
• In the following example, we will take our data in the form of characters as the days of a
week and store them in the day1 object
• Then we pass day1 with another text element into the same vector. In this case,
however, day1, is not of a text but of a numerical type. If numbers and text are
combined, R converts the number into text
• We can use the scan() command that doesn't require you to enter a comma after every
input data instead of typing input data with the additional specification of commas
• scan() command can also be used for taking data from files as well as with the
clipboards
• scan() command invokes a prompt through which you enter the data. It does not take
any input between its parentheses
• In the above example, we created a data frame that is then stored as a file called
‘data.txt’ on the local disk. This text file can be accessed using the scan function as
follows:
• To copy and paste the data more interactively, we can use the clipboard
• We can enter the input data such as spreadsheets with the help of scan() command
o Paste the data from the clipboard in R after returning to R. R then waits until an
empty line is entered before the data entry process is stopped to make it easier to
copying and pasting data as required
• If the data is separated by spaces, simply copy and paste. However, if some other
character or symbol separates the data, we must enter it in R before importing the data
• We can retrieve data from a CSV file using the scan() command. We will save our
antecedently created data frame ‘data’ as a CSV file
• Now, we scan our CSV file and define the what attribute with ‘character’
• We can use the scan() command to get data file from our system's local memory
• Data can be read from a console and written to a vector with the help of scan()
command. In the scan() function, we add the file name as follows:
• We used the scan() command in the above sections to read data from simple files. We
can enter a large number of data containing complicated data in R
• There are different ways and means of reading such large data that are stored in a
variety of text formats
o We can read from files that contain values separated by tabs: > delim()
• This model is used to predict a binary outcome such as (1/ 0, True/ False, Yes/ No )
given as a set of independent variables
• It is a regression model in which the response variable has binary values such as 0/1
or True/False. Hence, we are able to calculate the probability of the binary response
• Firstly we import the data and displaying the information related to the BreastCancer
dataset with the str() function:
Output:
• In case of simple linear relation there is one predictor and one response variable, but in
case of multiple regression there is one or more than one predictor variable and one
response variable
Following is the description of the parameters which are used in the equation on the
previous slide −
• lm() function creates the relationship model between the response variable and the
predictor
o formula is a symbol that defines the relation between predictor variables and the
response variable.
• Input Data
o Take the data set "mtcars" which is available by default in the R environment
o The goal of the model is for establishing the relationship between "wt", "hp" and
"disp" as predictor variables with "mpg" as a response variable
input <-
mtcars[,c("mpg","disp","hp","wt")] Output
print(head(input))
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
Or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
o We can use the previously created regression equation for predicting the mileage
when a new set of values for weight, horse power and displacement is provided
o For a car with wt = 2.91, hp = 102 and disp = 221 the predicted mileage is :
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91
print(Y)
Output
• On plotting a graph, we get a bell shape curve with the count of the values in the
vertical axis and the value of the variable in the horizontal axis
o mean represents the mean value of the sample data. Its default value is 0
• For example, tossing one coin always gives a head or a tail. During the binomial
distribution, the probability of finding exactly 3 heads in tossing a coin repeatedly for 10
times is estimated
• qbinom() function takes the value of probability and gives a number whose cumulative
value matches the value of probability
• This view can be mainly useful when your model contains complex relationships
between many tables
• Click on the Model icon placed at the left side of the window to see a view of the
existing model
• Your cursor over a relationship line represent the columns that are used as shown in the
next slide:
1. *-1: Many-to-One
2. 1-1: One-to-One
3. 1-*: One-to-Many
4. *-*: Many-to-Many
• In many to one relationship, the column in a given table can have more than one
instance of the value, and the other related table is known as the lookup table and
contained only one instance of a value
• In a one-to-one (1:1) relationship, the column in one table has only one example of a
specific value, and the different related table also contains only one instance of a
specific value
• In a one-to-many (1:*) relationship, the column in one table has only one instance of a
specific value, and the other related table contains more than one instance of a value
• You can develop a many-to-many relationship between tables with composite models
that removes the requirements for unique values in tables
• It also eliminates the previous workarounds, such as introducing new tables only to
build relationships
• The possible cross filter options are dependent on the type of cardinality
• Single cross filter direction indicates single direction, and both show both directions
• DAX formulas are just like the ones we issue in Microsoft excel
• Excel allows its users to reference cells or arrays. If the users need to perform such
behaviour in Power Bi, they would require the use of DAX Functions
• The scalar value can be an expression that evaluates to a scalar or an expression that
can be converted to a scalar
Expressions
Scaler Operator, containing any
Expressions or values, and A function
of the
Constants that References to Constants result along
following:
use Scaler Columns or specified as a with its
operators,
Operator such Tables part of arguments and
constants, or
as +, -, *, /, >. =, expression parameters
references to
&& etc.
columns
• DAX requires that all its objects whether tables or columns must have unique names
• Also, names of objects are case insensitive, i.e. Products and PRODUCTS would refer to
the same table or column
• A column name should always be fully qualified, i.e. it must be preceded by the table
name and should be written in square brackets, e.g. SALES.[Prodcut_Id]
• Sometimes table names will contain spaces in which case they must be enclosed in
single quotations
1. 2. 3.
When the VALUES function As arguments to the ALL or While using the CALCULATE
requires arguments ALLEXCEPT functions or CALCULATABLE functions
passes as a filter argument
4. 5.
As an argument to As an argument at any time
RELATEDTABLE function intelligence function
• DAX is also capable of returning a whole table rather than a column only
• The simplest way to visualise row context is to take a table and add a calculated column
• For instance, if a table has two columns a and b and row 1 values are 1 and 2
respectively
• If you add a column c that sums the value of columns a and b, then the column c value
for row 1 would be 3, and the value for row 2 would be seven
• You can create a data analysis expressions (DAX) formula that defines the columns
values rather of querying and loading values in your new column from a data source
• In Power BI desktop, calculated columns are generated by using the new column feature
in Report view
• Calculated columns that you create appear in the fields list just like any other field
• But they will contain a special icon showing its values are the result of a formula:
• You can name new columns whatever you want, and add them to a report visualisation
just like other fields
• For instance, suppose you are a personnel manager who has a table of Sales_2019 and
another table of Sales_2020, and you want to combine both tables into a single table
called Sales
Sales_2019 Sales_2020
Step 1: Click on the Modeling Tab and then select New Table
• Simple summarisations such as sums, averages, counts and minimum, maximum can be
set through the Fields well
• The calculated results of measures are always varying according to your interaction with
your reports, allowing for fast and dynamic ad-hoc data exploration
• In the Power BI desktop, measures are created in a data view or report view
• The measures that you are created appear in the Fields list with a calculator icon
• You can enter the name for measures whatever you want and add them to a new or
existing visualisation just like any other field
• It is the queries that form the bases of the reports and visualisations in Power BI
• A query is created in Power BI as soon as the command to fetch data (or Get Data to be
more precise) is given
• However, these tasks can only be performed from the query editor in the Power BI
desktop version
• Users can use multiple queries to get the results they want
• This is possible if they have already imported these datasets from some external source
such as Excel, CSV (comma-separated values) file, or some databases
• Power query editor loads information about the data if we connect to the web data
source that you can use and then begin to shape
• The following steps show how power query editor appears once a data connection is
established:
Step 1: In the ribbon, various buttons are now active to interact with the data in the query
Step 2: In the left pane, queries are listed as well as available for selection, shaping and
viewing
Step 3: In the centre pane, data from the selected query is displayed or available for
shaping
2 4
• In the Power BI desktop, there is a lot that can happen to the data that has been
retrieved
• While in the query editor users can opt to remove columns/rows from a dataset or may
even add new columns to the existing columns
• Remove Errors is one way in which all rows containing the errors would be removed
• As we want to keep our data and rectify the errors, we are not going to use this option
• Click the column that has ERROR displayed to show the following:
• To enable the advanced editor, select View from the ribbon, then select Advanced
Editor
• A window appears that displays the existing query code as shown below:
• You can also describe cell values with data bars or as active web links or KPI icons
• You can also apply conditional formatting to any text or data field, as long as you base
the formatting on a field that includes numeric, hex code or colour name, or web URL
values
Step 1: Open Power BI, choose Excel option from the Get Data
• To combine data, the easiest way would be to establish a relationship between the data
sources on a column
• Let us take a scenario where the user has a list of cities and their country codes in one
data sources and the country code and their respective names in another as shown in
the next slide:
• Once done, click on the Relationships icon on the right-hand side to create a
relationship as shown:
• Next, click on the Reports icon to see a result of the data you have combined
• The data you entered has been mapped automatically using Power BI desktop
• Data can also be combined from the web with some existing data
• Suppose we have some data that has organisation names and their respective US
country codes but not the country names, and a report requires that organisations be
listed with country names then we could take the web as a data source
Step 1: Choose Get Data > Web and provide the URL from where to retrieve the country
codes and country names. Click OK
Step 3: Rest of the process is the same, click Relationships and then Reports
• In the second set of page view settings, it controls the positioning of objects on the
report canvas and chooses between:
o Show gridlines: Turn on gridlines help you to position objects on the report canvas
o Snap to grid: Use with Show gridlines to position precisely as well as to align objects
on the report canvas
o Lock objects: Lock all objects on the canvas so that they can not be resized or
moved
o Selection pane: The Selection pane lists all objects on the canvas, and you can
decide which to show and which to hide
• These settings are available in the Visualisations pane and control the actual size (in
pixels) as well as the display ratio of the report canvas:
o 4:3 ratio
o Letter
• There are innumerable visualisation in Power BI, the list is growing, and according to
Microsoft this will keep on growing
• As of now, Power BI offers visualisations that help the user with simple charts and also
measure their performances
Step 2: Start with a numeric field like Customers > City > Customer_ID. Power BI creates a
column chart with a single column, and you can select the desired chart from visualizations
• It may contain the data in the Location, Latitude, and Longitude buckets of the visual's
field well
• The following are the steps to represent the data by using a map chart in Power Bi
report:
Step 1: Start with a geography field, such as Geo > City > Customer_ID
Step 1: Click on Get Data icon on the ribbon and select Excel to import a workbook
• The admin portal consists of items such as usage metrics, access to the Microsoft 365
admin centre, and settings
• The full admin portal is open to all users who are global admins or have Power BI service
administrator role
• Make sure your account marked as a Global Admin, with Microsoft 365 or Azure Active
Directory (Azure AD), or contains a Power BI service administrator role, to get access to
the Power BI admin portal
Step 1: Select the settings gear in the top right side of the Power BI service
Step 1: You can manage and observe the settings for your environments by signing in to
the Power Platform admin centre
Step 2: Go to the Environments page, select an environment and then click on Settings
• Global option is implemented for all Power BI desktop files created or accessed by the
user
• While CURRENT FILE options must be described for each Power BI desktop file
• To illustrate this, we are going to use Adam Insights dashboard available in my Power BI
workspace
• The following are the steps to change this Power BI dashboard settings:
Step 1: On the top right corner, click on the … button and then select the Settings option
from the context menu