You are on page 1of 221

Data Analytics Course

1. Introduction
IIFT Kolkata | MBA (IB) | Jul-Sep 2022
Delhi Campus

Dr. Shashank Shekhar


Sharma
drshashank.s.sharma@gmail.com
Make Predictions

NOT Make Erroneous Predictions


What we will do / learn today!

● Overview ● The Data Analytics Process


● Managing all the Data Science and ● What can Machine Learning really
Data Analytics Jargons! do - and not do?
● “Data” in Data Analytics ● Supervised, Semi-supervised and
● How Data becomes ‘Big Data’ Unsupervised ML
● 30K Feet view of ‘Cloud’ Computing ● AI, ML and DL
● Using Data for Forecasting - then ● Roles in Data Science
and now ● Tools of the trade
● Course Objectives
House Rules

1. ACP is welcome.. so is DCP

1. Try not to miss lectures.. Especially when we start learning Python

1. Ask.. there are no wrong questions.. But I may not have all right answers

1. You can sleep if you can still “sleep listen”


House Rules

No Phones

Laptops only when needed

No walking in and out


HUMAN DECISION MAKING
What is Analytics?
Different branches of
Data Science
Multitasking. High adapatability and flexibility

Faster. More accurate but for extremely specific tasks.


Properties of Measurement
● Identity: Identity refers to each value having a unique meaning.
● Magnitude: Magnitude means that the values have an ordered relationship to
one another, so there is a specific order to the variables.
● Equal intervals: Equal intervals mean that data points along the scale are equal,
so the difference between data points one and two will be the same as the
difference between data points five and six.
● A minimum value of zero: A minimum value of zero means the scale has a true
zero point. Degrees, for example, can fall below zero and still have meaning.
But if you weigh nothing, you don’t exist.
1. Nominal scale of measurement

The nominal scale of measurement defines the identity property of data. This scale has certain characteristics, but doesn’t
have any form of numerical meaning. The data can be placed into categories but can’t be multiplied, divided, added or
subtracted from one another. It’s also not possible to measure the difference between data points.

Examples of nominal data include eye colour and country of birth. Nominal data can be broken down again into three
categories:

● Nominal with order: Some nominal data can be sub-categorised in order, such as “cold, warm, hot and very hot.”
● Nominal without order: Nominal data can also be sub-categorised as nominal without order, such as male and female.
● Dichotomous: Dichotomous data is defined by having only two categories or levels, such as “yes’ and ‘no’.
2. Ordinal scale of measurement

The ordinal scale defines data that is placed in a specific order. While each value is ranked, there’s no
information that specifies what differentiates the categories from each other. These values can’t be
added to or subtracted from.

An example of this kind of data would include satisfaction data points in a survey, where ‘one = happy,
two = neutral, and three = unhappy.’ Where someone finished in a race also describes ordinal data. While
first place, second place or third place shows what order the runners finished in, it doesn’t specify how
far the first-place finisher was in front of the second-place finisher.
3. Interval scale of measurement

The interval scale contains properties of nominal and ordered data, but the difference between data points can be quantified. This
type of data shows both the order of the variables and the exact differences between the variables. They can be added to or
subtracted from each other, but not multiplied or divided. For example, 40 degrees is not 20 degrees multiplied by two.

This scale is also characterised by the fact that the number zero is an existing variable. In the ordinal scale, zero means that the
data does not exist. In the interval scale, zero has meaning – for example, if you measure degrees, zero has a temperature.

Data points on the interval scale have the same difference between them. The difference on the scale between 10 and 20 degrees
is the same between 20 and 30 degrees. This scale is used to quantify the difference between variables, whereas the other two
scales are used to describe qualitative values only. Other examples of interval scales include the year a car was made or the
months of the year.
4. Ratio scale of measurement

Ratio scales of measurement include properties from all four scales of measurement. The data is nominal and defined by an
identity, can be classified in order, contains intervals and can be broken down into exact value. Weight, height and distance are
all examples of ratio variables. Data in the ratio scale can be added, subtracted, divided and multiplied.

Ratio scales also differ from interval scales in that the scale has a ‘true zero’. The number zero means that the data has no
value point. An example of this is height or weight, as someone cannot be zero centimetres tall or weigh zero kilos – or be
negative centimetres or negative kilos. Examples of the use of this scale are calculating shares or sales. Of all types of data on
the scales of measurement, data scientists can do the most with ratio data points.
Course Objectives
1. INTRODUCTION TO DATA ANALYTICS
2. PYTHON FOR DATA PRE-PROCESSING

3. MACHINE LEARNING WITH PYTHON


4. BIG DATA ANALYTICS AND GOOGLE CLOUD PLATFORM
Evaluation Component

● Case Study Presentation: 15%

● Assignment: 20%

● Project: 30%

● End Term: 35%


Data Analytics Course
2. Introduction to Data
Analytics
IIFT Kolkata | MBA (IB) | Jul-Sep 2022
DATA PRE-PROCESSING
● Data Pre-processing: collection, ingestion, preparation (~50% effort)

● Analytics / Machine Learning: computation (~25% effort)

● Delivery: presentation (~25% effort)


Missing values are completely independent of other data. There is no pattern. In this case,
there is no relationship between the missing data and any other values observed or
unobserved (the data which is not recorded) within the given dataset. That is, In the case of
MCAR, the data could be missing due to human error, some system/equipment failure, loss
of sample, or some unsatisfactory technicalities while recording the values.

Reason for missing values can be Missing values depend on the


explained by variables on which you unobserved data. If there is some
have complete information as there structure/pattern in missing data
is some relationship between the and other observed data can not
missing data and other values/data. explain it
DEALING WITH MISSING VALUES

1. Deleting the Missing values


2. Imputing the Missing Values
IMPUTING VALUES
Replacing With Arbitrary Value. ...
Replacing with Mean, Media or Mode
Replacing with previous value – Forward fill
Replacing with next value – Backward fill
Interpolation - ‘polynomial’, ‘linear’, ‘quadratic’: (Python Method)
Treat as Missing (Separate category for Categorical variables)
Nearest Neighbours (KNN)
Use the right evaluation metrics
Precision/Specificity: how many selected instances are relevant.

Recall/Sensitivity: how many relevant instances are selected.

F1 score: harmonic mean of precision and recall.


Resampling (Oversampling and Undersampling)

Balance dataset by by reducing the Balance dataset by increasing the


size of the abundant class size of rare samples (Repetition /
Bootstrapping)
Ensemble different resampled datasets

Building n models that use all the samples of the rare class and n-differing samples of
the abundant class.
Increasing class weight for minority class
DATA NORMALIZATION

One of the crucial techniques for data transformation in data mining. Here, the data is
transformed so that it falls under a given range. When attributes are on different ranges or
scales, data modelling and mining can be difficult. Normalization helps in applying data
mining algorithms and extracting data faster.
● Min-max normalization: Guarantees all features will have the
exact same scale but does not handle outliers well.

● Z-score normalization: Handles outliers, but does not produce


normalized data with the exact same scale.
BUSINESS INTELLIGENCE
Data Analytics Course
3. Introduction to Machine
Learning
IIFT Kolkata | MBA (IB) | Jul-Sep 2022
Machine Learning Process
https://www.anaconda.com/
Length
Weight
Width
Shape
Supervised Learning : Regression
Linear Regression
Measuring Performance : Mean Average Error
Measuring Performance: Mean Average Percentage Error
Measuring Performance: Mean Percentage Error
Measuring Performance
Measuring Performance: Mean Square Error
Higher Dimensions
Optimization
Minimizing Cost Function By Updating Weights In Every Epoch
Optimization in Higher Dimensions
Testing different models

Three estimates of f are shown: the


linear regression line (orange curve),
and two smoothing spline
fits (blue and green curves
Training error vs Test error

Training MSE (grey curve), test MSE (red


curve), and minimum possible test MSE over all
methods (dashed line). Squares represent the training
and test MSEs for the three fits shown in the left-hand
panel.
Variance refers to the amount by which predictor function (f)
would change if we estimated it using a different training data set.
Since the training data are used to fit the statistical learning
method, different training data sets will result in a different f. But
ideally the estimate for f should not vary too much between
training sets. However, if a method has high variance, then small
changes in the training data can result in large changes in f. In
general, more flexible statistical methods have higher variance.

Bias refers to the error that is introduced by approximating a real-


life problem, which may be extremely complicated, by a much
simpler model. For example, linear regression assumes that there
is a linear relationship between Y and X1,X2, . . . , Xp. It is unlikely
that any real-life problem truly has such a simple linear
relationship, and so performing linear regression will undoubtedly
result in some bias in the estimate of f.
Supervised Learning : Classification
Logistic Regression

Logistic regression is like a linear regression, but the curve is


constructed using the natural logarithm of the “odds” of the
target variable, rather than the probability.
Log odds
In logistic regression, a logit transformation is applied
on the odds—that is, the probability of success
divided by the probability of failure. This is also
commonly known as the log odds, or the natural
logarithm of odds:
Logistic Regression
Cost Function
yp
Optimization

or J(w)

or θ
Optimization Function
Optimization Steps
Measuring Performance : Classification

COMPLEXITY
L1 and L2 Penalty / Regularization
Measuring Performance : Classification

λ
Measuring Performance : Classification
Measuring Performance : Classification
Measuring Performance : Classification
Support Vector Machines
Support Vector Machines
Non-Linear SVM
Kernel trick for SVM
K-Nearest Neighbour (KNN) Classifier

•Step-1: Select the number K of the neighbors


•Step-2: Calculate the Euclidean distance of K number
of neighbors
•Step-3: Take the K nearest neighbors as per the
calculated Euclidean distance.
•Step-4: Among these k neighbors, count the number
of the data points in each category.
•Step-5: Assign the new data points to that category
for which the number of the neighbor is maximum.
•Step-6: Our model is ready
Decision Trees
Random Forest
Unsupervised Learning
K-Means Clustering
K-Means Clustering
Hierarchical Clustering
Hierarchical Clustering
Dimensionality Reduction
Factor Analysis
Principal Component Analysis
Principal Component Analysis
Source : towardsdatascience.com
Source : University of Wisconsin
Source : Udemy
Data Analytics Course
4. Introduction to Data
Visualization
IIFT Kolkata | MBA (IB) | Jul-Sep 2022
“Power Corrupts. PowerPoint Corrupts Absolutely.”
- Edward Tufte, Yale Professor Emeritus
Data visualisation examples
The best visual representation of a data set is determined by the relationship data scientists want to convey between data
points. Do they want to present the distribution with outliers? Do they want to compare multiple variables or analyse a single
variable over time? Are they presenting trends in your data set? Here are some of the key examples of data visualisation.

● A bar chart is used to compare two or more values in a category and how multiple pieces of data relate to each other.
● A line chart is used to visually represent trends, patterns and fluctuations in the data set. Line charts are commonly
used to forecast information.
● A scatter plot is used to show the relationship between data points in a compact visual form.
● A pie chart is used to compare the parts of a whole.
● A funnel chart is used to represent how data moves through different steps or stages in a process.
● A histogram is used to represent data over a certain time period or interval.
6 key lessons for data storytelling

1. Understand the context

2. Choose an appropriate visual display

3. Eliminate clutter

4. Focus attention where you want it

5. Think like a designer

6. Tell a story
Exploratory vs. Explanatory analysis

Exploratory analysis is what you do to understand the data and figure out what
might be noteworthy or interesting to highlight to others.

Explanatory analysis is the step beyond exploratory. Instead of explaining what


happened, you're more focused on how and why it happened and what should
happen next and, in most cases, communicating that to the necessary decision-
makers and stakeholders.
Data Analytics Course
Case Study Presentation :
1st August – 15 mins each group
IIFT Kolkata | MBA (IB) | Jul-Sep 2022
5 Groups
1. https://www.kaggle.com/competitions/titanic/data

2. https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

3. https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

4. https://www.kaggle.com/competitions/digit-recognizer/data?select=train.csv

5. https://www.kaggle.com/datasets/arshid/iris-flower-dataset

6. https://www.kaggle.com/datasets/mirichoi0218/insurance

7. https://www.kaggle.com/datasets/blastchar/telco-customer-churn

8. If you want to use any other DataSet from Kaggle which you excites you more..Please feel free to do that.
Presentation to include
● Description and Understanding of Data

● Objective of the problem

● Explanation of the features and what data

● What will be the metrics for analysing performance of the prediction model

● What could be possible algorithms for building a prediction model.

● Since these are widely used datasets on Kaggle, many users have given a detailed
description of the steps and codes they used to solve the problem. You can choose to take
us through one of these to help you in bringing out the above points.
Data Analytics Course
Case Study Presentation :
1st August – 15 mins each group
IIFT Kolkata | MBA (IB) | Jul-Sep 2022

You might also like