You are on page 1of 68

CONTENTS OF THIS BOOK

ABOUT THE AUTHOR


PART 0 PREFACE
WHAT IS STATISTICS?
DESCRIPTIVE STATISTICS
• INTRO: ANALYTICS NINJA & THE STATISTICAL LABYRINTH
• MEANS AND AVERAGES
• MEDIAN
• VARIANCE/ STANDARD DEVIATION AND DELTA
• GLOSSARY
PART 1 INFERENTIAL STATISTICS
• DEFINE INFERENTIAL STATISTICS
• DATA TYPES
• SAMPLING TECHNIQUES
• SIMPLE RANDOM AND STRATIFIED SAMPLING
• CLUSTER AND SYSTEMATIC SAMPLING
• INFERENTIAL ANALYSIS
PROBABILITY
• INTRO: ANALYTICS NINJA AND PROBABILITY CONUNDRUM
• INTRODUCTION TO PROBABILITY
PART 2 • UNDERSTANDING PROBABILITY
• SAMPLE SPACE
• CLASSICAL / EMPIRICAL PROBABILITY
• PROBABILITY RULES
HYPOTHESIS TESTING
• INTRO: HYPOTHESIS IN A NUTSHELL
• WHAT IS HYPOTHESIS AND HOW TO WRITE?
PART 3 • 5 STEPS TO HYPOTHESIS TESTING
• TYPE I AND TYPE II ERRORS
• INTERPRETING THE ERROR TABLE
• HYPOTHESIS TESTING TREE
CORRELATION AND REGRESSION
• ANALYTICS NINJA AND THE CURSE OF MULTICOLLINEARITY
• TYPES OF CORRELATIONS
• METHODS OF STUDYING CORRELATION
• PEARSON CORRELATION COEFFICIENT
PART 4 • SPEARMAN’S RANK CORRELATION
• REGRESSION INTRODUCTION
• ANALYTICS NINJA AND THE REGRESSION GAMES
• SIMPLE LINEAR REGRESSION (SLR)
• MULTIPLE LINEAR REGRESSION (MLR)
• BINARY LOGISTIC REGRESSION (BLR)
EPILOGUE ANALYTICS NINJA AND THE STATISTICAL NIRVANA

2
(C) Copyright Sunil Kappal - Data Dojo
ABOUT THE AUTHOR

Sunil Kappal works as an Advanced Analytics Consultant. He has more than 20 years
of experience in Data Analytics, Business Intelligence, Statistical Modeling, Predictive
Models and Six Sigma Methodologies.

He has earned numerous industry certificates in the field of Business Analytics,


Project Management, Healthcare Innovation and Entrepreneurship from top notch
universities like:
• Duke University of North Carolina
• University of California Irvine Extension
• Wharton School of Business

Sunil has delivered multiple lectures at the University of Texas Dallas on the usage of
various advanced analytical techniques and machine learning best practices. He was
also invited as a guest speaker at the Symbiosis Institute of Operations Management
to talk on Big Data and Machine Learning.

Apart from delivering guest lectures and presentations on advanced analytics


techniques, Sunil also write guest blogs from leading IT and Analytics blogs like:

Besides the above brands, he also runs his own blog on WordPress

“If the knowledge remains with the knowledgeable it becomes alchemy”

(C) Copyright Sunil Kappal - Data Dojo 3


PREFACE

Why I wrote this book?

Statistics for Rookies is a book with a primary aim to educate its readers about the
most common statistical methods to help them take data-driven decisions “without
sweat”. The book presents some common yet very effective statistical concepts in the
most engaging and captivating manner.

The secondary aim of this book is to help cement the fundamental statistical concepts
in a way that is easy to follow and is jargon-free (this means each statistical terms is
demystified by using its simplistic definitions) Example: A Delta, usually represented
by Δ is nothing but difference or change in between two quantities.

This book caters to both seasoned and rookie data analysts to understand or refresh
the foundational concepts for any data analytics endeavor. The uniqueness of this
book is in its simplicity, I have tried not to include too many statistical notations
within this book, without compromising with the sanctity and correctness of various
statistical methods.

How is this book arranged?

As mentioned above this book’s aim is to present statistical concepts in a fun way.
Therefore, each statistical topic is placed strategically and is introduced with a
cartoon strip that is very refreshing and makes the learning process fun and engaging.

Per my observation when people write on an intricate and a serious topic as statistics
it tends to get overwhelming for the readers and especially readers like me who have
a short span of attention. Therefore, towards the end of each section, there will be a
glossary of terms defining each statistical concept that will act as a ready reckoner.

Finally, I just want to say that do not try to memorize the concepts but do try to
understand them and try to create a use case in your head based on your day to day
tasks.

Confucius Said: “He who memorizes is buying a car without an engine”

Enjoy the book !!!

Sincerely,
Sunil Kappal

(C) Copyright Sunil Kappal - Data Dojo 2


WHAT IS STATISTICS?

Definition(Statistics): Statistics consists of a body of methods for collecting and


analyzing data. (Agresti & Finlay, 1997)

It is pretty clear from the above definition that statistics is not only about tabulation or
visual representation of data. It is the science of deriving insights from the data that
can be numerical (quantitative) or categorical (qualitative) in nature. In a nutshell, this
science can be used to answer questions like:

• How much data is enough to perform a statistical analysis


• What kind of data will require what sort of data treatment
• Methods to draw the golden nuggets out of the data; that is:

• Summarizing and exploring the data to understand the spread of the data,
its central tendency and its measure of association by using various
descriptive statistical methods
• Drawing Inferences, forecasting and generalizing the patterns displayed by
the data to make some sort of conclusions

Furthermore, statistics is the art and science of dealing with events and phenomenon
that are not certain in nature. I can confidently say that nowadays statistics is used in
every field of science.

The goal of statistics is to gain understanding of data. Any data analysis should have
the following steps:

(C) Copyright Sunil Kappal - Data Dojo 2


DESCRIPTIVE STATISTICS

(C) Copyright Sunil Kappal - Data Dojo 6


PART 1 DESCRIPTIVE STATISTICS

ANALYTICS NINJA AND THE STATISTICAL LABYRINTH

(C) Copyright Sunil Kappal - Data Dojo 7


PART 1 DESCRIPTIVE STATISTICS

Definition: Descriptive Statistics are summary statistics that quantitatively describes


the data using its measure of central tendency and measures of variability or
dispersion. The table below depicts the most commonly used descriptive statistics
and visualization methods.

Figure: 1

Descriptive statistics provide data summaries the sample along with the observations
that have been made with regards to the sample data. Such summaries can be
presented in the form of summary statistics (refer to the above visual for the summary
statistics types by data type) or easy to decipher graphs.

It is worth mentioning here that descriptive statistics is mostly used to summarize the
values and may not be sufficient to make conclusive generalizations about the entire
population or to infer or to predict the data patterns.

MoL = Measure of Location


MoV = Measure of Variance

(C) Copyright Sunil Kappal - Data Dojo 8


PART 1 DESCRIPTIVE STATISTICS

Till now we have looked “visually” what are those descriptive statistics that can be
used to explore the data by the data type (refer to the figure 1). This section of the
book will help the readers to appreciate the nuances involved in using those summary
statistics.

Means & Averages:


At times “mean” and “average” are interchangeably used in the field of data analytics
and I will follow the same nomenclature to avoid unnecessary confusion related to
these two terms.

The term “mean” or “average” is one of the many summary statistics which can be
used to describe the central tendency of the sample data. Computing this statistic is
pretty straightforward, sum all the values and divide it by the number of values to get
the mean or an average of the sample.

Example:
The mean or average is (1+1+2+3+4)/5 = 2.2

The Salary Dilemma

Figure: 1.1

Analytics Ninja Tip: Mean and Averages are sensitive to extreme values i.e. one
or two extreme values can change the mean.

(C) Copyright Sunil Kappal - Data Dojo 9


PART 1 DESCRIPTIVE STATISTICS

Median:
A median is the value separating the higher half of a sample data, from the lower half.
Median can also be expressed as another way of finding the average of the sample
data by sorting the number list from low to high and then finding the middle digit
within the number list.

Example: One Number in the middle:


Number list = 3,2,4,1,1
Step 1 sort the number list low to high = 1,1,2,3,4
Step 2 find the middle digit = 1,1,2,3,4
Median = 2

In the previous section when we looked at the mean or average for the same data set
it was 2.2. However, when we used the median statistic, it turns out the central
tendency for this data set is 2.

Example: Two Numbers in the middle


With an even amount of numbers, things get little tricky. In this case, we have to
identify the middle pair of numbers, and then find the value that is halfway between
them. This can be easily done by adding them together and dividing by two.

Let's look at this example where we have fourteen numbers and we don't have just
one middle number, we have a pair of middle numbers:

Number list = 3, 13, 7, 5, 21, 23, 23, 40, 23, 14, 12, 56, 23, 29
Step 1 sort the number list low to high = 3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40,
56

There are now fourteen numbers and we don't have just one middle number, we
have a pair of middle numbers:

3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40, 56

In the above example, the middle numbers are 21 and 23.


To find the value halfway between them, add them together and divide by 2:
21 + 23 = 44
then 44 ÷ 2 = 22
So the Median in this example is 22.

(Note that 22 is not in the number list, but that is OK because half the numbers in the list are less, and half the numbers
are greater than 22.)

(C) Copyright Sunil Kappal - Data Dojo 10


PART 1 DESCRIPTIVE STATISTICS

Variance / Delta and Standard Deviation:


In my 20 years of experience, I have seen people using these two terms
interchangeably when summarizing a dataset, which in my opinion is not only
incorrect but is also dangerous. You can call me “A Purist” but this is the distinction
that we have to make to understand the difference in-between these two statistics.

So just to clear the air, I will take a step back and try to define these two terms in
simple English language and also in their statistical forms.

Definition: Variance can be defined as the average of the squared differences from the
mean.

To put some context to the above definition, Variance can be a difference between an
expected and actual result such as between a budget and actual expenditure.

Variance Formula Demystified

Figure: 1.2

I know the above formula can be pretty daunting. Therefore, I will list out the steps to
calculate the variance in an easy to understand manner:

1. Work out the mean (refer to the mean and average section of the book)
2. Then for each number: subtract the mean and squared the results (squared
difference)
3. Work out the averages of those squared values

Still unclear about the entire math, let’s look at it visually to understand how to calculate
the variance using a dataset.

Note: Square Root of Variance,𝜎, is called Standard Deviation. Just like variance standard
deviation is also used to describe the spread. Statistics like standard deviation can be
more meaningful when expressed in the same units as the mean, whereas the variance is
expressed in squared units.

(C) Copyright Sunil Kappal - Data Dojo 11


PART 1 DESCRIPTIVE STATISTICS

Calculating Variance:
As an example lets look a the two distributions and understand the step by step
approach to calculate Variance Statistic:
Data Set 1 Data Set 2
3 1
4 2
4 4
5 5
6 7
8 11
Figure: 1.3

How it all works?

Figure: 1.4

Given the above example, it helps us to appreciate the intricacies involved in


calculating the Variance statistics viz a viz stating that Variance or Delta is one of the
same thing.

Delta can be defined as a change or a difference % where we simply subtract the


historic value from the most recent value and divide it with the recent value to get a
delta % (some people call it variance % as well).

Example: (255 – 234)/234 = 9% (Delta %) and if we do not divide this with the recent
number, we will get the pure delta = 21)

(C) Copyright Sunil Kappal - Data Dojo 12


PART 1 DESCRIPTIVE STATISTICS

Standard Deviation Explained:


Standard deviation is defined as, “The deviation of the values or data from an average
mean”

Standard Deviation helps us to know how the values of a particular data are dispersed.
Lower standard deviation concludes that the values are very close to their average.
Whereas higher values mean the values are far from the mean value. Standard
deviation value can never be in negative.

Interpreting 𝜎, standard deviation using Chebyshev’s rule: for any population


• At least 75% of the observations will lie within 2𝜎 𝑜𝑓 𝜇,
• At least 89% of the observations will lie within 3𝜎 𝑜𝑓 𝜇,
• At least 100(1 – 1/𝑚2)% of the observations will lie within 𝑚𝑥𝜎 𝑜𝑓 𝜇.

Things to remember about standard deviation:


• Use it when comparing unlike measures
• It is the most common measure of spread
• Standard deviation is the square root of the variance

Figure: 1.5

(C) Copyright Sunil Kappal - Data Dojo 13


PART 1 DESCRIPTIVE STATISTICS

Glossary:

• Central tendency: A characteristic of a sample or population; intuitively, it is the


most average value.
• Mean
• Median

• Spread: A characteristic of a sample or population; intuitively, it describes how


much variability there is:
• Variance: A summary statistic often used to quantify spread.
• Standard deviation: The square root of variance, also used as a measure of
spread.

• Frequency: The number of times a value appears in a sample.

• Histogram: A mapping from values to frequencies, or a graph that shows this


mapping.

• Distribution: A summary of the values that appear in a sample and the frequency,
or probability, of each.
• mode: The most frequent value in a sample.
• outlier: A value far from the central tendency.

Statistical Notation Cheat Sheet:

• 𝛴 = 𝑆𝑢𝑚𝑚𝑎𝑡𝑖𝑜𝑛
• 𝑋 = 𝐼𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑉𝑎𝑙𝑢𝑒
• 𝑥𝑖 = 𝐹𝑜𝑟 𝑒𝑎𝑐ℎ, 𝑎𝑙𝑙 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑠
• 𝑥ҧ = 𝑇ℎ𝑒 𝑚𝑒𝑎𝑛, 𝑎𝑣𝑒𝑟𝑎𝑔𝑒
• 𝑁 = 𝑃𝑜𝑝𝑢𝑙𝑎𝑖𝑜𝑛 𝑆𝑖𝑧𝑒
• 𝜎 2 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
• 𝜎 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
• 𝜇 = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑑𝑎𝑡𝑎

(C) Copyright Sunil Kappal - Data Dojo 14


INFERENTIAL STATISTICS

(C) Copyright Sunil Kappal - Data Dojo 15


PART 1 INFERENTIAL STATISTICS

Define Inferential Statistics:


It is a technique to draw conclusions about a population by testing the data taken
from a sample population. It is the process of generalizing the pattern of the sample
on to the entire population.

Inferential statistics is used to make statements about the population.

To be able to understand the concept of inferential statistics it is important that we


first understand two primary data variable types:
1. Qualitative Variables
2. Quantitative Variables

Note: To gain a basic understanding of what are qualitative and quantitative


variables, please visit the “What is Statistics?” section of this book.

Qualitative Variables:
The observations that fall into a particular category or a class of a qualitative variable
is depicted in a form of frequency(or a count). Tabulation of such data and their
frequencies is called a frequency distribution.

In addition to looking at the frequencies, we can also look at their percentages and can
be expressed as a percentage of the class. We can find the percentage by dividing the
frequency of the class by the total number of observations and multiplying it by 100.
Qualitative Variables
Strengths Limitations
Can give a nuanced understanding of the May lend itself to working with smaller populations,
perspectives and needs of program which may not be representative of larger
participants demographics
Can help support or explain results
Data analysis can be time-consuming
indicated in quantitative analysis
Source of detailed or “rich” information
Analysis can be subjective; there is potential for
which can be used to identify patterns of
evaluator bias in analysis/collection
behavior
Figure: 1.6

Analytics Ninja Tip: Various types of analysis can be performed on a qualitative


data. Example: Key Words identification, concept elaborations etc.

(C) Copyright Sunil Kappal - Data Dojo 16


PART 1 INFERENTIAL STATISTICS

Quantitative Variables:
A quantitative variable is something that can be quantified, something that can be
counted or measured. This type of data has infinite values that cannot be counted,
usually a measurement (Temperature, pressure, humidity, length, time).

In a layman’s language, we can say that qualitative variables are variables that vary
in kind, like “beautiful” or “not so beautiful”, “understanding” or “not
understanding”. Whereas, quantitative variables vary in amounts like height, weight,
salary etc.
Quantitative Variables
Strengths Limitations
Data collection methods provide respondents with a
Clear and specific
limited number of response options

Accurate and reliable if properly analyzed Can require complex sampling procedures

Can be easily communicated via charts and


May not accurately describe a complex situation
graphs
Many large datasets already exist that can
Requires some expertise with statistical analysis
be analyzed
Figure: 1.7

Data (Variable) Types Hierarchy

Figure: 1.8

Analytics Ninja Tip: Quantitative variables are numerical information, the analysis of which
involves statistical techniques. Data type guides the analytical process.

(C) Copyright Sunil Kappal - Data Dojo 17


PART 1 INFERENTIAL STATISTICS

Sampling Techniques:
Referring back to the inferential statistic’s definition of generalization of the sample
population’s patterns and insights on to the overall population. For us to understand
this definition and cement the idea of inferential statistics in our minds we need to
understand the basics of inferential statistics.

These basics are:


1. Population (𝑁)
2. Sample Population (𝑛)
3. Sampling Techniques
• Random Sampling
• Stratified Sampling
• Distributions

Even before we start talking about the above basics, it will be a good idea to
understand what inferential statistics can do for us.

Inferential statistic s helps us to move from a simple guess to an educated guess. By


deploying various inferential statistical analysis and test we can either confirm that
what we guessed was right or not. These guesses can be termed as Hypothesis and
we will cover this part later in this book.

Definition Population: Population is the collection of all individuals or items under


consideration in a statistical study (Weiss, 1999)

Definition Sample Population: Sample is the part of the population from which
information is collected (Weiss, 1999)

Population vs. Sample

Population (N) Sample (n)


Figure: 1.9

18
PART 1 INFERENTIAL STATISTICS
In statistics, we rely a lot on a sample to draw inferences about the entire
population. Inferential statistics provide a way to base our conclusions from sample
to the population by inferring the parameters of a population from data around the
statistics of the sample.

“Parameters” in the above statement can also be termed as 𝜇 "𝑚𝑒𝑎𝑛" and


𝜎 "𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛"

I know it is getting heavy ….. so let me just put it this way. Inferential statistics gives
us a way to generalize the patterns observed on the overall population based on the
inferential analysis and test performed on the sample data.

This section can be considered as the most important part of this book where we will
develop the basic intuition of picking the most appropriate sample. It is also worth
mentioning that it is very important for a researcher to work with samples rather
working with the entire population.

So, what are those sampling techniques?

Though there are a variety of sampling techniques available. However, as the theme
of this books suggest “Statistics for Rookie”, I will discuss only two main sampling
techniques that are Random Sampling and Stratified Sampling. However, for the
brainiacs who adore the intricacies of statistics, I have created a hierarchical view of
various sampling methods.
Sampling Methods

Focus Area

Figure: 1.10

(C) Copyright Sunil Kappal - Data Dojo 19


PART 1 INFERENTIAL STATISTICS
The following table briefly describes various sampling methods with the associated
pros and cons.

Sampling
Definition Pros Cons
Technique
Highly effective
Very small sample
Random Sample is selected when all subjects
size introduces
Sampling in a random fashion participate in data
sample error
collection
The person applying
this sampling
Accurate and
method should
effective
have proper
representation of
Stratified Represents specific understand about
all subgroups
Sampling subgroups or strata the subgroups
Accurate estimates
otherwise
in case of similarity
misrepresentation
or dissimilarity
of sample can cause
analytical fallacy
Including every Nth
Time and Cost May results in high
Systematic observation of the
effective and sampling bias if
Sampling population in the
efficient periodicity exists
study
Clusters of
High sampling
observations
Time and Cost errors are observed
representing the
Cluster Sampling population is
effective and in this method
efficient compared to other
identified as sample
sampling methods
population

Figure: 1.10

(C) Copyright Sunil Kappal - Data Dojo 20


PART 1 INFERENTIAL STATISTICS

Figure: 1.11
A survey of a 2,000 people from a population of a particular state was conducted. In the
above example, the sample is the 2,000 people surveyed in a particular state. This can be
considered as one example of sample size.

Random Sample:
In a purely random sample, every unit of the population has an equal chance of being
selected, removing bias from the selection procedure. To conduct a random sample,
a population is first defined, as well as a target sample size. Units of the population
are then chosen at random.

Because the selection is random, the sample is assumed to be representative of the


population, and the information collected can be used to develop inferences about
the whole population.

Caveat:
Conducting a truly random sample may be challenging where the population is large,
dispersed, or hidden.

(C) Copyright Sunil Kappal - Data Dojo 21


PART 1 INFERENTIAL STATISTICS

Figure: 1.12

Stratified Sample:

Stratified sampling is a random sampling method where you divide members of a


population into 'strata,' or homogeneous subgroups. It is a two-step process in
which the population is partitioned into subpopulations or strata.

To put it in a layman’s language, stratified sampling is the process of selecting a


sample that allows identified subgroups in the defined population to be represented
in the same proportion that they exist in the population.

Steps to perform stratified sampling:


1. Identify and define the population
2. Determine the desired sample size
3. Identify the variable and subgroups (strata) for which you want to guarantee
exact and equal representation

Advantages:
• Precise sample
• Can be used for both proportions and stratification sampling
• Sample represents the desired strata

(C) Copyright Sunil Kappal - Data Dojo 22


PART 1 INFERENTIAL STATISTICS
Cluster Sampling:
It can be defined as the process of randomly selecting intact groups, not individuals,
within the defined population sharing similar characteristics. This can be also
classified as “multistage sampling”

Figure: 1.13

Steps to perform cluster sampling:

1. Identify and define the population


2. Determine the desired sample size
3. Identify and define a logical cluster
4. List all cluster that makes up the population
5. Estimate the average number of population members per cluster
6. Determine the number of clusters needed by dividing the sample size by the
estimated size of a cluster
7. Randomly select the needed number of clusters by using a table of random
numbers
8. Include in your study all population members in each selected cluster

Advantages:
• Efficient
• Researcher doesn’t need excessive details about the population members
• Very useful for educational research
(C) Copyright Sunil Kappal - Data Dojo 23
PART 1 INFERENTIAL STATISTICS
Systematic Sampling:
The process of selecting individuals within the defined population from a list by
taking every Nth name.

Figure: 1.14

Steps to perform systematic sampling:


1. Identify and define the population
2. Determine the desired sample size
3. Obtain a list of population
4. Determine what N is equal to by dividing the size of the population by the
desired sample size
5. Start at some random place in the population list. Close your eyes and point
your finger to a name..
6. Starting at that point, take every Nth name on the list until the desired sample
size is reached
7. If the end of the list is reached before the desired sample is reached, go back to
the top of the list

Advantages:
• Sample selection process is simple

(C) Copyright Sunil Kappal - Data Dojo 24


PART 1 INFERENTIAL STATISTICS
To conclude this section we can say that the process of selecting a number of
individuals for a study in such a way that the individuals represent the larger group
from which they were selected is called as Sampling.

The respondents or members or individuals selected for a study whose


characteristics exemplify the larger group from which they are selected is called a
Sample.

The larger group from which individuals are selected to participate in a study is called
a Population

(C) Copyright Sunil Kappal - Data Dojo 25


PART 1 INFERENTIAL STATISTICS

Inferential Analysis:
As mentioned at the start of this section “ Inferential statistics is a set of techniques
to draw conclusions about a population by testing the data taken from a sample
population”. Similarly, the Inferential analysis uses statistical tests to identify a
pattern and its effects on the sample data.

Technically inferential statistics can be defined as a set of methods that helps to


establish a relationship in between an intervention and an outcome as well as the
strength of that relationship.

The first step in the inferential analysis is to understand the data distribution. It is
the data distribution that will guide the type of test that can be deployed on the
sample data.

There are two types of distributions normal and non-normal. Standard normal
distribution’s mean is always 0 with a standard deviation of 1 and is often called a
bell curve. The graph below is the example of how normal distribution should look
like. When the data is normally distributed we use parametric statistical tests.

Figure: 1.15

Non-Normal Distributions:
There could be several ways that a distribution can be non-normal. A small sample
size or too many outliers within the data set can be few common reasons for
distributions to be non-normal. When the data set is non-normal we use non-
parametric statistical tests

(C) Copyright Sunil Kappal - Data Dojo 26


PART 1 INFERENTIAL STATISTICS

Negative and Positive Skew:


Now we will review two types of non-normal distributions but before that, we need to
understand what does skew means.

Skew: It is a graph attribute where the data is not plotted per the famous bell curve
and can be elongated towards left or right. Where left denotes a negative skew and a
right elongation denotes a positive skew.

Technical definition Skew: It is a measure of asymmetry (unevenness) of a data set

Skew Interpretation:
• Skewness <0 = Left-skewed distribution where most of the values are
concentrated on the right of the mean with extreme values on the left.
• Skewness >0 = Right skewed distribution where most of the values are
concentrated on the left of the mean with extreme values on the right.

Figure: 1.16

Kurtosis:
It is another measure of the shape of a frequency curve. It is a Greek word, which
means bulginess.

While skewness signifies the extent of asymmetry, kurtosis identifies the degree of
peakedness distribution. Karl Pearson classified curves into three types on the basis
of the shape of their peaks.
• Mesokurtic
• Leptokurtic
• Platykurtic

(C) Copyright Sunil Kappal - Data Dojo 27


PART 1 INFERENTIAL STATISTICS

Part 1 Conclusion:
We can conclude the Part 1 of this book with a simple statement that statistics is a
branch of mathematics that transforms data into information for decision-makers.
The process of decision making can be further divided into two parts Descriptive and
Inferential Statistics.

Where descriptive statistics help us to summarize and describe the data. Inferential
statistics helps to draw conclusions and/or make decisions related to a population in
question, based on the sample data taken from that population.

(C) Copyright Sunil Kappal - Data Dojo 28


PROBABILITY STATISTICS

(C) Copyright Sunil Kappal - Data Dojo 29


PART 2 PROBABILITY
ANALYTICS NINJA AND THE PROBABILITY CONUNDRUM – PART 1

(C) Copyright Sunil Kappal - Data Dojo 30


PART 2 PROBABILITY
ANALYTICS NINJA AND THE PROBABILITY CONUNDRUM – PART 2

(C) Copyright Sunil Kappal - Data Dojo 31


PART 2 PROBABILITY

Introduction to Probability
Definition: Probability is the chance(or likelihood) of an event happening. Whenever
we are unsure about the outcome of an event, we can talk about the probabilities of
certain outcomes using words from the probability scale.

Figure: 1.17

Probability can be described using fractions.

The probability of a flipped coin


1/2
landing on heads is 1 out of 2

The probability of this spinner 2/6


landing on 3 is 2 out of 6

The probability of rolling a 2 in a


1/6
die is 1 out of 6

Now that we have looked at few basic probability examples, we can comfortably say
that probability of a particular outcome is the proportion of times that outcome would
occur in a long run of repeated observations.

A simplified representation of such an experiment can be a very long sequence of


flipping coins, the expected outcome being that head faces upwards.

(C) Copyright Sunil Kappal - Data Dojo 32


PART 2 PROBABILITY

Understanding Probability
We will start with the most confusing parts first and work our way up. There is a
general agreement that probability is a real value between 0 and 1 which has a
quantitative connotation to it compared to the qualitative notion of less or more likely
to happen.

It is important to understand that probabilities assigned to “things” are called events.


If E represents an event, then P represents Probability and if we write P(E) it means
Probability of an Event. Situations, where E might or might not happen, is called a trial
or an experiment.

Sample Space:
The set of all possible outcomes of the experiment is known as the sample space
corresponding to an experiment. The sample space is usually denoted by S, and a
generic element of the sample space (a possible outcome) is denoted by s. The sample
space is chosen so that exactly one outcome will occur. The size of the sample space is
finite, countably infinite or uncountably infinite.

It is worth mentioning here that some sample spaces are better than others.
Consider the experiment of flipping two coins. It is possible to get 0 heads, 1 head, or
2 heads. Thus, the sample space could be {0, 1, 2}. Another way to look at it is flip {
HH, HT, TH, TT }. The second way is better because each event is as equally likely to
occur as any other.

When writing the sample space, it is highly desirable to have events which are equally
likely.

Another example is rolling two dice. The sums are { 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }.
However, each of these isn't equally likely. The only way to get a sum 2 is to roll a 1
on both dice, but you can get a sum of 4 by rolling a 1-3, 2-2, or 3-1. The following
table illustrates a better sample space for the sum obtain when rolling two dice.

Second Die
First Die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
Figure: 1.18
(C) Copyright Sunil Kappal - Data Dojo 33
PART 2 PROBABILITY

Classical Probability:
The table on the last page lends itself to describing data another way -- using a
probability distribution. Let's consider the frequency distribution for that table.

Sum Frequency Relative Frequency


2 1 1/36
3 2 2/36
4 3 3/36
5 4 4/36
6 5 5/36
7 6 6/36
8 5 5/36
9 4 4/36
10 3 3/36
11 2 2/36
12 1 1/36
Figure: 1.19

If just the first and last columns were written, we would have a probability
distribution. The relative frequency of a frequency distribution is the probability of
the event occurring. This is only true, however, if the events are equally likely.

This gives us the formula for classical probability. The probability of an event
occurring is the number in the event divided by the number in the sample space.
Again, this is only true when the events are equally likely. A classical probability is
the relative frequency of each event in the sample space when each event is
equally likely.

P(E) = n(E) / n(S)

Empirical Probability:
Empirical probability is based on observation. The empirical probability of an
event is the relative frequency of a frequency distribution based upon
observation.

P(E) = f/n

(C) Copyright Sunil Kappal - Data Dojo 34


PART 2 PROBABILITY

Key Probability Rules:


There are two rules of probability which are very important.
1. All probabilities are in between 0 and 1 inclusive
2. The sum of all probabilities in the sample space (S) is 1

Other rules to keep in mind:

The probability of an event which cannot occur is 0.


The probability of any event which is not in the sample space is zero.

The probability of an event which must occur is 1.


The probability of the sample space is 1.

The probability of an event not occurring is one minus the probability of it


occurring.

P(E') = 1 - P(E)

“OR” or “UNIONS”

Mutually Exclusive Events


Two events are mutually exclusive if they cannot occur at the same time. Another
word that means mutually exclusive is disjoint.

If two events are disjoint, then the probability of them both occurring at the same
time is 0.

Disjoint: P(A and B) = 0


If two events are mutually exclusive, then the probability of either occurring is the
sum of the probabilities of each occurring.

Specific Addition Rule


Only valid when the events are mutually exclusive.

P(A or B) = P(A) + P(B)

(C) Copyright Sunil Kappal - Data Dojo 35


PART 2 PROBABILITY

Non-Mutually Exclusive Events


In events which aren't mutually exclusive, there is some overlap. When P(A) and P(B)
are added, the probability of the intersection (and) is added twice. To compensate for
that double addition, the intersection needs to be subtracted.

General Addition Rule


Always valid.

P(A or B) = P(A) + P(B) - P(A and B)

"AND" or Intersections

Independent Events
Two events are independent if the occurrence of one does not change the probability
of the other occurring.

An example would be rolling a 2 on a die and flipping a head on a coin. Rolling the 2
does not affect the probability of flipping the head.

If events are independent, then the probability of them both occurring is the product
of the probabilities of each occurring.

Specific Multiplication Rule


Only valid for independent events

P(A and B) = P(A) * P(B)

Dependent Events
If the occurrence of one event does affect the probability of the other occurring, then
the events are dependent.

Conditional Probability
The probability of event B occurring that event A has already occurred is read "the
probability of B given A" and is written: P(B|A)

General Multiplication Rule


Always works.

P(A and B) = P(A) * P(B|A)

(C) Copyright Sunil Kappal - Data Dojo 36


PART 2 PROBABILITY

Independence Revisited
The following four statements are equivalent

A and B are independent events


P(A and B) = P(A) * P(B)
P(A|B) = P(A)
P(B|A) = P(B)

The last two are because if two events are independent, the occurrence of one
doesn't change the probability of the occurrence of the other. This means that the
probability of B occurring, whether A has happened or not, is simply the probability
of B occurring.

Conditional Probability
Recall that the probability of an event occurring given that another event has already
occurred is called a conditional probability.

The probability that event B occurs, given that event A has already occurred is

P(B|A) = P(A and B) / P(A)

The above formula comes from the general multiplication principle (refer to page
no.32 of this book)

Since we are given that event A has occurred, we have a reduced sample space.
Instead of the entire sample space S, we now have a sample space of A since we
know A has occurred. So the old rule about being the number in the event divided by
the number in the sample space still applies. It is the number in A and B (must be in A
since A has occurred) divided by the number in A. If you then divided numerator and
denominator of the right-hand side by the number in the sample space S, then you
have the probability of A and B divided by the probability of A.

Note: Refer to the cartoon strip at the start of the Probability section and see this rule
in action, where I have used the Bayes Theorem to identify the probability of a gender
committing a murder in the parking lot based on various conditions.

(C) Copyright Sunil Kappal - Data Dojo 37


PART 2 PROBABILITY

Glossary:

• Probability Experiment
Process which leads to well-defined results call outcomes
• Outcome
The result of a single trial of a probability experiment
• Sample Space
Set of all possible outcomes of a probability experiment
• Event
One or more outcomes of a probability experiment
• Classical Probability
Uses the sample space to determine the numerical probability that an
event will happen. Also called theoretical probability.
• Equally Likely Events
Events which have the same probability of occurring.
• Complement of an Event
All the events in the sample space except the given events.
• Empirical Probability
Uses a frequency distribution to determine the numerical probability. An
empirical probability is a relative frequency.
• Mutually Exclusive Events
Two events which cannot happen at the same time.
• Independent Events
Two events are independent if the occurrence of one does not affect the
probability of the other occurring.
• Dependent Events
Two events are dependent if the first event affects the outcome or
the occurrence of the second event in a way the probability is changed.
• Conditional Probability
The probability of an event occurring given that another event has already
occurred.
• Bayes' Theorem
A formula which allows one to find the probability that an event occurred
as the result of a particular previous event.

(C) Copyright Sunil Kappal - Data Dojo 38


HYPOTHESIS TESTING

(C) Copyright Sunil Kappal - Data Dojo 39


PART 3 HYPOTHESIS TESTING

HYPOTHESIS TESTING IN ACTION

SCENE - 1
A question mark (representing the
null hypothesis) is in the dock. An
attorney declares, "The accused here
is presumed innocent until proven
guilty beyond all reasonable doubt."
[We presume H₀ is true and must
prove otherwise.]

SCENE - 2
The attorney continues, "Therefore
your verdict is either that the accused
is 'guilty' [H₀ is rejected beyond all
reasonable doubt] or 'not guilty' [H₀
is not rejected]."

Ho = Null Hypothesis
Ha = Alternative Hypothesis

SCENE - 3
The judge then addresses the court:
"Ladies and gentlemen of the jury, notice
that we say 'not guilty' when the
evidence is consistent with H₀ [Presumed
innocent]. It does not imply that the
accused is actually innocent, merely that
we do not have enough evidence to be
assured of guilt. So H₀ is not 'accepted',
but instead is 'not rejected'. Ladies and
gentlemen, consider the evidence before
you."
(C) Copyright Sunil Kappal - Data Dojo 40
PART 2 HYPOTHESIS TESTING

What is a hypothesis?
The hypothesis is a primary method of research. It is considered as an assumption for
research work to be tested. The main function of the hypothesis is to suggest new
observations or experiments. The hypothesis as a word is pretty interesting and has
two concepts. Considering the fact that Hypothesis is a combination of two Greek
words, “Hypo” and “Thesis” it will be a great idea to look at these two concepts in
detail.

First Concept:
1. Hypo means “under” and “thesis” means “refer to place”, Therefore, it is
anything under consideration.

Second Concept:
2. Hypo means “less than” and thesis mean “generally held view”. Therefore,
collectively looking at these it means less than generally help view. This means
“less “or “no” generalization of facts.

It will be worth mentioning some key extended definitions of hypothesis by


numerous scientists and researchers.

1. Goode & Hall. It is a preposition which is pat to a test to determined its validity.
2. Lundberg. It is a tentative or systematic generalization, the validity of which
remains to be tested.
3. Kerlinger. It is a causal relationship between two or more variables.

Therefore, based on the above definitions we can say that hypothesis is a


relationship between two or more variables, enabling the experimenter to test and
give guidance to the research activity for further analysis.

How to write a Hypothesis?

As we have understood from the above hypothesis definitions and the word
“Hypothesis” itself. We can say that most of the hypothesis can be divided into two
sections, if and then. Therefore, it is even more important for us to understand how
to write these sections and also understand the dependent and independent
variables within the statement.

(C) Copyright Sunil Kappal - Data Dojo 41


PART 2 HYPOTHESIS TESTING

Independent Variable:
The condition be studied. It is controlled by the experimenter. Example: Knowledge

Dependent Variable:
The condition affected by the independent variable. It can’t be can’t be controlled by
the experimenter. Example: Career Growth

Writing the “If” section of your Hypothesis


1. Start your sentence with the word “If”
2. Write down one of the variables
3. Connect statement with one of the following:
• is related to
• is affected by
• causes
4. Write down other variables

Writing the “then” section of your Hypothesis


1. Make a comment on the relationship between those two variables.
Example: If section:
If knowledge is related to career growth,
Example: Then section:
Then the more knowledgeable an individual is, the better career growth
he/she will have

Final “If / then” Statement


If knowledge is related to career growth, then the knowledgeable individual will have
a better career growth

Now that we have understood how to write a hypothesis statement, it is equally


important for us to understand the basics of hypothesis testing which is also called as
significance testing. Before we venture into the nuts and bolts of hypothesis testing,
it will be beneficial for us to understand this process in 5 simple steps.

(C) Copyright Sunil Kappal - Data Dojo 42


PART 2 HYPOTHESIS TESTING

5 Steps to Hypothesis Testing:

Steps Actions Descriptions


Null Hypothesis: Ho
Null and Alternative Hypothesis: Ha
1 Alternative 1. Two-tailed test if alternative hypothesis does not state direction
Hypothesis (greater or less).
2. One-tailed test if alternative state direction.
1. .01 level (1%) = for consumer research
Level of
2 2. .05 (5%) = for quality assurance
Significance
3. .10 level (10%) = for political pooling
z & t as test statistic, (Note use t test when n is less than 30. If n is 30 and
3 Test Statistics more use z test), and others F test (to test more than 2 means) and Chi-square
for non-parameter statistic.
Find the critical value of z from Normal Distribution table, or value t from t
4 Decision Rule
distribution table, or Chi-square, or F distribution table, where appropriate
Only 1 decision is possible in Hypothesis Testing
5 Decision Never reject Null Hypothesis or reject Null Hypothesis and Accept Alternative
Hypothesis

Figure: 1.20

Type I and Type II Errors:

With the analogy “an err to a human being” people can make mistakes when they
perform hypothesis testing while performing a statistical analysis. They can either
make a Type I errors or Type II errors. Therefore, it is very important to understand
the difference between these two types or errors. Considering the fact that there is
some level of risk involved in making each type of error in every analysis, and the
amount of risk is under the experimenter’s control.

As a seasoned or a budding statistician we know that we begin any hypothesis test


with the assumption of the null hypothesis being correct and null hypothesis is a
default position similar to “not guilty until proven”.

It will be helpful to view these errors in a table which can be seen in almost all the
statistical textbooks:

Reality Null (Ho) not rejected Null (Ho) rejected


Null (Ho) is true Correct Conclusion Type 1 error
Null (Ho) is false Type 2 error Correct Conclusion

Figure: 2
(C) Copyright Sunil Kappal - Data Dojo 43
PART 2 HYPOTHESIS TESTING

Interpreting the error table:

We commit a Type 1 error if we reject the null hypothesis when it is true and we
commit a Type 2 error if we fail to reject the null hypothesis when it is not true.

Note: These errors related to the statistical concepts of risk, significance, and power.
Answering the bigger question, which type of error is worse?

Well, not to disappoint but there is no clear answer to the above question. In some
instances, Type 1 error can cause a lot of risk compared to Type 2 and vice versa.
However, based on several experts opinions and suggestions using a table like below
can help to decide the consequences of Type 1 and Type 2 error.

Null Type 1 Error: Ho true, Rejected Type 2 Error: Ho false, Not Rejected
Medicine A Does not relieve
Null (Ho) is true Medicine Medicine A relieves Condition B,
Condition B, but is not
A does not relieve but is eliminated as a treatment
Condition B. eliminated as a treatment
option.
option.
Patients with Condition B who
receive Medicine A get no relief. A viable treatment remains
Result They may experience side unavailable to a patient. Profit
effects or even worst condition potential lost.
up to fatality. Possible Litigation

Figure: 2.1

Keep in mind that before testing a statistical hypothesis it is important to clearly state
the nature of the claim to be tested and since we assume the null hypothesis is true,
we control for Type I error by stating a level of significance. The level we set, called
the alpha level (symbolized as a), is the largest probability of committing a Type I
error that we will allow and still decide to reject the null hypothesis. This criterion is
usually set at .05 (a = .05), and we compare the alpha level to the p-value. When the
probability of a Type I error is less than 5% (p < .05), we decide to reject the null
hypothesis; otherwise, we retain the null hypothesis.

(C) Copyright Sunil Kappal - Data Dojo 44


PART 2 HYPOTHESIS TESTING

Hypothesis Testing Tree:

Figure: 2.2

(C) Copyright Sunil Kappal - Data Dojo 45


CORRELATION & REGRESSION

(C) Copyright Sunil Kappal - Data Dojo 46


PART 4 CORRELATION & REGRESSION
ANALYTICS NINJA AND THE CURSE OF MULTICOLLINEARITY

(C) Copyright Sunil Kappal - Data Dojo 47


PART 4 CORRELATION & REGRESSION
ANALYTICS NINJA AND THE CURSE OF MULTICOLLINEARITY

(C) Copyright Sunil Kappal - Data Dojo 48


PART 4 CORRELATION & REGRESSION

Figure: 2.3

It often gets confusing and terrifying when trying to solve the puzzle of which
correlation technique should be deployed based on the X and Y variables properties. In
this section of “Statistics for Rookies” we will discuss two types of correlation statistics:
1. Pearson Correlation Coefficient
2. Spearman Rank Correlation
Correlation is a bivariate technique that measures the relationship strength between
two variables. The value of correlation varies between +1 & -1, where +1 denotes a
highly positive relationship in between two variables and -1 indicates the inverse. As
the correlation coefficient value goes near 0, the relationship between the two
variables tends to get weaker.

To be able to understand the above-mentioned correlation types we need to first


understand what correlation is all about?

Correlation is the degree of relationship measures in between two continuous or


discrete variables. The measure of correlation is called “correlation coefficient” and
the degree of relationship is expressed by a coefficient which ranges from -1 to +1
(as stated earlier in this section).

The correlation statistic helps us to have an idea of the degree and direction of the
relationship between the two variables. It deals with the association in between two
or more variables.

(C) Copyright Sunil Kappal - Data Dojo 49


PART 4 CORRELATION & REGRESSION

“Correlation is not Causation”

Causation means cause and effect relationship, correlation enables us to identify if


there is an interdependency in between the variables. However, this does not imply
that one is causing another.

If two or more variables vary in such a way that movement in one is accompanied by
movement in other, these variables are called cause and effect relationship.

It is advised that one should always remember that causation implies correlation but
correlation does not necessarily imply causation.

Figure: 2.4

Types of Correlations I

Positive Correlation: When the values of two variables changes in the same
direction then the correlation is considered as positive. Variables changing in the
same direction. Example:

• As X increases, Y increases

• As X decreases, Y decreases
• E.g., As height increases so does weight

(C) Copyright Sunil Kappal - Data Dojo 50


PART 4 CORRELATION & REGRESSION

Negative Correlation: When the values of two variables changes in opposite


direction then the correlation is considered as negative. Variables changing in
opposite direction. Example:

• As X increases, Y decreases

• As X decreases, Y increases

• E.g., As play time increases, grade decreases

Types of Correlation II
Linear Correlation: Correlation is said to be linear when the amount of change in
one variable tends to bear a constant ratio to the amount of change in the other.

Figure: 2.5

Nonlinear Correlation: Correlation is said to be nonlinear when the amount of


change in one variable does not bear a constant ratio to the amount of change in
the other variable.

(C) Copyright Sunil Kappal - Data Dojo 51


PART 4 CORRELATION & REGRESSION

Methods of Studying Correlations

Pearson Correlation Coefficient


This correlation technique is widely used in statistics to measure the strength of the
relationship between linear related variables. For example, in the contact center, if we
want to measure how two metrics are related to each other (Call Duration and Non-
Talk), Pearson correlation technique can be used to measure the degree of
relationship between these two variables.

The following formula is used to calculate the Pearson correlation:

Figure: 2.6

R = Pearson r correlation coefficient


N = number of value in each data set
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2 = sum of squared x scores
∑y2 = sum of squared y scores

Questions that can be answered by Pearson correlation


Is there a statistically significant relationship between call duration and Non-Talk?
Is there a relationship between temperature, measured in degree Celsius and ice cream
sales, measured by income?
Is there a relationship between age in years and height in inches?

Assumptions
Pearson correlation technique assumes that both the variables are normally
distributed. It also assumes that there are a linearity and homoscedasticity in between
the variables. Linearity assumes a straight line relationship between each of the
variable and homoscedasticity assumes the normal distribution along the regression
line.

(C) Copyright Sunil Kappal - Data Dojo 52


PART 4 CORRELATION & REGRESSION

Spearman Rank Correlation

Spearman rank correlation is a non-parametric test which is used to measure the


degree of association in between the two variables. It was developed by Spearman,
thus called the Spearman rank correlation. Spearman test assumes anything about
the distribution of the data. It is appropriate to use the Spearman rank correlation
test when the variables are measured on a scale that is ordinal.

The following formula is used to calculate the Spearman rank correlation:

Where:
P = Spearman rank correlation
di = the difference between the ranks of corresponding values Xi and Yi
n = number of value in each data set

Questions that can be answered by Pearson correlation


Is there a statistically significant relationship between participant responses to two
Likert scales questions?
Is there a statistically significant relationship between how the Member Experience
Surveys are done on the scale of 1 – 10 compared to their experience scores?

Assumptions
Spearman rank correlation test doesn’t make any distributional assumptions about
the data. The assumptions of Spearman rho correlation are that data must be at least
ordinal and scores on one variable must be monotonically related to the other
variable.

(C) Copyright Sunil Kappal - Data Dojo 53


PART 4 CORRELATION & REGRESSION

ANALYTICS NINJA AND THE REGRESSION GAMES

(C) Copyright Sunil Kappal - Data Dojo 54


PART 4 CORRELATION & REGRESSION

What is Regression?

Regression is often referred to as a measure of relationship two or more variables


where a change in a dependent variable is associated with, and depends on, change
in one or more independent variables.

Dependent variable: the variable we wish to explain (also called the endogenous
variable)

Independent variable: the variable used to explain (also called the exogenous
variable)

What is Regression Analysis?

Regression analysis is a statistical method used to predict a value from an unknown


variable using a known variable.

Historical Origin of Regression Analysis:


• Regression Analysis was first developed by Sir Francis Galton, who studied the
relation between heights of sons and fathers.
• Heights of sons of both tall and short fathers appeared to “revert” or “regress” to
the mean of the group.

(C) Copyright Sunil Kappal - Data Dojo 55


PART 4 CORRELATION & REGRESSION

Advantages of Regression Analysis

1. It provides value estimates of the dependent variables from the values of the
independent variables.
2. It helps to obtain a measure of the error involved in using the regression line as
foundation of estimates
3. It also helps to understand the degree of association or correlation that exists in
between the two variables.

Assumption of Regression Analysis

1. Existence of actual linear relationship


2. The regression analysis is used to estimate the values within the range for which
it is valid
3. In regression, there is only one dependent variable. However, more than one
independent variable can be used
4. Dependent variable can take any random value but the value(s) of independent
variables are fixed.

Regression Line

It is the line which gives the best estimate of one variable from the value of any other
variable. The regression line gives the average relationship between the two variable
in a mathematical form.

For two variables X & Y, there are always two lines of regression. Regression line of X
on Y: gives the best fit for the value of X for any specific value of Y.

𝑌 = 𝛽0 + 𝛽1𝑋+𝑈

Y = Dependent Variable
X = Independent Variables
𝛽0 = Intercept Parameter
𝛽1 = Slope Parameter
U = Error term that captures the amount of variation not predicted by the slope and
intercept terms.

(C) Copyright Sunil Kappal - Data Dojo 56


PART 4 CORRELATION & REGRESSION

(C) Copyright Sunil Kappal - Data Dojo 57


PART 4 CORRELATION & REGRESSION

Simple Linear Regression (SLR)

1. Simple regression analysis is a statistical tool That gives us the ability to estimate
the mathematical relationship between a dependent variable (usually called y)
and an independent variable (usually called x).
2. The dependent variable is the variable for which we want to make a prediction.
3. While various non-linear forms may be used, simple linear regression models are
the most common.
4. The goal of the analyst who studies the data is to find a functional
relationship between the response variable y and the predictor variable x.
5. The primary goal of the quantitative analysis is to use current information
about a phenomenon to predict its future behavior. Current information is
usually in the form of a set of data. In a simple case, when the data form a set
of pairs of numbers, we may interpret them as representing the observed
values of an independent (or predictor ) variable X and a dependent ( or
response) variable Y.

Is it a good idea to perform Correlation before SLR?

1. Correlation and regression analysis are related in the sense that both deal with
relationships among variables. The correlation coefficient is a measure of linear
association between two variables. Values of the correlation coefficient are
always between -1 and +
2. Neither Correlation nor Simple Linear Regression ascertain a cause and effect
relationship.
3. Due to the lack of Cause and Effect Relationship factorization, even SLR is
a probabilistic prediction Model and not deterministic.

Hence, the answer is “Yes” & “No” 



Regression Function

The statement that the relation between X and Y is statistical should be interpreted
as providing the following guidelines:
1. Regard Y as a random variable
2. For each X, take f (x) to be the expected value (i.e., mean value) of y.
3. Given that E (Y) denotes the expected value of Y, call the equation the
regression function

(C) Copyright Sunil Kappal - Data Dojo 58


PART 4 CORRELATION & REGRESSION

Multiple Linear Regression (MLR)

Multiple linear regression is also one the most common form of linear regression
analysis. As a predictive analytics tool, the MLR method is used to explain the
relationship in between one continuous dependent variable and two or more
independent variables. The independent variables can be continuous or categorical.

Example Questions Answered by MLR:


• Does Age, Customer sentiment and Product Type query explains the variance in
sales level?
• If 'x' number of people sign up for a promotional program on a particular Web
Portal, how many extra sales can we expect?

Assumptions
• Regression residuals must be normally distributed
• A linear relationship is assumed between the dependent variable(Y) and the
independent variables(X)
• Absence of multicollinearity is assumed in the model, meaning that the
independent variables are not too highly correlated

How to Deal with Multicollinearity?

This can be dealt by centering the variables which


is also known as standardizing the variables by subtracting the mean. The process is
pretty simple:

• Calculate the mean of each continuous variable


• Subtract the mean from all observed values of that variable
• Finally, use the centered variable in your model
• Alternatively, if you are like me who would not want to make any exceptions while
creating a model you can use the variance inflation factors (VIF) statistic to take
care of predictor variables exhibiting severe multicollinearity.

Note: Refer to the Analytics Ninja Cartoon Strip at the start of this section Page 47-48
to understand how to deal with multicollinearity in a visual way)

(C) Copyright Sunil Kappal - Data Dojo 59


PART 4 CORRELATION & REGRESSION

Binary Logistic Regression (BLR)

While statistical techniques like regression, Analysis of Variance aka ANOVA are
useful when a response variable (Y) is continuous. However, if the (Y) aka Key
Performance Output Variable (KPOV) is discrete than these methods end up being
redundant or futile.

If the response variable is binary (discrete) and the input variable(s) is/are continuous
then we can use the BLR method. Binary Logistic regression is helpful to understand
how various factors affect the probability of an event.

To gain in-depth knowledge of the binary logistic regression it will be a good idea to
break the equation and understand it bit by bit.

Equation = Ρ = βθ + β1 + β1×1 + β2×2 + β3×3…..+ βnxn

• P = Probability
• β1, β2, βn = are the coefficients, which we want to see if they are statistically
significant or not and if they are what are their values
• x1,x2,xn = are the factors or independent variable having some effect (significant
or non-significant) on the probability

Binary logistic regression also has a concept of “Odds” (O) this can be understood by
the example of winning a bet. If the probability of winning a bet is 0.75, odds in favor
of winning the best are 0 = 0.75/(1-0.75) = 3, this means that it is three times likely to
win the bet compared to losing. Those who are familiar with betting will be in a better
position to understand the workings of offs compared to those who are a novice and
understand this logic from an equation perspective.

Best Practices for Binary Logistic Regression:

Go “Full Throttle” or “Full Model” this means ensuring that the model includes all the
significant factors present in the data.
“Reduce one variable at a time”, then run the regression using reduce model. This
will ensure that the model is reduced to only those variable which are vital and with
no multicollinearity.

(C) Copyright Sunil Kappal - Data Dojo 60


PART 4 CORRELATION & REGRESSION

How to assess the model?

“The Log Likelihood Static”, it is similar to the residual sum of squares in multiple
regression and is an indicator of how much-unexplained information is there post
model fitting. Large values indicate poorly fitted statistical model.

Apart from the above statistics, most applications provide exhaustive additional
information which can help to assess the performance of the model. I will be
discussing this regression technique in my upcoming article in much more details.

(C) Copyright Sunil Kappal - Data Dojo 61


STATISTICS CHEATSHEETS

(C) Copyright Sunil Kappal - Data Dojo 62


STATISTICS CHEAT SHEETS STATISTICAL NIRVANA

THIS PAGE IS LEFT BLANK INTENTIONALLY

(C) Copyright Sunil Kappal - Data Dojo 63


STATISTICS CHEAT SHEETS STATISTICAL NIRVANA

SAMPLE
Sample Set – List of population from where a sample is picked
Biased – Often called “Cherry Picking” where the sampling is customized per the
population’s characteristics.

Sample Types:
Non Probability Sample: Choose what you think represents the population.
• Convenience Sample: easily accessed sample
Probability Sample: Elements selected based on probability
• Simple Random Sample
• Systematic Sample
Stratified Sampling: Divide populations into subgroups based on their contribution
to the overall population.

PROBABILITY
Probability Rules:
Addition Rule: 𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷 𝑩 − 𝑷(𝑨 ∩ 𝑩)
If A and B are mutually exclusive than 𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷 𝑩
Multiplication Rule: 𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨 ∗ 𝑷 𝑩 𝑨 𝒐𝒓 𝑷( 𝑩 ∗ 𝑷(𝑨|𝑩)
If A and B are independent than 𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨 ∗ 𝑷(𝑩)
Complement Rule:𝑷 𝐴𝐶 = 𝟏 − 𝑷 𝑨

PROBABILITY DEFINITIONS
A and B are mutually exclusive if 𝐏 𝑨 ∩ 𝑩 = 𝟎
A and B are independent if 𝑷 𝑨 𝑩 = 𝑷 𝑨 𝒐𝒓 𝑷 𝑩 𝑨 = 𝑷(𝑩)

Probability Laws:
Law of Probability: 𝑷 𝑩 = 𝑷 𝑨 ∗ 𝑷 𝑩 𝑨 + 𝑷 𝐴𝐶 ∗ 𝑷(𝑩|𝐴𝐶)
𝑷 𝑨 ∗𝑷(𝑩|𝑨)
Bayes’ Law = 𝑷 𝑨 𝑩 = 𝑷 𝑨 ∗𝑷(𝑩|𝑨)+𝑷 𝐴𝐶 ∗𝑷(𝑩|𝐴𝐶)

(C) Copyright Sunil Kappal - Data Dojo 64


STATISTICS CHEAT SHEETS STATISTICAL NIRVANA

HYPOTHESIS TESTING

• State the hypothesis


• Identify test statistic and its probability dist.
• Specify the significance level
• State the decision rule
• Collect data and perform calculations
• Make statistical decision
• Make the economic or investment decision

POINTS TO REMEMBER HYPOTHESIS TESTING


A P value is the probability of getting a statistic as extreme as
observed, if H0 is true
The P value measures the strength of the evidence the data
provides against the H0

“When P Value is Low H0 must Go!”

When P value is not low then you cannot reject H0, consider it as
an inconclusive test

(C) Copyright Sunil Kappal - Data Dojo 65


STATISTICS CHEAT SHEETS STATISTICAL NIRVANA

CORRELATION DOESN’T IMPLY CAUSATION

As we can clearly see in the first example that there is some level of causality that is
happening. Buying books causes you to spend more money. So, does this mean if long hours
of workout and body mass are correlated, does that mean that working out for long hours will
get you huge size? I don’t think so, and this is where things go awry.

There could also be a Z

X and Y. In this case Z

to maintain the body fat


along with course of

(C) Copyright Sunil Kappal - Data Dojo 66


STATISTICS CHEAT SHEETS STATISTICAL NIRVANA
EPILOGUE

ANALYTICS NINJA AND THE STATISTICAL NIRVANA

That’s how Analytics Ninja helped


the Data Monger to demystify his
statistical conundrum. He not only
explained the tough to understand
concepts in a laymans’ language
but also helped him to understand
the science behind those
techniques

(C) Copyright Sunil Kappal - Data Dojo 67


STATISTICS CHEAT SHEETS STATISTICAL NIRVANA
References:

https://quickkt.com/tutorials/artificial-intelligence/machine-learning/logistic-regression-theory/ -
Logistic Regression
http://flowchart.ghkates.com/statistics-flowchart/
https://www.mathsisfun.com/median.html
https://cyfar.org/qualitative-or-quantitative-data
https://research-methodology.net
https://lc.gcumedia.com/hlt362v/the-visual-learner
http://study.com/academy/lesson/stratified-random-samples-definition-characteristics-
examples.html
https://www.psychtutor.com
https://chrismadden.co.uk
https://people.richland.edu/james/lecture
https://researchskills.epigeum.com

(C) Copyright Sunil Kappal - Data Dojo 68

You might also like