Professional Documents
Culture Documents
2
(C) Copyright Sunil Kappal - Data Dojo
ABOUT THE AUTHOR
Sunil Kappal works as an Advanced Analytics Consultant. He has more than 20 years
of experience in Data Analytics, Business Intelligence, Statistical Modeling, Predictive
Models and Six Sigma Methodologies.
Sunil has delivered multiple lectures at the University of Texas Dallas on the usage of
various advanced analytical techniques and machine learning best practices. He was
also invited as a guest speaker at the Symbiosis Institute of Operations Management
to talk on Big Data and Machine Learning.
Besides the above brands, he also runs his own blog on WordPress
Statistics for Rookies is a book with a primary aim to educate its readers about the
most common statistical methods to help them take data-driven decisions “without
sweat”. The book presents some common yet very effective statistical concepts in the
most engaging and captivating manner.
The secondary aim of this book is to help cement the fundamental statistical concepts
in a way that is easy to follow and is jargon-free (this means each statistical terms is
demystified by using its simplistic definitions) Example: A Delta, usually represented
by Δ is nothing but difference or change in between two quantities.
This book caters to both seasoned and rookie data analysts to understand or refresh
the foundational concepts for any data analytics endeavor. The uniqueness of this
book is in its simplicity, I have tried not to include too many statistical notations
within this book, without compromising with the sanctity and correctness of various
statistical methods.
As mentioned above this book’s aim is to present statistical concepts in a fun way.
Therefore, each statistical topic is placed strategically and is introduced with a
cartoon strip that is very refreshing and makes the learning process fun and engaging.
Per my observation when people write on an intricate and a serious topic as statistics
it tends to get overwhelming for the readers and especially readers like me who have
a short span of attention. Therefore, towards the end of each section, there will be a
glossary of terms defining each statistical concept that will act as a ready reckoner.
Finally, I just want to say that do not try to memorize the concepts but do try to
understand them and try to create a use case in your head based on your day to day
tasks.
Sincerely,
Sunil Kappal
It is pretty clear from the above definition that statistics is not only about tabulation or
visual representation of data. It is the science of deriving insights from the data that
can be numerical (quantitative) or categorical (qualitative) in nature. In a nutshell, this
science can be used to answer questions like:
• Summarizing and exploring the data to understand the spread of the data,
its central tendency and its measure of association by using various
descriptive statistical methods
• Drawing Inferences, forecasting and generalizing the patterns displayed by
the data to make some sort of conclusions
Furthermore, statistics is the art and science of dealing with events and phenomenon
that are not certain in nature. I can confidently say that nowadays statistics is used in
every field of science.
The goal of statistics is to gain understanding of data. Any data analysis should have
the following steps:
Figure: 1
Descriptive statistics provide data summaries the sample along with the observations
that have been made with regards to the sample data. Such summaries can be
presented in the form of summary statistics (refer to the above visual for the summary
statistics types by data type) or easy to decipher graphs.
It is worth mentioning here that descriptive statistics is mostly used to summarize the
values and may not be sufficient to make conclusive generalizations about the entire
population or to infer or to predict the data patterns.
Till now we have looked “visually” what are those descriptive statistics that can be
used to explore the data by the data type (refer to the figure 1). This section of the
book will help the readers to appreciate the nuances involved in using those summary
statistics.
The term “mean” or “average” is one of the many summary statistics which can be
used to describe the central tendency of the sample data. Computing this statistic is
pretty straightforward, sum all the values and divide it by the number of values to get
the mean or an average of the sample.
Example:
The mean or average is (1+1+2+3+4)/5 = 2.2
Figure: 1.1
Analytics Ninja Tip: Mean and Averages are sensitive to extreme values i.e. one
or two extreme values can change the mean.
Median:
A median is the value separating the higher half of a sample data, from the lower half.
Median can also be expressed as another way of finding the average of the sample
data by sorting the number list from low to high and then finding the middle digit
within the number list.
In the previous section when we looked at the mean or average for the same data set
it was 2.2. However, when we used the median statistic, it turns out the central
tendency for this data set is 2.
Let's look at this example where we have fourteen numbers and we don't have just
one middle number, we have a pair of middle numbers:
Number list = 3, 13, 7, 5, 21, 23, 23, 40, 23, 14, 12, 56, 23, 29
Step 1 sort the number list low to high = 3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40,
56
There are now fourteen numbers and we don't have just one middle number, we
have a pair of middle numbers:
3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40, 56
(Note that 22 is not in the number list, but that is OK because half the numbers in the list are less, and half the numbers
are greater than 22.)
So just to clear the air, I will take a step back and try to define these two terms in
simple English language and also in their statistical forms.
Definition: Variance can be defined as the average of the squared differences from the
mean.
To put some context to the above definition, Variance can be a difference between an
expected and actual result such as between a budget and actual expenditure.
Figure: 1.2
I know the above formula can be pretty daunting. Therefore, I will list out the steps to
calculate the variance in an easy to understand manner:
1. Work out the mean (refer to the mean and average section of the book)
2. Then for each number: subtract the mean and squared the results (squared
difference)
3. Work out the averages of those squared values
Still unclear about the entire math, let’s look at it visually to understand how to calculate
the variance using a dataset.
Note: Square Root of Variance,𝜎, is called Standard Deviation. Just like variance standard
deviation is also used to describe the spread. Statistics like standard deviation can be
more meaningful when expressed in the same units as the mean, whereas the variance is
expressed in squared units.
Calculating Variance:
As an example lets look a the two distributions and understand the step by step
approach to calculate Variance Statistic:
Data Set 1 Data Set 2
3 1
4 2
4 4
5 5
6 7
8 11
Figure: 1.3
Figure: 1.4
Example: (255 – 234)/234 = 9% (Delta %) and if we do not divide this with the recent
number, we will get the pure delta = 21)
Standard Deviation helps us to know how the values of a particular data are dispersed.
Lower standard deviation concludes that the values are very close to their average.
Whereas higher values mean the values are far from the mean value. Standard
deviation value can never be in negative.
Figure: 1.5
Glossary:
• Distribution: A summary of the values that appear in a sample and the frequency,
or probability, of each.
• mode: The most frequent value in a sample.
• outlier: A value far from the central tendency.
• 𝛴 = 𝑆𝑢𝑚𝑚𝑎𝑡𝑖𝑜𝑛
• 𝑋 = 𝐼𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑉𝑎𝑙𝑢𝑒
• 𝑥𝑖 = 𝐹𝑜𝑟 𝑒𝑎𝑐ℎ, 𝑎𝑙𝑙 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑠
• 𝑥ҧ = 𝑇ℎ𝑒 𝑚𝑒𝑎𝑛, 𝑎𝑣𝑒𝑟𝑎𝑔𝑒
• 𝑁 = 𝑃𝑜𝑝𝑢𝑙𝑎𝑖𝑜𝑛 𝑆𝑖𝑧𝑒
• 𝜎 2 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
• 𝜎 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
• 𝜇 = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑑𝑎𝑡𝑎
Qualitative Variables:
The observations that fall into a particular category or a class of a qualitative variable
is depicted in a form of frequency(or a count). Tabulation of such data and their
frequencies is called a frequency distribution.
In addition to looking at the frequencies, we can also look at their percentages and can
be expressed as a percentage of the class. We can find the percentage by dividing the
frequency of the class by the total number of observations and multiplying it by 100.
Qualitative Variables
Strengths Limitations
Can give a nuanced understanding of the May lend itself to working with smaller populations,
perspectives and needs of program which may not be representative of larger
participants demographics
Can help support or explain results
Data analysis can be time-consuming
indicated in quantitative analysis
Source of detailed or “rich” information
Analysis can be subjective; there is potential for
which can be used to identify patterns of
evaluator bias in analysis/collection
behavior
Figure: 1.6
Quantitative Variables:
A quantitative variable is something that can be quantified, something that can be
counted or measured. This type of data has infinite values that cannot be counted,
usually a measurement (Temperature, pressure, humidity, length, time).
In a layman’s language, we can say that qualitative variables are variables that vary
in kind, like “beautiful” or “not so beautiful”, “understanding” or “not
understanding”. Whereas, quantitative variables vary in amounts like height, weight,
salary etc.
Quantitative Variables
Strengths Limitations
Data collection methods provide respondents with a
Clear and specific
limited number of response options
Accurate and reliable if properly analyzed Can require complex sampling procedures
Figure: 1.8
Analytics Ninja Tip: Quantitative variables are numerical information, the analysis of which
involves statistical techniques. Data type guides the analytical process.
Sampling Techniques:
Referring back to the inferential statistic’s definition of generalization of the sample
population’s patterns and insights on to the overall population. For us to understand
this definition and cement the idea of inferential statistics in our minds we need to
understand the basics of inferential statistics.
Even before we start talking about the above basics, it will be a good idea to
understand what inferential statistics can do for us.
Definition Sample Population: Sample is the part of the population from which
information is collected (Weiss, 1999)
18
PART 1 INFERENTIAL STATISTICS
In statistics, we rely a lot on a sample to draw inferences about the entire
population. Inferential statistics provide a way to base our conclusions from sample
to the population by inferring the parameters of a population from data around the
statistics of the sample.
I know it is getting heavy ….. so let me just put it this way. Inferential statistics gives
us a way to generalize the patterns observed on the overall population based on the
inferential analysis and test performed on the sample data.
This section can be considered as the most important part of this book where we will
develop the basic intuition of picking the most appropriate sample. It is also worth
mentioning that it is very important for a researcher to work with samples rather
working with the entire population.
Though there are a variety of sampling techniques available. However, as the theme
of this books suggest “Statistics for Rookie”, I will discuss only two main sampling
techniques that are Random Sampling and Stratified Sampling. However, for the
brainiacs who adore the intricacies of statistics, I have created a hierarchical view of
various sampling methods.
Sampling Methods
Focus Area
Figure: 1.10
Sampling
Definition Pros Cons
Technique
Highly effective
Very small sample
Random Sample is selected when all subjects
size introduces
Sampling in a random fashion participate in data
sample error
collection
The person applying
this sampling
Accurate and
method should
effective
have proper
representation of
Stratified Represents specific understand about
all subgroups
Sampling subgroups or strata the subgroups
Accurate estimates
otherwise
in case of similarity
misrepresentation
or dissimilarity
of sample can cause
analytical fallacy
Including every Nth
Time and Cost May results in high
Systematic observation of the
effective and sampling bias if
Sampling population in the
efficient periodicity exists
study
Clusters of
High sampling
observations
Time and Cost errors are observed
representing the
Cluster Sampling population is
effective and in this method
efficient compared to other
identified as sample
sampling methods
population
Figure: 1.10
Figure: 1.11
A survey of a 2,000 people from a population of a particular state was conducted. In the
above example, the sample is the 2,000 people surveyed in a particular state. This can be
considered as one example of sample size.
Random Sample:
In a purely random sample, every unit of the population has an equal chance of being
selected, removing bias from the selection procedure. To conduct a random sample,
a population is first defined, as well as a target sample size. Units of the population
are then chosen at random.
Caveat:
Conducting a truly random sample may be challenging where the population is large,
dispersed, or hidden.
Figure: 1.12
Stratified Sample:
Advantages:
• Precise sample
• Can be used for both proportions and stratification sampling
• Sample represents the desired strata
Figure: 1.13
Advantages:
• Efficient
• Researcher doesn’t need excessive details about the population members
• Very useful for educational research
(C) Copyright Sunil Kappal - Data Dojo 23
PART 1 INFERENTIAL STATISTICS
Systematic Sampling:
The process of selecting individuals within the defined population from a list by
taking every Nth name.
Figure: 1.14
Advantages:
• Sample selection process is simple
The larger group from which individuals are selected to participate in a study is called
a Population
Inferential Analysis:
As mentioned at the start of this section “ Inferential statistics is a set of techniques
to draw conclusions about a population by testing the data taken from a sample
population”. Similarly, the Inferential analysis uses statistical tests to identify a
pattern and its effects on the sample data.
The first step in the inferential analysis is to understand the data distribution. It is
the data distribution that will guide the type of test that can be deployed on the
sample data.
There are two types of distributions normal and non-normal. Standard normal
distribution’s mean is always 0 with a standard deviation of 1 and is often called a
bell curve. The graph below is the example of how normal distribution should look
like. When the data is normally distributed we use parametric statistical tests.
Figure: 1.15
Non-Normal Distributions:
There could be several ways that a distribution can be non-normal. A small sample
size or too many outliers within the data set can be few common reasons for
distributions to be non-normal. When the data set is non-normal we use non-
parametric statistical tests
Skew: It is a graph attribute where the data is not plotted per the famous bell curve
and can be elongated towards left or right. Where left denotes a negative skew and a
right elongation denotes a positive skew.
Skew Interpretation:
• Skewness <0 = Left-skewed distribution where most of the values are
concentrated on the right of the mean with extreme values on the left.
• Skewness >0 = Right skewed distribution where most of the values are
concentrated on the left of the mean with extreme values on the right.
Figure: 1.16
Kurtosis:
It is another measure of the shape of a frequency curve. It is a Greek word, which
means bulginess.
While skewness signifies the extent of asymmetry, kurtosis identifies the degree of
peakedness distribution. Karl Pearson classified curves into three types on the basis
of the shape of their peaks.
• Mesokurtic
• Leptokurtic
• Platykurtic
Part 1 Conclusion:
We can conclude the Part 1 of this book with a simple statement that statistics is a
branch of mathematics that transforms data into information for decision-makers.
The process of decision making can be further divided into two parts Descriptive and
Inferential Statistics.
Where descriptive statistics help us to summarize and describe the data. Inferential
statistics helps to draw conclusions and/or make decisions related to a population in
question, based on the sample data taken from that population.
Introduction to Probability
Definition: Probability is the chance(or likelihood) of an event happening. Whenever
we are unsure about the outcome of an event, we can talk about the probabilities of
certain outcomes using words from the probability scale.
Figure: 1.17
Now that we have looked at few basic probability examples, we can comfortably say
that probability of a particular outcome is the proportion of times that outcome would
occur in a long run of repeated observations.
Understanding Probability
We will start with the most confusing parts first and work our way up. There is a
general agreement that probability is a real value between 0 and 1 which has a
quantitative connotation to it compared to the qualitative notion of less or more likely
to happen.
Sample Space:
The set of all possible outcomes of the experiment is known as the sample space
corresponding to an experiment. The sample space is usually denoted by S, and a
generic element of the sample space (a possible outcome) is denoted by s. The sample
space is chosen so that exactly one outcome will occur. The size of the sample space is
finite, countably infinite or uncountably infinite.
It is worth mentioning here that some sample spaces are better than others.
Consider the experiment of flipping two coins. It is possible to get 0 heads, 1 head, or
2 heads. Thus, the sample space could be {0, 1, 2}. Another way to look at it is flip {
HH, HT, TH, TT }. The second way is better because each event is as equally likely to
occur as any other.
When writing the sample space, it is highly desirable to have events which are equally
likely.
Another example is rolling two dice. The sums are { 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }.
However, each of these isn't equally likely. The only way to get a sum 2 is to roll a 1
on both dice, but you can get a sum of 4 by rolling a 1-3, 2-2, or 3-1. The following
table illustrates a better sample space for the sum obtain when rolling two dice.
Second Die
First Die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
Figure: 1.18
(C) Copyright Sunil Kappal - Data Dojo 33
PART 2 PROBABILITY
Classical Probability:
The table on the last page lends itself to describing data another way -- using a
probability distribution. Let's consider the frequency distribution for that table.
If just the first and last columns were written, we would have a probability
distribution. The relative frequency of a frequency distribution is the probability of
the event occurring. This is only true, however, if the events are equally likely.
This gives us the formula for classical probability. The probability of an event
occurring is the number in the event divided by the number in the sample space.
Again, this is only true when the events are equally likely. A classical probability is
the relative frequency of each event in the sample space when each event is
equally likely.
Empirical Probability:
Empirical probability is based on observation. The empirical probability of an
event is the relative frequency of a frequency distribution based upon
observation.
P(E) = f/n
P(E') = 1 - P(E)
“OR” or “UNIONS”
If two events are disjoint, then the probability of them both occurring at the same
time is 0.
"AND" or Intersections
Independent Events
Two events are independent if the occurrence of one does not change the probability
of the other occurring.
An example would be rolling a 2 on a die and flipping a head on a coin. Rolling the 2
does not affect the probability of flipping the head.
If events are independent, then the probability of them both occurring is the product
of the probabilities of each occurring.
Dependent Events
If the occurrence of one event does affect the probability of the other occurring, then
the events are dependent.
Conditional Probability
The probability of event B occurring that event A has already occurred is read "the
probability of B given A" and is written: P(B|A)
Independence Revisited
The following four statements are equivalent
The last two are because if two events are independent, the occurrence of one
doesn't change the probability of the occurrence of the other. This means that the
probability of B occurring, whether A has happened or not, is simply the probability
of B occurring.
Conditional Probability
Recall that the probability of an event occurring given that another event has already
occurred is called a conditional probability.
The probability that event B occurs, given that event A has already occurred is
The above formula comes from the general multiplication principle (refer to page
no.32 of this book)
Since we are given that event A has occurred, we have a reduced sample space.
Instead of the entire sample space S, we now have a sample space of A since we
know A has occurred. So the old rule about being the number in the event divided by
the number in the sample space still applies. It is the number in A and B (must be in A
since A has occurred) divided by the number in A. If you then divided numerator and
denominator of the right-hand side by the number in the sample space S, then you
have the probability of A and B divided by the probability of A.
Note: Refer to the cartoon strip at the start of the Probability section and see this rule
in action, where I have used the Bayes Theorem to identify the probability of a gender
committing a murder in the parking lot based on various conditions.
Glossary:
• Probability Experiment
Process which leads to well-defined results call outcomes
• Outcome
The result of a single trial of a probability experiment
• Sample Space
Set of all possible outcomes of a probability experiment
• Event
One or more outcomes of a probability experiment
• Classical Probability
Uses the sample space to determine the numerical probability that an
event will happen. Also called theoretical probability.
• Equally Likely Events
Events which have the same probability of occurring.
• Complement of an Event
All the events in the sample space except the given events.
• Empirical Probability
Uses a frequency distribution to determine the numerical probability. An
empirical probability is a relative frequency.
• Mutually Exclusive Events
Two events which cannot happen at the same time.
• Independent Events
Two events are independent if the occurrence of one does not affect the
probability of the other occurring.
• Dependent Events
Two events are dependent if the first event affects the outcome or
the occurrence of the second event in a way the probability is changed.
• Conditional Probability
The probability of an event occurring given that another event has already
occurred.
• Bayes' Theorem
A formula which allows one to find the probability that an event occurred
as the result of a particular previous event.
SCENE - 1
A question mark (representing the
null hypothesis) is in the dock. An
attorney declares, "The accused here
is presumed innocent until proven
guilty beyond all reasonable doubt."
[We presume H₀ is true and must
prove otherwise.]
SCENE - 2
The attorney continues, "Therefore
your verdict is either that the accused
is 'guilty' [H₀ is rejected beyond all
reasonable doubt] or 'not guilty' [H₀
is not rejected]."
Ho = Null Hypothesis
Ha = Alternative Hypothesis
SCENE - 3
The judge then addresses the court:
"Ladies and gentlemen of the jury, notice
that we say 'not guilty' when the
evidence is consistent with H₀ [Presumed
innocent]. It does not imply that the
accused is actually innocent, merely that
we do not have enough evidence to be
assured of guilt. So H₀ is not 'accepted',
but instead is 'not rejected'. Ladies and
gentlemen, consider the evidence before
you."
(C) Copyright Sunil Kappal - Data Dojo 40
PART 2 HYPOTHESIS TESTING
What is a hypothesis?
The hypothesis is a primary method of research. It is considered as an assumption for
research work to be tested. The main function of the hypothesis is to suggest new
observations or experiments. The hypothesis as a word is pretty interesting and has
two concepts. Considering the fact that Hypothesis is a combination of two Greek
words, “Hypo” and “Thesis” it will be a great idea to look at these two concepts in
detail.
First Concept:
1. Hypo means “under” and “thesis” means “refer to place”, Therefore, it is
anything under consideration.
Second Concept:
2. Hypo means “less than” and thesis mean “generally held view”. Therefore,
collectively looking at these it means less than generally help view. This means
“less “or “no” generalization of facts.
1. Goode & Hall. It is a preposition which is pat to a test to determined its validity.
2. Lundberg. It is a tentative or systematic generalization, the validity of which
remains to be tested.
3. Kerlinger. It is a causal relationship between two or more variables.
As we have understood from the above hypothesis definitions and the word
“Hypothesis” itself. We can say that most of the hypothesis can be divided into two
sections, if and then. Therefore, it is even more important for us to understand how
to write these sections and also understand the dependent and independent
variables within the statement.
Independent Variable:
The condition be studied. It is controlled by the experimenter. Example: Knowledge
Dependent Variable:
The condition affected by the independent variable. It can’t be can’t be controlled by
the experimenter. Example: Career Growth
Figure: 1.20
With the analogy “an err to a human being” people can make mistakes when they
perform hypothesis testing while performing a statistical analysis. They can either
make a Type I errors or Type II errors. Therefore, it is very important to understand
the difference between these two types or errors. Considering the fact that there is
some level of risk involved in making each type of error in every analysis, and the
amount of risk is under the experimenter’s control.
It will be helpful to view these errors in a table which can be seen in almost all the
statistical textbooks:
Figure: 2
(C) Copyright Sunil Kappal - Data Dojo 43
PART 2 HYPOTHESIS TESTING
We commit a Type 1 error if we reject the null hypothesis when it is true and we
commit a Type 2 error if we fail to reject the null hypothesis when it is not true.
Note: These errors related to the statistical concepts of risk, significance, and power.
Answering the bigger question, which type of error is worse?
Well, not to disappoint but there is no clear answer to the above question. In some
instances, Type 1 error can cause a lot of risk compared to Type 2 and vice versa.
However, based on several experts opinions and suggestions using a table like below
can help to decide the consequences of Type 1 and Type 2 error.
Null Type 1 Error: Ho true, Rejected Type 2 Error: Ho false, Not Rejected
Medicine A Does not relieve
Null (Ho) is true Medicine Medicine A relieves Condition B,
Condition B, but is not
A does not relieve but is eliminated as a treatment
Condition B. eliminated as a treatment
option.
option.
Patients with Condition B who
receive Medicine A get no relief. A viable treatment remains
Result They may experience side unavailable to a patient. Profit
effects or even worst condition potential lost.
up to fatality. Possible Litigation
Figure: 2.1
Keep in mind that before testing a statistical hypothesis it is important to clearly state
the nature of the claim to be tested and since we assume the null hypothesis is true,
we control for Type I error by stating a level of significance. The level we set, called
the alpha level (symbolized as a), is the largest probability of committing a Type I
error that we will allow and still decide to reject the null hypothesis. This criterion is
usually set at .05 (a = .05), and we compare the alpha level to the p-value. When the
probability of a Type I error is less than 5% (p < .05), we decide to reject the null
hypothesis; otherwise, we retain the null hypothesis.
Figure: 2.2
Figure: 2.3
It often gets confusing and terrifying when trying to solve the puzzle of which
correlation technique should be deployed based on the X and Y variables properties. In
this section of “Statistics for Rookies” we will discuss two types of correlation statistics:
1. Pearson Correlation Coefficient
2. Spearman Rank Correlation
Correlation is a bivariate technique that measures the relationship strength between
two variables. The value of correlation varies between +1 & -1, where +1 denotes a
highly positive relationship in between two variables and -1 indicates the inverse. As
the correlation coefficient value goes near 0, the relationship between the two
variables tends to get weaker.
The correlation statistic helps us to have an idea of the degree and direction of the
relationship between the two variables. It deals with the association in between two
or more variables.
If two or more variables vary in such a way that movement in one is accompanied by
movement in other, these variables are called cause and effect relationship.
It is advised that one should always remember that causation implies correlation but
correlation does not necessarily imply causation.
Figure: 2.4
Types of Correlations I
Positive Correlation: When the values of two variables changes in the same
direction then the correlation is considered as positive. Variables changing in the
same direction. Example:
• As X increases, Y increases
• As X decreases, Y decreases
• E.g., As height increases so does weight
• As X increases, Y decreases
• As X decreases, Y increases
Types of Correlation II
Linear Correlation: Correlation is said to be linear when the amount of change in
one variable tends to bear a constant ratio to the amount of change in the other.
Figure: 2.5
Figure: 2.6
Assumptions
Pearson correlation technique assumes that both the variables are normally
distributed. It also assumes that there are a linearity and homoscedasticity in between
the variables. Linearity assumes a straight line relationship between each of the
variable and homoscedasticity assumes the normal distribution along the regression
line.
Where:
P = Spearman rank correlation
di = the difference between the ranks of corresponding values Xi and Yi
n = number of value in each data set
Assumptions
Spearman rank correlation test doesn’t make any distributional assumptions about
the data. The assumptions of Spearman rho correlation are that data must be at least
ordinal and scores on one variable must be monotonically related to the other
variable.
What is Regression?
Dependent variable: the variable we wish to explain (also called the endogenous
variable)
Independent variable: the variable used to explain (also called the exogenous
variable)
1. It provides value estimates of the dependent variables from the values of the
independent variables.
2. It helps to obtain a measure of the error involved in using the regression line as
foundation of estimates
3. It also helps to understand the degree of association or correlation that exists in
between the two variables.
Regression Line
It is the line which gives the best estimate of one variable from the value of any other
variable. The regression line gives the average relationship between the two variable
in a mathematical form.
For two variables X & Y, there are always two lines of regression. Regression line of X
on Y: gives the best fit for the value of X for any specific value of Y.
𝑌 = 𝛽0 + 𝛽1𝑋+𝑈
Y = Dependent Variable
X = Independent Variables
𝛽0 = Intercept Parameter
𝛽1 = Slope Parameter
U = Error term that captures the amount of variation not predicted by the slope and
intercept terms.
1. Simple regression analysis is a statistical tool That gives us the ability to estimate
the mathematical relationship between a dependent variable (usually called y)
and an independent variable (usually called x).
2. The dependent variable is the variable for which we want to make a prediction.
3. While various non-linear forms may be used, simple linear regression models are
the most common.
4. The goal of the analyst who studies the data is to find a functional
relationship between the response variable y and the predictor variable x.
5. The primary goal of the quantitative analysis is to use current information
about a phenomenon to predict its future behavior. Current information is
usually in the form of a set of data. In a simple case, when the data form a set
of pairs of numbers, we may interpret them as representing the observed
values of an independent (or predictor ) variable X and a dependent ( or
response) variable Y.
1. Correlation and regression analysis are related in the sense that both deal with
relationships among variables. The correlation coefficient is a measure of linear
association between two variables. Values of the correlation coefficient are
always between -1 and +
2. Neither Correlation nor Simple Linear Regression ascertain a cause and effect
relationship.
3. Due to the lack of Cause and Effect Relationship factorization, even SLR is
a probabilistic prediction Model and not deterministic.
The statement that the relation between X and Y is statistical should be interpreted
as providing the following guidelines:
1. Regard Y as a random variable
2. For each X, take f (x) to be the expected value (i.e., mean value) of y.
3. Given that E (Y) denotes the expected value of Y, call the equation the
regression function
Multiple linear regression is also one the most common form of linear regression
analysis. As a predictive analytics tool, the MLR method is used to explain the
relationship in between one continuous dependent variable and two or more
independent variables. The independent variables can be continuous or categorical.
Assumptions
• Regression residuals must be normally distributed
• A linear relationship is assumed between the dependent variable(Y) and the
independent variables(X)
• Absence of multicollinearity is assumed in the model, meaning that the
independent variables are not too highly correlated
Note: Refer to the Analytics Ninja Cartoon Strip at the start of this section Page 47-48
to understand how to deal with multicollinearity in a visual way)
While statistical techniques like regression, Analysis of Variance aka ANOVA are
useful when a response variable (Y) is continuous. However, if the (Y) aka Key
Performance Output Variable (KPOV) is discrete than these methods end up being
redundant or futile.
If the response variable is binary (discrete) and the input variable(s) is/are continuous
then we can use the BLR method. Binary Logistic regression is helpful to understand
how various factors affect the probability of an event.
To gain in-depth knowledge of the binary logistic regression it will be a good idea to
break the equation and understand it bit by bit.
• P = Probability
• β1, β2, βn = are the coefficients, which we want to see if they are statistically
significant or not and if they are what are their values
• x1,x2,xn = are the factors or independent variable having some effect (significant
or non-significant) on the probability
Binary logistic regression also has a concept of “Odds” (O) this can be understood by
the example of winning a bet. If the probability of winning a bet is 0.75, odds in favor
of winning the best are 0 = 0.75/(1-0.75) = 3, this means that it is three times likely to
win the bet compared to losing. Those who are familiar with betting will be in a better
position to understand the workings of offs compared to those who are a novice and
understand this logic from an equation perspective.
Go “Full Throttle” or “Full Model” this means ensuring that the model includes all the
significant factors present in the data.
“Reduce one variable at a time”, then run the regression using reduce model. This
will ensure that the model is reduced to only those variable which are vital and with
no multicollinearity.
“The Log Likelihood Static”, it is similar to the residual sum of squares in multiple
regression and is an indicator of how much-unexplained information is there post
model fitting. Large values indicate poorly fitted statistical model.
Apart from the above statistics, most applications provide exhaustive additional
information which can help to assess the performance of the model. I will be
discussing this regression technique in my upcoming article in much more details.
SAMPLE
Sample Set – List of population from where a sample is picked
Biased – Often called “Cherry Picking” where the sampling is customized per the
population’s characteristics.
Sample Types:
Non Probability Sample: Choose what you think represents the population.
• Convenience Sample: easily accessed sample
Probability Sample: Elements selected based on probability
• Simple Random Sample
• Systematic Sample
Stratified Sampling: Divide populations into subgroups based on their contribution
to the overall population.
PROBABILITY
Probability Rules:
Addition Rule: 𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷 𝑩 − 𝑷(𝑨 ∩ 𝑩)
If A and B are mutually exclusive than 𝑷 𝑨 ∪ 𝑩 = 𝑷 𝑨 + 𝑷 𝑩
Multiplication Rule: 𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨 ∗ 𝑷 𝑩 𝑨 𝒐𝒓 𝑷( 𝑩 ∗ 𝑷(𝑨|𝑩)
If A and B are independent than 𝑷 𝑨 ∩ 𝑩 = 𝑷 𝑨 ∗ 𝑷(𝑩)
Complement Rule:𝑷 𝐴𝐶 = 𝟏 − 𝑷 𝑨
PROBABILITY DEFINITIONS
A and B are mutually exclusive if 𝐏 𝑨 ∩ 𝑩 = 𝟎
A and B are independent if 𝑷 𝑨 𝑩 = 𝑷 𝑨 𝒐𝒓 𝑷 𝑩 𝑨 = 𝑷(𝑩)
Probability Laws:
Law of Probability: 𝑷 𝑩 = 𝑷 𝑨 ∗ 𝑷 𝑩 𝑨 + 𝑷 𝐴𝐶 ∗ 𝑷(𝑩|𝐴𝐶)
𝑷 𝑨 ∗𝑷(𝑩|𝑨)
Bayes’ Law = 𝑷 𝑨 𝑩 = 𝑷 𝑨 ∗𝑷(𝑩|𝑨)+𝑷 𝐴𝐶 ∗𝑷(𝑩|𝐴𝐶)
HYPOTHESIS TESTING
When P value is not low then you cannot reject H0, consider it as
an inconclusive test
As we can clearly see in the first example that there is some level of causality that is
happening. Buying books causes you to spend more money. So, does this mean if long hours
of workout and body mass are correlated, does that mean that working out for long hours will
get you huge size? I don’t think so, and this is where things go awry.
https://quickkt.com/tutorials/artificial-intelligence/machine-learning/logistic-regression-theory/ -
Logistic Regression
http://flowchart.ghkates.com/statistics-flowchart/
https://www.mathsisfun.com/median.html
https://cyfar.org/qualitative-or-quantitative-data
https://research-methodology.net
https://lc.gcumedia.com/hlt362v/the-visual-learner
http://study.com/academy/lesson/stratified-random-samples-definition-characteristics-
examples.html
https://www.psychtutor.com
https://chrismadden.co.uk
https://people.richland.edu/james/lecture
https://researchskills.epigeum.com