You are on page 1of 91

STATISTICS

FOR
DATA SCIENCE
Classification
Helps in segmenting say customers into
similar groups based on their
characteristics

Statistics - Association

Methods
Which items match together?

TOTAL DATA SCIENCE


Is there any association or relationship
between these items?
Particularly used in market busket analysis
to form association rules. e.g customers
who bought item A also bought item B, or
is there any likelihood of custmers buying
item A also buying item B?
Pattern Recognition
Consider various patten recognition tools
such as Box Plot, Histogram, Scatter
Plot.
Using Box plot to identify outliers is an

Statistics - important patten recognition

Predictive Modeling
Methods

TOTAL DATA SCIENCE


Statistical techniques such as logistic
regression helps us to predict the
behaviuor of our customers and have a
better insights as well as take calculative
decisions. For instance, a banking sector
can predict whether a customer will
default on a loan or not.
Data Types

Different
types of
data
QUALITATIVE DATA (ATTRIBUTE)

Qualitative data (commonly called attribute) contain values


that express a quality, a state. They are mostly nonnumeric
and can't be measured. Eg. Gender, color, feelings, etc.

Qualitative
Vs
Quantitative
QUANTITATIVE DATA

They are numerical in nature and can be measured.


E.g. weight, distance, income, age, etc.
Nominal Variable
variables cannot be easily ordered into hierachies,
e.g we cannot say that black is greater than red.

Ordinal variable
It has all the properties of the Norminal data and additiionally can be
ordered and measured on a scale. differences between each one is
not really known. For example, is the difference between “OK” and
“Unhappy” the same as the difference between “Very Happy” and
“Happy?”  We can’t say.

E.g of Norminal Variable

Qualitative color, gender

data
E.g of Ordinal
Variable
unsatisfied, satisfied, very satisfied
Interval
Interval scales are numeric scales in which we know both the
order and the exact differences between the values. E.g
measure of temperature, the difference between 60 and 50
degrees is a measurable 10 degrees, as is the difference
between 80 and 70 degrees. “Interval” itself means “space in
between,” which is the important thing to remember–interval
scales not only tell us about order, but also about the value
between each item.

Ratio
It is similiar to interval data , however, in order for data to be

Quantitative
considered ratio data it must have a true zero, meaning it is not
possible to have negative values in ratio data. An example of ratio
data is measurements of height be that centimetres, metres, inches
or feet. It is not possible to have a negative height. When comparing
this to temperature it is easy to consider the difference between

data
interval and ratio (which may be a little confusing at first!), as it is
possible for the temperature to be -10 degrees, but nothing can be –
10 inches tall.
Discrete type

Quantitative can take only certain values, example; number of students

TOTAL DATA SCIENCE


enrolled in Total Data Science program. This can be 1000, 2000,
3000 etc. studensts but definately not 50.5 or 3000.75 students

data Continuous type


It can also be: can take any value within a specific interval. Example; we can say
he weighs 50.7 kg or the Temperature at this time is 11.12*
Discrete
celsius
Continuous
A data attribute is a characteristic of data
that sets it apart from other data, such as

Data
location, length, or type.

From data science point of view, they are

Attributes
the features of a data set. For example you

TOTAL DATA SCIENCE


have a sales dataset consisting of sales id,
customer name, product purchased,
quantity, unit price etc. Customer name,
product purchased, quantity etc will be the
attributes of that sales id
Primary Data
1.9 Billion
Data that you collect by yourself or collected by
the organisation conducting the research

Data
Sources Secondary
Data
1.9 Billion
Data that you get from other third-party sources such
as government departments, Gartner, Mckinsey, etc.
Types of
Statistics
Descriptive Statistics Inferential Statistics
Inferential statistics takes data from a sample and makes
uses the data to provide descriptions of the inferences about the larger population from which the
population, either through numerical calculations sample was drawn. Because the goal of inferential
or graphs . statistics is to draw conclusions from a sample and
generalize them to a population, we need to have
Data Summarisation, confidence that our sample accurately reflects the
Graphs/Charts, population.
Tables Drawing conclusions about the population based on the
inferences from the sample
POPULATION

In statistics, a population is the entire pool from which a statistical sample


is drawn. A population may refer to an entire group of people, objects,
events, hospital visits, or measurements.

Common
SAMPLE

A sample is a smaller, manageable version of a larger


group. Samples are used in statistical testing when

Statistical population sizes are too large.

Terms
PARAMETER

In simple words, a parameter is any numerical quantity that characterises a given


population or some aspect of it. This means the parameter tells us something
about the whole population. Example the average number of pages a visitor visits
on a website

STATISTIC

The numerical value associated with an observed


sample. Example; the average number of visitors on a
website per minute
In simple terms, Frequency tells you how often
something happened. That's, how frequent an

TOTA DATA SCIENCE


observation happens or is distributed.
For example, in the following list of numbers, the
frequency of the number 9 is 5 (because it occurs

Frequency
5 times): 2, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0,9, 6, 1.

Distribution Frequency distribution is a


representation, either in a graphical or
tabular format, that displays the
number of observations within a given
interval.
Frequency
Distribution
Frequency
Distribution
Frequency
Distribution
Interval Scale
Cumulative
Frequency
Distribution

It shows how many


observations are
above or below the
lower boundaries of
the classes.
Also known as "Measures of Location", or
"Statistical Averages".

Central
This happens when you measure items of
the same kind. You will realise that most of
the items are being clustered around the

Tendency
central value or the middle item. This value

TOTAL DATA SCIENCE


is called a measure of "Central Tendency"

Mean, Median, Mode


Arithmetic
Mean

NB: Mean is Affected by extreme values.


Arithmetic Mean In Python we only use .mean()
to calculate the mean

UNGROUPED DATA GROUPED DATA


TOTAL DATA SCIENCE
The middle
most
Median observation
50% of the data are above this value and
50% are below this value.
Note that the data must be arranged in order
of magnitude (ascending/descending order)
The median is a better representation of
data when extreme values are present
since it not affected extreme values

Median
For example take these house prices 25,
27, 27, 29, and 250.

TOTAL DATA SCIENCE


The Mean here is 71.6 which is affected by
the extreme value 250.
The median becomes a 27.1 which is a
better represntation of the data.
Median
NB: Median is not Affected by extreme values.

List of numbers: 4, 10, 7, 15, 2. Calculate the median.

Solution: Let us arrange the numbers in ascending order.

In ascending order, the numbers are: 2,4,7,10,15

There are a total of 5 numbers. Median is (n+1)/2th value. Thus, the Median is
(5+1)/2th value.

Median = 3rd value.

The 3rd value in list 2, 4, 7, 10, 15 is 7.


The item with the
maximum frequency of

TOTAL DATA SCIENCE


occurrence.

Mode
That is the item that occurs most
often.

NB: Mode is not Affected by extreme values.


Important measure to identify the
item which consumers purchase the
most
Measures of
Dispersion
It measures the extent of the dispersion or
spread of the distribution around the
central tendency.

Measures of Standard Coefficient of


Dispersion

TOTAL DATA SCIENCE


Deviation Variation

The Empirical
Range Rule
Inter-Quartile
Range(IQR) Chebyshev Rule
The Five Number
Summary
Definition
The variance (σ2) is a measure of how far each value in the data
set is from the mean. Here is how it is defined:Subtract the mean
from each value in the data. This gives you a measure of the
distance of each value from the mean.Square each of these
distances (so that they are all positive values), and add all of the
squares together.Divide the sum of the squares by the number of
values in the data set.

Variance Note
However, because of this squaring, the variance

SCIENCE
is no longer in the same unit of measurement as
the original data. Taking the root of the variance

V 1.0
means the standard deviation is restored to the
original unit of measure and therefore much

DATA
easier to measure.

DECK
TOTAL
PITCH
The standard deviation (σ) is
simply the (positive) square root of
the variance.

A low standard deviation means


Standard that most of the numbers are
close to the average. A high
Deviation standard deviation means that
the numbers are more spread out.
How far is each number from the mean?
Coefficient of
Variation
The ratio of
Standard Deviation to
Mean.

It is a commonly used measure of the


risk associated with investing in stock
market.
It is also known as Relative Dispersion.
TOTAL DATA SCIENCE
The lower the value of the coefficient of
variation, the more precise the estimate.
Population

Coefficient of
Variation

Sample
The weights of the Baltimore Bullets
professional football team have a mean
of 224 pounds with a standard
deviation of 18 pounds, while the mean

Example
weight and standard deviation of their

TOTAL DATA SCIENCE


Monday opponent, Chicago Tralblazers,
are 195 and 12, respectively. Which
team exhibits the greater relative
dispersion in weights?
Boxplot
The Five
Number
Summary
A Box Plot gives a visual presentation of
the median and spread of a set of data.
The Box Plot shows the Range, Inter-

Boxplot
Quartile Range and the Mean.
It can be used to show the distribution of

SCIENCE
the data and particularly spot outliers

V 1.0
easily.

TOTAL
PITCH DATA
DECK
Boxplot
Quatiles

TOTAL DATA SCIENCE


Quatiles
Example
In Python we only
use the quatile()
We use the IQR to detect
outliers in our data
Self Reading

The
Empirical Rule
• Approximately 95% of the data in a bell-shaped
distribution lies within two standard deviations of
the
mean, or µ ± 2σ
• Approximately 99.7% of the data in a bell-shaped
distribution lies within three standard deviations of
the
mean, or µ ± 3σ
Correlation
A correlation describes
the degree of
relationship between
two variables.
Some correlation
variables you can
measure

Your caloric intake and your weight.


Your eye color and your relatives’
eye colors.
The amount of time you study and
your GPA.
In terms of the strength of relationship,
the value of the correlation coefficient
varies between +1 and -1.  A value of ± 1

Correlation indicates a perfect degree of association


between the two variables.  As the

Coefficient
correlation coefficient value goes

TOTAL DATA SCIENCE


towards 0, the relationship between the
two variables will be weaker.  The
direction of the relationship is indicated
by the sign of the coefficient; a + sign
indicates a positive relationship and a –
sign indicates a negative relationship.
Scatter Plot
for
correlation
Correlation Cuasation
The fact that one variable is correlated or
associated with another does not mean one is
causing the other

If your caloric intake and your weight are correlated it doesn't necessarily mean that caloric intake is
causing your weight, it could be, but your weight might also be caused by other factors such as
genetics, lack of exercise or excessive exercise or some other factors
Mainly four types of correlations
Types of Pearson correlation
(widely used)
Kendall rank correlation,
Correlation Spearman correlation,
Point-Biserial correlation.
Pearson r correlation is the most widely
used correlation statistic to measure the
degree of the relationship between
linearly related variables. For example, in
the stock market, if we want to measure

Pearson how two stocks are related to each


other, Pearson r correlation is used to

correlation
SCIENCE
measure the degree of relationship
between the two.

TOTAL
PITCH V 1.0
DATA
DECK rxy = Pearson r correlation coefficient between x and y
n = number of observations
xi = value of x (for ith observation)
yi = value of y (for ith observation)
In Python-corr()
Types of research questions a Pearson correlation
can examine:

Is there a statistically significant relationship

Pearson between age, as measured in years, and


height, measured in inches?

correlation Is there a relationship between temperature,


measured in degrees Fahrenheit, and ice cream
sales, measured by income?

Is there a relationship between job satisfaction,


as measured by the JSS, and income, measured
in dollars?
Correlation measures the
relationship between variables
and
Regression base on this
relationship to make future
predictions 
Probability
& Conditional
Probability
Why Probability
Most of the events are difficult to predict
precisely and therefore what we can do is
to find the likelihood that the event will or
not occur-this is called Probability

Probability Example

TOTAL DATA SCIENCE


How likely is it to rain today?
What is the chance of you getting all A's in
your exams?
What is the chance of the survival of the
patience?
EVENT

outcome of an experiment. Example; tossing a coin to


receive a head, the out come(head) is our event

EXPERIMENT

Concepts
The process undertaken to obtained our outcome.
Example the process of tossing the coin is our
experiment.

SAMPLE SPACE
Set of all outcomes of an experiment. Example; obtaing a head
for first trial, tail on second trial, tail again on 3rd trial, head on
4th trial, etc. All these combine to form the sample space
Probability is between 0 and 1
Probability
expression
Mutually Exclusive Events

Two events are

H T H T
mutually exclusive or
disjoint if they cannot
both occur at the same
time.
A clear example is the
set of outcomes of a
single coin toss, which

H T
can result in either
heads or tails, but not
both.
A die is rolled. Let us define
event E1 as the set of possible
outcomes where the number on
the face of the die is even and
event E2 as the set of possible
outcomes where the number on

Example the face of the die is odd. Are

TOTAL DATA SCIENCE


event1 E1 and E2 mutually
exclusive?
E1 = {2,4,6}
E2 = {1,3,5}
E1 and E2 have no elements in common and therefore are
mutually exclusive.
Two events, A and B, are independent if
the fact that A occurs does not affect the
probability of B occurring.

P(A and B) = P(A) · P(B)

Independent

SCIENCE
Events
Example

V 1.0
A jar contains 3 red, 5 green, 2

DATA
blue and 6 yellow marbles. A

DECK
marble is chosen at random from
the jar. After replacing it, a

TOTAL
second marble is chosen. What
PITCH is the probability of choosing a
green and then a yellow marble?
Multiplication Rule(independent event)
If two events A and B are independent, then this rule says that
the probability of the simultaneous occurrence of A and B is given
as the product of A and B

Rules for
Computing

TOTAL DATA SCIENCE


Example
Probability Suppose that we roll a six sided die and then
flip a coin. These two events are independent.
The probability of rolling a 1 is 1/6. The
probability of a head is 1/2. The probability of
rolling a 1 and getting a head is 1/6 x 1/2 =
1/12.
Multiplication Rule(dependent event)
Rules for The probability that Events A and B both occur is equal to the

Computing
probability that Event A occurs times the probability that Event B
occurs, given that A has occurred.

Probability P(A ∩ B) = P(A) . P(B|A)

TOTAL DATA SCIENCE


Example

Conditional A basket contains 6 red marbles and 4 black


marbles. Two marbles are drawn without
Probability replacement from the basket. What is the
probability that both of the marbles are black?
Addition Rule(mutually exclusive)
The probability that Event A or Event B occurs is equal to the
probability that Event A occurs plus the probability that Event B
occurs.

Rules for P(A ∪ B) = P(A) + P(B) 


Computing Example

TOTAL DATA SCIENCE


Probability A student goes to the library. The probability
that she checks out
(a) a work of fiction is 0.40,
(b) a work of non-fiction is 0.30
Student can only checkout one kind of book.

What is the probability that the student


checks out a work of fiction, non-fiction, or
both?
Addition Rule(non-mutually exclusive)
The probability that Event A or Event B occurs is equal to the
probability that Event A occurs plus the probability that Event B
occurs minus the probability that both Events A and B occur.

Rules for P(A ∪ B) = P(A) + P(B) - P(A ∩ B)


Computing

TOTAL DATA SCIENCE


Example
Probability A student goes to the library. The probability
that she checks out
(a) a work of fiction is 0.40,
(b) a work of non-fiction is 0.30, and
(c) both fiction and non-fiction is 0.20.
Student can checkout both kind of books
What is the probability that the student checks
out a work of fiction, non-fiction, or both?
Bayes’
Theorem

Bayes’ Theorem is used to revise


previously calculated probabilities
based on new information.

It is an extension of conditional
probability.
TOTAL DATA SCIENCE
Bayes’ Theorem
Classify emails as spam or not
spam given the words in the
email.

Interpreting medical results


Uses of
Baye's

TOTAL DATA SCIENCE


Use in building our Naive Baye's
Algorithms
Theorem
Financial risk analysis
Practice Example Naive Baye's
The Art Competition has entries from three painters: Pam, Pia and Pablo

Pam put in 15 paintings, 4% of her works have won First Prize.


Pia put in 5 paintings, 6% of her works have won First Prize.
Pablo put in 10 paintings, 3% of his works have won First Prize.

What is the chance that Pam will win First Prize?


In Python-sklean.naive_bayes
Self Reading

Chebyshev
Rule
Normal
Distribution
Skewed Distribution
Let's say these diagrams represents the salaries of workers of TDS in $1000s, if we consider the left diagram,
we can say that most of the workers earn between 80,000-120,000 with as high as 25 people earning
120,000. Few are earning below 80,000. This shows unfairness in the distribution of the salary. In the same
way, the diagram at the right shows majority earning between 20,000-30,000 with few earning 40,000 and
above. However, in the diagram in the middle, the salary is evenly distributed with as many people earning as
much high salary as those earning low salary 
In Python-displot/kde
to see the distribution of a particular variable
Central Limit
Theorem
Sampling Distribution of the mean of any
independent random variable will be normal
This is the result of rolling the Plotting the resulting
simulated die 100 times. distribution of sample means.
Testing
Hypothesis

TOTAL DATA SCIENCE


Hypothesis
an assumption regarding a population
parameter.

Hypothesis
Hypothesis testing
is the use of statistics to determine the

TOTAL DATA SCIENCE


probability that a given hypothesis is true

Testing Example
The 600ml coke bottle contains 600ml of
coke.
But is it really true that the 600ml Coke
bottle really contains 600ml coke?
Null Hypothesis
the status quo
The 600ml coke bottle contains 600ml of coke.

Hypothesis
Testing Alternate Hypothesis

Formulation challenge the status quo


The 600ml coke bottle does not contains 600ml of coke.
Null Hypothesis : Ho: μ = 600ml

Alternative Hypothesis : Ha: μ ≠ 600ml


One tailed vs Two tailed test

Two-tailed test One-tailed test

Ho: µ = 600ml Ho: µ <= 600ml


Ha: µ ≠ 600ml Ha: µ > 600ml
One tailed vs Two tailed test
Take Note
We either reject the null hypothesis or fail to reject the null
hypothesis we do not accept the null hypothesis.

we assume the null hypothesis to be true from the beginning, then


later the assumption is rejected or we fail to reject it.

When we reject the null hypothesis, we can conclude that the


alternative hypothesis is supported.

If we fail to reject the null hypothesis, it does not mean that we have
proven the null hypothesis is true, it means we do not have enough
evidence to reject it.
Type 1 and Type 2 errors

Type 1 error Type 11 error

Rejecting the null Accepting the null


hypothesis while it is true hypothesis while it is false
Same concept
applies when
dealing with
confusion matrix
Hypothesis Testing Process
Determine the null hypothesis and the
alternative hypothesis.
Collect and summarize the data into a
test statistic.
Use the test statistic to determine the
p-value.
The result is statistically significant if the
p-value is less than or equal to the
level of significance.
P-Value
A p value is used in hypothesis testing to help you support or reject the null
hypothesis.
The p-value is the evidence against a null hypothesis. The smaller the p-value, the
stronger the evidence that you should reject the null hypothesis.

P-values are expressed as decimals although it may be easier to understand what


they are if you convert them to a percentage. For example, a p-value of 0.0254 is
2.54%. This means there is a 2.54% chance your results could be random (i.e.
happened by chance). That’s pretty tiny. On the other hand, a large p-value of
0.9(90%) means your results have a 90% probability of being completely random
and not due to anything in your experiment.

Therefore, the smaller the p-value, the more important (“significant”) your results.
That's why you have to set a significant level in order to have something to comapre
your results with.
P-Value
Alpha levels are controlled by the researcher and are related to confidence levels.
You get an alpha level by subtracting your confidence level from 100%. For
example, if you want to be 98 percent confident in your research, the alpha level
would be 2% (100% – 98%). When you run the hypothesis test, the test will give
you a value for p. Compare that value to your chosen alpha level. For example, let’s
say you chose an alpha level of 5% (0.05). If the results from the test give you:

A small p (≤ 0.05), reject the null hypothesis. This is strong evidence that the null
hypothesis is invalid.
A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject
the null.

NB: If P is small, Null must Go


Confidence Level
It shows how much risk you are willing to accept. Maybe 5% (0.05)
of the time your results might be wrong. But you are also 95% (0.95)
sure your results are true.

Confidence levels are expressed as a percentage (for example, a


95% confidence level). It means that, should you repeat an
experiment or survey over and over again, 95 percent of the time
your results will match the results you get from a population (in other
words, your statistics would be sound!).
Thank You

You might also like