You are on page 1of 28

STATISTICS AND

PROBABILITY
CONCEPTS
Dr. Firoz Anwar
CONTENTS
 Introduction
 Basics/Descriptive Statistics
 Scales of measurement
 Graphical exploration of data
 Descriptive characteristics for a variable

 Estimation
 Characteristics of an estimator
 Confidence interval

 Statistical hypothesis testing


 Statistical testing principle
 Testing errors
 Power analysis

 Why multivariate analysis?

Source: https://www.edureka.co/blog/what-is-data-science/
EDA
 What is EDA
 Why is EDA
 Identify patterns and develop hypotheses.
 Test technical assumptions.
 Inform model selection and feature engineering.
 Build an intuition for the data.

Source: https://www.dataquest.io/blog/what-is-data-science/
EDA STEPS
 Form hypotheses/develop investigation themes to explore
 Wrangle data
 Assess data quality and profile
 Explore each individual variable in the dataset
 Assess the relationship between each variable and the target
 Assess interactions between variables
 Explore data across many dimensions

Source: https://www.dataquest.io/blog/what-is-data-science/
DATA TYPES

https://towardsdatascience.com/data-types-in-statistics-347e152e8bee
NOMINAL DATA
 Can be analysed using the grouping method.
 For each category, the frequency or percentage can be calculated.
 visually, such as by using a pie chart.
 Cannot be treated using mathematical operators
 Can be analysed using advanced statistical methods. E.g. hypothesis testing.
 Hypothesis testing can be carried out using nonparametric tests such as the chi-squared test.
 The chi-squared test aims to determine whether there is a significant difference between the
expected frequency and the observed frequency of the given values.

https://towardsdatascience.com/data-types-in-statistics-347e152e8bee
ORDINAL DATA
 Use visualization tools.
 Most commonly used chart for representing such types of data is the bar chart.
 Can also be analysed using advanced statistical analysis tools such as hypothesis testing.
 Standard parametric methods such as t-test or ANOVA cannot be applied to such types of data.
 The hypothesis testing of the data can be carried out only using nonparametric tests such as
the Mann-Whitney U test or Wilcoxon Matched-Pairs test.

https://towardsdatascience.com/data-types-in-statistics-347e152e8bee
SUMMARY STATISTICS
 Mean, median, mode
 Minimum, maximum, range
 Population, sample
 Standard deviation, Standard Error, variance
SUMMARY STATISTICS
SUMMARY STATISTICS
SUMMARY STATISTICS
 Outlier
 Coefficient of variation
SUMMARY STATISTICS
 Percentiles, inter quartile range
DISTRIBUTION
 Uniform distribution
 Normal/Gaussian distribution
DISTRIBUTION
 Exponential distribution/Power Law
 Binomial distribution
HYPOTHESIS TEST
 What is Hypothesis Testing?

 What are we trying to achieve?

 Why do we need to perform Hypothesis Testing?

 “There are two possible outcomes: if the result confirms the hypothesis, then
you’ve made a measurement. If the result is contrary to the hypothesis, then
you’ve made a discovery” — Enrico Fermi
HYPOTHESIS TEST
 Null Hypothesis (H₀):
 A statement in which no difference or effect is expected. If the null hypothesis is
not rejected, no changes will be made.

 Alternate Hypothesis (H₁):


 A statement that some difference or effect is expected. Accepting the alternative
hypothesis will lead to changes in opinions or actions. It is the opposite of the null
hypothesis.

 Youtube Link
HYPOTHESIS TEST
 Sample data must provide sufficient evidence to reject the null hypothesis and conclude that
the effect exists in the population.
 Ideally, a hypothesis test fails to reject the null hypothesis when the effect is not present in the
population, and it rejects the null hypothesis when the effect exists.
HYPOTHESIS TEST
 Type-I Error:
 Denotated by alpha (α)
 Type-I error occurs when the sample results, lead to the rejection of the null hypothesis when it is in
fact true.
 Equivalent to false positives.
 Type-I errors can be controlled. The value of alpha, which is related to the level of Significance that
we selected has a direct bearing on Type-I errors.

 Type-II Error:
 Type-II error occurs when based on the sample results, the null hypothesis is not rejected when it is in
fact false. Type-II errors are equivalent to false negatives.
HYPOTHESIS TEST
 Youtube Link
HYPOTHESIS TEST
 https://www.kaggle.com/code/hamelg/python-for-data-24-hypothesis-testing/notebook
NORMALITY TEST
 Assumptions
 Observations in each sample are independent and identically distributed (iid).

 Interpretation
 H0: the sample has a Gaussian distribution.
 H1: the sample does not have a Gaussian distribution.

 Shapiro-Wilk Test
 D’Agostino’s K^2 Test
 Anderson-Darling Test
CORRELATION TEST
 Pearson’s Correlation Coefficient
 Spearman’s Rank Correlation
 Kendall’s Rank Correlation
 Chi-Squared Test
PEARSON TEST
 Assumptions
 Each observation should have a pair of values.

 Each variable should be continuous.

 It should be the absence of outliers.

 It assumes linearity and homoscedasticity.


SPEARMAN’S RANK
CORRELATION TEST
 Assumptions
 Pairs of observations are independent.

 Two variables should be measured on an ordinal, interval or ratio scale.

 It assumes that there is a monotonic relationship between the two variables.


CORRELATION TEST
 Assumptions
 Observations in each sample are independent and identically distributed. (All)
 Observations in each sample are normally distributed. (Pearson’s)
 Observations in each sample have the same variance. (Pearson’s)
 Observations in each sample can be ranked. (Spearman, Kendall)
 Observations used in the calculation of the contingency table are independent. (Chi-square)
 25 or more examples in each cell of the contingency table. (Chi-square)

 Interpretation
 H0: the two samples are independent.
 H1: there is a dependency between the samples.
PEARSON VS SPEARMAN AND
KENDALL CORRELATION
 Non-parametric correlations are less powerful because they use less information in their
calculations. In the case of Pearson's correlation uses information about the mean and
deviation from the mean, while non-parametric correlations use only the ordinal information
and scores of pairs.

 In the case of non-parametric correlation, it's possible that the X and Y values can be
continuous or ordinal, and approximate normal distributions for X and Y are not required. But
in the case of Pearson's correlation, it assumes the distributions of X and Y should be normal
distribution and also be continuous.

 Correlation coefficients only measure linear (Pearson) or monotonic (Spearman and Kendall)
relationships.
PEARSON VS SPEARMAN AND
KENDALL CORRELATION
 In the normal case, Kendall correlation is more robust and efficient than Spearman correlation.
It means that Kendall correlation is preferred when there are small samples or some outliers.

 Kendall correlation has a O(n^2) computation complexity comparing with O(n logn) of
Spearman correlation, where n is the sample size.

 Spearman’s rho usually is larger than Kendall’s tau.

 The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable
(concordant) and non-agreeable (discordant) pairs is very direct.
Thank You

You might also like