Professional Documents
Culture Documents
Lecture 2
Lecture 2
PROBABILITY
CONCEPTS
Dr. Firoz Anwar
CONTENTS
Introduction
Basics/Descriptive Statistics
Scales of measurement
Graphical exploration of data
Descriptive characteristics for a variable
Estimation
Characteristics of an estimator
Confidence interval
Source: https://www.edureka.co/blog/what-is-data-science/
EDA
What is EDA
Why is EDA
Identify patterns and develop hypotheses.
Test technical assumptions.
Inform model selection and feature engineering.
Build an intuition for the data.
Source: https://www.dataquest.io/blog/what-is-data-science/
EDA STEPS
Form hypotheses/develop investigation themes to explore
Wrangle data
Assess data quality and profile
Explore each individual variable in the dataset
Assess the relationship between each variable and the target
Assess interactions between variables
Explore data across many dimensions
Source: https://www.dataquest.io/blog/what-is-data-science/
DATA TYPES
https://towardsdatascience.com/data-types-in-statistics-347e152e8bee
NOMINAL DATA
Can be analysed using the grouping method.
For each category, the frequency or percentage can be calculated.
visually, such as by using a pie chart.
Cannot be treated using mathematical operators
Can be analysed using advanced statistical methods. E.g. hypothesis testing.
Hypothesis testing can be carried out using nonparametric tests such as the chi-squared test.
The chi-squared test aims to determine whether there is a significant difference between the
expected frequency and the observed frequency of the given values.
https://towardsdatascience.com/data-types-in-statistics-347e152e8bee
ORDINAL DATA
Use visualization tools.
Most commonly used chart for representing such types of data is the bar chart.
Can also be analysed using advanced statistical analysis tools such as hypothesis testing.
Standard parametric methods such as t-test or ANOVA cannot be applied to such types of data.
The hypothesis testing of the data can be carried out only using nonparametric tests such as
the Mann-Whitney U test or Wilcoxon Matched-Pairs test.
https://towardsdatascience.com/data-types-in-statistics-347e152e8bee
SUMMARY STATISTICS
Mean, median, mode
Minimum, maximum, range
Population, sample
Standard deviation, Standard Error, variance
SUMMARY STATISTICS
SUMMARY STATISTICS
SUMMARY STATISTICS
Outlier
Coefficient of variation
SUMMARY STATISTICS
Percentiles, inter quartile range
DISTRIBUTION
Uniform distribution
Normal/Gaussian distribution
DISTRIBUTION
Exponential distribution/Power Law
Binomial distribution
HYPOTHESIS TEST
What is Hypothesis Testing?
“There are two possible outcomes: if the result confirms the hypothesis, then
you’ve made a measurement. If the result is contrary to the hypothesis, then
you’ve made a discovery” — Enrico Fermi
HYPOTHESIS TEST
Null Hypothesis (H₀):
A statement in which no difference or effect is expected. If the null hypothesis is
not rejected, no changes will be made.
Youtube Link
HYPOTHESIS TEST
Sample data must provide sufficient evidence to reject the null hypothesis and conclude that
the effect exists in the population.
Ideally, a hypothesis test fails to reject the null hypothesis when the effect is not present in the
population, and it rejects the null hypothesis when the effect exists.
HYPOTHESIS TEST
Type-I Error:
Denotated by alpha (α)
Type-I error occurs when the sample results, lead to the rejection of the null hypothesis when it is in
fact true.
Equivalent to false positives.
Type-I errors can be controlled. The value of alpha, which is related to the level of Significance that
we selected has a direct bearing on Type-I errors.
Type-II Error:
Type-II error occurs when based on the sample results, the null hypothesis is not rejected when it is in
fact false. Type-II errors are equivalent to false negatives.
HYPOTHESIS TEST
Youtube Link
HYPOTHESIS TEST
https://www.kaggle.com/code/hamelg/python-for-data-24-hypothesis-testing/notebook
NORMALITY TEST
Assumptions
Observations in each sample are independent and identically distributed (iid).
Interpretation
H0: the sample has a Gaussian distribution.
H1: the sample does not have a Gaussian distribution.
Shapiro-Wilk Test
D’Agostino’s K^2 Test
Anderson-Darling Test
CORRELATION TEST
Pearson’s Correlation Coefficient
Spearman’s Rank Correlation
Kendall’s Rank Correlation
Chi-Squared Test
PEARSON TEST
Assumptions
Each observation should have a pair of values.
Interpretation
H0: the two samples are independent.
H1: there is a dependency between the samples.
PEARSON VS SPEARMAN AND
KENDALL CORRELATION
Non-parametric correlations are less powerful because they use less information in their
calculations. In the case of Pearson's correlation uses information about the mean and
deviation from the mean, while non-parametric correlations use only the ordinal information
and scores of pairs.
In the case of non-parametric correlation, it's possible that the X and Y values can be
continuous or ordinal, and approximate normal distributions for X and Y are not required. But
in the case of Pearson's correlation, it assumes the distributions of X and Y should be normal
distribution and also be continuous.
Correlation coefficients only measure linear (Pearson) or monotonic (Spearman and Kendall)
relationships.
PEARSON VS SPEARMAN AND
KENDALL CORRELATION
In the normal case, Kendall correlation is more robust and efficient than Spearman correlation.
It means that Kendall correlation is preferred when there are small samples or some outliers.
Kendall correlation has a O(n^2) computation complexity comparing with O(n logn) of
Spearman correlation, where n is the sample size.
The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable
(concordant) and non-agreeable (discordant) pairs is very direct.
Thank You