You are on page 1of 22

OMBAIML 301:

BASICS OF Artificial intelligence &


Machine Learning

Unit 3:
Statistical Analysis Initial Data Analysis

By: Asst. Prof. Toshi Dave


• Mining data includes knowing about
data, finding relations between data.
Introduction • Attribute is a characteristic or feature
that is measured for each observation
(record) and can vary from one
observation to another.
• Population is the entire group that you
want to draw conclusions about.

• Sample is the specific group that you


will collect data from.

3
Types of Data

4
Measuring Relationship between Attributes

• Covariance: Covariance measures how variables vary together.


Covariance indicates the direction of the linear relationship
between variables.
✓ Positive covariance means that the variables vary together
in the same direction.
✓ Negative covariance means they vary in the opposite
direction.
✓ 0 covariance means that the variables don’t vary together
or they are independent of each other.

5
• Correlation: A statistical measure of the strength of a linear
relationship between two variables. Its values can range from -
1 to 1.
✓ -1 means perfect negative or inverse correlation
✓ 1 means perfect positive or direct correlation
✓ 0 means no linear relationship.

o Two methods of calculating correlation can help with


these issues:
1) Pearson Correlation
2) Spearman Rank Correlation.

6
• Chi-square: A statistical procedure for determining the
difference between observed and expected data.
✓ It is also used to determine whether it correlates to
the categorical variables in our data.
✓ It helps to find out whether a difference between
two categorical variables is due to chance or a
relationship between them.

7
ρ (X,Y) = cov (X,Y) / σ𝑿.σy

8
Measure of Distribution

• Statistical dispersion means the extent to which numerical


data is likely to vary about an average value. Thus dispersion
helps to understand the distribution of the data.

• Skewness and Kurtosis are statistical measures that describe


the shape of the data distribution. Both are numerical ways to
assess the shape of the data set. These normality tests are
used to determine whether the distribution is asymmetrical
and irregular.

9
Skewness

• Measurement of the distortion of symmetrical distribution or


asymmetry in a data set.
• Skewness is demonstrated on a bell curve when data points are
not distributed symmetrically to the left and right sides of the
median on a bell curve.
• If the bell curve is shifted to the left or the right, it is said to be
skewed.
• Zero skew are called Normal Distribution of Data (bell curve/
shaped).

10
11
• Kurtosis is a numerical method in
statistics that measures the
sharpness of the peak in the data
distribution.
• Also called as Tailedness of a
distribution.
Definition of
Kurtosis

12
13
Box & Whiskers Plot

14
Box & Whiskers Plot

15
• Fundamental concept in
Probability statistics and data analysis

• Measures the likelihood of


events and their
outcomes.

16
Types of Probability

Marginal Conditional
Joint Probability
Probability Probability

Probability of a Probability of one


Probability of two single event event occurring
events occurring occurring, given that another
simultaneously. irrespective of the event has already
other event. occurred.
17
Probability Distributions

• Describehow probabilities are distributed across different


outcomes or values in a random experiment.

• Categories : 2
– Continuous Distributions for variables like height or
weight.
– Discrete Distributions for variables like the number of
coin tosses needed to get a head.

18
• Stands for Probability Density
Functions.

PDF • PDFs are used in continuous


probability distributions to
describe the probability of
observing a specific value.

• Emphasizes that the area


under the PDF curve
between two values
represents the probability of
the random variable falling
within that range.

19
• Stands for Cumulative
Distribution Functions.

CDF • Clarifies that CDFs are used in


both continuous and discrete
distributions.

• Shows how CDFs display the


cumulative probability of a
random variable being less
than or equal to a specific
value.

20
THANK YOU

You might also like