You are on page 1of 28

Lecture- 5

Introduction to Data Science:

Basic of Statistics and Probability


Basic Terminologies

• The data are recordings of observations or events in a


scientific study, e.g., a set of measurements of individuals
from a population. The data actually obtained are variously
called the sample, the sample data, or simply the data, and all
possible samples from a study are collected in what is called
a sample space. The hypotheses, in turn, are general
statements about the target system of the scientific study, e.g.,
expressing some general fact about all individuals in the
population. A statistical hypothesis is a general statement that
can be expressed by a probability distribution over sample
space, i.e., it determines a probability for each of the possible
samples.
2
https://plato.stanford.edu/entries/statistics/
Statistics

• Types:
–  descriptive statistics 
–  inferential statistics

3
4
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
5
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
Descriptive Statistics: Central Tendency

• Mean:

Note: The mean is heavily affected by


outliers, therefore we say the mean is not a
robust measurement.
6
Descriptive Statistics: Central Tendency

Median:
• describes the center of the data set when the data is ordered
by value.
• If two numbers are in the middle then the median is the
average of the two.

Note: The median is robust to outliers, therefore an outlier


will not affect the value of the median.

7
Descriptive Statistics: Central Tendency

Mode:
This is the most commonly occurring value in the dataset.
• The mode is robust to outliers as well.
• In the normal distribution the mean = median = mode.

8
Descriptive Statistics: Central Tendency

• In a symmetrical distribution such as a Normal distribution,


these three measures are the same. In an asymmetrical (or
skewed) distribution, as below, there is a simple rule-of-thumb
formula which can be used to estimate one, given the other
two:

Mean - Mode = 3 x (Mean - Median)

9
http://www.syque.com/quality_tools/tools/Tools63.htm
Descriptive Statistics: Central Tendency

The mean salary for these ten staff is 30.7k.


Modian:

10
Descriptive Statistics: Central Tendency

11
https://www.slideshare.net/indramani332211/measures-of-central-tendency-and-dispersion
Descriptive Statistics: Central Tendency

12
https://bjcvs.org/article/3002/en-US/operating-with-data---statistics-for-the-cardiovascular-surgeon--part-iii--comparing-groups
Descriptive Statistics: Variability (Spread)
Variance: This is essentially the sum of squares of the distance
of each measurement from the mean, divided by a constant
(roughly the number of measurements). The larger the variance,
the more spread apart the data is.
Standard deviation (SD): This is the square root of variance. It
is more commonly reported than variance because it is on the
same scale as the measurements themselves. For example, if the
measurements are in inches, then variance is in square inches
while the SD is in inches as well.
13
https://web.stanford.edu/~kjytay/courses/stats32-aut2018/Session%202/Summary%20Statistics.pdf
Descriptive Statistics: Variability (Spread)

• Average deviation or variance for a population

• The sample formula for the variance requires dividing by


n – 1 rather than n because we lose one degree of
freedom when we calculate the mean.The formula for
the variance of a sample, notated as s2, is therefore:

14
Descriptive Statistics: Variability (Spread)

Consider the tiny data set (1, 2, 3, 4, 5)

15
• To get back to the original units, we take the square root of the
variance: this is called the standard deviation and is signified
by σ for a population and s for a sample.
• For a population, the formula for the standard deviation is:

16
Outliers
• There is no absolute agreement among statisticians about
how to define outliers, but nearly everyone agrees that it is
important that they be identified and that appropriate
analytical techniques be used for data sets that contain
outliers.
• Basically, an outlier is a data point or observation whose value
is quite different from the others in the data set being
analyzed.

17
Descriptive Statistics: Variability (Spread)

• Interquartile range (IQR): This the 3rd quartile minus the 1st
quartile.
• Note: The first 2 measures (Variance & SD) can be heavily
influenced by outliers, while the third (Interquartile range) is
robust to them.

18
https://web.stanford.edu/~kjytay/courses/stats32-aut2018/Session%202/Summary%20Statistics.pdf
19

https://www.slideshare.net/Sazedur92/measures-of-dispersion-73562437
Summary:
• True” value, error, and uncertainty

20
https://www.iso.org/sites/JCGM/GUM/JCGM100/C045315e-html/C045315e_FILES/MAIN_C045315e/AD_e.html
• Fourth Chapter: Descriptive Statistics and Graphics

21
22
23
24
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
25
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
26
27

https://www.slideshare.net/HardikAgarwal3/applications-of-central-tendency
Standard deviation in the Normal distribution

28

http://www.syque.com/quality_tools/tools/Tools63.htm

You might also like