Introduction to Data Science: Basic Statistics and Probability

Lecture- 5
Introduction to Data Science:
Basic of Statistics and Probability

Basic Terminologies
• The data are recordings of observations or events in a

scientific study, e.g., a set of measurements of individuals
from a population. The data actually obtained are variously
called the sample, the sample data, or simply the data, and all
possible samples from a study are collected in what is called
a sample space. The hypotheses, in turn, are general
statements about the target system of the scientific study, e.g.,
expressing some general fact about all individuals in the
population. A statistical hypothesis is a general statement that
can be expressed by a probability distribution over sample
space, i.e., it determines a probability for each of the possible
samples.
2
https://plato.stanford.edu/entries/statistics/
Statistics
• Types:
– descriptive statistics
– inferential statistics
3
4
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
5
Descriptive Statistics: Central Tendency
• Mean:
Note: The mean is heavily affected by

outliers, therefore we say the mean is not a
robust measurement.
6
Median:
• describes the center of the data set when the data is ordered
by value.
• If two numbers are in the middle then the median is the
average of the two.
Note: The median is robust to outliers, therefore an outlier

will not affect the value of the median.
7
Mode:
This is the most commonly occurring value in the dataset.
• The mode is robust to outliers as well.
• In the normal distribution the mean = median = mode.
8
• In a symmetrical distribution such as a Normal distribution,

these three measures are the same. In an asymmetrical (or
skewed) distribution, as below, there is a simple rule-of-thumb
formula which can be used to estimate one, given the other
two:
Mean - Mode = 3 x (Mean - Median)
9
http://www.syque.com/quality_tools/tools/Tools63.htm
The mean salary for these ten staff is 30.7k.

Modian:
10
11
https://www.slideshare.net/indramani332211/measures-of-central-tendency-and-dispersion
12
https://bjcvs.org/article/3002/en-US/operating-with-data---statistics-for-the-cardiovascular-surgeon--part-iii--comparing-groups
Descriptive Statistics: Variability (Spread)
Variance: This is essentially the sum of squares of the distance
of each measurement from the mean, divided by a constant
(roughly the number of measurements). The larger the variance,
the more spread apart the data is.
Standard deviation (SD): This is the square root of variance. It
is more commonly reported than variance because it is on the
same scale as the measurements themselves. For example, if the
measurements are in inches, then variance is in square inches
while the SD is in inches as well.
13
https://web.stanford.edu/~kjytay/courses/stats32-aut2018/Session%202/Summary%20Statistics.pdf
• Average deviation or variance for a population
• The sample formula for the variance requires dividing by

n – 1 rather than n because we lose one degree of
freedom when we calculate the mean.The formula for
the variance of a sample, notated as s2, is therefore:
14
Consider the tiny data set (1, 2, 3, 4, 5)
15
• To get back to the original units, we take the square root of the
variance: this is called the standard deviation and is signified
by σ for a population and s for a sample.
• For a population, the formula for the standard deviation is:
16
Outliers
• There is no absolute agreement among statisticians about
how to define outliers, but nearly everyone agrees that it is
important that they be identified and that appropriate
analytical techniques be used for data sets that contain
outliers.
• Basically, an outlier is a data point or observation whose value
is quite different from the others in the data set being
analyzed.
17
• Interquartile range (IQR): This the 3rd quartile minus the 1st
quartile.
• Note: The first 2 measures (Variance & SD) can be heavily
influenced by outliers, while the third (Interquartile range) is
robust to them.
18
https://web.stanford.edu/~kjytay/courses/stats32-aut2018/Session%202/Summary%20Statistics.pdf
19
https://www.slideshare.net/Sazedur92/measures-of-dispersion-73562437
Summary:
• True” value, error, and uncertainty
20
https://www.iso.org/sites/JCGM/GUM/JCGM100/C045315e-html/C045315e_FILES/MAIN_C045315e/AD_e.html
• Fourth Chapter: Descriptive Statistics and Graphics
21
22
23
24
25
26
27
https://www.slideshare.net/HardikAgarwal3/applications-of-central-tendency
Standard deviation in the Normal distribution
28
http://www.syque.com/quality_tools/tools/Tools63.htm

Introduction to Data Science: Basic Statistics and Probability

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to Data Science: Basic Statistics and Probability

Uploaded by

Copyright:

Available Formats

Lecture- 5

Introduction to Data Science:

Basic of Statistics and Probability

• The data are recordings of observations or events in a

Note: The mean is heavily affected by

Note: The median is robust to outliers, therefore an outlier

• In a symmetrical distribution such as a Normal distribution,

Mean - Mode = 3 x (Mean - Median)

The mean salary for these ten staff is 30.7k.

• Average deviation or variance for a population

• The sample formula for the variance requires dividing by

Consider the tiny data set (1, 2, 3, 4, 5)

You might also like