You are on page 1of 17

Probability and Statistics

LUMS
Undergraduate
SS-4-6
Numerical Descriptive Statistics
• Numerical descriptive statistics take a different approach to
answer the same set of questions:
– Provide more precise information about a dataset’s distribution.
– The increased precision comes at the cost of stronger aggregation.
• Three basic types of numerical descriptive statistics:
– Measures of Central Location: Mean, Median, Mode
– Measures of Variability: Range, Variance, Standard Deviation,
Coefficient of Variation, Percentiles and Quartiles.
– Measures of Shape: Skewness and Kurtosis
• Ideally, employ visual and numerical descriptive statistics in
tandem to shed light on information embedded in datasets.
Measures of Central Location
• Average (i.e. arithmetic mean) is the most popular
measure of central location:
– computed by adding all the observations and dividing by the
total number of observations.
– appropriate for describing quantitative data only.
– Possesses nice theoretical properties:
• Sum of deviations from mean is zero.
• Linked to the measures of variation in a dataset.
• Changing value of a single observation changes the average.
• Central Limit Theorem
– Sensitive to outliers (extreme values) e.g. what happens to
average household income in a poor neighborhood when a
billionaire moves in?
Measures of Central Location
• Median: Place observations in ascending order, whereby,
observation/s falling in the middle is the median.
– Median not sensitive to outliers
– Often used for income and property values datasets.
– Cannot be computed for nominal data.
• Mode: value/class that occurs most frequently in a dataset.
– Most suitable for nominal data, but also used for ordinal data.
– Datasets may have more than one modal class.
– Not a good measure of central location for quantitative data.
Measures of Variability
• Measures of central location fail to tell the complete story
about a dataset’s distribution e.g. how are observations
spread out around the mean (on average)?
Measures of Variability
• Range: simplest measure of variability, calculated by
subtracting smallest observation from largest observation.
– Fails to provide information on the dispersion of the observations
located between the two end points.
• Variance, and its related measure Standard Deviation, is a
measure of variability that incorporates all the data points.
– Variance calculated by subtracting the mean from each number in a
dataset, squaring the differences, and dividing the sum of the
squares by the number of observations in the dataset.
– Standard deviation (square root of the variance) used to compare
the average degree of variability between two quantitative datasets.
• Commonly used as a measure of risk in finance.
Measures of Variability
• Coefficient of variation: Standard deviation of a variable
divided by its mean:
– A standardized measure of variation, when comparing the degree of
variability between variables with different means:
• Variation in salaries of managers and CEOs?
• Variation in the weights of watermelons and apples?
– Interpreted as variation in a variable as percentage of it’s mean.
• All of the above-mentioned measures of variability are
sensitive to outliers.
– Measures of relative variability are not sensitive to outliers.
• Provides information about position of a particular observation
relative to the entire dataset, often used to define benchmarks
in business applications.
Measures of Variability
– For example suppose your SAT score of 1340 is on the 80th percentile-
implies that 80% of students scored below you, while 20% of students
scored above you.
– Caution: This doesn’t mean you scored 80% on the exam!
• Difference between Q1 (25th percentile) and Q3 (75th
percentile) is called the interquartile range:
– Median is known as Q2 (50th percentile)
– Measures the spread around the middle 50% of the observations.
– Large values indicative of a high variability and presence of outliers.
• Measures of variation don’t tell us much about symmetry of
distribution, outliers and concentration of data in tails relative
to center of distribution.
Measures of Shape
Normal Distribution: A special type of symmetric uni-modal
distribution that is bell shaped, frequently encountered in
statistical modelling:
Many statistical techniques
require/assume that data
follows a bell shaped Frequency
distribution.

Variable

A Normal Distributions has Bell


Shaped Histogram
Measure of Shape
Skewness: A skewed distribution is one with a long tail
extending either to the right or the left of the distribution.

Positively Skewed or Right Skewed Negatively Skewed or left skewed


Implies mean>median i.e. more outliers Implies mean<median i.e. more
on the RHS outliers on the LHS
Measures of Shape
Kurtosis: Measure of relative concentration of data in the tails,
relative to the center of the distribution:
– Negative Excess Kurtosis: Relatively less concentration in the tails.
– Positive Excess Kurtosis: Relatively more concentration in the tails.
Overview-Numerical Descriptive Statistics

Describing Data Numerically

Central Tendency Variation Shape

Mean Range
Skewness
Median Variance/Std. Deviation and Kurtosis

Coefficient of Variation
Mode
Interquartile Range
Some Rules of Expectations Operation
• If 𝑘 is some constant, then we can mathematically prove the
following results:
– Rule-1: If 𝐸 𝑥𝑖 = 𝑥ҧ then 𝐸 𝑥𝑖 + 𝑘 = 𝑥ҧ + 𝑘
• Adding a constant to each observation changes average by that constant.
– Rule-2: If 𝑉𝑎𝑟 𝑥𝑖 = 𝜎 2 then 𝑉𝑎𝑟 𝑥𝑖 + 𝑘 = 𝜎 2
• Adding a constant to each observation does not change the variance.
– Rule-3: If 𝐸 𝑥𝑖 = 𝑥ҧ then 𝐸 𝑘𝑥𝑖 = 𝑘𝐸 𝑥𝑖 = 𝑘𝑥ҧ
• Multiplying each observation by a constant, changes the average by a factor of
that constant.
– Rule-4: If 𝑉𝑎𝑟 𝑥𝑖 = 𝜎 2 then 𝑉𝑎𝑟 𝑘𝑥𝑖 = 𝑘 2 𝑉𝑎𝑟 𝑥𝑖 = 𝑘 2 𝜎 2
• Multiplying each observation by a constant, changes the variance by the squared
factor of that constant.
• We apply these rules to standardize datasets to identify outliers.
Standardizing Datasets
• Z-scores used to identify outliers in a dataset. To calculate Z-
score of each observation:
– Subtract from each observation the mean of the variable
– Divide each observation by the standard deviation of the variable.
– The resulting distribution (of Z-scores) has a mean of 0 and standard
deviation of 1.
– Each observations Z-score is interpreted as number of standard
deviations it is above or below the mean
• Converting each observation into it’s corresponding Z score
does not change a non-normal distribution into a normal
distribution.
The Empirical Rule
Approximately 68% of all observations fall
within one standard deviation of the mean.

Approximately 95% of all observations fall


within two standard deviations of the mean.

Approximately 99.7% of all observations fall


within three standard deviations of the mean.
4.16
Chebychev’s Inequality
• For any type of distribution and any number k > 1, at least
1
100 × 1 − 1 − 2 % of the observations lie within 𝑘
𝑘
standard deviations of either side of the mean.
• Two special cases of Chebychev’s inequality are applied
frequently, namely, when k = 2 and k = 3:
– At least 75% of the observations in any data set lie within 2 standard
deviations to either side of the mean.
– At least 89% of the observations in any data set lie within 3 standard
deviations to either side of the mean.
• Does the empirical rule violate Chebychev’s inequality?

You might also like