You are on page 1of 37

Lecture 1 – Describing the Distribution of

a Random Variables
MGMT804 – Sabancı University
Can Akkan
Fall 2022
Learning Objectives
• Histogram
– Visualizing the distribution of a single variable
• Quantifying the distribution of a sample
– Measures of central tendency
– Measures of variability

Sabancı U. - C. Akkan 2
Describing Data – Histogram
Numerical Data: Number of times registered users visited a web site in a given week
Raw data:

bin = (40, 45]


Recall (x, y], where y>x, is an interval that contains values
❖greater than x and
❖less than or equal to y
Sabancı U. - C. Akkan 3
Creating a Histogram
observed values

1. Divide the range of outcomes (observations) into intervals (bins)


2. Determine the number (frequency) of data that falls into each
observations
interval.
3. Plot a column chart

Excel has a histogram chart tool that does all of these for us

Sabancı U. - C. Akkan 4
Histogram – Rules for Bins
• Bins should include all data
• Bins should be of equal size (except first and last one – e.g. 1st
bin could be (−∞, 5])
– Excel calls these overflow bins
• Bins should be non-overlapping
• Bin limits which are round numbers are preferable.
– Not 12.34521 but 12 or 12.5
• Number of bins
– To see some kind of shape of the distribution you should have at least 5
intervals (preferably 8 to 15). bin max. 20 olmalı.
• Sabancı U. - C. Akkan 5
webhits_stats.xlsx
Naming Cell Ranges in Excel
• The range A2:A144 has been named WebHits.
1) Select the data range
- Click on cell A2
- Press simultaneously on
CTRL-SHIFT- (down arrow).

2) Type a name into the


Name Box.
- DO NOT use space character
- Prefer _ (underscore instead)
- E.g. my_data
Sabancı U. - C. Akkan 6
Plotting the histogram
• Select all data, then
1

Sabancı U. - C. Akkan 7
Plotting the histogram
• Want to change the bin size – right click on the horizontal axis

Sabancı U. - C. Akkan 8
Plotting the histogram
• Formatting the horizontal axis – adjusting the bins

default
Change bin width

Sabancı U. - C. Akkan 9
Improving the quality of the histogram
• Informative, clean and clear (easy to read) chart
– No horizontal gridlines → why?
– Small gap between each rectangle pair → why?
– Has a tittle for the chart.
– Has titles for both the x- and y-axes.
– Has no legend (as we have a single series).
• Edward Tufte’s book «Visual Display of Quantitative Information»
– Data/Ink ratio should be maximized…

Sabancı U. - C. Akkan 10
Placing the chart on a separate chart sheet
1. While you are on your chart 2. Select new sheet and type a
right-click your mouse and select name (Histogram) for the seperate
Move Chart from the drop-down chart sheet..
menu.

OR use Move Chart button.

Sabancı U. - C. Akkan 11
In-Class Exercise 1-1
move it to
• Plotting a histogram – Save it as a separate chart sheet

Unfortunately, we cannot adjust the «major units» on the vertical axis to be integers

Sabancı U. - C. Akkan 12
Choosing a Good Number of Bins
Fishermen ID Kg of Fish Caught Histogram 1
1 5.83 18
Histogram 2
2 6.17 16 9
3 4.81 8
14

Nbr of Fishermen
4 9.77 7

Nbr of Fishermen
12
5 9.17 6
10 5
6 10.13
7 11.48 8 4
8 5.37 6 3
2
9 6.88 4
1
10 9.23 2
0
11 9.57 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
12 10.36 3 6 9 12 15
13 4.67 Bins (Upper Limits) for Fish Caught (Kg)
Bins (Upper Limits) for Fish Caught (Kg)
14 9.41
15 3.72
35 4.89
36 5.21
37 4.38

Sabancı U. - C. Akkan 13
Analyzing the Histogram
• Shape:
– How many peaks (none, 1, 2, …)?
• Peak of the distribution is called its mode.
• If a distribution has two peaks, it is called a bi-modal distribution.
– Symmetric or skewed (positive or negative)?
• Right skewed: tail extends to the right (i.e. has positive skewness)
• Left skewed: tail extends to the left (i.e. has negative skewness)

Sabancı U. - C. Akkan 14
Analyzing the Histogram
shapes are important

Right-skewed distribution

Left-skewed distribution

Sabancı U. - C. Akkan 15
Descriptive statistics for the distribution of a
random variable
• Also called “summary measures/statistics”
• Purpose:
– Quantifying
• central tendency and variation of data,
• the degree of linear relationship between two data series.
– Detecting outliers

Sabancı U. - C. Akkan 16
Some Terminology
• Let’s say I am interested in the average height of the
students taking this course
– I may measure the height of 20 students who are taking this
course
• These 20 students are my sample. örneklem
• The entire class could be my population. kitle
• The height of each student in this sample is one observation.
– Can you think of a different population that can be represented
by the same sample?

Sabancı U. - C. Akkan 17
Some Terminology
• A population is the collection of all items of interest in a
particular study.
• A sample is a portion of the population selected to
represent the whole population.
• An observation is one item in the sample
• Inferential statistics is about using data collected in the
sample to make inferences about the characteristics of
the population.
to make inference about entire population
Sabancı U. - C. Akkan 18
Descriptive Statistics

Summary Measures
• Let’s assume in a customer satisfaction survey
– Likert scale of 5:
1: Completely disagree,
3: Neither agree or disagree
5: Completely agree
– The sample size is 7
• Sample observations are:
1, 3, 3, 4, 4, 4, 5

Sabancı U. - C. Akkan 19
Summary Measures –
Measures of Central Tendency bar

• Arithmetic average (mean) • Sample mean: 𝑋ത X-bar


– Average = (1+3+3+4+4+4+5)/7 = 3.43 • Population mean: 𝜇
– Excel formula: =AVERAGE(data_range) • Sample size: n
– Has any customer given an evaluation of • Population size: N
3.43?
• Is this a “typical” value?
1 n 1 N
X =  Xi  =  Xi
n i =1 N i =1
1 2 3 4 5

Sabancı U. - C. Akkan 20
Summary Measures –
Measures of Central Tendency
• Median half of the data is below the median

– Take all observations


– Order them Sort
– Count observations until you are exactly half way: this is the
median value.
• If there are an even number of observations, then the median is half-
way between the two middle values.
– For the sample 1, 3, 3, 4, 4, 4, 5 median is …
– For the sample 1, 3, 3, 4, 4, 5 median is … (3+4/)2= 3.5
– Excel’s formula =MEDIAN(data_range)
ortance
Sabancı U. - C. Akkan 21
Summary Measures –
Measures of Central Tendency
• Mode
– Most frequently observed value
– Since Excel 2010 new formulas
• MODE.MULT → an array formula just like FREQUENCY,
array function, not formula
TRANSPOSE, … check array functions!!!!
• MODE.SNGL
– For the sample 1, 3, 3, 4, 4, 4, 5 the mode is …
it is observed three times, so it's mode

Sabancı U. - C. Akkan 22
Summary Measures – webhits_stats.xlsx
Measures of Central Tendency
• Let’s calculate the average and median for page hits.
• Which characteristic of the data make the average larger than the
median?
• Is Turkey’s average income or median income per person larger?
• How about in other countries?
• Symmetric distribution have mean ≅ median
• Right-skewed distributions have mean >> median
positively skewed

• Left-skewed distributions have mean << median


negatively skewed

Sabancı U. - C. Akkan 23
Summary Measures – Measures of
Spread
• The simplest measure is the range of the data.
• Range = Max – Min
• When you have only a few data points
(< 10), range could be informative enough, otherwise …
• In statistical process control (SPC) (mostly used in
factories), mostly range is used to measure variability of
the data collected. 5 to 10 sample size are very common

Sabancı U. - C. Akkan 24
Summary Measures – Measures of
Spread
• Interpreting variance
• Recall the customer satisfaction survey data
1, 3, 3, 4, 4, 5 sample size 6

3.33

sample average Xave=3.33 1 2 3 4 5

s2 = ((1 – 3.33)2 + (3 – 3.33)2 + (3 – 3.33)2 + sum of squred differences

(4 – 3.33)2 + (4 – 3.33)2 + (5 – 3.33)2 ) / 5


= 9.34 / 5 = 1.87 5 is (n-1)
Sabancı U. - C. Akkan 25
Summary Measures – Measures of
Spread
• Variance:
– A measure of how far the values of a distribution (sample
data or population values) are from the mean.
şirkette herkesin cevapladığı bir
n N anket yaıyorsun

 i  i
O zaman greekletters ->population measurer

the left one is sample


( x − x ) 2
( x −  ) 2

2 =
variance
s2 = i =1 i =1

n −1 N
• s2 denotes the sample variance. Excel formula VAR.S(data_range)
•  2 denotes the population variance. Excel formula VAR.P(data_range)

Sabancı U. - C. Akkan 26
Summary Measures – Measures of Spread
• Standard deviation:
– Defined as the square root of the variance.
– Useful: it has the same unit as the mean
• Ex: If observations are in kilograms, both mean and standard
deviation are in kilograms (whereas the variance is in kilogram
squared)
• s denotes the sample std deviation. Excel formula STDEV.S(data_range)
•  denotes the population std deviation. Excel formula STDEV.P(data_range)

Sabancı U. - C. Akkan 27
Comparing Standard Deviations
six sigma for quality= the aim is reduction of the deviation

Frequency distribution 1:
Smaller standard deviation

variance gets bigger, if the data is further


Frequency distribution 2: away. The red one has a larger deviance

Larger standard deviation

Same center,
different variation

What if the centers are significantly different?


Sabancı U. - C. Akkan 28
In-class Exercise 1-2

• Quality control department of a furniture company collected


sample measurements of two parts of a table
• Calculate the specified summary statistics

4 cm'lik bir parçada 0,45 std sapma çok fazla.dolayısıyla


variability de
Sabancı U. - C. Akkan 29
Measures of Spread – Coefficient of
Variation
• Consider the previous in-class exercise:
– Wooden furniture factory
– Two parts: (1) Corner piece (2) Leg
– Sample of 25 taken for both – their length measured.
– Standard deviations are 0.450cm and 0.479cm →
Can we conclude that their variation is more or less
the same?

Sabancı U. - C. Akkan 30
Measures of Spread – Coefficient of
Variation
average
Corner
4.025
Leg
76.110
• Coefficient of variation
sx
std dev 0.450 0.479
CV =
min 3.54 75.32 X
max 5.3 77.22
(max-min)/min 50% 3% – Useful because it is unitless
it means large deviation
CV 0.112 0.006 – Generally, CV > 1 suggest
quite high spread in the data
in MODERN manufacturing
CV=0,00000

Sabancı U. - C. Akkan 31
webhits_stats.xlsx
Example
Let’s calculate the coefficient of variation for the page hits.
The histogram of page hits shows high variability → consistent with
the CV we calculated

Sabancı U. - C. Akkan 32
Percentiles and Quartiles
• 10th percentile of a data set is the number below which there exists
10% of the data.
– Excel function: =PERCENTILE.INC(data, 0.10)
– If Ali’s height of 160 cm is 10th percentile of Turkish adult male heights,
that means 10% of Turkish adult males are shorter than 160 cm.
• 1st quartile is the 25th percentile
– also called lower quartile, denoted Q1
– Excel function: = QUARTILE.INC(data, 1) ÇOĞUNLUKLA BUNU KULLANIYORUZ

• 3rd quartile is the 75th percentile


– also called upper quartile, denoted Q3
• Interquartile range (IQR) = Q3 – Q1 → IQR is a measure of spread

Sabancı U. - C. Akkan 33
Outliers narrow
measure of spread

• Extreme or untypical values


• Sometimes you want to:
– Get rid of the outliers (e.g. could be due to data entry errors)
– Work on outliers, understand why they occured
– What you do with them depends on the context.

Sabancı U. - C. Akkan 34
Identifying Outliers
• By convention outliers are grouped into two:
– Extreme (or serious) outliers
• Outer fences - 3*IQR below Q1 & above Q3 Q3 + 3xIQR
Q1-3xIQR
• Extreme outlier lies outside the outer fences
– Mild (or potential) outliers
• Inner fences - 1.5*IQR below Q1 & above Q3
• Mild outlier lies outside the inner fences and inside
the outer fences.
– Fischer (a famous statistician) came up with
these fences

35
webhits_stats.xlsx

Outliers
Let’s see whether there are mild and extreme outliers in the data?
Web Hits in Week 9
45

40

35

30

Nbr of Members
25

20

15

10

0
5 10 15 20 25 30 35 40 45 50 55 60 65 70
Bin Upper Limits

3xIQR OUTER FENCE


Mild Extreme
1.5xIQR INNER Outlier
FENCE Outliers
IQR

Lower Lower 14 Upper Upper


Outer Inner Inner Outer
Q1 Q3
Fence Fence Fence Fence

-37 -16 0 5 19 40 61
Sabancı U. - C. Akkan 36
webhits_stats.xlsx

Box-and-whisker plot

MILD AND EXTREME OUTLIERS

LARGEST DATA SMALLER THAN


Q3+1.5xIQR

Q3
AVERAGE
median

Q1

Sabancı U. - C. Akkan 37

You might also like