Professional Documents
Culture Documents
Lec1 DescribeDistribution
Lec1 DescribeDistribution
a Random Variables
MGMT804 – Sabancı University
Can Akkan
Fall 2022
Learning Objectives
• Histogram
– Visualizing the distribution of a single variable
• Quantifying the distribution of a sample
– Measures of central tendency
– Measures of variability
Sabancı U. - C. Akkan 2
Describing Data – Histogram
Numerical Data: Number of times registered users visited a web site in a given week
Raw data:
Excel has a histogram chart tool that does all of these for us
Sabancı U. - C. Akkan 4
Histogram – Rules for Bins
• Bins should include all data
• Bins should be of equal size (except first and last one – e.g. 1st
bin could be (−∞, 5])
– Excel calls these overflow bins
• Bins should be non-overlapping
• Bin limits which are round numbers are preferable.
– Not 12.34521 but 12 or 12.5
• Number of bins
– To see some kind of shape of the distribution you should have at least 5
intervals (preferably 8 to 15). bin max. 20 olmalı.
• Sabancı U. - C. Akkan 5
webhits_stats.xlsx
Naming Cell Ranges in Excel
• The range A2:A144 has been named WebHits.
1) Select the data range
- Click on cell A2
- Press simultaneously on
CTRL-SHIFT- (down arrow).
Sabancı U. - C. Akkan 7
Plotting the histogram
• Want to change the bin size – right click on the horizontal axis
Sabancı U. - C. Akkan 8
Plotting the histogram
• Formatting the horizontal axis – adjusting the bins
default
Change bin width
Sabancı U. - C. Akkan 9
Improving the quality of the histogram
• Informative, clean and clear (easy to read) chart
– No horizontal gridlines → why?
– Small gap between each rectangle pair → why?
– Has a tittle for the chart.
– Has titles for both the x- and y-axes.
– Has no legend (as we have a single series).
• Edward Tufte’s book «Visual Display of Quantitative Information»
– Data/Ink ratio should be maximized…
Sabancı U. - C. Akkan 10
Placing the chart on a separate chart sheet
1. While you are on your chart 2. Select new sheet and type a
right-click your mouse and select name (Histogram) for the seperate
Move Chart from the drop-down chart sheet..
menu.
Sabancı U. - C. Akkan 11
In-Class Exercise 1-1
move it to
• Plotting a histogram – Save it as a separate chart sheet
Unfortunately, we cannot adjust the «major units» on the vertical axis to be integers
Sabancı U. - C. Akkan 12
Choosing a Good Number of Bins
Fishermen ID Kg of Fish Caught Histogram 1
1 5.83 18
Histogram 2
2 6.17 16 9
3 4.81 8
14
Nbr of Fishermen
4 9.77 7
Nbr of Fishermen
12
5 9.17 6
10 5
6 10.13
7 11.48 8 4
8 5.37 6 3
2
9 6.88 4
1
10 9.23 2
0
11 9.57 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
12 10.36 3 6 9 12 15
13 4.67 Bins (Upper Limits) for Fish Caught (Kg)
Bins (Upper Limits) for Fish Caught (Kg)
14 9.41
15 3.72
35 4.89
36 5.21
37 4.38
Sabancı U. - C. Akkan 13
Analyzing the Histogram
• Shape:
– How many peaks (none, 1, 2, …)?
• Peak of the distribution is called its mode.
• If a distribution has two peaks, it is called a bi-modal distribution.
– Symmetric or skewed (positive or negative)?
• Right skewed: tail extends to the right (i.e. has positive skewness)
• Left skewed: tail extends to the left (i.e. has negative skewness)
Sabancı U. - C. Akkan 14
Analyzing the Histogram
shapes are important
Right-skewed distribution
Left-skewed distribution
Sabancı U. - C. Akkan 15
Descriptive statistics for the distribution of a
random variable
• Also called “summary measures/statistics”
• Purpose:
– Quantifying
• central tendency and variation of data,
• the degree of linear relationship between two data series.
– Detecting outliers
Sabancı U. - C. Akkan 16
Some Terminology
• Let’s say I am interested in the average height of the
students taking this course
– I may measure the height of 20 students who are taking this
course
• These 20 students are my sample. örneklem
• The entire class could be my population. kitle
• The height of each student in this sample is one observation.
– Can you think of a different population that can be represented
by the same sample?
Sabancı U. - C. Akkan 17
Some Terminology
• A population is the collection of all items of interest in a
particular study.
• A sample is a portion of the population selected to
represent the whole population.
• An observation is one item in the sample
• Inferential statistics is about using data collected in the
sample to make inferences about the characteristics of
the population.
to make inference about entire population
Sabancı U. - C. Akkan 18
Descriptive Statistics
Summary Measures
• Let’s assume in a customer satisfaction survey
– Likert scale of 5:
1: Completely disagree,
3: Neither agree or disagree
5: Completely agree
– The sample size is 7
• Sample observations are:
1, 3, 3, 4, 4, 4, 5
Sabancı U. - C. Akkan 19
Summary Measures –
Measures of Central Tendency bar
Sabancı U. - C. Akkan 20
Summary Measures –
Measures of Central Tendency
• Median half of the data is below the median
Sabancı U. - C. Akkan 22
Summary Measures – webhits_stats.xlsx
Measures of Central Tendency
• Let’s calculate the average and median for page hits.
• Which characteristic of the data make the average larger than the
median?
• Is Turkey’s average income or median income per person larger?
• How about in other countries?
• Symmetric distribution have mean ≅ median
• Right-skewed distributions have mean >> median
positively skewed
Sabancı U. - C. Akkan 23
Summary Measures – Measures of
Spread
• The simplest measure is the range of the data.
• Range = Max – Min
• When you have only a few data points
(< 10), range could be informative enough, otherwise …
• In statistical process control (SPC) (mostly used in
factories), mostly range is used to measure variability of
the data collected. 5 to 10 sample size are very common
Sabancı U. - C. Akkan 24
Summary Measures – Measures of
Spread
• Interpreting variance
• Recall the customer satisfaction survey data
1, 3, 3, 4, 4, 5 sample size 6
3.33
i i
O zaman greekletters ->population measurer
2 =
variance
s2 = i =1 i =1
n −1 N
• s2 denotes the sample variance. Excel formula VAR.S(data_range)
• 2 denotes the population variance. Excel formula VAR.P(data_range)
Sabancı U. - C. Akkan 26
Summary Measures – Measures of Spread
• Standard deviation:
– Defined as the square root of the variance.
– Useful: it has the same unit as the mean
• Ex: If observations are in kilograms, both mean and standard
deviation are in kilograms (whereas the variance is in kilogram
squared)
• s denotes the sample std deviation. Excel formula STDEV.S(data_range)
• denotes the population std deviation. Excel formula STDEV.P(data_range)
Sabancı U. - C. Akkan 27
Comparing Standard Deviations
six sigma for quality= the aim is reduction of the deviation
Frequency distribution 1:
Smaller standard deviation
Same center,
different variation
Sabancı U. - C. Akkan 30
Measures of Spread – Coefficient of
Variation
average
Corner
4.025
Leg
76.110
• Coefficient of variation
sx
std dev 0.450 0.479
CV =
min 3.54 75.32 X
max 5.3 77.22
(max-min)/min 50% 3% – Useful because it is unitless
it means large deviation
CV 0.112 0.006 – Generally, CV > 1 suggest
quite high spread in the data
in MODERN manufacturing
CV=0,00000
Sabancı U. - C. Akkan 31
webhits_stats.xlsx
Example
Let’s calculate the coefficient of variation for the page hits.
The histogram of page hits shows high variability → consistent with
the CV we calculated
Sabancı U. - C. Akkan 32
Percentiles and Quartiles
• 10th percentile of a data set is the number below which there exists
10% of the data.
– Excel function: =PERCENTILE.INC(data, 0.10)
– If Ali’s height of 160 cm is 10th percentile of Turkish adult male heights,
that means 10% of Turkish adult males are shorter than 160 cm.
• 1st quartile is the 25th percentile
– also called lower quartile, denoted Q1
– Excel function: = QUARTILE.INC(data, 1) ÇOĞUNLUKLA BUNU KULLANIYORUZ
Sabancı U. - C. Akkan 33
Outliers narrow
measure of spread
Sabancı U. - C. Akkan 34
Identifying Outliers
• By convention outliers are grouped into two:
– Extreme (or serious) outliers
• Outer fences - 3*IQR below Q1 & above Q3 Q3 + 3xIQR
Q1-3xIQR
• Extreme outlier lies outside the outer fences
– Mild (or potential) outliers
• Inner fences - 1.5*IQR below Q1 & above Q3
• Mild outlier lies outside the inner fences and inside
the outer fences.
– Fischer (a famous statistician) came up with
these fences
35
webhits_stats.xlsx
Outliers
Let’s see whether there are mild and extreme outliers in the data?
Web Hits in Week 9
45
40
35
30
Nbr of Members
25
20
15
10
0
5 10 15 20 25 30 35 40 45 50 55 60 65 70
Bin Upper Limits
-37 -16 0 5 19 40 61
Sabancı U. - C. Akkan 36
webhits_stats.xlsx
Box-and-whisker plot
Q3
AVERAGE
median
Q1
Sabancı U. - C. Akkan 37