Descriptive Statistics Slides PDF

Chapter 2
Data Analytics
Section 1
INTRODUCTION TO DATA ANALYTICS

Chapter Overview
• This chapter will focus on three main topics
• Describing data
• Probability
• Using data and probability to make conclusions
Chapter Objectives
• Understand and apply the tools of data analysis
• Understand and apply the tools of probability
• Learn to use the R programming language to

apply the tools of data analysis and probability to
make decisions
Data Analytics
• What is business analytics?
• Descriptive analytics: is the interpretation of historical data to better
understand changes that have occurred in a business.
• Predictive analytics: is the use of data, statistical algorithms and machine
learning techniques to identify the likelihood of future outcomes based on
historical data.
• Prescriptive analytics: makes use of machine learning to help businesses decide a
course of action based on a computer program's predictions.
• What is statistics?
• The study of data and its relationship with probability
• How does statistics relate to data analytics?

Statistics
• Why is statistics important?
• If we learn about data, we can use it to make smarter, more
profitable, business decisions
• Why is statistics practical?

• Big data
• Computational resources
Example (1)
• Walmart processes more than 1 million sales transactions
per hour
• This is BIG data
• Can Walmart use this data to understand the relationship
between sales and how nice a store is?
• Would it be profitable to renovate an old store to attract more
customers?
• Walmart can use the tools of business analytics and statistics to
answer these questions!
Example (2)
• A company wants to bring a new product to market
• It has been determined that it will only be profitable if there is a
30% acceptance rate over the old product
• The company performs a survey of 200 customers 32% of those
surveyed prefer the new product
• Should the company release the new product?
• What if a different set of 200 customers had been surveyed?
• Would there still be a greater than 30% acceptance rate?
• What will the whole market’s acceptance rate will be?
Example (2) cont.
• We therefore need to use the tools of probability to make a
decision
• With a 32% acceptance rate from 200 customers we can only be
75% confident that the whole market’s acceptance rate will be
greater than 30%
• What if the company wants to be 95% confident that the whole
market’s acceptance rate will be greater than 30%?
• The acceptance rate from the 200 customers would need to be
35.5%
• Alternatively, the company could survey more customers
Statistics
• A statistic is a number derived from data
• In the previous example the acceptance rate of 32% is a

statistic
• We surveyed several people and recorded which product they
liked better – data
• The subject of statistics is the study of these numbers

Statistics
• There are two groups of statistics
• Population statistics
• Sample statistics
• A population statistic is a number derived from the entire world
of interest
• What is the acceptance rate among ALL customers?
• A sample statistic is derived from a sub-group of the world of
interest.
• We only surveyed 200 customers
Statistics
• It is usually impossible to measure population statistics
• There are way too many people in India to measure the true
percentage of the population that is vegetarian
• We often use sample statistics to approximate population
statistics
• We can survey 10,000 people and ask if they are vegetarian or
not
• When we have bigger sample sizes the approximation
becomes better!
• Studying this approximation will be large portion of tis course.
Dan Mitchell
Section 2
MEASURES OF CENTRALITY
Outline
• Data
• Mean
• Median
• Mode
Data
• One of the goals of business statistics is to transform raw data
into actionable information
• What is raw data?
• A list of values that correspond to physical events
• Roll a die 100 and record which number it lands on
• Daily sales of milk at your local store
• Thickness of tire wall manufactured at a factory
• Who, a group of people voted for in an election
• It can be very difficult to understand a large list of numbers
• We will learn to summarize data
Mean
• We would first like to measure the ‘center’ of the data

• There are several ways we can measure the center
• One common measure of center is the mean
– The mean is defined as the arithmetic average of the
data.
– Suppose we have n data points X1, X2,….,Xn
1 𝑛
ത
– The mean is defined as 𝑋 = 𝑛 σ𝑖 𝑋𝑖
Mean
• Example:
• Over the last 5 days a store sold 565, 570, 572, 568, 585
units of milk
• The average units of milk sold over the last 5 days is 572
Mean
• Suppose the store closed early on the first day and only sold 100
units of milk
• (100 + 570 + 572 + 568 + 585)/5 = 479
• The first day would be considered an outlier
• Outliers can drastically affect the mean!
• This could make us think that the mean is a bad measure of

centrality
• Sometimes it’s good to include outliers in our computations
• Sometimes it isn’t
Median
• Another common measure of centrality is called the median
• The median is the mid-point of the data
• The number such that half of the data points are above it and
half the data are below it
• We must be careful depending on if there are an odd or even
number of data points
• To find the median sort the data and find the middle
Median
• Example: Let’s find the median of the milk sale data
• First sort the data
• 565, 570, 572, 568, 585
• 565, 568, 570, 572, 585
• The median is 570
• If we have one more day of data
• 565, 570, 572, 568, 585, 580
• 565, 568, 570, 572, 580, 585
• The median is somewhere between 570 and 572
• The convention is to define the median as the number halfway
between (571)
Median
• Suppose the store closed early on the first day and only sold 100
units of milk
• 100, 570, 572, 568, 585
• 100, 568, 570, 572, 585
• The median is still 570!
• Even though there is an outlier in the data, the median stays
the same
• In general, the median is unaffected by outliers
Skewness
• By comparing the median to the mean we can get a good idea
of the asymmetry of the data
• If the median and the mean are equal then the data is symmetric
• Suppose the mean is much lower than the median
• Then the data points less than the median tend to be further
from the median than the data points above the median
• Then the data is said to be skewed to the left
• The opposite is true if the mean is much higher than the
median
Skewness
• Let’s look at the milk data when the store closed early
• 100, 570, 572, 568, 585
• The mean is 479
• The median is 570
• The 100 sales day skewed the data to the left
Mode
• The last common measure of centrality we

will discuss is called the mode
• The mode is the most commonly occurring

value in the data
Mode
• A shoe store recorded the size of shoes sold to the last 12
customers
• 7, 10, 9, 10, 9, 8, 11, 10, 8, 9, 10, 7
• Size 7 sold 2 times
• Size 11 sold 1 time
• The most common size sold is 10
• The mode of the data is 10
Mode
• Suppose a basketball player comes in to the store to buy a pair
of size 16 shoes
• 7, 10, 9, 10, 9, 8, 11, 10, 8, 9, 10, 7, 16
• The mode is still 10
• Mode is also not affected by outliers
Centrality
• Mean, median and mode are all statistics
• Which statistic should you use when describing the

center of a data set?
• This depends on the data
• It is important to understand all three and how they
compare to each other
• Of course there are other measures of centrality, but these

are the 3 most commonly used statistics
Dan Mitchell
Section 3
MEASURES OF DISPERSION
Outline
• Range
• Interquartile Range
• 5-Number Summary
• Variance
• Standard Deviation
Dispersion
• Data is not completely described by its centrality

• We also want to understand how spread out,
around the center, the data is
• If we are manufacturing bricks we want the

weight of all the bricks to be as close to the
mean as possible
Range
• The simplest measure of dispersion is the range
• Maximum value minus minimum value
• Let’s go back to the milk sales example

• 565, 570, 572, 568, 585
• The largest sales value is 585
• The smallest sales value is 565
• The range is 585 – 565 = 20
Range
• Consider the data where the store closed early on the first day
• 100, 570, 572, 568, 585
• The range is now 585 – 100 = 485
• The range is VERY sensitive to outliers

Interquartile Range
• To address this sensitivity, another measure of
dispersion, called interquartile range (IQR), is commonly
used
• IQR is the range of the middle 50% of the data
• To be specific we need to define 2 more quantities
• The first quartile, Q1, is the number such that 25% of the
data is below it and 75% of the data is above it
• The third quartile, Q3, is the number such that 75% of the
data is below it and 25% of the data is above it
• IQR = Q3 –Q1
Interquartile Range
• I don’t want to get into the specifics of how Q1 and Q3 are
calculated based on odd/even number of data points
• We will calculate it using R
• Different software packages use different
conventions
• For large data sets the difference in convention is
insignificant
• IQR is insensitive to outliers
5-Number Summary
• With these statistics in mind we define the 5-number summary
as the list
• (minimum, Q1, median, Q3, maximum)
• The 5-number summary reduces a large dataset to just 5
numbers that do a good job describing the data
• You can compare (median – Q1) to (Q3 – median) to help
understand asymmetry
Outliers
• The 5-number summary is also commonly used to identify data
points that are outliers
• Any data point less than Q1 – 1.5*IQR can be considered
a left outlier
• Any data point greater than Q3 + 1.5*IQR can be considered
a right outlier
• Remember: IQR = Q3 – Q1
Standard Deviation
• By far, the most commonly used measure of dispersion is
standard deviation
• Standard deviation roughly measures the average distance

from the mean of all the data points
• With this we have a good understanding of how spread out

the data is around the mean
• Standard deviation does not capture asymmetry like the 5-

number summary does
Standard Deviation
• To understand standard deviation we will first define variance, and then use that to
get standard deviation
1
• 𝑉𝑎𝑟 𝑋 = 𝑛−1 σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2
• Remember that 𝑋ത is the mean of the data

• Variance is the average squared distance between each data point and the mean.
• Standard deviation is defined as the square Variance is the average squared distance
between each data point and the mean
1
• 𝑆𝑑 𝑋 = 𝑉𝑎𝑟 𝑋 = σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2
𝑛−1
Standard Deviation
• Variance measures the average squared distance from the mean
– Why 𝑛 − 1instead of 𝑛?
• By taking the square root of this we get something like the average distance
from the mean
– 𝑎+𝑏 ≠ 𝑎+ 𝑏
• If we want the average distance, why not calculate
1 𝑛
– σ
𝑛 𝑖=1
𝑋𝑖 − 𝑋ത
– This is called mean absolute deviance
• We will see that standard deviation is mathematically much more convenient
we get to probability.
• They are very similar.
Standard Deviation
• Let’s go back to the milk sales example
• 565, 570, 572, 568, 585
• The mean is 572
• To calculate variance we first subtract the mean from each data point
and square the difference
• (-7)2, (-2)2, 02, (-4)2, 132
• Then we add all these up and divide by 5-1=4
1
• 𝑉𝑎𝑟 𝑥 = 49 + 4 + 0 + 16 + 169 = 59.5
4
• The standard deviation is the square root of this
– 𝑆𝑑 𝑋 = 59.5 = 7.71
• Mean absolute deviance is 5.2
Dan Mitchell
Section 4
VISUALIZING DATA
Outline
• Histogram
• Box Plot
Histogram
• A histogram is a graph that plots the relative
frequency of data
• 10.61 12.18 11.73 11.28 4.43 9.83 10.13 9.48 9.27
5.95 8.74 9.96 10.73 13.26 10.95 11.13 13.42 9.13
8.77 6.77 9.53 7.09 9.48 11.23 8.35 12.26 6.47
12.45 10.82 2.50 14.26 11.56 10.76 10.98 7.52
10.38 10.35 15.938.40 10.57 7.88 12.43 8.95
10.10 12.10 11.13 7.18 10.77 11.54 8.03
• No two data points are the same 
• Let’s group the data into a few bins
Histogram
• There is 1 number less than 4
• There are 2 numbers between 4 – 6
• …
• To make a histogram we group data together like this and then
make a bar chart
• The width of the bars represents the range of the groups
• The heights of the bars represent how many data points are
in that group
Histogram
The mode is sometimes approximated as the midpoint of the tallest bar

Histogram
• A histogram of a data set is not unique

Histogram
• The histogram is a great way to visualize your
data
• Sometimes the histogram has the form with high
bars in the middle and bars of decreasing height
as you go away from the middle
• This is called the bell curve
• Not all data follows a bell curve
Box Plot
• A box plot is a more detailed visualization of the 5-number
summary
• (min, Q1, median, Q3, max)
• The box plot plots (Q1, median, Q3) as a box
• Recall that numbers less than Q1-1.5*IQR or greater than
Q3+1.5*IQR are considered outliers
• Then it has whiskers that correspond to either max/min or
the outlier range
• All points considered outliers are plotted individually
Box Plot
• We can also see skewness in boxplots

Outliers
• Should you throw away outliers?
• Probably not…
• Unless you know the reason a data point is an outlier, and
that reason is not relevant to your analysis
• If you collect enough data on anything you’ll almost always find
some data points that look like outliers
• Sometimes extreme events happen
• We need to take those extremes into consideration when we do
analysis
• But it’s still good to identify potential outliers so we can be careful
about our analysis.

Descriptive Statistics Slides PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics Slides PDF

Uploaded by

Copyright:

Available Formats

Chapter 2

INTRODUCTION TO DATA ANALYTICS

• Understand and apply the tools of data analysis

• Understand and apply the tools of probability

• Learn to use the R programming language to

• How does statistics relate to data analytics?

• Why is statistics practical?

• In the previous example the acceptance rate of 32% is a

• The subject of statistics is the study of these numbers

• We would first like to measure the ‘center’ of the data

• This could make us think that the mean is a bad measure of

• The last common measure of centrality we

• The mode is the most commonly occurring

• Which statistic should you use when describing the

• Of course there are other measures of centrality, but these

• Data is not completely described by its centrality

• If we are manufacturing bricks we want the

• Let’s go back to the milk sales example

• The range is VERY sensitive to outliers

• Standard deviation roughly measures the average distance

• With this we have a good understanding of how spread out

• Standard deviation does not capture asymmetry like the 5-

• Remember that 𝑋ത is the mean of the data

The mode is sometimes approximated as the midpoint of the tallest bar

• A histogram of a data set is not unique

• We can also see skewness in boxplots

You might also like