You are on page 1of 50

Chapter 2

Data Analytics
Section 1

INTRODUCTION TO DATA ANALYTICS


Chapter Overview
• This chapter will focus on three main topics
• Describing data
• Probability
• Using data and probability to make conclusions
Chapter Objectives

• Understand and apply the tools of data analysis

• Understand and apply the tools of probability

• Learn to use the R programming language to


apply the tools of data analysis and probability to
make decisions
Data Analytics
• What is business analytics?
• Descriptive analytics: is the interpretation of historical data to better
understand changes that have occurred in a business.
• Predictive analytics: is the use of data, statistical algorithms and machine
learning techniques to identify the likelihood of future outcomes based on
historical data.
• Prescriptive analytics: makes use of machine learning to help businesses decide a
course of action based on a computer program's predictions.

• What is statistics?
• The study of data and its relationship with probability

• How does statistics relate to data analytics?


Statistics
• Why is statistics important?
• If we learn about data, we can use it to make smarter, more
profitable, business decisions

• Why is statistics practical?


• Big data
• Computational resources
Example (1)
• Walmart processes more than 1 million sales transactions
per hour
• This is BIG data
• Can Walmart use this data to understand the relationship
between sales and how nice a store is?
• Would it be profitable to renovate an old store to attract more
customers?
• Walmart can use the tools of business analytics and statistics to
answer these questions!
Example (2)
• A company wants to bring a new product to market
• It has been determined that it will only be profitable if there is a
30% acceptance rate over the old product
• The company performs a survey of 200 customers 32% of those
surveyed prefer the new product
• Should the company release the new product?
• What if a different set of 200 customers had been surveyed?
• Would there still be a greater than 30% acceptance rate?
• What will the whole market’s acceptance rate will be?
Example (2) cont.
• We therefore need to use the tools of probability to make a
decision
• With a 32% acceptance rate from 200 customers we can only be
75% confident that the whole market’s acceptance rate will be
greater than 30%
• What if the company wants to be 95% confident that the whole
market’s acceptance rate will be greater than 30%?
• The acceptance rate from the 200 customers would need to be
35.5%
• Alternatively, the company could survey more customers
Statistics
• A statistic is a number derived from data

• In the previous example the acceptance rate of 32% is a


statistic
• We surveyed several people and recorded which product they
liked better – data

• The subject of statistics is the study of these numbers


Statistics
• There are two groups of statistics
• Population statistics
• Sample statistics
• A population statistic is a number derived from the entire world
of interest
• What is the acceptance rate among ALL customers?
• A sample statistic is derived from a sub-group of the world of
interest.
• We only surveyed 200 customers
Statistics
• It is usually impossible to measure population statistics
• There are way too many people in India to measure the true
percentage of the population that is vegetarian
• We often use sample statistics to approximate population
statistics
• We can survey 10,000 people and ask if they are vegetarian or
not
• When we have bigger sample sizes the approximation
becomes better!
• Studying this approximation will be large portion of tis course.
Dan Mitchell
Section 2
MEASURES OF CENTRALITY
Outline

• Data

• Mean

• Median

• Mode
Data
• One of the goals of business statistics is to transform raw data
into actionable information
• What is raw data?
• A list of values that correspond to physical events
• Roll a die 100 and record which number it lands on
• Daily sales of milk at your local store
• Thickness of tire wall manufactured at a factory
• Who, a group of people voted for in an election
• It can be very difficult to understand a large list of numbers
• We will learn to summarize data
Mean

• We would first like to measure the ‘center’ of the data


• There are several ways we can measure the center
• One common measure of center is the mean
– The mean is defined as the arithmetic average of the
data.
– Suppose we have n data points X1, X2,….,Xn
1 𝑛

– The mean is defined as 𝑋 = 𝑛 σ𝑖 𝑋𝑖
Mean
• Example:
• Over the last 5 days a store sold 565, 570, 572, 568, 585
units of milk
• The average units of milk sold over the last 5 days is 572
Mean
• Suppose the store closed early on the first day and only sold 100
units of milk
• (100 + 570 + 572 + 568 + 585)/5 = 479
• The first day would be considered an outlier
• Outliers can drastically affect the mean!

• This could make us think that the mean is a bad measure of


centrality
• Sometimes it’s good to include outliers in our computations
• Sometimes it isn’t
Median
• Another common measure of centrality is called the median
• The median is the mid-point of the data
• The number such that half of the data points are above it and
half the data are below it
• We must be careful depending on if there are an odd or even
number of data points

• To find the median sort the data and find the middle
Median
• Example: Let’s find the median of the milk sale data
• First sort the data
• 565, 570, 572, 568, 585
• 565, 568, 570, 572, 585
• The median is 570
• If we have one more day of data
• 565, 570, 572, 568, 585, 580
• 565, 568, 570, 572, 580, 585
• The median is somewhere between 570 and 572
• The convention is to define the median as the number halfway
between (571)
Median
• Suppose the store closed early on the first day and only sold 100
units of milk
• 100, 570, 572, 568, 585
• 100, 568, 570, 572, 585
• The median is still 570!
• Even though there is an outlier in the data, the median stays
the same
• In general, the median is unaffected by outliers
Skewness
• By comparing the median to the mean we can get a good idea
of the asymmetry of the data
• If the median and the mean are equal then the data is symmetric
• Suppose the mean is much lower than the median
• Then the data points less than the median tend to be further
from the median than the data points above the median
• Then the data is said to be skewed to the left
• The opposite is true if the mean is much higher than the
median
Skewness
• Let’s look at the milk data when the store closed early
• 100, 570, 572, 568, 585
• The mean is 479
• The median is 570
• The 100 sales day skewed the data to the left
Mode

• The last common measure of centrality we


will discuss is called the mode

• The mode is the most commonly occurring


value in the data
Mode
• A shoe store recorded the size of shoes sold to the last 12
customers
• 7, 10, 9, 10, 9, 8, 11, 10, 8, 9, 10, 7
• Size 7 sold 2 times
• Size 8 sold 2 times
• Size 9 sold 3 times
• Size 10 sold 4 times
• Size 11 sold 1 time
• The most common size sold is 10
• The mode of the data is 10
Mode
• Suppose a basketball player comes in to the store to buy a pair
of size 16 shoes
• 7, 10, 9, 10, 9, 8, 11, 10, 8, 9, 10, 7, 16
• The mode is still 10
• Mode is also not affected by outliers
Centrality
• Mean, median and mode are all statistics

• Which statistic should you use when describing the


center of a data set?
• This depends on the data
• It is important to understand all three and how they
compare to each other

• Of course there are other measures of centrality, but these


are the 3 most commonly used statistics
Dan Mitchell
Section 3
MEASURES OF DISPERSION
Outline
• Range

• Interquartile Range

• 5-Number Summary

• Variance

• Standard Deviation
Dispersion

• Data is not completely described by its centrality


• We also want to understand how spread out,
around the center, the data is

• If we are manufacturing bricks we want the


weight of all the bricks to be as close to the
mean as possible
Range
• The simplest measure of dispersion is the range
• Maximum value minus minimum value

• Let’s go back to the milk sales example


• 565, 570, 572, 568, 585
• The largest sales value is 585
• The smallest sales value is 565
• The range is 585 – 565 = 20
Range
• Consider the data where the store closed early on the first day
• 100, 570, 572, 568, 585
• The range is now 585 – 100 = 485

• The range is VERY sensitive to outliers


Interquartile Range
• To address this sensitivity, another measure of
dispersion, called interquartile range (IQR), is commonly
used
• IQR is the range of the middle 50% of the data
• To be specific we need to define 2 more quantities
• The first quartile, Q1, is the number such that 25% of the
data is below it and 75% of the data is above it
• The third quartile, Q3, is the number such that 75% of the
data is below it and 25% of the data is above it
• IQR = Q3 –Q1
Interquartile Range
• I don’t want to get into the specifics of how Q1 and Q3 are
calculated based on odd/even number of data points
• We will calculate it using R
• Different software packages use different
conventions
• For large data sets the difference in convention is
insignificant
• IQR is insensitive to outliers
5-Number Summary
• With these statistics in mind we define the 5-number summary
as the list
• (minimum, Q1, median, Q3, maximum)
• The 5-number summary reduces a large dataset to just 5
numbers that do a good job describing the data
• You can compare (median – Q1) to (Q3 – median) to help
understand asymmetry
Outliers
• The 5-number summary is also commonly used to identify data
points that are outliers
• Any data point less than Q1 – 1.5*IQR can be considered
a left outlier
• Any data point greater than Q3 + 1.5*IQR can be considered
a right outlier
• Remember: IQR = Q3 – Q1
Standard Deviation
• By far, the most commonly used measure of dispersion is
standard deviation

• Standard deviation roughly measures the average distance


from the mean of all the data points

• With this we have a good understanding of how spread out


the data is around the mean

• Standard deviation does not capture asymmetry like the 5-


number summary does
Standard Deviation

• To understand standard deviation we will first define variance, and then use that to
get standard deviation
1
• 𝑉𝑎𝑟 𝑋 = 𝑛−1 σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2

• Remember that 𝑋ത is the mean of the data


• Variance is the average squared distance between each data point and the mean.
• Standard deviation is defined as the square Variance is the average squared distance
between each data point and the mean

1
• 𝑆𝑑 𝑋 = 𝑉𝑎𝑟 𝑋 = σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2
𝑛−1
Standard Deviation
• Variance measures the average squared distance from the mean
– Why 𝑛 − 1instead of 𝑛?
• By taking the square root of this we get something like the average distance
from the mean
– 𝑎+𝑏 ≠ 𝑎+ 𝑏
• If we want the average distance, why not calculate
1 𝑛
– σ
𝑛 𝑖=1
𝑋𝑖 − 𝑋ത
– This is called mean absolute deviance
• We will see that standard deviation is mathematically much more convenient
we get to probability.
• They are very similar.
Standard Deviation
• Let’s go back to the milk sales example
• 565, 570, 572, 568, 585
• The mean is 572
• To calculate variance we first subtract the mean from each data point
and square the difference
• (-7)2, (-2)2, 02, (-4)2, 132
• Then we add all these up and divide by 5-1=4
1
• 𝑉𝑎𝑟 𝑥 = 49 + 4 + 0 + 16 + 169 = 59.5
4
• The standard deviation is the square root of this
– 𝑆𝑑 𝑋 = 59.5 = 7.71
• Mean absolute deviance is 5.2
Dan Mitchell
Section 4
VISUALIZING DATA
Outline

• Histogram

• Box Plot
Histogram
• A histogram is a graph that plots the relative
frequency of data
• 10.61 12.18 11.73 11.28 4.43 9.83 10.13 9.48 9.27
5.95 8.74 9.96 10.73 13.26 10.95 11.13 13.42 9.13
8.77 6.77 9.53 7.09 9.48 11.23 8.35 12.26 6.47
12.45 10.82 2.50 14.26 11.56 10.76 10.98 7.52
10.38 10.35 15.938.40 10.57 7.88 12.43 8.95
10.10 12.10 11.13 7.18 10.77 11.54 8.03
• No two data points are the same 
• Let’s group the data into a few bins
Histogram
• There is 1 number less than 4
• There are 2 numbers between 4 – 6
• There are 6 numbers between 6 – 8
• There are 13 numbers between 8 – 10
• …
• To make a histogram we group data together like this and then
make a bar chart
• The width of the bars represents the range of the groups
• The heights of the bars represent how many data points are
in that group
Histogram

The mode is sometimes approximated as the midpoint of the tallest bar


Histogram

• A histogram of a data set is not unique


Histogram
• The histogram is a great way to visualize your
data
• Sometimes the histogram has the form with high
bars in the middle and bars of decreasing height
as you go away from the middle
• This is called the bell curve
• Not all data follows a bell curve
Box Plot
• A box plot is a more detailed visualization of the 5-number
summary
• (min, Q1, median, Q3, max)
• The box plot plots (Q1, median, Q3) as a box
• Recall that numbers less than Q1-1.5*IQR or greater than
Q3+1.5*IQR are considered outliers
• Then it has whiskers that correspond to either max/min or
the outlier range
• All points considered outliers are plotted individually
Box Plot

• We can also see skewness in boxplots


Outliers
• Should you throw away outliers?
• Probably not…
• Unless you know the reason a data point is an outlier, and
that reason is not relevant to your analysis
• If you collect enough data on anything you’ll almost always find
some data points that look like outliers
• Sometimes extreme events happen
• We need to take those extremes into consideration when we do
analysis
• But it’s still good to identify potential outliers so we can be careful
about our analysis.

You might also like