Professional Documents
Culture Documents
Mastering Outliers in Excel and in R
Mastering Outliers in Excel and in R
Duncan Williamson
© 12th April 2024
This is a free of charge proof edition copy and your comments are welcomed: please write
to me at duncanwil@gmail.com with your suggestions
Contents
Chapter 1: Understanding Outliers ....................................................................................................... 4
- Definition of Outliers ...................................................................................................................... 4
Examples of Outliers ......................................................................................................................... 4
1 Medical Data Example ................................................................................................................ 4
2 Financial Data Example .............................................................................................................. 6
3 Educational Data Example .......................................................................................................... 8
- Types of Outliers (Point, Contextual, Collective) ........................................................................... 10
1. Point Outliers (or Global Outliers): .......................................................................................... 10
2. Contextual Outliers (or Conditional Outliers): ......................................................................... 10
3. Collective Outliers: .................................................................................................................. 11
4. Masked Outliers: ..................................................................................................................... 11
5. Influential Outliers: ................................................................................................................. 14
- Causes and Impacts of Outliers in Data ......................................................................................... 17
- Causes of Outliers .................................................................................................................... 17
- Impacts of Outliers .................................................................................................................... 18
- Addressing Outliers ................................................................................................................... 18
- Outliers: Bane or Boon? ................................................................................................................ 18
Chapter 2: Outlier Detection Techniques ............................................................................................ 20
- Basic Statistical Methods (Standard Deviation, IQR) ..................................................................... 20
- Visualisation Methods (Box Plot, Scatter Plot) .............................................................................. 20
Standard Deviation...................................................................................................................... 20
The Box & Whisker Plot ............................................................................................................... 21
The InterQuartile Range: IQR ...................................................................................................... 22
- Advanced Techniques: DBSCAN, Clustering with Power BI ........................................................... 27
Chapter 3: Excel for Outlier Analysis ................................................................................................... 34
- Preparing Data in Excel ................................................................................................................. 34
- Using Formulas for Outlier Detection ........................................................................................... 42
Identifying Individual Outliers ..................................................................................................... 47
- Case Study: Outlier Analysis in a Business Context with Excel ...................................................... 57
Multiple Regression: Chinese Wells ............................................................................................ 57
Standardised Residuals ............................................................................................................... 59
Standard Error ............................................................................................................................. 59
Range .......................................................................................................................................... 59
Z Score Results: outlier detection ................................................................................................ 61
An outlier is a data point that differs significantly from other observations in a dataset. It appears to
deviate markedly from other members of the sample in which it occurs. Outliers can occur by chance
in any distribution but they are often indicative either of measurement error or of the population
having a heavy tailed distribution. In the latter case, the outliers represent a phenomenon that is also
of interest for analysis.
Formally, an outlier can be defined in several ways, often based on statistical measures. For example,
in a normal distribution, any data point more than 1.5 interquartile ranges (IQR) below the first
quartile or above the third quartile is often considered an outlier. In terms of standard deviation, a
common rule is that a data point is an outlier if it is more than two standard deviations away from
the mean.
However, the definition of an outlier is subjective and varies from one field of study to another. In
practice, whether a data point is treated as an outlier depends not only on its numerical value but
also on the context of the data and the specific analysis being performed.
Examples of Outliers
Here are three examples from different contexts to illustrate what an outlier might look like and you
will find the data sets and graphs in the files chapter_one.xlsx and outliers_chapter_one.R
- Context: Imagine a dataset containing the resting heart rates of a group of healthy adults aged 25
- 40, where most values cluster around 60 - 80 beats per minute (bpm).
- Outlier: If one individual in this group has a resting heart rate of 140 bpm, this value would be
considered an outlier. It's significantly higher than the norm for this specific population and might
indicate an underlying health issue or measurement error.
This dataset includes a total of 100 entries, with the majority of the resting heart rates clustered
around the 60 - 80 bpm range. However, it also contains outliers, such as a very high value of 140
bpm and a low value of 55 bpm. These outliers could represent special cases or errors and would be
points of interest in an analysis focused on identifying and understanding outlier data.
# Defining outliers
heart_rates_outliers <- c(140, 55)
tail(df_heart_rates, 10)
- Context: Consider a dataset of daily sales figures for a retail store over a year. The sales usually
range between $1,000 and $3,000 per day.
- Outlier: If on one particular day, the sales record shows $50,000, this would be an outlier. This
could be due to a special event or an error in data recording. In contrast, if the store had a well-
advertised, major annual sale on that day, this high value might not be considered an outlier in the
context of expected sales spikes during promotional events.
# Defining outliers
sales_outliers <- c(50000, 300, 48000) # Two high sales and one low sale day
tail(df_daily_sales, 10)
In this example:
• I've used a normal distribution to generate sales figures predominantly in the $1,000 to
$3,000 range.
• I've added a few outliers to represent atypical sales days, both extremely high and extremely
low.
• The dataset is set against a sequence of dates to mimic daily sales over a year.
• You can adjust the mean, standard deviation and outliers to better fit your specific scenario
or example.
Finally, the code creates a .csv file that is saved to the R working directory: Daily_Sales_Data.csv
- Context: In a standardised test taken by thousands of students, scores are normally distributed
with most students scoring around the median.
- Outlier: If a student scores significantly higher or lower than the majority, this score could be
considered an outlier. For instance, if most scores range between 60 - 85 out of 100, but one student
scores 5 or 100, these scores would be outliers. The low score might indicate a lack of understanding
or an issue with the test taking, while the perfect score could be exceptionally rare and noteworthy.
To demonstrate this example in R, I'll create a synthetic dataset representing the standardized test
scores of students, with most scores clustering around the median (in the 60-85 range) and a few
outliers. Here's the R code to generate and visualize this data:
# Defining outliers
test_scores_outliers <- c(5, 100, 6, 98) # A few extremely low and high scores
head(df_test_scores, 10)
tail(df_test_scores, 10)
In this example:
• Test scores are generated using a normal distribution centered around 72.5 with a standard
deviation of 7.5, simulating a typical distribution of scores.
• A few extreme values (5, 100, 6, 98) are added as outliers.
• A scatter plot is created using ggplot2 to visualize these test scores. The outliers should be
clearly visible as points far removed from the cluster of other scores.
• I added a line of code so that R would export the data to a .csv file that we can then explore
in Excel
This code serves as a practical demonstration of how to handle and visualise data with outliers in an
educational context and the following graph has been created in Excel, in which I have highlighted
the outlier scores:
In the field of data analysis, outliers are data points that significantly differ from the rest of the data.
They can be categorized into three main types: point outliers, contextual outliers and collective
outliers. Understanding these categories is crucial for effectively identifying and analyzing outliers in
various datasets.
- Definition: A point outlier is an individual data point that significantly deviates from the rest of the
data in the dataset. It's an anomaly when considered in the full context of the dataset.
- Example: In a dataset of human heights, if most people are between 5 and 6 feet tall, a height of 8
feet would be a point outlier.
- Detection: These are typically detected using statistical measures like Z scores, standard
deviations or interquartile ranges (IQR).
- Definition: Contextual outliers are data points that are considered outliers within a specific
context or condition but might not be outliers when taken out of that context. These are often
detected in time series data or geographical data, where the context (like time or location) matters.
- Example: Consider temperature readings over a year. A temperature of 30°C would be normal in
summer but would be a contextual outlier in winter.
- Detection: Detecting these outliers involves understanding the context or conditions and often
requires more complex analysis, like segmentation of data based on those conditions.
- Definition: A collective outlier refers to a subset of data points that deviate significantly from the
overall data pattern when considered together, even though the individual data points may not be
outliers. These are often seen in time series or sequence data.
- Example: In a dataset of daily share prices, a sudden, short lived spike followed by an equally
sudden drop might not be unusual for individual data points. However, the collective pattern of these
points over a few days might be anomalous compared to the usual share price movements.
- Detection: Detecting collective outliers often involves analyzing the data points in sequence and
looking for anomalies in the pattern or behaviour over time.
4. Masked Outliers:
- Definition: Masked outliers are data points that may appear to be normal when viewed in the
context of the entire dataset but are actually anomalous when considered in a more refined or
appropriate context. These outliers are masked because their unusual nature is hidden by the
presence of other data points, making them harder to detect with standard outlier detection
methods.
- Example: Imagine a dataset containing the test scores of students from two different classes: Class
A and Class B. Class A students typically score between 70 - 80% and Class B students score between
40 - 50%. If the scores are combined into a single dataset without class distinction, a score of 65%
would not stand out as an outlier in the combined range of 40 - 80%. However, if we consider the
data separately for each class, a score of 65% is an outlier for Class B (too high) and potentially for
Class A (slightly low).
- Detection: masked outliers often requires a more nuanced approach to data analysis:
To set up this scenario in R, we will create two separate datasets for Class A and Class B students with
their respective score distributions and then combine them into a single dataset. We'll also include a
few scores that are outliers within each class context. Here's the R code to create this dataset:
head(df_combined_scores, 10)
tail(df_combined_scores, 10)
# Plotting the combined test scores with color distinction for each class
library(ggplot2)
ggplot(df_combined_scores, aes(x = Student_ID, y = Test_Score, color = Class)) +
geom_point() +
labs(title = "Test Scores of Students in Class A and Class B",
x = "Student ID",
y = "Test Score (%)") +
theme_minimal()
In this code:
• We generate scores for Class A and Class B, along with specific outliers for each class.
• We label each score with its respective class.
• We combine the two classes into a single dataset, df_combined_scores.
• A scatter plot is created using ggplot2 to visualise the scores, with colour coding to
distinguish between the two classes.
• This setup and visualisation will illustrate how a score like 65% can be an outlier in the
context of each class, but not in the combined dataset
The graph for this example is and we can see the distinction between the Class A and the Class B
scores:
I also created the following table of descriptive statistics that we will see more of later in the book:
Class A Class B
Mean 85.0000 65.0000
Median 74.3226 45.0413
Mode #N/A #N/A
SD 3.3439 4.0116
Kurtosis 1.5514 11.3376
Skewness 0.1682 2.0546
Maximum 85.0000 65.0000
Minimum 65.0000 35.0000
Range 20.0000 30.0000
Sum 3,903.5896 2,353.2109
Count 52.0000 52.0000
Notice the Mode values: there are no modal values in this case study and I could have used the
IFERROR() function to make the Mode results more presentable: next time!
=BYCOL(array,LAMBDA(column))
We will explore other aspects of examples such as these as we go on, as per the following:
1 Segmentation: One effective method is to segment the data into more homogeneous groups based
on relevant characteristics or conditions. In the student score example, this would mean analysing
the data separately for each class.
3 Domain Knowledge: Understanding the domain or context of the data can be critical. This
knowledge can guide you in identifying relevant subgroups or conditions where outliers may be
masked.
4 Advanced Statistical Techniques: Some statistical methods are designed to detect outliers in
complex data structures, including those that can uncover masked outliers. Techniques like cluster
analysis, principal component analysis (PCA) or machine learning algorithms can be useful in these
scenarios.
Discussion: masked outliers present a significant challenge in data analysis because they can go
unnoticed, leading to skewed analysis and incorrect conclusions. It's crucial to be aware of the
possibility of their existence, especially in datasets where heterogeneity is present or where
combining data from different sources. The key to detecting masked outliers lies in a careful and
thorough examination of the data, considering its context and applying the right analytical
techniques.
5. Influential Outliers:
- Definition: influential outliers are data points that have a disproportionately large impact on the
outcome of statistical analyses, such as regression models or mean calculations. Unlike typical
outliers, which might simply deviate from the norm, influential outliers can significantly alter the
results and conclusions drawn from the data.
- Example: consider a simple linear regression analysis where you're looking at the relationship
between years of experience and salary in a dataset of employees. Most data points form a linear
pattern, indicating that salary increases with experience. However, if there is one employee with an
unusually high salary that doesn't fit the pattern (perhaps due to being a high level executive), this
data point could disproportionately influence the slope of the regression line, leading to a misleading
interpretation of the relationship for the general workforce.
The King Kong Effect in data refers to a situation where one value is so extreme that it skews the
overall understanding of the dataset. Let’s create a dataset suggestive of the King Kong Effect by
imagining a hypothetical scenario involving assumed weights of various animals in a zoo. In this
example, most animals have weights within a relatively narrow range, except for King Kong, who has
an extraordinarily high weight.
weights_normal_animals <- c(250, 150, 200, 300, 180, 220, 190, 210, 230, 240) # Weights in kg
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 14 of 71
# Adding King Kong's weight as an outlier
animal_names <- c("Lion", "Tiger", "Bear", "Giraffe", "Elephant", "Rhino", "Hippo", "Leopard",
"Zebra", "Buffalo", "King Kong")
df_animal_weights
# Assuming you have already created the df_animal_weights data frame as previously described
# Plotting animal weights
ggplot(df_animal_weights, aes(x = Animal, y = Weight_kg)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() + # Flipping the coordinates for a horizontal bar plot
labs(title = "Weights of Animals in a Zoo (including King Kong)",
x = "Animal",
y = "Weight (kg)") +
theme_minimal()
- Detection: involves both identifying outliers and assessing their impact on your analysis:
1 Leverage v Residuals Plot: In regression analysis, a leverage versus residuals plot can help identify
influential outliers. High leverage points have a significant impact on the position of the regression
line, while high residuals indicate a large deviation from the predicted value.
2 Cook’s Distance: This is a measure used in regression analysis to identify influential points. It
considers both the leverage of the data point and the size of its residual. A large Cook’s distance
suggests that the data point is influential.
3 Influence Plot: Some statistical software offers an influence plot, which combines information
about leverage, residuals and Cook’s distance, providing a comprehensive view of the potential
influence of each data point.
4 Robustness Check: Re running the analysis with and without the suspected outliers can
demonstrate their influence. A significant change in results indicates that the outliers are influential.
The King Kong Effect in a dataset is an example of an Influential Outlier. Here's why:
Influential Outliers are data points that have a substantial impact on the outcome of your analysis.
They don't just differ from other observations, but their presence significantly alters statistical
calculations and the results of modeling. In the King Kong example, the extreme weight of King Kong
compared to other animals in the dataset would drastically affect the mean, variance, and any other
Point Outliers: These are individual data points that stand out from the rest of the dataset. While the
King Kong weight is a point outlier, it's specifically its influence on the dataset that makes it an
influential outlier.
Contextual Outliers: These are data points that are considered outliers within a specific context or
condition, but may not be outliers in another context. The King Kong effect doesn't necessarily
depend on context; it's the sheer scale of the outlier that's key.
Collective Outliers: These are a collection of data points that deviate significantly from the overall
data pattern when considered together, although the individual data points may not be outliers. The
King Kong effect is typically represented by a single, extreme data point.
Masked Outliers: These are outliers that might not be detected due to the presence of other data
points. The King Kong effect is quite the opposite, as it's typically very noticeable due to the
extremeness of the outlier.
In summary, the King Kong effect epitomises the influential outlier, where one extreme value can
have a disproportionate impact on the entire dataset's analysis.
- Discussion: influential outliers are particularly critical in regression analysis and other statistical
modelling techniques because they can lead to incorrect model parameters and predictions. Their
detection and handling are essential steps in the data analysis process. Depending on the context
and the goals of the analysis, you may choose to remove, adjust or otherwise account for these
outliers to ensure that your results are reliable and representative of the underlying data. However,
it's also important to investigate why these outliers exist, as they could represent important, albeit
rare, phenomena.
Each type of outlier provides different insights and challenges in data analysis. Identifying and
understanding the nature of outliers is crucial for accurate data analysis, as their presence can
significantly impact statistical conclusions and predictive modelling.
Understanding these aspects is crucial for accurate data analysis, model building and decision making
processes.
- Causes of Outliers
1 Measurement or Input Error: Outliers can arise from mistakes in data collection, recording or
entry. This includes transcription errors, malfunctioning measurement equipment or incorrect data
input. In such cases, the outliers do not represent actual variations in the underlying data.
4 Natural Variations: In many cases, outliers represent true but rare events in the population. For
instance, exceptionally high or low values in medical data might indicate rare medical conditions.
5 Changes in Behaviour or Conditions: Outliers can signal a shift in the underlying process generating
the data, such as a sudden market change in financial data or an emerging trend in social media
analytics.
- Impacts of Outliers
1 Statistical Analysis: Outliers can skew statistical measures like the mean, variance and standard
deviation, leading to misleading conclusions. They can also impact the assumptions underlying many
statistical tests and models, such as normality and homoscedasticity.
2 Predictive Modelling: In machine learning and predictive modeling, outliers can disproportionately
influence the model's parameters, potentially leading to overfitting. This makes the model less
generalisable and accurate on unseen data.
3 Data Interpretation Challenges: Outliers can complicate the interpretation of data visualisations,
such as histograms or scatter plots, masking true patterns or trends in the data.
4 Decision Making: In a business or policy context, unaddressed outliers can lead to misguided
strategies or decisions, especially if they are assumed to represent typical cases.
5 Opportunity for Discovery: On the positive side, outliers can reveal valuable insights. They may
point to novel phenomena, errors in process or areas for improvement. For instance, in quality
control, outliers can indicate defects or failures in manufacturing processes.
- Addressing Outliers
Professionals must develop a strategy for dealing with outliers, which includes detection, diagnosis
and appropriate treatment (removal, transformation or separate analysis). The approach depends on
the nature of the data, the context of the study and the objectives of the analysis. It's also imperative
to document the handling of outliers for transparency and reproducibility in data analysis.
Outliers, those extreme data points that deviate significantly from the rest of the dataset, have
always intrigued statisticians and analysts. In this section, we explore the conflicting perspectives on
outliers and their implications.
On the one hand, outliers are often considered to be a bane: they can distort statistical analysis,
compromise the accuracy of predictive models and skew the interpretation of results. Outliers have
the potential to mislead decision making based on faulty data and can be a source of frustration
when trying to establish patterns or trends.
On the other hand, outliers can also offer valuable insights and information. They might represent
exceptional cases, rare events or important data points that require special attention. Outliers can
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 18 of 71
highlight anomalies that reveal hidden patterns, uncover unexpected correlations or provide a fresh
perspective on the dataset.
Understanding whether outliers are a bane or a boon depends on the context and objectives of the
analysis. In certain scenarios, outliers might indicate errors or data quality issues that need to be
addressed. However, they could also be valuable pieces of information that contribute to a
comprehensive understanding of the data and its underlying processes.
Standard Deviation
Certainly! Imagine you're sitting in a lecture hall, eager to learn about one of the fundamental
concepts in business analysis: Standard Deviation. I, as your professor, am about to introduce this
concept to you, a group of first-year undergraduate business analysis students.
Standard Deviation measures how spread out numbers are in a dataset. In business, this can tell us a
lot about consistency, risk and variability. For instance, if we're looking at the monthly sales figures of
a company, the standard deviation helps us understand how much these sales figures fluctuate over
time.
Why is this important? Well, in the business world, we love predictability. Knowing the standard
deviation of sales, costs or even share prices can help businesses make more informed decisions. A
low standard deviation means the numbers are close to the average, indicating stability. A high
standard deviation, however, suggests a lot of variability, which could mean higher risk.
As budding business analysts, mastering the concept of standard deviation will enable you to identify
trends, assess risks, and make data driven decisions. Remember, it's not just about the numbers; it's
about understanding the story they tell.
We'll start by looking at some real world examples and then move on to calculating standard
deviation ourselves. It's a step by step process and I'll guide you through it. Let's embark on this
journey to demystify statistics and harness its power for business analysis. Welcome to the world of
standard deviation!
Let’s refer back to the daily sales data we used in chapter one and we can find those data in the file,
chapter_two.xlsx.
Excel has a built in standard deviation function that we can use with very little training but we will
use both that function and the BYCOL() function that we used in chapter one.
Look at these screenshots from the daily sales data and their analysis.
Histograms are best used for assessing the variability of a set of data. The following histograms show
the full data set, including and excluding the outliers:
Again, we can see the impact that the inclusion of outliers has on the appearance of data, if nothing
else.
Still working in Excel, we can easily convert the histograms to Box & Whisker plots: again, including
and excluding outliers, on the left and right below, respectively.
Notice, I have recoloured part of the plot so that you can see the data points, the fact that the mean
and the median are quite similar in value and you can see that even though I have labelled the
second plot as excluding outliers, there are two outliers shown on there: one above the maximum
bar and one below the minimum bar. We will discuss those two values shortly as we take a more
advanced view of this plot.
John Tukey was a statistician and he invented the box and whisker plot and he created the concept
and interpretation of the interquartile range, the IQR, to go with it. Let’s look at these additional
concepts now.
The idea behind the IQR is that it provides us with a threshold value for upper and lower outliers
In the diagram in the previous section, we can see the IQRs for the data including the outliers and for
the data without the outlier data. However, there is one further step to take with this knowledge and
that is to find what are called Tukey’s Fences. There are two versions of Tukey’s Fences, inner and
outer and for each of them there is a lower and an upper value, as follows:
• The values of the lower and upper inner and outer Tukey Fences: check the formulas for
them and note the constant values used of 1.5 for the inner fences and 3 for the outer fences
• The outlier summary which shows how many outlier values there are in the data
This introduction sets the stage for your learning journey into the world of business statistics,
emphasising the practical applications and importance of understanding standard deviation in
business analysis.
The R Codes
The following code includes the basic codes that we used in chapter one for the Daily Sales data and
then it adds the new code. I have identified where the new code begins.
# Defining outliers
sales_outliers <- c(50000, 300, 48000) # Two high sales and one low sale day
tail(df_daily_sales, 10)
summary(df_daily_sales)
Date Daily_Sales
Min:2023-01-01 Min: 300
1st Qu:2023-04-02 1st Qu: 1652
Median:2023-07-02 Median : 1994
Mean:2023-07-02 Mean: 2263
3rd Qu:2023-10-01 3rd Qu: 2316
Max:2024-01-01 Max:50000
hist(df_daily_sales$Daily_Sales,
xlab = "Daily Sales",
main = "Histogram of Daily Sales: all values",
breaks = sqrt(nrow(df_daily_sales))
) # set number of bins
# Calculate IQR
Q1 <- quantile(df_daily_sales$Daily_Sales, 0.25)
Q3 <- quantile(df_daily_sales$Daily_Sales, 0.75)
IQR <- Q3 - Q1
# Plot histogram
hist(filtered_data$Daily_Sales,
main="Histogram Excluding Extreme Values",
xlab="Daily Sales",
breaks=10) # Adjust 'breaks' as needed
boxplot(df_daily_sales$Daily_Sales,
ylab = "Daily Sales All Values"
)
boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales No Extreme Values"
)
# Using the which() function, R can tell us which are the row numbers of the outlier values in our
data set
# And now it is also possible to print the values of the outliers directly on the boxplot
# with the mtext() function:
boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales",
main = "Boxplot of Daily Sales"
)
mtext(paste("Outliers: ", paste(formatted_outliers, collapse = ", ")))
DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a popular clustering
algorithm that is particularly effective in identifying outliers as well as forming clusters of data points
in a dataset. In the context of outlier detection, DBSCAN is used because of its unique approach to
handling noise and outliers. Let’s delve into its key concepts and how it applies to outlier detection:
Density based Clustering: Unlike centroid based algorithms like K-means, DBSCAN groups together
points that are closely packed together (points with many nearby neighbours), marking as outliers
the points that lie alone in low density regions.
Epsilon (ε): A distance measure that defines the neighbourhood around a data point. If the distance
between two points is lower or equal to ε, they are considered neighbours.
MinPts (Minimum Points): The minimum number of points required to form a dense region. A point
is considered a core point if it has at least MinPts within its ε-neighbourhood.
Core, Border and Noise Points: In DBSCAN, points are categorised as core points, border points or
noise points. Noise points are considered outliers.
Process
The algorithm starts by randomly selecting a point and retrieving all points within its ε-
neighbourhood.
If the point has at least MinPts in its neighbourhood, a cluster is formed. The cluster then expands by
adding all reachable points within ε-distance that also meet the MinPts criterion.
I have applied the DBSCAN function to the Daily Sales data in the R file and these are the codes I
used and the results they generated:
# DBSAN Technique
# Defining outliers
sales_outliers <- c(50000, 300, 48000) # Two high sales and one low sale day
tail(df_daily_sales, 10)
summary(df_daily_sales)
hist(df_daily_sales$Daily_Sales,
xlab = "Daily Sales",
main = "Histogram of Daily Sales: all values",
breaks = sqrt(nrow(df_daily_sales))
) # set number of bins
# Calculate IQR
Q1 <- quantile(df_daily_sales$Daily_Sales, 0.25)
Q3 <- quantile(df_daily_sales$Daily_Sales, 0.75)
IQR <- Q3 - Q1
# Plot histogram
hist(filtered_data$Daily_Sales,
main="Histogram Excluding Extreme Values",
xlab="Daily Sales",
breaks=10) # Adjust 'breaks' as needed
boxplot(df_daily_sales$Daily_Sales,
ylab = "Daily Sales All Values"
)
boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales No Extreme Values"
)
# And now it is also possible to print the values of the outliers directly on the boxplot
# with the mtext() function:
boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales",
main = "Boxplot of Daily Sales"
)
mtext(paste("Outliers: ", paste(formatted_outliers, collapse = ", ")))
# Applying DBSCAN
max(aml$Price)
min(aml$Price)
# After applying DBSCAN, you can plot the results. Since you have only one dimension (Daily Sales),
# you might use a simple scatter plot, using the row indices as the x-axis and the sales values as
# the y-axis:
# In this plot, different colours will represent different clusters, and points that are outliers
# will be clearly distinguishable (typically in black if they are labeled as 0 by the algorithm).
Introducing the Aston Martin Lagonda Second Hand Car Price 2018 Data Set
We have done nothing wrong with what we have done here but the data we were working with were
created artificially. For simple tasks, artificial data can work well. And, then again, it can’t. In case it
helps, let’s take a look at some real data now, in the context of clustering the data
Work through the code that follows and interpret the results you get: are they better than the
artificial Daily Sales Data, do you think?
# Applying DBSCAN to Aston Martin Lagonda Second Hand Car Price Data from 2018
library(dbscan)
summary(aml$Price)
str(aml)
head(aml, 10)
tail(aml, 10)
hist(aml$Price,
xlab = "Prices",
main = "Histogram of AML Second Hand Prices",
breaks = sqrt(nrow(aml))
) # set number of bins
mean(aml$Price)
library(dbscan)
# Applying DBSCAN
db_aml <- dbscan(aml_data_matrix, eps = 2675.9, minPts = 5) # I found the eps value by using the
eps_value function, see above
# View Results
print(db_aml)
# After applying DBSCAN, you can plot the results. Since you have only one dimension (Daily Sales),
# you might use a simple scatter plot, using the row indices as the x-axis and the sales values as
# In this plot, different colours will represent different clusters, and points that are outliers
# will be clearly distinguishable (typically in black if they are labeled as 0 by the algorithm).
# Using the which() function, R can tell us which are the row numbers of the outlier values in our
data set
# And now it is also possible to print the values of the outliers directly on the boxplot
# with the mtext() function:
boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales",
main = "Boxplot of Daily Sales"
)
mtext(paste("Outliers: ", paste(formatted_outliers, collapse = ", ")))
No Assumption of Cluster Shapes: DBSCAN does not assume that the clusters are spherical (as in K-
means), making it suitable for detecting clusters of arbitrary shapes.
Handling Outliers: It effectively identifies outliers as noise points, which are not part of any cluster.
No Need to Specify Number of Clusters: Unlike K-means, DBSCAN does not require the user to
specify the number of clusters in advance.
Limitations
Parameter Selection: Choosing appropriate ε and MinPts can be challenging and greatly affects the
outcome.
Varying Density: DBSCAN can struggle with datasets where clusters have varying densities.
Excel is a powerful tool for data preparation and ensuring data quality is a crucial aspect of working
with Excel. Here are some professional tips and best practices for preparing data in Excel to maintain
high data quality:
1 Understanding Data Types: Excel supports different data types like text, numbers, dates and more.
Ensure that each column in your Excel sheet correctly represents the data type it is supposed to hold.
This improves accuracy and makes data analysis easier.
In case you are wondering, these are the types of files we can all open in Excel:
And these are the ways in which we can save our files in Excel:
2 Data Cleaning: This involves identifying and correcting (or removing) errors and inconsistencies in
data to improve its quality. This could include removing duplicates, correcting misspellings and
handling missing values appropriately.
Here we are talking about dirty data and throughout this book we will come across examples of
where the data we have created or imported or that has entered into our system in some way might
be dirty. If data are dirty, we MUST clean them. More than that, data analysts and scientists will tell
you that as much as 80 – 95% of their data analysis and science time is spent cleaning dirty data.
Cleaning dirty data is a serious business.
1OOO
Ϯ000
I000
IOOO
They are all dirty because none of them is a value even though they might look as if they are. Just
copy and paste them all into an Excel file and then create a formula to multiply them each by 1 or 10
or any number. Like this:
I know, some of them might be obviously fake numbers but dirty data often hides in plain sight.
In the file chapter_three.xlsx, go to the fortune worksheet where you will see a range of data that
has been extracted from the Fortune 1000 list of companies for 2023. Can you see any dirty data in
that range? In case it is not obvious for you, here is a clue that is sometimes really useful. Just look at
the column headings and their underlying data:
Revenue Profits Years On
Revenues Percent Percent Change Global 500
Industry Rank Name ($M) Change Profits ($M) Change Assets ($M) Employees In Rank List
General Merchandiers 1 Walmart $572,754 2.40% $13,673 1.20% $244,860 2,300,000 - 28
Internet Services and Retailing 2 Amazon $469,822 21.70% $33,364 56.40% $420,549 1,608,000 1 14
Computers, Office Equipment 7 Apple $365,817 33.30% $94,680 64.90% $351,002 154,000 -1 20
Health Care: pharmacy and other services 10 CVS Health $292,111 8.70% $7,910 10.20% $232,999 258,000 -3 27
Health Care: Insurance and Managed Care 11 UnitedHealth Group $287,597 11.80% $17,285 12.20% $212,206 350,000 -3 26
Look at the data in those columns and we can see that every value starts with $: click on any one of
those values in any one of those columns and then look at the Number format in the Home ribbon
and you will see this:
For a value to have the $ prefix, it should be showing as using the Currency or Accounting formats. In
this example, these columns contain dirty data and the best way to clean it all is to Find and Replace
the $ signs from all of those columns and replace them with nothing.
You can see that the Revenue, Profits and Assets columns are right aligned now and that suggests the
data are clean.
Another quick way to find potentially dirty data is to highlight it by selecting it and then taking a look
at how it appears on the Status Bar. Look at this:
Be ready, we will confront dirty data from time to time and I will work with you on sorting it out: just
be ready at all times!
3 Use of Formulas and Functions: Excel offers a wide range of formulas and functions for data
manipulation. There is no point making a list of the best functions there are in Excel because what is
important to you might not be important to me. Be ready, though, because Excel is developing very
quickly these days and in this book we will be working with the latest available functions as well as
those functions that have always been a part of Excel.
Dynamic Array Functions: these are relatively new but learning and using them will change your
approach to data analysis, including working with outliers:
SEQUENCE()
UNIQUE()
SORT()
SORTBY()
RANDARRAY()
FILTER()
XLOOKUP()
XMATCH()
Let me add the 14 additional functions that came along in August 2022:
TEXTSPLIT()
TEXTBEFORE()
TEXTAFTER()
VSTACK()
HSTACK()
CHOOSECOLS()
CHOOSEROWS()
DROP()
TAKE()
EXPAND()
TOCOL()
Then there are the LAMBDA functions that I have already started to introduce in this book: BYCOL().
Here is the whole list:
MAP()
SCAN()
REDUCE()
MAKEARRAY()
BYCOL()
BYROW()
ISOMITTED()
4 Data Validation: Use Excel's data validation feature to set rules for what data can be entered into a
cell. For example, you can restrict a cell to only accept numeric values or dates or select from a drop
down list. This prevents incorrect data entry.
Good examples here include creating an input cell from Data Validation that includes ONLY, for
example,
By using Data Validation, we can be sure that we are only inputting data that is appropriate.
Watch out for data validation in this book to see how we use it and how beneficial it is. In the
meantime, open the file chapter_three.xlsx to see some basic examples.
On the DV worksheet, I have created three data validation cells for you, in cells B5, B7 and B9 and
there is a text box on that worksheet that illustrates how to create your own in rows 11, 13, 15. I
have not demonstrated it here but take a look at the Input Message and Error Alert options in Data
Validation to see how you might usefully use them.
Then, of course, consider how to use what you now know how to do!
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 38 of 71
5 Consistent Formatting: Ensure consistent formatting across your dataset. This includes date
formats, decimal places for numbers and text casing: UPPER, lower, Proper.
6 Tables and Named Ranges: Use tables and named ranges to make data references clearer and
calculations more straightforward. This also helps in making your data more organised and easily
readable.
I will say here that you will hear some Excel users say that we should never use range names in Excel:
I disagree with that because in some cases, you cannot do without them. Learn how to use them and
use them sparingly: don’t create hundreds of range names for every file or you will cause yourself
problems.
There is the world of difference between a list or a range of data and the following table types:
• Excel Tables
• Pivot Tables
• Data Tables
Excel Tables
Create a list and with your cursor anywhere in that list press this key combination: Ctrl+T and you will
see a small dialogue box that confirms the range of your list and whether the first row of the range is
a header row … make sure they are right and change them if not. Then you will see this:
These days, because of laziness or a lack of understanding, many Excel users talk about a data table
when they mean a table of data. A data table is a very specific set up of data whereas a table of data
might just be an unformatted list of numbers and letters.
Excel tables are versatile and they contain a number of features that are not so obvious at first sight.
We will work with tables throughout this book and will learn much more about them.
Pivot Tables
Data Table
There are basically two forms of Data Table: one variable and two variables. Let’s begin with the one
variable version:
You can check on how a data table works by clicking on any cell in the range E4:H9 in this case and
you will see: {=TABLE(B5,B4)} and this confirms your row and column input cells. For the one variable
data table you will see {=TABLE(,B4)}, the column input cell.
7 Pivot Tables and Charts: Use pivot tables and charts for summarising data. They are powerful tools
for data analysis and can help in spotting trends, inconsistencies and outliers in your data.
Pivot Tables and Charts are subjects in their own right but here is a simple introduction for you.
On the PT worksheet in the chapter_three.xlsx file there is an Excel table called sales_dw and we will
use that to create a pivot table and a pivot chart.
In the dialogue box that opens, make sure it says sales_dw as the table name
Tell Excel to put your Pivot Table in cell I3
Click OK
Click on the down arrow next to the word All in cell J1 and you can select Hardware or Software, click
Software then Click OK and, as easy as that, you now have a Software report as opposed to the
combined Hardware and Software report that you had initially.
Pivot Charts
At first you see this chart Edit the chart to make it look much better!
We will work with Pivot Tables and Pivot Charts throughout this book so that was just your starter!
8 Documenting the Process: Keep a record of the steps taken in data preparation. This includes
formulas used, sources of data and any assumptions or rules applied. This documentation is vital for
future reference and for others who might use the data.
From my experience of my own work and the work of thousands of other people, documentation is
the last thing on most peoples’ minds. Yet it should be the first. Notice that I have documented
chapter three.xlsx as I have gone along: either in the file itself and/or in this Word file. Either way,
you know what I am doing, step by step. Keep it that way: it should be your unbreakable habit to
create documentation.
Again, make this a habit: can we audit our own work? Yes we can! Should we audit our own work?
No we shouldn’t! It should be your policy for colleagues to validate everything that happens in your
department or section. All of the time. Yes, it’s time consuming, boring, too, sometimes but if you
don’t do it, you could end up on the list Excel Blunders that Cost Millions!
The second reason why this point is so vital takes us back to the Dynamic Array Functions, the 14
new functions and the Lambda functions … they are new to you but you should be using them now.
Update your files for that reason alone.
10 Security and Sharing: Protect sensitive data using Excel’s security features like password
protection, and be cautious when sharing files. Ensure that the data shared respects privacy and
confidentiality agreements.
This really is a vital part of our work now as we are all so connected and inter connected. Malware,
spyware, ransomware … look after yourself and your work.
11 Training and Continuous Learning: Excel is constantly evolving. Stay up to date with new features
and best practices through continuous learning and training.
12 Leveraging External Tools: Sometimes, Excel might not be sufficient for complex data preparation
needs. In such cases, be open to using external tools or add ons that complement Excel's capabilities.
Remember, data quality in Excel is not just concerned with technical aspects but also with attention
to detail, consistency and an ongoing commitment to maintaining the integrity of your data.
In the previous section, Preparing Data in Excel part 3, Use of Formulas and Functions, we identified
a large range of new functions that have been built into Excel and we can use some or all of them in
the context of full data analysis: dynamic array and other functions.
In this section, we are going to leverage Excel's general formula capabilities to identify data points
that deviate significantly from the norm. Outliers can significantly impact statistical analyses and data
interpretations, so detecting them is crucial for maintaining data quality. Here are key aspects to
consider:
For this section, we will use the Fortune 1000 2023 data from chapter_three.xlsx: on the fortune
worksheet.
1 Understanding Outliers: As we know, an outlier is a data point that differs significantly from other
observations. They can occur due to variability in measurement or experimental error and can also
indicate something significant like a new trend.
2 Basic Statistical Methods: You can use basic statistical methods to detect outliers. One common
method is to calculate the mean and standard deviation of your data set and then find points that fall
outside a certain range, typically beyond 1.5 to 3 standard deviations from the mean.
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 42 of 71
3 Using Excel Formulas: In Excel, you can use formulas to calculate the mean and standard deviation,
and then create conditional formulas to identify outliers. For example:
Mean: =BYCOL(D$6:I$127,LAMBDA(column,AVERAGE(column)))
Median =BYCOL(D$6:I$127,LAMBDA(column,MEDIAN(column)))
Standard Deviation: =BYCOL(D$6:I$127,LAMBDA(column,STDEV.S(column)))
Trim mean =BYCOL(D$6:I$127,LAMBDA(column,TRIMMEAN(column,0.1)))
Quartile 3 = Q3 =BYCOL(D$6:I$127,LAMBDA(column,QUARTILE.INC(column,3)))
Quartile 1 = Q1 =BYCOL(D$6:I$127,LAMBDA(column,QUARTILE.INC(column,1)))
Revenue Profits
Revenues Percent Profits Percent Assets
Metric ($M) Change ($M) Change ($M) Employees
Mean 90,828.66 20.79% 10,238.92 143.57% 315,873.77 148,454.89
Median 59,936.50 12.80% 5,832.00 46.25% 92,556.00 79,000.00
SD 86,590.35 25.64% 15,678.60 406.86% 691,169.81 263,712.43
Trimmean 78,470.65 18.39% 7,747.32 88.07% 182,284.20 111,826.16
Q3 106,747.50 23.83% 11,967.25 112.50% 237,901.00 170,250.00
Q1 39,528.00 6.28% 2,096.00 10.88% 50,240.25 37,996.00
We can then use the following, as an example to help us to identify outliers in each column:
=IF(OR(data_point < (mean - k*stdev), data_point > (mean + k*stdev)), "Outlier", "Normal")`, where
`k` is typically 1.5 to 3.
We can apply this formula on a company by company basis, to find these examples:
Revenue Profits
Revenues Percent Profits Percent Assets
Metric ($M) Change ($M) Change ($M) Employees
Walmart Outlier Normal Normal Normal Normal Outlier
Amazon Outlier Normal Normal Normal Normal Outlier
Apple Outlier Normal Outlier Normal Normal Normal
CVS Health Normal Normal Normal Normal Normal Normal
UnitedHealth Group Normal Normal Normal Normal Normal Normal
Where k = 1.5: remember, we talked about John Tukey earlier and he suggested that a value of 1.5
was appropriate for this kind of analysis. We also saw that using k = 3 is also widely used, instead of
or as well as k = 1.5.
Copy that right and down to fill the range of data and you have an opinion of whether the data for
each company appears to be an outlier.
Whilst we can do what we just did, on a company by company basis, it is more normal to identify
outliers overall, for the complete range of data: all Revenue, all Revenue Percent Change and so on.
We do this by using the Quartile 3 and Quartile 1 values in this way:
Q3 =BYCOL(D$6:I$127,LAMBDA(column,QUARTILE.INC(column,3)))
Then we complete our analysis of outliers as follows, by extending our analysis to include Lower
Inner Fences (LIF) and Upper Inner Fences (UIF):
Revenue Profits
Revenues Percent Profits Percent
Metric ($M) Change ($M) Change Assets ($M) Employees
Mean 90,828.66 20.79% 10,238.92 143.57% 315,873.77 148,454.89
Median 59,936.50 12.80% 5,832.00 46.25% 92,556.00 79,000.00
SD 86,590.35 25.64% 15,678.60 406.86% 691,169.81 263,712.43
Trimmean 78,470.65 18.39% 7,747.32 88.07% 182,284.20 111,826.16
Q3 106,747.50 23.83% 11,967.25 112.50% 237,901.00 170,250.00
Q1 39,528.00 6.28% 2,096.00 10.88% 50,240.25 37,996.00
IQR 67,219.50 0.18 9,871.25 1.02 187,660.75 132,254.00
Lower Inner Fence (LIF) - 61,301.25 - 0.20 - 12,710.88 - 1.42 - 231,250.88 - 160,385.00
Upper Inner Fence (UIF) 207,576.75 0.50 26,774.13 2.65 519,392.13 368,631.00
Explanations
Lower Inner Fence for Revenues =Q1 – 1.5 * IQR = 39,528.00 – 1.5*67,219.50
Upper Inner Fence for Revenues =Q3 + 1.5 * IQR = 106,747.50 + 1.5*67,219.50
Finally, we can find how many values, column by column, are outliers in this way:
Can we identify which 10 > UIF Revenues are outliers? Which single outlier is < LIF for Profits Percent
Change? Let’s consider the next section: Identifying Individual Outliers
The data for the exercise are found on the data worksheet and the fully worked solution is on the
solutions worksheet
Data:
Initially, I simply used the FILTER() function to highlight the results of a Greater Than exercise. See the
formulas shown in cells L7 and L17
What I then show is whether there are any outliers in this salary table: is anyone earning significantly
less or more than anyone else: we see that here:
I used the BYCOL() function to identify the Quartile 3 (Q3) and the Quartile 1 (Q1) values, hence the
Inter Quartile Range, IQR of Q3-Q1
BYCOL() is a LAMBDA() function and such functions can be really easy to evaluate, as is the case here.
We modify the BYCOL() functions according to the column we are evaluating: Gross, PAYE, Net
Otherwise known as Tukey’s Fences, the Inner Fences give us boundaries in the data beyond which
we can find possible outliers.
The Inner Fence formulas are very straightforward as you can see in cells N24 and N25
Now that we have done the basic work, we can combine the COUNT() and FILTER() functions as you
ca see in cells N28 and N29 to return the number of values in each column, Gross, PAYE and Net, that
are classified as outliers.
In this example, there are no outliers and we can see what that is so by inspecting the graph of the
Salary data we are reviewing:
=J21-1.5*J22
paste it in cell J31 and change it to
=J21-3*J22
=J20+1.5*J22
paste it in cell J32 and change it to
=J20+3*J22
I have included the Minimum and Maximum values of the three columns we are working on to help
us to read and appreciate the outlier results we have found:
4 Conditional Formatting: This is a handy tool in Excel for visually identifying outliers. You can set up
rules using the same statistical criteria to highlight outliers in your data set.
Conditional Formatting
A simple method of finding individual outliers when based on the <LIF and >UIF calculations is to use
Conditional Formatting. To achieve this, do the following, for example, for Revenues:
Click OK
Repeat but for LESS THAN this time and choose cell N14 this time and choose your colour preference
and click OK.
All of the values that Excel has identified as outliers are now highlighted in red in the Revenues ($M)
column and here they are:
The ten largest companies by Revenue are the ten outliers in this column.
Repeat this process for all of the other variables and see what you find: choose the same colour
scheme or choose a different colour for every column, as you wish.
• Revenues ($M)
• Revenue Percent Change
• Profits ($M)
• Profits Percent Change
• Assets ($M)
• Employees
There should be 18 companies in the outlier list for Revenue Percent Change and in my case there
are. I formatted them as follows, where I show just the top 31 companies. Note, I have formatted my
results in a colour that contrasts with the Revenue ($M) column formatting.
Note: I have NOT shown all of the highlighted outliers in this screenshot.
If we know or believe that our data come from normally distributed data, we can use Z Score analysis
to find and then highlight any outliers in our data.
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 49 of 71
Excel has a function that we can use to make our life very easy: STANDARDIZE() but we can use a
formula instead of that: Z Score = (X - mean)/Standard Deviation. Working on the fortune_Z
worksheet, let’s explore both of these options.
I have put my column headers to start in cell M5 and copied that right
In cell M6 =STANDARDIZE(D6:D127,fortune!N$6,fortune!N$8)
and note that I am using the mean and standard deviation results from the fortune worksheet
Since I am using Excel 365, that formula SPILLs down automatically
I then fill that formula right, up to column R, the Employees column
That has given me the Z Scores for all data points for all variables from Revenues to Employees,
inclusive.
Point 1
Z scores of <-2, >2 or <-3, >3 are considered to be outliers but without conditional formatting, for
example, they might be difficult to identify.
Why do I talk about Z scores of <-2, >2 or <-3, >3 … what does that mean?
A cut off of ±2 is consistent with a 95% confidence interval and might be acceptable to you
A cut off of ±3 is consistent with a 99% confidence interval and might be more acceptable to you
In my case, I have chosen to highlight ±2 in yellow and from ±2 to ±3 in Light Red. Like this:
Similarly, for the Revenue Percent Change column, there are now only nine outliers and not the 18
we found when using Tukey’s LIF and UIF. But again, some are highlighted in red and some in yellow,
for greater granularity!
Variable No of No of
Outliers Outliers Z
Tukey Score
Revenues ($M) 10 7
Revenue Percent Change 18 9
Profits ($M) 8 5
Profits Percent Change 14 2
Assets ($M) 14 6
Employees 9 2
Point 2
There are many VALUE! Errors in the Profits Percent Change column: that is because of the lack of
data in that column, which shows the character “-“ where there is no value.
If we deleted the “-“ character from the cells, a value appears for the Z Score against those cells but
that is an error in itself. The best approach is to wrap all of our STANDARDIZE() functions in the
IFERROR() function. Which you should do!
Note, on the fortune worksheet, those “-“ in column G are all shown as being highlighted in Yellow …
find a way to prevent that happening, since it is misleading.
Alternatively
And I have demonstrated for a few values on the fortune_Z worksheet, in columns.
5 Box and Whisker Plots: Excel allows you to create Box and Whisker plots, which are excellent for
visualising the distribution of your data and identifying outliers.
Here are the Box & Whisker plots for the fortune data, fortune_BW worksheet
I selected all of the data in the six columns we have been analysing and created a single Box &
Whisker plot from there. It’s OK but not so easy to read. So, I created six individual Box & Whisker
plots.
The Median Absolute Deviation (MAD) is a non parametric method for detecting outliers,
particularly useful in datasets that do not follow a normal distribution. Here's a detailed
explanation of how MAD works and how it can be implemented in Excel:
Understanding MAD
Concept: MAD measures the variability of a dataset and is less sensitive to outliers than standard
deviation or variance. It's based on the median, a robust measure of central tendency.
Find the Median: Use the MEDIAN function to find the median of your data. The median is already
included in the descriptive statistics found on the fortune worksheet.
Calculate Absolute Deviations: In cell M6, calculate the absolute deviation of the data point from the
median. For example, the Revenue median is in cell fortune!N$7, the formula in cell M6
=IFERROR(ABS(D6-fortune!N$7),"-")
Calculate the MAD: In cell M4 =MEDIAN(M6:M127): this is the MAD.
Outlier Detection
Threshold for Outliers: A common approach is to define outliers as those points that are a certain
number of MADs away from the median. A typical threshold is 2 or 3 times the MAD.
Flagging Outliers: Create a formula to flag data points that exceed this threshold. T6 =IF(M6 >2 *
M$4, "Outlier", "Normal"): you can drag that formula down and to the right to complete your
analysis, as follows:
Revenue Profits
Revenues Percent Profits Percent Assets
($M) Change ($M) Change ($M) Employees
Outlier Normal Normal Normal Outlier Outlier
Outlier Normal Outlier Normal Outlier Outlier
Outlier Outlier Outlier Normal Outlier Normal
Outlier Normal Normal Normal Outlier Outlier
Outlier Normal Outlier Normal Normal Outlier
Outlier Outlier Outlier Outlier Outlier Normal
Outlier Normal Outlier Normal Outlier Outlier
Outlier Normal Normal Outlier Normal Normal
Outlier Outlier Outlier Normal Outlier Normal
Outlier Normal Normal Outlier Normal Normal
In the row under the title Outlier Analysis you will find this formula, T4=COUNTIFS(T6:T127,"Outlier")
and that counts the number of outliers identified by the MAD method using the threshold of 2
I have put the threshold level in cell W3 and changing it to 3 rather than 2 gives us these results:
Revenue Profits
Revenues Percent Profits Percent Assets
($M) Change ($M) Change ($M) Employees
Outlier Normal Normal Normal Normal Outlier
Outlier Normal Outlier Normal Outlier Outlier
Outlier Normal Outlier Normal Outlier Normal
Outlier Normal Normal Normal Normal Outlier
Outlier Normal Normal Normal Normal Outlier
Outlier Outlier Outlier Outlier Outlier Normal
Outlier Normal Outlier Normal Outlier Outlier
Outlier Normal Normal Outlier Normal Normal
Outlier Outlier Outlier Normal Outlier Normal
Outlier Normal Normal Outlier Normal Normal
Considerations
Robustness: MAD is more robust than standard deviation, making it suitable for skewed distributions
or datasets with outliers: the Fortune data we are using here is not normally distributed, as you can
see from the histograms below. We might conclude that the MAD method is the most appropriate of
the methods we have used.
Analysing the entire 1,000 company fortune database might give us completely different insights, of
course.
In summary, MAD offers a straightforward and robust way to detect outliers, especially in non-
normally distributed datasets. Its implementation in Excel is relatively simple and can be a valuable
part of your data analysis toolkit.
A Modified Z Score is an alternative to the traditional Z score and is particularly useful for datasets
that are not normally distributed or when the sample size is small. It is a robust method of
identifying outliers, similar to the Median Absolute Deviation (MAD) method, but it incorporates
elements of the Z score approach.
Traditional Z Score: In a traditional Z score calculation, you subtract the mean from each data point
and then divide by the standard deviation. This measures how many standard deviations away a data
point is from the mean.
Modified Z Score: Instead of using the mean and standard deviation, the modified Z score uses the
median and MAD. The formula for the Modified Z-Score is
0.6745(𝑥𝑖 − 𝑥̃)
𝑀𝑖 =
𝑀𝐴𝐷
Where
𝑀𝑖 = the Modified Z Score of an individual value of a variable
0.6745 = a constant factor used to make the Modified Z Score more comparable to the traditional Z
score in terms of identifying outliers under a normal distribution.
𝑥𝑖 = individual data value of a variable
𝑥̃ = the median of the variable
MAD = Median Absolute Deviation of a variable
Robustness to Skewness and Small Sample Sizes: Unlike the traditional Z score, which can be heavily
influenced by outliers (especially in a skewed distribution), the modified Z score is more robust
because it uses the median and MAD.
Better for Non Normal Distributions: Since it doesn't rely on the assumption of normality, the
modified Z score is suitable for a wider range of data types.
The worksheet is exactly the same as for the MAD except that the results in columns T:Y are
changed to illustrate the Modified Z Score.
This means that the MAD values are the same as before. I have added the median values for each
variable in the range D4:I4
Median
59,936.50 0.13 5,832.00 0.46 92,556.00 79,000.00
Then the results of applying the Modified Z Score analysis to our Fortune data are on the
fortune_mod_Z worksheet
Modified Z Threshhold
11 13 8 14 16 9
Revenue Profits
Revenues Percent Profits Percent Assets
($M) Change ($M) Change ($M) Employees
14.39 - 0.80 1.26 - 0.71 1.68 28.03
11.51 0.68 4.42 0.16 3.62 19.30
8.59 1.57 14.27 0.29 2.85 0.95
6.52 - 0.31 0.33 - 0.57 1.55 2.26
6.39 - 0.08 1.84 - 0.53 1.32 3.42
6.34 3.42 2.76 - 2.72 - 0.20
6.07 - 0.02 13.49 1.02 9.56 3.70
5.73 - 0.15 - 0.76 - - 0.32 - 0.16
5.55 2.18 11.28 0.67 2.94 0.98
4.32 - 0.01 - 0.69 - - 0.39 - 0.49
3.82 0.36 - 0.13 - 0.33 - 0.37 2.64
3.20 - 0.33 - 0.08 - 1.30 0.69 - 0.08
In the range T4:Y4, I have entered the formula to count the number of outliers in each column. For
example, in T4=COUNTIFS(T6:T127,"<"&-3.5)+COUNTIFS(T6:T127,">"&3.5) copy that formula right,
to Y4 to see the results of the Modified Z Score method.
Threshold for Outliers: Similar to the traditional Z score, a common threshold for identifying outliers
with the modified Z score is a value greater than 3.5 or less than -3.5, although this threshold can be
adjusted based on the context and the level of sensitivity desired. I have put a place holder in the
worksheet that will allow us to modify the threshold from ±3.5 although I have not used it here. You
are free to make the necessary change to cope with that!
Conditional Formatting: rather than programming the range T6:Y127 to show the words Outlier or
Normal, I have used conditional formatting to highlight where a value represents an outlier.
Interpretation: Data points with a modified Z score beyond the chosen threshold are considered
outliers. This method provides a balanced approach, mitigating the impact of extreme values while
still identifying significant deviations from the central tendency.
If we compare the three outlier methodologies we have use here, this is what we find:
Variable No of No of No of
Outliers Outliers Z Outliers
Tukey Score Modified
Z Score
Revenues ($M) 10 7 11
Revenue Percent Change 18 9 13
Profits ($M) 8 5 8
Profits Percent Change 14 2 14
Assets ($M) 14 6 16
Employees 9 2 9
7 Consider the Context: Always interpret outliers in the context of your specific data. In some cases,
what appears to be an outlier could be a valid data point that is essential for your analysis.
8 Refining Your Analysis: After detecting outliers, decide whether to keep them, adjust them or
remove them, depending on your analysis's purpose and the nature of the outliers.
9 Automation and Advanced Techniques: For large datasets or more complex analyses, consider
using VBA (Visual Basic for Applications) to automate outlier detection or integrate Excel with other
tools like R or Python for more sophisticated statistical analyses.
In summary, using formulas for outlier detection in Excel requires a mix of statistical understanding
and practical skills in Excel. As we have seen in this section, knowing the right formulas and
techniques to apply help us to interpret the results in the context of your data and analysis goals.
Open the file chinese_wells_data.xlsx and carry out the following instructions:
This case study comprises a dataset relating to a company that creates and lines wells they have dug.
What the company wants to know is what should their standard cost fuel be, given the independent
variables provided?
Copy the data tab, to ensure we always have a clean copy of the input data
Set up the correlation matrix for these data. The correlation matrix I have created is a dynamic
matrix which means that if any of the input data were to change, the correlation matrix would
update automatically with such changes.
=CORREL(INDIRECT(K$2&$K$10&":"&K$2&$K$11),INDIRECT($I4&$K$10&":"&$I4&$K$11))
Now we can set up the descriptive statistics table for the data, using the BYCOL() function, as before:
Fill K14:P14 down to row 26 and then for every line item, change AVERAGE to the appropriate
alternative. For example, in row 16 change AVERAGE to MEDIAN, in row 17 change AVERAGE to
MODE and so on
Notice, we need different formulas for the Standard Error and for the Range, which are:
Standardised Residuals
Standardised residuals essentially use Z Scores to alert us to the possibility of a residual value being
an outlier. As part of this analysis, we assume that the data we are working on comes from the
normal distribution.
A standardised residual value of >±2 suggests that a residual falls outside the 95% confidence level of
the data set from which it came. A standardised residual value of >±3 suggests that it falls outside of
the 99.7% confidence level of the data set from which it came.
We are about to find these values and then illustrate the results at both the >±2 and the >±3 levels of
confidence.
Standard Error
Range
In cell I4 set up the LINEST function to create a regression model of the data
If we don’t use IFERROR(), we don’t make a mistake but the output we get looks a bit scruffy.
Notice the most important point: in the range A3:F3, we see variable names NoWells, Dist, Wt etc
but in the LINEST() output, we see those names in the opposite direction. It’s not an error, that is
what Excel does for us. Just remember it and work with it.
We apply this equation of model on a line by line basis, as follows, for line 1 by creating this formula
in cell G4:
Y = 31,271.8877 + 3,009.7950 * 18 -75.8198 * 1574 -84.4036 * 1077.81 + 2.7906 *31,819 + 0.9610 * 94,713.32
Y = 54,959.22
We just fill down that formula to cell G43 and they are our estimates of the standard cost of fuel for
this example.
The Residuals column helps us with the first question, see column H where
The Standardised Residuals column helps us with the second question, see column I, where the
Standardised Residual = Residual Value/Standard Error of the Residuals*
For purposes like outlier detection and understanding the relative magnitude of the residuals, using
the standard error of the residuals to calculate standardised residuals is more appropriate. This
approach normalises the residuals in terms of the expected distribution of errors, making it easier to
identify those observations that are significantly deviating from the model's predictions
We can now highlight the results of the Standardised Residuals using conditional formatting, for
example, in this way:
±2 or ±3 … I have programmed both which is why you can see the following:
Excel has a built in STANDARDIZED() function to help us to find the Z Scores that give insights into the
outlier situation of each line item. You can find this on the data_ZScore worksheet:
From a much larger database, the Fortune 1000 for 2023, here are some of the results of the
application of the STANDARDIZE() function to the values of the variables included below, which is
followed by the histograms that describe these results:
Grubbs’ Test
If we believe that the maximum value in a variable or the minimum value on a variable is an outlier,
assuming the data are normally distributed, we can apply the Grubbs’ Test to check whether that is a
valid hypothesis.
In chapter_three.xlsx on the chinese_grubbs worksheet, I have tested whether the NoWells value for
D1 or the row 31 value in the table of 3 is an outlier since it is so small: the minimum value.
If you suspect that the maximum value in the dataset is an outlier, the test statistic is calculated as:
𝐺 = (𝑥𝑚𝑎𝑥 − 𝑥)/𝑠
If you suspect that the minimum value in the dataset is an outlier, the test statistic is calculated as:
𝐺 = (𝑥 − 𝑥𝑚𝑖𝑛 )/𝑠
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 63 of 71
And if you’re not sure if the maximum value or minimum value in the dataset is an outlier and you
want to perform a two sided test, then the test statistic is calculated as:
𝐺 = 𝑚𝑎𝑥|𝑥− − 𝑥 |/𝑠
where 𝑥 is the sample mean and s is the sample standard deviation
Results
Firstly, draw the histogram to check whether the NoWells data are from a normal distribution:
The data are not entirely normal and the skewness coefficient is -0.59378 but we will assume the
data come from a normal distribution so that we can carry on with Grubbs!
All of the work we needed to have done is shown in column J in the screenshot that follows:
The test statistics, G, in cells J8:L8 are 1.6893, 2.9950 and 2.9950
1 less than
2 greater than
3 greater than
The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect one or more
outliers in a univariate data set that follows an approximately normal distribution.
The primary limitation of the Grubbs test [and the Tietjen-Moore test] is that the suspected number
of outliers, k, must be specified exactly. If k is not specified correctly, this can distort the conclusions
of these tests. On the other hand, the generalized ESD test (Rosner 1983) only requires that an upper
bound for the suspected number of outliers be specified.
Given the upper bound, r, the generalized ESD test essentially performs r separate tests: a test for
one outlier, a test for two outliers, and so on up to r outliers.
Ri=maxi|xi−x¯|s
with and s denoting the sample mean and sample standard deviation, respectively.
Remove the observation that maximizes |xi−x¯| and then recompute the above statistic
with n - 1 observations. Repeat this process until r observations have been removed. This
results in the r test statistics R1, R2, ..., Rr.
where tp,ν is the 100p percentage point from the t distribution with ν degrees of freedom
and
The number of outliers is determined by finding the largest i such that Ri > λi.
Simulation studies by Rosner indicate that this critical value approximation is very
accurate for n ≥ 25 and reasonably accurate for n ≥ 15.
Note that although the generalized ESD is essentially Grubbs’ test applied
sequentially, there are a few important distinctions:
• The generalized ESD test makes approriate adjustments for the critical
values based on the number of outliers being tested for that the sequential
application of Grubbs test does not.
• If there is significant masking, applying Grubbs test sequentially may stop
too soon. The example below identifies three outliers at the 5 % level when
using the generalized ESD test. However, trying to use Grubbs test
sequentially would stop at the first iteration and declare no outliers
Impact Analysis
I have included Impact Analysis on the impact worksheet of the file chapter_three.xlsx
The outliers I have assumed for this demonstration are the ones we found on the ESD worksheet
when we used the IQR analysis and found the upper and lower inner fences for the dataset on that
worksheet.
• I have repeated the entire data set in the range A4:A57, complete with outliers
• I have shown the dataset less the outlier values in the range B4:B54.
Firstly, evaluate the mean, median and standard deviation of these two datasets:
We then use the data analysis toolpak to carry out the t-Test: Two-Sample Assuming Unequal
Variances for us, which is already done for you:
To use a t-test and p-values for hypothesis testing to assess the potential impact of
removing outliers from your data, you're essentially comparing the means of two
datasets: one with outliers and one without.
Formulate Hypotheses
Null Hypothesis (H0): The mean of the dataset with outliers is equal to the mean of the dataset
without outliers. (ie Outliers do not significantly affect the mean).
Alternative Hypothesis (H1): The mean of the dataset with outliers is not equal to the mean of the
dataset without outliers. (ie Outliers significantly affect the mean).
Dataset without Outliers: This dataset is created by removing outliers from the original dataset.
Then we can perform an independent samples t-test if your data points are independent of each
other, using Excel's Data Analysis ToolPak
P-value: This value helps you determine the significance of your results.
If the p-value is less than your chosen significance level (eg 0.05), you reject the null hypothesis,
suggesting that outliers have a significant impact on the mean.
If the p-value is greater than the significance level, you fail to reject the null hypothesis, suggesting
no significant impact from outliers.
Considerations
Assumptions: Ensure your data meets the assumptions of the t-test, such as normality. Outliers can
impact normality, so it’s important to check this.
Effect Size: Alongside the t-test, consider calculating the effect size (like Cohen's d: see below for the
results of this further test) to understand the magnitude of the difference.
Contextual Interpretation: Always interpret the results in the context of your specific data and
research question. Statistical significance doesn't always equate to practical significance.
Remember, removing outliers can significantly impact your analysis, so the decision to remove them
should not be taken lightly and should be well justified in the context of your specific data and
analysis objectives.
This implies that you are testing the hypothesis that there is no difference in the means of the two
groups (with and without outliers).
This suggests that the total number of observations across both groups (minus the number of
groups) is 99. Degrees of freedom are used to determine the critical value of t.
t Statistic: 0.9432
The t Statistic is a measure of the difference between the two group means in terms of the standard
error. A t Statistic of 0.9432 suggests that the difference between the means is less than one
standard error away from zero.
This is the p-value for a one-tailed test. It indicates the probability of observing a t Statistic as
extreme as 0.9432 if the null hypothesis is true. In this case, there's a 17.4% chance of finding a t
Statistic this extreme due to random chance.
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 68 of 71
t Critical one-tail: 1.6606
This is the critical value of t for a one-tailed test at the chosen significance level (usually 0.05). Since
your t Statistic (0.9432) is less than the critical value, you do not reject the null hypothesis in a one-
tailed test.
This is the p-value for a two-tailed test. It's the probability of observing a t Statistic as extreme as
0.9432 in either direction if the null hypothesis is true. A p-value of 34.79% is quite high, indicating a
low likelihood that the difference in means is due to anything other than chance.
This is the critical value of t for a two-tailed test. Again, your t Statistic is less than this critical value.
Interpretation
Since the p-value for both one-tailed (0.1740) and two-tailed (0.3479) tests are higher than the
common significance levels (0.05 for one-tailed, 0.10 for two-tailed), you fail to reject the null
hypothesis in both cases.
This suggests that there's no statistically significant difference between the means of your two
datasets (with and without outliers). In other words, the outliers do not have a significant impact
on the mean of your dataset.
It's important to remember that "failing to reject the null hypothesis" is not the same as proving the
null hypothesis is true. It simply means there isn't enough evidence in your sample data to conclude
a significant difference due to outliers.
Always consider these results in the context of your specific data and research question. While
statistically, outliers might not impact the mean significantly, they could still be relevant for your
specific case, especially if they represent true but rare events.
Formally, the toolpak uses this formula to evaluate the number of degrees of freedom:
Where:
The Excel ToolPak automatically calculates this more complex degrees of freedom when you select
the option for "t-Test: Two-Sample Assuming Unequal Variances". Thus, the degrees of freedom
being 98 in your results suggests the underlying calculations for variances and the adjustments
they necessitate.
This calculation adjusts the degrees of freedom downward when there are large discrepancies in
variance or sample size between the two groups, which helps keep the test accurate under these
conditions.
Cohen’s d
I mentioned Cohen’s d above which is a measure of effect size used to indicate the standard
difference between two means. It's calculated by subtracting the mean of one group from the mean
of another and dividing the result by a pooled standard deviation. It is easy to find Cohen’s d in Excel.
See the impact worksheet, as follows:
Interpreting Cohen's d
The value of Cohen's d reflects the size of the effect. A common interpretation is that
Remember, Cohen's d is a measure of effect size, not statistical significance. It tells you how big the
difference is, not whether that difference is statistically significant.
I have been working in and around management accounting, financial accounting, financial
reporting, spreadsheeting and spreadsheet and financial modelling for a long time! I first started
working with spreadsheets on a Commodore 64 computer. I then graduated to a Windows powered
desktop computer on which I used Lotus 1-2-3 spreadsheet software, finally moving on to Microsoft
Excel somewhere around 1993 – 1994.
I have worked as an accountant, a college lecturer, a business school professor, head of department,
head of projects, self employed consultant, author, web master, blogger, freelance teacher and
trainer.
I love travelling here and there so the three years that covid kept me at home were strange and
strained!
I have lived and worked in North America, Asia, Africa, Europe and Oceania and I currently live in SE
Asia and have done for around a decade or so.
Although I am by no means a professional statistician, I use some aspects of statistics as part of the
work I do: descriptive statistics, regression analysis, numerical and ratio analysis of data, visualisation
and so on. I carry out outlier analysis from time to time in such work, when I feel the need to know if
something is out of kilter with the rest of the data!
I am a lifelong learner and I keep myself busy with my formal work and with working on the kinds of
projects that reading a variety of business, accounting and analytics blogs and web sites can
generate. I also answer questions on www.quora.com as well as elsewhere and all of these ideas
keep me interested in switching on my laptop on most days.
I have a smart phone that I tend to use for many things but phoning is not one of them! The best way
of contacting me is via LinkedIn, just search for me by name, my blog is called Duncan's Diurnal
Diatribe (duncanwil.blogspot.com) and you can contact me through that. It is best, though, if you use
my email address, which is duncanwil@gmail.com and if you write to me, I will reply, providing it is
clear what you want me to do!