You are on page 1of 71

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R

Duncan Williamson
© 12th April 2024

This is a free of charge proof edition copy and your comments are welcomed: please write
to me at duncanwil@gmail.com with your suggestions
Contents
Chapter 1: Understanding Outliers ....................................................................................................... 4
- Definition of Outliers ...................................................................................................................... 4
Examples of Outliers ......................................................................................................................... 4
1 Medical Data Example ................................................................................................................ 4
2 Financial Data Example .............................................................................................................. 6
3 Educational Data Example .......................................................................................................... 8
- Types of Outliers (Point, Contextual, Collective) ........................................................................... 10
1. Point Outliers (or Global Outliers): .......................................................................................... 10
2. Contextual Outliers (or Conditional Outliers): ......................................................................... 10
3. Collective Outliers: .................................................................................................................. 11
4. Masked Outliers: ..................................................................................................................... 11
5. Influential Outliers: ................................................................................................................. 14
- Causes and Impacts of Outliers in Data ......................................................................................... 17
- Causes of Outliers .................................................................................................................... 17
- Impacts of Outliers .................................................................................................................... 18
- Addressing Outliers ................................................................................................................... 18
- Outliers: Bane or Boon? ................................................................................................................ 18
Chapter 2: Outlier Detection Techniques ............................................................................................ 20
- Basic Statistical Methods (Standard Deviation, IQR) ..................................................................... 20
- Visualisation Methods (Box Plot, Scatter Plot) .............................................................................. 20
Standard Deviation...................................................................................................................... 20
The Box & Whisker Plot ............................................................................................................... 21
The InterQuartile Range: IQR ...................................................................................................... 22
- Advanced Techniques: DBSCAN, Clustering with Power BI ........................................................... 27
Chapter 3: Excel for Outlier Analysis ................................................................................................... 34
- Preparing Data in Excel ................................................................................................................. 34
- Using Formulas for Outlier Detection ........................................................................................... 42
Identifying Individual Outliers ..................................................................................................... 47
- Case Study: Outlier Analysis in a Business Context with Excel ...................................................... 57
Multiple Regression: Chinese Wells ............................................................................................ 57
Standardised Residuals ............................................................................................................... 59
Standard Error ............................................................................................................................. 59
Range .......................................................................................................................................... 59
Z Score Results: outlier detection ................................................................................................ 61

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 2 of 71
Grubbs’ Test ................................................................................................................................ 63
Results......................................................................................................................................... 64
Generalized (extreme Studentised deviate) ESD test .................................................................. 65
Impact Analysis ........................................................................................................................... 66
Formulate Hypotheses ................................................................................................................ 67
Interpret the Results ................................................................................................................... 68
Considerations ............................................................................................................................ 68
Interpretation of the Example ..................................................................................................... 68
Interpretation.............................................................................................................................. 69
A Note on Degrees of Freedom (df) ............................................................................................ 69
Cohen’s d .................................................................................................................................... 70
Interpreting Cohen's d ................................................................................................................ 70
About the Author ................................................................................................................................ 71
- Background and experience .......................................................................................................... 71
- Contact information and professional links ............................................................................... 71

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 3 of 71
Chapter 1: Understanding Outliers
- Definition of Outliers

An outlier is a data point that differs significantly from other observations in a dataset. It appears to
deviate markedly from other members of the sample in which it occurs. Outliers can occur by chance
in any distribution but they are often indicative either of measurement error or of the population
having a heavy tailed distribution. In the latter case, the outliers represent a phenomenon that is also
of interest for analysis.

Formally, an outlier can be defined in several ways, often based on statistical measures. For example,
in a normal distribution, any data point more than 1.5 interquartile ranges (IQR) below the first
quartile or above the third quartile is often considered an outlier. In terms of standard deviation, a
common rule is that a data point is an outlier if it is more than two standard deviations away from
the mean.

However, the definition of an outlier is subjective and varies from one field of study to another. In
practice, whether a data point is treated as an outlier depends not only on its numerical value but
also on the context of the data and the specific analysis being performed.

Examples of Outliers

Here are three examples from different contexts to illustrate what an outlier might look like and you
will find the data sets and graphs in the files chapter_one.xlsx and outliers_chapter_one.R

1 Medical Data Example

- Context: Imagine a dataset containing the resting heart rates of a group of healthy adults aged 25
- 40, where most values cluster around 60 - 80 beats per minute (bpm).

- Outlier: If one individual in this group has a resting heart rate of 140 bpm, this value would be
considered an outlier. It's significantly higher than the norm for this specific population and might
indicate an underlying health issue or measurement error.

This dataset includes a total of 100 entries, with the majority of the resting heart rates clustered
around the 60 - 80 bpm range. However, it also contains outliers, such as a very high value of 140
bpm and a low value of 55 bpm. These outliers could represent special cases or errors and would be
points of interest in an analysis focused on identifying and understanding outlier data.

The following R code helps us to illustrate these data:

# Exploring outliers with heart rates

# Load necessary library


library(tibble)

# Setting seed for reproducibility


set.seed(0)

# Generating normal heart rates (around 60-80 bpm)

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 4 of 71
heart_rates_normal <- rnorm(98, mean = 70, sd = 5)

# Defining outliers
heart_rates_outliers <- c(140, 55)

# Combining normal heart rates with outliers


heart_rates <- c(heart_rates_normal, heart_rates_outliers)

# Creating a data frame


df_heart_rates <- tibble(Resting_Heart_Rate_bpm = heart_rates)

# Displaying the first few rows of the dataset


head(df_heart_rates)

tail(df_heart_rates, 10)

# Load the ggplot2 library


library(ggplot2)

# Using the previous dataset df_heart_rates


# Plotting the data
ggplot(df_heart_rates, aes(x = seq_along(Resting_Heart_Rate_bpm), y = Resting_Heart_Rate_bpm))
+
geom_point() +
geom_hline(yintercept = mean(df_heart_rates$Resting_Heart_Rate_bpm), linetype="dashed", color
= "blue") +
labs(title = "Resting Heart Rate Data",
x = "Observation Number",
y = "Resting Heart Rate (bpm)") +
theme_minimal()

This R code does the following:

• It sets a random seed for reproducibility.


• It generates 98 normal heart rates centered around 70 bpm with a standard deviation of 5.
• It defines two outliers, one high (140 bpm) and one low (55 bpm).
• It combines the normal heart rates and outliers into a single vector.
• It creates a data frame named df_heart_rates with these values.
• Finally, it displays the first few rows of the data frame using the head function.

Here is the plot of the data from the above:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 5 of 71
The following additional code saves the data in the example to a csv file in your root R directory:

# Exporting the data frame to a CSV file

write.csv(df_heart_rates, file = "Resting_Heart_Rates.csv", row.names = FALSE)

I created the following graphs in Excel, for comparison:

2 Financial Data Example

- Context: Consider a dataset of daily sales figures for a retail store over a year. The sales usually
range between $1,000 and $3,000 per day.

- Outlier: If on one particular day, the sales record shows $50,000, this would be an outlier. This
could be due to a special event or an error in data recording. In contrast, if the store had a well-
advertised, major annual sale on that day, this high value might not be considered an outlier in the
context of expected sales spikes during promotional events.

The R code for this example

# Financial Data Example

# Setting seed for reproducibility


set.seed(0)

# Generating normal daily sales (mostly between $1,000 and $3,000)


sales_normal <- rnorm(363, mean = 2000, sd = 500)

# Defining outliers
sales_outliers <- c(50000, 300, 48000) # Two high sales and one low sale day

# Combining normal sales with outliers


daily_sales <- c(sales_normal, sales_outliers)

# Creating a sequence of dates for the sales data


dates <- seq(as.Date("2023-01-01"), by = "day", length.out = length(daily_sales))

# Creating a data frame


df_daily_sales <- tibble(Date = dates, Daily_Sales = daily_sales)

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 6 of 71
# Displaying the first few rows of the dataset
head(df_daily_sales)

tail(df_daily_sales, 10)

# Plot the data

# Plotting the daily sales data


ggplot(df_daily_sales, aes(x = Date, y = Daily_Sales)) +
geom_line() + # Line to show trends over time
geom_point(color = "blue") + # Points to show each day's sales
labs(title = "Daily Sales Data for a Retail Store",
x = "Date",
y = "Daily Sales ($)") +
theme_minimal()

# Exporting the data frame to a CSV file


write.csv(df_daily_sales, "Daily_Sales_Data.csv", row.names = FALSE)

In this example:

• I've used a normal distribution to generate sales figures predominantly in the $1,000 to
$3,000 range.
• I've added a few outliers to represent atypical sales days, both extremely high and extremely
low.
• The dataset is set against a sequence of dates to mimic daily sales over a year.
• You can adjust the mean, standard deviation and outliers to better fit your specific scenario
or example.

This code will also generate a time series plot with:

• The x-axis representing the dates.


• The y-axis representing the daily sales figures.
• A line connecting the daily sales to show trends over time.
• Points for each day's sales to clearly mark individual data points, including outliers.

Finally, the code creates a .csv file that is saved to the R working directory: Daily_Sales_Data.csv

The graph plotted by ggplot2 is here

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 7 of 71
And I created the following chart in Excel:

3 Educational Data Example

- Context: In a standardised test taken by thousands of students, scores are normally distributed
with most students scoring around the median.

- Outlier: If a student scores significantly higher or lower than the majority, this score could be
considered an outlier. For instance, if most scores range between 60 - 85 out of 100, but one student
scores 5 or 100, these scores would be outliers. The low score might indicate a lack of understanding
or an issue with the test taking, while the perfect score could be exceptionally rare and noteworthy.

To demonstrate this example in R, I'll create a synthetic dataset representing the standardized test
scores of students, with most scores clustering around the median (in the 60-85 range) and a few
outliers. Here's the R code to generate and visualize this data:

# Educational Data Example

# Load necessary libraries


library(tibble)
library(ggplot2)

# Setting seed for reproducibility


set.seed(0)

# Generating normal test scores (mostly between 60 - 85)


test_scores_normal <- rnorm(998, mean = 72.5, sd = 7.5)

# Defining outliers
test_scores_outliers <- c(5, 100, 6, 98) # A few extremely low and high scores

# Combining normal test scores with outliers


test_scores <- c(test_scores_normal, test_scores_outliers)

# Creating a sequence of student IDs


student_ids <- seq(1, length(test_scores))

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 8 of 71
# Creating a data frame
df_test_scores <- tibble(Student_ID = student_ids, Test_Score = test_scores)

# Displaying the first few rows of the dataset

head(df_test_scores, 10)
tail(df_test_scores, 10)

# Plotting the test scores


ggplot(df_test_scores, aes(x = Student_ID, y = Test_Score)) +
geom_point() +
labs(title = "Distribution of Student Test Scores",
x = "Student ID",
y = "Test Score") +
theme_minimal()

# Exporting the data frame to a CSV file

write.csv(df_test_scores, "Daily_Test_Scores.csv", row.names = FALSE)

Here is the graph that R has created for us:

In this example:

• Test scores are generated using a normal distribution centered around 72.5 with a standard
deviation of 7.5, simulating a typical distribution of scores.
• A few extreme values (5, 100, 6, 98) are added as outliers.
• A scatter plot is created using ggplot2 to visualize these test scores. The outliers should be
clearly visible as points far removed from the cluster of other scores.
• I added a line of code so that R would export the data to a .csv file that we can then explore
in Excel

This code serves as a practical demonstration of how to handle and visualise data with outliers in an
educational context and the following graph has been created in Excel, in which I have highlighted
the outlier scores:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 9 of 71
In each of these examples, the outlier stands out because it deviates significantly from the pattern
set by the majority of the data points. It's important to investigate outliers to determine if they result
from data entry errors, measurement errors or if they represent a genuine phenomenon.

- Types of Outliers (Point, Contextual, Collective)

In the field of data analysis, outliers are data points that significantly differ from the rest of the data.
They can be categorized into three main types: point outliers, contextual outliers and collective
outliers. Understanding these categories is crucial for effectively identifying and analyzing outliers in
various datasets.

1. Point Outliers (or Global Outliers):

- Definition: A point outlier is an individual data point that significantly deviates from the rest of the
data in the dataset. It's an anomaly when considered in the full context of the dataset.

- Example: In a dataset of human heights, if most people are between 5 and 6 feet tall, a height of 8
feet would be a point outlier.

- Detection: These are typically detected using statistical measures like Z scores, standard
deviations or interquartile ranges (IQR).

2. Contextual Outliers (or Conditional Outliers):

- Definition: Contextual outliers are data points that are considered outliers within a specific
context or condition but might not be outliers when taken out of that context. These are often
detected in time series data or geographical data, where the context (like time or location) matters.

- Example: Consider temperature readings over a year. A temperature of 30°C would be normal in
summer but would be a contextual outlier in winter.

- Detection: Detecting these outliers involves understanding the context or conditions and often
requires more complex analysis, like segmentation of data based on those conditions.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 10 of 71
3. Collective Outliers:

- Definition: A collective outlier refers to a subset of data points that deviate significantly from the
overall data pattern when considered together, even though the individual data points may not be
outliers. These are often seen in time series or sequence data.

- Example: In a dataset of daily share prices, a sudden, short lived spike followed by an equally
sudden drop might not be unusual for individual data points. However, the collective pattern of these
points over a few days might be anomalous compared to the usual share price movements.

- Detection: Detecting collective outliers often involves analyzing the data points in sequence and
looking for anomalies in the pattern or behaviour over time.

4. Masked Outliers:

- Definition: Masked outliers are data points that may appear to be normal when viewed in the
context of the entire dataset but are actually anomalous when considered in a more refined or
appropriate context. These outliers are masked because their unusual nature is hidden by the
presence of other data points, making them harder to detect with standard outlier detection
methods.

- Example: Imagine a dataset containing the test scores of students from two different classes: Class
A and Class B. Class A students typically score between 70 - 80% and Class B students score between
40 - 50%. If the scores are combined into a single dataset without class distinction, a score of 65%
would not stand out as an outlier in the combined range of 40 - 80%. However, if we consider the
data separately for each class, a score of 65% is an outlier for Class B (too high) and potentially for
Class A (slightly low).

- Detection: masked outliers often requires a more nuanced approach to data analysis:

To set up this scenario in R, we will create two separate datasets for Class A and Class B students with
their respective score distributions and then combine them into a single dataset. We'll also include a
few scores that are outliers within each class context. Here's the R code to create this dataset:

The R code to set up this example is:

# Load necessary library


library(tibble)

# Setting seed for reproducibility


set.seed(0)

# Generating test scores for Class A (mostly between 70-80%)


scores_class_A <- rnorm(50, mean = 75, sd = 3) # 50 students in Class A
outliers_class_A <- c(65, 85) # Outliers for Class A

# Generating test scores for Class B (mostly between 40-50%)


scores_class_B <- rnorm(50, mean = 45, sd = 3) # 50 students in Class B
outliers_class_B <- c(65, 35) # Outliers for Class B

# Combining scores and outliers for each class


Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 11 of 71
all_scores_A <- c(scores_class_A, outliers_class_A)
all_scores_B <- c(scores_class_B, outliers_class_B)

# Creating data frames for each class with a class label


df_class_A <- tibble(Student_ID = seq(1, length(all_scores_A)),
Test_Score = all_scores_A,
Class = "A")

df_class_B <- tibble(Student_ID = seq(1, length(all_scores_B)),


Test_Score = all_scores_B,
Class = "B")

# Combining both classes into a single dataset


df_combined_scores <- rbind(df_class_A, df_class_B)

# Displaying the first few rows of the combined dataset

head(df_combined_scores, 10)
tail(df_combined_scores, 10)

# Plotting the combined test scores with color distinction for each class
library(ggplot2)
ggplot(df_combined_scores, aes(x = Student_ID, y = Test_Score, color = Class)) +
geom_point() +
labs(title = "Test Scores of Students in Class A and Class B",
x = "Student ID",
y = "Test Score (%)") +
theme_minimal()

# Exporting the data frame to a CSV file

write.csv(df_combined_scores, "Combined_Scores.csv", row.names = FALSE)

In this code:

• We generate scores for Class A and Class B, along with specific outliers for each class.
• We label each score with its respective class.
• We combine the two classes into a single dataset, df_combined_scores.
• A scatter plot is created using ggplot2 to visualise the scores, with colour coding to
distinguish between the two classes.
• This setup and visualisation will illustrate how a score like 65% can be an outlier in the
context of each class, but not in the combined dataset

The graph for this example is and we can see the distinction between the Class A and the Class B
scores:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 12 of 71
I created the following three graphs in Excel

I also created the following table of descriptive statistics that we will see more of later in the book:

Class A Class B
Mean 85.0000 65.0000
Median 74.3226 45.0413
Mode #N/A #N/A
SD 3.3439 4.0116
Kurtosis 1.5514 11.3376
Skewness 0.1682 2.0546
Maximum 85.0000 65.0000
Minimum 65.0000 35.0000
Range 20.0000 30.0000
Sum 3,903.5896 2,353.2109
Count 52.0000 52.0000

Notice the Mode values: there are no modal values in this case study and I could have used the
IFERROR() function to make the Mode results more presentable: next time!

I used the following formula to help me with the descriptive statistics:

=BYCOL(array,LAMBDA(column))

For example, for the Mean:

=BYCOL(B$3:B$54, LAMBDA(array, MAX(array))) for Class A and


=BYCOL(C$3:C$54, LAMBDA(array, MAX(array))) for Class B

We will explore other aspects of examples such as these as we go on, as per the following:

1 Segmentation: One effective method is to segment the data into more homogeneous groups based
on relevant characteristics or conditions. In the student score example, this would mean analysing
the data separately for each class.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 13 of 71
2 Multivariate Analysis: Sometimes, considering multiple variables simultaneously can help reveal
outliers that are not detectable when looking at a single variable. Multivariate techniques can
uncover unusual combinations of values across different attributes.

3 Domain Knowledge: Understanding the domain or context of the data can be critical. This
knowledge can guide you in identifying relevant subgroups or conditions where outliers may be
masked.

4 Advanced Statistical Techniques: Some statistical methods are designed to detect outliers in
complex data structures, including those that can uncover masked outliers. Techniques like cluster
analysis, principal component analysis (PCA) or machine learning algorithms can be useful in these
scenarios.

Discussion: masked outliers present a significant challenge in data analysis because they can go
unnoticed, leading to skewed analysis and incorrect conclusions. It's crucial to be aware of the
possibility of their existence, especially in datasets where heterogeneity is present or where
combining data from different sources. The key to detecting masked outliers lies in a careful and
thorough examination of the data, considering its context and applying the right analytical
techniques.

5. Influential Outliers:

- Definition: influential outliers are data points that have a disproportionately large impact on the
outcome of statistical analyses, such as regression models or mean calculations. Unlike typical
outliers, which might simply deviate from the norm, influential outliers can significantly alter the
results and conclusions drawn from the data.

- Example: consider a simple linear regression analysis where you're looking at the relationship
between years of experience and salary in a dataset of employees. Most data points form a linear
pattern, indicating that salary increases with experience. However, if there is one employee with an
unusually high salary that doesn't fit the pattern (perhaps due to being a high level executive), this
data point could disproportionately influence the slope of the regression line, leading to a misleading
interpretation of the relationship for the general workforce.

Case Study: the King Kong Effect

The King Kong Effect in data refers to a situation where one value is so extreme that it skews the
overall understanding of the dataset. Let’s create a dataset suggestive of the King Kong Effect by
imagining a hypothetical scenario involving assumed weights of various animals in a zoo. In this
example, most animals have weights within a relatively narrow range, except for King Kong, who has
an extraordinarily high weight.

# The King Kong Effect

# Load necessary library


library(tibble)

# Generating weights (in kg) for a variety of typical zoo animals

weights_normal_animals <- c(250, 150, 200, 300, 180, 220, 190, 210, 230, 240) # Weights in kg
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 14 of 71
# Adding King Kong's weight as an outlier

weight_king_kong <- 5000 # A very high weight in kg

# Combining the weights into one dataset

animal_weights <- c(weights_normal_animals, weight_king_kong)

# Creating animal names for labeling (last one is King Kong)

animal_names <- c("Lion", "Tiger", "Bear", "Giraffe", "Elephant", "Rhino", "Hippo", "Leopard",
"Zebra", "Buffalo", "King Kong")

# Creating a data frame

df_animal_weights <- tibble(Animal = animal_names, Weight_kg = animal_weights)

# Displaying the data frame

df_animal_weights

# Load the ggplot2 library


library(ggplot2)

# Assuming you have already created the df_animal_weights data frame as previously described
# Plotting animal weights
ggplot(df_animal_weights, aes(x = Animal, y = Weight_kg)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() + # Flipping the coordinates for a horizontal bar plot
labs(title = "Weights of Animals in a Zoo (including King Kong)",
x = "Animal",
y = "Weight (kg)") +
theme_minimal()

Which gives us this bar chart:

Using Excel, we find:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 15 of 71
And again, some of the descriptive statistics, using the BYCOL() function:

With Kong Without Kong


Mean 651.8182 217.0000
Median 220.0000 215.0000
SD 1,442.6699 41.6467
Maximum 5,000.0000 300.0000
Minimum 150.0000 150.0000
Range 4,850.0000 150.0000

- Detection: involves both identifying outliers and assessing their impact on your analysis:

1 Leverage v Residuals Plot: In regression analysis, a leverage versus residuals plot can help identify
influential outliers. High leverage points have a significant impact on the position of the regression
line, while high residuals indicate a large deviation from the predicted value.

2 Cook’s Distance: This is a measure used in regression analysis to identify influential points. It
considers both the leverage of the data point and the size of its residual. A large Cook’s distance
suggests that the data point is influential.

3 Influence Plot: Some statistical software offers an influence plot, which combines information
about leverage, residuals and Cook’s distance, providing a comprehensive view of the potential
influence of each data point.

4 Robustness Check: Re running the analysis with and without the suspected outliers can
demonstrate their influence. A significant change in results indicates that the outliers are influential.

The King Kong Effect in a dataset is an example of an Influential Outlier. Here's why:

Influential Outliers are data points that have a substantial impact on the outcome of your analysis.
They don't just differ from other observations, but their presence significantly alters statistical
calculations and the results of modeling. In the King Kong example, the extreme weight of King Kong
compared to other animals in the dataset would drastically affect the mean, variance, and any other

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 16 of 71
statistical analysis you perform on this data. This can lead to misleading conclusions about the typical
animal weight if not properly addressed.
This type of outlier differs from others in the following ways:

Point Outliers: These are individual data points that stand out from the rest of the dataset. While the
King Kong weight is a point outlier, it's specifically its influence on the dataset that makes it an
influential outlier.

Contextual Outliers: These are data points that are considered outliers within a specific context or
condition, but may not be outliers in another context. The King Kong effect doesn't necessarily
depend on context; it's the sheer scale of the outlier that's key.

Collective Outliers: These are a collection of data points that deviate significantly from the overall
data pattern when considered together, although the individual data points may not be outliers. The
King Kong effect is typically represented by a single, extreme data point.

Masked Outliers: These are outliers that might not be detected due to the presence of other data
points. The King Kong effect is quite the opposite, as it's typically very noticeable due to the
extremeness of the outlier.

In summary, the King Kong effect epitomises the influential outlier, where one extreme value can
have a disproportionate impact on the entire dataset's analysis.

- Discussion: influential outliers are particularly critical in regression analysis and other statistical
modelling techniques because they can lead to incorrect model parameters and predictions. Their
detection and handling are essential steps in the data analysis process. Depending on the context
and the goals of the analysis, you may choose to remove, adjust or otherwise account for these
outliers to ensure that your results are reliable and representative of the underlying data. However,
it's also important to investigate why these outliers exist, as they could represent important, albeit
rare, phenomena.

Each type of outlier provides different insights and challenges in data analysis. Identifying and
understanding the nature of outliers is crucial for accurate data analysis, as their presence can
significantly impact statistical conclusions and predictive modelling.

- Causes and Impacts of Outliers in Data

Understanding these aspects is crucial for accurate data analysis, model building and decision making
processes.

- Causes of Outliers

1 Measurement or Input Error: Outliers can arise from mistakes in data collection, recording or
entry. This includes transcription errors, malfunctioning measurement equipment or incorrect data
input. In such cases, the outliers do not represent actual variations in the underlying data.

2 Data Processing Errors: Mistakes in data processing, such as incorrect transformations or


mishandling of missing values, can create artificial outliers.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 17 of 71
3 Sampling Variability: Outliers may occur purely by chance, especially in small sample sizes. This is a
natural aspect of statistical variability.

4 Natural Variations: In many cases, outliers represent true but rare events in the population. For
instance, exceptionally high or low values in medical data might indicate rare medical conditions.

5 Changes in Behaviour or Conditions: Outliers can signal a shift in the underlying process generating
the data, such as a sudden market change in financial data or an emerging trend in social media
analytics.

- Impacts of Outliers

1 Statistical Analysis: Outliers can skew statistical measures like the mean, variance and standard
deviation, leading to misleading conclusions. They can also impact the assumptions underlying many
statistical tests and models, such as normality and homoscedasticity.

2 Predictive Modelling: In machine learning and predictive modeling, outliers can disproportionately
influence the model's parameters, potentially leading to overfitting. This makes the model less
generalisable and accurate on unseen data.

3 Data Interpretation Challenges: Outliers can complicate the interpretation of data visualisations,
such as histograms or scatter plots, masking true patterns or trends in the data.

4 Decision Making: In a business or policy context, unaddressed outliers can lead to misguided
strategies or decisions, especially if they are assumed to represent typical cases.

5 Opportunity for Discovery: On the positive side, outliers can reveal valuable insights. They may
point to novel phenomena, errors in process or areas for improvement. For instance, in quality
control, outliers can indicate defects or failures in manufacturing processes.

- Addressing Outliers

Professionals must develop a strategy for dealing with outliers, which includes detection, diagnosis
and appropriate treatment (removal, transformation or separate analysis). The approach depends on
the nature of the data, the context of the study and the objectives of the analysis. It's also imperative
to document the handling of outliers for transparency and reproducibility in data analysis.

- Outliers: Bane or Boon?

Outliers, those extreme data points that deviate significantly from the rest of the dataset, have
always intrigued statisticians and analysts. In this section, we explore the conflicting perspectives on
outliers and their implications.

On the one hand, outliers are often considered to be a bane: they can distort statistical analysis,
compromise the accuracy of predictive models and skew the interpretation of results. Outliers have
the potential to mislead decision making based on faulty data and can be a source of frustration
when trying to establish patterns or trends.

On the other hand, outliers can also offer valuable insights and information. They might represent
exceptional cases, rare events or important data points that require special attention. Outliers can
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 18 of 71
highlight anomalies that reveal hidden patterns, uncover unexpected correlations or provide a fresh
perspective on the dataset.

Understanding whether outliers are a bane or a boon depends on the context and objectives of the
analysis. In certain scenarios, outliers might indicate errors or data quality issues that need to be
addressed. However, they could also be valuable pieces of information that contribute to a
comprehensive understanding of the data and its underlying processes.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 19 of 71
Chapter 2: Outlier Detection Techniques

- Basic Statistical Methods (Standard Deviation, IQR)


- Visualisation Methods (Box Plot, Scatter Plot)

Standard Deviation

Certainly! Imagine you're sitting in a lecture hall, eager to learn about one of the fundamental
concepts in business analysis: Standard Deviation. I, as your professor, am about to introduce this
concept to you, a group of first-year undergraduate business analysis students.

Standard Deviation measures how spread out numbers are in a dataset. In business, this can tell us a
lot about consistency, risk and variability. For instance, if we're looking at the monthly sales figures of
a company, the standard deviation helps us understand how much these sales figures fluctuate over
time.

Why is this important? Well, in the business world, we love predictability. Knowing the standard
deviation of sales, costs or even share prices can help businesses make more informed decisions. A
low standard deviation means the numbers are close to the average, indicating stability. A high
standard deviation, however, suggests a lot of variability, which could mean higher risk.

As budding business analysts, mastering the concept of standard deviation will enable you to identify
trends, assess risks, and make data driven decisions. Remember, it's not just about the numbers; it's
about understanding the story they tell.

We'll start by looking at some real world examples and then move on to calculating standard
deviation ourselves. It's a step by step process and I'll guide you through it. Let's embark on this
journey to demystify statistics and harness its power for business analysis. Welcome to the world of
standard deviation!

Let’s refer back to the daily sales data we used in chapter one and we can find those data in the file,
chapter_two.xlsx.

Excel has a built in standard deviation function that we can use with very little training but we will
use both that function and the BYCOL() function that we used in chapter one.

Look at these screenshots from the daily sales data and their analysis.

Look at these two graphs, though:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 20 of 71
The graph on the left includes the outliers in the data set and because of that, we cannot see the
extent of the variability or deviation in the data set. The graph on the right excludes those outliers
and we can now assess the variability of the data set.

Histograms are best used for assessing the variability of a set of data. The following histograms show
the full data set, including and excluding the outliers:

Again, we can see the impact that the inclusion of outliers has on the appearance of data, if nothing
else.

The Box & Whisker Plot

Still working in Excel, we can easily convert the histograms to Box & Whisker plots: again, including
and excluding outliers, on the left and right below, respectively.

What do box and whisker plots tell us?

Here is the anatomy of a box & whisker plot or box plot:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 21 of 71
The values for the daily sales data for the daily sales data are:

Notice, I have recoloured part of the plot so that you can see the data points, the fact that the mean
and the median are quite similar in value and you can see that even though I have labelled the
second plot as excluding outliers, there are two outliers shown on there: one above the maximum
bar and one below the minimum bar. We will discuss those two values shortly as we take a more
advanced view of this plot.

The InterQuartile Range: IQR

John Tukey was a statistician and he invented the box and whisker plot and he created the concept
and interpretation of the interquartile range, the IQR, to go with it. Let’s look at these additional
concepts now.

The idea behind the IQR is that it provides us with a threshold value for upper and lower outliers

In the diagram in the previous section, we can see the IQRs for the data including the outliers and for
the data without the outlier data. However, there is one further step to take with this knowledge and
that is to find what are called Tukey’s Fences. There are two versions of Tukey’s Fences, inner and
outer and for each of them there is a lower and an upper value, as follows:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 22 of 71
What we see here is:

• The values of the lower and upper inner and outer Tukey Fences: check the formulas for
them and note the constant values used of 1.5 for the inner fences and 3 for the outer fences
• The outlier summary which shows how many outlier values there are in the data

This introduction sets the stage for your learning journey into the world of business statistics,
emphasising the practical applications and importance of understanding standard deviation in
business analysis.

The R Codes

The following code includes the basic codes that we used in chapter one for the Daily Sales data and
then it adds the new code. I have identified where the new code begins.

# Exploring outliers with Daily Sales

# Load necessary library


library(tibble)

# Setting seed for reproducibility


set.seed(0)

# Financial Data Example

# Setting seed for reproducibility


set.seed(0)

# Generating normal daily sales (mostly between $1,000 and $3,000)


sales_normal <- rnorm(363, mean = 2000, sd = 500)

# Defining outliers
sales_outliers <- c(50000, 300, 48000) # Two high sales and one low sale day

# Combining normal sales with outliers


daily_sales <- c(sales_normal, sales_outliers)

# Creating a sequence of dates for the sales data

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 23 of 71
dates <- seq(as.Date("2023-01-01"), by = "day", length.out = length(daily_sales))

# Creating a data frame


df_daily_sales <- tibble(Date = dates, Daily_Sales = daily_sales)

# Displaying the first few rows of the dataset


head(df_daily_sales)

tail(df_daily_sales, 10)

# Plot the data

# Plotting the daily sales data


ggplot(df_daily_sales, aes(x = Date, y = Daily_Sales)) +
geom_line() + # Line to show trends over time
geom_point(color = "blue") + # Points to show each day's sales
labs(title = "Daily Sales Data for a Retail Store",
x = "Date",
y = "Daily Sales ($)") +
theme_minimal()

### New Codes for Chapter Two

# Summary Data to show Min, Q1, Median, Mean, Q3, Max

summary(df_daily_sales)

Date Daily_Sales
Min:2023-01-01 Min: 300
1st Qu:2023-04-02 1st Qu: 1652
Median:2023-07-02 Median : 1994
Mean:2023-07-02 Mean: 2263
3rd Qu:2023-10-01 3rd Qu: 2316
Max:2024-01-01 Max:50000

# draw a histogram of all of the data

hist(df_daily_sales$Daily_Sales,
xlab = "Daily Sales",
main = "Histogram of Daily Sales: all values",
breaks = sqrt(nrow(df_daily_sales))
) # set number of bins

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 24 of 71
# Histogram that excludes the extreme values

# Calculate IQR
Q1 <- quantile(df_daily_sales$Daily_Sales, 0.25)
Q3 <- quantile(df_daily_sales$Daily_Sales, 0.75)
IQR <- Q3 - Q1

# Define the outlier thresholds


lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Filter out outliers

filtered_data <- subset(df_daily_sales,


df_daily_sales$Daily_Sales > lower_bound & df_daily_sales$Daily_Sales <
upper_bound)

# Plot histogram
hist(filtered_data$Daily_Sales,
main="Histogram Excluding Extreme Values",
xlab="Daily Sales",
breaks=10) # Adjust 'breaks' as needed

# Now a box plot of all of the data

boxplot(df_daily_sales$Daily_Sales,
ylab = "Daily Sales All Values"
)

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 25 of 71
# Box plot with extreme values removed

boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales No Extreme Values"
)

# Using the which() function, R can tell us which are the row numbers of the outlier values in our
data set

out <- boxplot.stats(df_daily_sales$Daily_Sales)$out


out_ind <- which(df_daily_sales$Daily_Sales %in% c(out))
out_ind

Rows: 163 245 364 365 366

# And now it is also possible to print the values of the outliers directly on the boxplot
# with the mtext() function:

# Format the outliers with two decimal places


formatted_outliers <- sprintf("%.2f", out)

boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales",
main = "Boxplot of Daily Sales"
)
mtext(paste("Outliers: ", paste(formatted_outliers, collapse = ", ")))

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 26 of 71
# Exporting the data frame to a CSV file
write.csv(df_daily_sales, "Daily_Sales_Data_ch_two.csv", row.names = FALSE)

- Advanced Techniques: DBSCAN, Clustering with Power BI

DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a popular clustering
algorithm that is particularly effective in identifying outliers as well as forming clusters of data points
in a dataset. In the context of outlier detection, DBSCAN is used because of its unique approach to
handling noise and outliers. Let’s delve into its key concepts and how it applies to outlier detection:

Key Concepts of DBSCAN

Density based Clustering: Unlike centroid based algorithms like K-means, DBSCAN groups together
points that are closely packed together (points with many nearby neighbours), marking as outliers
the points that lie alone in low density regions.

Two Main Parameters:

Epsilon (ε): A distance measure that defines the neighbourhood around a data point. If the distance
between two points is lower or equal to ε, they are considered neighbours.
MinPts (Minimum Points): The minimum number of points required to form a dense region. A point
is considered a core point if it has at least MinPts within its ε-neighbourhood.

DBSCAN for Outlier Detection

Core, Border and Noise Points: In DBSCAN, points are categorised as core points, border points or
noise points. Noise points are considered outliers.

Core Points: Have at least MinPts within their ε-neighbourhood.


Border Points: Fewer than MinPts within their ε-neighbourhood but in the neighbourhood of a core
point.
Noise Points (Outliers): Do not have the minimum number of points in their ε-neighbourhood and
are not in the neighbourhood of a core point.

Process

The algorithm starts by randomly selecting a point and retrieving all points within its ε-
neighbourhood.

If the point has at least MinPts in its neighbourhood, a cluster is formed. The cluster then expands by
adding all reachable points within ε-distance that also meet the MinPts criterion.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 27 of 71
If the point doesn't meet the MinPts criterion, it's marked as noise (potential outlier). However, it
might later be found in the ε-neighbourhood of a different core point and thus become part of a
cluster (as a border point).
This process continues until all points are either assigned to a cluster or marked as noise.

I have applied the DBSCAN function to the Daily Sales data in the R file and these are the codes I
used and the results they generated:

# DBSAN Technique

# Exploring outliers with Daily Sales

# Load necessary library


library(tibble)

# Setting seed for reproducibility


set.seed(0)

# Financial Data Example

# Setting seed for reproducibility


set.seed(0)

# Generating normal daily sales (mostly between $1,000 and $3,000)


sales_normal <- rnorm(363, mean = 2000, sd = 500)

# Defining outliers
sales_outliers <- c(50000, 300, 48000) # Two high sales and one low sale day

# Combining normal sales with outliers


daily_sales <- c(sales_normal, sales_outliers)

# Creating a sequence of dates for the sales data


dates <- seq(as.Date("2023-01-01"), by = "day", length.out = length(daily_sales))

# Creating a data frame


df_daily_sales <- tibble(Date = dates, Daily_Sales = daily_sales)

# Displaying the first few rows of the dataset


head(df_daily_sales)

tail(df_daily_sales, 10)

# Plot the data

# Plotting the daily sales data


ggplot(df_daily_sales, aes(x = Date, y = Daily_Sales)) +
geom_line() + # Line to show trends over time
geom_point(color = "blue") + # Points to show each day's sales
labs(title = "Daily Sales Data for a Retail Store",
x = "Date",

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 28 of 71
y = "Daily Sales ($)") +
theme_minimal()

# Summary Data to show Min, Q1, Median, Mean, Q3, Max

summary(df_daily_sales)

# draw a histogram of all of the data

hist(df_daily_sales$Daily_Sales,
xlab = "Daily Sales",
main = "Histogram of Daily Sales: all values",
breaks = sqrt(nrow(df_daily_sales))
) # set number of bins

# Histogram that excludes the extreme values

# Calculate IQR
Q1 <- quantile(df_daily_sales$Daily_Sales, 0.25)
Q3 <- quantile(df_daily_sales$Daily_Sales, 0.75)
IQR <- Q3 - Q1

# Define the outlier thresholds


lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Filter out outliers

filtered_data <- subset(df_daily_sales,


df_daily_sales$Daily_Sales > lower_bound & df_daily_sales$Daily_Sales <
upper_bound)

# Plot histogram
hist(filtered_data$Daily_Sales,
main="Histogram Excluding Extreme Values",
xlab="Daily Sales",
breaks=10) # Adjust 'breaks' as needed

# Now a box plot of all of the data

boxplot(df_daily_sales$Daily_Sales,
ylab = "Daily Sales All Values"
)

# Box plot with extreme values removed

boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales No Extreme Values"
)

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 29 of 71
# Using the which() function, R can tell us which are the row numbers of the outlier values in our
data set

out <- boxplot.stats(df_daily_sales$Daily_Sales)$out


out_ind <- which(df_daily_sales$Daily_Sales %in% c(out))
out_ind

# And now it is also possible to print the values of the outliers directly on the boxplot
# with the mtext() function:

# Format the outliers with two decimal places


formatted_outliers <- sprintf("%.2f", out)

boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales",
main = "Boxplot of Daily Sales"
)
mtext(paste("Outliers: ", paste(formatted_outliers, collapse = ", ")))

# Exporting the data frame to a CSV file


write.csv(df_daily_sales, "Daily_Sales_Data_ch_two.csv", row.names = FALSE)

# Applying DBSCAN

# Check for an appropriate value of eps, for example, 1% of the range

max(aml$Price)
min(aml$Price)

eps_value <- (299990 - 32400) * 0.1


eps_value

db_result <- dbscan(sales_data_matrix, eps = 26759, minPts = 5)

# Viewing the results


print(db_result)

# Plotting the Results

# After applying DBSCAN, you can plot the results. Since you have only one dimension (Daily Sales),
# you might use a simple scatter plot, using the row indices as the x-axis and the sales values as
# the y-axis:

# Plotting the clustering result

plot(1:nrow(df_daily_sales), df_daily_sales$Daily_Sales, col=db_result$cluster + 1L, pch=20, cex=2)


legend("topright", legend=unique(db_result$cluster), col=1:length(unique(db_result$cluster)) + 1,
pch=20)

# In this plot, different colours will represent different clusters, and points that are outliers
# will be clearly distinguishable (typically in black if they are labeled as 0 by the algorithm).

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 30 of 71
# Example of a k-distance plot for choosing eps
k_dist <- kNNdistplot(sales_data_matrix, k = 5)
abline(h = 0.5, col = "red") # Example line at initial eps

Introducing the Aston Martin Lagonda Second Hand Car Price 2018 Data Set

We have done nothing wrong with what we have done here but the data we were working with were
created artificially. For simple tasks, artificial data can work well. And, then again, it can’t. In case it
helps, let’s take a look at some real data now, in the context of clustering the data

Work through the code that follows and interpret the results you get: are they better than the
artificial Daily Sales Data, do you think?

# Applying DBSCAN to Aston Martin Lagonda Second Hand Car Price Data from 2018

library(dbscan)

# Replace 'path_to_file.csv' with the actual path to your CSV file


aml <- read.csv('aml_prices_2018.csv', stringsAsFactors = FALSE)

summary(aml$Price)
str(aml)

head(aml, 10)
tail(aml, 10)

hist(aml$Price,
xlab = "Prices",
main = "Histogram of AML Second Hand Prices",
breaks = sqrt(nrow(aml))
) # set number of bins

mean(aml$Price)

library(dbscan)

# Extracting the Price column and converting it to a matrix


aml_data_matrix <- matrix(aml$Price, ncol = 1)

# Applying DBSCAN
db_aml <- dbscan(aml_data_matrix, eps = 2675.9, minPts = 5) # I found the eps value by using the
eps_value function, see above

# View Results
print(db_aml)

# Plotting the Results

# After applying DBSCAN, you can plot the results. Since you have only one dimension (Daily Sales),
# you might use a simple scatter plot, using the row indices as the x-axis and the sales values as

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 31 of 71
# the y-axis:

# Plotting the clustering result


plot(1:nrow(aml), aml$Price, col=db_result$cluster + 1L, pch=20, cex=2)
legend("topright", legend=unique(db_result$cluster), col=1:length(unique(db_result$cluster)) + 1,
pch=20)

# In this plot, different colours will represent different clusters, and points that are outliers
# will be clearly distinguishable (typically in black if they are labeled as 0 by the algorithm).

# Example of a k-distance plot for choosing eps


k_dist <- kNNdistplot(aml_data_matrix, k = 5)
abline(h = 0.5, col = "red") # Example line at initial eps

# Using the which() function, R can tell us which are the row numbers of the outlier values in our
data set

out <- boxplot.stats(aml$Price)$out


out_ind <- which(aml$Price %in% c(out))
out_ind

# And now it is also possible to print the values of the outliers directly on the boxplot
# with the mtext() function:

# Format the outliers with two decimal places


formatted_outliers <- sprintf("%.2f", out)

boxplot(filtered_data$Daily_Sales,
ylab = "Daily Sales",
main = "Boxplot of Daily Sales"
)
mtext(paste("Outliers: ", paste(formatted_outliers, collapse = ", ")))

# Exporting the data frame to a CSV file


write.csv(aml, "aml_ch_two.csv", row.names = FALSE)

Advantages of DBSCAN for Outlier Detection

No Assumption of Cluster Shapes: DBSCAN does not assume that the clusters are spherical (as in K-
means), making it suitable for detecting clusters of arbitrary shapes.
Handling Outliers: It effectively identifies outliers as noise points, which are not part of any cluster.
No Need to Specify Number of Clusters: Unlike K-means, DBSCAN does not require the user to
specify the number of clusters in advance.

Limitations

Parameter Selection: Choosing appropriate ε and MinPts can be challenging and greatly affects the
outcome.
Varying Density: DBSCAN can struggle with datasets where clusters have varying densities.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 32 of 71
In summary, DBSCAN is a powerful method for both clustering and outlier detection, particularly
useful in scenarios where data contains complex structures and you do not have a priori knowledge
of the number of clusters. Its ability to identify outliers as noise points makes it a valuable tool in
many applications, including anomaly detection, spatial data analysis, and image segmentation.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 33 of 71
Chapter 3: Excel for Outlier Analysis

- Preparing Data in Excel

Excel is a powerful tool for data preparation and ensuring data quality is a crucial aspect of working
with Excel. Here are some professional tips and best practices for preparing data in Excel to maintain
high data quality:

1 Understanding Data Types: Excel supports different data types like text, numbers, dates and more.
Ensure that each column in your Excel sheet correctly represents the data type it is supposed to hold.
This improves accuracy and makes data analysis easier.

In case you are wondering, these are the types of files we can all open in Excel:

And these are the ways in which we can save our files in Excel:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 34 of 71
Just consider these aspects in terms of the enormous numbers of ways in which we can express
ourselves in data! More than that, just consider the number of ways in which we can get the data
type wrong … and right, of course. Data cleaning, section 2, the next part of this chapter, is one of
the most important things you can ever think about when considering Excel and the preparation of
data.

2 Data Cleaning: This involves identifying and correcting (or removing) errors and inconsistencies in
data to improve its quality. This could include removing duplicates, correcting misspellings and
handling missing values appropriately.

Here we are talking about dirty data and throughout this book we will come across examples of
where the data we have created or imported or that has entered into our system in some way might
be dirty. If data are dirty, we MUST clean them. More than that, data analysts and scientists will tell
you that as much as 80 – 95% of their data analysis and science time is spent cleaning dirty data.
Cleaning dirty data is a serious business.

What is dirty data?

Here are some examples of dirty data:

1OOO
Ϯ000
I000
IOOO

Why are they dirty and how do we know they are?

They are all dirty because none of them is a value even though they might look as if they are. Just
copy and paste them all into an Excel file and then create a formula to multiply them each by 1 or 10
or any number. Like this:

I know, some of them might be obviously fake numbers but dirty data often hides in plain sight.

Fortune 1000 for 2023

In the file chapter_three.xlsx, go to the fortune worksheet where you will see a range of data that
has been extracted from the Fortune 1000 list of companies for 2023. Can you see any dirty data in
that range? In case it is not obvious for you, here is a clue that is sometimes really useful. Just look at
the column headings and their underlying data:
Revenue Profits Years On
Revenues Percent Percent Change Global 500
Industry Rank Name ($M) Change Profits ($M) Change Assets ($M) Employees In Rank List
General Merchandiers 1 Walmart $572,754 2.40% $13,673 1.20% $244,860 2,300,000 - 28
Internet Services and Retailing 2 Amazon $469,822 21.70% $33,364 56.40% $420,549 1,608,000 1 14
Computers, Office Equipment 7 Apple $365,817 33.30% $94,680 64.90% $351,002 154,000 -1 20
Health Care: pharmacy and other services 10 CVS Health $292,111 8.70% $7,910 10.20% $232,999 258,000 -3 27
Health Care: Insurance and Managed Care 11 UnitedHealth Group $287,597 11.80% $17,285 12.20% $212,206 350,000 -3 26

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 35 of 71
In my work, I always left align any column of text and I right align any column of values. In this
context, look at the Revenues ($M) column: the heading is right aligned but the contents appear to
be values and yet they are left aligned. The same applies to the Profits ($M) and Assets ($M) columns
too.

Look at the data in those columns and we can see that every value starts with $: click on any one of
those values in any one of those columns and then look at the Number format in the Home ribbon
and you will see this:

For a value to have the $ prefix, it should be showing as using the Currency or Accounting formats. In
this example, these columns contain dirty data and the best way to clean it all is to Find and Replace
the $ signs from all of those columns and replace them with nothing.

I have done that now and look at this:


Revenue Profits Years On
Revenues Percent Percent Change Global 500
Industry Rank Name ($M) Change Profits ($M) Change Assets ($M) Employees In Rank List
General Merchandiers 1 Walmart 572,754 2.40% 13,673 1.20% 244,860 2,300,000 - 28
Internet Services and Retailing 2 Amazon 469,822 21.70% 33,364 56.40% 420,549 1,608,000 1 14
Computers, Office Equipment 7 Apple 365,817 33.30% 94,680 64.90% 351,002 154,000 -1 20
Health Care: pharmacy and other services 10 CVS Health 292,111 8.70% 7,910 10.20% 232,999 258,000 -3 27
Health Care: Insurance and Managed Care 11 UnitedHealth Group 287,597 11.80% 17,285 12.20% 212,206 350,000 -3 26

You can see that the Revenue, Profits and Assets columns are right aligned now and that suggests the
data are clean.

Another quick way to find potentially dirty data is to highlight it by selecting it and then taking a look
at how it appears on the Status Bar. Look at this:

I have selected a few rows of the Revenues data


and look at the bottom right hand corner of the
screen where we see that the data are
summarised by Count XX.

If the data were properly formatted as values,


we would see this instead:
Average, Count, Numerical Count …

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 36 of 71
Dates are a particular pain: if you work with data that you copy from a web page or from a file from
someone who works in a different part of the world, it is not at all unusual for the format of their
dates and other values to be different from the way you work and that alone could cause you
problems. The problems often arise when dirty data looks clean.

Be ready, we will confront dirty data from time to time and I will work with you on sorting it out: just
be ready at all times!

3 Use of Formulas and Functions: Excel offers a wide range of formulas and functions for data
manipulation. There is no point making a list of the best functions there are in Excel because what is
important to you might not be important to me. Be ready, though, because Excel is developing very
quickly these days and in this book we will be working with the latest available functions as well as
those functions that have always been a part of Excel.

Dynamic Array Functions: these are relatively new but learning and using them will change your
approach to data analysis, including working with outliers:

SEQUENCE()
UNIQUE()
SORT()
SORTBY()
RANDARRAY()
FILTER()
XLOOKUP()
XMATCH()

Let me add the 14 additional functions that came along in August 2022:

TEXTSPLIT()
TEXTBEFORE()
TEXTAFTER()
VSTACK()
HSTACK()
CHOOSECOLS()
CHOOSEROWS()
DROP()
TAKE()
EXPAND()
TOCOL()

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 37 of 71
TOROW()
WRAPCOLS()
WRAPROWS()

Then there are the LAMBDA functions that I have already started to introduce in this book: BYCOL().
Here is the whole list:

MAP()
SCAN()
REDUCE()
MAKEARRAY()
BYCOL()
BYROW()
ISOMITTED()

Finally, there is the LET() function

4 Data Validation: Use Excel's data validation feature to set rules for what data can be entered into a
cell. For example, you can restrict a cell to only accept numeric values or dates or select from a drop
down list. This prevents incorrect data entry.

Good examples here include creating an input cell from Data Validation that includes ONLY, for
example,

Days of the week


Months of the year
Departments in the company
Names of Heads of Department
Names of raw materials
Grades of labour

By using Data Validation, we can be sure that we are only inputting data that is appropriate.

Watch out for data validation in this book to see how we use it and how beneficial it is. In the
meantime, open the file chapter_three.xlsx to see some basic examples.

On the DV worksheet, I have created three data validation cells for you, in cells B5, B7 and B9 and
there is a text box on that worksheet that illustrates how to create your own in rows 11, 13, 15. I
have not demonstrated it here but take a look at the Input Message and Error Alert options in Data
Validation to see how you might usefully use them.

Then, of course, consider how to use what you now know how to do!
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 38 of 71
5 Consistent Formatting: Ensure consistent formatting across your dataset. This includes date
formats, decimal places for numbers and text casing: UPPER, lower, Proper.

6 Tables and Named Ranges: Use tables and named ranges to make data references clearer and
calculations more straightforward. This also helps in making your data more organised and easily
readable.

I will say here that you will hear some Excel users say that we should never use range names in Excel:
I disagree with that because in some cases, you cannot do without them. Learn how to use them and
use them sparingly: don’t create hundreds of range names for every file or you will cause yourself
problems.

There is the world of difference between a list or a range of data and the following table types:

• Excel Tables
• Pivot Tables
• Data Tables

Excel Tables

Create a list and with your cursor anywhere in that list press this key combination: Ctrl+T and you will
see a small dialogue box that confirms the range of your list and whether the first row of the range is
a header row … make sure they are right and change them if not. Then you will see this:

There is a list or range of data in


A5:D10

There is an Excel Table with the


name sales in E14:H19

Click on the list, press Ctrl+T


change the Table Name to sales

Feel free to change the colour


scheme of the Excel Table as it
will use your default scheme and
you might not like that!

These days, because of laziness or a lack of understanding, many Excel users talk about a data table
when they mean a table of data. A data table is a very specific set up of data whereas a table of data
might just be an unformatted list of numbers and letters.

Excel tables are versatile and they contain a number of features that are not so obvious at first sight.
We will work with tables throughout this book and will learn much more about them.

Pivot Tables

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 39 of 71
Look at Section 7 that follows.

Data Table

There are basically two forms of Data Table: one variable and two variables. Let’s begin with the one
variable version:

And now the two variable Data Table

You can check on how a data table works by clicking on any cell in the range E4:H9 in this case and
you will see: {=TABLE(B5,B4)} and this confirms your row and column input cells. For the one variable
data table you will see {=TABLE(,B4)}, the column input cell.

7 Pivot Tables and Charts: Use pivot tables and charts for summarising data. They are powerful tools
for data analysis and can help in spotting trends, inconsistencies and outliers in your data.

Pivot Tables and Charts are subjects in their own right but here is a simple introduction for you.

On the PT worksheet in the chapter_three.xlsx file there is an Excel table called sales_dw and we will
use that to create a pivot table and a pivot chart.

Click anywhere on the Excel Table


Table Design
Summarise with Pivot Table

In the dialogue box that opens, make sure it says sales_dw as the table name
Tell Excel to put your Pivot Table in cell I3
Click OK

Your empty Pivot Table is waiting. Set it up this way:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 40 of 71
That is a basic but very useful Pivot Table and you can already see its usefulness, can’t you?

Click on the down arrow next to the word All in cell J1 and you can select Hardware or Software, click
Software then Click OK and, as easy as that, you now have a Software report as opposed to the
combined Hardware and Software report that you had initially.

Pivot Charts

Click anywhere on your Pivot Table


Pivot Table Analyse … Pivot Chart
Choose Clustered Column chart from the dialogue box that opens
OK

At first you see this chart Edit the chart to make it look much better!

We will work with Pivot Tables and Pivot Charts throughout this book so that was just your starter!

8 Documenting the Process: Keep a record of the steps taken in data preparation. This includes
formulas used, sources of data and any assumptions or rules applied. This documentation is vital for
future reference and for others who might use the data.

From my experience of my own work and the work of thousands of other people, documentation is
the last thing on most peoples’ minds. Yet it should be the first. Notice that I have documented
chapter three.xlsx as I have gone along: either in the file itself and/or in this Word file. Either way,
you know what I am doing, step by step. Keep it that way: it should be your unbreakable habit to
create documentation.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 41 of 71
9 Regular Audits and Updates: Regularly review and update your data. This includes checking for
[new] errors, updating values and reviewing formulas and validations to ensure they are still relevant.

Again, make this a habit: can we audit our own work? Yes we can! Should we audit our own work?
No we shouldn’t! It should be your policy for colleagues to validate everything that happens in your
department or section. All of the time. Yes, it’s time consuming, boring, too, sometimes but if you
don’t do it, you could end up on the list Excel Blunders that Cost Millions!

The second reason why this point is so vital takes us back to the Dynamic Array Functions, the 14
new functions and the Lambda functions … they are new to you but you should be using them now.
Update your files for that reason alone.

10 Security and Sharing: Protect sensitive data using Excel’s security features like password
protection, and be cautious when sharing files. Ensure that the data shared respects privacy and
confidentiality agreements.

This really is a vital part of our work now as we are all so connected and inter connected. Malware,
spyware, ransomware … look after yourself and your work.

11 Training and Continuous Learning: Excel is constantly evolving. Stay up to date with new features
and best practices through continuous learning and training.

You’re reading this book so that’s a really good start!

12 Leveraging External Tools: Sometimes, Excel might not be sufficient for complex data preparation
needs. In such cases, be open to using external tools or add ons that complement Excel's capabilities.

Remember, data quality in Excel is not just concerned with technical aspects but also with attention
to detail, consistency and an ongoing commitment to maintaining the integrity of your data.

- Using Formulas for Outlier Detection

In the previous section, Preparing Data in Excel part 3, Use of Formulas and Functions, we identified
a large range of new functions that have been built into Excel and we can use some or all of them in
the context of full data analysis: dynamic array and other functions.

In this section, we are going to leverage Excel's general formula capabilities to identify data points
that deviate significantly from the norm. Outliers can significantly impact statistical analyses and data
interpretations, so detecting them is crucial for maintaining data quality. Here are key aspects to
consider:

For this section, we will use the Fortune 1000 2023 data from chapter_three.xlsx: on the fortune
worksheet.

1 Understanding Outliers: As we know, an outlier is a data point that differs significantly from other
observations. They can occur due to variability in measurement or experimental error and can also
indicate something significant like a new trend.

2 Basic Statistical Methods: You can use basic statistical methods to detect outliers. One common
method is to calculate the mean and standard deviation of your data set and then find points that fall
outside a certain range, typically beyond 1.5 to 3 standard deviations from the mean.
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 42 of 71
3 Using Excel Formulas: In Excel, you can use formulas to calculate the mean and standard deviation,
and then create conditional formulas to identify outliers. For example:

Mean: =BYCOL(D$6:I$127,LAMBDA(column,AVERAGE(column)))
Median =BYCOL(D$6:I$127,LAMBDA(column,MEDIAN(column)))
Standard Deviation: =BYCOL(D$6:I$127,LAMBDA(column,STDEV.S(column)))
Trim mean =BYCOL(D$6:I$127,LAMBDA(column,TRIMMEAN(column,0.1)))
Quartile 3 = Q3 =BYCOL(D$6:I$127,LAMBDA(column,QUARTILE.INC(column,3)))
Quartile 1 = Q1 =BYCOL(D$6:I$127,LAMBDA(column,QUARTILE.INC(column,1)))

Here are the results of having applied those formulations:

Revenue Profits
Revenues Percent Profits Percent Assets
Metric ($M) Change ($M) Change ($M) Employees
Mean 90,828.66 20.79% 10,238.92 143.57% 315,873.77 148,454.89
Median 59,936.50 12.80% 5,832.00 46.25% 92,556.00 79,000.00
SD 86,590.35 25.64% 15,678.60 406.86% 691,169.81 263,712.43
Trimmean 78,470.65 18.39% 7,747.32 88.07% 182,284.20 111,826.16
Q3 106,747.50 23.83% 11,967.25 112.50% 237,901.00 170,250.00
Q1 39,528.00 6.28% 2,096.00 10.88% 50,240.25 37,996.00

We can then use the following, as an example to help us to identify outliers in each column:

=IF(OR(data_point < (mean - k*stdev), data_point > (mean + k*stdev)), "Outlier", "Normal")`, where
`k` is typically 1.5 to 3.

We can apply this formula on a company by company basis, to find these examples:

Revenue Profits
Revenues Percent Profits Percent Assets
Metric ($M) Change ($M) Change ($M) Employees
Walmart Outlier Normal Normal Normal Normal Outlier
Amazon Outlier Normal Normal Normal Normal Outlier
Apple Outlier Normal Outlier Normal Normal Normal
CVS Health Normal Normal Normal Normal Normal Normal
UnitedHealth Group Normal Normal Normal Normal Normal Normal

For Revenues for Walmart, for example, the formula is:

=IF(OR(D6 < (N$6 - 1.5*N$8), D6 > (N$6 +3*N$8)), "Outlier", "Normal")

Where k = 1.5: remember, we talked about John Tukey earlier and he suggested that a value of 1.5
was appropriate for this kind of analysis. We also saw that using k = 3 is also widely used, instead of
or as well as k = 1.5.

Copy that right and down to fill the range of data and you have an opinion of whether the data for
each company appears to be an outlier.

Outliers for Whole Data Set

Whilst we can do what we just did, on a company by company basis, it is more normal to identify
outliers overall, for the complete range of data: all Revenue, all Revenue Percent Change and so on.
We do this by using the Quartile 3 and Quartile 1 values in this way:

Q3 =BYCOL(D$6:I$127,LAMBDA(column,QUARTILE.INC(column,3)))

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 43 of 71
Q1 =BYCOL(D$6:I$127,LAMBDA(column,QUARTILE.INC(column,1)))
Inter Quartile Range (IQR) = Q3 – Q1

Then we complete our analysis of outliers as follows, by extending our analysis to include Lower
Inner Fences (LIF) and Upper Inner Fences (UIF):

Revenue Profits
Revenues Percent Profits Percent
Metric ($M) Change ($M) Change Assets ($M) Employees
Mean 90,828.66 20.79% 10,238.92 143.57% 315,873.77 148,454.89
Median 59,936.50 12.80% 5,832.00 46.25% 92,556.00 79,000.00
SD 86,590.35 25.64% 15,678.60 406.86% 691,169.81 263,712.43
Trimmean 78,470.65 18.39% 7,747.32 88.07% 182,284.20 111,826.16
Q3 106,747.50 23.83% 11,967.25 112.50% 237,901.00 170,250.00
Q1 39,528.00 6.28% 2,096.00 10.88% 50,240.25 37,996.00
IQR 67,219.50 0.18 9,871.25 1.02 187,660.75 132,254.00

Lower Inner Fence (LIF) - 61,301.25 - 0.20 - 12,710.88 - 1.42 - 231,250.88 - 160,385.00
Upper Inner Fence (UIF) 207,576.75 0.50 26,774.13 2.65 519,392.13 368,631.00

Number of Outliers <LIF 0 1 0 1 0 0


Number of Outliers >UIF 10 17 8 13 14 9

Explanations

Lower Inner Fence for Revenues =Q1 – 1.5 * IQR = 39,528.00 – 1.5*67,219.50
Upper Inner Fence for Revenues =Q3 + 1.5 * IQR = 106,747.50 + 1.5*67,219.50

Finally, we can find how many values, column by column, are outliers in this way:

Number of Outliers <LIF =COUNTIFS(range of data,"<"&LIF)


Number of Outliers <UIF =COUNTIFS(range of data,">"&UIF)

Can we identify which 10 > UIF Revenues are outliers? Which single outlier is < LIF for Profits Percent
Change? Let’s consider the next section: Identifying Individual Outliers

Combining the FILTER() and BYCOL() Functions with the IQR

The file for this section is filter_outliers.xlsx

The data for the exercise are found on the data worksheet and the fully worked solution is on the
solutions worksheet

Data:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 44 of 71
Solutions:

Initially, I simply used the FILTER() function to highlight the results of a Greater Than exercise. See the
formulas shown in cells L7 and L17

The results of this exercise are shown in the range

I7:K11 for a single value interrogation: gross salary value


I17:K7 for a double value interrogation: gross salary value and department number

What I then show is whether there are any outliers in this salary table: is anyone earning significantly
less or more than anyone else: we see that here:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 45 of 71
BYCOL()

I used the BYCOL() function to identify the Quartile 3 (Q3) and the Quartile 1 (Q1) values, hence the
Inter Quartile Range, IQR of Q3-Q1

BYCOL() is a LAMBDA() function and such functions can be really easy to evaluate, as is the case here.

We modify the BYCOL() functions according to the column we are evaluating: Gross, PAYE, Net

The Inner Fences

Otherwise known as Tukey’s Fences, the Inner Fences give us boundaries in the data beyond which
we can find possible outliers.

The Inner Fence formulas are very straightforward as you can see in cells N24 and N25

FILTER() and COUNT()

Now that we have done the basic work, we can combine the COUNT() and FILTER() functions as you
ca see in cells N28 and N29 to return the number of values in each column, Gross, PAYE and Net, that
are classified as outliers.

In this example, there are no outliers and we can see what that is so by inspecting the graph of the
Salary data we are reviewing:

The Outer Fences

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 46 of 71
I show the Outer Fences below and they are very easy to find, given the work we have already done.
For example, copy the formula for Gross Salary in cell J24

=J21-1.5*J22
paste it in cell J31 and change it to
=J21-3*J22

Similarly for the formula in cell J25

=J20+1.5*J22
paste it in cell J32 and change it to
=J20+3*J22

I have included the Minimum and Maximum values of the three columns we are working on to help
us to read and appreciate the outlier results we have found:

Identifying Individual Outliers

4 Conditional Formatting: This is a handy tool in Excel for visually identifying outliers. You can set up
rules using the same statistical criteria to highlight outliers in your data set.

Conditional Formatting

A simple method of finding individual outliers when based on the <LIF and >UIF calculations is to use
Conditional Formatting. To achieve this, do the following, for example, for Revenues:

Highlight all of the data in the Revenues ($M) columns


Click on the down arrow on the Conditional Formatting icon on the Home Ribbon
Choose

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 47 of 71
Choose the cell N15 as the Greater than condition, leave it as Light Red Fill or choose your own
colour preference:

Click OK

Repeat but for LESS THAN this time and choose cell N14 this time and choose your colour preference
and click OK.

All of the values that Excel has identified as outliers are now highlighted in red in the Revenues ($M)
column and here they are:

The ten largest companies by Revenue are the ten outliers in this column.

Repeat this process for all of the other variables and see what you find: choose the same colour
scheme or choose a different colour for every column, as you wish.

In my case, I will only be analysing these columns:

• Revenues ($M)
• Revenue Percent Change
• Profits ($M)
• Profits Percent Change
• Assets ($M)
• Employees

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 48 of 71
Revenue Percent Change

There should be 18 companies in the outlier list for Revenue Percent Change and in my case there
are. I formatted them as follows, where I show just the top 31 companies. Note, I have formatted my
results in a colour that contrasts with the Revenue ($M) column formatting.

Note: I have NOT shown all of the highlighted outliers in this screenshot.

Here is the final table of extracts:

Using Z Score Analysis

If we know or believe that our data come from normally distributed data, we can use Z Score analysis
to find and then highlight any outliers in our data.
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 49 of 71
Excel has a function that we can use to make our life very easy: STANDARDIZE() but we can use a
formula instead of that: Z Score = (X - mean)/Standard Deviation. Working on the fortune_Z
worksheet, let’s explore both of these options.

I have put my column headers to start in cell M5 and copied that right
In cell M6 =STANDARDIZE(D6:D127,fortune!N$6,fortune!N$8)
and note that I am using the mean and standard deviation results from the fortune worksheet
Since I am using Excel 365, that formula SPILLs down automatically
I then fill that formula right, up to column R, the Employees column

That has given me the Z Scores for all data points for all variables from Revenues to Employees,
inclusive.

Highlighting the Outliers

There are two points to notice:

Point 1

Z scores of <-2, >2 or <-3, >3 are considered to be outliers but without conditional formatting, for
example, they might be difficult to identify.

Why do I talk about Z scores of <-2, >2 or <-3, >3 … what does that mean?

A cut off of ±2 is consistent with a 95% confidence interval and might be acceptable to you
A cut off of ±3 is consistent with a 99% confidence interval and might be more acceptable to you

In my case, I have chosen to highlight ±2 in yellow and from ±2 to ±3 in Light Red. Like this:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 50 of 71
Now e can see, for example, that there are just seven outliers in the Revenues ($M) column , the
seven largest companies by sales but only the top three have a Z Score in excess of 3 while the others
have a Z Score of >2 <3: we can call that fine tuning!

Similarly, for the Revenue Percent Change column, there are now only nine outliers and not the 18
we found when using Tukey’s LIF and UIF. But again, some are highlighted in red and some in yellow,
for greater granularity!

Overall, we find as follows:

Variable No of No of
Outliers Outliers Z
Tukey Score
Revenues ($M) 10 7
Revenue Percent Change 18 9
Profits ($M) 8 5
Profits Percent Change 14 2
Assets ($M) 14 6
Employees 9 2

Point 2

There are many VALUE! Errors in the Profits Percent Change column: that is because of the lack of
data in that column, which shows the character “-“ where there is no value.

If we deleted the “-“ character from the cells, a value appears for the Z Score against those cells but
that is an error in itself. The best approach is to wrap all of our STANDARDIZE() functions in the
IFERROR() function. Which you should do!

Note, on the fortune worksheet, those “-“ in column G are all shown as being highlighted in Yellow …
find a way to prevent that happening, since it is misleading.

Alternatively

Instead of the STANDARDIZE() function, we can use this formula,

Z Score = (X - Mean)/Standard Deviation

And I have demonstrated for a few values on the fortune_Z worksheet, in columns.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 51 of 71
What do you notice about these values compared to the values provided by the STANDARIZE()
function? They should be exactly the same!

5 Box and Whisker Plots: Excel allows you to create Box and Whisker plots, which are excellent for
visualising the distribution of your data and identifying outliers.

Here are the Box & Whisker plots for the fortune data, fortune_BW worksheet

I selected all of the data in the six columns we have been analysing and created a single Box &
Whisker plot from there. It’s OK but not so easy to read. So, I created six individual Box & Whisker
plots.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 52 of 71
What conclusions can we draw from this information?
6 Non Parametric Methods: Median Absolute Deviation and the Modified Z Score. For data that
does not follow a normal distribution, non parametric methods like using the Median Absolute
Deviation (MAD) and the Modified Z Score can be more appropriate for outlier detection.

The Median Absolute Deviation (MAD) is a non parametric method for detecting outliers,
particularly useful in datasets that do not follow a normal distribution. Here's a detailed
explanation of how MAD works and how it can be implemented in Excel:

Understanding MAD

Concept: MAD measures the variability of a dataset and is less sensitive to outliers than standard
deviation or variance. It's based on the median, a robust measure of central tendency.

Implementing MAD in Excel on the fortune_MAD worksheet

Find the Median: Use the MEDIAN function to find the median of your data. The median is already
included in the descriptive statistics found on the fortune worksheet.
Calculate Absolute Deviations: In cell M6, calculate the absolute deviation of the data point from the
median. For example, the Revenue median is in cell fortune!N$7, the formula in cell M6
=IFERROR(ABS(D6-fortune!N$7),"-")
Calculate the MAD: In cell M4 =MEDIAN(M6:M127): this is the MAD.

Outlier Detection

Threshold for Outliers: A common approach is to define outliers as those points that are a certain
number of MADs away from the median. A typical threshold is 2 or 3 times the MAD.
Flagging Outliers: Create a formula to flag data points that exceed this threshold. T6 =IF(M6 >2 *
M$4, "Outlier", "Normal"): you can drag that formula down and to the right to complete your
analysis, as follows:

Outlier Analysis Threshhold 2


30 29 26 46 32 27

Revenue Profits
Revenues Percent Profits Percent Assets
($M) Change ($M) Change ($M) Employees
Outlier Normal Normal Normal Outlier Outlier
Outlier Normal Outlier Normal Outlier Outlier
Outlier Outlier Outlier Normal Outlier Normal
Outlier Normal Normal Normal Outlier Outlier
Outlier Normal Outlier Normal Normal Outlier
Outlier Outlier Outlier Outlier Outlier Normal
Outlier Normal Outlier Normal Outlier Outlier
Outlier Normal Normal Outlier Normal Normal
Outlier Outlier Outlier Normal Outlier Normal
Outlier Normal Normal Outlier Normal Normal

In the row under the title Outlier Analysis you will find this formula, T4=COUNTIFS(T6:T127,"Outlier")
and that counts the number of outliers identified by the MAD method using the threshold of 2

I have put the threshold level in cell W3 and changing it to 3 rather than 2 gives us these results:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 53 of 71
Outlier Analysis Threshhold 3
23 22 18 40 27 20

Revenue Profits
Revenues Percent Profits Percent Assets
($M) Change ($M) Change ($M) Employees
Outlier Normal Normal Normal Normal Outlier
Outlier Normal Outlier Normal Outlier Outlier
Outlier Normal Outlier Normal Outlier Normal
Outlier Normal Normal Normal Normal Outlier
Outlier Normal Normal Normal Normal Outlier
Outlier Outlier Outlier Outlier Outlier Normal
Outlier Normal Outlier Normal Outlier Outlier
Outlier Normal Normal Outlier Normal Normal
Outlier Outlier Outlier Normal Outlier Normal
Outlier Normal Normal Outlier Normal Normal

As expected, the number of outliers flagged falls quite a bit!

Considerations

Robustness: MAD is more robust than standard deviation, making it suitable for skewed distributions
or datasets with outliers: the Fortune data we are using here is not normally distributed, as you can
see from the histograms below. We might conclude that the MAD method is the most appropriate of
the methods we have used.

Analysing the entire 1,000 company fortune database might give us completely different insights, of
course.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 54 of 71
Context Specific Threshold: The threshold multiplier (eg 2 or 3 times MAD) can be adjusted based on
the specific context and how sensitive you want the outlier detection to be.
Interpretation: As with any statistical method, the interpretation of outliers should consider the
context of the data. Sometimes what is statistically an outlier might be important in the context of
the study.
Limitations: While MAD is useful for outlier detection, it may not be the best choice for all types of
data or analyses. Understanding the nature of your data and the goal of your analysis is crucial in
selecting the right method.

In summary, MAD offers a straightforward and robust way to detect outliers, especially in non-
normally distributed datasets. Its implementation in Excel is relatively simple and can be a valuable
part of your data analysis toolkit.

Understanding the Modified Z Score

A Modified Z Score is an alternative to the traditional Z score and is particularly useful for datasets
that are not normally distributed or when the sample size is small. It is a robust method of
identifying outliers, similar to the Median Absolute Deviation (MAD) method, but it incorporates
elements of the Z score approach.

Definition and Calculation

Traditional Z Score: In a traditional Z score calculation, you subtract the mean from each data point
and then divide by the standard deviation. This measures how many standard deviations away a data
point is from the mean.

Modified Z Score: Instead of using the mean and standard deviation, the modified Z score uses the
median and MAD. The formula for the Modified Z-Score is

0.6745(𝑥𝑖 − 𝑥̃)
𝑀𝑖 =
𝑀𝐴𝐷

Where
𝑀𝑖 = the Modified Z Score of an individual value of a variable
0.6745 = a constant factor used to make the Modified Z Score more comparable to the traditional Z
score in terms of identifying outliers under a normal distribution.
𝑥𝑖 = individual data value of a variable
𝑥̃ = the median of the variable
MAD = Median Absolute Deviation of a variable

Advantages of Modified Z-Score

Robustness to Skewness and Small Sample Sizes: Unlike the traditional Z score, which can be heavily
influenced by outliers (especially in a skewed distribution), the modified Z score is more robust
because it uses the median and MAD.

Better for Non Normal Distributions: Since it doesn't rely on the assumption of normality, the
modified Z score is suitable for a wider range of data types.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 55 of 71
Useful for Smaller Datasets: Traditional Z scores may not be reliable for small datasets due to the
influence of outliers on the mean and standard deviation but the modified Z score remains effective
even with fewer data points.

Using the Modified Z Score for Outlier Detection

The worksheet is exactly the same as for the MAD except that the results in columns T:Y are
changed to illustrate the Modified Z Score.

This means that the MAD values are the same as before. I have added the median values for each
variable in the range D4:I4

Median
59,936.50 0.13 5,832.00 0.46 92,556.00 79,000.00

Then the results of applying the Modified Z Score analysis to our Fortune data are on the
fortune_mod_Z worksheet

T6=IFERROR(0.6745*(D6-D$4)/M$4,"-") is copied down to T127 and across to Y6:Y127

Modified Z Threshhold
11 13 8 14 16 9

Revenue Profits
Revenues Percent Profits Percent Assets
($M) Change ($M) Change ($M) Employees
14.39 - 0.80 1.26 - 0.71 1.68 28.03
11.51 0.68 4.42 0.16 3.62 19.30
8.59 1.57 14.27 0.29 2.85 0.95
6.52 - 0.31 0.33 - 0.57 1.55 2.26
6.39 - 0.08 1.84 - 0.53 1.32 3.42
6.34 3.42 2.76 - 2.72 - 0.20
6.07 - 0.02 13.49 1.02 9.56 3.70
5.73 - 0.15 - 0.76 - - 0.32 - 0.16
5.55 2.18 11.28 0.67 2.94 0.98
4.32 - 0.01 - 0.69 - - 0.39 - 0.49
3.82 0.36 - 0.13 - 0.33 - 0.37 2.64
3.20 - 0.33 - 0.08 - 1.30 0.69 - 0.08

In the range T4:Y4, I have entered the formula to count the number of outliers in each column. For
example, in T4=COUNTIFS(T6:T127,"<"&-3.5)+COUNTIFS(T6:T127,">"&3.5) copy that formula right,
to Y4 to see the results of the Modified Z Score method.

Threshold for Outliers: Similar to the traditional Z score, a common threshold for identifying outliers
with the modified Z score is a value greater than 3.5 or less than -3.5, although this threshold can be
adjusted based on the context and the level of sensitivity desired. I have put a place holder in the
worksheet that will allow us to modify the threshold from ±3.5 although I have not used it here. You
are free to make the necessary change to cope with that!

Conditional Formatting: rather than programming the range T6:Y127 to show the words Outlier or
Normal, I have used conditional formatting to highlight where a value represents an outlier.

Interpretation: Data points with a modified Z score beyond the chosen threshold are considered
outliers. This method provides a balanced approach, mitigating the impact of extreme values while
still identifying significant deviations from the central tendency.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 56 of 71
In summary, the Modified Z Score is a useful and robust statistical tool for outlier detection,
especially in scenarios where the data does not conform to a normal distribution or when dealing
with small sample sizes. Its reliance on the median and MAD makes it less sensitive to extreme
values than the traditional Z score.

If we compare the three outlier methodologies we have use here, this is what we find:

Variable No of No of No of
Outliers Outliers Z Outliers
Tukey Score Modified
Z Score
Revenues ($M) 10 7 11
Revenue Percent Change 18 9 13
Profits ($M) 8 5 8
Profits Percent Change 14 2 14
Assets ($M) 14 6 16
Employees 9 2 9

7 Consider the Context: Always interpret outliers in the context of your specific data. In some cases,
what appears to be an outlier could be a valid data point that is essential for your analysis.

8 Refining Your Analysis: After detecting outliers, decide whether to keep them, adjust them or
remove them, depending on your analysis's purpose and the nature of the outliers.

9 Automation and Advanced Techniques: For large datasets or more complex analyses, consider
using VBA (Visual Basic for Applications) to automate outlier detection or integrate Excel with other
tools like R or Python for more sophisticated statistical analyses.

In summary, using formulas for outlier detection in Excel requires a mix of statistical understanding
and practical skills in Excel. As we have seen in this section, knowing the right formulas and
techniques to apply help us to interpret the results in the context of your data and analysis goals.

- Case Study: Outlier Analysis in a Business Context with Excel

I have two case studies to share with you in this section

A multiple Regression exercise which I use to demonstrated Standardised Residuals


The Fortune 1000 complete dataset which I am using here to contrast with the work we have done
on just part of that data set.

Multiple Regression: Chinese Wells

Open the file chinese_wells_data.xlsx and carry out the following instructions:

This case study comprises a dataset relating to a company that creates and lines wells they have dug.
What the company wants to know is what should their standard cost fuel be, given the independent
variables provided?

There is a dependent variable, the Y variable that is labelled Fuel Cost


There are five independent variables, X variables that are labelled

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 57 of 71
• NoWells = number of wells
• Dist = distance
• Wt = weight
• Depth = depth
• TonKm = Tonne kilometers: defined as the value of Distance * Weight

Copy the data tab, to ensure we always have a clean copy of the input data

Set up the correlation matrix for these data. The correlation matrix I have created is a dynamic
matrix which means that if any of the input data were to change, the correlation matrix would
update automatically with such changes.

Enter this formula in cell K4

=CORREL(INDIRECT(K$2&$K$10&":"&K$2&$K$11),INDIRECT($I4&$K$10&":"&$I4&$K$11))

which you then fill right to P4 and down to K9:P9

Now we can set up the descriptive statistics table for the data, using the BYCOL() function, as before:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 58 of 71
Enter this formula in cell K14=BYCOL(A$4:F$43,LAMBDA(column,AVERAGE(column))) which will SPILL
automatically across to P14

Fill K14:P14 down to row 26 and then for every line item, change AVERAGE to the appropriate
alternative. For example, in row 16 change AVERAGE to MEDIAN, in row 17 change AVERAGE to
MODE and so on

Notice, we need different formulas for the Standard Error and for the Range, which are:

Standardised Residuals

Standardised residuals essentially use Z Scores to alert us to the possibility of a residual value being
an outlier. As part of this analysis, we assume that the data we are working on comes from the
normal distribution.

A standardised residual value of >±2 suggests that a residual falls outside the 95% confidence level of
the data set from which it came. A standardised residual value of >±3 suggests that it falls outside of
the 99.7% confidence level of the data set from which it came.

We are about to find these values and then illustrate the results at both the >±2 and the >±3 levels of
confidence.

Standard Error

Standard Error of the Residuals = Observed Residual/Standard Error of Residuals

K15=K18/SQRT(K26) and for consistency, we might use this instead


K28=BYCOL(A$4:F$43,LAMBDA(column,STDEV.S(column)))/SQRT(BYCOL(A$4:F$43,LAMBDA(column,
COUNT(column))))

Range

K22=K24-K23 and, again, for consistency, we might use this instead


K22=BYCOL(A$4:F$43,LAMBDA(column,MAX(column)))-
BYCOL(A$4:F$43,LAMBDA(column,MIN(column)))

In cell I4 set up the LINEST function to create a regression model of the data

The syntax of the LINEST() function is LINEST(known_ys,known_xs,[const],[stats])

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 59 of 71
=IFERROR(LINEST(F4:F43,A4:E43,1,1),"-")

If we don’t use IFERROR(), we don’t make a mistake but the output we get looks a bit scruffy.

Notice the most important point: in the range A3:F3, we see variable names NoWells, Dist, Wt etc
but in the LINEST() output, we see those names in the opposite direction. It’s not an error, that is
what Excel does for us. Just remember it and work with it.

In this case, the regression equation we are provided with is this:

Y = a + b1 * NoWells + b2 * Dist + b3 * Wt + b4 * Depth + b5 * TonKm

We apply this equation of model on a line by line basis, as follows, for line 1 by creating this formula
in cell G4:

Y = 31,271.8877 + 3,009.7950 * 18 -75.8198 * 1574 -84.4036 * 1077.81 + 2.7906 *31,819 + 0.9610 * 94,713.32
Y = 54,959.22

We just fill down that formula to cell G43 and they are our estimates of the standard cost of fuel for
this example.

We have two questions, now:

• What do we think of the quality of our estimates? and


• Are there any outliers in this model?

The Residuals column helps us with the first question, see column H where

Actual Value – Predicted Value = Residuals

The Standardised Residuals column helps us with the second question, see column I, where the
Standardised Residual = Residual Value/Standard Error of the Residuals*

For purposes like outlier detection and understanding the relative magnitude of the residuals, using
the standard error of the residuals to calculate standardised residuals is more appropriate. This
approach normalises the residuals in terms of the expected distribution of errors, making it easier to
identify those observations that are significantly deviating from the model's predictions

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 60 of 71
The formula for finding the Standard Error of the Residuals is in cell L14 and it is this:

*Standard Error of the Residuals =SQRT(SUMX2MY2(F4:F43,G4:G43)/COUNT(G4:G43-5-1))

We can now highlight the results of the Standardised Residuals using conditional formatting, for
example, in this way:

±2 or ±3 … I have programmed both which is why you can see the following:

Where we can see the numbers there are no outliers

Z Score Results: outlier detection

Excel has a built in STANDARDIZED() function to help us to find the Z Scores that give insights into the
outlier situation of each line item. You can find this on the data_ZScore worksheet:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 61 of 71
As we can see there is just one line item that has been classified as containing outliers and that is line
item D1, which reads as follows:

I think it is pretty clear why that line is considered to be an outlier row!

From a much larger database, the Fortune 1000 for 2023, here are some of the results of the
application of the STANDARDIZE() function to the values of the variables included below, which is
followed by the histograms that describe these results:

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 62 of 71
We can see that in all cases, the data are heavily positively skewed.

Grubbs’ Test

If we believe that the maximum value in a variable or the minimum value on a variable is an outlier,
assuming the data are normally distributed, we can apply the Grubbs’ Test to check whether that is a
valid hypothesis.

In chapter_three.xlsx on the chinese_grubbs worksheet, I have tested whether the NoWells value for
D1 or the row 31 value in the table of 3 is an outlier since it is so small: the minimum value.

If you suspect that the maximum value in the dataset is an outlier, the test statistic is calculated as:

𝐺 = (𝑥𝑚𝑎𝑥 − 𝑥)/𝑠
If you suspect that the minimum value in the dataset is an outlier, the test statistic is calculated as:

𝐺 = (𝑥 − 𝑥𝑚𝑖𝑛 )/𝑠
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 63 of 71
And if you’re not sure if the maximum value or minimum value in the dataset is an outlier and you
want to perform a two sided test, then the test statistic is calculated as:

𝐺 = 𝑚𝑎𝑥|𝑥− − 𝑥 |/𝑠
where 𝑥 is the sample mean and s is the sample standard deviation

Results

Firstly, draw the histogram to check whether the NoWells data are from a normal distribution:

The data are not entirely normal and the skewness coefficient is -0.59378 but we will assume the
data come from a normal distribution so that we can carry on with Grubbs!

All of the work we needed to have done is shown in column J in the screenshot that follows:

The test statistics, G, in cells J8:L8 are 1.6893, 2.9950 and 2.9950

The critical value, Gcritical, in cell J15 is 2.867542.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 64 of 71
Since the test statistic is

1 less than

2 greater than

3 greater than

the critical value, this means that the value … is

1 not an outlier in this dataset

2 is an outlier in this dataset

3 is an outlier in this dataset

Generalized (extreme Studentised deviate) ESD test

The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect one or more
outliers in a univariate data set that follows an approximately normal distribution.

The primary limitation of the Grubbs test [and the Tietjen-Moore test] is that the suspected number
of outliers, k, must be specified exactly. If k is not specified correctly, this can distort the conclusions
of these tests. On the other hand, the generalized ESD test (Rosner 1983) only requires that an upper
bound for the suspected number of outliers be specified.

Given the upper bound, r, the generalized ESD test essentially performs r separate tests: a test for
one outlier, a test for two outliers, and so on up to r outliers.

The generalized ESD test is defined for the hypothesis:

H0: There are no outliers in the data set

Ha: There are up to r outliers in the data set

Test Statistic: Compute

Ri=maxi|xi−x¯|s

with and s denoting the sample mean and sample standard deviation, respectively.

Remove the observation that maximizes |xi−x¯| and then recompute the above statistic
with n - 1 observations. Repeat this process until r observations have been removed. This
results in the r test statistics R1, R2, ..., Rr.

Corresponding to the r test statistics, compute the following r critical values


λi=(n−i)tp,n−i−1(n−i−1+t2p,n−i−1)(n−i+1)√i=1,2,…,r

where tp,ν is the 100p percentage point from the t distribution with ν degrees of freedom
and

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 65 of 71
p=1−α2(n−i+1)

The number of outliers is determined by finding the largest i such that Ri > λi.

Simulation studies by Rosner indicate that this critical value approximation is very
accurate for n ≥ 25 and reasonably accurate for n ≥ 15.

Note that although the generalized ESD is essentially Grubbs’ test applied
sequentially, there are a few important distinctions:

• The generalized ESD test makes approriate adjustments for the critical
values based on the number of outliers being tested for that the sequential
application of Grubbs test does not.
• If there is significant masking, applying Grubbs test sequentially may stop
too soon. The example below identifies three outliers at the 5 % level when
using the generalized ESD test. However, trying to use Grubbs test
sequentially would stop at the first iteration and declare no outliers

1.3.5.17.3. Generalized Extreme Studentized Deviate Test for Outliers (nist.gov)


See the ESD worksheet in the chapter_three.xlsx file

Following the Rosner paper, we test for up to 10 outliers:

H0: there are no outliers in the data


Ha: there are up to 10 outliers in the data

Significance level: α = 0.05


Critical region: Reject H0 if Ri > critical value

Impact Analysis

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 66 of 71
Impact analysis is an attempt to assess the impact that outliers might have on a dataset and the way
we carry out such analysis is to calculate the mean, the median and the standard deviation of a
complete dataset and then we repeat those calculations with the values we assess to be outliers
having been removed.

I have included Impact Analysis on the impact worksheet of the file chapter_three.xlsx

The outliers I have assumed for this demonstration are the ones we found on the ESD worksheet
when we used the IQR analysis and found the upper and lower inner fences for the dataset on that
worksheet.

• I have repeated the entire data set in the range A4:A57, complete with outliers
• I have shown the dataset less the outlier values in the range B4:B54.

Firstly, evaluate the mean, median and standard deviation of these two datasets:

We then use the data analysis toolpak to carry out the t-Test: Two-Sample Assuming Unequal
Variances for us, which is already done for you:

To use a t-test and p-values for hypothesis testing to assess the potential impact of
removing outliers from your data, you're essentially comparing the means of two
datasets: one with outliers and one without.
Formulate Hypotheses

First, set up your null and alternative hypotheses.

Null Hypothesis (H0): The mean of the dataset with outliers is equal to the mean of the dataset
without outliers. (ie Outliers do not significantly affect the mean).

Alternative Hypothesis (H1): The mean of the dataset with outliers is not equal to the mean of the
dataset without outliers. (ie Outliers significantly affect the mean).

Prepare Your Data

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 67 of 71
Dataset with Outliers: This is your original dataset.

Dataset without Outliers: This dataset is created by removing outliers from the original dataset.

Conduct the t-test

Then we can perform an independent samples t-test if your data points are independent of each
other, using Excel's Data Analysis ToolPak

Interpret the Results


t-Statistic: This tells you how much the means of the two groups differ in terms of standard errors.

P-value: This value helps you determine the significance of your results.

If the p-value is less than your chosen significance level (eg 0.05), you reject the null hypothesis,
suggesting that outliers have a significant impact on the mean.

If the p-value is greater than the significance level, you fail to reject the null hypothesis, suggesting
no significant impact from outliers.

Considerations
Assumptions: Ensure your data meets the assumptions of the t-test, such as normality. Outliers can
impact normality, so it’s important to check this.

Effect Size: Alongside the t-test, consider calculating the effect size (like Cohen's d: see below for the
results of this further test) to understand the magnitude of the difference.

Contextual Interpretation: Always interpret the results in the context of your specific data and
research question. Statistical significance doesn't always equate to practical significance.

Remember, removing outliers can significantly impact your analysis, so the decision to remove them
should not be taken lightly and should be well justified in the context of your specific data and
analysis objectives.

Interpretation of the Example


Hypothesized Mean Difference: 0

This implies that you are testing the hypothesis that there is no difference in the means of the two
groups (with and without outliers).

Degrees of Freedom (df): 98

This suggests that the total number of observations across both groups (minus the number of
groups) is 99. Degrees of freedom are used to determine the critical value of t.

t Statistic: 0.9432

The t Statistic is a measure of the difference between the two group means in terms of the standard
error. A t Statistic of 0.9432 suggests that the difference between the means is less than one
standard error away from zero.

P(T<=t) one-tail: 0.1740

This is the p-value for a one-tailed test. It indicates the probability of observing a t Statistic as
extreme as 0.9432 if the null hypothesis is true. In this case, there's a 17.4% chance of finding a t
Statistic this extreme due to random chance.
Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R
© Duncan Williamson 16th March 2024
Page 68 of 71
t Critical one-tail: 1.6606

This is the critical value of t for a one-tailed test at the chosen significance level (usually 0.05). Since
your t Statistic (0.9432) is less than the critical value, you do not reject the null hypothesis in a one-
tailed test.

P(T<=t) two-tail: 0.3479

This is the p-value for a two-tailed test. It's the probability of observing a t Statistic as extreme as
0.9432 in either direction if the null hypothesis is true. A p-value of 34.79% is quite high, indicating a
low likelihood that the difference in means is due to anything other than chance.

t Critical two-tail: 1.9845

This is the critical value of t for a two-tailed test. Again, your t Statistic is less than this critical value.

Interpretation
Since the p-value for both one-tailed (0.1740) and two-tailed (0.3479) tests are higher than the
common significance levels (0.05 for one-tailed, 0.10 for two-tailed), you fail to reject the null
hypothesis in both cases.

This suggests that there's no statistically significant difference between the means of your two
datasets (with and without outliers). In other words, the outliers do not have a significant impact
on the mean of your dataset.

It's important to remember that "failing to reject the null hypothesis" is not the same as proving the
null hypothesis is true. It simply means there isn't enough evidence in your sample data to conclude
a significant difference due to outliers.

Always consider these results in the context of your specific data and research question. While
statistically, outliers might not impact the mean significantly, they could still be relevant for your
specific case, especially if they represent true but rare events.

A Note on Degrees of Freedom (df)


The basic definition of df is n1+n2-2, which in this case =54 + 51 – 2 = 103

But the toolpak shows 98 df.

Formally, the toolpak uses this formula to evaluate the number of degrees of freedom:

Where:

Number of observations of dataset 1 = n1

Number of observations of dataset 2 = n2

Standard deviation of dataset 1 = s1

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 69 of 71
Standard deviation of dataset 2 = s2

That boils down to this: =((4.2/54 + 3.8/51)^2) / ((4.2/54)^2/(54-1) + (3.8/51)^2/(51-1)) = 102.9939

So, why does the ToolPak return df=98?

The Excel ToolPak automatically calculates this more complex degrees of freedom when you select
the option for "t-Test: Two-Sample Assuming Unequal Variances". Thus, the degrees of freedom
being 98 in your results suggests the underlying calculations for variances and the adjustments
they necessitate.

This calculation adjusts the degrees of freedom downward when there are large discrepancies in
variance or sample size between the two groups, which helps keep the test accurate under these
conditions.

Cohen’s d
I mentioned Cohen’s d above which is a measure of effect size used to indicate the standard
difference between two means. It's calculated by subtracting the mean of one group from the mean
of another and dividing the result by a pooled standard deviation. It is easy to find Cohen’s d in Excel.
See the impact worksheet, as follows:

Interpreting Cohen's d
The value of Cohen's d reflects the size of the effect. A common interpretation is that

• 0.2 represents a small effect


• 0.5 a medium effect
• 0.8 a large effect

However, these should be interpreted in the context of your field of study.

Remember, Cohen's d is a measure of effect size, not statistical significance. It tells you how big the
difference is, not whether that difference is statistically significant.

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 70 of 71
About the Author

- Background and experience

I have been working in and around management accounting, financial accounting, financial
reporting, spreadsheeting and spreadsheet and financial modelling for a long time! I first started
working with spreadsheets on a Commodore 64 computer. I then graduated to a Windows powered
desktop computer on which I used Lotus 1-2-3 spreadsheet software, finally moving on to Microsoft
Excel somewhere around 1993 – 1994.

I have worked as an accountant, a college lecturer, a business school professor, head of department,
head of projects, self employed consultant, author, web master, blogger, freelance teacher and
trainer.

I love travelling here and there so the three years that covid kept me at home were strange and
strained!

I have lived and worked in North America, Asia, Africa, Europe and Oceania and I currently live in SE
Asia and have done for around a decade or so.

Although I am by no means a professional statistician, I use some aspects of statistics as part of the
work I do: descriptive statistics, regression analysis, numerical and ratio analysis of data, visualisation
and so on. I carry out outlier analysis from time to time in such work, when I feel the need to know if
something is out of kilter with the rest of the data!

I am a lifelong learner and I keep myself busy with my formal work and with working on the kinds of
projects that reading a variety of business, accounting and analytics blogs and web sites can
generate. I also answer questions on www.quora.com as well as elsewhere and all of these ideas
keep me interested in switching on my laptop on most days.

- Contact information and professional links

I have a smart phone that I tend to use for many things but phoning is not one of them! The best way
of contacting me is via LinkedIn, just search for me by name, my blog is called Duncan's Diurnal
Diatribe (duncanwil.blogspot.com) and you can contact me through that. It is best, though, if you use
my email address, which is duncanwil@gmail.com and if you write to me, I will reply, providing it is
clear what you want me to do!

Mastering Outliers in Data Analysis: A Practical Guide Using Excel and R


© Duncan Williamson 16th March 2024
Page 71 of 71

You might also like