You are on page 1of 24

JOURNAL OF INFORMATION SYSTEMS American Accounting Association

Vol. 35, No. 3 DOI: 10.2308/ISYS-2020-025


Fall 2021
pp. 199–221

Applications of Data Analytics: Cluster Analysis of Not-for-


Profit Data
Zamil S. Alzamil
Majmaah University

Deniz Appelbaum
Montclair State University

William Glasgall
The Volcker Alliance

Miklos A. Vasarhelyi
Rutgers, The State University of New Jersey
ABSTRACT: Since data analytics enables data exploration and the uncovering of hidden relationships, in this study,
we use cluster analysis to gain more insight into governmental data and financial reports. This research initiative is
performed using the Design Science Research (DSR) methodology, where we develop and apply an appropriate
artifact. We apply our artifact to two different datasets: the first uses the U.S. states’ financial statements, and the
second utilizes survey results from the Volcker Alliance about states’ budgeting performance. In both applications,
we demonstrate how clustering may be used on governmental data to gain new insights about financial statements
and budgeting. This study contributes to the literature in two ways: First, the two applications bring advanced data
mining techniques into the not-for-profit domain; and second, the results provide guidance for auditors, academics,
regulators, and practitioners to use clustering to gain more insights.
Keywords: not-for-profit data; governmental data; data analytics; k-means clustering; hierarchical clustering;
audit analytics.

I. INTRODUCTION

D
ata mining is the process of gaining insights and identifying interesting patterns and trends from data where the
insights, patterns, and trends are generally unknown yet statistically reliable, meaning that some decisions could be
taken to exploit the knowledge (Elkan 2001). Data mining is also defined as ‘‘a process that uses statistical,
mathematical, artificial intelligence and machine learning techniques to extract and identify useful information and
subsequently gaining knowledge from a large database’’ (Turban, Sharda, and Delen 2011, 21).
Cluster analysis as a data mining approach can help find similar objects in data. Kaufman and Rousseeuw (2009) have
defined cluster analysis as ‘‘the art of finding groups in data.’’ Objects in the same group are similar to each other, and objects in
different groups are as dissimilar as possible (Kaufman and Rousseeuw 2009). Cluster analysis is a type of unsupervised
learning where there are no predefined classes required. Cluster analysis could be used alone to get insight into data

We thank the participants and reviewers of our paper at the 2020 JISC Conference, and our paper’s anonymous reviewers for all of their help and support.
Zamil S. Alzamil also thanks the Deanship of Scientific Research at Majmaah University for supporting him at the 2020 JISC Conference under Project
No. R-2021-151.
Zamil S. Alzamil, Majmaah University, College of Computer and Information Sciences, Computer Science Department, Al-Majmaah, Saudi Arabia;
Deniz Appelbaum, Montclair State University, Feliciano School of Business, Department of Accounting and Finance, Montclair, NJ, USA; William
Glasgall, The Volcker Alliance, New York, NY, USA; Miklos A. Vasarhelyi, Rutgers, The State University of New Jersey, Rutgers Business School,
Accounting and Information Systems, Newark, NJ, USA.
Editor’s note: Accepted by Tawei Wang, under the Senior Editorship of Alexander Kogan and Patrick Wheeler.
Submitted: June 2020
Accepted: July 2021
Published Online: July 2021
199
200 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

distribution, or it could be used as a preprocessing step where the results of the clustering algorithm are used as an input for
other algorithms.
Cluster analysis is also considered as an Exploratory Data Analysis (EDA) tool that is used for pattern recognition and
knowledge discovery. EDAs are well-established statistical techniques, mostly graphical, that can assist in uncovering
underlying structure, explore important variables, test hypotheses, detect anomalies and outliers, and much more. Eventually,
the discovered insights from data could be used to support various decision making, test and evaluate current and past
decisions, or just to conduct different experiments on a specific domain (Hossain 2012).
Classifying similar objects is an essential human activity; it is part of everyday life. The daily process of learning to
distinguish between cats and dogs, cars and motorcycles, green and yellow colors, since childhood, improves our subconscious
classification schemes. This classification scheme is one of the reasons that cluster analysis is considered part of artificial
intelligence and pattern recognition (Kaufman and Rousseeuw 2009).
Clustering algorithms usually work with two different input structures. The first structure represents the objects n based on
their m attributes, represented as an n-by-m matrix, where the rows correspond to the objects or points, and the columns
correspond to the attributes or features (Kaufman and Rousseeuw 2009). The second structure could be represented by
proximities between objects or n-by-n, which is the distances between every two objects whether using similarity (how close
two objects are to each other) or dissimilarity (how far two objects are from each other) measures.
In this work, we investigate the use of hierarchical and k-means clustering cluster analysis techniques in publicly available
not-for-profit datasets.1 The research approach used in this paper is Design Science Research (DSR). The DSR approach was
introduced by Hevner, March, Park, and Ram (2004) and Peffers, Tuunanen, Rothenberger, and Chatterjee (2007), and
provides a methodology for the design and presentation of artifacts or solutions to research problems. AIS research using the
DSR approach includes Geerts (2011), Nehmer and Srivastava (2016), and Appelbaum and Nehmer (2017, 2020). This paper
follows Geerts’ (2011) application of DSR, which suggests six major processes. The six phases are:
1. Problem identification and motivation
2. Define the objectives of a solution
3. Design and development of an artifact that meets the solution objectives
4. Demonstration of the solution
5. Evaluation of the solution
6. Communication of the problem and the solution (usually an article)
In the first application, we explore the literature for the use of emerging data mining techniques in the audit of not-for-profit
entities. Then we apply k-means and hierarchical clustering on U.S. states’ financial statements data. In particular, we
demonstrate how cluster analysis could be applied as supportive tools for auditing not-for-profit and governmental bodies. The
second demonstration utilizes the Volcker Alliance’s survey data results. The survey produces extensive information about how
the different U.S. states score annually on budgeting using five measures. In both applications, we demonstrate how cluster
analysis could be used on not-for-profit and governmental data to help gain more audit and business insights about financial
statements and budgeting.

II. PROBLEM IDENTIFICATION AND MOTIVATION: LITERATURE REVIEW


Data analysis is expected to reveal previously unknown and unidentified information about a dataset. The expectation is
that the analytic approach is different from that of humans and that the technology brings another perspective. Before applying
a particular analytical approach, the auditor/accountant/analyst should explore the data and uncover any hidden relationships.
However, the problem is that there appear to be few published guidelines about how to conduct such a data exploration and
gain understanding. What follows is a review of extant research regarding this problem.

Cluster Analysis
Many studies in different domains discuss cluster analysis. In the accounting domain, previous research addresses the
possibility of using unsupervised learning or cluster-based analysis to identify some interesting patterns. In risk analysis, for
instance, cluster analysis is applied to identify groups of accounts, or customers who display unusual behavior. Williams and

1
For the sake of simplicity, in this paper, not-for-profit refers to the broad category of all entities whose goal is not that of profit-making for their owners.
They exist and operate for a collective social good and benefit. This could include not-for-profits, nonprofits, charities, foundations, municipalities,
hospitals, state universities, state and local governments, and federal or country-wide governments. Although sometimes nonprofits, not-for-profits, and
governmental entities are mentioned separately, this paper elects to take a broader view of these categories and consider them as one (see: https://en.
wikipedia.org/wiki/Nonprofit_organization).

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 201

Huang (1997) combine k-means clustering and the decision tree supervised algorithm for insurance risk analysis to find
customer groups that highly influence an insurance portfolio. They use an approach named the hot spot methodology, where k-
means clustering is used to group the data, then apply the decision tree algorithm to get sets of rules, and, finally, evaluate the
results to identify the high-risk patients. This method can help better facilitate the exploration of different key groups or clusters
to reassess the insurance premiums for different patients.
In anomaly detection, Thiprungsri and Vasarhelyi (2011) use cluster analysis to detect anomalies. The purpose of the paper
is to examine the use of cluster analysis techniques to automate fraud filtering during the auditing process and to support
auditors when evaluating group life insurance claims. By using a simple k-means clustering approach, they detect possible
fraudulent claims resulting from anomalous clusters, thereby facilitating the work of internal auditors. Byrnes (2019) also
explores the potential of using cluster analysis methods in the area of auditing. He implements an artifact that automatically
processes comprehensive clustering and outlier detection on behalf of a user on credit card customers’ data. Thus, the user does
not need to be an expert in the technical aspects of clustering, such as choosing the right clustering techniques, what model best
represents the data in terms of cluster quality, and what objects are considered as outliers.
Cluster analysis is also widely used in other domains such as market research, particularly in market segmentation
(Thiprungsri and Vasarhelyi 2011). For instance, dividing customers into segments or clusters based on demographic
information results in different clusters that could be targeted with specific services. Kim and Ahn (2008) apply a hybrid k-
means clustering algorithm to segment an online shopping market for customers and operationalize a product recommendation
system.
Cluster analysis is also used in fraud detection. Outlier detection is an often-used technique to detect fraud. For example,
fraud in banking could be considered in two categories: behavioral and applicational fraud (Bolton and Hand 2001). Behavioral
fraud happens when credit card data are compromised. On the other hand, applicational fraud occurs when customers apply for
new credit cards. So, clustering methodologies could identify high-risk customers.
Srivastava, Kundu, Sural, and Majumdar (2008) apply a k-means clustering algorithm to identify fraudulent credit card
transactions using the amount of a transaction attribute. Then they categorize customers or profiles based on their spending
behavior. For instance, high-spenders will be in one group, low-spenders will be assigned to another group, and medium-
spenders will be assigned to a third. Thus, by using this profiling approach, they are able to identify possible fraudulent
transactions.
Cluster analysis is also used in detecting fraud in the telecommunication domain. Hilas and Mastorocostas (2008) use both
supervised and unsupervised machine learning approaches to investigate the usefulness of applying those methods in
telecommunication fraud detection. They use cluster analysis for their unsupervised learning approach; in particular, they use
hierarchical agglomerative clustering. The method successfully groups a distinct cluster of outliers.
In the healthcare domain, cluster analysis is applied to identify or group patients with similar types of cancer (Thiprungsri
and Vasarhelyi 2011). It is also being used to identify adverse drug reactions (ADRs). Harpaz et al. (2011) present a new
pharmacovigilance data mining method based on the Biclustering model, which aims to identify drug groups that share a
common set of ADRs. The Biclustering technique differs from the typical clustering methodology in which clustering can be
applied to only rows or columns of a matrix separately, while Biclustering performs clustering in both rows and columns
simultaneously. The performed statistical tests show that it is unlikely that the Biclusters’ structure and their contents could
have arisen by chance. The results show that many cases for drugs in the same class were associated with a specific adverse
reaction; this suggests that any member of the class could experience similar adverse events.
Clustering is mainly used for data exploration and pattern recognition using visualizations and other techniques. As
the size of a dataset increases, more methods and tools are required to facilitate interpretation and understanding of the
data. Cluster analysis is one way that enhances understanding of the data structure of large datasets. Fua, Ward, and
Rundensteiner (1999) apply hierarchical clustering methodology for visualization of large multivariate datasets. Their
research mainly focuses on an interactive visualization of large multidimensional datasets using a parallel display
technique. They propose a hierarchical clustering approach using a software called XmdvTool2 that displays
multiresolution views of the data with the ability to navigate and filter large datasets to facilitate the discovery of
hidden trends and patterns.
Eisen, Spellman, Brown, and Botstein (1998) use cluster analysis for genome pattern exploration. They utilize a
hierarchical clustering method to support biologists in the study of gene expression datasets. Pairwise average linkage is used to
calculate the distances between observations. They use the hierarchical tree of genes to represent different degrees of
similarities and relationships between genes, thus ordering the different genes so that the genes with similar expressions are
adjacent to each other. The ordered data are then displayed graphically. The quantitative values are represented using a

2
See: https://davis.wpi.edu/xmdv/

Journal of Information Systems


Volume 35, Number 3, 2021
202 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

naturalistic color scale rather than numbers. They find that this alternative way of representing a large volume of quantitative
data is easier for humans to interpret and capture.
In addition, the use of clustering is not only recommended for large high-dimensional data, since multivariate techniques
such as cluster analysis produce more accurate results on small datasets. More often, the large dimensionality of datasets is
reduced to a small number of latent variables that enable the clustering process (Zhao, Zou, and Chen 2014). The accuracies
and effectiveness of traditional clustering techniques diminish for larger and hyper-dimensional datasets. This requires new
methodologies to overcome these problems, such as topic modeling, which is an active research field mainly to be used to
reduce high-dimensional data into smaller representations for accurate and more robust results.
To summarize, clustering is used strategically in many fields to explore data and detect outliers. Even though the type and
size of data may vary, research supports the use of clustering for many applications and domains. Since there are few papers
that address the use of clustering in auditing, where data exploration and anomaly detection are paramount tasks, this paper
aims to address this gap. It is also possible that such a gap exists in the not-for-profit domain.

Advanced Data Analytics Techniques for Not-for-Profit Data


Advanced data analytics techniques bring challenges for researchers and practitioners in accounting and auditing of all
entities, whether for profit or not. Challenges exist, such as the ability to adapt to new and sophisticated data analytics
techniques, and how to decide which data analytics technique is suitable to be used on a certain task. At the same time, the
modern data analytics techniques provide more opportunities for researchers and practitioners in the accounting domain to be
explored and utilized (Schneider, Dai, Janvrin, Ajayi, and Raschke 2015; Liu and Vasarhelyi 2014).
Data analytics tools, along with the digitization and dissemination of governmental and not-for-profit data (e.g., financial
statements) to the public, allow auditors and any other interested party to monitor and analyze government performance (Liu
and Vasarhelyi 2014). Various data analytics techniques, such as clustering methods, can help auditors to get more insights into
data and to automate the discovery of previously unknown patterns and trends (Liu and Vasarhelyi 2014).
However, few studies have been published discussing the use of sophisticated data analytics techniques using not-for-profit
organizations’ data, such as municipal financial statements and other budgeting-related information. Bloch and Fall (2015)
discuss government debt, finances, and sustainability across countries. They also assess factors such as financial liabilities,
assets, pension liabilities, national accounts, nonfinancial assets, and debt composition for analyzing general government debt.
In another study, Arisandi (2016) discusses applying an exploratory data analytics (EDA) technique such as data clustering
using the U.S. government and non-profit sector data. Ten years of data (2004–2013) of the comprehensive annual financial
report (CAFR)3 of the 50 states’ level data in the U.S. are used, and nonfinancial data, such as crime statistics, transparency
awards data, leadership change, and public employee corruption conviction, from many publicly available sources. Ratio
analysis and k-means cluster analysis are applied. The clusters group the states with similar characteristics based on the selected
variables. Trends and potential anomalies are detected by using cluster analysis.
Also, in a paper4 that sheds some light on the use of public sector audit or armchair audit, the authors cite that there is
comparably little recent research on the value of public audit in general and in New Zealand in particular. They also explain
how public sector audit makes a difference to the lives of citizens, and how it enhances the degree of confidence of intended
users in financial statements in particular.
Since governments around the world started to make their data publicly available and accessible, citizens, professionals,
and other interested groups can monitor public spending and increase overall participation (Alzamil and Vasarhelyi 2019). In
this paper, we demonstrate how data analytics, especially cluster analysis or unsupervised learning, could be used on
governmental and not-for-profit data and to help interested stakeholders gain more insights and increase audit efficiency about
financial statements and budgeting.
We extend this research by suggesting an artifact framework for applying modern data analytics techniques to obtain more
insights into data and to extract meaningful knowledge, which will better inform many interested stakeholders, such as a broad
range of government officials, regulators, analysts, auditors, and interested citizens. It is hoped that this work will result in
increased adoption of these advanced data mining techniques, especially for auditors using the not-for-profit organizations’
data. Since cluster analysis is one way to gain insights into the distribution of data and gives researchers directional insights,
thus helping researchers and practitioners to understand the data they investigate, not-for-profit and governmental data would
be a good fit to perform such analysis. Clustering is not being used in the research on machine learning approaches for

3
A Comprehensive Annual Financial Report (CAFR) is a set of financial statements for a government body such as a state, municipality, or any other
government entity that complies with the accounting requirements of the Government Accounting Standards Board (GASB). It also should be audited
by an independent auditor using generally accepted government auditing standards (GASB 2010; Hegar 2019).
4
See ‘‘The Value of Public Audit’’ at: https://www.wgtn.ac.nz/cagtr/occasional-papers/documents/introductory-material.pdf/

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 203

governmental data, although such data are widely available in many countries thanks to the open government data initiatives.
Such a demonstration of clustering using varied not-for-profit data types would be of interest to many stakeholders, especially
to promote the use of citizen armchair audit.
In addition, presenting this framework could allow citizen armchair auditors (Cesário 2020) and government auditors to
perform surveillance roles to improve the quality of information they need to monitor government entities and prevent wasting
of public resources. Since few studies have been published discussing the use of advanced data analytics techniques using not-
for-profit and governmental data, applying these techniques in this domain will result in more adoption among practitioners,
especially by auditors, and this would be a major contribution of the paper.

III. DEFINE THE OBJECTIVES OF THE SOLUTION

Cluster Analysis Overview


Large amounts of data are generated and collected every day from various domains; from satellite images, biomedical,
businesses, marketing, security, internet searches, and, more importantly, government-related data. Extracting knowledge from
these ‘‘big’’ data exceeds human abilities. Cluster analysis is one way of mining these data and is one of the important data
mining techniques for discovering previously unknown patterns and trends (Kassambara 2017). In the literature, it is often
referred to as ‘‘unsupervised learning’’ because we do not have an a priori knowledge of the data and which variables belong in
which cluster and the algorithm will learn how to cluster the data (Kassambara 2017).
Since our focus on these two applications is in cluster analysis techniques using k-means and hierarchical clustering, the
following two subsections give an overview of these two clustering techniques.

K-Means Clustering
K-means clustering is popular due to its simplicity and efficiency. It is one of the most common unsupervised learning data
mining methods that is used in data clustering (grouping) and pattern recognition. K-means was first introduced by MacQueen
(1967). It partitions a given dataset (i.e., unlabeled) into a set of k clusters (groups), where k denotes the number of groups pre-
selected by the analyst. It uses centroids to form clusters by optimizing the within-clusters’ squared errors. Therefore, it
automatically groups a dataset into k partitions known as clusters. The K-means clustering algorithm divides m points in n
dimensions into k clusters so that the within-cluster sum of squares is minimized, meaning that each observation (a record in a
table) belongs to the cluster with the nearest mean. Therefore, objects within the same cluster are as similar to each other as
possible (i.e., intra-class similarity), whereas objects in different clusters are as dissimilar as possible (i.e., inter-class similarity)
from other clusters (Hartigan and Wong 1979; Jain 2010; Kassambara 2017).
In a k-means clustering algorithm, each cluster is formed by its center ‘‘centroid,’’ which represents the mean of the total
points assigned to the cluster. K-means proceeds by manually choosing the number of clusters (k initial clusters) and then
iteratively applying the following steps (Wagstaff, Cardie, Rogers, and Schroedl 2001; Kassambara 2017):

I. Specify the number of clusters ‘‘k.’’


II. Randomly select k objects from the dataset as the initial cluster centers.
III. Scan through the list of m observations, then assign each observation to its nearest cluster’s center using the Euclidean
distance.
IV. Each cluster’s center is then updated to be the average of the new observations assigned.
V. Repeat Steps III and IV iteratively until there are no more reassignments or the maximum iterations number is
reached.

The fundamental idea of the k-means clustering is forming clusters of data so that the total intra-class variation is as low as
possible. There exist many k-means algorithms; the standard one is the Hartigan and Wong (1979) algorithm, which uses the
Euclidean distance (Kassambara 2017). The main advantages of the k-means algorithm are speed and simplicity; however,
there exist some disadvantages, such as that the algorithm does not yield the same results with each run since k is assigned
randomly at the beginning (Singh, Malik, and Sharma 2011).

Hierarchical Clustering
Hierarchical clustering or hierarchical cluster analysis (HCA) is an unsupervised data mining technique that aims to build a
hierarchy of clusters (see: https://en.wikipedia.org/wiki/Hierarchical_clustering). Hierarchical clustering results in a tree called
a dendrogram. It is of great interest in a broad range of domains. Hierarchical trees provide views of the data at different

Journal of Information Systems


Volume 35, Number 3, 2021
204 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

abstraction levels, therefore making it ideal for interactive visualization and exploration (Zhao, Karypis, and Fayyad 2005). In
hierarchal clustering, the data are not partitioned into clusters in one step; it takes multiple partitioning steps that may move
from all objects in one cluster to k clusters each containing one object. The hierarchical clustering technique is appropriate for
tables with a few hundred records. Also, one can choose the desired number of clusters after the tree is built by cutting it at a
certain height (Bobric, Cartina, and Grigoras 2009).
Hierarchical clustering is subdivided into two methods: agglomerative and divisive. In the agglomerative method (or
bottom-up), where each observation forms its own cluster, each pair of clusters is merged as the hierarchy moves up. The
divisive method (or top-down) is the opposite of agglomerative clustering, where all observations form one big cluster and then
split recursively as one moves down the hierarchy.
The merging in the agglomerative hierarchical clustering method uses heuristic criteria, e.g., sum of squares, single
linkage (i.e., nearest neighbor), complete linkage (i.e., furthest neighbor), Ward’s method (i.e., decreases in variance
between the two clusters that are being merged) or average linkage (i.e., average distances between objects). The splitting in
the divisive clustering methods uses the following heuristic. It starts with all objects as one cluster of size m. Then, at each
step, clusters are partitioned into pairs of clusters maximizing the distance between the two clusters. Finally, the algorithm
stops when objects are partitioned into m clusters of size 1. The hierarchical clustering resulting from either agglomerative or
divisive methods is usually presented in a two-dimensional diagram called a dendrogram (Everitt, Landau, Leese, and Stahl
2011).

IV. DESIGN AND DEVELOPMENT OF AN ARTIFACT THAT MEETS THE OBJECTIVES

The Clustering Process Artifact


Since municipal and state-level data vary in type, a process model artifact is developed that could be generalizable for most
clustering applications. A process model artifact instantiates or codifies the steps taken when solving a research problem
(Hevner et al. 2004). By providing an artifact that clarifies the process of applying varied clustering algorithms, we hope to
facilitate further future research and applications of clustering in the municipal domain. A well-designed process artifact should
reduce the difference between the goal state (successful application of an appropriate clustering technique) and the current state
of the system (minimal application of clustering) (Hevner et al. 2004). The process should not only guide throughout efforts to
apply clustering to governmental data in a consistent way, but also facilitate a similar undertaking by others. Hence, this
clustering process artifact facilitates the use of clustering and extends the knowledge base of previous research in the
accounting domain (Byrnes 2019). The process artifact is a result of iteratively summarizing the steps for various clustering
methods until a common process between these methods could be identified. These common steps are shown as the process
artifact in Figure 1, which flows in the following steps:
First Step—Not-for-Profit Data: Not-for-profit and governmental data that could measure government performance
have been used, since they can vary in type and information provided. The data in the first application are the financial
statements for the U.S. states. For the second application, we use the Volcker Alliance (2018b) survey results. The survey
obtained five variables related to budgeting: budget forecasting, budget maneuvers, legacy costs, reserve funds, and budget
transparency.
Second Step—Data Pre-Processing: This step initially removes all empty fields. Data are standardized by calculating the
mean and standard deviation. Thus, in each observation, we subtract the mean and divide it by the standard deviation. In
addition, in the first study, we use the per capita average to improve results. These first two steps are common to both clustering
approaches.
Third Step—Pattern Recognition: The step of pattern recognition is a pivotal step in this process artifact. During this step,
it is determined whether the data lend themselves to clustering, based on assessment using techniques such as correlation or
principal component analysis (PCA).
Fourth Step—Appropriate Clustering Method: In this step, we determine that two clustering techniques, k-means and
hierarchical clustering, are appropriate for comparison purposes of the two methodologies.
Fifth Step—Assessment of the Results: In this step, we assess the results by observing the different groups of the states in
each cluster. In addition, we apply an assessment methodology to calculate the stability of each cluster.
Although this artifact might appear similar to any standard data analytics process, the inclusion of the pattern
recognition step is essential for a successful application of clustering. The type and orientation of the dataset(s) is
important to observe, to indicate whether clustering is indeed called for and to additionally determine the best clustering
method(s).

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 205

FIGURE 1
Process Artifact for the Application of Cluster Analysis

Figure 1 shows our proposed clustering process artifact.

V. DEMONSTRATION OF THE SOLUTION (THE TWO APPLICATIONS)

Demonstration of Clustering: U.S. States Financial Statements


Introduction
Cluster analysis is used in this study as a supportive method to get more insight into not-for-profit and government data,
specifically, U.S. States’ financial statements. The main objective of using cluster analysis is for grouping and ranking of states.
Unlike linear discriminant analysis (LDA), cluster analysis does not require prior knowledge of data (Roy, Kar, and Das 2015).
Two different clustering techniques are applied, k-means and hierarchical clustering. The statistical open-source software R is
used in the analysis. Both techniques share the first two steps of data and data pre-processing. Finally, we compare the results of
the clusters and how they are different from each other.
Step One—Government Data: In this study, we use the states’ financial statements for clustering. The dataset contains nine
numeric variables suggested by government data experts.5
1. Per capita total general fund revenues.

5
Numbers were normalized to per capita level for the years of 2000–2016.

Journal of Information Systems


Volume 35, Number 3, 2021
206 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

FIGURE 2
A Two-Dimensional Biplot of the Variables

This biplot shows how the original variables behave relative to our principal components (only in the directions of the first two principal components).

2. Per capita excess (deficiency) of revenues over expenditures.


3. Per capita total operating expenses.
4. Per capita education expenses.
5. Per capita net change in fund balance.
6. Per capita general fund total other financing sources.
7. Per capita general fund transfers to other funds.
8. Per capita general fund transfers from other funds.
9. Per capita pension expense.
Step Two—Data Pre-Processing: The datasets are standardized and also use the per capita of the variables. The goal for
standardizing the data is to make the variables comparable. This is accomplished by having the standard deviation equal to 1
and the mean equal to 0.
Step Three—Pattern Recognition: Figure 2 shows a two-dimensional biplot. This biplot shows how the original variables
behave relative to the principal components (only in the direction of the first two principal components). The pattern
recognition step helps to understand the data and to see what variable contributes the most to clustering. Principal component
analysis (PCA) is a statistical technique that is widely used by statisticians and data scientists to find trends and patterns in the
data (Smith 2002). PCA attempts to find a combination of attributes that lead to maximum separation of data points.
Furthermore, the main reason for using cluster analysis is to be able to group the states and, thus, to perform further
investigation by using external variables.

K-Means Clustering
Step Three—Cluster Analysis: In using the k-means clustering algorithm, we must first initiate the number of clusters that
is k. One of the fundamental problems when dealing with k-means is to determine the number of clusters. Choosing the right k
value has a significant impact on the performance of the algorithm, and the question is which k should be selected when there is
no prior knowledge of the data (Bholowalia and Kumar 2014). There exist many methods to choose a ‘‘good’’ number of

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 207

clusters. Some widely used methods are the elbow, silhouette, and gap statistics (Kassambara 2017). For this framework, we
decide to choose a direct and widely used method, which is the elbow method.
The elbow method is a method of interpretation and validation of consistency within clusters. This is performed by
computing the within-clusters sum of squares that is designed to help to find the appropriate number of clusters in a dataset. It is
one of the oldest methods used for determining a ‘‘true’’ number of clusters. It looks at the percentage of the explained variance
as a function of the number of clusters k. It comes from the idea that when adding additional clusters, the modeling of the data
will not improve. Therefore, the percentage of the explained variance by the clusters is plotted against the total number of
clusters. Thus, the first clusters will add the most information or marginal gain and, at a certain point, the gain will drop
significantly. This significant drop means that adding an additional cluster would not gain incremental meaningful insights. If
one were to graph these findings, the optimal point resides in the curve of a bent arm, the elbow crease (Bholowalia and Kumar
2014; Kodinariya and Makwana 2013).
The true k ‘‘number of clusters’’ is chosen at the elbow crease point or the ‘‘elbow criterion.’’ The idea is to start with k ¼ 1,
and then keep increasing the number by 1, calculating your clusters, and at a certain value for k, the cost will drop significantly,
hence reaching a plateau when you increase k further. This is the ‘‘right’’ k-value you want to proceed with. The reason is that
after this, you increment the clusters, but the new clusters are near the existing ones (Bholowalia and Kumar 2014). To
summarize, the elbow methods algorithm depends on the following: it first initializes k ¼ 1, and then measures the cost of the
optimal quality solution, and if, at a certain point, the cost of the solution drops significantly, then that is the ‘‘right’’ k value.
Step Four—K-Means Clustering: Figure 3 shows a scree plot of the elbow method where the x-axis represents the number
of clusters, and the y-axis shows the sum of squared distances of the clusters’ means to the global mean (Bholowalia and
Kumar 2014). Figure 3 shows that six clusters would be a good fit.
Figure 4 shows a two-dimensional representation of the clusters for the average per capita scores using four clusters. We
use the first two principal components (PCs), and they account for 61.26 percent of the points variability. The first principal
component explains 35.93 percent of the overall variability; the second principal component accounts for 25.32 percent of the
data variability. As shown in Figure 4, the states in each cluster are more similar to the ones in their group than they are to the
others. Table 1 shows the states that belong to a certain cluster.

Hierarchical Clustering
Step Three—Cluster Analysis: Here, the agglomerative hierarchical clustering method is used, in particular, Ward.D with
Euclidean, as a measure of similarity. Ward.D is a method that finds a minimum variance by minimizing the total within-cluster
variance (Everitt et al. 2011).
Step Four—Hierarchical Clustering: The dendrogram in Figure 5 shows the clusters’ hierarchy. Furthermore, Figure 6
shows four clusters, and the light grey lines separate the four clusters. In addition, Table 2 shows the states that belong to a
specific cluster.

Discussion
Step Five—Assessment of the Results: In this application, we experiment using the U.S. states’ financial statements for the
years 2000 through 2016. Visualization, such as the two-dimensional biplot of the variables, shows us how the original
variables behave relative to our first two principal components; thus, exploring those variables in length could help us examine
how each feature could affect external variables, such as credit ratings of the states (e.g., Moody’s ratings), municipal bonds, or
net population change in an area. Then we use an unsupervised learning approach (k-means and hierarchical clustering) on our
data to further investigate and get more insights about the states and compare, for instance, the states in the same group (i.e.,
cluster) with others.
The states that form a cluster are more similar to one another based on the examined variables. By comparing the two
unsupervised methodologies, k-means and hierarchical, the quality of the clusters can be confirmed. For instance, Alaska and
Louisiana are in the same cluster in both methods, thus confirming that the two states have similarities and thus suggesting that
additional investigation should be conducted to obtain further insights into the states. Also, the third clusters in both the
hierarchical and the k-means methods are almost the same except for four states.
This demonstration explores the potential for applying data mining methodologies to get more insights into governmental
data, especially of the states’ financial statements. This paper is one of the few studies that use cluster analysis in the
governmental domain. These methods could bring more data insights for auditors, practitioners, and government officials.
Cluster analysis is mainly being used for ranking and grouping the states. Further analysis is required for each cluster’s results,
also, by comparing and relating states’ financial statements with external variables. Since the same methodology is used in the
two applications, additional insights will be obtained in the ensuing application.

Journal of Information Systems


Volume 35, Number 3, 2021
208 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

FIGURE 3
A Scree Plot That Shows the Number of Clusters and Within-Group Sum of Squares (Elbow Method)

The x-axis represents the number of clusters, and the y-axis shows the sum of squared distances of the clusters’ means to the global mean.

Cluster Analysis of the Volcker Alliance Survey Data


Introduction
This application utilizes the Volcker Alliance’s6 survey data results. The survey produces extensive information about how
the U.S. states score on an annual basis on budgeting using five measures: Budget Forecasting, Budget Maneuvers, Legacy
Costs, Reserve Funds, and Transparency. However, readers may have difficulties fully understanding such rich data by just
looking at the tables. This demonstration contributes to the literature by showing how data mining techniques could be of good
use when applied to the government domain.
First, it illustrates that applying variables’ correlation coefficient among the five variables could draw some patterns
between two variables. Also, Tableau7 is used for data visualization, where the map feature is used to illustrate Moody’s8
different state ratings. Furthermore, a map view of the Volcker Alliance’s scores is presented. Next, cluster analysis techniques

6
The Volcker Alliance is a nonprofit, nonpartisan organization whose mission is to empower the public sector workforce to solve the challenges facing
the U.S. It partners with academic, public, and private organizations that conduct research on governments with the objectives to improve overall
governments’ performance and increase accountability and transparency at the federal, state, and local levels (see: https://www.volckeralliance.org).
7
Tableau is an interactive data visualization tool that helps people discover and understand their data better (see: https://www.tableau.com).
8
Moody’s provides credit ratings and research on areas such as debt instruments and securities as well as analytics in financial risk management (see:
https://www.moodys.com/).

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 209

FIGURE 4
Two-Dimensional Representation of K-Means Results

Figure 4 shows the average per capita scores using four clusters. We use the first two principal components (PCs), and they account for 61.26% of the
points variability.

are applied on the Volcker Alliance’s survey scores. That is how similar states regarding budgetary practices are identified and
previously unknown relationships and patterns with non-judgmental cluster analysis can be ascertained.

About the Dataset


Step One—Data: In 2016, the Volcker Alliance began a series of studies exploring budgetary and financial reporting
practices of all 50 states in the U.S.9 It proposes best practices for state-level policymakers. The goal of the published study is to
contribute to the main mission of the Volcker Alliance to improve the effectiveness of different government administrations at
all levels and to make processes such as state budgeting more transparent to various stakeholders, especially citizens. Following
their first published report regarding budgetary and financial reporting practices, two more reports were published in the
following year (Volcker Alliance 2018a, 2018b).
The Volcker Alliance’s (2018a, 2018b) reports contain grades ranging from A to D related to the state’s budgetary
practices for the fiscal years of 2015 through 2017. Each one of the states receives a grade in five main categories; a grade is
given based on the state’s commitment to best practices in several budgeting indicators. Each grade reflects the average of the
three years for the state’s budgeting area. Explanations of the five categories are as follows (Volcker Alliance 2017):
a. Budget forecasting: how states estimate expenditures and revenues for the upcoming fiscal year and for the long term;
b. Budget maneuvers: the degree to which states use one-time actions to cover recurring expenditures and achieve
budgetary balance;
c. Legacy costs: measures how well states fund commitments to cover public employee pension and other
postemployment benefits (OPEB), primarily healthcare;
d. Reserve funds: measures how well the states manage both the general fund reserves and the rainy-day funds;
e. Budget transparency: measures how well the states are disclosing budget information, including tax expenditures, debts,
and the estimated costs of deferred infrastructure maintenance.

9
See: https://www.volckeralliance.org/publications/truth-and-integrity-state-budgeting-preventing-next-fiscal-crisis

Journal of Information Systems


Volume 35, Number 3, 2021
210 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

TABLE 1
U.S. States Distribution on Each Cluster
Cluster Members
#1 AK, LA
#2 AL, AZ, CA, CO, FL, ID, IL, IN, KS, KY, ME, MA, MI, MO, MT, NE, NV, NH, NJ, NY,
NC, ND, OH, OR, SC, SD, TN, TX, UT, VT, VA
#3 NM
#4 AR, CT, DE, IA, GA, HI, MD, MN, MS, OK, PA, RI, WA, WV, WI, WY

FIGURE 5
A Dendrogram Representation of the Hierarchy—The Clusters Hierarchy

TABLE 2
U.S. States Distribution on Each Cluster
Cluster Members
#1 HI, NM
#2 AK, LA
#3 MN, WA, TX, GA, WI, DE, RI, AR, CT, IA, OK, PA, MS, MD, WV
#4 ID, IN, MO, VT, CO, MA, ND, NY, WY, VA, KY, ME, MT, NE, FL, OR, CA, KS, OH,
TN, AZ, IL, AL, SD, NJ, NC, SC, MI, NH, NV, UT

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 211

FIGURE 6
A Dendrogram Representation of the Hierarchy

Figure 6 shows four clusters separated by the light grey lines.

The grades are derived from the surveys for each of the five critical categories of the states. They have assigned a value of
0 or 1 for each indicator of the five categories. For instance, in the budget forecasting category, they measure the following
indicators: revenue growth projections, multiyear revenue forecasts, multiyear expenditure forecasts, and consensus revenue
forecasts. For budget maneuvers, indicators include deferring recurring expenditures, revenue and cost shifting, funding
recurring expenditures with debt, and using asset sales and up-front revenues. For legacy costs, indicators include public
employee OPEB funding and public employee pension funding. For reserve funds, indicators include positive reserve or
general fund balance, reserve funds disbursement policy, and reserves tied to revenue volatility. Finally, for budget
transparency, indicators include consolidated budget website, provided debt tables, disclosed deferred infrastructure
replacement costs, and disclosed tax expenditures.
The datasets used are publicly available and freely accessed through the Data Lab on the Volcker Alliance’s website.10 The
Alliance also provides online tools for data exploration with interactive visualizations.
Step Two—Data Pre-Processing: In the data preprocessing step, the data are scaled using standard deviation for better
calculation of the results.
Step Three—Pattern Recognition: First, correlation coefficient analysis is used, which measures the strength of association
or relationship between two variables (McGraw and Wong 1996). Correlation analysis is a data exploration step that helps find
patterns and, thus, would be the base for defining models or frames for the analysis. This analysis could assist in understanding
and validating the survey design, as well as establishing hints for future budget behaviors. Correlation analysis of the five
variables shows a moderate positive correlation between legacy costs and budget maneuvers at ;0.512, which indicates that
the scores in these two categories are interrelated with each other, as shown in Figure 7 and Table 3. Clusters are highly affected
by these two variables, legacy costs and budget maneuvers. This variables correlation coefficient and the pairwise scatterplot
matrix analysis could assist in selecting appropriate variables to build future models.

10
The Volcker Alliance data can be found at: https://www.volckeralliance.org/truth-and-integrity-state-budgeting-data-lab

Journal of Information Systems


Volume 35, Number 3, 2021
212 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

FIGURE 7
A Pairwise Scatter Plot Matrix for Our Five Variables—Assist in Selecting Appropriate Variables to Build Future
Models

Figure 8 shows a map view of the different grades of the Volcker Alliance’s scores (i.e., A, B, C, D, D) of the five
variables for the states. As one can see in Figure 8, the vertical axis is representing the grades A to D, and the horizontal axis
represents the budget categories: budget forecasting, budget maneuvers, legacy costs, reserve funds, and transparency. This
map view of the scores allows the readers to directly compare the scores of different categories of the different states. The
following section presents further visualization of the clusters’ results.

K-Means Clustering
Step Four—K-Means Clustering: Second, we apply the k-means clustering concept of ‘‘birds of a feather flock together’’
(McPherson, Smith-Lovin, and Cook 2001). K-means clustering algorithm divides m points in n dimensions into k clusters so

TABLE 3
Variables Correlation Coefficient—Which Measures the Strength of Association or Relationship Between Two
Variables
(Higher Value Results in Higher Correlation Between the Two Variables)
Budget Budget Legacy Reserve
Variables Forecasting Maneuvers Costs Funds Transparency
Budget Forecasting 1.00000000 0.00791909 0.03613848 0.25110021 0.18377649
Budget Maneuvers 0.00791909 1.00000000 0.51272449 0.22466741 0.11578494
Legacy Costs 0.03613848 0.51272449 1.00000000 0.02784838 0.04485754
Reserve Funds 0.25110021 0.22466741 0.02784838 1.00000000 0.09371242
Transparency 0.18377649 0.11578494 0.04485754 0.09371242 1.00000000

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 213

FIGURE 8
A Map View of Volcker Alliance’s Scores for the States

Figure 8 shows the different grades of the Volcker Alliance’s scores (i.e., A, B, C, D, D) of the five variables for the states.

that the within-cluster sum of squares is minimized, which means that each observation (record in a table) belongs to the cluster
with the nearest mean (Hartigan and Wong 1979; Jain 2010). We use the five variables mentioned earlier in the cluster analysis.
In addition, we take the average for the data (i.e., three years); this is the average for the years 2015, 2016, and 2017. To
determine the number of clusters, the aforementioned elbow method is used. Examining Figure 9, using the elbow method,
seven clusters are found to be a good cut. The figure shows a plot where the x-axis represents the number of clusters, and the y-
axis shows the sum of squared distances of the clusters’ means to the global mean.
The next cluster plot, Figure 10, shows a two-dimensional representation of the seven clusters for the average scores using
the five variables, which are budget forecasting, budget maneuvers, legacy costs, reserve funds, and transparency. Before
plotting the clusters, principal components analysis (PCA) is used. By applying PCA on the dataset with five numerical
features, we obtain five principal components (PC 15). Each one explains a proportion of the total variation in the dataset.
That is, PC 1 explains 32 percent of the total variance, which means around one-third of the information in the dataset. PC 2
explains 27 percent of the variance. So, by knowing PC 1 and PC 2, you can have an accurate view or variability in your data,
as just PC 1 and PC 2 can explain around 58 percent of the variance. Table 4 shows the importance or variability of each one of
the five components. As shown in Figure 10, the states in the same cluster are more similar to the ones in their group than they
are to the states in other clusters. Table 5 shows each state that belongs to a certain cluster.

Hierarchical Clustering
Step Four—Hierarchical Clustering: After clustering the data using k-means, results are illustrated using the hierarchical
clustering. As in the first demonstration, the Ward.D method with Euclidean distance is also used as a measure of similarity for
building the hierarchy. Figure 11 shows a dendrogram representation of the hierarchical tree.
In Figure 12, the hierarchical tree is cut at a certain height, where seven clusters are obtained. The height of the dendrogram
cutting point controls the number of clusters obtained, similar to the k in k-means clustering. The cutree function in R is used to
cut the hierarchical tree. The resulting clusters are separated with a borderline around each cluster. As shown in Figure 12, the
states are clustered as presented in Table 6.
Table 7 compares the two clusters’ results (k-means and hierarchical clusters). The states that populate each cluster of the
hierarchical method are moderately different from the clusters obtained from k-means except for New Jersey, Illinois, and

Journal of Information Systems


Volume 35, Number 3, 2021
214 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

FIGURE 9
Within-Clusters Sum of Squares (Elbow Method)

Figure 9 shows a plot where the x-axis represents the number of clusters, and the y-axis shows the sum of squared distances of the clusters’ means to the
global mean.

Kansas. These are identical in both methods. Their similarities affirm that the clusters for both methods are well distributed. It
may be a surprising result that Kansas belongs with New Jersey and Illinois. After all, there is much press about the budgetary
woes of New Jersey and Illinois, not to mention the findings of the Volcker survey, but little publicity in general about the state
of Kansas. However, upon closer examination of the survey results, we find that Kansas scores poorly in areas of legacy costs
and budget maneuvers, which are the two main contributing factors to the clusters. In fact, Kansas scores poorly in four out of
five of Volcker’s variables. These survey results (Figure 13) are highlighted immediately with cluster analysis (Volcker
Alliance 2017).

Assessment of the Results


Step Five—Assessment of the Results: To evaluate the results, we apply a clustering assessment methodology that uses
bootstrap resampling, which assesses how stable a given cluster is. This methodology is important when using clustering
algorithms such as k-means and hierarchical clustering, especially the latter, where, most of the time, you have to use an a
priori method of choosing the optimal number of clusters (Hennig 2007, 2010).
These algorithms mostly produce clusters that represent the actual structure and relationships in the data, and then a few are
miscellaneous. One common way to check if a cluster represents the true structure of the data is to see how each cluster holds
up under apparent variations in the dataset. We use the function clusterboot() that is available in the ‘‘fpc’’11 package using the
statistical software R. The clusterboot() function uses a bootstrap resampling to evaluate the stability of each cluster (Hennig
2007). The clusterboot() function can perform clustering and also evaluate the resulting clusters. It can run many clustering
algorithms, including k-means and hierarchical clustering (Zumel and Mount 2014; Zumel 2015).

11
fpc is a package in R statistical software that has the clusterboot method, among others, that runs the cluster-wise stability assessment for clustering
(Hennig 2007, 2010).

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 215

FIGURE 10
A Visualization of k-means Clusters—Seven Clusters

TABLE 4
The Principal Components Distribution—Shows the Importance or Variability of Each of the Five Components
Importance of
Components PC 1 PC 2 PC 3 PC 4 PC 5
Standard deviation 1.2551352 1.1599979 0.9719874 0.8466411 0.64612677
Proportion of variance 0.3150729 0.2691190 0.1889519 0.1433602 0.08349596
Cumulative proportion 0.3150729 0.5841919 0.7731438 0.9165040 1.00000000

TABLE 5
State Distributions Within Each Cluster
Cluster Members
Cluster 1 ID, SD, NE, IA, UT, OR, WI, OK, MS, NV, NC, MT
Cluster 2 NJ, IL, KS
Cluster 3 TX, VT, GA, MO, ND, OH, NH
Cluster 4 TN, MN, DE, CA, HI, SC, IN
Cluster 5 AK, WA, AZ, FL, ME, WV, MI, RI
Cluster 6 CT, NY, PA, MA, VA, MD, LA, KY, CO
Cluster 7 NM, AL, AR, WY

Journal of Information Systems


Volume 35, Number 3, 2021
216 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

FIGURE 11
A Visualization of the Hierarchical Clustering Dendrogram—The Hierarchical Tree

Running the cluster-wise cluster stability assessment algorithm by resampling using the clusterboot() function in our
dataset for our hierarchical clustering method obtains the results in Table 8. Table 8 shows the cluster-wise means for each
cluster—the average values for the 20-bootstrap iterations. Values close to 1 indicate stable clusters. Any mean value that is
less than or equal to 0.50 indicates a dissolve cluster. Mean values between 0.6 and 0.75 are considered to be clusters that
clearly indicate patterns in the data. Mean values equal to 0.85 or above are considered highly stable clusters (Hennig 2007).
Table 8 also shows the number of times each cluster has been dissolved. The second cluster has dissolved only five times. The
clustering results in the highest mean value of 0.69833, which indicates the most stable cluster, followed by the fifth cluster
with a mean value of 0.68573, then the first cluster with a value of 0.63043, with the fourth cluster being the lowest, with a
mean value of 0.50435, meaning it cannot be trusted. The low value indicates that it dissolved the most among the other
clusters.
Next, by running cluster-wise cluster stability assessment using the clusterboot function for our k-means clustering
method, we obtain the results in Table 9 (note that we are using 20-resampling runs for each cluster on both the hierarchical and
the k-means clustering methods). Cluster number 5 in Table 9 has a high mean value of 0.86000, which indicates high stability.
This cluster has been dissolved only once.
Looking at the cluster-wise stability assessment results of the two clusters—k-means and hierarchical—overall,
hierarchical clustering performs better than k-means, with higher mean values with less dissolved clusters. However, assessing
clusters’ stability is not the only important validity criterion since some clustering methods could result in stable, but not valid,
clusters (Hennig 2007).

VI. EVALUATION OF THE FINAL RESULTS AND THE METHODOLOGY


These two applications of both k-means and hierarchical clustering to two different not-for-profit datasets provide a
valuable illustration of the generalizability of the design artifact. In both applications, each clustering approach performs
differently, so it would be critical to assess the viability of both approaches for the analysis of any dataset. Just because k-means
performs ‘‘better’’ in the first application does not mean that it will offer the same performance in the second dataset,

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 217

FIGURE 12
A Visualization of the Hierarchical Clustering Dendrogram

We cut the hierarchical tree at a certain height where seven clusters are obtained (the height of the dendrogram cutting point controls the number of clusters
obtained).

demonstrating the value of the two-pronged approach offered by this framework artifact. With this artifact, shared and diverse
steps are clarified for the purpose of testing which clustering approach works best for a particular dataset.
Hopefully, this artifact and these two demonstrations will encourage the use of cluster analysis by accountants and auditors
for other datasets, such as https://www.data.gov/. Clustering allows the auditor to discover hidden patterns and trends in data
for proactive and better decisions. Finding new dimensions and patterns could assist the auditor in finding misstatements and
suspicious transactions that are subtle in their presentation. Clustering can add another insight or dimension to audit analysis, so
an efficient and effective approach for applying clustering techniques, as outlined in this artifact, should go far toward
promoting the use of clustering in the accounting and auditing professions.

TABLE 6
State Distributions Within Each Cluster
Cluster Members
Cluster 1 NJ, IL, KS
Cluster 2 AK, FL, RI, ME, WV, AZ, MI
Cluster 3 KY, MD, WA, CT, NY, VA, CO, LA, MA, PA
Cluster 4 HI, SC, NM, WY
Cluster 5 OK, IA, MS, IN, UT, MO, ND, AL, AR
Cluster 6 DE, GA, TN, CA, MN, TX, NH, VT
Cluster 7 NE, OR, SD, ID, WI, MT, OH, NC, NV

Journal of Information Systems


Volume 35, Number 3, 2021
218 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

TABLE 7
K-Means and Hierarchical Clusters Results—Comparison of the Two Clusters’ Results
(K-Means and Hierarchical Clusters)
Cluster K-Means Hierarchical Cluster
Cluster 1 ID, SD, NE, IA, UT, OR, WI, OK, MS, NV, NC, MT NE, OR, SD, ID, WI, MT, OH, NC, NV Cluster 7
OK, IA, MS, IN, UT, MO, ND, AL, AR Cluster 5
Cluster 2 NJ, IL, KS NJ, IL, KS Cluster 1
Cluster 3 TX, VT, GA, MO, ND, OH, NH DE, GA, TN, CA, MN, TX, NH, VT Cluster 6
Cluster 4 TN, MN, DE, CA, HI, SC, IN DE, GA, TN, CA, MN, TX, NH, VT
Cluster 5 AK, WA, AZ, FL, ME, WV, MI, RI AK, FL, RI, ME, WV, AZ, MI Cluster 2
Cluster 6 CT, NY, PA, MA, VA, MD, LA, KY, CO KY, MD, WA, CT, NY, VA, CO, LA, MA, PA Cluster 3
Cluster 7 NM, AL, AR, WY HI, SC, NM, WY Cluster 4

Not-for-profit and governmental data are appropriate for these demonstrations as these datasets are publicly available.
Interested users may replicate these two studies or attempt their own analyses of different government data. For example, a
comparison of Moody’s rankings of the states with that of Volcker’s rankings may be of interest to investors and analysts. Such
datasets could be compared using this clustering design artifact. There are many other potential use cases for clustering within
the numerous ‘‘open data’’ municipal websites, where one may see various checkbook accounts and municipal budgetary data.

VII. CONCLUSIONS AND FUTURE RESEARCH


This study explores the literature for the use of emerging data mining techniques, particularly clustering, in the audit of not-
for-profit entities. However, this application should not be confined to not-for-profit data and their auditors. Public watchdogs

FIGURE 13
Budget Forecasting, Budget Maneuvers, Legacy Costs, Reserve Funds for the Fiscal Years of 2015 through 2017—The
Lowest-Graded States (Volcker Alliance 2017)

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 219

TABLE 8
The Mean and the Number of Times a Cluster is Being Dissolved (Hierarchical Clusters)—The Cluster-Wise Means
for Each Cluster
(The Average Values for the 20-Bootstrap Iterations)
Cluster
(HC) 1 2 3 4 5 6 7
Mean value 0.63043 0.69833 0.62334 0.50435 0.68573 0.60433 0.56000
Dissolve 7 5 8 13 5 10 13

may also be interested stakeholders who can benefit from this study. Both the Volcker Alliance (see: https://www.
volckeralliance.org/) and Open the Books (see: https://www.openthebooks.com/) are public watchdogs of state and federal
spending and budgetary practices. Clustering, since it is exploratory, could be used as an attention-directing analytic technique
in the audit and/or assessment process. For example, clustering would be appropriate for a general risk assessment or an initial
review of a dataset during substantive procedures. Just as clustering revealed states that were similar regarding their budgetary
performance, auditors could likewise cluster unaudited financial statements of a firm with its audited financial statements from
previous years, or perhaps cluster the unaudited statements with audited statements of industry competitors. Clustering tells the
auditor analyst where to direct attention. After the auditor gains a high-level understanding of the data, then a more detailed
analysis can follow. Clustering may also be applied to transaction level data in substantive procedures to see patterns and
anomalies. Such data could be clustered against that of other years to obtain patterns and detect outliers. Generally, the auditor
may apply clustering wherever general data knowledge is desired. We apply two different clustering techniques: k-means and
hierarchical clustering. We find that different clustering techniques can facilitate deeper insights into varied datasets. The
process artifact facilitates the use of multiple clustering techniques to determine which approach works best in a given context.
On the first application, we use the U.S. states’ financial statements data. The second application utilizes the Volcker
Alliance’s survey data results. The survey produces extensive information about how the different U.S. states score on an
annual basis on budgeting using five measures. On both applications, we show how data clustering could be used on not-for-
profit and governmental data and to help gain more information about financial statements and budgeting. The cluster results
also show that there are some similarities between the two methods—k-means and hierarchical—and this could give us an idea
about our data quality. We have also used an assessment methodology to evaluate the stability level of the clusters’ results.
Subsequently, we now have clear and unusual patterns and relationships to explore in greater depth.
The contribution of the study is twofold. First, the two applications presented in the paper bring advanced data mining
techniques into the not-for-profit domain, especially using financial statements and budgetary data. Second, the clustering
results suggest opportunities for auditors and practitioners to gain more insights into their data by applying these methods.
However, there are some limitations to this study. First, perhaps the results from the first application should be examined with
additional external variables. Second, more data are needed for the second study as only three years of data were available. The
Volcker Alliance dataset is expected to increase every year in size, due to the addition of new survey data.
Future research could apply and compare additional data clustering techniques, such as the partition around medoids
(PAM) clustering method, within this artifact. Also, additional incremental research could evaluate the results of cluster
analysis using external variables such as net population change and gross domestic product (GDP) growth. In addition, the
states that are grouped together in the same cluster should be further analyzed, particularly those with unexpected similarities.
As mentioned earlier, future research could also compare these findings to ratings of Moody’s or other similar entities. In sum,

TABLE 9
The Mean and The Number of Times a Cluster is Being Dissolved
(K-Means Clusters)
Cluster
(K) 1 2 3 4 5 6 7
Mean value 0.52234 0.57123 0.58506 0.54961 0.86000 0.66667 0.57313
Dissolve 9 9 10 11 1 8 10

Journal of Information Systems


Volume 35, Number 3, 2021
220 Alzamil, Appelbaum, Glasgall, and Vasarhelyi

this clustering design artifact applied to not-for-profit and governmental data provides a clear and direct method with which an
auditor may apply clustering to varied datasets.

REFERENCES
Alzamil, Z. S., and M. A. Vasarhelyi. 2019. A new model for effective and efficient open government data. International Journal of
Disclosure and Governance 16 (4): 174–187. https://doi.org/10.1057/s41310-019-00066-w
Appelbaum, D., and R. A. Nehmer. 2017. Using drones in internal and external audits: An exploratory framework. Journal of Emerging
Technologies in Accounting 14 (1): 99–113. https://doi.org/10.2308/jeta-51704
Appelbaum, D., and R. A. Nehmer. 2020. Auditing cloud-based blockchain accounting systems. Journal of Information Systems 34 (2):
5–21. https://doi.org/10.2308/isys-52660
Arisandi, D. 2016. The implementation of data analytics in the governmental and not-for-profit sector. Doctoral dissertation, Rutgers
Graduate School Newark.
Bholowalia, P., and A. Kumar. 2014. EBK-means: A clustering technique based on elbow method and k-means in WSN. International
Journal of Computers and Applications 105 (9). Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi¼10.1.1.735.
7337&rep¼rep1&type¼pdf
Bloch, D., and F. Fall. 2015. Government debt indicators: Understanding the data. Available at: https://www.oecd-ilibrary.org/docserver/
5jrxv0ftbff2-en.pdf?expires¼1630434565&id¼id&accname¼guest&checksum¼416B47C00E9E21516DCD4F844EEE170E
Bobric, E. C., G. Cartina, and G. Grigoras. 2009. Clustering techniques in load profile analysis for distribution stations. Advances in
Electrical and Computer Engineering 9 (1): 63–66. https://doi.org/10.4316/aece.2009.01011
Bolton, R. J., and D. J. Hand. 2001. Unsupervised profiling methods for fraud detection. In Proceedings of Credit Scoring and Credit
Control VII, 235–255.
Byrnes, P. E. 2019. Automated clustering for data analytics. Journal of Emerging Technologies in Accounting 16 (2): 43–58. https://doi.
org/10.2308/jeta-52474
Cesário, G. 2020. The sousveillance of politically connected entities and individuals in the public sector: An armchair audit approach of
related party transactions. Doctoral dissertation. Available at: http://bibliotecadigital.fgv.br/dspace/bitstream/handle/10438/29659/
cesario.pdf?sequence¼3&isAllowed¼y
Eisen, M. B., P. T. Spellman, P. O. Brown, and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. In
Proceedings of the National Academy of Sciences of the United States of America 95 (25): 14863–14868. https://doi.org/10.1073/
pnas.95.25.14863
Elkan, C. 2001. Magical thinking in data mining: Lessons from CoIL Challenge 2000. In Proceedings of the Seventh ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 426–431.
Everitt, B.S., S. Landau, M. Leese, and D. Stahl. 2011. Hierarchical clustering. Cluster Analysis 5: 71–110.
Fua, Y. H., M. O. Ward, and E. A. Rundensteiner. 1999. Hierarchical parallel coordinates for exploration of large datasets. In Proceedings
Visualization ’99 (Cat. No. 99CB37067), 43–508.
Government Accounting Standards Board (GASB). 2010. The User’s Perspective. Available at: https://gasb.org/cs/
ContentServer?c¼GASBContent_C&cid¼1176158164302&d¼&pagename¼GASB/GASBContent_C/UsersArticlePage&pf¼true
Geerts, G. L. 2011. A design science research methodology and its application to accounting information systems research. International
Journal of Accounting Information Systems 12 (2): 142–151. https://doi.org/10.1016/j.accinf.2011.02.004
Harpaz, R., H. Perez, H. S. Chase, R. Rabadan, G. Hripcsak, and C. Friedman. 2011. Biclustering of adverse drug events in the FDA’s
spontaneous reporting system. Clinical Pharmacology and Therapeutics 89 (2): 243–250. https://doi.org/10.1038/clpt.2010.285
Hartigan, J. A., and M. A. Wong. 1979. AK-means clustering algorithm. Journal of the Royal Statistical Society, Series C: Applied
Statistics 28 (1): 100–108.
Hegar, G. 2019. Guide to understanding annual comprehensive financial reports (ACFRS). Texas Comptroller of Public Accounts.
Available at: https://comptroller.texas.gov/transparency/budget/acfr-faq.php
Hennig, C. 2007. Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis 52 (1): 258–271. https://doi.
org/10.1016/j.csda.2006.11.025
Hennig, C. 2010. fpc: Flexible procedures for clustering. R package version 2 (2): 0–3. Available at: https://CRAN.R-project.org/
package¼fpc
Hevner, A. R., S. T. March, J. Park, and S. Ram. 2004. Design science in information systems research. Management Information Systems
Quarterly 28 (1): 75–105. https://doi.org/10.2307/25148625
Hilas, C. S., and P. A. Mastorocostas. 2008. An application of supervised and unsupervised learning approaches to telecommunications
fraud detection. Knowledge-Based Systems 21 (7): 721–726. https://doi.org/10.1016/j.knosys.2008.03.026
Hossain, M. S. 2012. Exploratory data analysis using clusters and stories. Doctoral dissertation, Virginia Polytechnic Institute and State
University.
Jain, A. K. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31 (8): 651–666. https://doi.org/10.1016/j.
patrec.2009.09.011

Journal of Information Systems


Volume 35, Number 3, 2021
Applications of Data Analytics: Cluster Analysis of Not-for-Profit Data 221

Kassambara, A. 2017. Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning. 1st Edition. Available at: https://
xsliulab.github.io/Workshop/week10/r-cluster-book.pdf
Kaufman, L., and P. J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Volume 344. Hoboken, NJ: John
Wiley & Sons.
Kim, K. J., and H. Ahn. 2008. A recommender system using GA k-means clustering in an online shopping market. Expert Systems with
Applications 34 (2): 1200–1209. https://doi.org/10.1016/j.eswa.2006.12.025
Kodinariya, T. M., and P. R. Makwana. 2013. Review on determining number of cluster in k-means clustering. International Journal 1
(6): 90–95.
Liu, Q., and M. A. Vasarhelyi. 2014. Big questions in AIS research: Measurement, information processing, data analysis, and reporting.
Journal of Information Systems 28 (1): 1–17. https://doi.org/10.2308/isys-10395
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability 1 (14): 281–297.
McGraw, K. O., and S. P. Wong. 1996. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1 (1):
30–46. https://doi.org/10.1037/1082-989X.1.1.30
McPherson, M., L. Smith-Lovin, and J. M. Cook. 2001. Birds of a feather: Homophily in social networks. Annual Review of Sociology 27
(1): 415–444. https://doi.org/10.1146/annurev.soc.27.1.415
Nehmer, R. A., and R. P. Srivastava. 2016. Using belief functions in software agents to test the strength of application controls: A
conceptual framework. International Journal of Intelligent Information Technologies 12 (3): 1–19. https://doi.org/10.4018/IJIIT.
2016070101
Peffers, K., T. Tuunanen, M. A. Rothenberger, and S. Chatterjee. 2007. A design science research methodology for information systems
research. Journal of Management Information Systems 24 (3): 45–77. https://doi.org/10.2753/MIS0742-1222240302
Roy, K., S. Kar, and R. N. Das. 2015. Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk
Assessment. Cambridge, MA: Academic Press.
Schneider, G. P., J. Dai, D. J. Janvrin, K. Ajayi, and R. L. Raschke. 2015. Infer, predict, and assure: Accounting opportunities in data
analytics. Accounting Horizons 29 (3): 719–742. https://doi.org/10.2308/acch-51140
Singh, K., D. Malik, and N. Sharma. 2011. Evolving limitations in k-means algorithm in data mining and their removal. International
Journal of Computational Engineering and Management 12 (1): 105–109.
Smith, L. I. 2002. A tutorial on principal components analysis. Available at: http://www.cs.otago.ac.nz/cosc453/student_tutorials/
principal_components.pdf
Srivastava, A., A. Kundu, S. Sural, and A. Majumdar. 2008. Credit card fraud detection using hidden Markov model. IEEE Transactions
on Dependable and Secure Computing 5 (1): 37–48. https://doi.org/10.1109/TDSC.2007.70228
Thiprungsri, S. and M.A. Vasarhelyi. 2011. Cluster analysis for anomaly detection in accounting data: An audit approach. International
Journal of Digital Accounting Research 11 (17).
Turban, E., R. Sharda, and D. Delen. 2011. Decision Support and Business Intelligence Systems. Upper Saddle River, NJ: Pearson
Education Inc.
Volcker Alliance. 2017 Truth and integrity in state budgeting: What is the reality? Available at: https://www.volckeralliance.org/sites/
default/files/TruthAndIntegrityInStateBudgetingWhatIsTheReality_0.pdf
Volcker Alliance. 2018a. Truth and integrity in state budgeting: Preventing the next fiscal crisis. Available at: https://www.
volckeralliance.org/publications/truth-and-integrity-state-budgeting-preventing-next-fiscal-crisis
Volcker Alliance. 2018b. Truth and integrity in state budgeting: What is the reality? Fifty state report cards. Available at: https://www.
volckeralliance.org/sites/default/files/TruthAndIntegrityInStateBudgetingWhatIsTheRealityFiftyStateReportCards.pdf
Wagstaff, K., C. Cardie, S. Rogers, and S. Schroedl. 2001. Constrained k-means clustering with background knowledge. In Proceedings
of the Eighteenth International Conference on Machine Learning, 577–584.
Williams, G. J. and Z. Huang. 1997. Mining the knowledge mine. In Proceedings of the 10th Australian Joint Conference on Artificial
Intelligence, 340–348, Berlin, Heidelberg: Springer.
Zhao, W., W. Zou, and J. J. Chen. 2014. Topic modeling for cluster analysis of large biological and medical datasets. BMC Bioinformatics
15: Article number: S11. https://doi.org/10.1186/1471-2105-15-S11-S11
Zhao, Y., G. Karypis, and U. Fayyad. 2005. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge
Discovery 10 (2): 141–168. https://doi.org/10.1007/s10618-005-0361-3
Zumel, N. 2015 Bootstrap evaluation of clusters. Available at: https://www.r-bloggers.com/bootstrap-evaluation-of-clusters/
Zumel, N., and J. Mount. 2014. Practical Data Science with R, 101–104. Shelter Island, NY: Manning.

Journal of Information Systems


Volume 35, Number 3, 2021
Copyright of Journal of Information Systems is the property of American Accounting
Association and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.

You might also like