You are on page 1of 19

)

Research Proposal

ANOMALY DETECTION CAPABILITY OF EXISTING MODELS OF


CONTINUITY EQUATIONS IN CONTINUOUS ASSURANCE

E.J.F. VAN KEMPEN


ANR: 201386
Pre-master Accounting
Supervisor : Prof. Dr. W.F.J. Buijink

2014

Abstract
Continuous assurance is a methodology to provide assurance on financial data on a near
real-time basis. One of the fundamental elements of continuous assurance is continuous data
auditing in which the integrity of the data provided by the client is tested. Continuity
equations can be used to evidence assertions regarding data integrity. In order to do so, data
is tested by predicting subsequent values based on a fitting model. In total there are three
models: the simultaneous equations model, the vector autoregressive model and the restricted
vector autoregressive model. I propose to test these models and compare them on the aspect
of anomaly detection capability.

I.

Introduction

Continuous assurance has been a subject of interest for auditors and financial professionals
for the last three decades. However, this field of research took off only after Vasarhelyi et al.
(2004) published a widely accepted conceptual framework for continuous assurance. In the
following years additional studies were performed in this field, but most of these studies were
focused on refining the theoretical framework and developing new and innovative analysis
methods. Comparison of existing analysis models was not yet in scope. This proposal focuses
on the comparison of the anomaly detection capability of existing models of continuity
equations
Conventional audit procedures focus on time consuming manual testing on a fixed number
of randomly selected supporting documents, like invoices or inventory counts. By
introducing more superior audit procedures from the continuous assurance domain, like
continuity equations, substantive testing can in theory be performed more efficiently and
effectively. The level of assurance can improve, while time consumption is reduced at the
same time.
However, all these audit procedures from the continuous assurance domain are fairly new
and remain mostly untested in the real world. This research intends to investigate one of these
procedures, continuity equations, on a more detailed level. By using continuity equations
business processes could be tested by detecting anomalies in one or more of the steps within
these processes. The audit procedures or manual testing can then be narrowed down to the
detected anomalies.
Efficient performance of anomaly detection could lead to a paradigm shift in the field of
auditing. Instead of sampling evidence randomly from the population, the level of assurance
can be improved by inspecting exceptions only: audit by exception.

II.

Literature review and research question

Continuous assurance
The Canadian Institute of Chartered Accountants (1999) provides a definition of continuous
assurance: Continuous auditing [or continuous assurance] is a methodology that enables
independent auditors to provide written assurance on a subject matter using a series of
auditors reports issued simultaneously with, or a short period of time after, the occurrence of
events underlying the subject matter. The emphasis of continuous assurance is on reducing
the lag between preparing a report and subsequently providing assurance on the matters
reported.
In order to be able to provide assurance on a near real-time basis, the auditors have to rely
heavily on automated testing. Vasarhelyi et al. (2004; 2010) have defined three elements of
continuous assurance and continuous monitoring: Continuous Control Monitoring (CCM),
Continuous Data Auditing (CDA), Continuous Risk Monitoring and Assessment (CRMA).
CCM can be compared to interim testing of procedures in the conventional audit framework
and CDA can be compared to final testing focusing more on data than procedures. These two
elements combined can be used to provide sufficient assurance. CRMA can be used as an
additional part of the control framework, but is not essential for providing assurance. CDA
verifies the integrity of the data flowing through the information system. The data provided
by the client is the basis for all testing procedures, so data assurance forms an essential part of
continuous assurance. Continuity equations can be used as a tool from the CDA sub-domain
to evidence management assertions focusing on data integrity.
Continuity equations
Continuity equations have been a fundamental part of classical physics since the eighteenth
century. These equations describe the transport of a quantity, while simultaneously ensuring
conservation of this quantity (like mass and/or energy). Accordingly similar relations can be
defined for the transport of quantities within a system in the financial domain. The movement
of reported quantities, e.g. ordered kilograms or invoiced units, between steps in the key
business processes can be described with continuity equations.
The term continuity equations was coined in 1991, when Vasarhelyi and Halper (1991)
modeled the flow of billing data at AT&T. Although Vasarhelyi and Halper proposed
2

continuity equations more than 20 years ago, little research has been performed on the
application in practice and implementation of a decent continuity equations model.
In most businesses the flow of goods is the most important basis for revenue recognition.
As such, the flow of goods can be used to provide evidence for the completeness, timeliness
and accuracy of the reported revenue. If the continuity equations hold for a specific business
process, one can assert that there are no leakages from the transaction flow, i.e. the integrity
of the flow of goods can be asserted. Therefore, continuity equations provide a method to
evidence the integrity of the basis for revenue recognition, which makes them a valuable tool
in continuous assurance.
Continuity equations are based on historical data of quantities in the separate steps of
business processes. For example, the sales cycle can be modeled as three separate steps:
receiving the order from the customer, shipping goods to the customer and invoicing for the
ordered and shipped goods. The quantity of ordered goods today will of course show up in
the invoicing step a certain number of days later. The daily flow of goods between these steps
can be defined with a certain quantity

and a lag between the steps . This research will

focus on the sales cycle consisting of the three previously defined process steps.
Previous research by Leitch and Chen (2003), Kogan et al. (2010) and Alles et al. (2005)
has resulted in three models of continuity equations: the simultaneous equations model
(SEM), vector autoregressive model (VAR) and the restricted vector autoregressive model
(RVAR).
Simultaneous Equations Model
Leitch and Chen (2003) proposed a first model of continuity equations in the field of
assurance: the Simultaneous Equations Model (SEM). When applied to the sales cycle this
model can be represented as Equation (1). Each step in the sales cycle is simultaneously
dependent on historic quantities from the previous step. These historic quantities are
represented with lag

in each step. This model simplifies the sales cycle by assuming that

there is only a single fixed lag between each step.


(1)
The coefficients of this model are estimated by OLS linear regression, optimizing for the
overall

of the model.
3

Leitch and Chen tested the application of SEM on monthly data of financial statements.
They found that SEM outperformed other more conventional models of analytical
procedures.
Basic Vector Autoregressive model
Alles et al. (2005) introduced another model: the basic Vector Autoregressive (VAR)
model. This model for the sales cycle can be represented as Equation (2). In this model
,

are respectively the quantities ordered, shipped and invoiced

at time , the

terms are

terms are

vectors containing daily aggregates of quantities

and

transition vectors for a multivariate linear model, the


for the given dimension

is the number of time periods covered in the model.


(

(2)

Each of these sub-equations models a predictor for the reported quantities in a specific step
in the business process. As previously defined, the quantities are related to quantities in the
other process steps by a time delay (lag). For example, if orders are shipped in exactly one
day, without exception, and invoicing is performed simultaneously with shipping, the
resulting predictors can be defined as Equation (3).

(3)

The VAR model is estimated by OLS linear regression, optimizing for the overall

by

trying different lags for the process steps. Only the maximum expected lag is provided to the
algorithm, which then tries to find the best fitting model by iterating trough all lag
possibilities up to the maximum expected lag. The exact lags do not have to be known prior
to modeling as the best fitting lags are determined while modeling.
One can easily understand that it is not always trivial to determine lags prior to the
modeling process, e.g. lags in the purchasing cycle are highly dependent on the policies and
processes at third parties. Therefore, the VAR model can be a powerful tool for modeling
continuity equations when exact lags can not be predefined easily.
4

Contrary to the SEM model, the VAR model does not assume that there is a singular fixed
lag between steps. All lags up to a maximum are considered in the model. This can possibly
result in a comprehensive estimated model. Therefore, most VAR models are represented
using matrix notation.
Restricted Vector Autoregressive model
Kogan et al. (2010) have shown in their studies that the VAR model shows outstanding
accuracy. More importantly, they showed that the Restricted VAR (RVAR) model resulted in
better accuracy. With a MAPE (mean absolute percentage error) of 0.3374 on the test set it
outscored even several other models, i.e. SEM and VAR type of models. Only the Bayesian
VAR model performed better when taking only the MAPE into account, but it also resulted in
a larger standard deviation for the absolute percentage error. Therefore, the Bayesian VAR
model is not considered viable for auditing purposes. The RVAR model was found to be one
of the best models for continuity equations.

The RVAR model translates roughly to optimizing for

of the predictor by removing

insignificant coefficients from the VAR model. For example, if the mean lag between order
and shipping is less than a month shipment

a year after ordering is obviously

not significant and thus excluded from the model. This method iterates the modeling process
per equation by removing all coefficients with | |-statistics below a predefined threshold, as
explained in Figure 1. Kogan et al. (2010) find that a threshold of
corresponding

and its

yields the model with the best prediction accuracy.

Data

Final model

Threshold

Yes

Start

Initial model
estimation

Exclude parameters
with t-statistic
below threshold

Re-estimate model

All t-statistics
above threshold?

No

Figure 1. RVAR modeling process. The initial VAR model is restricted by excluding parameters with a tstatistic below a predefined threshold. The model is re-estimated followed by the next exclusion iteration, until
all parameters satisfy the t-statistic requirement.

The RVAR model usually results in less extensive and more accurate estimated models due
to the restriction to significant terms only.
Research question
In total three different models of continuity equations are used in the field of continuous
assurace. Auditors rely on the accuracy and anomaly detection capability of these models to
provide assurance on the data. This leads to my research question:

Which of the existing models of continuity equations in continuous auditing has the best
anomaly detection capability?

III.

Method

Data
The proposed base model for the sales cycle is based on three different quantities: the
ordered quantity, the quantity of goods shipped and the quantity invoiced. These three
variables can be provided by most ERP systems on a daily basis.
Data is provided by a Dutch wholesaler in technical supplies. This company uses an offthe-shelf solution of Microsoft Dynamics AX 2009. The data was extracted from separately
generated reports containing transaction quantities for each of the process steps by merging
the columns by date, as presented in Figure 2.
SalesOrders
PK

Date
Quantity

Shipments
PK

Invoices

Date

PK

Quantity

Date
Quantity

SalesData
PK,FK1,FK2,FK3 Date
SO
GS
IS

Figure 2. Data model consisting of daily aggregates for three different stages in the sales cycle: ordered
quantity (SO), quantity of goods shipped to customer (GS) and quantity invoiced (IS) combined by date via a
SQL join clause. The date serves as the primary and foreign keys of the data source involved.

The data reflects actual day-to-day transaction quantities of February 2007 up to November
2007, excluding Sundays and holidays during which the company was closed for business.
Saturdays are still included, because sometimes high priority orders are shipped on Saturdays.
The resulting data is exported as a CSV file to be imported by the model implementations
in R. The CSV file consists of four data fields, i.e. date, the quantities ordered, quantities
shipped and quantities invoiced. More detailed information about the data can be found in
Appendix A.
Panel A

Variable
Sales orders (SO)
Goods shipped (GS)
Invoices sent (IS)

n
264
264
264

Mean
Std.Dev. 25th Pct. Median 75th Pct.
66,845
60,676
38,384
62,548
83,122
62,068
46,099
42,295
63,326
40,865
60,211
47,237
78,393
60,745
81,303

Panel B

Pearson correlations
| |
| |
| |

| |
1.000

| |
| |
0.600* 0.588*
1.000 0.960*
1.000

*:values significant on the 1% level.


Table 1. A: sample characteristics of the data set consisting of 264 observations of actual day-to-day
transaction quantities in sales orders, goods shipped en invoices sent. B: Pearson correlations between the
quantity variables.

Table 1 and Figure 3 presents descriptive statistics about the three quantity fields in the data
set. The Pearson correlations show that the GS and IS variables are strongly related. This is
fully in line with the notion that invoices are generated at the same time as the goods are
shipped most of the time. Furthermore, the charts clearly show less activity on Saturdays
compared to weekdays. On Saturdays only priority orders and over-the-counter sales are
handled.
The data is split into two separate parts, which account for roughly and of the
observations included in the data set respectively. The first part will be used as a training set
to estimate the model parameters for all three models. The second part is used as a test set.
After estimation, the models will be tested by generating predictions for the test set.

Figure 3. Plot of daily aggregates for three different stages in the sales cycle: ordered quantity (SO), quantity of
goods shipped to customer (GS) and quantity invoiced (IS) as provided in the data set.

Implementation of the models


The models will be implemented in R, the most widely accepted language for statistical
processing and data analytics. A rudimentary implementation of these models is already
available in the form of R packages.
The SEM model is implemented in four stages: data collection, pre-processing, modeling
and prediction. The code is based on the systemfit package, which has been developed and
pusblished by Arne Henningsen and Jeff D. Hamann and is available via CRAN.
(Henningsen & Hamann, 2007)
The VAR and RVAR models are also implemented in four stages: data collection, preprocessing, modeling and prediction. The code is centered around the vars package, which
has been developed and pusblished by Bernhard Pfaff and Matthieu Stigle and is available via
CRAN. (Pfaff & Im Taunus, 2007; Pfaff, 2008; Pfaff, 2008) The package includes several
functions for modeling VARs, testing the VARs and presenting the results.
8

The modeling implementation in R can be found in Appendix B.


Testing of the models
After the model parameters were estimated based on the training set the resulting models
are tested. Anomaly detection capability is tested by counting false negatives or Type II
errors in the model predictions based on a slightly modified test set. Type I errors or false
positives are not in scope, due to the lack of negative effects on the level of assurance.
The test set is altered by increasing the quantities in five randomly selected observations by
100%. These altered observations serve as injected anomalies in the test set. The test set,
including the seeded anomalies, are then processed by the model implementation and
anomalies are reported.
In order to improve randomness and reduce the apparent selection bias the testing is
repeated 1,000 times, while randomly selecting five observations to be altered by 100% in the
original test set for every repetition. The mean number of Type II errors found serves as the
test statistic for comparison purposes. These means are compared using a dependent t-test.
The test procedure, as implemented in R, can be found in Appendix C.

IV.

Expected results

After testing I expect to find that the RVAR model to be the superior model in terms of
anomaly detection capability. The SEM model will probably underperform due to the
oversimplification of the sales cycle steps and the accompanying lag terms. I expect most
companies to have two or more lag terms associated with the largest part of the flow of
goods. The data provider for the proposed tests for example provides next day delivery for
some items which are separately shipped. The ordered quantity can thus be considered as two
or more flows with

and

. The SEM model would oversimplify this cycle.

In theory it should also outperform the basic VAR model purely based on statistical
properties. In both the RVAR and VAR model multiple lag terms are considered and
included in the model. This should result in better performance than the SEM model. The
RVAR model can be considered an improved version of the basic VAR model due to the
exclusion of statistical insignificant terms. Eventhough the algorithm for estimating the
RVAR model on real data is simple and elegant it could result in a suboptimal estimation.
9

Estimating anomaly detection performance and accuracy prior to the estimation algorithm is
even more difficult.

V.

Limitations

Type II errors only


The research focuses on Type II errors only, since only false negatives (failing to identify
an anomaly when one exists) influence the level of assurance. The level of assurance is the
most important factor in acceptance of the models used. If the models are considered to be
not reliable, auditors will not be able to use them. Therefore, actual errors can not pass the
test undiscovered.
However, Type I errors also influence the audit procedure. The detection of false positives
can lead to an increase in audit activities, since all detected anomalies have to be tested
manually. Eventhough Type I errors are not in scope, the models can only be accepted if the
number of false positives stays below a certain limit.
Data
The data used in this research is provided by a single entity and for a single year only.
Therefore, conclusions and results are only applicable to the data provider and can not be
generalized. In order to be able to generalize the results and conclusions, the proposed
methods need to be used on data provided by multiple entities. Furthermore, reliability will
be improved by testing data from subsequent years. Furthermore, since the data is provided
by a single entity selection bias may occur. In addition, the data set contains noise. Preexisting anomalies might exist in the data set.

10

REFERENCES
(CICA), C. I. (1999). Continuous Auditing. Continuous Auditing. Toronto, ON, Canada.
Alles, M., Kogan, A., Vasarhelyi, M., & Wu, J. (2005). Continuity Equations in Continuous
Auditing: Detecting Anomalies in Business Processes.
Dzeng, S. (1994). A Comparison of Analytical Procedures Expectation Models Using Both
Aggregate and Disaggregate Data. Auditing: A Journal of Practice \& Theory,
13(Fall), 1-24.
Henningsen, A., & Hamann, J. D. (2007). systemfit: A Package for Estimating Systems of
Simultaneous Equations in R. Journal of Statistical Software, 23(4), 1-40.
Kogan, A., Alles, M. G., Vasarhelyi, M. A., & Wu, J. (2010). Analytical Procedures for
Continuous Data Level Auditing: Continuity Equations.
Leitch, R. A., & Chen, Y. (2003). The effectiveness of expectation models in recognizing
error patterns and generating and eliminating hypotheses while conducting analytical
procedures. Auditing: A Journal of Practice & Theory, 22(2), 147-170.
Pfaff, B. (2008). VAR, SVAR and SVEC models: Implementation within R package vars.
Journal of Statistical Software, 27(4), 1-32.
Pfaff, B. (2008). vars: VAR Modelling. R package version, 1-3.
Pfaff, B., & Im Taunus, K. (2007). Using the vars package.
Vasarhelyi, M. A., & Halper, F. B. (1991). The continuous audit of online systems. Auditing:
A Journal of Practice & Theory, 10(1), 110-125.
Vasarhelyi, M. A., Alles, M. G., & Kogan, A. (2004). Principles of analytic monitoring for
continuous assurance. Journal of Emerging Technologies in Accounting, 1(1), 1-21.
Vasarhelyi, M. A., Alles, M., & Williams, K. T. (2010). Continuous assurance for the now
economy. Institute of Chartered Accountants in Australia Sydney, Australia.

11

Appendix A.

Data

The data is provided by a Dutch wholesaler in technical supplies and contains daily
aggregates of the three separate steps in the sales cycle.

SalesOrders
PK

Date
Quantity

Shipments
PK

Invoices

Date

PK

Quantity

Date
Quantity

SalesData
PK,FK1,FK2,FK3 Date
SO
GS
IS

Figure 2. Data model consisting of daily aggregates for three different stages in the sales cycle: ordered
quantity (SO), quantity of goods shipped to customer (GS) and quantity invoiced (IS) combined by date via a
SQL join clause. The date serves as the primary and foreign keys of the data source involved.

The data is imported by using the following R code:

12

Appendix B.

Implementations of the models in R

13

14

Appendix C.

Test algorithm

15

16

17