You are on page 1of 17

Journal of Quantitative Economics

https://doi.org/10.1007/s40953-018-0142-7

ORIGINAL ARTICLE

Imputation of Missing Values in the Fundamental Data:


Using MICE Framework

Balasubramaniam Meghanadh1 · Lagesh Aravalath1 · Bhupesh Joshi1 ·


Raghunathan Sathiamoorthy1 · Manish Kumar1

© The Indian Econometric Society 2018

Abstract
Revolutionary developments in the field of big data analytics and machine learning
algorithms have transformed the business strategies of industries such as banking,
financial services, asset management, and e-commerce. The most common problems
these firms face while utilizing data is the presence of missing values in the dataset.
The objective of this study is to impute fundamental data that is missing in finan-
cial statements. The study uses ‘Multiple Imputation by Chained Equations’ (MICE)
framework by utilizing the interdependency among the variables that wholly comply
with accounting rules. The proposed framework has two stages. The initial imputation
is based on predictive mean matching in the first stage and resolving financial con-
straints in the second stage. The MICE framework allows us to incorporate accounting
constraints in the imputation process. The performance tests conducted on the imputed
dataset indicate that the imputed values for the 177 line items are good and in line
with the expectations of subject matter experts.

The views and opinions expressed in this article are those of the authors and do not necessarily reflect the
view of the Credit Rating information services India Ltd (CRISIL).

B Manish Kumar
manish.kumar@crisil.com
Balasubramaniam Meghanadh
Balasubramaniam.M@crisil.com
Lagesh Aravalath
Lagesh.aravalath@crisil.com
Bhupesh Joshi
Bhupesh.Joshi1@crisil.com
Raghunathan Sathiamoorthy
Raghunathan.Sathiamoorthy@crisil.com

1 CRISIL GR&A, TVH-Beliciaa Towers, 3rd Floor, Tower-II, Block No.94, MRC Nagar,
Chennai 600 028, India

123
Journal of Quantitative Economics

Keywords Multiple imputation · MICE · Fundamental data · Accounting and


financial statement

JEL Classification C13 · C32 · C51 · C53 · G20

Introduction

Financial statement analysis plays an important role in investment decision making


for valuation and credit analysis. The investment community uses financial statements
to determine asset values, financing resources, profitability and risk embedded in the
company’s assets. Financial data is fundamental to asset managers and supports key
investment functions such as stock selection, relative valuation, financial modelling
and forecasting, portfolio construction and management, risk management, accounting
and pricing of securities and successful testing and simulation of investment strategies.
Hence, the importance of accurate financial statements cannot be underestimated.
Asset managers, banks, hedge funds, and other financial institutions use a variety of
data at regular intervals and usually purchase data from multiple vendors to meet their
requirements. Given the diverse data sources (publicly available and third party data),
data must be tested for its usability. Cleaning, preparing, validating and transforming
raw data to make it usable for investment research process requires considerable time
and effort. To extract meaningful insights and value, the data obtained from vendors
should be clean, accurate, sufficient, complete, consistent, and timely.
Pagano et al. (2012) analyzed balance sheets in the banking sector and noticed
that some data values are either missing or incorrect. They suggested using a forward
search algorithm to identify incorrect observations. The forward search algorithm
progressively adds each data point into the dataset and searches for deviations in the
point estimates to identify anomalies. We can treat these anomalies as a missing data
problem based on the opinion of subject matter experts (SMEs), as incorrect data
cannot be used for any analysis. For example, let us assume that regulatory reporting
standards insist on adding a geographical breakdown of sales in a particular year. This
regulatory change requires the companies to oblige during the upcoming years but not
on the previous years. Under this circumstance, the relevant line item is marked as
missing for the previous years and imputed using the MICE procedure.
Rubin (1987) notes that ignoring missing values in the analysis will introduce
bias if a systemic difference between respondents and non-respondents (observed and
missing data) exists in the datasets. Further, missing values reduce the size of the
data and when they are ignored, estimates are biased as uncertainty is not taken into
consideration. In research papers published in Journal of Banking and Finance, Journal
of Finance, Journal of Financial Economics, Journal of Financial and Quantitative
Analysis and Review of Financial Studies, Kofman and Sharpe (2000) notices that
the practitioners have altogether ignored and not reported the missing value problem
by noting that only 175 of the 1057 articles recognized the treatment of missing data.
Of these 175 articles, in 137 articles, the authors have used the list wise deletion
method, which removes observations containing missing data. For fundamental data,

123
Journal of Quantitative Economics

imputation would be a better option than the removal of missing values as the pool of
companies analyzed shrinks if the list wise deletion method is performed.
If missing data is not handled effectively, it can have serious consequences on quan-
titative research leading to: (1) biased estimates of parameters, (2) loss of information,
(3) a decrease in statistical power, (4) an increase in standard errors, and (5) weak
generalizability of findings. Even though various imputation methods are available,
the suitability of the method chosen is based on the challenges related to the business
problem.
Fundamental data has several unique features and is different from traditional statis-
tical data which is typically based on a marketing survey or a field survey. The line items
in balance sheet could have a relationship with the cash-flow statement and the income
statement. Similarly, each and every line item has multiple constraints and a solution
may not be feasible if we intent to solve all these constraints. Hence the imputed values
should respect multiple constraints without being overly complex. For example, oper-
ating cash flow, investing cash flow and financing cash flow should be tallied with the
net-cash difference present in the balance sheets of the previous year and the current
year. Further, the cash flow line items should be derived from the balance sheet and
profit and loss statements in addition to the rules for the cash flow statements. Given
the complexity of accounting constraints, they should be implicitly handled within
the proposed solution and the construction of rule/accounting constraints is based on
the known relationship among line items of the financial statement. The imputation
process is considered robust and accurate for fundamental data, when each of the line
items adhere to their respective accounting constraints.
Further, we have observed in a few cases that the optimal solution to solve con-
straints is not achieved. For example, a line item unexpected gains/losses may neither
follow a specific distribution nor can be predicted based on its predecessor. The impu-
tation, in this case, is used only to adhere to accounting rules. Unless we have a unified
approach for the overall problem, these accounting rules could limit our ability to arrive
at an optimal solution for the line items.
Another complexity is that a few fundamental ratios need to be imputed directly.
i.e., the algorithm should solve the given constraint if all the other values are directly
available. For instance, if a line item ‘a’ can be specified as the difference of two other
items i.e., ‘b’ and ‘c’ and the values for ‘b’ and ‘c’ are directly available, then the
missing value for ‘a’ can be solved directly by providing the constraint a  b − c. For
the given line item (‘a’ in this case), all the multiple datasets that are imputed should
have the same value. The above condition can be achieved using passive imputation
in MICE.
The current literature review suggests that traditional imputation techniques such as
mean/median imputation, list wise deletion, and omission of variables cannot accom-
modate the accounting constraints. For example, if the line items in the financial
statements show a clear time trend, mean/median imputation cannot provide appeal-
ing results. If the imputation is performed using linear regression methodologies, the
imputed values would not adhere to accounting rules. (See, Table 2).
In our study, we focus on imputing the missing values subjected to accounting
constraints in the balance sheets, income statement and cash flow statement using the
MICE approach. The key objective of this study is to impute the missing value in

123
Journal of Quantitative Economics

fundamental data using existing literature on the MICE framework and also integrate
the interdependency among variables to comply with accounting rules (i.e., principle
requirement for fundamental data). In the proposed framework, the imposition of
constraints is automated based on the tree structure that adheres to accounting rules.
The automation of constraint selection helps us to provide a different set of constraints
for each company as the pattern of missing values may not be unique for all the
companies.
This paper contributes to growing literature on solving the missing value problem
in a number of ways. In this study, we have used the MICE framework to solve the
missing value problem in the fundamental datasets. Fogarty (2006) and Galler and
Kehral (2012) have used the multiple imputation technique for their reject inference
problem and credit risk problem, respectively. However, our approach focuses solving
multiple constraints in fundamental data to create multiple datasets, which is the first
phase in Rubin’s (1987) three-phase approach of multiple imputation.
The solution proposes a unified methodology for imputing missing values using
a tree structure and creates multiple datasets for a large number of line items in the
fundamental data which can be used by researchers and practitioners for their ‘analy-
sis’ model. We have utilized a technique prescribed in standard literature (especially
Rubin 1987; Schafer 1997; Raghunathan 2001; Van Buuren et al. 2006) and performed
imputation for 177 line items with multiple constraints for each line item. To achieve
this objective, we have: (a) adopted a tree structure that integrates different line items
and (b) used a two-step approach by duplicating/shadowing line items to resolve all the
constraints. The MICE approach handles the links between different financial items
seamlessly and solves all the constraints (100% in Table 2). Effectively, with only
a couple of line items in the balance sheet (total assets and net sales), imputation
can be performed for line items at all levels in the form of a tree structure using this
approach. We have used a performance metric similar to the one used by Galler and
Kehral (2012) to gauge the imputed value for each of the line items. Additionally, we
have used line charts to assess the performance of imputed values.
The remaining sections of this paper are organized as follows. Section 2 provides
a review of the relevant literature. Section 3 deliberates on data. Section 4 explains
the methodology including the algorithm, assumptions, limitations and the challenges
faced while using this technique. Section 5 reports the results and Sect. 6 concludes
the study by providing a brief discussion on how our research could be useful for
practitioners.

Literature Review

There are many imputation techniques that are commonly used to deal with missing
values. These methods can be classified as single imputation techniques (SIT) and
multiple imputation techniques (MIT).
In SIT, the missing values are replaced by “predicted values” based on the available
cases/information. Since the missing values are imputed once, these techniques are
called single imputation techniques. Little and Rubin (2002) and De Waal (2011)
detail the following single value imputations such as mean imputation, cold deck

123
Journal of Quantitative Economics

imputation, hot deck imputation, and regression imputation. The single imputation
method imputes a single value for the missing value and treats it as the real value
(Rubin 1987). As a result, single imputation ignores uncertainty and underestimates
variance.
Rubin (1987 and 1996) proposed the MIT which overcame the limitations of SIT.
Fogarty (2006) has provided a detailed discussion on how multiple imputation retains
the advantages of single imputation and rectifies major disadvantages. Multiple impu-
tations are repeated random draws from the predictive distribution of the missing
values. It can be approached in two ways: parametric and non-parametric. Rubin
(1987) proposed Approximate Bayesian Bootstrap (ABB) as non-parametric MIT
where one creates multiple samples (with replacement) of observed and missing val-
ues. The missing values are then imputed using a standard method. In the parametric
approach, we try to get the multivariate distribution of unknown parameters to impute
missing values.
The multiple imputation technique involves three phases, namely imputation, ana-
lyzes and pooling. In the imputation phase, the process is iteratively designed to arrive
at multiple optimal values for a given variable. In the second phase, the desired analyses
are performed on each dataset using standard complete-data methods. In the pooling
phase, all the multiple results are consolidated by calculating the mean, variance, and
confidence interval of the variable of interest. Our study is confined to the imputation
phase.
The role of multiple imputation is to perform the imputation process several times,
so that the practitioners can perform modelling and analysis on multiple datasets
before arriving at a conclusion. The advantages of MIT are its flexibility and ability to
accommodate various scenarios. Instead of the traditional reject inference approaches,
Fogarty (2006) used the multiple imputation approach to enhance the information
inferred from the rejected applications by treating the reject inference problem in
credit scoring models as a missing data problem. Galler and Kehral (2012) performed
multiple imputation to assess the probability of default of a given company using
financial ratios as independent variables. Bouhlila and Sellaouti (2013) performed
missing value imputation for the student background data file for Tunisia using the
multiple imputation approach.
In multivariate missing data, two methods are generally used: joint modelling (JM)
and fully conditional specification (FCS) (Buuren and Groothuis-Oudshoorn 2010).
MICE, also known FCS as detailed in Van Buuren et al. (2006), is an approach for
imputing missing values based on a set of imputation models. Solving constraints is
relatively difficult in JM than MICE, as the latter provides options to perform passive
imputation for each variable that accounts for the constraints. The variable-by-variable
imputation method of MICE provides additional flexibility to deal with the problems
related to constraints.
Rubin’s (1987) work on multiple imputation serves as a bridge between expectation
maximization (EM) and the later simulation techniques that involve a structure similar
to EM (Kennickell 1991). The MICE algorithm uses Gibbs sampler technique to suc-
cessively draw random samples of the imputations from the posterior distribution of
the variable (Van Buuren et al. 2006). While explaining the Gibbs sampling technique,
Kennickell (1991) mentions that the unique parameter of the distribution is estimated

123
Journal of Quantitative Economics

based on the observed data from the “M” stage of EM algorithm. Van Buuren et al.
(2006) observed that MICE allows increased flexibility in model building. MICE can
easily incorporate constraints on the imputed values, work with different transforma-
tions of the same variable and also account for skip patterns, rounding, etc. Another
advantage of MICE is its ability to incorporate datasets with thousands of observations
and hundreds of variables (He et al. 2009; Stuart et al. 2009).
Rubin (1976) observes three types of missing values: (1) missing completely at
random (MCAR) where the missing values are randomly distributed across the obser-
vations; (2) missing at random (MAR) where the missing values can be explained
by other variables; and (3) missing not at random (MNAR) where the missing val-
ues are not related to the values of observed cases. Schafer (1999) mentions that the
assumption of missing at random (MAR) is not required in multiple imputation.
Another important step that a practitioner should consider while performing miss-
ing value imputation using the MICE algorithm is to identify an appropriate model
specification for each variable. King et al. (1998) suggests that the missing value impu-
tation model need not be the “analysis” model as the risk of over specification is not a
concern. However, we have used the analysis model for missing value imputation as
suggested by Schafer (1997) and Raghunathan (2001).
The literature suggests that the multiple imputation technique accounts for uncer-
tainty in the dataset unlike single imputation technique. A theoretical understanding
of the missing mechanism is required before imputation of missing values. The miss-
ing mechanism can be either MCAR, MAR or MNAR. The multiple imputation
technique is used by several practitioners. When multiple imputations for several vari-
ables are performed sequentially, the imputations are obtained by fitting a sequence of
regression models and drawing values from the corresponding predictive distributions
(Raghunathan 2001). However, performing missing value imputation where multiple
constraints are involved has unique challenges, which are not completely explored in
other studies. The MICE framework can accommodate multiple constraints given its
flexibility to allow variable-by-variable imputation. By adopting MICE framework,
we utilize several options such as passive imputation, order of the imputation of vari-
ables and predictive mean matching to impute missing values in the fundamental data
containing multiple constraints.

Data

Most asset management strategies are built using fundamental analysis of statements
based on earnings, sales, debt, cash flows and related metrics such as profitability
ratios, liquidity ratios, debt ratios, and efficiency ratios. They are also used to estimate
a company’s liquidity and default risks. Asset managers also use quantitative screeners
to filter a large investible universe and obtain a smaller group of stocks, and then apply
fundamental analysis to shortlist potential investment opportunities.
The study used the fundamental data sample of US-based companies in the health-
care industry sourced from a third-party vendor. The data comprises fundamental
data of about 1000 line items containing a variety of categories, including balance
sheet, income statement, financial ratio, equity and market related information. The

123
Journal of Quantitative Economics

data is collected both quarterly and annually for 6-year period starting from 2010 to
2016. However, only a few financial ratios are available for 2016 and none of the
top-level items such as net sales and total assets is available. This cannot be treated as
a forecasting problem as the top-level items can be derived directly or indirectly from
other financial ratios (using known relationships in accountancy) and these ratios are
adequate to impute the remaining line items using the framework developed in this
study.
For the current study, we assume that the missing data follows MAR i.e., the prob-
ability that a value is missing depends only on observed values and not on unobserved
ones. The financial data is strictly governed by accounting rules and the top-most-level
item is not imputed in this approach, only a passive imputation is required to derive
sales and total assets. Missing line items have a known relationship with non-missing
line items in financial statements and these relationships provide a theoretical justi-
fication for not assuming MCAR or MNAR conditions. Further, Galler and Kehral
(2012), in their study on missing data for credit risk, also assume that the mechanism
of missingness in financial statements follows MAR. In addition to providing a theo-
retical justification to reject the assumption of MCAR, we conducted Little’s MCAR
test (Little 1988) for a few variables. The result of test rejects the null hypothesis that
the missing data is MCAR. Therefore, we confirm our theoretical consideration and
assume that the data follows MAR by ruling out both MCAR and MNAR. Hence, we
assume the MAR condition to be appropriate.
The present study imputes values for a total of 177 line items across categories viz.
balance sheet, income statement, cashflow and financial ratios. Of the 177 line items,
73 items are from the balance sheet, 36 items are from the income statement, 40 items
are from the cashflow statement and 28 items are financial ratios. The values in the
fundamental dataset for 2016 are not available for various reasons and our study utilizes
the SME’s opinion to mark the anomalies as missing data. After passive imputation is
performed using financial ratios, only the top-level line item (total assets) is available
in the balance sheet; hence, a top-down approach is adopted to impute the lower-level
items in the form of a tree structure.

Methodology

In the proposed approach, we use the tree structure to integrate multiple accounting
constraints. Then, the imputation is performed using a two-step approach detailed in
the subsequent sections. We use the MICE framework to generate multiple datasets
for imputation of missing values. The output will have multiple datasets, implying that
there are multiple optimal solutions. The researchers and practitioners can perform
modelling and analysis on multiple datasets instead of a single dataset using a process
called ‘pooling’ to present their final conclusions.
In MICE algorithm, we get the multivariate distribution of θ, a vector of unknown
parameters of the multivariate distribution of Y (variables with missing values), by
doing iterative sampling from the conditional distribution (Buuren and Groothuis-
Oudshoorn 2010). The algorithm achieve convergence when the estimated parameter
attains a pre-specified threshold.

123
Journal of Quantitative Economics

Imputation of Missing Values in the Fundamental Data

Tree Structure

The dataset contains multiple line items that have interdependencies in the form of
accounting constraints. To incorporate all the accounting constraints, we adopt a top
down approach, which defines the order of imputation of variables. The order in which
the variables are imputed is very important while performing variable-by-variable
imputation. If the order of imputation of variables is not considered, it may intro-
duce circular constraints that cannot be solved. To avoid this incompatibility, we have
introduced the ‘tree structure’ approach to streamline the order of imputation. This
idea stems from the fact that for financial statements such as balance sheet, income
statement and cash flow statement, we have a top-most-level item like total assets, net
sales, and net cash, respectively. Using the asset turnover ratio that links both total
assets and net sales, we prepared a single tree structure for all the line items based
on their predecessor. For example, we consider Asset turnover ratio as a level 0 item.
Both Total Assets and Net Sales are level 1 item. The components that add up to total
assets and net sales are level 2 variables and so on. With this mapping, we can map
each and every item in the balance sheet, income statement and cash flow statement
to a specific level.
The algorithm should effectively respect the relationship between the predeces-
sor and the successor relationship effectively (see Fig. 1). Any line item has three
relationships:
1. The predecessor, which is typically a top-level item. However, if a few ratios that
involve the specific line item are directly available, the predecessor will be changed
to that specific ratio.
2. The successors, which are typically lower-level items. In other words, these are
other line items that can be imputed only after imputation of a given line item.
3. The neighbours, which share the same constraint linked by a common equation. For
example, total debt is equal to the sum of short-term debt and long-term debt. Both
short-term and long-term debt are neighbours as they share the same relationship
with total debt. With this approach, we are able to preserve the relationships for
all the line items.
In MICE, imputation is performed in a chained manner. We have fixed the ordering
sequence of imputation based on specific levels. For this purpose, we have assigned
a ‘level’ for each financial line item. In our study, all the line items had taken levels
from level 0 to level n. Based on these levels, a specific ordering sequence is provided
in MICE. As we follow a top-down approach, the ordering sequence followed is from
level 0 to level n. The ‘level’ defines the order in which the constraints are solved. The
first level items are imputed in the first stage, then the next level items are imputed
sequentially. Level 0 either means the value is directly available or it can be directly
computed using passive imputation and there is no ambiguity over the predictions.
Level 1 means there is one top-level item that is available and there are multiple lower
level items; hence, in this case, MICE imputation technique has to be performed. For
example, if there is an equation A  B + C + D, and if A is available, B, C, D have

123
Journal of Quantitative Economics

Fig. 1 An example of the relationship of a line item (Gross Income) with other line items in the tree

to be imputed. Here, A is assigned level 0 and B, C and D are assigned level 1. As


there are multiple line items in level 1, we have provided another classification called
‘importance’ which is assigned based on the average of historical values for the line
items B, C and D to resolve the order of imputation of variables. For example, if
B > C > D, then the importance for B, C and D is ranked 1, 2 and 3, respectively.

Two-Step Imputation Process

In our approach, we have used the predecessor as a predictor variable (either raw
predecessor or difference in the predecessor or growth of the predecessor) i.e.,
successorlineitem  β0 + β1 × pr edecessorlineitem. For a few line items, these
relationships may not hold due to specification error. Under these circumstances we
have used the skip-level predecessor as a predictor variable, where the correlation with
the skip-level variable is higher than the correlation with the immediate predecessor.
As a few line items (such as unexpected gains/losses) do not follow the distribu-
tion, we have adopted a two-step approach to ensure convergence of imputed values
by duplicating or shadowing all the line items. In the first stage, one of the variables
is predicted based on the relationship with the predecessor. In the second stage, the
duplicated line item is imputed based on the prediction from the first stage and simul-
taneously solving accounting constraints.
For e.g., Sales  Gross income + Depreciation, Depletion and Amortization + Cost
of Goods Sold.
Then, Gross Income should solve the constraint as mentioned below:

Gr oss income  Sales − Depreciation − Cost of goods sold.

123
Journal of Quantitative Economics

The mathematical process of the two-stage approach for the above example is given
below:
Stage 1: Gr oss I ncome  β0 + β1 × Sales without any constraints.
Stage 2: Duplicated(Gr oss I ncome)  β0 + β1 × Gr oss I ncome with the con-
straint as:Duplicated(Gr oss I ncome)  Sales - Depreciation - Cost of goods sold.
If the gross income is constrained to follow the relationship with its predecessor, it
might be possible that it may deviate from the accounting constraint. For fundamental
data, solving the constraint is more important than following the distribution, as
explained in the case of a line item, unexplained gains/losses. Moreover, β1 in the
second stage is expected to be closer to 1. Hence, the deviations in β1 in the second
stage from the desired value are an indication that the distribution of the imputed
value for solving the constraint is deviating from the desired distribution. By verifying
β1 in the second stage, we can gauge the mis-specification error and improve the
relationship of the line items by choosing say some skip-level predecessors.
The algorithm should also keep the tree structure intact even if a few line items can
be directly derived from the financial ratios. This situation can be well captured by the
approach followed in this study. The proposed algorithm works as follows:
1. With the initial tree structure, the line item say net income—bottom line takes a
level of 5 derived from the level 0 predecessor which is net sales or Revenues with
several intermediary line items.
2. However, for most of the companies, profit to sales margin is available. By virtue
of this relationship, net income—bottom line can take level 1 as it can be directly
derived once Net Sales is imputed.
3. Since the neighbour of net income—bottom line, extraordinary item gains/loss
from sale of assets share the same constraint as net income, the tree structure will
still be preserved.

Constraints

The process by which the constraints have evolved is explained in this section. Let us
assume that we are looking for imputation of a line item called gross income. The line
item is derived from Net Sales which is a non-missing value. As per accounting rules,
net sales is the aggregation of gross income, depreciation, depletion and amortization
(DDA) and cost of goods sold (COGS). In the tree structure, we are not providing this
constraint to net sales. However, each of the gross income, DDA and COGS will take
this constraint by rewriting the above relationship for each of them as mentioned in
Table 1.
The Table 1 represents the implementation of constraint sheet. Each line item has a
unique field number and a unique predecessor. Further, every line item has two more
fields associated with it, namely level and importance. Level and importance decide
the sequence in which the line item will be taken up for imputation. For example, all
level 1 items will be imputed before any level 2 item. Importance is defined as the
significance of a line item within the same level. For example, line item 29XX and
39XX are both at level 1 but 29XX will be imputed before 39XX because it has higher
importance. Every line item also has a unique constraint that it has to solve based on

123
Journal of Quantitative Economics

Table 1 A representation of the constraint sheet for line items

Item Name Field number Predecessor Level Importance Constraint

Net sales 23XX 0


Gross income 29XX 23XX 1 1 Px023XX–Px039XX–Px051XX
COGS 51XX 23XX 1 2 Px023XX– x039XX–Px029XX
DDA 39XX 23XX 1 3 Px023XX–Px029XX–Px051XX

the values (imputed or actuals) of other line items. For example, the line item Gross
Income (29XX) should follow the identity

LI (29XX)  LI (23XX)-LI (29XX)-LI (51XX).

Our approach solves all the constraints and the line plots indicate that the imputed
values are acceptable to SMEs. Further, all the imputed values are flagged so that all
the data users are aware of the exact location of the imputed value and subjectively
judge its quality on a case-to-case basis.

Algorithm

The algorithm takes the following steps:

1. The data for a specific company ID is retrieved from the base file and all the
non-missing values are marked as level 0.
2. A constraint sheet is prepared that has the following information: field ID, field
name, level, predecessor and a constraint equation as represented in Table 1. The
levels, predecessors and constraint equations are initially designed based on the
known relationships. For example, sales and total assets are level 0 items and so
on. We note that the level, predecessor and constraint equations are dynamically
selected within the algorithm.
3. For initial imputation, we use only one variable which is the predecessor. The
assumption is that all the lower level items have a specific relationship with their
predecessor. This is true in most of the cases. However, cash flow items are also
derived from income statements and balance sheets of the previous year and the
current year. For example, an increase or decrease in cash is calculated based on
the difference in cash in the balance sheets of the two consecutive years. For such
cases, we assign the predictor variable as the difference of the cash itself. For a
few other items, we provide growth of predecessor as the predictor variable.
4. For a few cases, the relationship between the predecessor and successor does not
hold true. As MICE will impute the values based on the distribution, the solutions
in a few rare cases are not feasible. To avoid the convergence criteria, the algorithm
has been modified slightly so that we duplicate each and every line item. In the first
stage, we will perform the imputation based on the predictor and in the next stage,
we will solve the constraint by slightly moving away from the original distribution.

123
Journal of Quantitative Economics

The disadvantage is the risk of misspecification error which can be minimized by


gauging the beta coefficients in the second stage.
5. The levels, predecessors and constraint equations are updated for each company
based on the data that is available for the specific company. We intend to promote
each and every line item based on the non-missing data i.e., a level 3 variable
will be promoted to level 1 if a financial ratio involving the missing line item is
available directly. Accordingly, the levels, predecessors and equations are updated
simultaneously without breaking the already existing constraints.
6. The MICE package in R performs imputation from a univariate distribution for
every company (as we use one variable as a predecessor). The functional form is
a one-variable linear regression with the predecessor as a predictor variable and
the imputed value simultaneously solves the given constraints. MICE package in
R has an inbuilt option of predictive mean matching (pmm) for choosing a value
within the distribution. This is same as linear regression, except that instead of
imputing the predicted value, the imputation will be performed based on a value
within the distribution which is very close to the predicted value. In essence, the
imputed value will be from the distribution itself. The option of choosing predictive
mean matching (pmm) is intuitive. As it is a primary requirement that the imputed
value should follow the distribution of the other non-missing values, using ‘pmm’
imputation is advantageous. For cases where the data clearly displays a time trend,
by modelling the difference of a dependent variable (such as AR (1) process) using
the difference of the independent variable, we can achieve the desired performance.
In the above case, the independent variable is diff(x) and the dependent variable
is diff(y). Further, providing accounting constraints is possible within the MICE
package.
7. We verify the plot of the line items in an interactive chart and check the reason-
ableness of the imputed value and out-of-time testing.

Assumptions

The assumptions of the methodology are given below:

1. The major assumption in this algorithm is that the functional form represents
the precise relationship between the independent and dependent variables. As the
financial ratios are mostly stable across time, the regressor coefficients will be
stable and the assumption will be valid for fundamental data. A few variables
follow the growth pattern instead of ratios. For example, we use growth in sales,
operating cash flow, current assets, etc.
2. The financial ratios that involve the dependent and independent variables are
assumed to be constant during the imputation process. We assume that there are
no corporate actions such as buyback of shares, merger and acquisitions, etc. that
could impact the financial ratios.
3. The dependent variable can be imputed with one variable. This assumption is valid
as we have very few data points and there is only one predecessor for each line
item.

123
Journal of Quantitative Economics

4. Even if the values are slightly away from the distribution for a few lower-level
items, it is still not a big concern as these values adhere to the financial constraints.
For example, if most of the values are zero (except one or two), then even if the
imputed values are not zero, it does not pose a serious threat to the methodology.
5. The predictors provided, based on the predecessor relationships, will hold for all
the companies in all the sectors.
6. The input data is of high quality.

Limitations

For a few companies, the ratios of the dependent to independent variables are volatile
and cannot be used as a component for modelling as it affects accuracy. Though the
items at lower levels have multiple predecessors, the characteristics of the fundamental
data enable the framework to prefer fundamental ratios or growth rate or AR (1)
processes for the imputation. The current forecasting methods also use financial ratios
widely and hence the severity of this limitation is low. Any unusual trend could be
observed in the visual charts of time-series plots. Even if a weak relationship exists
between line items and its predecessors, MICE enables predictive mean matching to
impute the value from the distribution. Hence, severity of this limitation is low. The
identification of data anomalies should be a separate exercise before processing the
data within the MICE framework. The data anomalies can be flagged by consulting
SMEs on treating the anomalies as a missing data problem. These anomalies can be
considered missing data and processed similarly within the MICE framework.

Result and Discussion

We have performed imputation for 177 line items using the MICE framework and
adhered to all the accounting rules. Overall, we have 73 constraints for various line
items that are solved using this approach. We have plotted line charts (see Fig. 2) and
visually compared the trend of line items to assess the reasonableness of the imputed
values. The results (see Table 2) indicate that the approach can handle such large
problems efficiently.
Further, we have performed out-of-time testing to validate the imputation per-
formed using the MICE approach. Wherever we have missing values for 2016, we
have removed the corresponding values for 2015. We have also removed 2016 data as
most of values are only imputed values. We have used the same process used to resolve
the missing values for 2016 to impute values for 2015. These values are compared with
actual values and we have analyzed the error and percentage error. We have grouped
the differences into five categories and provided the results in the table below. The
results (see Table 3) are good and in line with the expectations of SMEs given that
we have adopted a relatively simple approach instead of developing a different model
methodology for every line item.
The error is within the 10% range for a majority of line items imputed. The higher
percentage error for a few financial ratios is due to the small magnitude of the actual

123
Journal of Quantitative Economics

Fig. 2 Sample of imputed line items for 2016 (2010–15 are historical values)

Table 2 Percentage of constrains solved using MICE

Methodology Total number of Total Number of % of constraints


line items constraints constraints solved solved

MICE 177 73 73 100


Linear regression 177 73 0 0

Table 3 Out-of-time test performance metric: (left) percentage error; (right) Error

Percentage error table Error table

Range Line items Range* Line items

− 80 to − 50% 8 Below − 500 million 2


− 50 to − 10% 21 − 500 to − 50 million 18
− 10 to 10% 127 − 50 to 50 million 140
10 to 50% 14 50 to 500 million 16
50 to 80% 7 Above 500 million 1
* The companies’ revenues in billions

value of line item. Further, only 3 line items have an absolute error of more than $500
million.
Further, we have reported the difference in model estimates in out-of-time testing.
We have compared the results of one of the many imputed datasets with the results of
the models based on actual data for the year 2015, following Galler and Kehral (2012).
The difference of the beta estimate is calculated as:

123
Journal of Quantitative Economics

Table 4 Difference in Beta of


Difference in beta estimates of Line items
slope between actual and the
slopes between actual and
imputed data sets for stage-one
imputed data sets
model
0 8
0–0.4 133
0.4–0.8 21
0.8–1.5 10
> 1.5 4

Table 5 Difference in beta of


Difference in beta estimates of Line items
intercept between actual and the
intercepts between actual and
imputed data sets for stage one
imputed data sets
model
0–10 9
10–100 29
100–1000 132
> 1000 7

βi  βactual − βimputed ,

where,

βactual  the coe f f icient o f the analysis model with the actual values in 2015,

βimputed  coe f f icients o f analysis model with imputed values (one o f the multi ple datasets).

These results are provided in Table 4. The difference indicates the deviation of our
imputed values from the actual values of the analysis model. For most of the line items,
the difference is below 0.4 for the slope coefficient and this difference is acceptable
to SMEs. The differences in the beta estimates of intercepts are also presented for
completeness (see Table 5), although the beta estimates for the slope are of prime
interest to the SMEs.
Further, the coefficients of point estimates (desired value is Beta  1) for the second
stage model are presented in Table 6. For most of the line items, the coefficients lie
between 0.95 and 1.05, indicating only a slight deviation from the actual value after
accounting for the constraints.
We have also presented the line charts of a few imputed line items for the year 2016
in Fig. 2. The imputed values (last point in the line chart) follow the previous trends
closely and these line plots are acceptable to the SMEs for their downstream analysis.
Although there is a small deviation between the actual and predicted values, for all
practical purposes they do not pose any serious issue as they correspond to a lower
level line items and follow all accounting constraints.

123
Journal of Quantitative Economics

Table 6 Beta estimates between


Beta estimates in the second Line items
imputed values and the expected
stage between imputed values
values for second stage model
and the expected values

0–0.30 0
0.30–0.60 0
0.60–0.80 9
0.80–0.95 16
0.95–1.05 92
27 line items are excluded as 1.05–1.20 27
actual and imputed values are 1.20–1.40 6
zero

Conclusion

Our research studies the application of MICE and integrates interdependency of line
items in the financial statements by imposing accounting constraints. Errors are low
as evident from the small deviation between actual and predicted values, indicating
that the results are good and conclusive. Also, the process of identifying the functional
relationship between different line items requires functional knowledge in handling
financial statements. These functional relationships can be further improved by lever-
aging the experience of asset managers and the approach can be customized for various
line items to improve the efficacy of the imputation process.
We tested our approach by imputing 177 line items that contain 73 constraints, where
the missing values are scattered across several line items in the balance sheet, income
statement and cash flow statement. The study reveals that our proposed framework is
able to capture the trend of specific variables, impute values for a few line items akin
to their distribution, and also solves accounting constraints.
Application of multiple optimal solutions on the downstream model allows the
practitioners to identify the parameters of interest in a scientific manner without elim-
inating the missing values. The approach will be useful for different groups such
as academicians, analysts, investment managers and policy makers for a variety of
purposes such as studying default behavior, providing trading signals, investment
opportunities, etc. Starting from pair trading based on pattern recognition to invest-
ment strategies of private equity, wealth funds and pension funds and asset allocation
strategies of investment managers, fundamental data is the most important require-
ment to device any strategy. Banks can decide lending strategies models for a portfolio
that has limited data or missing data. Given the framework is developed in a gener-
alized manner, rigorous testing is recommended on downstream models, particularly
out-of-time testing as performed in the research study.
The study can be extended further to cover several other scenarios. If the top-
most-level items are forecasted separately, other financial items can be derived by
treating the other line items as a missing value problem. Instead of using a single
line item, multiple line items could be used as predictors by combining industry-
wide indicators with market-related information. If the sales can be forecasted for
benchmark companies (bell weather companies) using generic measures such as the

123
Journal of Quantitative Economics

Index of Industrial Production and industry-wise predictors, the sales of mid-sized


and small-sized companies can be forecasted using the tree approach proposed in our
framework. All the problems mentioned above can be solved by extending the proposed
framework. Hence, we conclude that a two-step approach and a tree structure under the
MICE framework enable us to enforce accounting rules involving multiple constraints
for imputing missing values in financial data.

References
Bouhlila, D.S., and F. Sellaouti. 2013. Multiple imputation using chained equations for missing data in
TIMSS: a case study. Large-scale Assessments in Education 1: 1–33.
Buuren, S.V., and K. Groothuis-Oudshoorn. 2010. Mice: Multivariate imputation by chained equations in
R. Journal of Statistical Software. 45: 1–68.
Van Buuren, S., J.P.L. Brand, C.G.M. Groothuis-Oudshoorn, and D.B. Rubin. 2006. Fully Conditional Spec-
ification in multivariate imputation. Journal of Statistical Computation and Simulation 76: 1049–1064.
De Waal, T. 2011. Handbook of statistical data editing and imputation. New York: Wiley.
Fogarty, D.J. 2006. Multiple imputation as a missing data approach to reject inference on consumer credit
scoring. Interstat. 41: 1–41.
Galler, B., and U. Kehral. 2012. Missing data methods in credit risk. Kirchberg: 5th European Risk
Conference. (13–14 September 2012).
He, Y., A.M. Zaslavsky, M.B. Landrum, D.P. Harrington, and P. Catalano. 2009. Multiple imputation in a
large-scale complex survey: a practical guide. Statistical Methods in Medical Research 19: 653–670.
Kennickell, Arthur B. 1991. Imputation of the 1989 survey of consumer finances: stochastic relaxation and
multiple imputation. Proceedings of the Survey Research Methods Section of the American Statistical
Association 1 (10): 41.
King, Gary, et al. 1998. List-wise deletion is evil: what to do about missing data in political science. Boston:
Annual Meeting of the American Political Science Association.
Kofman, P., and I.G. Sharpe. 2000. Imputation methods for incomplete dependent variables in finance.
School of finance and economics. Sydney: University of Techology.
Little, R.J., and D.B. Rubin. 2002. Statistical analysis with missing data. New York: Wiley.
Little, Roderick J.A. 1988. A test of missing completely at random for multivariate data with missing values.
Journal of the American Statistical Association 83 (404): 1198–1202.
Pagano, A., Perrotta, D., and S. Arsenis. 2012. Imputation and outlier detection in banking datasets. Paper
presented at 46th SIS Scientific Meeting of the Italian Statistical Society, Rome.
Raghunathan, T.E. 2001. A multivariate technique for multiply imputing missing values using a sequence
of regression models. Survey methodology 27: 85–96.
Rubin, D.B. 1976. Inference and missing data. Biometrika 63 (3): 581–592.
Rubin, D.B. 1987. Multiple imputation for nonresponse in surveys. New York: Wiley.
Rubin, D.B. 1996. Multiple imputation after 18 + years. Journal of the American statistical Association 91:
473–489.
Schafer, J.L. 1997. Analysis of incomplete multivariate data. Florida: CRC Press.
Schafer, J.L. 1999. Multiple imputation: a primer. Statistical Methods in Medical Research 8: 3–15.
Stuart, E.A., M. Azur, C. Frangakis, and P. Leaf. 2009. Multiple imputation with large data sets: a case
study of the Children’s Mental Health Initiative. American Journal of Epidemiology 169: 1133–1139.

123

You might also like