Unbalanced Data, Type II Error, and Nonlinearity in Predicting M&A Failure

Journal of Business Research 109 (2020) 271–287
Contents lists available at ScienceDirect
Journal of Business Research

journal homepage: www.elsevier.com/locate/jbusres
Unbalanced data, type II error, and nonlinearity in predicting M&A failure T

a,⁎ b c a d
Kangbok Lee , Sunghoon Joo , Hyeoncheol Baik , Sumin Han , Joonhwan In
a
Auburn University, Harbert College of Business, 415 W. Magnolia Ave, Auburn, AL 36849, United States
b
California State University Dominguez Hills, College of Business Administration and Public Policy, 1000 E. Victoria Street, Carson, CA 90747, United States
c
Stockton University, School of Business, 101 Vera King Farris Drive, Galloway, NJ 08205, United States
d
Ulsan National Institute of Science and Technology, Graduate School of Technology and Innovation Management/School of Business Administration, Ulsan 44919, South
Korea
ARTICLE INFO ABSTRACT
Keywords: The traditional forecasting methods in the M&A data have three limitations: first, the outcome of M&A deal is an
Nonlinearity prediction event with a small probability of failure, second, the consequences of misclassifying failure as success are much
Unbalanced data more severe than those of misclassifying success as failure, and third, the nonlinear and complex nature of the
Logit and Probit model relationship between predictors and M&A outcome could limit the advantage of logistic regression. To overcome
Generalized logit model
these limitations, we develop a forecasting model that combines two complementary approaches: a generalized
Neural network
logit model framework and a context-specific cost-sensitive function. Our empirical results demonstrate that the
Merger and acquisition
Machine learning proposed approach provides excellent forecasts when compared with traditional forecasting methods.
1. Introduction on organizations than those of a type I error (i.e., false positive; mis-
classifying success as failure). Barnes (1999) noted that in takeover
Despite decades-long scrutiny from academics and practitioners in predictions, the penalty of misclassifying a non-target firm as a target
the disciplines of finance and management (Krishnan & Masulis, 2013; (type II error) is significantly higher than misclassifying a target as a
Wang & Branch, 2009), the accurate forecast of merger and acquisition non-target (type I error), while Hansen, McDonald, Messier, and Bell
(M&A) outcomes has been a challenging task for organizations that (1996) explained that in fraud detections, the failure to detect fraud
need to make strategic business decisions (Lin, Lan, & Chuang, 2013). when it occurs (type II error) is significantly costlier to Certified Public
Given that the choice of the right forecasting model is a multi-faceted Accountant firms than predicting fraud when it does not occur (type I
decision that must consider both the pros and cons of forecasting error), since type II errors lead to litigation, while type I errors lead to
models of interest and data characteristics, we claim that the use of over-auditing. Similarly, misclassifying M&A failures as successes (type
traditional forecasting methods (e.g., logit and probit) in the M&A II error) may incur the opportunity cost caused as a result of not
forecast context could have three critical limitations. searching for alternative targets, and this opportunity cost may be high
From a probabilistic model standpoint, the failure of M&A deals can when compared to the cost related to preparing for alternative deals as
be characterized as rare events, with a small probability of failure (i.e., a result of misclassification of M&A success as failure (type I error).
unbalanced data); in general, successful instances of M&A outnumber High levels of misclassification are of a vital concern in takeover pre-
the failures (e.g., about 10% of the sample) (Wang & Branch, 2009). dictions—especially when costly type II errors occur (Rodrigues &
Under such conditions, the estimation or training of a model using such Stevenson, 2013), that is, when M&A failures are predicted to be suc-
unbalanced data can introduce bias towards the success-category due to cesses.
the larger number of data points under the success category. This could From a model selection standpoint, the nonlinear and complex
lead an organization to make a wrong business decision and ultimately nature of the relationship between predictors and M&A outcome data
endanger itself in the market. could limit the advantages of logit and probit classification methods
From a managerial standpoint, it is not uncommon in the business over linear models. Research addressing the prediction of M&A success
environment that the consequences of a type II error (i.e., false nega- or failure has largely used the logit and probit models in finance lit-
tive; misclassifying failure as success) render more detrimental impacts erature over the last three decades (see Table 2). This may be because
Corresponding author.
⁎
E-mail addresses: kbl0009@auburn.edu (K. Lee), sjoo@csudh.edu (S. Joo), hyeoncheol.baik@stockton.edu (H. Baik), szh0117@auburn.edu (S. Han),
junani12@gmail.com (J. In).
https://doi.org/10.1016/j.jbusres.2019.11.083
Received 19 March 2019; Received in revised form 27 November 2019; Accepted 28 November 2019
0148-2963/ © 2019 Elsevier Inc. All rights reserved.
K. Lee, et al. Journal of Business Research 109 (2020) 271–287
such models are well-established and are, thus, considered to be pow- models in the finance and management literature, we develop a neural
erful approaches in the classification of binary response in literature network-based M&A forecasting model and assess the predictive power
(Audrino, Kostrov, & Ortega, 2019). The logit and probit classification of the proposed model using a number of competing models. We con-
methods could overcome the shortcomings of linear regression models.1 clude the paper with the discussion on the contributions and sugges-
However, the logit and probit models cannot solve non-linear classi- tions for future research. The implementation of our approach is
fication problems because their decision surface is linear, as illustrated in available using the R software (see Appendix A).
Fig. 1 (see note in Fig. 1). These approaches separate the predictors into
two regions using a linear boundary (i.e., M&A success and failure). As 2. Experiments using logit functions with unbalanced data
depicted in the left panel of Fig. 1, the logit and probit models assume a
linear decision boundary between the two classes and, thus, their pre- In this section, we highlight the limitations of logit functions in
diction power could be lower than that of the neural network model using forecasting unbalanced data and succinctly explain our overall mod-
nonlinear decision boundaries. As an example, Giordani, Jacobson, Von eling approach.
Schedvin, and Villani (2014) were able to improve forecasting power by
using nonlinearity specification2 between firm failure and leverage, 2.1. Unequal contribution of ones (failures) and zeros (success) to the
earnings, and liquidity. Therefore, it can be argued that forecasting the learning process in the unbalanced data
success or failure of M&A deals requires prediction models that (i) can
mitigate potential problems arising from unbalanced data, (ii) incorporate Let us define a single output y as the posterior probability of a
a misclassification error, and (iii) assume possible nonlinear structures of failure class. Then, the posterior probability of a success class will be
association between predictors and response variables. given by 1 y . Let t denote the target variable where t = 1 if the input
This research combines two complementary approaches: a gen- vector belongs to the withdrawn class with probability y and t = 0 if it
eralized logit model framework (i.e., the general form of sigmoid-type belongs to the complete class with probability 1 y ; equivalently,
functions) and a context-specific cost-sensitive function (the penalty Pr(t = 1) = y . Suppose we are interested in estimating the logit model
term in an objective function) to account for unbalanced data and type that maximizes the log likelihood function, Li , which describes the
II errors, respectively. To identify the possible nonlinear relationship likelihood of observing the dyadic outcome ti for an acquiring firm i
between predictors and M&A outcomes, we focus and develop a neural considering the total asset difference ( x i ) between the acquiring firm i
network-based M&A forecasting model as the models gain flexibility in and its target firm, given the parameter . For our model, we get:
the choice of the nonlinear form of threshold classification, which does
not require the same functional form for all data, as with data points in logit 1( xi )for ti = 1
Li = .
the right panel of Fig. 1. Neural network models have been proven to 1 logit 1 ( xi )for ti = 0 (1)
have good predictive power in the literature (Hatzakis, Nair, & Pinedo,
2010). The nonparametric nature of neural models makes them parti- We can summarize this piecewise function as
cularly well-suited for observational data in social science where as- Li = [logit 1 ( x i )]ti [1 logit 1 ( x i )]1 ti . Maximum-likelihood logit
sumptions of linearity and normality cannot be assured3; neural net- analysis then works by finding the value of that provides the max-
imum value of the log likelihood function ln i Li , and we label this
N
work models have demonstrated the ability to accurately approximate
almost any non-linear arbitrary function. estimator . The asymptotic variance matrix, V ( ) , is also estimated to
Our research differs from existing studies in two respects. Our calculate the standard errors. When observations are selected randomly
proposed model employs the general form of sigmoid-type functions and there is no perfect discrimination between zeros and ones, is
that could add more parameters in order to increase the limits of the consistent and asymptotically efficient. However, as previously men-
output range and control the steepness of its graph. This approach al- tioned, very small parts of data sets contain most of the important in-
lows the proposed model to flexibly address bias in the dominant class formation, which can be seen by examining the variance matrix,
N 1
model (i.e., the class with more observations). V ( ) = [ i yi (1 yi ) x i' x i ] . The part of this variance matrix affected
Second, in the conventional setting of a model that allows a linear model by the unbalanced data is the component yi (1 yi ). In this setting,
to be related to the response variable via a sigmoid-type link function, the additional ones will significantly decrease the variance (i.e., un-
model’s classification accuracy is typically evaluated by estimating its error certainty; inverse of yi (1 yi ) ) because yi (1 yi ) will be larger with
rate on the sample based on the assumption that all classification errors additional ones than with additional zeros in the unbalanced data
have equal costs. As previously mentioned, this assumption does not hold in (Imbens, 1992; Lancaster & Imbens, 1996). Thus, additional ones (small
our context as the error costs drastically differ between types (e.g., type I part; rare events) of data sets contain more information than additional
errors vs. type II errors). Hence, our research incorporates cost-sensitive zeros.
modifications into the estimation procedures following the works of Kukar
and Kononenko (1998) and Breiman, Friedman, Olshen, and Stone (1984) 2.2. Skewness refers to a lack of symmetry and causes overestimation
(We discuss more of this in detail in Section 3.3).
This paper is organized as follows. In the following section, we The problem with the logit model introduced by unbalanced data is
describe the pilot study to explain how the logit function can suffer illustrated by Fig. 2, which shows the logit curves for unbalanced data
from unbalanced data and why increasing the limits of the output range and balanced data (by using random under-sampling). For these data
and controlling the steepness of its graph can help mitigate potential sets, Pr(M&A = failure) is shown as a function of the differences in total
problems from unbalanced data. After reviewing various forecasting assets between the acquiring and target firms. The developmental
sample for the curves consists of 803 M&A cases (107 failures; 13%)
between 2006 and 2015 (more details about this data set will be given
1
in the Section 4).
In the linear model, the dependent variable is assumed to be continuous and
To achieve the balanced data, we randomly remove the majority of
unbounded. By introducing the sigmoid function, the logit and probit can work
examples from the original data set. For a clearer presentation, we here
well with M&A outcome data, especially if we believe in a non-monotone re-
lationship between predictors and the probability of M&A outcome. establish some of the notations used in this section. Considering a given
2
Giordani et al. (2014) implemented spline functions in a logit model to data set N with m observations, we define subsets Nmajority N and
account for a highly nonlinear structure. Nminority N , where Nmajority is the set of majority-class examples in N ,
3
Lippmann (1987) showed that neural models are robust and are not easily and Nminority is the set of minority-class examples in N . Thus,
confounded by the misspecification of underlying data distributions. Nmajority Nminority = { } and Nmajority Nminority = N . To create the set of
272
Logit model decision boundary Neural network decision boundary
The overall percentage of correctly classified cases for The overall percentage of correctly classified cases
the logit model for the neural network
51% 97.5%
Fig. 1. The linear and nonlinear decision boundaries, logit model and neural network, respectively, are displayed, dividing M&A success and failure classes.
Unbalanced (original) data Balanced (resampling) data
Fig. 2. The logit curves for unbalanced (original) and balanced (resampling) data illustrate the problem with the logit model introduced by unbalanced data.
balanced data, we randomly select majority examples and remove them 0 1, and the parameter 1 is such that the distance between the 95%
from the set, Nmajority . In this way, the number of total examples in point and the 50% point is approximately 3 1 . Thus, the slope and
Nmajority is decreased by D = Nmajority Nminority , where Nminority indicates spread parameters can be defined by | 1| and |3 1|, respectively.
minority examples. In particular, we randomly select a set of majority Therefore, the logit curve is completely determined by the two para-
examples in Nmajority and remove these samples from N so that meters, where the 50% value locates the midpoint of the curve and the
Nbalanced = Nminority + Nmajority D. In this way, undersampling readily spread provides the shape of the curve.
gives us a simple method for adjusting the balance of the original data A model provides a good fit to the data if it results in a high value of
set N. L4 (as discussed in the previous section)—in other words, a high
We then fit the logit model to the data and compute the slope
coefficient ( 1) and location parameters ( 0 ). The probability of M&A
failure can be expressed as Pr(M&A = failure x ) = 4
If we assume independence across the dyadic relationships between ac-
1 x )) . The logit curve has the properties
p (t = 1 x ) = (1 + exp( 0 1
quiring and target firms, the likelihood of observing success versus failure in
that the value of x at which Pr(M&A = failure) is 50% is given by our sample is simply the product of the individual likelihood L = i Li . A
model provides a good fit to the data if it results in a high value of L .
273
likelihood of observing the success or failure M&A combinations in the

sample given the parameter values of 0 and 1. Since maximum like-
lihood allows us to choose the optimal values of 0 and 1, the estimated
80 (True negative; 75% minority class

27 (False negative, type II error; 25%
Actual: M&A failure (Nminority = 107 )
values ( 0 and 1) that provide the best fit to the unbalanced data may
sharply underestimate the probability of rare events (in this case, M&A
failure). With the estimated parameters, we can predict the probability
Condition negative = 107

of a failed M&A: . Fig. 2 illustrates a logit model that includes the dif-
minority class miss rate)

ference in total assets as a single explanatory variable. By plugging the
explanatory variable into the logit transformation in both panels, we
map from the predictor to a probability. The dashed line shows a pre-
diction for 0 = 1 and 1 = 1. It is important to note that not only do the
detection)
50% values of the curves differ, but so do the slopes and spreads. The
logit curve for unbalanced data is flatter ( 0 1= 2.21; 0 = 1.04 ,
p -value = 4.87 × 10 11; 1 = .47, p -value = 5.8 × 10 8 ) than the
64 (True positive, power; 60% majority class

curve for balanced data ( 0 1 = 1.85; 0 =1.28, p -value = 1.2 × 10 ;
6
1= .69, p -value = 5.7 × 10 8 ). Therefore, the spread of the curve for
= 107 )
43 (False positive, type I error; 40%

unbalanced data (|3 1| = 6.38) is greater than that for balanced data
(|3 1| = 4.34 ). Obviously, if the difference in total assets between the
balanced
Actual: M&A success (Nmajority
acquiring and target firms is transformed into a deviation from the 50%
Balanced (resampling) data

majority class miss rate)
values, then a given deviation would not give the same
Pr(M&A = failure) with balanced and unbalanced data. For example,
when a total asset difference is close to 1, the predicted probability is
about a 70% chance of failure in the balanced data but only a 20%
detection)
chance of failure in the unbalanced data.
We also report the confusion matrix in Table 1 and find that the
errors are not uniformly distributed in the unbalanced data, with 100%
false negatives and 0% false positives. Surprisingly, the slope and
Predicted: M&A Success
Predicted: M&A Failure

constant parameters are statistically significant at the even 0.01% level
despite the 100% prediction failure of withdrawn M&As. However, the Nbalanced = 214
confusion matrix for the balanced data shows more uniform errors than
that of the unbalanced data (25% false negatives and 40% false posi-
tives), probably because the slope is steeper and therefore the logit
model is more flexible; the predicted ranges of the probability of failure
for balanced and unbalanced data are approximately 0.26
0 (True negative; 0% minority class detection)
1 1
(= 1 + exp(1.04 + . 47 × . 018) 0 ) and 0.78 (= 1 + exp( 1.28 + . 69 × . 018) 0 ), re-
107 (False negative, type II error; 100%
spectively, where 0.018 is the minimum value of the difference in total

Actual: M&A failure (Nminority = 107 )
assets between the acquiring and target firms. It can be argued that the
steepness of a slope in the logit model reflects the difference in the
balance between M&A failures and successes. In other words, this il-
lustrates that it is possible to considerably mitigate the problem of
minority class miss rate)
unbalanced data by adding more parameters that can increase the limits
of the output range or control the steepness of its graph. Thus, in this
study, we propose using the general form of the sigmoid function. We
consider this approach in more detail in the next section on modeling
approach.
0 (False positive, type I error; 0% majority
696 (True positive, power; 100% majority
3. A modeling framework for M&A forecasting

Actual: M&A success (Nmajority = 696 )
M&A forecasting models have been used in both management and

finance literatures. Examples include the logit model (Chakrabarti &
Mitchell, 2016; Cudd & Duggal, 2000;, Jeon & Ligon, 2011; Lin et al.,
Unbalanced (original) data
Condition positive = 696
2013; Renneboog & Zhao, 2014; Walkling, 1985), probit model (Baker,
Pan, & Wurgler, 2012; Lin et al., 2013; Officer, 2003), weighted logit
model (Wang & Branch, 2009), and neural network (Branch, Wang, &
class detection)
class miss rate)
Yang, 2008; Zhang, Johnson, & Wang, 2012) (see Table 2 for more
details). Among these approaches, neural network classifiers have
proven to have good predictive power in forecasting future behavior
(Hatzakis et al., 2010) and have been well recognized as a promising
method in both academia and practice. Although the regression-type
Predicted: M&A failure
models that comprised linear combinations of fixed functions have

Confusion matrix.
useful analytical and computational properties, their practical applic-

Predicted: M&A
Column total
ability is limited by the curse of dimensionality (Bishop, 2006). How-

success
Nraw = 803
ever, neural network approach that comprises multiple layers of logistic

Table 1
regression models is to fix the number of basis functions in advance but

allow them to be adaptive; to use parametric forms for the basis
274
K. Lee, et al.
Table 2
M&A success/failure prediction literature.
Study Journal DV (Binary) In-sample (S/ Out-of-sample Sample period Sample (S/ Proportion of Methodology
F/O) (S/F/O) F) failure
de Bodt, Cousin, and Roll (2018) Journal of Financial and Yes – – 1994–2014 5318/462 0.20 Probit
Quantitative Analysis
Chakrabarti and Mitchell (2016) Strategic Management Journal Yes – – 1980–2004 1326/277 0.17 Logit
Betton, Eckbo, Thompson, and The Journal of Finance Yes – – 1980–2008 5035/1068 0.18 Logit
Thorburn (2014)
Renneboog and Zhao (2014) Journal of Corporate Finance Yes – – 1995–2012 609/134 0.18 Logit
Krishnan and Masulis (2013) The Journal of Law and Yes (Failure = 1) – – 1990–2008 8474/1086 0.11 Logit
Economics
Lin et al. (2013) Journal of Forecasting Yes (95.4/29.0/ – 2000–2007 66/38 0.36 Option-based approach; Logit; Probit
71.2)
Baker et al. (2012) Journal of Financial Economics Yes – – 1984–2007 4853/1609 0.24 Probit
Zhang et al. (2012) Journal of Business and Yes – (97.1/50.4/ 1991–2004 1050/146 0.12 Logit; Neural network; SVM – Linear Kernel; REPTree;
Economic Research 91.0) Decision tree; Random forest; AdaBoost
Jeon and Ligon (2011) Journal of Corporate Finance Yes – – 2001–2007 1489/213 0.12 Logit
Wang and Branch (2009) Journal of Business and Yes (Failure = 1) – – 1995–2005 1170/143 0.11 Logit; Weighted logit
Economic Studies
Branch et al. (2008) International Review of Yes – (98.2/58.0/-) 1991–2004 1050/146 0.12 Neural network; Logit
Financial Analysis
275
Heron and Lie (2006) Journal of Business Yes – – 1985–1998 111/415a 0.21 Logit
Bhagat et al. (2005) Journal of Financial Economics Yes – – 1962–2001 690/328 0.32 Logit
Bange and Mazzeo (2004) The Review of Financial Studies Yes – – 1979–1990 297/139 0.31 Logit
Branch and Yang (2003) Quarterly Journal of Business Yes – (91.2/62.5/ 1991–2001 86/13 0.13 Logit
and Economics 88.9)
Officer (2003) Journal of Financial Economics Yes – – 1988–2000 2084/427 0.17 Probit
Baker and Savasoglu (2002) Journal of Financial Economics Yes – – 1981–1996 1470/431 0.23 Probit
Cudd and Duggal (2000) The Financial Review Yes – (7.7/78.0/76.1) 1987–1991 13/460 0.03 Logit
Duggal and Millar (1994) Quarterly Review of Economics Yes – – 1984–1987 51/29 0.36 Logit
and Finance
Kaplan and Weisbach (1992) The Journal of Finance Yes – – 1971–1982 179/92 0.34 Linear Probability Model
Walkling (1985) Journal of Financial and Yes (100/9.1/ (87.1/33.3/ 1972–1977 72/36 0.33 Logit
Quantitative Analysis 83.9) 82.4)b
(85.7/80.0/ (14.3/100/
82.6) 62.5)c
In-sample and out-of-sample prediction accuracy rates are reported in terms of percentage (%). When there is multiple methodology used in a study, we italicize and bold the methodology that generates the highest
prediction accuracy rates and report corresponding in-sample and out-of-sample prediction accuracy rates. Note that S, F, and O indicate in-sample and our-of-sample prediction accuracy for success, failure, overall,
respectively.
a
This paper uses the sample of 526 hostile takeover attempts.
b
Uncontested sample.
c
Contested sample.
functions in which the parameter values are adapted during estimation However, a common unrealistic assumption when using the logistic
stage. There is an immense literature supporting this approach in en- distribution at small scales is that it has 1 as an upper asymptote con-
gineering, computer science, statistics, psychology, neuroscience, version and an inflection point at x = 0 . When an inflection point is
medicine, finance, and other disciplines. expected in an isolate curve, the model provides a poor fit to the data.
As mentioned before, one problem with the empirical im- This inherent inflexibility of the parametric nature of the logit model
plementation of the logit function is the underestimation of the prob- largely stems from the linear terms that are used for continuous and
ability of rare events. For example, about 7–25% of M&A withdrawals bounded output structure (e.g., the lower asymptote is 0 and the higher
are reported in previous studies such as those of Brar, Giamouridis, and asymptote is 1) of the model. An effective way to mitigate this major
Liodakis (2009), Wang and Branch (2009), and Baker et al. (2012). This downside is to create an unbounded growth rate in the sigmoid function
necessitates developing a model that can deal with unbalanced data to by adding more logit functions.
accurately predict the ultimate success of takeover attempts. Most ex- Hence, we can also approach the modeling framework of the neural
isting studies rely on the logit model and thus assume balanced data. network in a similar fashion to that of the generalized linear modeling
However, these studies conclude that there is significant empirical by adding an additional link function. The neural network model allows
evidence of M&A predictability without an examination of model fit. more than one logit curve to be used simultaneously to approximate the
Standard neural network classifiers, such as standard classifier learning relationship between the predictors and the target variable. To ap-
algorithms, also assume a balanced class distribution because of the proximate the relationship between y and x with a set of L logit curves
logit activation function (He & Garcia, 2009; Kwak & Choi, 2002), in (this is also known as the number of hidden neurons), the neural net-
that the model is often estimated from a pair-matched sample (Barnes, work model can be expressed in terms of the logit and linear predictors:
1999; Wang & Branch, 2009). Put more simply, a standard neural
t ~ Bernoulli(y ),
network classifier is designed to avoid rare events (i.e., a balanced class
assumption) and instead prefer a balanced frequency. This method is
Pr(t = 1) = y
more appropriate for a controlled or experimental setting rather than
uncontrollable environments such as those of observational studies; as = logit 1 ( 0 + 1 logit
1 ( x 1) + 2 logit
1 (x 2) + + L logit 1(x L )).
the class imbalance increases, a standard neural network classifier is (5)

less likely to capture true data characteristics because most standard This neural network model differs from the logit model in the shape
learning algorithms assume a balanced distribution (He & Garcia, of the sigmoid curve. Since a direct sum of logit transformations is
2009). Therefore, a standard neural network classifier may not effec- projective, the model’s inflexibility increases. When L = 1, the neural
tively forecast the outcome of M&As, given that events labeled “rare” network model becomes a logit transformation of a linear function of a
are prevalent in reality (Denrell & Fang, 2010). logit transformation of a linear function:
y = logit 1 (linear(logit 1 (linear(x )))) . The larger the number of hidden
3.1. Linkage between logit model and neural network neurons of L , the more the logit functions are used in the generalized
model, and thus the larger the variety of shapes the overall fit can
Since the success and failure classes are mutually exclusive and approximate. The neural network models provide more flexibility in the
exhaustive, a Bernoulli distribution fully describes this variable. Thus, functional form of the model, which can approximate any hypothetical
the probability of observing either target value is association between y and x based on the large number of choices for
Pr(t x ) = y t (1 y )1 t , (2) the number of logit functions (see Table 3). Thus, this form of neural
network allows to handle problems for which relationships are non-
which is a case of the binomial distribution. Let a vector of a constant linear, complex, and less known compared with highly structured
term with k inputs be denoted x = {1, x1, x2, , xk } . Then the relation- equation-based approaches such as logistic regression.
ship between y and x can be specified by a linear function, which is Again, it is important to note that the neural network model still
known as the linear probability model: uses the inverse logit function as an activation function and thus suffers
t ~ Bernoulli(y ), from unbalanced data. The majority of practical applications of neural
network classifiers are facilitated by using the back propagation algo-
y=x = 0 + 1 x1 + + k xk , (3) rithm and the delta (i.e., gradient descent) rule (Haykin, 2009). More
where x is a matrix expression for the linear association between y and concretely, the model assumes that an error function is defined as:
E ( ) = 2 (ti yi ) 2, where is a set of weights that requires a training
1
x (i.e., x = linear(x ) ), and there are k coefficients on each of the k
inputs. However, it would not make sense to fit the continuous linear set, and ti and yi are the observation of the ith target value and network
regression model x + error to the binary variable t that takes on the output value from a given neuron, respectively. In a conventional ma-
values 0 and 1. In other words, since it can generate values of t that are chine learning or data mining setting, the classifiers usually try to re-
either negative or greater than one, even the predicted values within duce (or minimize) the classification error; the gradient descent rule
the correct range near the limits are questionable. Instead, we model aims to find the steepest descent to modify the following weights at
the probability that t = 1: each iteration: n = E ( n) , where is the specified neural net-
work learning rate parameter and represents the gradient operator
t ~ Bernoulli(y ),
with respect to weights . A probabilistic estimate for the output is
ex usually defined by normalizing the output values of all output neurons.
Pr(t = 1) = y = logit 1(x ) = ,
1 + ex (4) Such a setting is valid only when the costs of the different errors are
equal (i.e., with balanced class data). Statistically, this implies that the
under the assumption that the outcome t is independent given these
performance of classification algorithm can be biased due to un-
probabilities. The function logit 1 (x ) transforms continuous values into
balanced class data in which one class outnumbers the other by a large
a range between 0 and 1, which is crucial because probabilities must be
proportion (He & Garcia, 2009). This may also cause a deep neural
between 0 and 1. Since we still refer to x as the linear predictor and
network or other conventional statistical model to perform poorly. In
the underlying probability of failure, y , as a logit function of a linear
other words, standard classification algorithms may generate
function of x , the function becomes y = logit 1(linear(x )) . In other
words, within the generalized linear modeling framework, the logit
model predicts Pr(t = 1) for binary data from a linear predictor with an 5
The probit model is the same as logit model but with the inverse-logit
inversed logit transformation.5 function replaced by the normal cumulative distribution.
276
Table 3
Link function, response range, and conditional variance function.
Logit model Probit model Standard NN when L = 1 NN with generalized logit model when L = 1
Link function y = (1 + e linear(x ) ) 1 y= (linear(x )) y = logit 1(y )= (1 + e linear(y )) 1 , y = glogit 1(y )= (1 + e linear(y ) ) 1 ,
where y = logit 1(linear(x )) where y = glogit 1(linear(x ))

A linear predictor logit 1(linear(x )) (linear(x )) logit 1(linear(logit 1(linear(x )))) glogit 1(linear(glogit 1(linear(x ))))
Range of y (0, 1) (0, 1) (0, 1) (0, )
inaccurate predictions regarding M&A failures, which are typically rare parameter, which controls the steepness of the activation function7;
events. Accurate predictions on such important corporate events, de- and are both self-adapted; (ii) reinitiating the weights linked with a
spite rare, can determine organizational success or failure (He & Garcia, node that is in a cutoff (i.e., saturated) state (Stäger & Agarwal, 1997);
2009; Sun, Wong, & Kamel, 2009). and (iii) the use of locally linearized activation functions (Rubanov,
2000). However, one of the main defects of these methods is their ex-
cessive calculating workload (Daqi & Genxing, 2003), which rapidly
3.2. Proposed M&A forecasting model increases with an increase in the size of the networks. Taylor expansion
may improve the learning speed and forecasting precision of small
For the purpose of this study, unbalanced class data means that one networks in this case, but it may yield poor performance for large
class (e.g., ti = 0 ) outnumbers the other (e.g., ti = 1) in the target networks (Buntine & Weigend, 1994).
variable by a large proportion. There are two evident ways we can try Therefore, we propose a sampling-based optimization method to
to circumvent the problem introduced by unbalanced class data (Daqi & overcome the calculating workload while escaping a local minimum
Genxing, 2003). First, the crux of most existing methods balances the point. The sampling-based optimization methods are less dependent on
dataset through data sampling (illustrated in Section 2.2), the technique asymptotic theory, and are thus able to produce reliable empirical re-
of which involves over-sampling or under-sampling the underweighted sults even with small data sets. The posterior distributions of and
class6; second, the strength and gain parameters of a sigmoid-type ac- can be estimated by using a large number of observations drawn from
tivation function can be modified to accelerate the learning speed and the posterior distribution of the unknown parameters through Markov
improve classification rates, an approach we refer it as a generalized Chain Monte Carlo (MCMC) methods. The basic idea is to partition the
activation function approach. In this study, we focus on the generalized set of unknown parameters ( and ) and then estimate them one at a
activation function approach because the over-sampling method re- time with each parameter estimated conditional on the other. This
quires removing instances from the majority groups, and this may not approach (like Gibbs sampling) is considered effective for a wide range
be appropriate in the social sciences, in which situations typically have of problems because estimating separate parts of a model is relatively
a larger number of variables and relatively smaller data sets. easy, even when it is hard to know how to estimate all the parameters
In neural networks, the strength and gain parameters of a sigmoid- simultaneously (Gelman & Hill, 2007). Thus, we argue that our ap-
type activation function determine the learning capabilities (intrinsic proach can overcome the excessive calculating workload stemming
data characteristics; unbalanced class) of multilayer neural networks from the conventional approach. We propose a sampling-based opti-
with a linear function. Daqi and Genxing (2003) show that the char- mization approach to estimate the strength and gain parameters using
acteristics of a sigmoid-type activation function influence the con- Metropolis-Hastings algorithm.
vergence rate, classification ability, and nonlinear fitting accuracy of
multilayer networks and thus, the characteristics ultimately influence
3.3. Cost-sensitive modifications and sampling-based optimization
forecast accuracy. The change in the activation function will be very
limited after the input information, x , moves beyond a certain range.
In the usual setting of the logit model, classification accuracy is
This is because the first-order derivative of the activation function ap-
typically evaluated by estimating its error rate on the sample. As we
proaches zero. In that range, the sum of squared errors, E ( ) , hardly
discussed earlier, it is assumed that all classification errors have equal
changes, and the weights and model fall to a local minimum point
costs. However, this assumption usually fails in the M&A data because
(Menon, Mehrotra, Mohan, & Ranka, 1996). Thus, the learning cap-
the error costs drastically differ across types (e.g., type I error vs. type II
abilities highly depend on the shape of the activation function, which is
error). To mitigate this concern about the misclassification costs, in this
described by the strength and gain parameters. Furthermore, the type
section, we follow the work of Kukar and Kononenko (1998) and
(or shape) of the activation function has a critical effect not only on the
Breiman et al. (1984) and present four approaches to making cost-
network’s learning speeds but also on its classification rates and non-
sensitive modifications to the estimation procedure. A misclassification
linear mapping precision (Daqi & Genxing, 2003; Menon et al., 1996).
cost is defined as a function of the actual and predicted dichotomous
This problem becomes more severe in the case of unbalanced class data.
(class) responses (i.e., st [actualresponse, predictedresponse]). This func-
Despite the importance of the shape of the sigmoid activation function,
tion is used to reduce misclassification costs and is considered an ad-
the standard logit model (i.e., 1 (1 + exp( x ) ; 0 = 0 and 1 = 1) is the
ditional input to the parameter estimation. The cost function is defined
default setting; as we illustrated in Section 2.2, the slope and spread
as: cost [i, j]= cost of misclassifying an observation from “response i” as
parameters of the standard logit models are fixed to 1 (i.e., | 1| = 1) and
“response j”, cost [i, i] = 0 , where i = 1, 2 , and j = 1, 2 . In other words,
3 (i.e., |3 1| = 3), respectively.
cost [i, j] is equal to a predefined cost value when the actual response (i )
To address this problem, the following methods are proposed: (i) the
and the output response ( j ) are different, and cost [i, j] is equal to zero
use of the general form of the sigmoid-type activation function (Chen &
Chang, 1996) in which the equation (Thimm, Moerland, & Fiesler,
1996) is written as f (x ) = (1 + exp( x )) 1 , where is the strength 7
In this study, we consider a network with a single output y = f (x ), the value
parameter, which limits the output range from 0 to , and is the gain of which is to be interpreted as a probability, and the general logistic activation
function: f (x ) = (1 + exp( x )) 1 , which has the property
f ' (x ) = f (x )[ f (x )]. Then, we have the derivative of the error with respect
6
Over-sampling and under-sampling refer to removing instances (or ex- E yi yi ti f (xi )
to x that takes the following form: i E
= = . yi = f (x i )
amples) of the majority class and adding instances to the minority class, re- yi ti yi ti
xi yi xi yi (1 yi ) xi
spectively. = yi (1 yi )
f (x i )[ f (x i )] = y[
yi (1 yi ) i
yi ].
277
when the actual response (i ) and the output response ( j ) are the same. costvect [i], i = j
defined as follows: m [i , j] = (for more detail, see
This cost matrix has been widely implemented in the cost-sensitive cost [i , j], i j
learning algorithm (Lu, Liu, Lu, & Wang, 2012). Furthermore, the cost Kukar and Kononenko (1998); a ready-to-use pseudo code can be found
[i , j ] (i.e., misclassification cost) can be either a fixed value or a func- in Appendix A).
tion and can be associated with either a class or a sample. The fixed cost In order to compute the estimates of the parameters of and given
associated with a class, which is utilized in this study, has been widely the cost functions in this study. Since Gibbs sampling is not possible in
employed due to its lower risk of encountering an over-fitting problem this case because of the lack of conjugate priors for logistic models
than that of the cost associated with a sample (Ma, Song, Hung, Su, & (Hoff, 2009), we use the random walk Metropolis-Hasting approach
Huang, 2012). (Gelman & Hill, 2007) to simulate values from the posterior distribution
To include the misclassification costs, we must define a cost vector (for more detail see Appendix A).
to evaluate the classifiers’ performance based on cost [i, j]. The cost
vector indicates the expected cost of misclassifying a training sample 4. Application
1
that belongs to the ith class: costvect [i] = 1 p (i ) j i p (j ) cost [i , j ], where
cost [i, j] indicates the cost of misclassifying a sample from class i as We compile data from two sources: The Securities Data Company’s
class j8 and p (i) indicates an estimate of the prior probability that a (SDC) U.S. Mergers and Acquisitions database and Compustat.
sample belongs to the ith class (i.e., i class : p (i ) = i t ).9 Thus, the
t Specifically, we use the SDC database to gather M&A data and
j j
Compustat for the bidder and target characteristics.
performance criterion becomes not the error rate (i.e.,
# of incorrectly classified samples
), but the average cost per sample:
number of training samples 4.1. Sample data
1 N
avgcost = N i = 1 cost [i , j], where N is the number of testing samples.
We denote M&A success and failure as responses 1 and 2, respec- We use the SDC database to identify proposed takeovers that
tively. Then, the 2 × 2 cost matrix is represented by eventually succeed or are withdrawn. We start with all deals announced
cost [1, 1] cost [1, 2] 0.0 cost [1, 2] between January 1, 2006 and December 31, 2015.
cost = = . costvect [1],
cost [2, 1] cost [2, 2] cost [2, 1] 0.0 We applied several screens during our sample period to obtain the
which is the expected cost of misclassifying success as failure, is equal final sample of takeover offers. Some acquisitions are recorded twice in
to cost [1, 2]. Similarly, costvect [2], which is the expected cost of mis- the SDC database. We deleted 525 repeat acquisitions, keeping only one
classifying failure as success, is equal to cost [2, 1]. We may then handle unique takeover announcement. In addition, the SDC database records
the cost asymmetry in the two types of misclassification errors by as- a share repurchase as a deal in which the acquirer is identical to the
suming that cost [1, 1] = cost [2, 2] = 0.0 , cost [1, 2] = 1.0 , and target firm. We removed 8170 such share repurchase deals. We re-
cost [2, 1] = 1.1 1.5. Thus, it is clear that the cost of misclassifying the stricted our samples to U.S. public firms since we measured predictors
observation of “response 2” as “response 1” (i.e., false negative; mis- in our takeover withdrawn prediction model using Compustat, which
classifying failure as success; 1.1 cost [2, 1] 1.5) is always higher provides financial and accounting data for most publicly held compa-
than the cost of misclassifying the observation of “response 1” as “re- nies in the U.S. and Canada. In this process, a total of 95,051 ob-
sponse 2” (i.e., false positive; misclassifying success as failure; servations (e.g., from government firms, joint ventures, private firms,
cost [1, 2] = 1.0 ). and subsidiaries) were eliminated. An additional 1099 acquiring and
Cost-sensitive classification (CS1): To reduce the misclassification target firms were not listed on Compustat and were therefore omitted
costs, we modify the probability estimates of the network in the clas- from our sample. Among the acquiring and target firms matched with
sification of the testing sample period. Thus, the probability p (i) that a Compustat, 867 observations had incomplete coverage in the SDC da-
sample belongs to class i is replaced with the altered probability, which tabase (e.g., the database listed the deal status as unknown, rumored, or
costvect [i] p (i)
considers the expected costs of misclassification: p' (i ) = costvect [i] p (i) . pending) or Compustat (e.g., Compustat had missing values for items
j
This approach increases the probability p (i) for highly expected mis- that are used to construct predictors). These were therefore eliminated
classification cost samples. from our sample.
Adaptive output (CS2): To correct the output of the error function in This sample selection process yields a total of 803 successful or
the ordinary network that gives the classes with higher expected mis- withdrawn takeover offers in our final sample. Among the 803 deals,
classification costs, we change and scale the actual output of the net- 696 are successful (about 87%), and 107 (about 13%) are announced
work instead of the estimated probabilities in the following way: but not completed. As we noted, the frequency of failed mergers in our
costvect [j] t j final sample follows an unbalanced class distribution, indicating that
t 'j = . This function differs from the above approaches in that
maxcostvect [i]
i
our final sample is appropriate, typical, and fits our purpose.
the observed outputs of the network are modified and appropriately Table 4 provides summary attributes of the final sample that are
scaled. used in our withdrawn takeover prediction model. We separate vari-
Adaptive learning rate (CS3): In this approach, we compensate for the ables into continuous and categorical variables. For continuous vari-
samples that belong to classes with high expected misclassification costs ables that capture acquirers’ and targets’ firm characteristics, we take
by increasing their prevalence in the classification of the testing sample the absolute difference to capture the similarity or dissimilarity be-
period. The altered prior can be simulated by giving the high-cost tween the acquiring and target firms (more details about why we take
costvect [class (p)]
samples higher weights: (p) = maxcostvect [i] . the absolute difference for these variables are given in Section 4.2.).
i Some of continuous and all of categorical variables capture M&A deal
Minimization of misclassification costs (CS4): Instead of minimizing
characteristics.
the squared error, in this method, the learning algorithm minimizes the
misclassification costs. The error function is corrected by embedding
4.2. Variables
the factor m [i , j], where i is the desired class and j is the actual class:
E = p examples 2 i output ((yi oi ) m [class(p), i]) 2 . The factor m [i , j] is
1
Dependent variable. In this paper, we develop a withdrawn takeover
prediction model. The target variable, therefore, is binary, equaling one
when the announced offer ultimately lead to the withdrawn takeover of
8
The cost [i , i] = 0 ; zero cost for correct classification. a target firm and zero otherwise. Our sample shows that one class of the
9
The normalized output can be viewed as an estimation of the probability target variable (the number of complete takeovers, tk = 0 ) outnumbers
p (i) that a sample belongs to the ith class. the other (the number of withdrawn takeovers, tk = 1) by a large
278
K. Lee, et al.
Table 4
Summary of attributes of final sample of takeover deals.
Continuous variables Mean Std. Dev. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Firm characteristics
1. ROA 0.16 0.38 1.00
2. DPS 0.58 1.36 −0.04 1.00
3. Invntr/TA 0.05 0.07 −0.01 0.06 1.00
4. M/B ratio 0.84 1.16 0.54 −0.03 −0.01 1.00
5. P/E ratio 51.9 202 −0.02 −0.04 −0.06 0.12 1.00
6. Grwth in sales 0.42 2.37 0.15 −0.03 −0.02 0.04 −0.01 1.00
7. CapExp/OprtRev 1.41 36.7 0.17 −0.02 0.04 −0.01 0.00 0.01 1.00
8. Inv cap trnovr 1.33 4.14 0.10 −0.01 0.12 0.03 −0.01 −0.01 −0.01 1.00
9. Div pyt ratio 0.52 2.66 −0.03 0.17 −0.02 −0.03 0.04 −0.01 −0.00 0.02 1.00
10. Div yld 0.02 0.11 0.01 0.37 0.09 −0.02 −0.02 −0.01 −0.01 0.00 0.15 1.00
11. Log (TA) 2.21 1.83 0.27 0.12 0.18 0.26 −0.00 0.01 0.04 0.08 −0.05 −0.01 1.00
Deal characteristics
12. Rel deal size 1.84 26.4 −0.00 0.01 0.14 0.00 0.00 −0.01 −0.00 0.01 0.02 0.00 0.08 1.00
13. TaPrc 1d prior 22.9 24.4 −0.20 0.10 −0.06 −0.02 −0.01 −0.07 −0.03 −0.05 −0.02 −0.03 −0.12 0.05 1.00
279
14. TaPrc 1w prior 22.7 24.2 −0.20 0.10 −0.06 −0.02 −0.02 −0.07 −0.03 −0.05 −0.02 −0.03 −0.12 0.05 0.99 1.00
15. TaPrc 4w prior 22.1 23.7 −0.20 0.11 −0.06 −0.02 −0.02 −0.07 −0.03 −0.05 −0.02 −0.03 −0.12 0.05 0.99 0.99 1.00
16. AcTrm fee 0.74 0.44 −0.04 0.01 0.00 −0.03 −0.01 −0.02 −0.01 −0.01 −0.01 −0.01 −0.10 0.16 0.12 0.12 0.13 1.00
17. TaTrm fee 45.4 255 −0.09 0.07 −0.01 −0.03 0.00 −0.04 −0.01 −0.04 0.01 −0.01 −0.08 0.28 0.35 0.34 0.34 0.60 1.00
18. Toehold 57.6 145 −0.03 0.04 −0.02 −0.01 −0.03 −0.01 −0.01 0.01 0.01 0.02 −0.07 0.00 −0.02 −0.02 −0.02 −0.03 −0.06
Categorical variables Mean Std. Dev.

19. Trm fee clause 0.74 0.44
20. Hostile 0.03 0.16
21. Tender offer 0.23 0.42
22. All-cash deal 0.51 0.50
23. Competing bid 0.07 0.26
ROA is return on assets. DPS is dividend per share. Invntr/TA is Inventories divided by total assets. M/B ratio is market value of assets divided by book value of assets. Grwth in sales is growth in sales over the past year.
CapExp/OprtRev is capital expenditures divided by sales. Inv cap trnovr is invested capital turnover. Div pyt ratio is dividend payout ratio. Div yld is dividend yield. Log (TA) is natural log of 1 plus total assets. Rel deal size is
the ratio of deal size to acquirer size. TaPrc 1d prior is target share price 1 day prior. TaPrc 1w prior is target share price 1 week prior. TaPrc 4w prior is target share price 4 week prior. AcTrm fee is termination fee for
acquirer. TaTrm fee is termination fee for target. Toehold is the percentage of target stocks held by the acquirer prior to the announcement date. Trm fee clause is termination fee clause dummy. Hostile is a dummy variable
that equals one if a bid is recorded by the SDC database as “hostile” and zero otherwise. Tender offer is a dummy variable that equals one if a bid is a tender offer and zero otherwise. All-cash deal is a dummy variable that
equals one for purely cash-financed deals and zero otherwise. Competing bid refers to competing bidder that equals one if there multiple bidders and zero otherwise.
proportion. More precisely, we possess the unbalanced classification their original purchase prices, thereby suffering from large losses. Im-
data of 696 successful deals (about 87%) and 107 unsuccessful deals portantly, positive returns from such price differential will solely be
(about 13%). guaranteed if the merger ultimately occurs. Therefore, the primary
Predictors. We base our predictors on the literature that has in- objective of merger arbitrageurs is to accurately determine the like-
vestigated the motivations of M&A deals and the antecedents of the lihood that a proposed takeover will ultimately succeed to avoid large
outcome of a takeover offer. In essence, pertinent predictors are cate- arbitrage losses. As noted in Baker & Savasoglu (2002), merger arbi-
gorized into (i) financial predictors10 and (ii) M&A-related predictors. trageurs periodically suffer substantial losses, primarily due to un-
Detailed descriptions of both types of predictors and their rationales for expected deal failure, which makes merger arbitrage such a risky
being included as predictors in our withdrawn takeover prediction trading strategy. Thus, misclassifying failure as success (type II error) is
model appear in Table 5. much more severely concerning for merger arbitrageurs than is mis-
Dyads and absolute difference. M&As involve two firms, acquiring classifying success as failure (type I error).
and target firms, seeking to strategically combine their resources with Target shareholders/firms. Target firms’ existing shareholders may
those of other organizations. This inter-firm relationship requires the also benefit by accurately predicting whether or not a merger will
acquiring and target firms to overcome potential sources of dyadic (i.e., occur. Not surprisingly, when takeover attempts eventually fail because
combinations of target and acquiring firms taken pairwise) conflict that either acquiring firms withdraw or target firms rebuff the offer, target
inevitably arise from the M&A negotiation process. Assessing these firms’ stock prices sharply decline, and target firms’ shareholders lose
dyadic factors is particularly critical to understanding why two firms almost all the takeover-related premium (Jandik & Makhija, 2005). 21st
involve in a takeover process either complete or withdraw from an Century Fox’s unrequited pursuit of Time Warner is a recent example of
announced M&A. Extant research, however, has focused on the uni- destroyed value for shareholders of target firms when takeover attempts
lateral perspective of the acquiring firm as opposed to assessing the eventually fail. In 2014, Time Warner’s shares fell by 13% the day after
dyadic nature of M&As (Wang & Zajac, 2007). Such one-sided analysis 21st Century Fox announced withdrawal on August 5th. From a target
of the M&A process inadequately examines this clearly dyadic phe- shareholder’s standpoint, the consequences of a type II error may also
nomenon (Zajac & Olsen, 1993). Therefore, unlike most previous stu- more detrimentally affect his/her wealth than those of a type I error
dies, we use the absolute difference between the acquirer’s and target’s because type II errors are likely to incur unexpected losses and type I
financial measures to capture the similarity or dissimilarity between the errors often achieve unexpected gains.
bidding and target firms. In general, the smaller the absolute difference, Failed merger attempts often have lasting consequences beyond the
the more similar the characteristics of the two merging firms. destruction of target shareholder value, and disciplining a target firm’s
inefficient management is one of those consequences (Chatterjee,
Harrison, & Bergh, 2003). Since being a target itself conveys adverse
4.3. Benefits of accurate M&A forecasts for stakeholders
information possessed by acquiring firms regarding a target firm’s
management, failed takeover attempts are followed by a high rate of
The literature on M&A can be classified into two schools of thought
management turnover (Denis & Serrano, 1996). For example, Office
based on dependent variables. One stream of research examines the
Depot, an American office supply retailer, announced that CEO Roland
stock market performance or post-merger operating performance of
Smith would step down from his role by the end of the first quarter of
acquiring, target, or merged firms over short- or long-term periods
2017 following a failed merger with Staples, another operator in
(Powell & Stark, 2005). In this stream of research, takeovers are
America. As such, a failed takeover deal often damages the credibility
deemed successful either when the merged entity achieves positive
of the target firm’s managers. From a target firm’s managerial stand-
stock returns or when it generates operating gains from, for example,
point, it is also of great importance that M&A outcomes be accurately
increases in sales growth or market shares.
predicted.
The other stream of research simply defines a successful takeover as
Acquiring shareholders/firms. In 2015 alone, takeover bids worth
takeover attempts that are ultimately completed (Branch & Yang,
$1.484 trillion were terminated or withdrawn in global markets for
2003). The latter stream of research focuses on discussing how various
corporate control (see, Fig. 3); in absolute terms, this number is en-
stakeholders, such as merger arbitrageurs, target shareholders/firms,
ormous. As previously noted, given that about 7% to 25% of each year’s
and acquiring shareholders/firms, may benefit by being able to accu-
proposed deals typically end in failure, it also mirrors the recent surge
rately determine the likelihood that a proposed takeover will ultimately
in M&A by emphasizing the ever-increasing importance that acquiring
occur. Since they all play vital roles throughout the entire M&A pro-
companies identify likely takeover targets and predict ultimate deal
cesses and potentially benefit the most from the successful prediction
success.
outcomes, we discuss what these benefits are and how the costs of Type
From an acquiring firm’s perspective, immense, explicit, and im-
I and Type II errors matter for each of these stakeholders.
plicit costs are related to unsuccessful takeover attempts. Explicit costs
Investors, especially merger arbitrageurs. Merger arbitrage is an in-
are expenses associated with accounting and due diligence, valuation,
vestment strategy that aims to capture price differential by purchasing
and other legal/consulting/advisory activities, while implicit costs are
the shares of target firms at lower prices immediately following the
the opportunity costs of having to forego other attractive takeover deals
merger announcement and by selling these shares to the acquiring firms
that may ultimately succeed. Implicit costs also arise from the fact that
at higher prices upon the successful completion of the M&A. In finance,
a bid’s failure reveals to the market further information about bidders.
investors who implement such an investment strategy are referred to as
As noted in Pickering (1983), an ultimately unsuccessful bid reveals the
“merger arbitrageurs.” Merger arbitrageurs are extremely active and
bidder’s inherent weakness, and many unsuccessful bidders become
important traders in the stocks of acquirers and targets around merger
targets of takeover bids themselves. Moreover, Bradley, Desai, and Kim
announcements (Officer, 2007). On the announcement date, target
(1983) indicate that unsuccessful bidders display significant wealth
firms’ shares will typically be trading at lower prices than the acquiring
losses when their deals ultimately fail, and their intended targets are
firms’ offer prices. If a takeover offer is successful, merger arbitrageurs
subsequently acquired by a rival bidder. As such, failed takeover at-
lock in positive gains from this price differential. If the offer is un-
tempts not only incur enormous explicit and implicit costs but also
successful, however, the target share price typically falls, and merger
damage the credibility of the acquirer’s managers and their strategic
arbitrageurs face selling the shares in the market at lower prices than
decisions. Therefore, high levels of misclassification are of great con-
cern to the acquiring firms in takeover predictions, especially when
10
A total of 22 financial predictors, 11 each for the acquiring and target firm, costly type II errors occur (Rodrigues & Stevenson, 2013)—that is,
are compiled at the firm level. when M&A failures are predicted to succeed.
280
Table 5
Variable description.
Variables Description Rationale
Financial/accounting We take absolute difference between acquirer and target financial In general, the smaller the absolute difference measure, the more similar
predictors predictors to capture the similarity or dissimilarity between the characteristics of the two merging firms
bidding and target firms
Inefficient management
ROA Income before extraordinary items (item 18)/total assets (item 6) In the market for corporate control, a market mechanism plays a crucial
Dividend per share Dividend per share (item 26) role in transferring resources from inefficiently managed firms to more
Inventory/Total assets Inventories (item 3)/total assets (item 6) efficient ones (Palepu, 1986)
Undervaluation
M/B ratio Market value of assets (item 6 + (item 199 × item 25) – item The lower the market-to-book ratio, the more undervalued the target firm,
60)/book value of assets (item 6) and thus the more attractive it is to bidding firms. Overvalued bidders in
particular look for undervalued entities (Rodrigues & Stevenson, 2013)
Price-to-earnings ratio
P/E ratio Closing price (item 199)/earnings per share (item 58) A firm with a low P/E ratio is more likely to be a target, since the earnings
of the firm will be valued at the multiple of the bidder (Rodrigues &
Stevenson, 2013)
Growth-resource mismatch
Growth in sales over the past (Sales (item 12) at the end of year t – sales at the end of year Low-growth but resource-rich firms and high-growth but resource-poor
year t − 1)/sales at the end of year t − 1 firms are natural acquisition targets (Palepu, 1986)
Capital expenditure/ Capital expenditures (item 128)/sales (item 12).
operating revenue
Invested capital turnover Sales (item 12)/total invested capital (item 37)
Dividend payout
Dividend payout ratio Total dividends (item DVT)/income before extraordinary items Firms that have a low dividend payout policy retain more earnings and
(item 18) therefore can invest in future opportunities, leading to a higher growth
Dividend yield Dividend per share (item 26)/closing price (item 199) potential and an increased likelihood of becoming a target (Rodrigues &
Stevenson, 2013)
Size
Log (total assets) Natural log of 1 plus total assets (item 6) A consequent increase in wages and reputation after successful large
acquisitions would make managers to prefer larger rather than smaller
acquisitions (Barnes, 1999). Alternatively, fewer bidders with sufficient
resources to acquire a large target decrease the likelihood of acquisition
(Palepu, 1986)
M&A predictors
Relative deal size Ratio of deal size to acquirer size Larger size of a deal or target reduces takeover success (Branch & Yang,
2003)
Target share price
Target share price 1 day Target share price 1 day before merger announcement A rise in the target stock price prior to a merger announcement will deter
prior competing bids and lower the probability of bid revision (Jennings &
Target share price 1 week Target share price 1 week before merger announcement Mazzeo, 1993). In turn, the reduced competition and revision probability
prior increases the likelihood that any one offer will be successful (Walkling,
Target share price 4 weeks Target share price 4 weeks before merger announcement 1985)
prior
Termination fee
Termination fee clause Dummy variable equals one if a bid includes termination fee Termination clauses would make it costly for the fee-paying party to
dummy clause and zero otherwise withdraw from the merger deal, thus increasing the likelihood of the
Termination fee (Acquirer) Termination fees payable by the acquirer to the target if the ultimate success of a takeover attempt (Officer, 2003)
acquirer terminates the deal
Termination fee (Target) Termination fees payable by the target to the acquirer if the
target terminates the deal
Hostile and tender offer
Hostile Dummy variable equals one if a bid is recorded by the SDC Hostile and tender offer dummies measure target resistance. A hostile
database as “hostile” and zero otherwise takeover or tender offer are often perceived as a takeover deal that is far
Tender offer Dummy variable equals one if a bid is a tender offer and zero from completion and has been found to be negatively associated with
otherwise success rates of proposed mergers (Schwert, 2000)
All-cash deal Dummy variable equals one for purely cash-financed deals and A cash offer reduces information asymmetry about the value of acquiring
zero otherwise and target firms. Therefore, it may signal a greater degree of certainty to the
markets about the probability of deal success (Wang & Branch, 2009)
Toehold Percentage of target stocks held by the acquirer prior to the A toehold by an acquiring firm discourages competing bids and lowers the
announcement date probability of target managerial resistance (Betton & Eckbo, 2000)
Competing bidder Dummy variable equals one if there are multiple bidders and zero Existing bids reduce the probability of deal success by making competing
otherwise bidders recognize the increased likelihood that offer revision will occur
(Walkling, 1985)
Compustat is the source of variables referred to by item number.
5. Results &A predictions, whereas Incorrectly Classified (IC) lines show the ratio
of incorrectly classified takeovers (success/failure) to the total number
5.1. In-sample and out-of-sample prediction accuracy of M&A predictions. Thus, it is always true that CC + IC = 1. Equiva-
lently, the performance of an unbalanced model is expressed in terms of
For the purposes of the present study, multiple accuracy rate mea- withdrawn takeovers given by Accuracy for Failure (AF) lines. This
sures are used as a score rule to quantify the classification model per- demonstrates the ratio of correctly predicted withdrawn takeovers to
formance. In Table 6, Correctly Classified (CC) lines indicate the ratio of the number of actually withdrawn takeovers. Thus, it is always true that
correctly classified takeovers (success/failure) to the total number of M AF + type II Error = 1. Moreover, CC and AF ratios are used to estimate
281
Fig. 3. Dollar value of takeover bids (in $ billion) terminated or withdrawn in global markets for corporate control for the period, 2006–2015.
the percentage of observations that a model correctly predicts: the In general, the proposed NN models (i.e., PNN CS types 1–4) out-
higher the ratio, the better the predictive power of the model; con- perform other models in terms of out-of-sample prediction of both CC
versely, the lower the number of IC lines and type II errors, the better and AF in all out-of-sample periods. Additionally, the proposed models
the predictive power of the model. have a lower type II error rate for the out-of-sample prediction setting
Table 6 presents both in-sample and out-of-sample prediction ac- than all other competing models. It is thus probable that the problem of
curacies for the logit model, the weighted logit model, the probit model, unbalanced data can be mitigated by implementing the cost function
the neural network model with standard logit activation function (NN), and adding strength ( ) and gain ( ) parameters. Therefore, the pro-
the neural network model with four types of cost-sensitive (CS) mod- posed NN models provide a better fit than other models by demon-
ifications (NN CS types 1–4), the proposed neural network model with strating higher overall out-of-sample prediction accuracy. Furthermore,
generalized logit activation function (PNN), and the proposed neural the forecasting capabilities of the proposed NN models can be effec-
network model with generalized logit activation function and four types tively enhanced by introducing the misclassification cost function (CS1-
of CS modifications (PNN CS types 1–4). For the purpose of comparison, 4).
all the neural network models are estimated with one layer containing Second, the proposed NN CS type 4, which is found to be the best
three neurons. It is important to note that although increasing the performing model for out-of-sample forecasts, does not perform better
flexibility of a model allows for higher in-sample fit, it does not ne- than all other models in terms of in-sample prediction. In other words, a
cessarily result in higher out-of-sample fit as higher complexity models higher in-sample fit does not lead to better out-of-sample prediction
are more likely to overfit the training data set. (e.g., NN CS type 3 and type 4). This can be attributed to the influence
In neural network models and proposed models with different CS of specific explanatory variables that can change over time, causing
type functions, and refer to the strength and gain parameters of the forecasting models with high in-sample fit to suffer from overfitting due
general logistic activation function, while CV represents cost vector to decreased flexibility.
which specifies the value of costvect [2]. For example, CV = 1.5 indicates
that we use 1.0 of costvect [1] and 1.5 of costvect [2]. The first two col- 5.2. Variable importance
umns indicating M&A Failure and M&A Success contain the number of
actually withdrawn takeovers and the number of actual successful ta- Identifying the right set of predictor variables not only improves the
keovers for in- and out-of-sample periods. Subsamples were only used accuracy of forecasting models, but also enhance our understanding of
for the data recorded after the recent global financial crisis, as it af- key drivers of M&A success or failure. Accordingly, we extend our
fected the takeover processes in unusual ways, driving down the vo- analysis to address this concern and calculate the variable importance
lume of transactions and causing many companies to postpone or measure. The variable importance measure refers to the extent to which
abandon deals. Out-of-sample forecasts drawn from the two previous a given forecasting model uses that variable to make accurate predic-
five-year (2009–2013 and 2010–2014) samples were developed for the tions; the more heavily a model relies on a variable to make predictions,
subsequent year (2014 and 2015).11 the more important it is for the model.
Two conclusions emerge from the summary of model performance Several measures of variable importance have been proposed in the
displayed in Table 6. First, the proposed NN models (i.e., PNN CS types forecasting literature (Hastie, Tibshirani, Friedman, & Franklin, 2005).
1–4) outperform or perform as well as other models for the in-sample We employ the metrics of classification error rate (CER) and relative
prediction of CC, with minimum and maximum ratio values of 0.94 importance because (i) the M&A outcome is binary rather than con-
(94%) and 0.98 (98%), respectively. Furthermore, the proposed models tinuous, for which the residual sum of squares is commonly used, and
are found to outperform or perform as well as other models over all (ii) we are interested in the outcomes’ accuracy rather than the variance
subsample periods for the in-sample prediction of AF, with values across the outcomes, for which either the Gini index or the entropy is
ranging from 0.63 (63%) to 0.96 (96%). The proposed models also have employed. The CER-based metric measures variable importance based
a lower type II error rate for the in-sample prediction setting than all on the change in CER when a variable is dropped from all the variables.
other competing models, with values ranging from 0.04 (4%) to 0.37 The change is calculated from the out-of-bag set and represented by
(37%).
d = CERid CERd , (6)
where d denotes the index of the out-of-bag set, i denotes the index of
11
The results were similar when using each two-year period to estimate the the variable that is dropped, CERid is the CER when the ith variable is
model and the subsequent one-year period to forecast. dropped from all variables, and CERd is the CER when the 23 total
282
K. Lee, et al.
Table 6
In-sample and out-of-sample prediction accuracy of forecast models.
M&A M&A Logit Weighted Probit NN NN CS Type 1 NN CS Type 2 NN CS Type 3 NN CS Type 4 PNN PNN CS Type 1 PNN CS Type 2 PNN CS Type 3 PNN CS Type 4
Failure Success Model Logit Model Model
In-sample: 48 336 CV = 1.3, CV = 1.3, CV = 1.1, CV = 1.2, CV=1, CV = 1.5, CV = 1.5, CV = 1.2, CV = 1.5,
2009–2013 (13%) (88%) γ = 1, γ = 1, γ = 1, γ = 1, β = 1.036 γ = 1, γ = 1, γ = 1.5, γ = 1.5,
β=1 β=1 β=1 β=1 β = 1.5 β = 1.5 β = 1.3 β=1
CC 0.93 0.88 0.93 0.97 0.97 0.97 0.97 0.97 0.98 0.95 0.95 0.97 0.95
IC 0.07 0.12 0.07 0.03 0.03 0.03 0.03 0.03 0.02 0.05 0.05 0.03 0.05
AF 0.63 0.42 0.58 0.77 0.77 0.77 0.79 0.83 0.85 0.75 0.75 0.79 0.96
Type II error 0.37 0.58 0.42 0.23 0.23 0.23 0.21 0.17 0.15 0.25 0.25 0.21 0.04
Out-of-sample: 2014 8 (10%) 75

(90%)
CC 0.88 0.88 0.89 0.89 0.89 0.89 0.93 0.93 0.90 0.92 0.92 0.93 0.89
IC 0.12 0.12 0.11 0.11 0.11 0.11 0.07 0.07 0.10 0.08 0.08 0.07 0.11
AF 0.38 0.38 0.25 0.38 0.38 0.38 0.63 0.50 0.50 0.63 0.63 0.63 0.75
Type II error 0.62 0.62 0.75 0.62 0.62 0.62 0.37 0.50 0.50 0.37 0.37 0.37 0.25
283
In-sample: 48 331 CV = 1.5,γ CV = 1.5,γ CV = 1.2,γ CV = 1.1,γ CV = 1.4,γ CV = 1.4,γ CV = 1.3,γ CV = 1.1,γ
2010–2014 (13%) (87%) =1, =1, =1, =1, CV=1, =1.5, β = 1 =1.5, =1, β = 1.9 =1,
β=1 β=1 β=1 β=1 β = 1.123 β=1 β = 1.9
CC 0.94 0.92 0.94 0.97 0.97 0.97 0.98 0.98 0.98 0.94 0.94 0.95 0.94
IC 0.06 0.08 0.06 0.03 0.03 0.03 0.02 0.02 0.02 0.06 0.06 0.05 0.06
AF 0.67 0.40 0.67 0.81 0.81 0.81 0.85 0.83 0.85 0.92 0.92 0.63 0.67
Type II error 0.33 0.60 0.33 0.19 0.19 0.19 0.15 0.17 0.15 0.08 0.08 0.37 0.33
Out-of-sample: 2015 12 78
(13%) (87%)
CC 0.89 0.91 0.88 0.87 0.87 0.87 0.90 0.91 0.86 0.90 0.90 0.93 0.91
IC 0.11 0.09 0.12 0.13 0.13 0.13 0.10 0.09 0.14 0.10 0.10 0.07 0.09
AF 0.50 0.33 0.42 0.50 0.50 0.50 0.50 0.50 0.58 0.58 0.58 0.58 0.58
Type II error 0.50 0.67 0.58 0.50 0.50 0.50 0.50 0.50 0.42 0.42 0.42 0.42 0.42
CC is correctly classified; IC is incorrectly classified; AF indicates accuracy for failure; CC IC; AF Type II error; M&A Failure is the number of total examples in M&A failure; M&A Success is the number of total examples in
M&A success; the proportions are in the parentheses; CV represents cost vector; and refer to two parameters from the general logistic activation function; Entries in bold indicate the best performance value of any model
within a particular data set.
Fig. 4. Estimated in-sample coefficients in parentheses and variable importance in the bar chart in a neural network model with four neurons.
variables are used. If the change d in CER is positive, the ith variable is dummy mostly affects the likelihood that proposed takeovers will be
important for making accurate predictions. Then, the CER-based im- ultimately withdrawn. The termination fee clause dummy equals one if
portance is defined as a bid includes a termination fee clause for the acquirer or the target,
D
and zero otherwise. Given that the target termination fee’s variable
1 importance is ranked last for predicting M&A failure, we infer that the
d,
D d=1 (7) magnitude of the termination fee clause dummy’s variable importance
primarily originates from the bidder’s side. As previously discussed, the
where D is the total number of the out-of-bag sets. For the value of D, 50
estimated coefficient (−5.23) suggests that termination clauses would
seems reasonable, and more sets are required as the number of classes
make it costly for both fee-paying parties to withdraw from the merger
increases (Breiman, 1996). We use 100 for the value of D.
deal, thus decreasing the likelihood of deal failure.
Next, we calculate the relative importance metric, which ranges
Hostile. The predictor variable with the next largest variable im-
between 0 and 100 and is calculated in four steps. Firstly, we find the
portance is the hostile dummy. Hostile takeovers are defined as deals
maximum and minimum CER-based variable importance. Secondly, we
with target board opposition. Not surprisingly, the estimated in-sample
compute the range between the maximum and minimum CER-based
coefficient is positive (1.65), indicating that hostile deals markedly
variable importance. Thirdly, we compute the relative variable im-
increase the likelihood that a proposed takeover will ultimately fail.
portance by subtracting the minimum CER-based importance among all
Unlike friendly bids, hostile takeovers are not recommended by the
variables from this variable’s CER-based importance, then dividing the
target management either at the time of the bid announcement or im-
resulting value by the range. Lastly, we multiply the relative variable
mediately thereafter. In addition, the target management implements a
importance by 100. Thus, one variable always has variable importance
variety of defensive tactics during the course of the hostile bid, making
of 100. A higher variable importance value indicates that the variable is
takeover offers more likely to be withdrawn (Sudarsanam, 1995).
more important for making accurate predictions. Fig. 4 displays both
Consistent with this notion, our empirical result shows that the like-
estimated in-sample coefficients12 in parentheses and variable im-
lihood of takeover failure is positively associated with hostile bids
portance in the bar chart in a neural network model with four neurons.
(Renneboog & Zhao, 2014).
We find that the four most important variables in predicting the out-
The existence of competing bidders. We find that the existence of
comes of takeover bids are (a) termination fee clause dummy, (b) hostile,
competing bidders is the predictor variable that displays the next largest
(c) the existence of competing bidders, and (d) total assets (absolute dif-
variable importance. The estimated in-sample coefficient is positive
ference in size between two merging firms).
(2.40), implying that the more competitive the bids are, the more likely
Termination fee clause. Among others, the termination fee clause
it becomes that the takeover bids will ultimately fail. This finding is
consistent with the notion that the presence or arrival of competing
12
The coefficients (weights) in this artificial neural network are an approx- bidders increases the expected price the initial bidder would pay should
imation of multiple processes and thus refer to the strength of a connection he/she win, thus lowering takeover offers’ success rates (Bhagat, Dong,
between two neurons. Coefficients can be positive or negative. A weight’s Hirshleifer, & Noah, 2005). In other words, the initial bidder will not
magnitude reflects the notion that, if a large signal from the input neurons value the merger agreement as highly as he/she previously did when a
results in a large signal from the output neurons, then the weight between those competing bidder makes an offer for the target firm.
neurons will increase. Positive (negative) weights indicate that an increase in Total assets. Lastly and not surprisingly, the total assets (the absolute
the predictor variable would correspond with an increase (decrease) in the difference in size between two merging firms) is the factor that presents
outcome variable (i.e., withdrawn takeover).
284
the next largest variable importance. Caiazza and Pozzolo (2016) find the proposed model generates more accurate out-of-sample prediction
that the target’s size in absolute terms and relative to that of the bidder results. These findings evince that the proposed approach (i.e., using a
has a critical impact on takeover offers’ success or failure rates. Simi- generalized logit activation function) is both a consistent and accurate
larly, Lim and Lee (2016) demonstrate that an acquirer’s size positively method for class-unbalanced data. Our findings provide guidance to
affects acquisition completion. We determine total assets to be the organizations on the development of forecasting models for decision-
fourth most important variable for predicting M&A failure. This finding, making processes.
combined with the negative estimated coefficient on total assets (-2.86), Our study also offers managerial insights on the development of
indicates that a smaller difference in absolute size between the target forecasting models. Given that a large and varied number of factors, yet
company and the acquiring firm results in a greater number of proposed a relatively small sample size of rare events are included in traditional
takeovers that will be ultimately withdrawn. In other words, takeover forecasting models, the use of standard classification algorithms de-
attempts between two merging firms of similar size are more likely to signed for balanced data are more likely to generate inaccurate pre-
be withdrawn than are those between two different firms in terms of dictions and overfit the data. Thus, when predicting the success or
total assets. failure of M&A deals, forecasting models that can effectively in-
corporate distributive characteristics of all possible outcomes, including
6. Conclusions unbalanced class data, must be employed.
The results of the present study also suggest that business implica-
In this study, we proposed a forecasting model that employs a tions and consequences of inaccurate predictions of an organization’s
flexible functional form while taking into consideration the cost-sensi- strategic decision should be taken into account when designing fore-
tive function. Specifically, the proposed model consists of a neural casting models, as the effect of misclassifying success as failure in
network with a robust generalized logit activation function to address business contexts differs significantly from the effect of misclassifying
unbalanced class datasets and with estimated strength and gain para- failure as success. From a statistical perspective, the loss functions of
meters using the Gibbs sampler and the Metropolis-Hastings algorithm. type I and type II errors are asymmetrical. Organizations should,
In addition, a context-specific, cost-sensitive function is employed to therefore, determine scoring rules using different penalties for the two
minimize the misclassification of M&A failure. The design of this model misclassification cases (i.e., type I and type II errors).
is motivated by two important characteristics of M&A data: (i) the Further, findings suggest that, in comparison to characteristics of
outcome of M&A data represents an event with a small probability of targets and acquirers, M&A-related factors (e.g., the existence of a
withdrawn takeovers and (ii) the impact of misclassifying failure as termination fee clause, hostile mergers, and the existence of competing
success (type II error) is more detrimental than misclassifying success as bidders) are good predictors of takeover offer outcomes. In particular,
failure (type I error). the existence of a termination fee clause is a crucial factor in the pre-
Takeover deals completed or withdrawn during the years 2009 to diction of takeover deal outcomes. Regarding target and acquirer
2015 are used to evaluate the proposed model. When compared to characteristics, the absolute difference in size between two merging
benchmark models (e.g., logit model, weighted logit model, probit firms is found to be an important factor in increasing success rates of
model, and neural network with a standard logit activation function), proposed takeover deals.
Appendix A. Pseudo code
Input:
x – Independent variable data in the training set
y – Dependent variable data in the training set
(0: class 1 (not withdrawn), 1: class 2 (withdrawn))
x.test – Independent variable data in the test set
Output:
y.test_pre – Predicted dependent variable data in the test set
1: Normalize the input data
2: 1
3: 1
4: numepochs 100
5: Initialize weight matrix w, bias matrix b, and learning rate
6: Set the cost matrix cm
7: // Train the neural network
8: For i = 1 to numepochs do
9: For j = 1 to count(y) do
10: // Forward pass
11: (w, b) forward(w, b, x, y, , )
12: oj f (w, b) // Compute the output neuron value oj for the jth data
13: // Compute the error (for CS Type 1, 2, and 3)
14: ej f (yj , oj )
15: // Compute the error (for CS Type 4)
16: ej f (yj , oj , cm)
17: // Backward pass (for CS Type 1, 2, and 4)
18: (w, b) backward(w, b, x, y, , , ej , )
19: // Backward pass (for CS Type 3)
20: (w, b, ) backward(w, b, x, y, , , ej , cm, )
21: End for
22: End for
23: // Test the neural network
24: o f ( x.test, w, b) // Compute the output neuron vector o
25: Normalize o between 0 and 1
26: // Predict dependent variables (for CS Type 1)
27: P f (o, cm) // Compute the class probability vector P
28: y.test_pre arg(max P)
285
29: // Predict dependent variables (for CS Type 2)

30: o f (o, cm) // Update o using the current o and cm
31: y.test_pre arg(max o)
32: // Predict dependent variables (for CS Type 3 and 4)
33: y.test_pre arg(max o)
34: Return y.test_pre
References Imbens, G. W. (1992). An efficient method of moments estimator for discrete choice
models with choice-based sampling. Econometrica, 60(5), 1187–1214.
Jandik, T., & Makhija, A. K. (2005). Debt, debt structure and corporate performance after
Audrino, F., Kostrov, A., & Ortega, J. (2019). Predicting U.S. bank failures with MIDAS unsuccessful takeovers: Evidence from targets that remain independent. Journal of
logit models. Journal of Financial and Quantitative Analysis, 54(6), 2575–2603. Corporate Finance, 11(5), 882–914.
Baker, M., Pan, X., & Wurgler, J. (2012). The effect of reference point prices on mergers Jennings, R. H., & Mazzeo, M. A. (1993). Competing bids, target management resistance,
and acquisitions. Journal of Financial Economics, 106(1), 49–71. and the structure of takeover bids. The Review of Financial Studies, 6(4), 883–909.
Baker, M., & Savaşoglu, S. (2002). Limited arbitrage in mergers and acquisitions. Journal Jeon, J. Q., & Ligon, J. A. (2011). How much is reasonable? The size of termination fees in
of Financial Economics, 64(1), 91–115. mergers and acquisitions. Journal of Corporate Finance, 17(4), 959–981.
Bange, M. M., & Mazzeo, M. A. (2004). Board composition, board effectiveness, and the Kaplan, S. N., & Weisbach, M. S. (1992). The success of acquisitions: Evidence from di-
observed form of takeover bids. Review of Financial Studies, 17(4), 1185–1215. vestitures. The Journal of Finance, 47(1), 107–138.
Barnes, P. (1999). Predicting UK takeover targets: Some methodological issues and an Krishnan, C. N. V., & Masulis, R. W. (2013). Law firm expertise and merger and acqui-
empirical study. Review of Quantitative Finance and Accounting, 12(3), 283–302. sition outcomes. The Journal of Law and Economics, 56(1), 189–226.
Betton, S., & Eckbo, B. E. (2000). Toeholds, bid jumps, and expected payoffs in takeovers. Kukar, M., & Kononenko, I. (1998). Cost-sensitive learning with neural networks. Proceedings
The Review of Financial Studies, 13(4), 841–882. of the 13th European conference on artificial intelligence. Brighton: IOS Press.
Betton, S., Eckbo, B. E., Thompson, R., & Thorburn, K. S. (2014). Merger negotiations Kwak, N., & Choi, C. H. (2002). Input feature selection by mutual information based on
with stock market feedback. The Journal of Finance, 69(4), 1705–1745. Parzen window. IEEE Transactions on Pattern Analysis & Machine Intelligence, 24,
Bhagat, S., Dong, M., Hirshleifer, D., & Noah, R. (2005). Do tender offers create value? 1667–1671.
New methods and evidence. Journal of Financial Economics, 76(1), 3–60. Lancaster, T., & Imbens, G. (1996). Case-control studies with contaminated controls.
Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. Journal of Econometrics, 71(1–2), 145–160.
Bradley, M., Desai, A., & Kim, E. H. (1983). The rationale behind interfirm tender offers: Lim, M. H., & Lee, J. H. (2016). The effects of industry relatedness and takeover motives
Information or synergy? Journal of Financial Economics, 11(1–4), 183–206. on cross-border acquisition completion. Journal of Business Research, 69(11),
Branch, B., Wang, J., & Yang, T. (2008). A note on takeover success prediction. 4787–4792.
International Review of Financial Analysis, 17(5), 1186–1193. Lin, L., Lan, L. H., & Chuang, S. S. (2013). an option-based approach to risk arbitrage in
Branch, B., & Yang, T. (2003). Predicting successful takeovers and risk arbitrage. emerging markets: Evidence from Taiwan takeover attempts. Journal of Forecasting,
Quarterly Journal of Business and Economics, 42(1), 3–18. 32(6), 512–521.
Brar, G., Giamouridis, D., & Liodakis, M. (2009). Predicting European takeover targets. Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE Assp
European Financial Management, 15(2), 430–450. Magazine, 4(2), 4–22.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Lu, S., Liu, L., Lu, Y., & Wang, P. S. (2012). Cost-sensitive neural network classifiers for
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. postcode recognition. International Journal of Pattern Recognition and Artificial
Belmont: Wadsworth International Group. Intelligence, 26(5), 1263001.
Buntine, W. L., & Weigend, A. S. (1994). Computing second derivatives in feed-forward Ma, G. Z., Song, E., Hung, C. C., Su, L., & Huang, D. S. (2012). Multiple costs based
networks: A review. IEEE transactions on Neural Networks, 5(3), 480–488. decision making with back-propagation neural networks. Decision Support Systems,
Caiazza, S., & Pozzolo, A. F. (2016). The determinants of failed takeovers in the banking 52(3), 657–663.
sector: Deal or country characteristics? Journal of Banking & Finance, 72, S92–S103. Menon, A., Mehrotra, K., Mohan, C. K., & Ranka, S. (1996). Characterization of a class of
Chakrabarti, A., & Mitchell, W. (2016). The role of geographic distance in completing sigmoid functions with applications to neural networks. Neural Networks, 9(5),
related acquisitions: Evidence from US. chemical manufacturers. Strategic 819–835.
Management Journal, 37(4), 673–694. Officer, M. S. (2003). Termination fees in mergers and acquisitions. Journal of Financial
Chatterjee, S., Harrison, J. S., & Bergh, D. D. (2003). Failed takeover attempts, corporate Economics, 69(3), 431–467.
governance and refocusing. Strategic Management Journal, 24(1), 87–96. Officer, M. S. (2007). Are performance based arbitrage effects detectable? Evidence from
Chen, C. T., & Chang, W. D. (1996). A feedforward neural network with function shape merger arbitrage. Journal of Corporate Finance, 13(5), 793–812.
autotuning. Neural Networks, 9(4), 627–641. Palepu, K. G. (1986). Predicting takeover targets: A methodological and empirical ana-
Cudd, M., & Duggal, R. (2000). Industry distributional characteristics of financial ratios: lysis. Journal of Accounting and Economics, 8(1), 3–35.
An acquisition theory application. Financial Review, 35(1), 105–120. Pickering, J. F. (1983). The causes and consequences of abandoned mergers. The Journal
Daqi, G., & Genxing, Y. (2003). Influences of variable scales and activation functions on of Industrial Economics, 31(3), 267–281.
the performances of multilayer feedforward neural networks. Pattern Recognition, Powell, R. G., & Stark, A. W. (2005). Does operating performance increase post-takeover
36(4), 869–878. for UK takeovers? A comparison of performance measures and benchmarks. Journal
de Bodt, E., Cousin, J. G., & Roll, R. (2018). Empirical evidence of overbidding in M&A of Corporate Finance, 11(1–2), 293–317.
contests. Journal of Financial and Quantitative Analysis, 53(4), 1547–1579. Renneboog, L., & Zhao, Y. (2014). Director networks and takeovers. Journal of Corporate
Denis, D. J., & Serrano, J. M. (1996). Active investors and management turnover fol- Finance, 28, 218–234.
lowing unsuccessful control contests. Journal of Financial Economics, 40(2), 239–266. Rodrigues, B. D., & Stevenson, M. J. (2013). Takeover prediction using forecast combi-
Denrell, J., & Fang, C. (2010). Predicting the next big thing: Success as a signal of poor nations. International Journal of Forecasting, 29(4), 628–641.
judgment. Management Science, 56(10), 1653–1667. Rubanov, N. S. (2000). The layer-wise method and the backpropagation hybrid approach
Duggal, R., & Millar, J. A. (1994). Institutional investors, antitakeover defenses and to learning a feedforward neural network. IEEE Transactions on Neural Networks,
success of hostile takeover bids. The Quarterly Review of Economics and Finance, 34(4), 11(2), 295–305.
387–402. Schwert, G. W. (2000). Hostility in takeovers: In the eyes of the beholder? The Journal of
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevelhierarchical Finance, 55(6), 2599–2640.
models. New York: Cambridge University Press. Stäger, F., & Agarwal, M. (1997). Three methods to speed up the training of feedforward
Giordani, P., Jacobson, T., Von Schedvin, E., & Villani, M. (2014). Taking the twists into and feedback perceptrons. Neural Networks, 10(8), 1435–1443.
account: Predicting firm bankruptcy risk with splines of financial ratios. Journal of Sudarsanam, P. S. (1995). The role of defensive strategies and ownership structure of
Financial and Quantitative Analysis, 49(4), 1071–1099. target firms: Evidence from UK hostile takeover bids. European Financial Management,
Hansen, J. V., McDonald, J. B., Messier, W. F., Jr, & Bell, T. B. (1996). A generalized 1(3), 223–240.
qualitative-response model and the analysis of management fraud. Management Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review.
Science, 42(7), 1022–1032. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687–719.
Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical Thimm, G., Moerland, P., & Fiesler, E. (1996). The interchangeability of learning rate and
learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27(2), gain in backpropagation neural networks. Neural Computation, 8(2), 451–460.
83–85. Walkling, R. A. (1985). Predicting tender offer success: A logistic analysis. Journal of
Hatzakis, E. D., Nair, S. K., & Pinedo, M. (2010). Operations in financial services—An Financial and Quantitative Analysis, 20(4), 461–478.
overview. Production and Operations Management, 19(6), 633–664. Wang, J., & Branch, B. (2009). Takeover success prediction and performance of risk ar-
Haykin, S. S. (2009). Neural networks and learning machines. Upper Saddle River: Prentice bitrage. The Journal of Business and Economic Studies, 15(2), 10–25.
Hall. Wang, L., & Zajac, E. J. (2007). Alliance or acquisition? A dyadic perspective on interfirm
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on resource combinations. Strategic Management Journal, 28(13), 1291–1317.
Knowledge and Data Engineering, 21(9), 1263–1284. Zajac, E. J., & Olsen, C. P. (1993). From transaction cost to transactional value analysis:
Heron, R. A., & Lie, E. (2006). On the use of poison pills and defensive payouts by ta- Implications for the study of interorganizational strategies. Journal of Management
keover targets. The Journal of Business, 79(4), 1783–1807. Studies, 30(1), 131–145.
Hoff, P. D. (2009). A first course in Bayesian statistical methods. New York: Springer Science Zhang, M., Johnson, G., & Wang, J. (2012). Predicting takeover success using machine
& Business Media. learning techniques. Journal of Business & Economics Research, 10(10), 547–552.
286
Hyeoncheol Baik is a PhD Candidate in the Department of Industrial and Systems

Kang Bok Lee, PhD, is an assistant professor of business analytics in the Raymond J. Engineering at Auburn University. His research interests include the optimization of
Harbert College of Business at Auburn University. His research interests include corporate drone operation in the electric power industry and the military, and interdisciplinary
governance and statistical modeling of organizational phenomena. He has published in decision making in the air transportation. He has published several papers in Journal of
journals such as Academy of Management Journal, Production and Operations Management, Intelligent & Robotic Systems.
European Journal of Operational Research, International Journal of Research in Marketing,
European Journal of Marketing, Decision Sciences, Journal of Business Logistics, Public
Administration Review, Journal of Economics and Finance, and Quantitative Finance and Sumin Han, PhD, is an assistant professor of business analytics in the Raymond J. Harbert
Economics. College of Business at Auburn University. She has published in journals such as European
Journal of Marketing,
Sunghoon Joo is a PhD candidate in the Department of Finance at Auburn University. His
research interests include financial markets and institutions, corporate finance, and Joonhwan In (Ph.D., The University of Tennessee), is an assistant professor of supply
predictive modeling. He has published in journals such as Journal of Economics and chain management at California State University Long Beach. His research interests in-
Finance, Journal of Regional Analysis and Policy, and Quantitative Finance and Economics. He clude governance of supply chain information flows, Healthcare supply chains, and ve-
has taught Financial Markets and Institutions and Advanced Business Finance at Auburn hicle routing. His research has been published in journals such as Production and
University. Operations Management, Journal of Business Logistics, International Journal of Physical
Distribution & Logistics Management, and international journal of production economics.
287

Unbalanced Data, Type II Error, and Nonlinearity in Predicting M&A Failure

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unbalanced Data, Type II Error, and Nonlinearity in Predicting M&A Failure

Uploaded by

Copyright:

Available Formats

Journal of Business Research 109 (2020) 271–287

Contents lists available at ScienceDirect

Journal of Business Research

Unbalanced data, type II error, and nonlinearity in predicting M&A failure T

ARTICLE INFO ABSTRACT

Logit model decision boundary Neural network decision boundary

Unbalanced (original) data Balanced (resampling) data

likelihood of observing the success or failure M&A combinations in the

80 (True negative; 75% minority class

Condition negative = 107

minority class miss rate)

64 (True positive, power; 60% majority class

1= .69, p -value = 5.7 × 10 8 ). Therefore, the spread of the curve for

43 (False positive, type I error; 40%

Balanced (resampling) data

Condition negative = 107

Predicted: M&A Success

Predicted: M&A Failure

spectively, where 0.018 is the minimum value of the difference in total

3. A modeling framework for M&A forecasting

M&A forecasting models have been used in both management and

Condition positive = 696

class miss rate)

models that comprised linear combinations of fixed functions have

useful analytical and computational properties, their practical applic-

ability is limited by the curse of dimensionality (Bishop, 2006). How-

ever, neural network approach that comprises multiple layers of logistic

regression models is to fix the number of basis functions in advance but

the class imbalance increases, a standard neural network classifier is (5)

where y = logit 1(linear(x )) where y = glogit 1(linear(x ))

Categorical variables Mean Std. Dev.

Compustat is the source of variables referred to by item number.

Out-of-sample: 2014 8 (10%) 75

Appendix A. Pseudo code

29: // Predict dependent variables (for CS Type 2)

Hyeoncheol Baik is a PhD Candidate in the Department of Industrial and Systems

You might also like