You are on page 1of 14

Government Torture and Terrorism:

Reviewing Daxeckers Model & Binary Classification Using Machine Learning

Xinzhou Ge, Minxuan Xu

3/22/2017

1 Introduction
The existing paper Dirty Hands: Government Torture and Terrorism 1 by Ursula Daxecker, a political sci-
ence professor from the University of Amsterdam, explores how variation in the visibility of different torture
techniques affects the likelihood of terrorist backlash, i.e. increased terrorism activity.

The author focuses on two types of torture techniques: scarring torture and stealth torture. The former
leaves lesions or scars on the body and is thus more visible to the public (e.g. whipping, beating), while the
latter is more plausible for the government to deny (e.g. sleep deprivation, waterboarding). After analyzing
the data with negative binomial regression, she concludes that scarring torture produces more terrorist at-
tacks, whereas stealth torture has no statistically significant effect on terrorism.

This is a causal statement, and Daxecker attempts to defend it by elaborating reverse causality and en-
dogeneity in the torture-terrorism relationship both theoretically and empirically. Causation is beyond the
scope of this paper and we are going to examine the association only.

Daxecker also argues that model comparisons support the selection of the negative binomial model. However,
she does not provide any details about the fitting, testing and evaluation process. Whether the unregularized
negative binomial regression (NB2) is the optimal choice is open to doubt. Regularization may help, and
building separate models for zero and non-zero responses might be beneficial.

This paper proceeds as follows. After a brief description of the data, we review Daxeckers proposed model by
inspecting the significance of the torture variables and comparing multiple variants of the negative binomial
regression model. The subsequent section aims to make predictions about the occurence of terrorist attacks
using several machine learning techniques and compare their accuarcies.
1 See http://journals.sagepub.com/doi/abs/10.1177/0022002715603766.

1
2 Data Description
The dataset is country-year data of 116 countries from 1996 to 2006, and there are 1, 140 observations. It is
in fact the replication data2 of Daxeckers paper, gathered by combining information from Global Terrorism
Database (GTD), Ill-Treatment and Torture Specific Allegations (ITT SA) by Amnesty International (AI),
World Bank, etc. The only dependent variable is the number of domestic terrorist incidents. Predictors
consist of the two torture variables and other covariates measuring information, institutional and human
rights environment. As Daxecker puts, the predictors are taken at time t 1 rather than time t, so that
the simultaneity bias could be avoided. Natural logarithm is taken on several variables because of their left
skewness. Notations and detailed descriptions (extracted from the existing paper) of the variables are shown
below in Table 1.

Table 1: Notations and Descriptions of Variables


Notation Description
Response (time = t)
DV t No. of domestic terrorist incidents. A count variable.
Predictors (time = t 1)
Scarring t1 Logged no. of scarring torture (logarithm is taken after adding one).
Stealtht1 Logged no. of stealth torture (logarithm is taken after adding one).
GDP t1 Logged gross domestic product (GDP) per capita.
P opt1 Logged population size.
Dummy variable (1 = Democracy). A continous polity measure is not used
Demt1
here due to multicollinearity concerns between HR and democracy measures.
Durat1 Durability. It is the number of years since a three-point change in polity score.
3-Year moving average of DV at t 1, t 2 and t 3. It accounts for serial
M avgDV t1
correlation.
SpDV t1 Spatial lag of DV . It captures the spatial diffusion of terrorism.
Dummy variable (1 = Restricted). It measures whether AI had difficulty gain-
Restrictt1
ing access to detainees. This is related with the reporting of allegations.
Logged no. of news reports about a country in Reuters Global New Service. It
M ediaExpot1
measures the media reporting environment.
Judit1 An indicator of judicial independence. It is continuous, ranging from 0 to 1.
M ediaF reet1 A measure of freedom of speech and press, taking values 0, 1, or 2.
Othert1 A measure of physical integrity rights violations other than torture.
U nstatedt1 Logged no. of torture allegations with the techniques used unknown.

2 It is accessible online from dataverse. See https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YKGTMD.

2
The correlation matrix between the 14 predictors shows there is no strong correlation between the two torture
variables and other covariates, except for unstated. This justifies the selection of the torture variables as
predictors. One may have noticed that scarring torture and stealth torture are moderately correlated (their
correlation is around 0.7), but Daxecker has addressed this issue by attributing it to reporting bias of AI,
the organization that is responsible for recording the torture allegations, rather than any theoretical logic.
In fact, the reason to include Restrict and M ediaExpo as variables is to account for the issue.

3
3 Reviewing Daxeckers Model
Daxeckers paper defines 4 models with various number of predictors included. For simplicity, we limit our
study to the one with the least number of variables, the base model, and the one with all 14 variables, the
full model. They are outlined as follows:

Base Model

DV t M avgDV t1 + SpDV t1 + Scarring t1 + Stealtht1 + GDP t1 + P opt1 + Demt1 + Durat1

Full Model

DV t M avgDV t1 + SpDV t1 + Scarring t1 + Stealtht1 + Restrictt1 + M ediaExpot1 + Judit1

+ M ediaF reet1 + Othert1 + U nstatedt1 + GDP t1 + P opt1 + Demt1 + Durat1

3.1 Significance of the Torture Variables

We start by fitting all the data with negative binomial regression (NB2) and inspect the significance of the
torture variables. NB2 is a natural choice since the response is a count variable without upper limit, and it
allows for over-dispersion, imposing fewer constraints than Poisson regression.

As observations of the same country in different years are not independent, the standard errors directly
reported by the statistical package are biased (Column 2 and 4 of Table 2, in parentheses). Adjustments are
made to get clustered standard errors (Column 3 and 5 of Table 2, in parentheses). As one can see, standard
error tends to grow after the adjustment, leading to the reduction in the significance level of the coefficient
estimate.

Although it is nice to see that stealth torture remains not significant in all cases, it makes Daxeckers
conclusion questionable that the significance of scarring torture (almost) vanish in both the base and full
model after the adjustment. In her paper she argues that the reported standard errors are clustered, but it
still seems that there is a big gap in numerical value between her calculation (Base 0.081, Full 0.095) and
ours (Base 0.136, Full 0.138) in the standard error of the scarring variable. This is far from enough to deny
her findings, but at least it means that her result is not fully replicable.

4
Table 2: Coefficients of Negative Binomial Regression (All Observations, Base & Full)
Base Model Full Model
Our Our Our Our
Daxeckers Daxeckers
(Before Adj) (Clustered) (Before Adj) (Clustered)
-1.089 -0.622 -0.622 -2.569* -2.697*** -2.697*
(Intercept)
(1.062) (0.474) (0.863) (1.097) (0.576) (1.073)
0.039** 0.040*** 0.040*** 0.020** 0.022*** 0.022***
M avgDV t1
(0.010) (0.002) (0.002) (0.006) (0.002) (0.002)
0.047* 0.033*** 0.033*** 0.053** 0.045*** 0.045***
SpDV t1
(0.022) (0.010) (0.009) (0.019) (0.009) (0.009)
0.271** 0.259*** 0.259 0.220* 0.156* 0.156
Scarringt1
(0.081) (0.075) (0.136) (0.095) (0.076) (0.138)
-0.004 0.037 0.037 0.081 0.103 0.103
Stealtht1
(0.101) (0.090) (0.113) (0.087) (0.092) (0.092)
-0.703** -0.818*** -0.818***
Restrictt1
(0.227) (0.224) (0.226)
0.429** 0.262*** 0.262***
M ediaExpot1
(0.109) (0.060) (0.066)
1.359* 1.645*** 1.645*
Judit1
(0.646) (0.407) (0.812)
0.324* 0.288** 0.288*
M ediaF reet1
(0.147) (0.099) (0.131)
0.327** 0.333*** 0.333***
Othert1
(0.068) (0.042) (0.071)
-0.049 0.008 0.008
U nstatedt1
(0.096) (0.079) (0.103)
0.005 -0.020 -0.020 -0.320* -0.228** -0.228
GDP t1
(0.129) (0.060) (0.111) (0.145) (0.077) (0.118)
0.333** 0.318*** 0.318** 0.012 0.139* 0.139
P opt1
(0.102) (0.045) (0.114) (0.128) (0.061) (0.132)
-0.054 -0.013 -0.013 -0.259 -0.320 -0.320
Demt1
(0.266) (0.133) (0.310) (0.347) (0.179) (0.546)
0.002 0.001 0.001 -0.002 -0.002 -0.002
Durat1
(0.004) (0.002) (0.004) (0.003) (0.002) (0.003)

a
Significance Codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

5
3.2 Comparisons of Various Regression Models

In this section, the focus is 5 different variants of negative binomial regression. They are:

Unregularized Negative Binomial Regression (NB2)

Negative Binomial Regression with L1 Regularization (NB2 LASSO)


Tuning parameter () is chosen by 10-fold cross-validation.

Negative Binomial Regression with L2 Regularization (NB2 L2)


Tuning parameter () is chosen by 10-fold cross-validation.

Hurdle Model
Zero-model: binomial with logit link; Count-model: truncated negative binomial with log link.

Zero-Inflated Model
Zero-model: binomial with logit link; Count-model: negative binomial with log link.

To make comparisons, we need to partition the data into training and testing sets. Models are to be fitted
on the training set, and their performance are to be evaluated on the test set. Before doing that, we have
noticed some unreasonably large fitted values in the previous negative binomial regression (up to 8.58 106
in base model and 4.15 104 in full model). Since the range of the response is only from 0 to 576, this
indicates the existence of outliers and the need to delete them.

3.2.1 Data Cleaning

For outlier detection, we only consider 7 (nearly) continuous variables, M avgDv, SpDV , Scarring, Stealth,
M ediaExpo, GDP and P op. Iteratively 57 outliers have been removed, which accounts for 5% of the data,
and now we have 1083 observations in total. In each iteration, the Mahalanobis distance of all remaining
points is computed and the data point with the highest distance is deleted.

6
Recall that if data points are multivariate normal, their squared Mahalanobis distance follows a Chi-squared
distribution. The QQ-chisq plots before and after the cleaning have demonstrated the improved multivariate
normality of the continuous variables.

3.2.2 Performance Evaluation

Since different observations for the same country tends to be non-independent, we do cluster level randomiza-
tion to split the data. The ratio of training data to test data is 739 : 344, approximately 2 : 1. Performances
on test data are evaluated according to two measures: the mean squared error (MSE) and the log-likelihood.
MSE is a natural choice for linear regression, which is fitted by minimizing the residual sum of squares, but
it might not be suitable for our case. For generalized linear models, e.g. negative binomial regression, as
they are fitted by maximizing the likelihood, we introduce the log-likelihood of the test set as a measure.

Table 3: Coefficients of Various Regression Models (Training Set, Base)


NB2 NB2 LASSO NB2 L2 Hurdle Zero-Inflated

= 0.010 = 0.021 Zero Count Zero Count


-0.133 -2.190** 0.944 4.036* 1.306
(Intercept) -0.401 -0.135
(0.903) (0.817) (1.125) (1.757) (0.931)
0.083*** 0.221*** 0.073*** -1.028*** 0.066***
M avgDV t1 0.081 0.076
(0.007) (0.046) (0.011) (0.283) (0.007)
0.038*** 0.024 0.029* -0.011 0.023*
SpDV t1 0.032 0.035
(0.011) (0.018) (0.014) (0.051) (0.011)
0.196 0.139 0.158 0.055 0.163
Scarringt1 0.137 0.183
(0.134) (0.131) (0.152) ( 0.300) (0.122)
-0.058 0.147 -0.107 -0.560 -0.117
Stealtht1 - -0.021
(0.115) (0.151) (0.113) (0.361) (0.096)
-0.096 0.080 -0.210 -0.451 -0.190
GDP t1 -0.042 -0.086
(0.116) (0.098) (0.149) (0.246) (0.129)
0.310*** 0.187* 0.304* 0.059 0.273*
P opt1 0.300 0.301
(0.076) (0.087) (0.148) (0.175) (0.110)
-0.324 -0.222 -0.169 0.368 -0.123
Demt1 -0.311 -0.309
(0.301) (0.238) (0.325) (0.501) (0.304)
0.002 0.002 0.002 0.006 0.003
Durat1 - 0.002
(0.003) (0.003) (0.004) (0.007) (0.003)

a
Significance Codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1. Clustered SEs are reported in parentheses.

Table 3 and Table 4 show the estimated coefficients and their significance. For technical reason we cannot
get the standard errors and p-values of the regularized models: the R package only reports the coefficient
estimates, while bootstrapping fails due to potential memory leakage in the model-fitting function.

7
Table 4: Coefficients of Various Regression Models (Training Set, Full)
NB2 NB2 LASSO NB2 L2 Hurdle Zero-Inflated

= 0.013 = 0.041 Zero Count Zero Count


-0.823 -2.215* 0.248 5.820* 0.862
(Intercept) -1.256 -1.272
(1.082) (1.005) (1.329) (2.565) (1.048)
0.073*** 0.188*** 0.065*** -1.248** 0.064***
M avgDV t1 0.073 0.065
(0.008) (0.044) (0.011) (0.469) (0.010)
0.049*** 0.035* 0.040** -0.031 0.033**
SpDV t1 0.039 0.042
(0.012) (0.017) (0.014) (0.062) (0.012)
0.082 0.116 0.098 0.592 0.183
Scarringt1 0.090 0.110
(0.124) (0.143) (0.162) (0.824) (0.134)
0.046 0.212 -0.014 -0.823 -0.033
Stealtht1 - 0.030
(0.105) (0.143) (0.116) (0.748) (0.090)
-0.761** -0.474 -0.689 -16.38*** -0.804**
Restrictt1 -0.506 -0.663
(0.245) (0.548) (0.373) (1.878) 0.293
0.287*** 0.315** 0.125 -0.428 0.092
M ediaExpot1 0.224 0.191
0.072 (0.116) (0.120) (0.506) (0.104)
0.857 0.041 1.026 -0.371 0.712
Judit1 - 0.429
(0.675) (0.720) (0.801) (1.785) (0.713)
-0.037 -0.045 -0.040 0.430 -0.011
M ediaF reet1 - 0.009
(0.139) (0.156) 0.191 (0.524) (0.158)
0.158* 0.127 0.112 -0.452 0.077
Othert1 0.134 0.146
(0.068) (0.068) (0.070) (0.302) (0.061)
0.021 -0.092 0.013 -0.391 -0.075
U nstatedt1 - 0.017
(0.118) (0.180) (0.114) (1.042) (0.130)
-0.300* -0.159 -0.306 -0.346 -0.303*
GDP t1 -0.146 -0.161
(0.137) (0.160) (0.160) (0.465) (0.144)
0.061 -0.079 0.203 0.876 0.265
P opt1 0.090 0.140
(0.107) (0.123) (0.188) (0.885) (0.167)
-0.329 -0.010 -0.351 -0.291 -0.255
Demt1 -0.120 -0.273
(0.464) (0.353) (0.423) (0.654) (0.384)
-0.001 0.000 0.001 0.005 0.001
Durat1 - -
(0.003) (0.003) (0.004) (0.011) (0.003)

a
Significance Codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1. Clustered SEs are reported in parentheses.

8
Note that scarring torture is significant in none of the fitted models, which again cast doubt on the validity
of the proposed significant association between scaring torture and terrorism.

The performances of the fitted models are demonstrated in Table 5, from which one can tell that the per-
formances of zero-inflated model and hurdle stand out from the other threes. Moreover, the plots of the
distribution of true and expected counts on the test set convince us that zero-inflated model slightly out-
performs hurdle. So a mini-summary here is that Daxecker should have considered fitting the data using
zero-inflated model instead of NB2. She may has chosen NB2 because it is parsimonious, though.

Table 5: Performance of Various Regression Models on Test Set


NB2 NB2 LASSO NB2 L2 Hurdle Zero-Inflated
MSE, Base 1244.210 963.461 664.399 576.351 465.510
MSE, Full 1651.515 1175.452 580.492 502.261 590.999
Log-likelihood, Base -751.442 -747.138 -750.219 -724.504 -727.535
Log-likelihood, Full -738.266 -736.624 -736.948 -713.698 -725.249

9
3.3 Principal Component Analysis

Since most variables are non-significant in the regression models of the previous section, we want to see if the
dimension of the predictors in the full model can be reduced through principal component analysis (PCA).

As is illustrated in the plot above, the variances of the first two principle components (PCs) take up most
of the total variance. In fact, sum of their variances accounts for 54% of the overall variance. This indicates
that they contain a considerable amount of information in the whole dataset.

Table 6: Factor Loadings of the First Two Principle Components


PC1 PC2
M avgDV -0.0397471 -0.3195717
SpDV -0.0719273 -0.0604280
Scarring -0.1818698 -0.3676354
Stealth -0.1215768 -0.4043683
Restrict -0.0678978 -0.1650857
M ediaExpo 0.1918648 -0.4075648
Judi 0.4553516 -0.1161575
M ediaF ree 0.3388678 -0.0325676
Other -0.3766586 -0.1506809
U nstated -0.2489377 -0.3486599
GDP 0.3805943 -0.2110718
P op -0.0576056 -0.3928653
Dem 0.3886246 -0.0880870
Dura 0.2806228 -0.1916120

To interpret the PCs, we take a careful look at their factor loadings. From Table 6, it can be easily seen that
the first PC puts emphasis mainly on variables like Judi, M ediaF ree, Dem and GDP , which summarizes the
overall development level of a country. The second PC puts much of its emphasis on Scarring and Stealth,
which describes the overall torture level of a country.

10
4 Binary Classification Using Machine Learning
In this section, we aim to make predictions about the occurrence of terrorist attacks using the variables from
the replication data.

As whether there are terrorist attacks is of higher practical importance than the specific number of ter-
rorism incidents, we simplify our problem to a binary classification case. The response variable (label) is an
indicator defined by Yt = I {DVt > 0}. Yt = 1 means that there exist terrorist attacks within the country in
year t, while Yt = 0 means no terrorist attacks in that year. Similar to the regression problem, Yt is predicted
based on the independent variables at time t 1.

4.1 Techniques and Variables

To classify the data we consider 5 different machine learning techniques and 3 different classes of variables.
The classification methods are:

SVM (Linear Kernel): hyperparameters are chosen by cross-validation

SVM (Gaussian Kernel): hyperparameters are chosen by cross-validation

Unregularized Logistic Regression

Quadratic Discriminant Analysis (QDA)

Linear Discriminant Analysis(LDA)

Among them there are both discriminant models (SVM, Logistic) and generative models (LDA, QDA), which
makes it possible for us to compare between the two philosophy.

The classes of variables are:

Class 1: Variables of the Base Model selected by LASSO logit


M avgDV , SpDV , Scarring, Stealth, GDP and P op

Class 2: Variables of the Full Model selected by LASSO logit


M avgDV , SpDV , Scarring, Stealth, Restrict, M ediaExpo, Judi, Other, GDP and P op

Class 3: All 14 Variables


M avgDV , SpDV , Scarring, Stealth, Restrict, M ediaExpo, Judi, M ediaF ree, Other, U nstated,
GDP , P op, Dem and Dura

Following the same data splitting as Section 3.2.2, we fit the models on the training set and evaluate their
performances on the test set. The base rate of the two classes in test set is 172:172, or 1:1. The training and
test accuracies are displayed in the coming section.

11
4.2 Performance Evaluation

From Table 7, we can see that all five classification methods perform quite well in prediction with a test accu-
racy of at least 67.4%. The highest test accuracy 74.7% appears in logistic regression with Class 1 variables.
Besides, SVM with radial kernel tends to outperform SVM with linear kernel. And QDA is consistently
better than LDA, implying that the equal-covariance-matrix assumption of the two classes does not hold.

Note that the test accuracies for the 3 variable classes are quite similar, which means that the variables
within Class 1 contains most of the information, and thus it is good enough to make classification by using
them only.

Table 7: Training and Test Accuracies of Various Methods


Training Test

Class 1 Class 2 Class 3 Class 1 Class 2 Class 3


SVM (Linear) 0.664 0.713 0.717 0.674 0.674 0.674
SVM (Radial) 0.744 0.748 0.770 0.732 0.718 0.721
Logistic 0.739 0.731 0.731 0.747 0.706 0.709
QDA 0.681 0.700 0.721 0.718 0.724 0.724
LDA 0.686 0.709 0.714 0.689 0.669 0.674

We further check the effectiveness of the classification methods by looking at the Receiver Operator Char-
acteristic (ROC) curve and area under the curve (AUC). From the ROC plots, one can see that the five
classification methods have very good trade-off between sensitivity and specificity, as all the curves are close
to the left and top borders.

Taking the AUC as a measure, we can see that SVM with radial kernel now becomes the optimal meth-
ods for all three variable classes, while logistic regression is the second best. This provides a slightly different
perspective from the test accuracies. Moreover, Table 8 still supports the view that variables not included in
Class 1 add little accuracy in predictions.

Table 8: Area under the Curve(AUC)


Class 1 Class 2 Class 3
SVM (Linear) 0.7361 0.7549 0.7361
SVM (Radial) 0.7925 0.7944 0.7981
Logistic 0.7871 0.7944 0.7926
QDA 0.7695 0.7817 0.7712
LDA 0.7635 0.7732 0.7735

12
13
5 Discussion
In this paper we attempts to replicate the significant relationship between scarring torture and terrorism pro-
posed by Daxecker. It turns out that in negative binomial regression the significance of the scarring variable
is achieved only when non-clustered standard errors are used. This undermines the validity of her conclu-
sion. Furthermore, comparisons between different regression models have shown that Daxeckers selection of
unregularized negative binomial regression (NB2) might be sub-optimal in terms of MSE and log-likelhood.
The zero-inflated model could be a better choice.

The second part of the paper concentrates on the predictions of whether there will be terrorist attacks
given several variables from the previous year, a binary classification problem. The performances of logistic
regression and SVM using radial kernel stand out according to test errors and AUC, respectively.

For further improvement, we may consider adding more variables (e.g. Polity score) to the model and
inspect whether they contribute to the explanation of the response, as well as their relationship with the
torture variables.

14