Professional Documents
Culture Documents
than by data-driven or algorithmic approaches. These that the set of accurate classifiers is often large enough
manually-created risk assessment tools are used in pos- so that it contains interpretable classifiers (see Holte,
sibly every hospital; e.g., the TIMI scores, CHADS2 1993). Because the monotonicity properties we enforce
score, Apache scores, and the Ranson score, to name a are much stronger than those of Ben-David (1995);
few (Antman et al. , 2000; Morrow et al. , 2000; Gage Feelders & Pardoel (2003); Altendorf et al. (2005)
et al. , 2001; Knaus et al. , 1981, 1985, 1991; Ranson (we are looking at monotonicity along the whole list
et al. , 1974). These models can be computed without rather than for individual features), we do find that
a calculator, making them very practical as decision accuracy is sometimes sacrificed, but not always, and
aids. Of course, we aim for this level of interpretabil- generally not by much. On the other hand, it is pos-
ity in purely data-driven classifiers, with no manual sible that our method gains a level of practicality and
feature selection or rounding coefficients. interpretability that other methods simply cannot.
Algorithms that discretize the input space have gained Interpretability is very context dependent (see Ko-
in popularity purely because they yield interpretable dratoff, 1994; Pazzani, 2000; Freitas, 2014; Huysmans
models. Decision trees (Breiman et al. , 1984; Quinlan, et al. , 2011; Allahyari & Lavesson, 2011; Martens &
1986, 1993), as well as decision lists (Rivest, 1987), or- Baesens, 2010; Rüping, 2006; Verbeke et al. , 2011;
ganize a collection of simple rules into a larger logical Martens et al. , 2011), and no matter how one mea-
structure, and are popular despite being greedy. In- sures it in one domain, it can be different in the next
ductive logic programming (Muggleton & De Raedt, domain. A falling rule list used in medical practice has
1994) returns an unstructured set of conjunctive rules the benefit that it can, in practice, be as sparse as de-
such that an example is classified as positive if it sat- sired. Since it automatically stratifies patients by risk
isfies any of the rules in that set. An extremely sim- in the order used for decision making, physicians can
ple way to induce a probabilistic model from the un- choose to look at as much of the list as they need to
ordered set of rules given by an ILP method is to place make a decision; the list is as sparse as one requires it
them into a decision list (e.g., see Fawcett, 2008), or- to be. If physicians only care about the most high risk
dering rules by empirical risk. This is also done in as- patients, they look only at the top few rules, and check
sociative classification (e.g., see Thabtah, 2007). How- whether the patient obeys any of the top clauses.
ever, the resulting model cannot be expected to exhibit
The algorithm we provide for falling rule lists aims to
good predictive performance, as its constituent rules
have the best of all worlds: accuracy, interpretability,
were chosen with a different objective.
and computation. The algorithm starts with a statisti-
Since it is possible that decision tree methods can pro- cal assumption, which is that we can build an accurate
duce results that are inconsistent with monotonicity model from pre-mined itemsets. This helps tremen-
properties of the data, there is a subfield dedicated to dously with computation, and restricts us to build-
altering these greedy decision tree algorithms to obey ing models with only interpretable building blocks (see
monotonicity properties (Ben-David, 1995; Feelders & also Letham et al. , 2014; Wang et al. , 2014). Once the
Pardoel, 2003; Altendorf et al. , 2005). Studies showed itemsets are discovered, a Bayesian modeling approach
that in many cases, no accuracy is lost in enforcing chooses a subset and permutation of the rules to form
monotonicity constraints, and that medical experts the decision list. The user determines the desired size
were more willing to use the models with the mono- of the rule list through a Bayesian prior. Our gener-
tonicity constraints (Pazzani et al. , 2001). ative model is constructed so that the monotonicity
property is fully enforced (no “soft” monotonicity).
Even with (what seem like) rather severe constraints
on the hypothesis space such as monotonicity or spar- The code for fitting falling rule lists is available online1 .
sity in the number of leaves and nodes, it still seems 1
http://web.mit.edu/rudin/www/falling_rule_list
Fulton Wang, Cynthia Rudin
Thus, after reparameterizing, the parameters are list parameters θ = {L, c0,...,L−1 (·), K, γ0,...,L−1 } as de-
scribed in Equation (13),
θ = {L, {cl (·)}L−1 L−1
l=0 , {γl }l=0 , K}. (13)
ppost (L, c0,...,L−1 (·), K, γ0,...,L−1 |y1,...,N ; x1,...,N ).
(18)
2.3.2 Prior Specifics
∝ wj if cj (·) ∈
/ Θ and 0 otherwise. (16) L∗ , c0,...,L∗ −1 (·)∗ (20)
∗ ∗
Update Θ ← Θ ∪ {cl (·).} (17) ∈ argmaxL,{cl (·)}L−1 L(L, {cl (·)}L−1
l=0 , K , γ0,...,L−1 )
l=0
5. For l = 0, . . . , L − 1 draw γl ∼ Gamma1 (αl , βl ),
where
which is a Gamma distribution truncated to have
support only above 1. K ∗ , γ0,...,L−1
∗
(21)
6. Draw K ∼ Gamma(αK , βK ).
∈ argmaxK,γ0,...,L−1 L(L, {cl (·)}L−1
l=0 , K, γ0,...,L−1 ).
We now elaborate on our choice for each involved dis-
tribution. We let L ∼ Poisson(λ), where λ reflects the Note that K ∗ and γ0,...,L−1
∗
depend on L, {cl (·)}L−1
l=0 .
prior decision length desired by the user. We let cl (·) Furthermore, the solution to the subproblem of find-
be the result of a draw from a discrete distribution ing K ∗ and γ0,...,L−1
∗
can be approximated closely us-
over the yet unchosen rules, B \ {cl̀ (·)}l−1
l̀=0
, where the ing a simple procedure, as it involves maximizing the
l-th rule is drawn with probability proportional to a posterior probability of a decision list given the rules
user designed weight wl . For example, a rule might {cl (·)}L−1
l=0 . Optimizing Equation (20) lends itself bet-
be chosen with probability proportional to the num- ter to simulated annealing than Equation (19); opti-
ber of clauses in it. This allows the user to express mizing (20) involves optimization over a discrete space,
preferences over different types of clauses in the list. namely the set and order of rules cl (·). In this formula-
Given L, only {c(·)l }L−1
l=1 are observed, though note tion, at each simulated annealing iteration, we need to
this process specifies a joint distribution over all of evaluate the objective function for the current rule list
|B|−1
{c(·)l }l=1 . Letting {γl }L−1
l=0 to be independently dis- {cl (·)}L−1
l=0 , which involves solving the continuous sub-
tributed truncated gamma variables permits posterior problem of finding the corresponding K ∗ and γ0,...,L−1
∗
.
Gibbs sampling while enforcing the monotonicity con-
straints and still permitting diversity over prior dis- Given a objective function E(s) over discrete search
tributions. For example, one could encourage some of space S, a function specifying the set of neighbors of a
the γ’s near the middle (of L) of the list to be large, state N (s), and a temperature schedule function over
in which case the risks would be widely spaced in the time steps, T (t), a simulated annealing procedure is a
middle of the list (but this would force closer spacing discrete time, discrete state Markov Chain {st } where
at the top of the list where the risks concentrate near at time t, given the current state st , the next state st+1
1). Finally, K, which models the risk of patients not is chosen by first randomly selecting a proposal s̀ from
satisfying any rules, is Gamma distributed. the set N (s), and setting st+1 = s̀ with probability
min(1, exp(− E(s̀)−E(s)
T (t) )), and st+1 = st otherwise.
3 Fitting the Model The search space of the optimization problem of Equa-
L−1
tion (20) is L, {cl (·)}l=0 , the set of ordered lists of
First we describe our approach to finding the deci- rules drawing from the finite pre-mined set of rules B.
sion list with the maximum a posteriori probability. Based on Equation (20), we let
Then we discuss our approach to perform Monte Carlo L−1 ∗ ∗
sampling from the posterior distribution over decision E({cl (·)}l=0 ) = −L(L, {cl (·)}L−1
l=0 , K , γ0,...,L−1 ). (22)
Fulton Wang, Cynthia Rudin
We simultaneously define the set of neighbors and model, yn ∼ Bernoulli(logistic(rzn )), as before:
the process by which to randomly choose a neighbor
through the following random procedure that alters p(Yn = 1) = p(Un > 0) (26)
the current rule list {cl (·)}L−1
Z ∞
l=0 to produce a new rule
L̀−1 =1− p(Un = 0|ζn )p(ζn )dζn (27)
list {c̀l (·)}l=0 (The new list’s length may be different):
Z0 ∞
Choose uniformly at random one of the following 4 =1− exp(−ζn vzn ) exp(−ζn )dζn (28)
operations to apply to the current rule list, {cl (·)}L−1
l=0 :
0
= 1 − (1 + vzn )−1 (29)
1. SWAP: Select i 6= j uniformly from 0, . . . , L − 1,
and swap the rules at those 2 positions, letting exp(rzn )
= . (30)
c`i (·) ← cj (·) and c`j (·) ← ci (·). 1 + exp(rzn )
2. REPLACE: Select i uniformly from 0, . . . , L − 1,
draw c(·) from the the distribution 3.2.1 Schedule of Updates
|B|−1
pc(·) (·|Θ; B, {wl }l=0 ) defined in Equation (16), Given the augmented model, we cycle through the
where Θ = {cl (·)}l=0,...,i−1,i+1,...,L−1 and set following steps in the following deterministic or-
c̀i (·) ← c(·). der. These will each be discussed in detail shortly.
3. ADD: Choose one of the L + 1 possible in- Regarding notation, we will use use θaug to re-
sertion points uniformly at random, draw a fer to the parameters of the augmented model:
L−1
|B|−1
rule c(·) from pc(·) (·|Θ; B, {wl }l=0 ), where (L, {cl (·)}l=0 , K, {γl }L−1 N N
l=0 , {Un }n=1 , {ζn }n=1 ), so that
Θ = {cl (·)}l=0,...,L−1 , and insert it at the chosen Gibbs updates can be described more succinctly.
insertion point, so that L̀ ← L + 1. Step 1 (Gibbs steps for each γl ):
4. REMOVE: Choose i uniformly at random from Sample γ̀l ∼ pγl (·|(θaug \ γl ), {yn }N N
n=1 ; {xn }n=1 )
0, . . . , L − 1, and remove ci (·) from the current for l = 0, . . . , L − 1.
rule list, so that L̀ ← L − 1. Step 2 (Gibbs step for K):
Note that this approach optimizes over the full set of Sample K̀ ∼ pK (·|(θaug \ K), {yn }N N
n=1 ; {xn }n=1 ).
rule lists from itemsets, and does not rely on greedy
splitting. Even Bayesian tree methods that aim to Step 3 (Collapsed Metropolis Hastings Step):
traverse a wider search space use greedy splitting and Perform Metropolis-Hastings update over
local solutions, e.g. Chipman et al. (1998). (L, {cl (·)}L−1
l=0 ), under the original model from
Equation (18). This can be viewed as a collapsed
3.2 Obtaining the posterior Metropolis-Hastings step, where the collapsed param-
eters are the augmenting variables {Un }N N
n=1 , {ζn }z=1 .
To perform posterior sampling, we use Gibbs sampling Step 4 (Gibbs step for ({ζn }N N
n=1 , {Un }n=1 )):
steps over {γl }L−1
l=0 and K made possible by variable Jointly sample
augmentation, and Metropolis-Hastings steps over L
and {cl (·)}. We describe the variable augmentation ({ζ̀n }N N
n=1 , {Ùn }n=1 )
step, the schedule of updates we employ, and finally ∼ p{ζn }n ,{Un } (·|θaug \ ({ζn }n , {Un }n ), {yn }n ; {xn }n ).
the details of each individual update step.
Augmenting the model with two additional variables Mixing Gibbs and collapsed Metropolis-Hastings sam-
Un , ζn for each n = 1, . . . , N preserves the marginal pling steps requires special care to ensure the Markov
distribution over the original variables, and enables chain is proper in that it possesses the desired sta-
Gibbs sampling over K and each γl (see Dunson, 2004): tionary distribution. We refer to van Dyk & Park
(2011) for details, but do note that after the collapsed
ζn ∼ Exponential(1) for n = 1, . . . , N (23) Metropolis-Hastings step, first performing a Gibbs up-
Un ∼ Poisson(ζn vzn ) for n = 1, . . . , N (24) date of {ζn }N
n=1 , and then a separate Gibbs update for
Yn = 1(Un > 0) for n = 1, . . . , N. (25) {Un }N
n=1 (or in reverse) would not have been proper.
Table 2: Falling rule list for patients with no multiple readmissions history.
Rule Miner Radm Spam Mamm Breast Cars Ben-David, Arie. 1995. Monotonicity maintenance in
nFoil 7 24 8 5 25 information-theoretic machine learning algorithms.
FPGrowth 103 550 43 147 190 Machine Learning, 19(1), 29–43.
Borgelt, Christian. 2005. An Implementation of the
Table 5: Number of rules mined for each dataset.
FP-growth Algorithm. Pages 1–5 of: Proceedings of
the 1st international workshop on open source data
Dataset n p running time (sec)
mining: frequent pattern mining implementations.
Radm 7944 33 35 ACM.
Spam 4601 57 102
Mamm 961 13 35 Breiman, Leo, Friedman, Jerome H., Olshen,
Breast 683 26 42 Richard A., & Stone, Charles J. 1984. Classifica-
Cars 1728 21 39 tion and Regression Trees. Wadsworth.
Chipman, Hugh A, George, Edward I, & McCulloch,
Table 6: Running time for 5000 simulated annealing Robert E. 1998. Bayesian CART Model Search.
steps, for each dataset. Journal of the American Statistical Association,
93(443), 935–948.
6 Conclusion Cronin, Patrick, Ustun, Berk, Rudin, Cynthia, &
Greenwald, Jeffrey. 2014. Work In Progress.
We present a new class of interpretable predictive Dunson, David B. 2004. Bayesian isotonic regression
models that could potentially have a major benefit in for discrete outcomes. Tech. rept. Citeseer.
decision-making for some domains. As nicely stated
Elter, M, Schulz-Wendtland, R, & Wittenberg, T.
by the director of the U.S. National Institute of Jus-
2007. The prediction of breast cancer biopsy out-
tice (Ridgeway, 2013), an interpretable model that is
comes using two CAD approaches that both empha-
actually used is better than one that is more accurate
size an intelligible decision process. Medical physics,
that sits on a shelf. We envision that models pro-
34(11), 4164–4172.
duced by FRL can be used, for instance, by physicians
in third world countries who require models printed on Fawcett, Tom. 2008. PRIE: a system for generating
laminated cards. In high stakes decisions (like medical rulelists to maximize ROC performance. Data Min-
decisions), it is important we know whether to trust ing and Knowledge Discovery, 17(2), 207–224.
the model we are using to make decisions; models like Feelders, Ad, & Pardoel, Martijn. 2003. Pruning for
FRL help tell us when (and when not) to trust. monotone classification trees. Pages 1–12 of: Ad-
Acknowledgment: We gratefully acknowledge fund- vances in intelligent data analysis V. Springer.
ing from Wistron, Siemens, and NSF-CAREER IIS- Freitas, Alex A. 2014. Comprehensible classification
1053407. models: a position paper. ACM SIGKDD Explo-
rations Newsletter, 15(1), 1–10.
References Gage, Brian F, Waterman, Amy D, Shannon, William,
Boechler, Michael, Rich, Michael W, & Radford,
Allahyari, Hiva, & Lavesson, Niklas. 2011. User- Martha J. 2001. Validation of clinical classification
oriented Assessment of Classification Model Under- schemes for predicting stroke. The Journal of the
standability. Pages 11–19 of: SCAI. American Medical Association, 285(22), 2864–2870.
Altendorf, Eric E, Restificar, Angelo C, & Dietterich, Holte, Robert C. 1993. Very simple classification rules
Thomas G. 2005. Learning from sparse data by ex- perform well on most commonly used datasets. Ma-
ploiting monotonicity constraints. Pages 18–26 of: chine learning, 11(1), 63–90.
Conf. Uncertainty in Artificial Intelligence. AUAI
Press. Huysmans, Johan, Dejaeger, Karel, Mues, Christophe,
Vanthienen, Jan, & Baesens, Bart. 2011. An empir-
Antman, Elliott M, Cohen, Marc, Bernink, Peter JLM,
ical evaluation of the comprehensibility of decision
McCabe, Carolyn H, Horacek, Thomas, Papuchis,
table, tree and rule based predictive models. Deci-
Gary, Mautner, Branco, Corbalan, Ramon, Radley,
sion Support Systems, 51(1), 141–154.
David, & Braunwald, Eugene. 2000. The TIMI
risk score for unstable angina/non–ST elevation MI. Knaus, William A, Zimmerman, Jack E, Wagner,
The Journal of the American Medical Association, Douglas P, Draper, Elizabeth A, & Lawrence, Di-
284(7), 835–842. ane E. 1981. APACHE-acute physiology and chronic
health evaluation: a physiologically based classifica-
Bache, K., & Lichman, M. 2013. UCI Machine Learn-
tion system. Critical Care Medicine, 9(8), 591–597.
ing Repository.
Knaus, William A, Draper, Elizabeth A, Wagner, Dou- the role of operative management in acute pancre-
glas P, & Zimmerman, Jack E. 1985. APACHE II: atitis. Surgery, gynecology & obstetrics, 139(1), 69.
a severity of disease classification system. Critical Ridgeway, Greg. 2013. The Pitfalls of Prediction. NIJ
Care Medicine, 13(10), 818–829. Journal, National Institute of Justice, 271, 34–40.
Knaus, William A, Wagner, DP, Draper, EA, Zim- Rivest, Ronald L. 1987. Learning decision lists. Ma-
merman, JE, Bergner, Marilyn, Bastos, PG, Sirio, chine learning, 2(3), 229–246.
CA, Murphy, DJ, Lotring, T, & Damiano, A. 1991.
The APACHE III prognostic system. Risk predic- Rüping, Stefan. 2006. Learning interpretable models.
tion of hospital mortality for critically ill hospital- Ph.D. thesis, Universität Dortmund.
ized adults. Chest Journal, 100(6), 1619–1636. Thabtah, Fadi. 2007. A review of associative classifi-
Kodratoff, Y. 1994. The comprehensibility manifesto. cation mining. The Knowledge Engineering Review,
KDD Nugget Newsletter, 94(9). 22(March), 37–65.
Landwehr, Niels, Kersting, Kristian, & De Raedt, Luc. van Dyk, David A, & Park, Taeyoung. 2011. Par-
2005. nFOIL: Integrating naıve bayes and FOIL. tially collapsed Gibbs sampling and path-adaptive
Metropolis-Hastings in high-energy astrophysics.
Letham, Benjamin, Rudin, Cynthia, McCormick, Handbook of Markov Chain Monte Carlo, 383–397.
Tyler H., & Madigan, David. 2014. An interpretable
model for stroke prediction using rules and Bayesian Verbeke, Wouter, Martens, David, Mues, Christophe,
analysis. In: Proceedings of 2014 KDD Workshop on & Baesens, Bart. 2011. Building comprehensible
Data Science for Social Good. customer churn prediction models with advanced
rule induction techniques. Expert Systems with Ap-
Martens, David, & Baesens, Bart. 2010. Building ac- plications, 38(3), 2354–2364.
ceptable classification models. Pages 53–74 of: Data
Mining. Springer. Wang, Tong, Rudin, Cynthia, Doshi, Finale, Liu,
Yimin, Klampfl, Erica, & MacNeille, Perry. 2014.
Martens, David, Vanthienen, Jan, Verbeke, Wouter, & Bayesian Ors of Ands for Interpretable Classifi-
Baesens, Bart. 2011. Performance of classification cation with Application to Context Aware Recom-
models from a user perspective. Decision Support mender Systems. Tech. rept.
Systems, 51(4), 782–793.
Morrow, David A, Antman, Elliott M, Charlesworth,
Andrew, Cairns, Richard, Murphy, Sabina A,
de Lemos, James A, Giugliano, Robert P, McCabe,
Carolyn H, & Braunwald, Eugene. 2000. TIMI risk
score for ST-elevation myocardial infarction: a con-
venient, bedside, clinical score for risk assessment
at presentation an intravenous nPA for treatment of
infarcting myocardium early II trial substudy. Cir-
culation, 102(17), 2031–2037.
Muggleton, Stephen, & De Raedt, Luc. 1994. Induc-
tive logic programming: Theory and methods. The
Journal of Logic Programming, 19, 629–679.
Pazzani, Michael J. 2000. Knowledge discovery from
data? Intelligent systems and their applications,
IEEE, 15(2), 10–12.
Pazzani, MJ, Mani, S, & Shankle, WR. 2001. Ac-
ceptance of rules generated by machine learning
among medical experts. Methods of information in
medicine, 40(5), 380–385.
Quinlan, J. Ross. 1986. Induction of decision trees.
Machine learning, 1(1), 81–106.
Quinlan, J. Ross. 1993. C4. 5: programs for machine
learning. Vol. 1. Morgan kaufmann.
Ranson, JH, Rifkind, KM, Roses, DF, Fink, SD, Eng,
K, Spencer, FC, et al. . 1974. Prognostic signs and