Koo Che Mesh Kian 2020

Cybernetics and Systems
An International Journal
ISSN: 0196-9722 (Print) 1087-6553 (Online) Journal homepage: https://www.tandfonline.com/loi/ucbs20
Flexible Distribution-Based Regression Models for

Count Data: Application to Medical Diagnosis
Pantea Koochemeshkian, Nuha Zamzami & Nizar Bouguila
To cite this article: Pantea Koochemeshkian, Nuha Zamzami & Nizar Bouguila (2020): Flexible
Distribution-Based Regression Models for Count Data: Application to Medical Diagnosis,
Cybernetics and Systems, DOI: 10.1080/01969722.2020.1758464
To link to this article: https://doi.org/10.1080/01969722.2020.1758464
Published online: 19 May 2020.
Submit your article to this journal
Article views: 14
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=ucbs20
CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL
https://doi.org/10.1080/01969722.2020.1758464
Flexible Distribution-Based Regression Models for

Count Data: Application to Medical Diagnosis
Pantea Koochemeshkiana, Nuha Zamzamib,c, and Nizar Bouguilab
a
Electrical and Computer Engineering Department, Concordia University, Montreal, Canada;
b
Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal,
Canada; cFaculty of Computing and Information Technology, King Abdulaziz University, Jeddah,
Saudi Arabia
ABSTRACT KEYWORDS
Data mining techniques have been successfully utilized in dif- Distribution-based regres-
ferent applications of significant fields, including medical sion; multinomial; Beta-
research. With the wealth of data available within the health- Liouville; scaled Dirichlet;
count data; bio-medical
care systems, there is a lack of practical analysis tools to dis- data mining; Maximum
cover hidden relationships and trends in data. The complexity Likelihood Estimation (MLE)
of medical data that is unfavorable for most models is a con-
siderable challenge in prediction. The ability of a model to
perform accurately and efficiently in disease diagnosis is
extremely significant. Thus, the model must be selected to fit
the data better, such that the learning from previous data is
most efficient, and the diagnosis of the disease is highly
accurate. This work is motivated by the limited number of
regression analysis tools for multivariate counts in the litera-
ture. We propose two regression models for count data based
on flexible distributions, namely, the multinomial Beta-
Liouville and multinomial scaled Dirichlet, and evaluated the
proposed models in the problem of disease diagnosis. The
performance is evaluated based on the accuracy of the predic-
tion which depends on the nature and complexity of the
dataset. Our results show the efficiency of the two proposed
regression models where the prediction performance of both
models is competitive to other previously used regression
models for count data and to the best results in the literature.
1. Introduction
Data mining techniques have shown to increasingly attract scholars’ atten-
tion due to their successful implementation in numerous applications of
different fields such as genomics, sports, medical, image analysis, epidemi-
ology, marketing, criminology, industrial statistics, and text mining (Zhang
et al. 2017; Nikoloulopoulos and Karlis 2009; Bouguila and Amayri 2009).
In the medical field, for instance, the majority of doctors do not possess
expertise in every sub-specialty. Thus, the automation of disease diagnosis
CONTACT Pantea Koochemeshkian p_kooche@encs.concordia.ca Electrical and Computer Engineering

Department, Concordia University, Montreal, Canada.
Present address: Department of Computer Science and Artificial Intelligence, University of Jeddah, Jeddah,
Saudi Arabia.
ß 2020 Taylor & Francis Group, LLC
2 P. KOOCHEMESHKIAN ET AL.
would be extremely advantageous. Data mining and machine learning have

become formidable tools within the field of medical research. Using data
mining techniques, researchers were able to uncover new information and
insights into specific significant medical areas such as the diagnosis of dif-
ferent diseases (Gavin III et al. 1997; Soni et al. 2011).
Data mining classical techniques can be grouped into three major catego-
ries: regression, classification, and clustering. For instance, classification
models that describe and distinguish data classes or concepts have been
used to analyze the information of patients with different diseases (Karstoft
et al. 2017). The classification models are derived based on the analysis of a
set of training data where the class labels of the data objects are known,
and the model is then used to predict the class label of unseen objects
(Rahman and Afroz 2013). Regression, on the other hand, focuses on find-
ing dependencies between objects, and predict target values given training
samples of objects and their related target values. This method is called
“induction” (Herbrich, Graepel, and Obermayer 1999), and it involves
assertions that provide only a finite set of observations. It is commonly rec-
ognized that any induction involves some limitations on the presumed
dependencies (Herbrich, Graepel, and Obermayer 1999). Over the past
years, data mining and machine learning researchers have been primarily
working on classification and regression estimation issues, as the issue of
ordinal regression shares the features of both of these functions (Herbrich,
Graepel, and Obermayer 1999).The majority of the proposed methods
make their previous understanding explicit by limiting the range of the pre-
sumed dependencies without creating any distributional claims (Vapnik
and Vapnik 1998; Karstoft et al. 2017).
In this paper, we propose distribution-based regression approaches using
efficient generative models for multivariate discrete data to overcome some
classification issues. Regression models are used to detect the interactions
between the effect of reaction variables and a few essential independent
variables. The objective of regression models is to learn the various effect
of one or more variables on a dependent variable. Thus, regression models
are used to predict one or more response variables (Powers and Xie 2008).
Thus, in regression analysis, the model predicts an array of the values of
the dependent variables based on a simple function of the independent
variable as strictly as possible (Powers and Xie 2008). Remarkably, regres-
sion models have been widely used in the literature as powerful tools to
tackle several scientific issues (Ferrari and Cribari-Neto 2004). Examples of
successful regression models include multivariate linear regression
(Maronna 2011), least-square regression (Andrews 1974), and distribution
based regression for compositional and count data (Bayes, Bazan, and
Garcıa 2012; Zhang et al. 2017; Ankam and Bouguila 2019). For instance,
CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL 3
Zhang et al. (2017) have examined regression models for multivariate count
data with efficient distributions for analyzing complex genomic data. The
authors proposed regression models based on Dirichlet Multinomial and
Generalized Dirichlet Multinomial that overcome some limitations of the
multinomial model (Bouguila 2009). In this work, we further investigate
the problem of analyzing multivariate count responses with other flexible dis-
tributions that overcome both specific mean-variance structure and the nega-
tive correlation requirement of the Dirichlet distribution as a prior to the
Multinomial. More precisely, we propose two regression models based on flex-
ible distributions for count data, namely; Multinomial Beta-Liouville and
Multinomial scaled Dirichlet. First, we introduce the response distributions,
propose the link functions, and derive the score and information matrices for
estimating the parameters and give the complete regression algorithm. Then,
we show the efficiency of the proposed models in analyzing high-throughput
data in genomics. Furthermore, we investigated, with the proposed models,
the problem of the diagnosis of three different diseases, namely, heart attack,
breast cancer, diabetes, as well as the analysis of genomics dataset.
The rest of this article is organized as follows. Section 2 discusses previous
works in distribution-based regression models for count data. In Section 3,
we propose two distribution-based regression models where we first discuss
the properties of the considered distributions, then propose the link func-
tions and provide all the details about the models’ parameters estimation.
Section 4 is devoted to the application of the proposed models on real gen-
omics and medical data and to the discussion of the results. Section 5 gives
the work concluding remarks.
2. Related Work
Count data frequently appear in instances where incidences of several asso-
ciated occurrences are measured by counting them. If multivariate counts
are accessible, there is often an interest in investigating the dependencies
among them. However, the applications and techniques for analyzing
multivariate count data are comparatively uncommon (Wedel, B€ ockenholt,
and Kamakura 2003). In this section, we review the related works in count
data regression. In all the reviewed models here, we symbolized our dataset
by X ¼ {W1, … , Wn} which consists of n independent vectors Wj ¼ (Xj,
Yj), where Xj ¼ (xj1, xj2, … , xjd)T is a d-dimensional response vector, and
Yj ¼ (Yj1, Yj2, … , Yjp)T is a p-dimensional co-variate vector.
2.1. Dirichlet-Multinomial (DM) Regression

Dirichlet distribution (Mosimann 1962), is a generalization of the Beta dis-
tribution, offering significant flexibility and ease of use. The Dirichlet
distribution has the advantage that by varying its parameters (Bouguila and
Ziou 2005b) it permits multiple modes and asymmetries and can thus
approximate a wide variety of shapes (Bouguila, Ziou, and Vaillancourt 2004,
Bouguila and Ziou 2005a). The Dirichlet distribution is commonly used given
its flexibility and its several interesting properties, such as the consistency of
its estimates, and its ease of use as well as the fact that it is conjugate to the
multinomial distribution. Considering the Dirichlet as a prior distribution to
the multinomial results in the Dirichlet Multinomial (DM) Distribution
(Wang and Zhao 2017; Bouguila, Ziou, and Vaillancourt 2003).
The probability of a d-dimensional count vector X ¼ (x1, … , xd), with
P
m ¼ di¼1 xi , that follows a multinomial distribution with parameters q ¼
(q1, … , qd), is given by:
Y d
m
MðXjqÞ ¼ qi xi (1)
X i¼1
The most popular multinomial-logit model uses the joint distribution based
on multinomial and Dirichlet, which is called the Dirichlet-Multinomial (DM)
distribution (Madsen, Kauchak, and Elkan 2005). The probability of a vector
X over m possible trails following the DM Distribution, with parameters a ¼
(a1, … , ad), is given by (Zhang et al. 2017):
Y Qd a
i¼1 ð i Þxi
d
m CðjajÞ Cðxi þ ai Þ m
DMðXjaÞ ¼ ¼ (2)
x Cðjaj þ mÞ i¼1 Cðai Þ X ðjajÞm
where (jaj)(m) ¼ jaj(jaj þ 1) … (jaj þ m 1) denotes the rising factorial,
P
and jaj ¼ di¼1 ai :
Even though the DM regression enables the parameterization of the
multi-class correlation coefficient for unit-specific covariates, it may dis-
close additional information that may not be identified by the grouped
conditional logit model (Guimaraes and Lindrooth 2005). The inverse link
function ai ¼ eai Y the parameters
T
a ¼ (a1, … , ad) of DM distribution to the covariates X. The complete

log-likelihood for n independent data points in this case is given by (Zhang
et al. 2017; Guimaraes and Lindrooth 2005):
Xn X d X n xX ij 1 T
mj ai y j
Ln vja ¼
ð Þ ln þ ln e þ k
j¼1
Xj i¼1 j¼1 k¼1
!
X X
n m j 1 Xd
eai yj þ k
T
ln (3)
j¼1 k¼0 i¼1
Estimating the Dirichlet multinomial regression model does not present any
specific challenge, and its numerical optimization process based on the
Newton-Raphson algorithm provides quick progress to the maximization

(Guimaraes and Lindrooth 2005). However, the Dirichlet has some disadvan-
tages, such as its very restrictive negative covariance matrix and the fact that
the variables with the same mean must have the same variance, which limits
its applicability to many data sets (Bouguila 2008; Wong 2009). To handle
these disadvantages, Zhang et al. (2017), proposed a regression model with a
more flexible mean-covariance and correlation structure based on the general-
ized Dirichlet multinomial distribution (Bouguila 2008).
2.2. Generalized Dirichlet Multinomial (GDM) Regression

The Generalized Dirichlet (GD) distribution was introduced in (Connor
and Mosimann 1969), and it has a more general covariance structure than
the Dirichlet distribution. The generalized Dirichlet distribution, in fact,
can release both constraints of the Dirichlet distribution, including the
negative-correlation and the equal-confidence requirements. Thus, it has
shown to be a more appropriate prior to Bayesian learning situations
(Bouguila 2008; Bouguila and Ziou 2006). Similar to the Dirichlet, the gen-
eralized Dirichlet is a conjugate to the multinomial distribution, but it is
more practical for several real-life applications (Bouguila 2008; Zamzami
and Bouguila 2018a). The composition of the generalized Dirichlet and the
multinomial gives the generalized Dirichlet Multinomial (GDM) distribu-
tion. The probability mass of a GDM for a count vector X ¼ (x1, … , xd)
with a parameter set n ¼ (a1, … , ad 1, b1, … , bd 1) and ai, bi > 0, is
given by (Bouguila 2008; Zhang et al. 2017):
Y
m d1 Cðai 6xi Þ Cðbi 6zi61 Þ Cðai 6bi Þ
GDMðXjnÞ ¼
X i¼1 Cðai Þ Cðbi Þ Cðai 6bi 6zi Þ
Y
m d1 ðai Þxi ðbi Þziþ1
¼ (4)
x i¼1
ðai þ bi Þzi
P
where zi ¼ dl¼i xl is the cumulative sum.
For relating the covariates X to parameters in the condition of 1 i
d 1, the following link functions have been used by (Zhang et al. 2017):
ai ¼ eai y and bi ¼ ebi y . Now, let the parameter set n ¼ fa, bg represent all
t t
regression coefficients, the log-likehood is given by (Zhang et al. 2017):

0 1
Xn X d X n xXij 1 zi 1
X
mj @ ln ebi yj ln eai yj þ ebi yj þ k A
T T T
Ln ðXjnÞ ¼ ln þ
j¼1
Xj i¼1 j¼1 k¼0 k¼0
(5)
Indeed, generalized Dirichlet Multinomial is a more suitable distribution

for modeling count data than the widely used Dirichlet Multinomial It
acquires its flexibility from the fact that generalized Dirichlet has a more
flexible covariance structure, and it has one more set of parameters that
grants it a d 1 extra degrees of freedom to better fit real data. Both DM
and GDM have been well studied in the literature (see, for instance,
(Bouguila 2008; Zhang et al. 2017; Guimaraes and Lindrooth 2005) for
more details). In the following, we introduce two novel regression models
based on alternative distributions that have shown superior performance in
modeling count data, namely; Multinomial Beta-Liouville (MBL) distribu-
tion (Bouguila 2011), and Multinomial scaled Dirichlet (MSD) distribution
(Zamzami and Bouguila 2018b, 2019b).
3. The Proposed Regression Models

In this section, we give the details of the proposed models for multivariate
count responses. For each proposed model, we first discuss the properties
of the fitting distribution, then we propose the link functions and discuss
the maximum likelihood estimation procedure. Finally, we give the com-
plete learning algorithm.
3.1. The Considered Distributions

3.1.1. The Multinomial Beta-Liouville (MBL) Distribution
The Liouville family of the second kind includes the Dirichlet distribution
as a special case if all variables in the Liouville random vector have the
same normalized variance, and the density generator variate has a Beta dis-
tribution (Wong 2009). Choosing the Beta distribution as a generating
density resulting in which is commonly called the Beta-Liouville distribu-
tion (Aitchison 1982). Like the Dirichlet, the Beta-Liouville is a conjugate
prior to the multinomial distribution, and it can overcome the main
restrictions of the Dirichlet distribution. Moreover, the two more parame-
ters in Beta-Liouville can be used to adjust the spread of the distribution,
which makes it more practical and provides better modeling capabilities.
Considering the Beta-Liouville as a prior to the multinomial results in a
flexible joint distribution called the Multinomial Beta-Liouville (MBL)
(Bouguila 2011).
Pd
The probability of a count vector X ¼ ðx1 , :::, xd Þ over m ¼ i¼1 xi tri-
als the MBL model with a parameters set h ¼ ða1 , :::, ad1 , a, bÞ, is given
by (Bouguila 2011):

C Pd1 a Cða þ bÞCða0 ÞCb0 Qd1 Ca0
m i¼1 i i¼1 i
MBLðXjhÞ ¼ P Q
X C d1 0 0 0 d1
i¼1 ai C a þ b Cða ÞCðb Þ i¼1 Cðai Þ
a þ ðbÞ þ a
m ð Þzi xiþ1 ð i Þxi
¼ (6)
X jajm ða þ bÞzi
P
where zi ¼ ik¼1 xk , a0 ¼ ai þ xi and b0 ¼ b þ xd : Note that by substituting
P
a ¼ di¼1 ai and b ¼ ad , the MBL is reduced to the Dirichlet Multinumial
(Eq. 2). Indeed, MBL is Indeed, MBL is an attractive distribution to fit
count data, given the fact that it has fewer parameters than MGD with a
comparable performance (Bouguila 2011).
3.1.2. Multinomial Scaled Dirichlet (MSD) Distribution

The scaled Dirichlet (Monti, Mateu-Figueras, and Pawlowsky-Glahn 2011) is
another generalization of the Dirichlet distribution, which has been proposed to
overcome the Dirichlet limitation of not considering the similar positions between
categories or multinomial cells. Besides, it has a general and more flexible variance
and covariance structure, given the fact that it has one more parameter to model
the variance of each dimension independently. Furthermore, the scaled Dirichlet
has shown to be an interesting prior to the multinomial, resulting in an efficient
hierarchical Bayesian model called Multinomial Scaled Dirichlet (MSD) proposed
by Zamzami and Bouguila (2018b). Indeed, MSD has shown to have high flexibil-
ity in count data modeling with superior performance in many challenging appli-
cations (Zamzami and Bouguila 2018b, 2019b, 2019a).
The scaled Dirichlet has two parameters such that a ¼ (a1, … , ad) is the
shape parameter and b ¼ (b1, … , bd) is the scale parameter (Monti, Mateu-
Figueras, and Pawlowsky-Glahn 2011). It is noteworthy that when all elements
of vector b are equal to some constant, the Dirichlet distribution becomes a
special case of scaled Dirichlet. Therefore, the scaled Dirichlet with d extra
parameters is more flexible than the Dirichlet distribution (Hankin 2010;
Oboh and Bouguila 2017; Zamzami et al. 2020). The probability of a count
P
vector X ¼ ðx1 , :::, xd Þ, and m ¼ di¼1 xd , following a multinomial scaled
P
Dirchlet, with a set of parameters # ¼ fa, bg and jaj ¼ di¼1 ad is given by
(Zamzami and Bouguila 2018b):
" #
m CðjajÞ Yd
Cðxi þ ai Þ
MSDðXj#Þ ¼ Q
X Cðm þ jajÞ di¼1 bxi i i¼1 Cðai Þ
Qd
m i¼1 ðai Þxi
¼ Qd xi (7)
x jaj b a
m i¼1 i i
3.2. The Proposed Link Functions

The link function (Owen 2008) is the reverse of any cumulative distribu-
tion function belonging to a distribution. Such function provides the rela-
tion between the linear prediction and the mean of the distribution
function (Mallick and Gelfand 1994). When considering a distribution
function with the canonical parameter, there is always a well-defined
canonical link function obtained from the exponential density function of
the response (Ter Braak 1988; Howar et al. 2012).
3.2.1. Proposed Link Functions for MBL Regression

For Multinomial Beta-Liouville distribution-based regression, we can write
the relation between the parameters and the p-dimensional co-variate vec-
tor Y ¼ (y1, … , yp), in the following forms:
ai ¼ g1 ðai y1 þ ai y2 þ ::: þ ai yp Þ, i ¼ 1, :::, d
a ¼ g2 ðay1 þ ay2 þ ::: þ ayp Þ,

b ¼ g3 by1 þ by2 þ ::: þ byp (8)
For finding the g(lj) we consider the following procedure:

g ðli Þ ¼ YjT h j ¼ 1, :::, n (9)
where lj is the mean of Yj and h is a vector of regression parameters.
Thus:
l
j
logitðlj Þ ¼ log (10)
1l j
and for logit link function we have the following:

exp h Yj
T
Pj ðyÞ ¼ Pn1 (11)

1 þ j¼1 exp h YjT
Thus, for the Multinomial Beta-Liouville model, we have the following:

g1 ðlj Þ ¼ YjT ai
g2 ðlj Þ ¼ YjT a (12)
g3 ðlj Þ ¼ YjT b
We consider the final regression equation as a linear regression equation:

Y ¼ g0 þ g1 x1 þ þ gn xd (13)
where gi ¼ baai , i ¼ 1, :::, d, where d is the dimension of the
response vector.
Consider the parameters set h ¼ ða1 , :::, ad1 , a, bÞ as all the regres-
sion coefficients the complete log-likelihood is given by:
2
X n X d X n xXij 1 T X zij T
mj 4 ai y j
Ln ðXjhÞ ¼ ln þ ln e þ k þ ln ebyj þ k
j¼1
Xj i¼1 j¼1 k¼0 k¼0
3
XXi1 T xX i, m 1 T X
xi þ1
ln eai yj þ k ln eai yj þ k ln eayj þ ebyj þ k 5
T T
þ
k¼0 k¼0 k¼0
(14)
3.2.2. Proposed Link Functions for MSD Regression

For multinomial scaled Dirichlet, we can link the parameter # ¼ {a, b} to
the p-dimensional covariates vector Y, as:
ai ¼ k1 ðai y1 þ ai y2 þ þ ai yp Þ (15)

bi ¼ k2 bi y1 þ bi y2 þ þ bi yp i ¼ 1, :::, d (16)
For finding the k(mj) we follow the following procedure:

kðlj Þ ¼ YjT #, j ¼ 1, :::, n (17)
then we have:
k1 ðlj Þ ¼ YjT ai (18)
k2 ðlj Þ ¼ YjT bi (19)
Considering the final regression equation to be similar to the linear regres-

sion equation, as previously mentioned in Eq. (13), where gi in case of MSD
model is given by gi ¼ biai, i ¼ 1, … , d. The complete log-likelihood of MSD
for n independent data points is, thus, computed as following:
Xn X n X xi Xd T
mj bi Yj
Ln Xjh ¼
ð Þ ln xi ln e þk
j¼1
Xj j¼1 k¼0 i¼1
X n X xi Xd T
ai YjT
þ ln xi þ e þ k ln eai Yj þ k (20)
j¼1 k¼0 i¼1
X
n X xi
d X T
ai YjT
þ ln e þ k ln mj eai Yj k
j¼1 i¼1 k¼1
3.3. Parameters Estimation

For estimating the parameters, to find the best coefficient for our regres-
sion models, we utilized the Maximum Likelihood Estimate (MLE) tech-
nique (Dempster, Laird, and Rubin 1977). Maximum likelihood estimation
(Vasconcellos and Cribari-Neto 2005; Paolino 2001), is a method that

attempts to discover the most probable model that generated the observed
result. We can obtain the maximum likelihood parameter estimates for
Multinomial Beta-Liouvell and Multinomial scaled Dirichlet models by tak-
ing the derivative of the complete log-likelihood function, and find H
when the derivative is equal to zero. In this technique, the estimation of
the parameters that maximize the log-likelihood is based on the following:
ðtþ1Þ
X
n

H ¼ argmaxH log p Xj jH (21)
j¼1
For both models, closed-form solutions do not exist. Thus, the process
requires a Newton-Raphson optimization that iterates between scoring steps
based on the present values and an update of the parameters, such that:
Hðtþ1Þ ¼ HðtÞ HH
1
GH (22)
where G is the gradients and H is the Hessian matrix based on the first
and second order derivatives of the log-likelihood function, respectively.
The complete derivation needed for estimating the parameters of the two
proposed models are given in (Appendix A).
To achieve an optimal performance of our proposed models, the initial val-
ues of the parameters were calculated using the method of moments
(Taboada et al. 2011), which depends on the mean and variance of each dis-
tribution. Then, using the maximum likelihood approach, the parameters are
updated to get their natural values with respect to the given dataset. Finally,
the regression model is applied to predict the multivariate count response.
The complete learning algorithm is summarized in (Algorithm 1).
Algorithm 1: Learning algorithm for predicting multivariate count response.

Input: DATA SET X ¼ {W1, … , Wn} with n independent data points Wj ¼
(Xj, Yj), where Xj is the count response vector and Yj is covariate vector.
Output: The final parameters H, log-likelihood, predicted Y
1 Split the data by ratio 60:40 for training and testing;
2 Initialize the parameters for each model H(0);
3 repeat;
4 Update the parameters H(t) using Eq. (22);
5 Update the link functions;
6 Calculate the complete log-likelihood using Eq. (14) for MBL or Eq. (20)
for MSD;
7 until log-likelihood convergence;
8 Predict the covariate values of Y using Eq. (13)
4. Experimental Results
Our aim in this section is to apply the proposed regression models on real
datasets. We evaluate both Multinomial Beta-Liouville and Multinomial
scaled Dirichlet regression models to show their effectiveness compared to
the previously proposed distribution based regression models for count
data. All the models were implemented in MATLAB.
4.1. Data and Performance Measures

We apply the models to four different applications from the medical
domain research field as following:
Anysis of genomics data: RNA-seq (Zhang et al. 2017).

Impact of stress on heart attack (Krivokapich et al. 1999).
Bereast Cancer diagnosis (Wolberg, Street, and Mangasarian 1995).
Diabetes diagnosis(Smith et al. 1988)
The evaluation of each model is based on the Akaike information criterion

(AIC) (Burnham, Anderson, and Huyvaert 2011), Bayesian information criter-
ion (BIC) (Burnham and Anderson 2004) where the smaller the value for
AIC and BIC shows which model has the best performance. Furthermore, we
considered the prediction accuracy and the log-likelihood values, where the
highest accuracy and log-likelihood indicate the better performance of the
model. The considered performance metrics are defined as follows:
Akaike Information Criterion(AIC): It is a measure that can be used to

evaluate the model capabilities by showing a link between the Kullback-
Leibler information (Burnham and Anderson 2001), and the maximized
log-likelihood (Burnham, Anderson, and Huyvaert 2011). AIC selects the
model that minimizes the mean squared error of prediction or estimation
(Bishop 2006). We can calculate the AIC of each model using the follow-
ing formula where NX is the number of data points:
AIC ¼ 2Ln þ 2Nx (23)
Bayesian Information Criterion (BIC): This measure can be obtained
from a large-sample approximation (Burnham and Anderson 2004). For
each model, BIC criterion selects the model with the smallest value. We
use the following formula to calculate it, where NX is the number of data
points and DX is the dimension of the data:
BIC ¼ 2Ln þ Nx logðDx Þ (24)
Accuracy: Our concern is to predict accurate covariate values of Y, which
consists of one or more attributes. To find the accuracy of the prediction,
Table 1. Comparing different distribution-based regression models for RNA-seq dataset.

Performance metrics
MODEL Log likelihood AIC BIC Accuracy
DM 1.2634e þ 03 2.5748e þ 03 2.6417e þ 03 94.00%
GDM 1.1432e þ 03 2.8721e þ 03 2.8617e þ 03 95.00%
MBL 7.0738e þ 19 1.4148e þ 20 1.4148e þ 20 98.00%
MSD 9.5334e þ 04 21.9045e 1 05 21.9043e 1 05 95.75%
we compare Ypredict to the actual data in the test split Ytest of a given
dataset. Since we work on multivariate data where each Y is a vector, the
average accuracy was calculated. That is, the average of the differences
between Ypredict ¼ ðy01 , . . . , y0p Þ and Ytest ¼ ðy1 , . . . , yp Þ should be calculated.
We use the following equation to calculate the accuracy for each model:

lðYpredict Ytest Þ
ACC ¼ 1 100 (25)
lðjYtest jÞ
The prediction results in the following subsection are shown by figures,
and in each figure, the X axis shows the observed data points, and Y axis
shows the value of each Y that we predict.
4.2. Analyzing Genomics Data: RNA-Seq

In this application, we study the problem of high-throughput data analysis
in genomics. Quantifying the genomic features depends on sequencing
technology, where the data obtained from sequencing technologies are
often summarized by the counts of DNA or RNA fragments within a gen-
omic interval. We considered RNA-seq (RS) dataset1 (Montgomery et al.
2010). The data consists of six exons that present the gene, and these six
exons in our regression model are exploratory variables where each obser-
vation has the expression level with four covariates: total reads, treatment,
gender, and age. The total number of observations is 200. Table 1 presents
the results of the four tested models, where we compared them based on
AIC, BIC, and accuracy. As we can see, the MSD based model has the
smallest AIC, BIC, and the highest likelihood. In terms of accuracy, the
MBL based model outperforms all the tested models with an accuracy of
98% compared to 94–95% for the other models.
Figures 1 and 2 show the predicted values Y 0 , i.e. the values of each of
the four attributes total reads, treatment, gender, and age, that we predict
for each observation using MBL and MSD based regression models,
respectively. From these figures, we can see that the prediction of Y has the
same behavior of YTest : Note that the small predicted values are shown as
approximated to zero. In general, we can say that the predicted values are
1
https://github.com/Yiwen-Zhang/MGLM/tree/master/MGLM/data.
Figure 1. Comparing the actual test values and the predicted values of Y using the proposed
MBL-based regression model for RNA-seq dataset.
MSD-based regression model for RNA-seq dataset.
approximately similar to the actual test values, as shown in the figures,

such that we have an accuracy of 98% when using the MBL-based regres-
sion model, and 95.75% using the regression model based on MSD.
4.3. Predicting Heart Attack Risk

This application is based on a publicly available dataset named as “Stress
Echocardiography” (SEG).2 The dataset represents a study that has been
2
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/stressEcho.html.
Table 2. Comparing different distribution-based regression models for Stress

Echocardiography dataset.
Performance metrics
MODEL Log-likelihood AIC BIC Accuracy
DM 6.1365e þ 03 1.2333e þ 04 1.2447e þ 04 95.00%
GDM 5.7684e þ 03 1.1633e þ 04 1.1816e þ 04 98.00%
MBL 5.6532e þ 06 1.1307e þ 07 1.1307e þ 07 98.90%
MSD 2.1068e þ 05 24.2070e 1 05 24.2077e 1 05 97.80%
done to determine the impact of the “dobutamine” drug on having a risk

of heart attack or cardiac event. The observations of the dataset were based
on a test that the patient should take through raising the patient’s heart
rate by the run on the treadmill and gather the needed information. In our
experiments, we focus on predicting the cardiac death, i.e. the risk of heart
attack, to identify predictors of subsequent cardiac events from clinical and
demographic information for each patient. Independent variables evaluated
were: history of hypertension, diabetes mellitus, MI, CABG, or PTCA, age,
gender, peak dose of dobutamine, rest and peak dobutamine heart rate,
blood pressure, and rate pressure product (RPP), percent of achieved max-
imum predicted heart rate, rest and peak dobutamine EF, presence of
induced chest pain, negative, equivocal or ischemic electrocardiogram
(ECG), rest wall-motion abnormality (WMA), and a positive stress echocar-
diogram (SE) (Krivokapich et al. 1999). Then, the prediction of the cardiac
events that we aimed to predict was broken down into four categories (val-
ues), represent myocardial infarction (MI), revascularization by percutan-
eous transluminal coronary angioplasty (PTCA), coronary artery bypass
grafting surgery (CABG), and cardiac death. The prediction results for the
considered dataset using the four tested regression models are given in
Table 2 reported using the above-mentioned performance metrics.
According to the results, one may notice that the DM-based regression
model has the smallest likelihood and lowest accuracy comparing to the
other tested models. On the other hand, GDM, MBL, and MSD have
approximately similar performance according to the prediction accuracy,
yet, MSD has a larger log-likelihood and smaller AIC and BIC. Thus, we
can say that MSD based regression has the best prediction results on
this dataset.
In Figures 3 and 4, we display the predicted four values of Y, i.e. MI,
PTCA, CABG and cardiac death, corresponding to each observation in
Echocardiography dataset using MBL and MSD based regression models,
respectively. We can conclude from these figures that the MBL-based
regression model has the highest achieved accuracy of 98.90%, which is
slightly better for large numbers but not suitable for predicting small val-
ues. Furthermore, Figure 4 illustrates that the MSD-based regression model
performs well on both large and small values.
MBL-based regression model for Stress Echocardiography dataset.
MSD-based regression model for Stress Echocardiography dataset.
4.4. Breast Cancer Diagnosis

In this application, we used Breast Cancer Wisconsin dataset (BCD),3
which has a total of 569 observations, and each observation is computed
from a digitized fine needle aspirate (FNA) of a breast mass. The prediction
includes the diagnosis of each case to malignant or benign, based on the
symmetry, and the fractal dimension. Figure 5 shows sample images from
this dataset. After extracting the features, we have eight values that have
3
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
Figure 5. Sample images from the Breast Cancer dataset.
Table 3. Comparing different distribution-based regression models for breast cancer dataset.
Performance metrics
DM 3.1532e þ 03 6.3383e þ 03 6.4111e þ 03 91.00%
GDM 3.5300e þ 03 5.1000e þ 03 5.1910e þ 03 93.00%
MBL 2.4137e þ 05 4.8414e þ 05 4.8387e þ 05 98.00%
MSD 1.7277e þ 05 3.4637e 1 05 3.4621e 1 05 98.00%
MBL-based regression model for Breast Cancer dataset.
been discretized to be used in our models. The eight real-valued features

computed for each cell nucleus are; 1-radius (mean of distances from the
MSD-based regression model for Breast Cancer dataset.
center to points on the perimeter), 2-texture (standard deviation of gray-

scale values), 3-perimeter, 4-area, 5-smoothness (local variation in radius
lengths), 6-compactness, 7-concavity (severity of concave portions of the
contour), 8-concave points (number of concave portions of the contour).
The prediction results for this dataset are shown in Table 3. We can see
that DM and GDM based regression models are not the best fitted models
for the prediction of breast cancer shown by the relatively lower accuracy.
On the other hand, both MBL and MSD based regression models perform
similarly in terms of accuracy. Furthermore, MSD has a lower AIC and
BIC, thus, we can conclude that MSD based regression model is better for
breast cancer diagnosis dataset.
Figures 6 and 7 illustrate the predicted values for the Breast Cancer data-
set, including the three predicted values for each observation using the pro-
posed MBL and MSD based regression models, respectively. As shown in
the figures, both proposed models perform well for the prediction of Y val-
ues, which illustrated by having similar behavior for the actual and pre-
dicted values for all tested observations.
4.5. Diagnosis of Diabetes

In this application, we used the Pima Indians Diabetes dataset (DD) (Smith
et al. 1988), which is publicly available to download.4 The objective of this
application is to evaluate the efficiency of the proposed models in the prob-
lem of diagnosing diabetes. The dataset contains 2,000 observations and
4
https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/pima-indians-diabetes-database.zip/1.
Table 4. Comparing different regression models for Pima Indians Diabetes dataset.
Performance metrics
DM 622.6466 1.2653e þ 03 1.3066e þ 03 92.00%
GDM 31612.78 6.3664e þ 04 6.3358e þ 04 94.50%
MBL 1.7917e þ 06 3.5843e þ 06 3.5842e þ 06 99.00%
MSD 2.8160e þ 04 25.5400e 1 04 25.5425e 1 04 97.75%
Figure 8. Comparing the actual test values and the predicted values of Y with MBL regression
model for Pima Indians Diabetes dataset.
Figure 9. Comparing the actual test values and the predicted values of Y with MSD regression
model for Pima Indians Diabetes dataset.
nine variables with no missing values reported. The variables in the consid-
ered dataset are based on personal data, such as age, the number of preg-
nancy times, and the results of medical examinations, e.g., blood pressure,
body mass index, the result of glucose tolerance test, etc.
Table 5. Comparing the proposed regression models performance to the State-of-the-Art.

DATA SETS Algorithms Accuracy
RS CASI(Richard et al. 2010) 80.00%
RS Dirichlet-Multinomial Regression (DM) (Zhang et al. 2017) 94.00%
RS Generalized Dirichlet-Multinomial Regression (GDM) (Zhang et al. 2017) 95.00%
RS Proposed model 1: MBL-based regression model 98.00%
RS Proposed model 2: MSD-based regression model 95.75%
SEG CART (Krivokapich et al. 1999) 95.00%
SEG Hidden Markov Model (HMM) (Chykeyuk, Clifton, and Noble 2011) 93.20%
SEG Proposed model 1: MBL-based regression model 98.90%
SEG Proposed model 2: MSD-based regression model 97.80%
BCD Logistic Regression(Schein and Ungar 2004) 92.10%
BCD Artificial Neural Networks (ANNs) (Abbass 2002) 96.00%
BCD Artificial Neural Net Input Gain Measurement Approximation 90.00%
BCD Proposed model 1: MBL-based regression model 98.00%
BCD Proposed model 2: MSD-based regression model 98.00%
DD Artificial neural net input gain measurement approximation 71.00%
(Hsu, Schuschel, and Yang 1999)
DD An Early Neural Network Model(ADAP)(Smith et al. 1988) 76.00%
DD Decision Tree (Han, Rodriguez, and Beheshti 2008) 72.00%
DD ID3 Decision Tree (Han, Rodriguez, and Beheshti 2008) 80.00%
DD General Regression Neural Network (Kayaer and Yıldırım 2003) 80.21%
DD KNN. (Kayaer and Yıldırım 2003) 77.00%
DD Proposed model 1: MBL-based regression model 99.00%
DD Proposed model 2: MSD-based regression model 97.75%
The analysis aims to predict whether a patient was diabetes positive or

not (represented in our experiments by positive count values of 1 and 2,
respectively).The dataset consists of a variety of ranges of each feature
spanning all individuals. The prediction results for this dataset are shown
in Table 4. According to the results, DM and GDM-based regression mod-
els have a smaller likelihood and relatively lower accuracy. On the other
hand, both MSD and MBL outperform the other models, with the MBL
regression model has the highest accuracy on this dataset of 99%.
As Figures 8 and 9 illustrate, the predicted values using the proposed
MBL-based regression model are similar to the actual ones (i.e. Y Test).
However, using the MSD-based regression model, the prediction is between
1 and 2. Thus, for predicting if a patient has diabetes or not, the values
were rounded to the closest integer. That is, we assume that if the predicted
value is greater than 1.5, the diagnosis is negative (actual value is 2); other-
wise, the patient has diabetes (actual value is 1).
4.6. Comparison with Other Methods from the Literature

Recently, a large number of models have been proposed in the literature to
perform medical diagnosis efficiently and accurately. In this section, we
review the published results for other methods that considered the same
datasets used in our experiments. A comparative study between the pro-
posed models and other approaches from the state-of-the-art is depicted in
Table 5. We can see from the results in this table that our proposed
approach is competitive to the most successful approaches.
For instance, three different algorithms have been previously imple-
mented to analyze the RNA-seq dataset, including the two with a similar
approach (i.e., DM, and GDM-based regression models (Zhang et al.
2017)), and CASI (Richard et al. 2010). SEG dataset has been considered
using the Classification and Regression Trees (CART) (Krivokapich et al.
1999) and the Hidden Markov Model (HMM) (Chykeyuk, Clifton, and
Noble 2011), which are well-known algorithms, however, our proposed
models are giving the highest accuracy of prediction. Similarly, comparing
the results of previous algorithms such as logistic regression (Schein and
Ungar 2004) and two models of neural networks (Abbass 2002) imple-
mented on the BCD dataset, our proposed models have the high-
est accuracy.
Furthermore, while the average accuracy of diabetes diagnosis on DD
dataset ranges between 71 and 80%, obtained using previous methods such
as logistic regression (Schein and Ungar 2004), different neural network
models (Kayaer and Yıldırım 2003; Smith et al. 1988), decision trees (Han,
Rodriguez, and Beheshti 2008) and KNN (Kayaer and Yıldırım 2003), our
proposed approaches achieve a superior performance of 97.75% and 99%
for the proposed regression models based on MSD and MBL, respectively.
5. Conclusion
This article introduced two novel regression models for count data based
on multinomial Beta-Liouville and multinomial scaled Dirichlet distribu-
tions. This work is mainly motivated by the fact that these distributions
offer high flexibility, better fitting, and considerable potential to accurately
describe count data compared to the previously used models. Thus, the
proposed regression models have the benefits of better fitting multivariate
count data compared to the previously proposed ones. To validate the per-
formance of the proposed models, we considered the application of assess-
ing the connections and patterns analysis in medical data. The evaluation is
performed by considering different measures that are usually used to evalu-
ate regression models, including model selection criteria such as AIC and
BIC, as well as the prediction accuracy. According to the obtained results,
the proposed models achieved a superior performance presented by
the high accuracy of predicting diseases. It could be claimed that these new
distribution-based regression models yield better results than the state-of-
the-art methods. Future work can be devoted to the application of the pro-
posed regression models in different problems of computer vision and
image processing.
References
Abbass, H. A. 2002. An evolutionary artificial neural networks approach for breast cancer
diagnosis. Artificial Intelligence in Medicine 25 (3):265–81.
Aitchison, J. 1982. The statistical analysis of compositional data. Journal of the Royal
Statistical Society: Series B (Methodological) 44 (2):139–60.
Andrews, D. F. 1974. A robust method for multiple linear regression. Technometrics 16 (4):
523–31.
Ankam, D., and N. Bouguila. 2019. Generalized dirichlet regression and other compos-
itional models with application to market-share data mining of information technology
companies. In Proceedings of the 21st International Conference on Enterprise Information
Systems, ICEIS 2019, Heraklion, Crete, Greece, May 3–5, 2019, vol. 1, 158–166. 10.5220/
0007708201580166.
Bayes, C. L., J. L. Bazan, and C. Garcıa. 2012. A new robust regression model for propor-
tions. Bayesian Analysis 7 (4):841–66.
Bishop, C. M. 2006. Pattern recognition and machine learning. New York, NY: Springer.
Bouguila, N. 2008. Clustering of count data using generalized dirichlet multinomial distri-
butions. IEEE Transactions on Knowledge and Data Engineering 20 (4):462–74.
Bouguila, N. 2009. A model-based approach for discrete data clustering and feature weight-
ing using map and stochastic complexity. IEEE Transactions on Knowledge and Data
Engineering 21 (12):1649–64.
Bouguila, N. 2011. Count data modeling and classification using finite mixtures of distribu-
tions. IEEE Transactions on Neural Networks 22 (2):186–98.
Bouguila, N., and D. Ziou. 2006. A hybrid sem algorithm for high-dimensional unsuper-
vised learning using a finite generalized dirichlet mixture. IEEE Transactions on Image
Processing 15 (9):2657–68.
Bouguila, N., and O. Amayri. 2009. A discrete mixture-based kernel for svms: application
to spam and image categorization. Information Processing & Management 45 (6):631–42.
Bouguila, N., and D. Ziou. 2005a. Mml-based approach for finite dirichlet mixture estima-
tion and selection. In International Workshop on Machine Learning and Data Mining in
Pattern Recognition, 42–51. Berlin, Heidelberg: Springer.
Bouguila, N., and D. Ziou. 2005b. On fitting finite dirichlet mixture using ecm and mml.
In International Conference on Pattern Recognition and Image Analysis, 172–82. Berlin,
Heidelberg: Springer.
Bouguila, N., D. Ziou, and J. Vaillancourt. 2003. Novel mixtures based on the dirichlet dis-
tribution: application to data and image classification. In International Workshop on
Machine Learning and Data Mining in Pattern Recognition, 172–81. Berlin, Heidelberg:
Springer.
Bouguila, N., D. Ziou, and J. Vaillancourt. 2004. Unsupervised learning of a finite mixture
model based on the dirichlet distribution and its application. IEEE Transactions on
Image Processing 13 (11):1533–43.
Burnham, K. P., and D. R. Anderson. 2001. Kullback-leibler information as a basis for
strong inference in ecological studies. Wildlife Research 28 (2):111–9.
Burnham, K. P., and D. R. Anderson. 2004. Multimodel inference: understanding aic and
bic in model selection. Sociological Methods & Research 33 (2):261–304.
Burnham, K. P., D. R. Anderson, and K. P. Huyvaert. 2011. Aic model selection and multi-
model inference in behavioral ecology: some background, observations, and comparisons.
Behavioral Ecology and Sociobiology 65 (1):23–35.
Chykeyuk, K., D. A. Clifton, and J. A. Noble. 2011. Feature extraction and wall motion
classification of 2D stress echocardiography with relevance vector machines. In 2011
IEEE international symposium on biomedical imaging: From nano to macro, 677–680.
IEEE.
Connor, R. J., and J. E. Mosimann. 1969. Concepts of independence for proportions with a
generalization of the dirichlet distribution. Journal of the American 64 (325):194–206.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incom-
plete data via the em algorithm. Journal of the Royal Statistical Society: Series B
(Methodological) 39 (1):1–22.
Ferrari, S., and F. Cribari-Neto. 2004. Beta regression for modelling rates and proportions.
Journal of Applied Statistics 31 (7):799–815.
Gavin III, J. R., K. Alberti, M. B. Davidson, and R. A. DeFronzo. 1997. Report of the expert
committee on the diagnosis and classification of diabetes mellitus. Diabetes Care 20 (7):
1183.
Guimaraes, P., and R. Lindrooth. 2005. Dirichlet-multinomial regression. Economics
Working Paper Archive at WUSTL, Econometrics (0509001).
Han, J., J. C. Rodriguez, and M. Beheshti. 2008. Diabetes data analysis and prediction
model discovery using rapidminer. In 2008 Second International Conference on Future
Generation Communication and Networking, vol. 3, 96–99.
Hankin, R. K. 2010. A generalization of the dirichlet distribution. Journal of Statistical
Software 33 (11):1–18.
Herbrich, R., T. Graepel, and K. Obermayer. 1999. Regression models for ordinal data: A
machine learning approach. Berlin: Technische Universit€at Berlin.
Howar, F., B. Steffen, B. Jonsson, and S. Cassel. 2012. Inferring canonical register automata.
In International Workshop on Verification, Model Checking, and Abstract Interpretation,
251–66. Berlin, Heidelberg: Springer.
Hsu, C-n, D. Schuschel, and Y-t Yang. 1999. The annigma-wrapper approach to neural nets
feature selection for knowledge discovery and data mining. Taipei, Taiwan: Institute of
Information Science.
Karstoft, K.,. C. F. Brinkløv, I. K. Thorsen, J. S. Nielsen, and M. Ried-Larsen. 2017. Resting
metabolic rate does not change in response to different types of training in subjects with
type 2 diabetes. Frontiers in Endocrinology 8:132.
Kayaer, K., and T. Yıldırım. 2003. Medical diagnosis on pima indian diabetes using general
regression neural networks. In Proceedings of the International Conference on Artificial
Neural Networks and Neural Information Processing (ICANN/ICONIP), vol. 181, 184.
Krivokapich, J., J. S. Child, D. O. Walter, and A. Garfinkel. 1999. Prognostic value of
dobutamine stress echocardiography in predicting cardiac events in patients with known
or suspected coronary artery disease. Journal of the American College of Cardiology 33
(3):708–16.
Madsen, R. E., D. Kauchak, and C. Elkan. 2005. Modeling word burstiness using the dirich-
let distribution. In Proceedings of the 22nd International Conference on Machine
Learning, 545–552.
Mallick, B. K., and A. E. Gelfand. 1994. Generalized linear models with unknown link func-
tions. Biometrika 81 (2):237–45.
Maronna, R. 2011. Alan Julian Izenman (2008): modern multivariate statistical techniques:
regression, classification and manifold learning. Statistical Papers 52 (3):733–4.
Montgomery, S. B., M. Sammeth, M. Gutierrez-Arcelus, R. P. Lach, C. Ingle, J. Nisbett, R.
Guigo, and E. T. Dermitzakis. 2010. Transcriptome genetics using second generation
sequencing in a caucasian population. Nature 464 (7289):773–7.
Monti, G. S., G. Mateu-Figueras, and V. Pawlowsky-Glahn. 2011. Notes on the scaled

dirichlet distribution. Compositional data analysis, 128–38. UK: Wiley.
Mosimann, J. E. 1962. On the compound multinomial distribution, the multivariate b-dis-
tribution, and correlations among proportions. Biometrika 49 (1/2):65–82.
Nikoloulopoulos, A. K., and D. Karlis. 2009. Modeling multivariate count data using copu-
las. Communications in Statistics—Simulation and Computation 39 (1):172–87.
Oboh, B. S., and N. Bouguila. 2017. Unsupervised learning of finite mixtures using scaled
dirichlet distribution and its application to software modules categorization. In 2017
IEEE international conference on industrial technology (ICIT), 1085–1090.
Owen, C. E. B. 2008. Parameter estimation for the beta distribution. Brigham Young
University.
Paolino, P. 2001. Maximum likelihood estimation of models with beta-distributed depend-
ent variables. Political Analysis 9 (4):325–46.
Powers, D., and Y. Xie. 2008. Statistical methods for categorical data analysis. Bingley, UK:
Emerald.
Rahman, R. M., and F. Afroz. 2013. Comparison of various classification techniques using
different data mining tools for diabetes diagnosis. Journal of Software Engineering and
Applications 06 (03):85–97.
Richard, H., M. H. Schulz, M. Sultan, A. N€ uRnberger, S. Schrinner, D. Balzereit, E.
Dagand, A. Rasche, H. Lehrach, M. Vingron, et al. 2010. Prediction of alternative iso-
forms from exon expression levels in rna-seq experiments. Nucleic Acids Research 38
(10):e112–e112.
Schein, A. I., and L. Ungar. 2004. A-optimality for active learning of logistic regression
classifiers. University of Pennsylvania.
Smith, J. W., J. Everhart, W. Dickson, W. Knowler, and R. Johannes. 1988. Using the adap
learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the
Annual Symposium on Computer Application in Medical Care, 261. American Medical
Informatics Association.
Soni, J., U. Ansari, D. Sharma, and S. Soni. 2011. Predictive data mining for medical diag-
nosis: An overview of heart disease prediction. International Journal of Computer
Applications 17 (8):43–8.
Taboada, J. M., J. Rivero, F. Obelleiro, M. G. Ara ujo, and L. Landesa. 2011. Method-of-
moments formulation for the analysis of plasmonic nano-optical antennas. Journal of the
Optical Society of America A 28 (7):1341–8.
Ter Braak, C. 1988. Partial canonical correspondence analysis. In Classification and related
methods of data analysis: Proceedings of the First Conference of the International
Federation of Classification Societies (IFCS), Technical University of Aachen, FRG, 29
June-1 July 1987, 551–558.
Vapnik, V., and V. Vapnik. 1998. Statistical learning theory, 156–60. New York: Wiley.
Vasconcellos, K. L., and F. Cribari-Neto. 2005. Improved maximum likelihood estimation
in a new class of beta regression models. Brazilian Journal of Probability 19 (1):13–31.
Wang, T., and H. Zhao. 2017. A dirichlet-tree multinomial regression model for associating
dietary nutrients with gut microorganisms. Biometrics 73 (3):792–801.
Wedel, M., U. B€ ockenholt, and W. A. Kamakura. 2003. Factor models for multivariate
count data. Journal of Multivariate Analysis 87 (2):356–69.
Wolberg, W. H., W. N. Street, and O. L. Mangasarian. 1995. Image analysis and machine
learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative
Cytology and Histology 17 (2):77–87.
Wong, T.-T. 2009. Alternative prior assumptions for improving the performance of naïve
bayesian classifiers. Data Mining and Knowledge Discovery 18 (2):183–213.
Zamzami, N., R. Alsuroji, O. Eromonsele, and N. Bouguila. 2020. Proportional data model-
ing via selection and estimation of a finite mixture of scaled dirichlet distributions.
Computational Intelligence 36 (2):459–85.
Zamzami, N., and N. Bouguila. 2018a. Consumption behavior prediction using hierarchical
Bayesian frameworks. In 2018 First International Conference on Artificial Intelligence for
Industries (AI4I), 31–34. IEEE.
Zamzami, N., and N. Bouguila. 2018b. Text modeling using multinomial scaled dirichlet
distributions. In International Conference on Industrial, Engineering and Other
Applications of Applied Intelligent Systems, 69–80. Cham: Springer.
Zamzami, N., and N. Bouguila. 2019a. An accurate evaluation of msd log-likelihood and its
application in human action recognition. In 7th IEEE Global Conference on Signal and
Information Processing (GlobalSIP). IEEE.
Zamzami, N., and N. Bouguila. 2019b. A novel scaled dirichlet-based statistical framework
for count data modeling: Unsupervised learning and exponential approximation. Pattern
Recognition 95:36–47.
Zhang, Y., H. Zhou, J. Zhou, and W. Sun. 2017. Regression models for multivariate count
data. Journal of Computational and Graphical Statistics 26 (1):1–13.
Appendix A. MLE for the Proposed Models

The derivatives to estimate the MBL-based model parameters The first derivatives of
MBL log-likelihood with respect to the regression coefficient are given by, where
a0i ¼ a þ xi :
@ Ln ðXjhÞ X n h i
¼ g10 ðyj Þ wðai Þ þ w a0i wðai Þ w a0i (A1)
@a i j¼1
@ Ln ðXjLn ðXjhÞhÞ X n h i
¼ g20 ðyj Þ wða þ b Þ þ w a0 w a0 þ b0 wða Þ (A2)
@a j¼1
@ Ln ðXjhÞ X n
¼ g30 ðyj Þ wða þ b Þ þ w b0 w a0 þ b0 wðb Þ (A3)

@a j¼1
P
where a0 ¼ a þ di¼1 xi and b0 ¼ b þ xi : According to the Newton-Raphson method, we
should calculate the second-order derivative
@ 2 Ln ðXjhÞ X n
¼ g100 ðyj Þ½w0 ðai Þ þ w0 a0i w0 a0i w0 ðai Þ (A4)
@ai1 @ai2 j¼1
@ 2 Ln ðXjhÞ X n
¼ g200 ðyj Þ½w0 ða þ bÞ w0 ða0 Þ w0 a0 þ b0 w0 ðaÞ (A5)
@ a
2
j¼1
@ 2 Ln ðXjhÞ X n
¼ g300 ðyj Þ½w0 ða þ bÞ w0 b0 w0 a0 þ b0 w0 ðbÞ (A6)
@ b
2
j¼1
The derivatives to estimate the MSD-based model parameters The first derivative of
MSD log likelihood function with respect to ai , i ¼ 1, :::, d and bi ¼ i ¼ 1, :::, d is:
@ Ln ðXj#Þ X n

¼ k01 ðyj Þ wðjajÞ wðmi þ jajÞ þ wðxi þ ai Þ wðai Þ (A7)
@ ai j¼1

@ Ln ðXj#Þ X n
xi
¼ k02 ðyj Þ (A8)
@ bi j¼1
b i
By computing the second derivatives with the respect to ai andbi , we obtain:

8
>
> Xn h i
>
> k001 ðyj Þ w0 ðjajÞ w0 mj þ jaj þ w0 ðxi þ ai Þ w0 ðai Þ if i1 ¼ i2 ¼ i
>
@ 2 Ln ðXj#Þ < j¼1
¼ X
@ai1 @ai2 >
>
n

>
>
> k001 ðyj Þ½w0 ðZÞ w0 mj þ jaj otherswise,
: j¼1
(A9)
8
>
< Pn 00 xi
@ Ln ðXj#Þ
2 k
j¼1 2 ð j Þ
y if i1 ¼ i2 ¼ i
¼ b2i (A10)
@ai1 @ai2 >
:0 otherwise,

Koo Che Mesh Kian 2020

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Koo Che Mesh Kian 2020

Uploaded by

Copyright:

Available Formats

Cybernetics and Systems

ISSN: 0196-9722 (Print) 1087-6553 (Online) Journal homepage: https://www.tandfonline.com/loi/ucbs20

Flexible Distribution-Based Regression Models for

Pantea Koochemeshkian, Nuha Zamzami & Nizar Bouguila

To link to this article: https://doi.org/10.1080/01969722.2020.1758464

Published online: 19 May 2020.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Flexible Distribution-Based Regression Models for

CONTACT Pantea Koochemeshkian p_kooche@encs.concordia.ca Electrical and Computer Engineering

would be extremely advantageous. Data mining and machine learning have

2.1. Dirichlet-Multinomial (DM) Regression

a ¼ (a1, … , ad) of DM distribution to the covariates X. The complete

Newton-Raphson algorithm provides quick progress to the maximization

2.2. Generalized Dirichlet Multinomial (GDM) Regression

regression coefficients, the log-likehood is given by (Zhang et al. 2017):

Indeed, generalized Dirichlet Multinomial is a more suitable distribution

3. The Proposed Regression Models

3.1. The Considered Distributions

3.1.2. Multinomial Scaled Dirichlet (MSD) Distribution

3.2. The Proposed Link Functions

3.2.1. Proposed Link Functions for MBL Regression

For finding the g(lj) we consider the following procedure:

and for logit link function we have the following:

Pj ðyÞ ¼ Pn1   (11)

Thus, for the Multinomial Beta-Liouville model, we have the following:

We consider the final regression equation as a linear regression equation:

3.2.2. Proposed Link Functions for MSD Regression

For finding the k(mj) we follow the following procedure:

Considering the final regression equation to be similar to the linear regres-

3.3. Parameters Estimation

(Vasconcellos and Cribari-Neto 2005; Paolino 2001), is a method that

Algorithm 1: Learning algorithm for predicting multivariate count response.

4.1. Data and Performance Measures

 Anysis of genomics data: RNA-seq (Zhang et al. 2017).

The evaluation of each model is based on the Akaike information criterion

 Akaike Information Criterion(AIC): It is a measure that can be used to

Table 1. Comparing different distribution-based regression models for RNA-seq dataset.

4.2. Analyzing Genomics Data: RNA-Seq

approximately similar to the actual test values, as shown in the figures,

4.3. Predicting Heart Attack Risk

Table 2. Comparing different distribution-based regression models for Stress

done to determine the impact of the “dobutamine” drug on having a risk

4.4. Breast Cancer Diagnosis

Figure 5. Sample images from the Breast Cancer dataset.

been discretized to be used in our models. The eight real-valued features

center to points on the perimeter), 2-texture (standard deviation of gray-

4.5. Diagnosis of Diabetes

Table 5. Comparing the proposed regression models performance to the State-of-the-Art.

The analysis aims to predict whether a patient was diabetes positive or

4.6. Comparison with Other Methods from the Literature

Monti, G. S., G. Mateu-Figueras, and V. Pawlowsky-Glahn. 2011. Notes on the scaled

Appendix A. MLE for the Proposed Models

¼ g30 ðyj Þ wða þ b Þ þ w b0 w a0 þ b0 wðb Þ (A3)

By computing the second derivatives with the respect to ai andbi , we obtain:

You might also like

Pj ðyÞ ¼ Pn1 (11)

Anysis of genomics data: RNA-seq (Zhang et al. 2017).

Akaike Information Criterion(AIC): It is a measure that can be used to