You are on page 1of 22

Communications in Statistics - Theory and Methods

ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: https://www.tandfonline.com/loi/lsta20

Phase I monitoring of social network with baseline


periods using poisson regression

Ebrahim Mazrae Farahani & Reza Baradaran Kazemzadeh

To cite this article: Ebrahim Mazrae Farahani & Reza Baradaran Kazemzadeh (2019) Phase I
monitoring of social network with baseline periods using poisson regression, Communications in
Statistics - Theory and Methods, 48:2, 311-331, DOI: 10.1080/03610926.2017.1408836

To link to this article: https://doi.org/10.1080/03610926.2017.1408836

Published online: 22 Dec 2017.

Submit your article to this journal

Article views: 39

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=lsta20
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS
2019, VOL. 48, NO. 2, 311–331
https://doi.org/./..

Phase I monitoring of social network with baseline periods


using poisson regression
Ebrahim Mazrae Farahani and Reza Baradaran Kazemzadeh
Faculty of Industrial and Systems Engineering, Tarbiat Modares University, Tehran, Iran

ABSTRACT ARTICLE HISTORY


Social network analysis is an important analytic tool to forecast social Received  May 
trends by modeling and monitoring the interactions between network Accepted  November 
members. This paper proposes an extension of a statistical process
control method to monitor social networks by determining the base- KEYWORDS
line periods when the reference network set is collected. We consider Social networks; Probability
density profile; Signal
probability density profile (PDP) to identify baseline periods using Pois-
probability; Statistical
son regression to model the communications between members. Also, process monitoring; Change
Hotelling T2 and likelihood ratio test (LRT) statistics are developed to point detection.
monitor the network in Phase I. The results based on signal probability
indicate a satisfactory performance for the proposed method.

1. Introduction
In recent years, social network monitoring is gaining attentions of researchers and practition-
ers especially those interested in statistical process monitoring (SPM). Monitoring suspicious
communications within networks is an essential subject for online analysis of social issues
such as terrorism and crime (McCulloh and Carley 2011; McCulloh et al. 2007). Hence, gov-
ernments are interested in methods that could help them to gain information about terrorist
attacks and anomalies in certain groups’ communications. Social network monitoring can
also be used to manage crises such as fire in populated areas or early diagnosis of diseases
(McCulloh et al. 2007; Abbasi et al. 2010; Christakis and Fowler 2010). As mentioned above,
analyzing and monitoring social networks communications are highlighted in security issues
for early detection of threats such as the threat of terrorist attacks and threats that can harm
the sites and facilities (McCulloh and Carley 2011; McCulloh et al. 2007). This will help secu-
rity agencies to focus their resources on important points. Too much communications and
increase in the flow of information between different groups leads to complexity of these net-
works. Despite the mentioned complexity, an important goal in this area is to find a simple
and accurate method to monitor communications in social networks.
Social network monitoring represents a significant analytic tool for analyzing communi-
cation in a network where process consists of a collection of graph data. Therefore, social
network can be modeled as a graph in which nodes represent individuals such as members,
groups, and companies and edges between the nodes represent connections such as friend-
ship, affiliation, and interactions (McCulloh and Carley 2011; McCulloh et al. 2007; Christakis
and Fowler 2010; Azarnoush et al. 2016; Wasserman and Faust 1994; Woodall et al., 2016).

CONTACT Reza Baradaran Kazemzadeh rkazem@modares.ac.ir Faculty of Industrial and Systems Engineering, Tarbiat
Modares University, Tehran, Iran.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/lsta.
©  Taylor & Francis Group, LLC
312 E. M. FARAHANI AND R. B. KAZEMZADEH

Cook and Whitmeyer also proposed two general conceptions of structure in social net-
work analysis (Cook and Whitmeyer 1992). The more common view conceives of structure as
a composition of particular edges between actors, where variation in the network in the exis-
tence or strength of edges is significant and meaningful. The other view conceives of structure
as a general deviation from randomness at the edges for specific groups, or perhaps the entire
network. In this research, the second view is used to evaluate the performance of the proposed
method using a simulated network streams.
Methods for detecting changes in social networks exist in the literature of static and
dynamic networks. In order to monitor a social network behavior, a reference period of a
network is extracted and it is compared to each incoming network over time using differ-
ent monitoring methods. Social network change detection is a statistical approach to identify
shifts in network behavior which potentially increases the ability of analyst to detect changes
(Wald, 1950; Neymanand Pearson, 1993). One of the good examples of social network change
detection was discussed by McCulloh et al. using social network analysis techniques and sta-
tistical process control (McCulloh et al. 2011). This approach was presented to detect small
changes using three control schemes; cumulative sum (CUSUM) control chart, exponentially
weighted moving average (EWMA) control chart, and scan statistic to monitor network mea-
sures such as closeness centrality, betweenness centrality, and density (McCulloh and Carley
2011; McCulloh et al. 2007). Network measures firstly have been developed by Freeman who
provided more reasonable information to detect anomalies that arise from different network
structures (Freeman 1997, 1978).
Savage et al. presented four anomaly detection characteristics in online social networks
namely static labeled anomalies, static unlabeled anomalies, dynamic labeled anomalies, and
dynamic unlabeled anomalies (Savage et al. 2014). They also provided an overview of the
existing methods for detecting the aforementioned anomalies.
In social networks the formation of relationships between people can be affected by exter-
nal attributes, which can be used for detecting network changes (Miller et al. 2013: Azarnoush
et al. 2016; and Mazrae Farahani et al. 2016). For instance Mazrae Farahani et al. modeled the
social network by applying the Poisson regression model and afterward used MEWMA and
MCUSUM control chart to monitor the average degree, average betweenness, and average
closeness measurements simultaneously which they affected by the number of communica-
tions between nodes (Mazrae Farahani et al. 2016). Noorossana et al. also presented a new
monitoring method based on Exponential Random Graph Model (ERGM) and control charts.
They applied a monitoring statistics based on the Generalized Likelihood Ratio Test (GLRT)
to detect anomalies in networks. The results show that GLR chart has better performance in
average run length (ARL) as compared to the CUSUM chart in detecting large shifts, while
the CUSUM chart is better for detecting small shifts (Noorossana et al. 2017).
Al Hasan et al. also presented an overview of aforementioned attributes to monitor the
email communication network related to a university, which shows that age difference of the
two users, having common major and the number of courses that two users are enrolled in
are effective attribute (Al Hasan et al. 2006).
Sparks and Wilson has been focused in an approach for monitoring the known and
unknown groups by using generalizations of the exponentially weighted moving average
(EWMA) statistic and they noted that communication count between members in networks
follow form poison distribution (Sparks and Wilson 2016). Sparks also used EWMA and
CUSUM respectively in order to monitor the departure of smoothed communications level
from their expected mean and expected median (Sparks 2015).
Wilson et al. in order to modeling and monitoring networks applied a dynamic version of
degree-corrected stochastic block model (DCSBM). They monitored three parameters from
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 313

the mentioned model by using Shewhart control chart that these parameters are: probability of
nodes belonging to a particular community, probability of link between and within different
communities, and propensity of nodes to make relationship (Wilson et al. 2016).
Azarnoush et al. proposed a statistical method to monitor social networks via attributes by
considering three different types of reference network sets which are collected during normal
conditions to identify sudden changes which might have occurred (Azarnoush et al. 2016).
So we need a method to monitor the performance of a process by comparing the cur-
rent state of the system against other states. These methods are presented in statistical pro-
cess control (SPC) literature and there are different approaches to define reference sets which
can help monitoring a system in different states (Bersimis 2005; Chakravarti and Laha 1967;
Montgomery 2007; Sullivan and Woodall 2000, 2014; Sullivan 2002). For instance, Sullivan
used a method to cluster individual observations to detect multiple change points compared to
X-chart and CUSUM chart which failed to detect multiple shifts (Sullivan 2002). In Hawkins
and Maboudou-Tchao, Capizzi and Masarotto researches, self-starting charts were applied in
some conditions such as when parameters are unknown and when samples cannot be calcu-
lated to estimate exact values of control limits (Hawkins and Maboudou-Tchao 2007; Capizzi
and Masarotto 2010). In these methods, reference sets were extracted from the successful
conditions to control the process and eliminate causes of the deviation in order to bring the
process back to in-control state. Guh, and Han and Baker researches primarily presented the
idea of using PDPs and clustering PDPs but Zhang and Albin and Zhang et al. researches
improved the PDP clustering method in SPC context which could cluster PDPs in historical
data in order to determine periods generated by different distributions over time (Guh 2005;
Han and Baker 1995; Zhang and Albin 2007; Zhang et al. 2010).
However, these studies have not focused on considering time periods associated with stable
condition of the network. Therefore, regarding the importance of monitoring social networks,
a method based on Poisson regression is developed to monitor counts of communications
instead of existence of communications in social networks and also PDP clustering method
is used as an alternative to the existing methods such as static, dynamic, and moving window
reference time period methods.
The remainder of this paper is organized as follows. In Section 2, social networks are
discussed and our modeling procedure is described. Poisson regression is discussed in Sec-
tion 3. In Section 4, the PDP clustering methodology to extract baseline periods is discussed.
In Section 5, two methods namely Hotelling T2 and LRT statistics are applied to monitor
social networks. The simulation analysis used to evaluate the proposed method is described
in Section 6. We summarize our concluding remarks in the final section.

2. Problem definition and model assumptions


In this section, we describe a method for change detection based on statistical process control.
We first present the notation and definitions used to model the method in Table 1.
As we mentioned, a social network can be provided in the form of a network relation-
ship matrix. The notations to characterize a social network are presented by the following
equations:

G(t) = (V (t ), Y (t )); t = 1, . . . , m (1)


V(t) ∈ {v 1 , v 2 , . . . , v i , . . . , v s } (2)
Y(t) = {y12t , . . . , yi jt , . . . , y(s−1)st } (3)
314 E. M. FARAHANI AND R. B. KAZEMZADEH

Table . Notations and Definitions.


Notation Description

V (t ) Set of nodes
Y (t ) Set of edge
k Set of attributes
s Number of nodes
m Number of time periods
Q Number of quality variable
mR Number of time periods in reference set
TR time periods in reference set
τ Change point
p Number of attributes
aik The value of kth attribute at ith node
xi j p The value of explanatory variable between i and j for pth attribute
βt = (β0t , β1t , . . . , βpt ) The vector of Poisson regression parameters when the process is in-control
β The vector of Poisson regression parameters when the process is out-of-control
δ The magnitude of step change in regression parameters
yi jt The number of contacts between nodes i and j at tth time
λi j The expected value for the counts of communications between nodes i and j
λ̂i j The estimated value of λi j

where V (t ) and Y (t ) represents nodes and edges in time period t, respectively. For instance, a
relationship may be defined as any possible communications such as email, phone calls, SMS,
and Telegram messages.
We define a matrix whose diagonal elements are considered equal to zero, because one
cannot communicate with himself. Let yijt be the number of communications between nodes
i and j during the period t. Parameter λi j is also defined as the expected value for the counts
of communications between nodes i and j. It is assumed that the number of communications
between nodes in different time periods follows Poisson distribution as in Equation 4.

yi jt ∼ Poisson(λi j ); i = 1, . . . , s; j = 1, . . . , s; t = 1, . . . , m. (4)

To model the social network as a Poisson regression, the counts of communications


between nodes at time period t = 1, . . . , m are considered as the response variables. Conse-
quently, the Poisson regression profile can be used to model the relationship between response
and explanatory variables at any time period. The relationship between the response variable
and the explanatory variables is modeled as


p

log(λi j ) = β0 + β1 xi j1 + · · · + β p xi jp = βk xi jk i = 1, . . . , s; j = 1, . . . , s (5)
k=0
p
β x
λi j = e k=0 k i jk ; i = 1, . . . , s; j = 1, . . . , s; k = 0, . . . , p (6)

Once the vertex attributes are determined, the proposed method is required to consider
the relationship between vertex attributes for pair of vertices being linked. Similarly, the work
by Mazrae Farahani et al. uses the following function to modify the vertex attributes (Mazrae
Farahani et al. 2016).
min{aik , a jk }
xi jk = ; i = 1, . . . , s; j = 1, . . . , s; k = 1, . . . , p (7)
max{aik , a jk }

where aik denote the value of tth attribute of node i. In this equation, if denominator is equal
to zero then we assume that xi jk is zero.
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 315

3. Poisson regression model


In this section, Poisson regression model to characterize a social network is presented
(Herberts and Jensen 2004). Let the vector of explanatory variables corresponding to the rela-
tionship between ith and jth nodes be defined as
Xij = (xi j1 , xi j2 , . . . , xi jp )T ; i = 1, . . . , s; j = 1, . . . , s; p = 1, . . . , k (8)
Recall that yi jt is the counts of communication between ith and jth at tth time period which
is assumed to follow Poisson distribution with the following probability distribution function
(PDF):
λi j yi jt
fyi jt (yi jt ) = e−λi j ; yi jt = 1, 2, . . . ; λi j > 0 (9)
yi jt !
The value of parameter λi j depends on vector β, and Xi j where λi j = exp(XiTj β). As a result,
E(yi j ) = var(yi j ) = λi j . Then, the joint likelihood function for yi j at time t can be written as
s s
 s s
i=1 (λi j )
yi j
−λi j (λi j )
yi j s s
j=i=1
L(λ; y) = e = s s e− j=i=1 i=1 λi j (10)
j=i=1 i=1
yi j ! j=i=1 i=1 yi j !

taking the natural logarithm from both sides of Equation (10) leads to the following
equation:

s 
s 
s 
s 
s 
s
ln[L(λ; y)] = yi j ln(λi j ) − λi j − ln(yi j !) (11)
j=i=1 i=1 j=i=1 i=1 j=i=1 i=1

substitute λi j with exp(XijT β), we can rewrite Equation 11 at time t as


s 
s 
p    
s 
s 
p

s 
s
log[L(β; y)] = yi j log exp XiTj β − exp(Xij β) − log(yi j !)
j=i=1 i=1 k=1 j=i=1 i=1 k=1 j=i=1 i=1
(12)
then
∂ log[L(λ; y)]
= Xij T (y − λ) (13)
∂β
So, the maximum likelihood estimation (MLE) of β is the solution of Xij T (y − λ) = 0,
where 0 is a p-dimensional zero vector. We can estimate the vector of regression parameters
by utilizing the iterative weighted least-square (IWLS) method (Amiri et al. 2015). For the
tth time period, the estimated model parameters obtained by IWLS method is denoted by
β̂t = (β̂1t , . . . , β̂ pt )T (Sharafi et al. 2013).

4. Probability density profile clustering method


Statistical process control (SPC) is a method of quality control which is used by engineers to
monitor industrial processes. In social networks, when we use SPC for monitoring immediate
changes, it is better to model the network communication from baseline periods where quality
variables have a stable condition over time. Often the base line data has to be extracted from a
long stream of historical data that include from both normal and unusual conditions. During
on-line monitoring the current observation is compared with the baseline data, if they are
inconsistent, it indicates that an unusual condition has occurred which requires attention. To
316 E. M. FARAHANI AND R. B. KAZEMZADEH

ensure that the model is correct and the resulting online SPC monitoring will perform effective
it is critical to choose the baseline data with care. The method for determining the number of
operational modes has been addressed by Zhang and Albin (Zhang and Albin 2007).
In this paper, PDP clustering method is applied to identify baseline period in historical
dataset. This period where the interaction between people is normal, the time interval is suf-
ficiently long, and baseline period has a stable condition is extracted as follows.
Step 1: Network centrality measures which include closeness, betweenness and degree could
be calculated over time as network quality variables. In this paper for defining such variable,
we use average degree over time by having the following mean vector:
QV = {qv1 , qv2 . . . , qvt } (14)
Step 2: Segment QV into w subsequences using the following constraint
Si = {qv j i ≤ j ≤ i + w} (15)
Step 3: Determine PDPs for each subsequence which could be probability density function,
frequency histogram, etc. In this study, we present probability density to determine the
profile for each subsequence.
Step 4: Cluster profiles using clustering method. In this approach, we use the method of
k-means to segment network variables into clusters (Han and Kamber 2001; Fraley and
Raftery 1998).
Step 5: Determine quality variables of each time period to clusters of previous step which has
the highest frequency.
Step 6: Determine baseline clusters by analyzing network model characteristics such standard
deviation, coefficient of variation and etc. Use coefficient of variation to select the satisfac-
tory clusters.
Step 7: Set up a hypothesis to determine other baseline clusters. Use t-test to conduct the
hypothesis test as follows
H0 : Mean value of the kth cluster is equal to baseline clusters mean.
H1 : Mean value of the kth cluster is not equal to baseline clusters mean.

5. Proposed method
To monitor a social network based on Poisson regression, two statistics namely LRT and
extended Hotelling T2 are utilized in this section (Azarnoush et al. 2016; Mason and Young
2000, 2001; Yeh et al. 2009). Recall that, we have m time periods where each time period has
n(n − 1) maximum pairwise relationships between nodes.

5.1. LRT Method


Suppose that a change occurs in one or more general linear model (GLM) parameter. Hence
we have
g(λi j1 ) = β01 + β11 xi j1 + β21 xi j2 + · · · + β p1 xi jp ;
i = 1, 2, . . . , s; j = 1, 2, . . . , s; t = 1, 2, . . . , m1 (16)
g(λi j2 ) = β02 + β12 xi j1 + β22 xi j2 + · · · + β p2 xi jp ;
i = 1, 2, . . . , s; j = 1, 2, . . . , s; t = m1 + 1, . . . , m (17)
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 317

The equality of parameters λi j1 and λi j2 is tested by the following hypotheses test:

H0 : λi j1 = λi j2 = λi j
H1 : H0 is not true
s 
MLE method is used (L(y; λ) = j=i=1 si=1 f (yi j )) to obtain l1 and l2 . Under the null
hypothesis, the maximum likelihood function is l0 . The maximum likelihood function for
observations before and after change point is l1 and l2 , respectively. Under H1 , the maximum
likelihood function, which is obtained by sum of l1 and l2 , is called la . Likelihood ratio statistic
also is calculated as follows.

lrt = −2(l0 − la ) (18)

The likelihood function of reference data set is as follows.


 
s 
s
(λi j )yi jt
L(λ; y) = e−λi j
t∈TR j=i=1 i=1
yi jt !
 s s
i=1 (λi j )
yi jt s s
j=i=1
e− j=i=1 i=1 λi j
t∈TR
=  s s (19)
t∈TR j=i=1 i=1 yi jt !
 s s  s  s 
s 
s
log[L(λ; y)] = yi jt log(λi j ) − λi j − log(yi jt !)
t∈TR j=i=1 i=1 t∈TR j=i=1 i=1 t∈TR j=i=1 i=1

(20)

∂ log[L(λ; y)] t∈TR yi jt
= − mR = 0 (21)
∂λi j λi j
where

t∈TR yi jt
λ̂i j = (22)
mR
and l0 , l1 and l2 are

s 
s 
s 
s 
s 
s
l0 = mR λ̂i j log(λ̂i j ) − mR λ̂i j − log(yi jt !) (23)
j=i=1 i=1 j=i=1 i=1 t∈TR j=i=1 i=1


s 
s 
s 
s 
m1

s 
s
l1 = m1 λ̂i j1 log(λ̂i j1 ) − m1 λ̂i j1 − log(yi jt !) (24)
j=i=1 i=1 j=i=1 i=1 t=1 j=i=1 i=1


s 
s 
s 
s 
m 
s 
s
l2 = (m − m1 )λ̂i j2 log(λ̂i j2 ) − (m − m1 )λ̂i j2 − log(yi jt !)
j=i=1 i=1 j=i=1 i=1 t=m1 +1 j=i=1 i=1

(25)

where
 m1
t=1 yi jt
λ̂i j1 = (26)
m
m 1
t=m1 +1 yi jt
λ̂i j2 = (27)
m − m1
318 E. M. FARAHANI AND R. B. KAZEMZADEH


s 
s 
s 
s 
m1 
s 
s
la = l 1 + l2 = m1 λ̂i j1 log(λ̂i j1 ) − m1 λ̂i j1 − log(yi jt !)
j=i=1 i=1 i= j=1 i=1 t=1 j=i=1 i=1


s 
s 
s 
s 
m 
s 
s
+ (m − m1 )λ̂i j2 log(λ̂i j2 ) − (m − m1 )λ̂i j2 − log(yi jt !)
j=i=1 i=1 j=i=1 i=1 t=m1 +1 j=i=1 i=1


s 
s 
s 
s
= m1 λ̂i j1 log(λ̂i j1 ) + (m − m1 )λ̂i j2 log(λ̂i j2 )
j=i=1 i=1 j=i=1 i=1


s 
s 
s 
s 
m 
s 
s
− m1 λ̂i j1 − (m − m1 )λ̂i j2 − log(yi jt !) (28)
j=i=1 i=1 j=i=1 i=1 t=1 j=i=1 i=1

Finally, LRT value is obtained for each change point


lrt = −2(l 0 − la ) (29)

s 
s 
s 
s
LRT = −2 mR λ̂i j log(λ̂i j ) − m1 λ̂i j1 log(λ̂i j1 )
j=i=1 i=1 j=i=1 i=1


s 
s 
s 
s
− (m − m1 )λ̂i j2 log(λ̂i j2 ) − mλ̂i j
j=i=1 i=1 j=i=1 i=1


s 
s 
s 
s
+ m1 λ̂i j1 + (m − m1 )λ̂i j2 (30)
j=i=1 i=1 j=i=1 i=1

5.2. Extended hotelling T2


Yeh et al. proposed five Hotelling T2 charts for monitoring logistic regression profiles (Yeh
et al. 2009). In that study it was concluded that TI2 method had the best performance. This
statistic then was used by Amiri et al. to monitor Poisson regression profiles (Amiri et al.
2015). Due to the satisfactory performance of TI2 statistic, we also use this method to monitor
social network. For the tth network this statistic is computed as

TI,t2 = (β̂t − β̄)T S−1


I (β̂t − β̄); t = 1, 2, . . . , m (31)
where
1 
β̄ = β̂ (32)
mR t∈T t
R
1  1  T −1
SI = var(β̂t ) = (X Ŵt X) (33)
mR t∈T mR t∈T
R R

and
⎡ ⎤
λ̂12t 0 ··· 0
⎢ 0 λ̂13t ··· 0 ⎥
⎢ ⎥
Ŵt = ⎢ . .. .. .. ⎥ (34)
⎣ .. . . . ⎦
0 0 · · · λ̂(s−1)st
where
λ̂i jt = exp(xi j β̂t ) (35)
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 319

6. Case study
Social network analysis (SNA) is the mapping and measuring of relationships and flows
between people, organizations or other information to determine the network changes. In the
current section, we describe two case studies to explain the proposed method. We consider
Enron corporation email communication network and a simulated communication network
datasets.

6.1. Enron email communication network


The Enron dataset is a large scale email collection which was collected and prepared by the
CALO Project (A Cognitive Assistant that Learns and Organizes) over a period of 3.5 years
(1998 to 2002) from an American energy company based in Houston, Texas (Priebe et al.,
2005). It contains data from about 150 users, mostly senior management of Enron which con-
tains a total of about 0.5 million messages.
In this case, we decide to time slice the Enron email corpus on a monthly basis from Jan-
uary 1999 to February 2002. The network which consists of communications between Enron
employees induces a graph. Thus, the vertices are labeled by employees whose email commu-
nications among them correspond to edges which is modeled as a stream of directed networks
in a weekly time interval.
In this section, we describe the PDP clustering method to determine baseline periods from
Enron email corpus data. So, we use average degree centrality which refers to the average
number of edges a vertex has to other vertices to investigate baseline data over time. Actors
who have more edges may have multiple alternative ways and resources to reach goals and thus
be relatively advantaged. The standard centrality measures capture a wide range of importance
and also identify the most important people in the network.
Therefore, in order to extract the baseline periods, it suffices to calculate the weekly average
degree centrality as a quality variable. After plotting the average degree centrality which is
shown in Figure 1, the whole time period of the Enron email dataset is divided into periods
with different cluster labels from November 1998 to February 2002.
The Figure 1 also shows that there are three important periods. The first cluster is between
November 1998 to December 1999 (label t1 in Figure 1), the second cluster is from December
1999 to April 2001 (label t2 in Figure 1) and the third one is from April 2001 to February 2002,
as illustrated in Figure 1.

Figure . Plot of the weekly average degree centrality over time for emails of Enron’s employees.
320 E. M. FARAHANI AND R. B. KAZEMZADEH

Figure . Plot of the LRT statistic versus time for monitoring weekly emails of Enron’s employees.

Thereafter we have to determine which clusters have the best quality by comparing coef-
ficient of variation values. The periods corresponding to the least changes in average degree
centrality of 150 Enron employees are selected as the baseline periods. So, the first cluster
corresponding to the important clusters are selected as baseline period. After the PDP clus-
tering, we estimate the regression parameters as mentioned in the previous section based on
snapshots of Enron’s communication networks. The LRT statistic is also calculated to monitor
the network changes. Figure 2 depicts the monitoring of the LRT statistics using PDP method
versus time. The UCL value of the LRT statistics is determined in this method, we choose
α = 0.0027 with ten degrees of freedom.
The results show that in late 2001, Enron Corporation led to the Enron crisis broke out. As
can be observed, the highest peak occurred in October 2001 when the bankruptcy is filed. The
detected anomalies demonstrate the excessive communications during the Enron scandal.

6.2. Simulated networks


In this subsection, we use signal probability to evaluate performance of the proposed method
with existing methods. So, we simulate 120 streams networks consist of 150 fixed nodes in
which connections between two nodes follows Poisson distribution with parameter λi j = 9
over different time periods t where t = 1, 2, …, 120.
We calculate the aforementioned quality variable for each time period. Figure 3 shows a
simulation dataset consisting of 120 snapshot. From Figure 3 it is obvious that the process
experience a period of normal condition from t1 to t2 , corresponding to snapshots 27 to 100.
Based on data logs analyzed by process engineer, there are three important periods: The first
from snapshot 1 to 26 which was a transient period. The second period, snapshots 27 to 100,
had a good and stable quality variable. The third period, snapshots 101 to 120, was a transient

Figure . Plot of quality variable from simulation data.


COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 321

period as well. Thus, according to the process engineer, the baseline would be snapshots 27
to 100.
When applying the PDP clustering method, we let the moving window size w = 10 time
units. Then for determining the baseline period, quality variables are segmented into subse-
quences such as S1 ,S2 , and S111 . The subsequences are generated as follows
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
qv1 qv2 qv111
⎢ qv2 ⎥ ⎢ qv3 ⎥ ⎢qv112 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ qv3 ⎥ ⎢ qv4 ⎥ ⎢qv113 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ qv4 ⎥ ⎢ qv5 ⎥ ⎢qv114 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ qv5 ⎥ ⎢ qv6 ⎥ ⎢qv115 ⎥

S1 = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎥ , S2 = ⎢ qv7 ⎥ , . . . S111 =⎢ ⎥
⎢ qv6 ⎥ ⎢ ⎥ ⎢qv116 ⎥
⎢ qv7 ⎥ ⎢ qv8 ⎥ ⎢qv117 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ qv8 ⎥ ⎢ qv9 ⎥ ⎢qv118 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ qv9 ⎦ ⎣qv10 ⎦ ⎣qv119 ⎦
qv10 qv11 qv120

To demonstrate profiles for each subsequence, we fit normal distribution, Poisson distri-
bution, and exponential distribution on each subsequence using Kolmogorov–Smirnov test
which determine the normal distribution as the distribution of subsequences (Ramanayake
and Gupta 2002; Kim et al. 2003; Mahmoud and Woodall 2004; Daniel 1990). The PDP clus-
tering method segments PDPs in to three clusters using K-mean method as shown is Figure 4.
These three periods are very close to the periods identified by the process engineers. The first
periods with the transients periods is between 1 to 21 (Labeled t1 in Figure 3). The second
period is from 22 to 102 (labeled t2 in Figure 3).

Figure . PDPs in three clusters.


322 E. M. FARAHANI AND R. B. KAZEMZADEH

Table . Descriptive statistics for clustering.


Cluster  Cluster ∗ Cluster 

Number of each cluster   


Mean . . .
Standard deviation . . .
Coefficient of variation . . .

As we mentioned in the previous section, we select the baseline clusters with the best coef-
ficient of variation values. Table 2 shows calculated mean, standard deviation and coefficient
of variation values for each cluster. So, we choose cluster 2 as a baseline cluster by comparing
coefficient of variation values. The third period is from 103 to 120 and corresponds to the
third period of transient period.
Tables 3 and 4 also show the result of t-test to determine the other baseline clusters.
Levene’s test is used to test if two clusters have equal variances (Schultz 1985). As shown in
Table 3 and 4, we calculate the test statistics from the simulated data and indicate that equality
of two clusters variances is accepted and clusters 1 and 3 cannot be selected as a baseline
clusters. Also from the result of the signal probability (sig.) with type I probability level 0.01,
we can reject the null hypothesis. If H0 is rejected we conclude there is no difference in the
variances between the two groups and H1 is true. So, there is a significant difference in the
variances (Hawkins and Zamba 2005).
In this study, we simulate a social network which can be modeled as a graph where mem-
bers are nodes and connections between two nodes are edges where three different attributes
namely information diversity, age, and work experience are defined as the network external
variables which has effect on the formation of connections between network members. We
assumed that the information diversity is sampled from a uniform distribution between 0
to 10, age variable is distributed uniformly between 20 to 55, and work experience variable
follows a uniform distribution between 0 to 30. Descriptive statistics and correlations are pro-
vided in Tables 5 and 6, using MINITAB software.
Now we use a Poisson regression model to determine the impact of external variables in
the formation of communications between nodes. Equation 36 shows the average number of
connections which are influenced by the external features of the network over time.

E(ŷi j |Xij ) = 0.022 + 0.031xi j1 + 0.057xi j2 + 0.063xi j3 (36)

If we fit Poisson regression equation for 81 networks in cluster number 2 as the baseline
cluster using MLE approach, Equation 37 for baseline networks will be extracted as follows

Ebase line (ŷi j |Xi j ) = 0.02 + 0.043xi j1 + 0.061xi j2 + 0.067xi j3 (37)

Therefore, performance of the proposed method is evaluated by applying the following


changes. These are then compared to existing techniques evaluated in terms of signal proba-
bility criterion for each method can be written as

E(ŷi j |Xi j ) = ( 0.022 + δ1 ) + (0.031 + δ2 )xi j1 + (0.057 + δ3 )xi j2 + (0.063 + δ4 )xi j3


(38)
Ebase line (ŷi j |Xi j ) = (0.02 + δ1 ) + ( 0.043 + δ2 )xi j1 + (0.061 + δ3 )xi j2 + (0.067 + δ4 )xi j3
(39)
Table . Output of independent sample test for cluster  and cluster .
t-test for Equality of Means
Levene’s Test for Equality of % Confidence Internal of the
Variances Difference
Mean Standard Error
F Sig. t Dt Sig. (-tailed) Difference Difference Lower Upper

Response Equal . . .  . . . . .
Variances
Assumed
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS
323
324

Table . Output of Independent Sample Test for cluster  and cluster .


t-test for Equality of Means
E. M. FARAHANI AND R. B. KAZEMZADEH

Levene’s Test for Equality of % Confidence Internal of the


Variances Difference
Mean Standard Error
F Sig. t Dt Sig. (-tailed) Difference Difference Lower Upper

Response Equal . . .  . . . . .
Variances
Assumed
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 325

Table . Descriptive statistics for attributes.


Descriptive statistics

N Minimum Maximum Mean Standard deviation

Information Diversity    . .


Age    . .
Work experience    . .

Table . Pairwise correlations between independent attributes.


Information Diversity Age Work experience

Information Diversity  — —
Age .  —
Work experience . . 

In step shifts, we generate three kinds of change; (1) shifts in the second half of the data
i.e., 60th to 120th periods (τ = 60), (2) shifts in the 20th to 120th periods (τ = 20), and (3)
shifts in the 10th to 120th period (τ = 10).
To compare performance of the proposed methods through simulation, the upper control
limit (UCL) of each one is set such that the value of Type I error approximately equals to 0.05.
The UCL value for T2 and LRT methods are obtained which are equal to 18.11 and 9.49. These
simulations are conducted by MATLAB software through 10000 simulation experiments and
the results are summarized in the Tables 7–9.
Here, a graphical analysis on the results of Tables 7–9 is illustrated in Figures 5–8. The
horizontal and vertical axes of each table show the magnitude of shift in the corresponding
parameter and the signal probability, respectively. It can be seen in Figures 5–8 that as the
magnitude of shift increases, the capability of both methods to detect out-of-control scenarios
increases. Generally it is inferred from following figures that the LRT with baseline periods
method has superior performance in comparison with previous methods. As we see, when

Table . Comparisons of the signal probability values for the proposed methods for shifts in β  , β  , β  , β 
for the simulation which shift occurs in τ = 10.
δ1 . . . . . . . . . 
T . . .       
T2 Base line . . .       
LRT . .        
LRTBase line .         
δ2 . . . . . . . . . 
T . . . .      
T2 Base line . . .       
LRT . .        
LRTBase line .         
δ3 . . . . . . . . . 
T . . . . . .    
T2 Base line . . . . .     
LRT . . .       
LRTBase line . .        
δ4 . . . . . . . . . 
T . . . .      
T2 Base line . . .       
LRT . . .       
LRTBase line . .        
326 E. M. FARAHANI AND R. B. KAZEMZADEH

Figure . Performance of the proposed methods under various shifts and change points in β  .

Figure . Performance of the proposed methods under various shifts and change points in β  .
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 327

Figure . Performance of the proposed methods under various shifts and change points in β  .

Figure . Performance of the proposed methods under various shifts and change points in β  .
328 E. M. FARAHANI AND R. B. KAZEMZADEH

Table . Comparisons of the signal probability values for the proposed methods for shifts in β  , β  , β  , β 
for the simulation which shift occurs in τ = 20.
δ1 . . . . . . . . . 
T . . .       
T2 Base line . .        
LRT .         
LRTBase line .         
δ2 . . . . . . . . . 
T . . . .      
T2 Base line . . .       
LRT . .        
LRTBase line .         
δ3 . . . . . . . . . 
T . . . . . .    
T2 Base line . . . . .     
LRT . . .       
LRTBase line .         
δ4 . . . . . . . . . 
T . . . .      
T2 Base line . . .       
LRT . .        
LRTBase line .         

τ = 10, the T2 method with baseline periods method has slightly better performance than
T2 method. But, the difference between mentioned methods is negligible. In the case of τ =
20, 60, the T2 chart with baseline periods method outperforms the T2 chart. Hence, it can be
concluded that as parameter τ increases, the superiority of T2 chart with baseline periods chart
in comparison with T2 increases. In the other words, the more the network stay in controlled
state, the better performance of T2 chart with baseline to detect changes in communications
count.

Table . Comparisons of the signal probability values for the proposed methods for shifts in β  , β  , β  , β 
for the simulation which shift occurs in τ = 60.
δ1 . . . . . . . . . 
T . . .       
T2 Baseline . .        
LRT . .        
LRTBase line .         
δ2 . . . . . . . . . 
T . . . .      
T2 Base line . .        
LRT . . .       
LRTBase line .         
δ3 . . . . . . . . . 
T . . . . . . .   
T2 Base line . . . .      
LRT . . .       
LRTBase line . .        
δ4 . . . . . . . . . 
T . . . . .     
T2 Base line . . .       
LRT . .        
LRTBase line .         
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 329

7. Conclusion
In this paper, first the relationship between the counts of communications and the attributes
of network nodes were modeled using Poisson regression. Hence, PDP clustering method is
applied to identify baseline time periods. Within this model, two control chart schemes for
detecting change in the network streams are considered: Hotelling T2 and LRT. Simulation
studies in terms of signal probability criterion were conducted using a given network and
the performance of the proposed methods to detect out-of-control scenarios under different
shifts was compared.
The simulation results indicate that the proposed PDP clustering method using Hotelling
T2 control charts and LRT method has better performance than the previous methods to
detect out-of-control states in social networks. Within this method, we can consider the bet-
ter detection capability compared to previous methods. Also we demonstrate that the overall
performance of the proposed method is much better than the existing methods in detecting
a wide range of shift sizes in social networks. Future research should be able to shed more
light on the social network analysis through correlation between different time periods, and
develop models for online monitoring when relations between individuals are dependent and
non-parametric model apply to construct control charts.

Reference
Abbasi, A., L. Hossain, J. Hamra, and C. Owen. 2010. Social networks perspective of firefighters’
adaptive behaviour and coordination among them. In Green Computing and Communications
(GreenCom), 2010 IEEE/ACM Int’l Conference on & Int’l Conference on Cyber, Physical and Social
Computing (CPSCom) (pp. 819–824). IEEE.
Al Hasan, M., V. Chaoji, S. Salem, and M. Zaki. 2006. Link prediction using supervised learning. In
Society of Industrial and Applied Math Data mining Conference.
Amiri, A., M. Koosha, A. Azhdari, and G. Wang. 2015. Phase I monitoring of generalized linear model-
based regression profiles. Journal of Statistical Computation and Simulation 85 (14):2839–2859.
Azarnoush, B., K. Paynabar, J. Bekki, and G. Runger. 2016. Monitoring temporal homogeneity in
attributed network streams. Journal of Quality Technology 48 (1):28–43.
Bersimis, S., J. Panaretos, and S. Psarakis. 2005. Multivariate statistical process control charts and the
problem of interpretation: a short overview and some applications in industry. In Proceedings of
the 7th Hellenic European Conference on Computer Mathematics and its Applications, Greece,
Athens, 1–6.
Capizzi, G., and G. Masarotto. 2010. Combined Shewhart–EWMA control charts with estimated
parameters. Journal of Statistical Computation and Simulation 80 (7):793–807.
Chakravarty, I. M., J. D. Roy, and R. G. Laha. 1967. Handbook of methods of applied statistics, 392–394.
John Wiley and Sons.
Christakis, N. A., and J. H. Fowler. 2010. Social network sensors for early detection of contagious out-
breaks. PloS one 5 (9):e12948.
Cook, K. S., and J. M. Whitmeyer. 1992. Two approaches to social structure: Exchange theory and net-
work analysis. Annual review of Sociology 18 (1):109–127.
Daniel, W. W. 1990. Kolmogorov–Smirnov one-sample test. Applied Nonparametric Statistics, 319–330.
Fraley, C., and A. E. Raftery. 1998. How many clusters? Which clustering method? Answers via model-
based cluster analysis. The computer journal 41 (8):578–588.
Freeman, L. C. 1977. A set of measures of centrality based on betweenness. Sociometry, 35–41.
Freeman, L. C. 1978. Centrality in social networks conceptual clarification. Social networks 1 (3):215–
239.
Guh, R. S. 2005. A hybrid learning-based model for on-line detection and analysis of control chart
patterns. Computers & Industrial Engineering 49 (1):35–62.
Han, K. F., and D. Baker. 1995. Recurring local sequence motifs in proteins. Journal of molecular biology
251 (1):176–187.
330 E. M. FARAHANI AND R. B. KAZEMZADEH

Han, J., and M. Kamber. 2001. Data mining: concepts and technologies. Data Mining Concepts Models
Methods & Algorithms 5 (4):1–18.
Hawkins, D. M., and E. M. Maboudou-Tchao. 2007. Self-starting multivariate exponentially weighted
moving average control charting. Technometrics 49 (2):199–209.
Hawkins, D. M., and K. D. Zamba. 2005. A change-point model for a shift in variance. Journal of Quality
Technology 37 (1):21–31.
Herberts, T., and U. Jensen. 2004. Optimal detection of a change point in a Poisson process for different
observation schemes. Scandinavian Journal of Statistics 31 (3):347–366.
Kim, K., M. A. Mahmoud, and W. H. Woodall. 2003. On the monitoring of linear profiles. Journal of
Quality Technology 35 (3):317–328.
Mahmoud, M. A., and W. H. Woodall. 2004. Phase I analysis of linear profiles with calibration applica-
tions. Technometrics 46 (4):380–391.
Mason, R. L., and J. C. Young. 2000. Interpretive features of a T (2) chart in multivariate SPC. Quality
Progress 33 (4):84–89.
Mason, R. L., and J. C. Young. 2001. Implementing multivariate statistical process control using
Hotelling’s T2 statistics. Quality Progress 34 (4):71–73.
Mazrae Farahani, E., R. Baradaran Kazemzadeh, R. Noorossana, and G. Rahimian. 2016. A Statistical
Approach to Social Network Monitoring. Communications in Statistics-Theory and Methods.
McCulloh, I., J. Lospinoso, and K. M. Carley. 2007. Social Network Probability Mechanics. Proceedings
of the World Scientific Engineering Academy and Society 12th International Conference on Applied
Mathematics, Egypt, Cairo, 29–31.
McCulloh, I., and K. M. Carley. 2011. Detecting Change in Longitudinal Social Networks. Journal of
Social Structure 12 (3):1–37.
Miller, B. A., N. Arcolano, and N. T. Bliss. 2013. Efficient anomaly detection in dynamic, attributed
graphs: Emerging phenomena and big data. In Intelligence and Security Informatics (ISI), IEEE Inter-
national Conference on, 179–184.
Montgomery, D. C. 2007. Introduction to statistical quality control. John Wiley & Sons.
Neyman, J., and E. S. Pearson. 1992. On the problem of the most efficient tests of statistical hypotheses.
In Breakthroughs in statistics, 73–108. Springer New York.
Noorossana, R., G. Rahimian, M. R. Nayebpour, and E. Mazrae Farahani. 2017. A statistical process
control method for monitoring social networks using generalized likelihood ratio test. International
Journal of Operations and Quantitative Management 23:229–240.
Ramanayake, A., and A. Gupta. 2002. Change points with linear trend followed by abrupt change for
the exponential distribution. Journal of Statistical Computation and Simulation 72 (4):263–278.
Savage, D., X. Zhang, X. Yu, and P. Chou, and Q. Wang. 2014. Anomaly detection in online social net-
works. Social Networks 39, 62–70.
Schultz, B. B. 1985. Levene’s test for relative variation. Systematic Biology 34 (4):449–456.
Sharafi, A., M. Aminnayeri, and A. &Amiri. 2013. An MLE approach for estimating the time of step
changes in Poisson regression profiles. ScientiaIranica 20 (3):855–860.
Sparks, R., and J. D. Wilson. 2016. Monitoring communication outbreaks among an unknown team of
actors in dynamic networks. arXiv preprint arXiv:1606.09308.
Sparks, R. 2015. Social network monitoring: aiming to identify periods of unusually increased commu-
nications between parties of interest. In Frontiers in Statistical Quality Control, 11, 3–13. Springer
International Publishing.
Sparks, R. 2016. Detecting Periods of Significant Increased Communication Levels for Subgroups of
Targeted Individuals. Quality and Reliability Engineering International 32 (5):1871–1888.
Sullivan, J. H. 2002. Detection of multiple change points from clustering individual observations. Jour-
nal of Quality Technology 34 (4):371–383.
Sullivan, J. H., and W. H. Woodall. 1996. A control chart for preliminary analysis of individual obser-
vations. Journal of Quality Technology 28 (3):265–278.
Sullivan, J. H., and W. H. Woodall. 2000. Change-point detection of mean vector or covariance matrix
shifts using multivariate individual observations. IIE transactions 32 (6):537–549.
Wald, A. 1992. Statistical decision functions. In Breakthroughs in Statistics, 342–357. Springer New York.
Wasserman, S., and K. Faust. 1994. Social network analysis: Methods and applications, Vol. 8. Cambridge
university press.
COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 331

Wilson, J. D., N. T. Stevens, and W. H. Woodall. 2016. Modeling and estimating change in temporal
networks via a dynamic degree corrected stochastic block model. arXiv preprint arXiv:1605.04049.
Woodall, W. H., M. J. Zhao, K. Paynabar, R. Sparks, and J. D. Wilson. 2017. An overview and perspective
on social network monitoring. IISE Transactions 49 (3):354–365.
Yeh, A. B., L. Huwang, and Y. M. Li. 2009. Profile monitoring for a binary response. IIE Transactions 41
(11):931–941.
Zhang, H., and S. Albin. 2007. Determining the number of operational modes in baseline multivariate
SPC data. IIE Transactions 39 (12):1103–1110.
Zhang, H., S. L. Albin, S. R. Wagner, D. A. Nolet, and S. Gupta. 2010. Determining statistical process
control baseline periods in long historical data streams. Journal of Quality Technology 42 (1):21–35.

You might also like