Novelty Detection in Social Media by Fusing Text and Image Into A Single Structure

Received July 29, 2019, accepted August 29, 2019, date of publication September 5, 2019, date of current version
September 26, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2939736
Novelty Detection in Social Media by Fusing

Text and Image Into a Single Structure
MARTA AMORIM 1 , FREDERICO D. BORTOLOTI1 , PATRICK M. CIARELLI 1,
EVANDRO O. T. SALLES 1 , AND DANIEL C. CAVALIERI2

1 Electrical Engineering Department, Federal University of Espírito Santo, Vitória 29075-910, Brazil
2 Automation and Control Engineering Department, Federal Institute of Espírito Santo, Serra 29173-087, Brazil
Corresponding author: Marta Amorim (marta.amorim@ifes.edu.br)
The work of M. Amorim was supported by the IFES-Instituto Federal do Espírito Santo.
ABSTRACT This work aims to propose an approach for detecting novelties, taking into account the temporal
flow of data streams in social media. To this end, we present a completely new architecture for novelty
detection. This new architecture entails three new contributions. First, we propose a new concept for novelty
definition based on temporal windows. Second, we formulate an expression to determine the quality of a
novelty. Third, we introduce a new approach to the fusion of heterogeneous data (image + text), using
the COCO dataset and the MASK-RCNN convolutional neural network, which transforms image and text
from social media into a single data format ready to be identified by machine learning algorithms. Since
novelty detection is a task in which labeled samples are scarce or inexistent, unsupervised algorithms are
used, and thus, the following baseline and state-of-the-art algorithms have been chosen: kNN, HBOS,
FBagging, IForesting, and autoencoders. The new fusion approach is also compared to a state-of-the-art
approach to outlier detection named AOM. Because of temporal particularities and the data types being
fused, a new dataset was created, containing 27,494 tweets collected from Twitter. Our experiments show
that data classification of social media using data fusion is superior to using only text or only images as input
data.
INDEX TERMS Detection algorithms, knowledge based systems, learning systems, machine learning,
neural networks, classification algorithms, streaming media.
I. INTRODUCTION abnormal or novelty and available for training. For this rea-
Novelty detection is also generally referred to in the liter- son, novelty detection is often performed using unsupervised
ature as event detection [1]. In fact, event detection is a or one-class methods. In cases where data labeled as novelties
general term used to refer to a variety of detection types, are available, novelty detection should be approached as a
including that of outliers and anomalies, among others. Addi- class imbalance problem.
tionally, many other synonymous terms, such as anomaly or Samples labeled as novelties are expensive to obtain or are
outlier detection, are often used to refer to novelty detec- naturally rare [5]. Some examples of this problem may occur
tion. However, there is a difference between meanings. The in the following contexts: a particular defect of a machine,
Merriam-Webster dictionary [2] defines novelty as an event cyber attacks attempting to destroy a system, or news of sig-
that is ‘‘new or different from anything familiar. Barnett nificant prominence, among others. These systems may oper-
and Lewis [3] define an outlier as an event that is incon- ate normally for an extended period of time. Data recorded
sistent with a training dataset. Therefore, outlier detection during normal operation are usually accessible and easy
aims to handle strictly abnormal observations in datasets [4]. to measure. On the other hand, observing failures or rare
In novelty detection, if a novelty is inconsistent with a events depends on the occurrence of all possible scenarios
dataset, the novelty is included in the dataset, whereas an of improper operation, which is not always possible. This
outlier is usually discarded. Nevertheless, both definitions may even be more difficult if significantly varied events may
share the same drawback of the scarcity of data labeled as occur.
In the case of social media news, thousands of news articles
The associate editor coordinating the review of this manuscript and are generated daily on the Internet, covering a significant
approving it for publication was Shirui Pan. variety of subjects [6]. Automatic identification (or detection)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
132786 VOLUME 7, 2019
M. Amorim et al.: Novelty Detection in Social Media by Fusing Text and Image Into a Single Structure
of news items that are novel and of interest to large audiences, To generalize the novelty concept to a context-free one,
such as a natural disaster occurring in a particular location or we redefine novelty as a unique news item that stands out rela-
the award of a Nobel Prize, poses a singular challenge to a pri- tive to the present and the past, considering a time interval that
ori prediction.Consequently, trained models are most likely to spans predefined temporal windows. Moreover, as we look at
present a high bias while maintaining low variance [5]. a temporal window in the past, we apply a memory sense. For
In addition to the scarcity of samples available for training, example, if the temporal window size is 4 hours and the news
another issue that has emerged in the novelty detection field item is ‘‘American dollar currency quotation’’, how would the
regards handling heterogeneous data, as heterogeneity is an novelty detector behave for this short temporal window? If
inherent characteristic of social media. The news posted on we change the temporal window from 4 hours to 10 days,
social networks can contain not only text but also images, would the detector identify the same news as a novelty? The
videos, temporal information, social connections, user pref- hypothesis is that a change in the size of temporal windows
erences, and other metadata. Gao et al. [7] indicate that 30% changes novelty labels and, therefore, implies a change in the
of blogs post content with images, and this number continues underlying distribution of the problem, which is known in the
to increase. Therefore, visual content becomes significant in literature as concept drift [19].
novelty detection analysis. The proliferation of applications User interactions in social media tell us a great deal about
that share photos on the web, such as Flickr, Instagram, the real world, and these insights are not limited to any
Pinterest, and Vero, results in a large number of photos being particular temporal aspect [20]. There are many ways to
available publicly. Flickr presently hosts more than 6 billion identify events of interest to social media users. Perhaps
photos, of which 100 million include geolocation data [8]. the most straightforward way is based on the content of the
Facebook is currently the most popular social network in tweets [21]. This means that the identification would be tied
photo sharing, with over 100 million photos shared daily [9]. to a context, ignoring events of importance to other contexts.
Another social network that is known for publishing short Due to the peculiar nature of novelties, they can be of great
posts (tweets) is Twitter, with a total of 1.3 billion accounts. interest to users. Popularity also indicates the degree of inter-
There are 500 million tweets sent every day, or approximately est in the event. As [22] pointed out, popularity of tweets
6,000 tweets every second. According to [10], tweets with is affected by multiple reasons aside from newsworthiness.
images result in 18% larger click-through rates, 89% more Therefore, in this study we seek to identify a novelty with
likes, and 150% more retweets. its degree of repercussion, which defines the quality of the
The literature discusses several problems encountered novelty.
when classifying images from social media, including low For our study, we break down novelty detection in social
quality, ambiguous meaning, and noise, among others. As a media into three aspects: heterogeneity, volatility, and audi-
consequence, most of the current novelty detections tech- ence. New data fusion tackles the data heterogeneity aspect.
niques for web content deal only with textual content and Temporal windows address the data volatility aspect. The
social connections, whereas visual content is ignored. How- quality of a novelty explains the data audience aspect. In this
ever, texts posted on social networks have many elements, paper, we propose two new novelty detection architectures
including paraphrases, metaphors, high expressiveness, and for heterogeneous social media data. The first applies text
so forth, that make classification a difficult task [11], [12]. and image fusion techniques to the input of unsupervised
The field of computer vision has recently advanced signif- algorithms, while the second applies fusion techniques to the
icantly due to deep learning techniques, more specifically, output of the unsupervised algorithms. We perform experi-
convolutional neural networks. Recent advances in computer ments on a dataset obtained from Twitter.
vision enable extraction of information from images with The contributions of this work are as follows: (i) a new
excellent precision. Examples include identifying different dataset containing text and images from Twitter that are
objects in a scene, including people, animals, vehicles, and so used for novelty detection, (ii) a new proposal for fusing
forth. This kind of information may be used as a complement heterogeneous data that improves the ROC AUC metric
to the information present in the text to develop better classi- (receiver operating characteristic area under the curve) con-
fiers of social network posts. In this context, we hypothesize cerning other techniques applied in this work, and (iii) a
that the fusion of text and images will add more relevant proposal for novelty detection considering temporal window
information to novelty detection in cases where the isolated and audience.
structures (text or images) may be deficient. This paper is organized as follows. In Section II,
Another peculiarity is that social media generates data we present novelty detection-related studies. The introduc-
in a data stream format. The quantity of data produced in tion of proposed methods is divided into two subsections of
social media is not only enormous but continues to grow, Section III: III-A and III-B. In Section III-A, we describe
and data are regularly updated [13], [14]. Due to the con- the procedures for labeling a dataset for training purposes
siderable quantity of data and the specific goal of most and other concepts pertaining to the novelty detection task.
applications, most papers attempt to identify novelties within In Section III-B, we explain the pipelines of the pro-
a specific context, sometimes filtering data by keywords, posed architectures: data encoding, fusion, and classification.
as done in [15]–[18]. Section IV outlines the experimental setup, the dataset and
VOLUME 7, 2019 132787

results. Discussion is presented in Section V. Finally, the gen- relations between the features. A probabilistic Markov model
eral conclusions are stated in Section VI. connects the features with the event. The experiments analyze
two million images extracted from Flickr.
II. RELATED WORK Kaneko and Yanai [25] presented a system for discovering
Related studies in the literature use techniques of detecting events related to Twitter photos. Selected photos must con-
novelty only from unstructured data (raw text, images, user tain a geographical location (latitude/longitude). The method
data, and videos, among other metadata) from social media. consists of two main steps: detection of keywords based on
Alqhtani et al. [23] developed a method of detecting textual analysis and a clustering algorithm. First, the textual
novelty on Twitter using unstructured data. The goal is to analysis identifies the words in the text. Afterwards, visual
evaluate whether optimal event detection results from using features are extracted from the photos using BoF (bag-of-
only text, or only image, or a combination of both. The features), SURF (speeded up robust features) and 64-color
analysis starts with text-based event detection. Performing RGB (red-green-blue) color histograms. Photos are grouped
detection over text entails the application of several prepro- by the corresponding keywords, using Ward’s clustering algo-
cessing techniques, such as filtering characters, tokenization, rithm, which is a hierarchical clustering method. The most
removal of stopwords, and stemming. After the text has been representative photo of each group is selected by the Visu-
preprocessed, the TF-IDF (term frequency–inverse document alRank algorithm. The maps of the United States and Japan
frequency) technique is used to generate a high-dimensional are used to geolocate events. In the experiments, 17 million
vector that represents the text. Based on the weight assigned geolocated tweets posted in the United States and 3 million
by TF-IDF, a threshold is used to determine which tweets are tweets posted in Japan are analyzed.
novelties. If a tweet is deemed a non-novelty by the text anal- Mei et al. [26] proposed a new technique for joining het-
ysis technique, then the image-based event detection method erogeneous data (images, text, and users) from social net-
is used. For image-based event detection, visual characteris- works in the process of event summarization. First, the SIFT
tics such as HOG (histogram of oriented gradients), GLCM algorithm extracts the image features into a high-dimensional
(gray-level co-occurrence matrix), and color histogram are vector. The quantization hierarchy algorithm transforms the
used to generate the vector representing the image. Finally, high-dimensional vector into a visual vocabulary. In the sec-
the researchers use a supervised classifier to classify images ond step, a high-dimensional vector is computed by apply-
into novelties and non-novelties. They analyzed one million ing the bag-of-words approach to the text. The similari-
tweets that contained text and photos posted about Indonesia ties between text vectors and between image vectors are
Air Asia Flight 8501 that were collected from the Twitter calculated based on the cosine distance. Finally, a correla-
stream on 28 December 2014. The result of the experiment tion matrix between images and texts is calculated using
showed that the proposed method reached an accuracy of 89% the similarities, computing a relevance score between text
with text only, 86% with image only and 94% with text and and image. The experiments use a dataset extracted from
image. the Weibo social network containing 2,776,623 texts and
Zhang et al. [24] proposed a new probabilistic model that 4,432,514 images.
identifies an event by computing the correlation of several Reflecting upon all related studies, several conclu-
data types, such as visual and textual content, user profile, sions can be drawn: (i) most approaches use sparse and
and temporal information. The textual content includes tags, high-dimensional vectors to represent unstructured data;
titles, descriptions, and comments about images. The textual (ii) most approaches do not make it clear how they obtain
vector of text t is denoted by xt =< xt1 , xt2 , . . . , xti >, where the ground truth for a dataset; (iii) some approaches detect
xti is a value based on the frequency of the i-th element. The novelty from context; (iv) unstructured data are combined
visual content is represented in the form of visual words. rather than fused, whereby combining data entails using both
To this end, each image is evenly divided into 16 × 16 pieces. types of structured data to perform novelty detection, while
The visual features are extracted by the SIFT (scale invariant fusion does not; and (v) approaches only handle detection
feature transform) algorithm and through color histograms with no qualification of novelty based on the importance
of 32 and 128 dimensions. Afterwards, a k-means algo- of the event to social media users. With so many research
rithm groups and converts features into visual words. Thus, challenges, we seek to answer the following questions:
the visual content from image v can be represented by vector • Which state-of-the-art approach is most effective in
xv =< xv1 , xv2 , . . . , xvi >, where each element is associated detecting novelties in data streams?
with a visual word. The user features are related to the image, • Does fusion of data (image+text) improve novelty
that is, users who uploaded, shared or marked the image as detection?
favorite. Thus, user features obtained for user u are denoted • How does conceptualizing novelty as a function of time
by vector xu =< xu1 , xu2 , . . . , xui >. Each event Ci can be and audience facilitate novelty detection?
considered as a ‘‘virtual object’’ that consists of the union of
features Ci =< ∪xtj , ∪xvj , ∪xuj > (j = 1, 2, . . . , |Ci |). The III. METHODS
correlation of features is structured in an undirected graph This section is divided into two parts: data preparation and
model. The features are the vertices, and the edges are the novelty detection. The former describes the procedures for
132788 VOLUME 7, 2019

labeling the dataset to be used to the process of training the the number of shares (e.g., ‘‘retweets’’ in the case of Twitter).
architectures. The latter describes the pipeline of the proposed Variable fc corresponds to the number of interactions (e.g.,
architectures: data encoding, fusion, and classification. ‘‘likes’’ in the case of Twitter). Variable n is the number of
tweets posted within the present temporal window. The norm
A. DATA PREPARATION function performs the normalization of rt and fc values to the
In this section, we define our novelty concept, novelty quality range [0, 1].
and a heuristic responsible for generating the ground truth for n−1
the dataset. 1X
qu = Quartile( sti + norm(rt) + norm(fc)) (2)
n
i=0
1) NOVELTY DEFINITION
where,
In this paper, we define a novelty as a news item that stands
out from other news over present and past time windows. norm(x) = (x − min(x))/(max(x) − min(x)). (3)
Time can be freely adjusted in units of time, including hours,
The Quartile function splits all the tweets within the same
days, months, etc. Equation (1) presents our mathematical
temporal window into four groups, each containing 25%
model for classifying an event as a novelty or a non-novelty.
of tweets. The first group contains tweets with the lowest
We redefine an event as a tweet.
( audience, while the fourth group comprises tweets with the
Novelty / wpr )∧(tac ∈
if (tac ∈ / wpa ) highest audience among all groups. The novelty quality value
f (wpr , wpa , tac ) = varies from 1 (for the smallest audience) to 4 (for the largest
NonNovelty otherwise
audience).
(1)
We chose the Quartile function because it is essentially a
Equation (1) is a function with three input variables: tac , location specifier. We want to know where the most impor-
wpr and wpa . Variable tac refers to a new incoming event tant events are located within the distribution of all detected
within the present temporal window. Variable wpr refers to novelties and non-novelties. However, the quartile function
the present temporal window. Variable wpa refers to the past is inappropriate if there are many replicas of some distinct
temporal window. Temporal windows can contain any time data values [27]. The dataset proposed in this work has few
frame, such as days, weeks, or months. For an event to be a repetitions of values, so the quartile function is suitable for
novelty, a new incoming tweet tac from a data stream should our objectives. Furthermore, quartiles are especially useful if
not belong to the present window wpr or the past window wpa . data are not symmetrically distributed, or a dataset contains
Otherwise, the event will be a non-novelty. outliers.
An event does not belong to a temporal window if it has The novelty quality levels aid our strategy by
no similarity to any event of the present and past temporal • Eliminating noisy novelty items (memes, daily news,
windows. In semiautomatic labeling heuristics (presented in and so forth), highlighting novelties that are rare for jour-
Algorithm 1), we calculate the cosine similarity between the nalistic purposes, and pointing out suspicious actions in
events within a temporal window, and a threshold is applied social media to detect crimes, among others.
to determine whether the similarity of an event indicates a • Identifying news from alternative sources. A novelty
novelty or non-novelty (operator ∈ / of Equation (1)); more will not always originate from major news media
details are available in Section III-A.3. (Reuters and BBC, among others). We live in the age
In the architecture for detecting novelty proposed in this of user-generated content that has become so popular
work, we use unsupervised algorithms to detect novelty. that some news usually appear first in other media before
In the training phase, the algorithms learn score thresholds being noticed by the newswires. More prominent exam-
/ of Equation (1)).
that will reveal a novelty label (operator ∈ ples include Osama bin Laden’s death [28] and Michael
Therefore, the proposed architecture and the semiautomatic Jackson’s death [29] notifications posted an hour before
labeling heuristics use this novelty concept described by announcements in any other news media.
Equation (1).
3) SEMIAUTOMATIC LABELING HEURISTICS
2) NOVELTY QUALITY The ground truth labels and the novelty quality for the dataset
Equations (2) characterizes the novelty quality of one event are generated by a heuristic detailed in Algorithm 1. The
in a temporal window, as measured by its audience score first step is to receive events from the AllTweets data stream,
on a scale from 1 to 4. The choice of range (1-4) is due which are ordered by event creation date dt. The current event
to the use of the quartile measure. An event with quality received by the algorithm is denoted by tac . Variables dtBegin
equal to 4 indicates significant repercussions, whereas an and dtEnd store the start and end times, respectively, of a
event with quality equal to 1 represents a weak reaction in temporal window. The start date dtBegin stores the current
social media. Variable st is defined by the similarity between event date dt. Variable dtEnd stores the start date dtBegin plus
events, that is, we calculate the cosine similarity between a custom time Windays (e.g., 10, 20 or 40 days). The events
events within a temporal window. Variable rt is measured by in the dataset are tweets.
VOLUME 7, 2019 132789

FIGURE 1. Two architectures proposed for novelty detection. The architecture in (a) fuses text and an image using our new approach, the MASK-RCNN
network, encodes the fusion result, and finally classifies the tweet using an unsupervised algorithm; the architecture in (b) performs text encoding and
image feature extraction separately, uses an unsupervised algorithm for classification and finally performs data fusion with the AOM algorithm.
For each step of the loop, the algorithm checks if the Algorithm 1 Pseudocode of the Labeling Process
current tweet is outside the temporal window, generates 1: procedure LabelingTweet
the ground truth using the Similar function (Equation 4) 2: flag ← 0
and calculates the novelty quality using the Audience func- 3: while Not AllTweets is EOF do
tion (Equation 2). Algorithm 1 is considered semiautomatic 4: dt ← tac .dtCreated
because a manual verification of ground truth labels is 5: if flag == 0 then F Initializing the variables
needed. Manual verification employed an inspection method- 6: flag ← 1
ology. Three people have inspected each tweet and evaluated 7: dtBegin ← dt
if the semiautomatic labels were correct. 8: dtEnd ← dtBegin + Windays

Novelty
 if (s < 0.4) ∨ (s >= 0.4 9: if Not (dtBegin ≤ dt ≤ dtEnd) then F current
h(s, qu) = ∧ s < 0.5 ∧ qu = 4) (4) tweet is not inside the temporal window


NonNovelty otherwise 10: dtBegin ← dt
11: dtEnd ← dtBegin + Windays
where h is the label of event tac , s is the cosine similarity mean 12: lstWinpas ← lstWinpre
between tac and events in the temporal window, and qu is the 13: lstWinpre ← empty
novelty quality of tac . 14: lstWinpre ← Add(tac )
15: lbl ← Similar(tac , lstWinpre, lstWinpas)
B. NOVELTY DETECTION
16: aud ← Audience(tac , lstWinpre)
In this section, we describe the proposed architectures that
perform novelty detection in social media. The two architec-
tures, shown in Fig. 1, have three main pipelines: encoding for the vector of the image and the vector of text, and
text and images feature extraction, data fusion and unsuper- (iii) fusing the scores of unsupervised algorithms using the
vised algorithms. AOM fusion method. The details of the proposed methods
First, data will arrive as a data stream; i.e., the architecture are presented in the following subsections.
will receive it in the chronological order of creation. The
first architecture (part (a) of Fig. 1) performs the follow- 1) ENCODING TEXT AND AN IMAGE
ing steps: (i) fusing the tweet image and text into a sin- An autoencoder can transform text and an image into a
gle textual structure using the MASK-RCNN network [30], real-valued weight vector, which is a neural network hid-
(ii) transforming the textual structure into a vector representa- den layer representation [31]. A suitable encoder is essential
tion using an autoencoder, and (iii) detecting novelty using an for a classifier or regressor performance [32]. Therefore,
unsupervised algorithm. The unsupervised algorithms receive the encoders used in this work are based on [33] and [34].
a vector of text as input. The second architecture (part (b) of There are several approaches to transforming text input to
Fig. 1) performs the following steps: (i) transforming the a neural network, including vector mean, one-hot, and so on.
tweet image and text into vector representations using autoen- The approach chosen for text encoding in this work is known
coders, (ii) detecting novelty using unsupervised algorithms as concatenation. This technique generates a matrix as an
132790 VOLUME 7, 2019

input data (image, text, video, signals, and so forth), into

a common representation format. This approach has been
widely used with data obtained from sensors, such as in [36]
and [37].
The second approach merges probabilities returned by
classifier outputs. Examples of this approach include ran-
dom forest, boosting, bagging, bragging, wagging, subag-
ging, rotated bagging, and so forth. Consistent proposals have
been presented in the field of output fusion, and there have
been many relevant contributions. However, there is a lack of
studies performing heterogeneous input data fusion on social
FIGURE 2. Sentence model for mapping input sentences to their
intermediate feature representations [34]. media.
We develop a new method for fusing inputs using a convo-
lutional neural network called MASK-RCNN [30] pretrained
input to an encoding network so that each matrix represents a with the COCO dataset [38] to recognize objects in scenes,
short tweet text where the words appear in the same sequence as shown in Fig. 5. The COCO dataset has 2.5 million object
as in the text but are replaced by numeric vectors, more specif- instances and therefore represents a large labeled object
ically, by word embeddings. Fig. 2 illustrates an example of dataset. Hence, it is crucial to select an appropriate classifier
an input matrix. Our model may be improved by learning with good published results. The MASK-RCNN network is
word embeddings from larger corpora. We chose to initialize one of the best performing networks in the literature. The net-
our word embeddings with Word2Vec [35] trained on the work outputs are labels, confidence probability and bounding
Google News corpus containing approximately 100 billion box. MASK-RCNN receives the images from the tweets as
words. input. If output labels of images have probabilities above 0.7,
The encoding network takes as input texts comprising at they are added to the tweet text to aggregate and supplement
most 20 words and produces output of dimension 1 × 150 or information. The labels are ordered in the ascending order
1 × 300. The text encoding network used in this work has the of probability and added to the end of the tweet text. Then
following architecture (Fig. 3): two 1-D convolutional layers the text is encoded, and an unsupervised algorithm ultimately
with 64 filters of size 1 × 7, the ReLU activation function, and determines whether the information represents a novelty or
1-D max-pooling, followed by dropout of 0.25, and one fully non-novelty. This new fusion approach (Fig. 1(a)) is per-
connected layer with 1 × 150 or 1 × 300 neurons depending formed at the data structure level (image+text).
on the size chosen to represent the text. The second fusion approach is illustrated in Fig. 1(b)
For image feature extraction (Fig. 4), we use a network and uses the AOM (average-of-maximum) fusion [39]. This
with the following architecture: a 2-D convolutional layer fusion method distributes the novelty detection task to m
with 16 filters of size 90 × 90, the ReLU activation function, estimators, and the estimators are divided into m/q buck-
and 2-D max-pooling, another convolutional layer with 16 fil- ets (subgroups) of q components each. First, maximization
ters of size 45 × 45, the ReLU activation function, and 2-D is performed over each of the buckets of q components,
max-pooling, followed by a convolutional layer with 8 filters and the scores are subsequently averaged over m/q buckets.
of size 22 × 22, the ReLU activation function, and 2-D max- For example, in our experiment there are m = 20 image
pooling, the fourth convolution layer with 4 filters of size estimators and m = 20 text estimators, so the estimators
11 × 11, the ReLU activation function, and 2-D max-pooling are divided into 10 buckets of 2 components each. First,
and one last convolutional layer with 4 filters of size 5 × 5, maximization is performed in each bucket of 2 components,
the ReLU activation function, and 2-D max-pooling followed the scores are subsequently averaged over all buckets, and
by dropout of 0.25, and a fully connected layer with 1 × the average of the estimator outputs is finally obtained. This
150 or 1 × 300 neurons. The image feature extraction net- technique has led to good results in outlier detection methods
work input is a tweet image in its raw mode. that use unsupervised algorithms. In this work, all estimators
We apply the same text encoding network configuration in used by AOM are unsupervised learning algorithms.
architectures (a) and (b) shown in Fig. 1. The image feature
extraction network is used only in architecture (b) shown 3) UNSUPERVISED LEARNING
in Fig. 1. According to [5], novelty detection is an unsupervised prob-
lem in which test data differ in some respect from the data
2) DATA FUSION available during training. There are many trade-offs asso-
Fusion is a process that combines the output or the input ciated with the choice of a model, for example, a highly
of multiple classifiers to achieve a more reliable and com- complex model (with many parameters) may suffer from
plete decision. In machine learning literature, two data fusion overfitting. A simpler model built upon good and intuitive
approaches can be found: fusing inputs or fusing outputs of understanding of data may lead to better results. How-
classifiers. The first approach involves fusing heterogeneous ever, simple models in general tend to fit data poorly.
VOLUME 7, 2019 132791

FIGURE 3. Text encoding network.
FIGURE 4. Image feature extraction network.
neighbors (kNN) algorithm being used is an unsupervised

version. Equation (5) generates the score, where x is the
instance, and k is the number of neighbors. The dist func-
tion calculates the Euclidean distance. Instances with larger
distances are more likely to be considered novelties.
k
X
s(x, k) = max( dist(xi , xi0 )) (5)
i=1
The histogram-based outlier detection (HBOS) algorithm
segments the data space into histograms called bins. The
FIGURE 5. Example of a tweet containing an image, in which the
calculation of the score of instance x uses the inverse of
pretrained MASK-RCNN network recognized the following objects on the the estimated density hist of each dimension d according to
scene: a person (0.999), a tie (0.998), a clock (0.952), a book (0.346) and a Equation (6). Instances associated with high s values are more
laptop (0.917).
likely to be considered novelties.
TABLE 1. Baseline and state-of-the-art unsupervised algorithms.
d
X 1
s(x) = log (6)
histi (x)
i=0
The isolation foresting (IForesting) algorithm involves a
two-stage process of training and test phases. Binary tree
structures are built in the training phase. In the test phase,
instances are passed through the tree to obtain their scores
according to Equation (7). Function φ is the length of all
paths from the root node to an external node (a leaf) that
Accordingly, we select different baseline and state-of-the-art pass through the instance. The average path value E(φ(x))
unsupervised learning algorithms to evaluate the performance of the instance is returned. Function c(n) returns a normaliza-
of our method, as shown in Table 1. tion or adjustment factor, common in binary tree structures.
Unsupervised algorithms return scores for each instance. Instances with longer path lengths are more likely to be
In the training phase, each algorithm also learns the score considered novelties.
threshold for a novelty label. An instance is considered a E(φ(x))
novelty if its score surpasses the threshold. The k-nearest s(x, n) = 2− c(n) (7)
132792 VOLUME 7, 2019

FIGURE 6. First and second principal component (PC) of the hidden autoencoder layer for a 10-day temporal window. The threshold is calculated for each
temporal window.
The feature bagging (FBagging) algorithm uses an algo- the Q features. The output of the algorithm is the scor-
rithm ensemble according to Equation (8). First, it ran- ing vector. LOF is executed n times, the results are com-
domly chooses a subset of features Q without replacement. bined, and the maximum score is returned. The higher
The local outlier factor (LOF) algorithm is applied using the s value is, the more likely the instance to be a
VOLUME 7, 2019 132793

TABLE 2. Several examples of normalized scores generated by the TABLE 3. Selected values of the kNN parameter.
autoencoder algorithm of Fig. 6 (a). The autoencoder applies the
threshold of −0.6762 and predicts the labels Novelty and Non-Novelty.
novelty.
n
X
s(Q, n) = max( LOF(Q)) (8)
i=1
The autoencoder algorithm calculates the score as the

instance reconstruction error according to Equation (9). The
error is calculated for each dimension d. The goal is to train
the output to reconstruct the input as closely as possible. The TABLE 4. Selected values of the iForest parameter.
input is reconstructed if the network can learn the weights
of the intermediate neurons that represent the reduction of
dimensionality. The used autoencoder architecture consists
of 8 layers. As to the choice of the activation functions,
we use the sigmoid and the rectified linear units (ReLUs).
Specifically, in the first hidden layer and the output layer,
the sigmoid activation function is used. In all other layers,
we use the ReLU activation function [43].
d
X
s(x) = ([xi ] − [oi ])2 (9)
i=1
Fig. 6 shows the behavior of the hidden layer of the

autoencoder for a dataset of 10-day temporal windows. The
autoencoder input uses MASK-RCNN fusion. We apply PCA
(principal component analysis) to the hidden layer output of
the autoencoder. For each temporal window, the autoencoder
calculates a threshold and scores. Table 2 shows several
sample scores generated by the autoencoder algorithm for
Fig. 6 (a). concatenated due to the small number of tweets. The exper-
iments were performed for the new fusion technique using
IV. EXPERIMENTS MASK-RCNN as well as for the AOM fusion, using separate
In this section, we detail the experiments and analyze results. text and image data.
Tables 3, 4, 5, 6 and 7 show the configured parameter val-
A. EXPERIMENTAL SETUP ues for the unsupervised algorithms. The parameter column
Our architectures were developed using Python 3.6.6, Keras displays the parameter that was evaluated in the respective
2.2.2, and PyOD 0.5.9 [45] tools for unsupervised algorithms experiments. The window column shows the time frame in
and Anaconda 4.5.11 as the virtual environment. All experi- days. The test column shows the grid search range assigned
ments were performed on a computer equipped with 32 GB of to each algorithm parameter. The ‘‘Best’’ column shows the
RAM, an NVIDIA 1080 GPU with 8 GB of video RAM, and best average parameter value for temporal windows selected
Intel i7-2600K CPU operating at 3.4 GHz, running Ubuntu in the experiments. However, for each new temporal window,
18.04 LTS. The data split was as follows: the training used the algorithms’ parameters were adjusted using the previous
samples from the past temporal window, and the test used window.
samples from the present temporal window. Parameter k of the kNN algorithm is related to the number
The configured temporal windows were 1, 5, 10, and of neighbors. Parameter nestimators of the IForest algorithm
20 days. Slices of consecutive 1-day temporal windows were is related to the number of trees. Parameter nbin in the HBOS
132794 VOLUME 7, 2019

TABLE 5. Selected values of the HBOS parameter. TABLE 7. Selected values of the autoencoder parameter.
TABLE 6. Selected values of the FBagging parameter.

each window. The best parameter value is the average of
values calculated for all temporal windows.
B. DATASETS
The Twitter API (application programming interface) returns
a wide variety of data and metadata. In our work, the most
important indicators are tweet ID, user name, image, text,
creation date of a tweet, like count and retweet count [20].
By having a large quantity of free data available for con-
sumption, Twitter has become a social network widely used
for data analysis. Despite a large number of tweet datasets
available on the Internet, the following shortcomings were
observed in this research context:
• Most of the databases are presented in a particular con-
text or pertain to a specific theme, that is, tweets are
filtered using keywords.
• Our technique is based on temporal windows, and the
labels change as a function of a set timespan. Labels
are essential for measuring the effectiveness of the tech-
algorithm is related to the number of histograms. Parameter nique.
nestimators in the FBagging algorithm is related to the num- • Most databases contain many more instances of text
ber of algorithm runs. Parameter epochs in the autoencoder than of images. To effectively evaluate our method,
algorithm is related to the number of iterations performed to we require that each tweet must have text and an image.
complete training However, our method can be used if there is only text or
The choice of parameter ranges for grid search is based only an image.
on overfitting/underfitting using the ROC AUC metric and For the experiments, we collected 27,494 tweets
the number of novelties per temporal window. Therefore, (text+image) created in the 2017-2019 period. The tweets
the larger the temporal window is, the greater the num- were posted by various news sources: BBC, BBC Break-
ber of instances and the wider the parameter range. For ing News, CNN, Reuters, NYTimes, etc. The heuristics
instance, distance-based algorithms have to compute the dis- (described by Algorithm 1) were used to label all tweets
tance between the tweet instances within the temporal win- within temporal windows of four different sizes: 1, 5, 10 and
dow, so there is a limit on parameters k and nbins. 20 days. The peculiar characteristic of our dataset is that
As each temporal window has different distributions, each text must have an associated image. Despite the ample
the tendency is for parameters to behave differently. There- period of tweet data collection, the ratio of the number of
fore, parameters are not fixed for each algorithm execu- posts containing text with images in social media is less
tion. A grid search by parameter ranges is run separately in than the number of posts containing text without an image.
VOLUME 7, 2019 132795

FIGURE 7. Tweet text size distribution. FIGURE 8. Tweet image width (pixels) distribution.
The database and the source code are freely available for
download.1
The choice of temporal window size should be made con-
sidering the information volatility (amount of news generated
per day). For example, the news regarding ‘‘dollar quotation’’
will be identified as a novelty more times in a 1-day window
than in 30-day windows. Therefore, smaller windows tend
to identify more novelties. Larger windows tend to identify
fewer novelties than do smaller windows. Information volatil-
ity may be linked to the type of news. The economic and
political domains have a higher volume of news generated
per day than the number of news of natural disasters and
FIGURE 9. Tweet image height (pixels) distribution.
Nobel Prize awards. Therefore, smaller windows would be
more suitable for the former example. The choice of different
temporal window sizes in this work (1, 5, 10 and 20) allows
evaluating the volatility of information in various contexts.
Figs. 7, 8, 9, 10 and 11 shown some statistics of the dataset.
Fig. 7 illustrates the number of words versus number of
tweets. The x-axis is the number of words. The y-axis is the
number of tweets. Most tweets have an average of 10 words,
and the dictionary contains 19259 unique tokens (words).
The limited number of words by tweet and large number
of unique tokens indicate a hard task. We apply a state-of-
the-art tweet normalization tool [46] to tokenize and trans-
form each tweet into a sequence of words. This is done to
mitigate the noise due to the colloquial nature of the data.
The process involves, for example, spelling correction, elon- FIGURE 10. Distribution of the number of pixels in a tweet image.
gated word normalization (‘‘soooo’’ becomes ‘‘so’’), word

segmentation on hashtags (‘‘#economysustainable’’ becomes
Fig. 9 shows the number of images versus image
‘‘economy sustainable’’), and unpacking contractions (e.g.,
height. The average height of an image is approximately
‘‘don’t’’ becomes ‘‘do not’’) [47].
617.048 pixels.
Figs. 8, 9, 10 and 11 aid the evaluation of the quality of the
Fig. 10 shows the number of images versus the average
tweeted image [48]. Fig. 8 shows the number of images versus
number of pixels. Some images may contain no pixel varia-
image width. The average width of an image is approximately
tion and are entirely uniform. The average number of pixels
971.313 pixels, but there are a considerable number of images
is a measure that indicates the number of edges present in
that are less than 800 pixels wide. Excessively small images
the image. If this number is very low, then the image is most
might not be very useful because they contain few pixels and
likely a uniform image and may not represent the right con-
therefore are of poor quality.
tent [48]. The values were obtained using code from [48]. The
average number of pixels of a tweet image is approximately
1 https://github.com/martatcfa/novelty 2.560.
132796 VOLUME 7, 2019

TABLE 8. Resulting ROC AUC metric values for the temporal window of 1 day.
TABLE 9. Resulting ROC AUC metric values for the temporal window of 5 days.
In addition to the state-of-the-art unsupervised algorithms,

we evaluated the influence of the input vector size, with sizes
of 150 and 300 considered for each temporal window. To ver-
ify the fusion method robustness, we analyzed the algorithms
in relation to using only text, only image, and fusion of text
and image.
In Tables 8, 9, 10, and 11 the ‘‘Text’’ column shows
the result of text classification. The ‘‘Image’’ column
shows the result of image classification. The AOM col-
umn displays the result of using AOM fusion, proposed
by Aggarwal and Sathe [39]. The MASK column dis-
plays the result of using the new approach of text and
FIGURE 11. Distribution of tweet image whiteness. image fusion obtained with the trained MASK-RCNN net-
work [30]. The best result for each unsupervised algorithm is
highlighted.
In the experiments, the MASK-RCNN fusion achieved bet-
Fig. 11 shows the number of images versus image white- ter results than did the AOM fusion using any of the selected
ness. Some images can be too white or too bright, which unsupervised algorithms. The best results were obtained by
might not be good for several purposes [48]. These values the autoencoder algorithm with an ROC AUC of 0.7768 for a
were also obtained using code from [48]. The average white- window size of 10 days.
ness of a tweet image is approximately 10.790. The AOM method fuses the probabilities, combining the
averaging function and a rank-based variant of the maximiza-
C. RESULTS tion function, and seeking the best of both techniques. The
The experiments were evaluated by the ROC AUC metric; fusion that uses only the average is particularly desirable for
for each algorithm combination, the program was executed small datasets because of its robustness. On the other hand,
10 times, and the final results are the means and standard the maximizing function often emphasizes different outliers
deviations, as shown in Tables 8, 9, 10, 11, and 12. that are hidden by averaging function [39].
VOLUME 7, 2019 132797

However, the AOM fusion is observed to be unsatisfactory TABLE 12. Resulting ROC AUC metric values for the temporal window
of 10 days obtained for various proportions.
in this experiment because even though there is a small
dataset in each window, the data are heterogeneous and with
much higher noise in images than in texts, which impacts
fusion. Taking advantage of the fact that it is easier to classify
text than an image, the MASK fusion extracts information
from the image to augment the text. Since the MASK fusion
is performed on the pipeline input, we do not observe the
impact of distinct probabilities on the algorithms’ output.
Additionally, the input vector size and the algorithm being
of unsupervised type impact novelty identification results,
as reported in other studies [39], [49] and [50].
According to statistics, the proportion of texts is larger than
that of images in the social media [51]. Therefore, we experi-
ment by varying the proportion of tweets including text and an is relevant. In comparison, the ROC AUC obtained using only
image as they are fed to the unsupervised algorithm, with the text under the same conditions is 0.7003.
best result obtained by the autoencoder algorithm, as shown Other results can be analyzed visually, considering Fig. 12
in Tables 8 to 11. For example, 40/60% means that 40% of that illustrates the time span between 2017 and 2019. The
data include text and an image, and 60% include text only x-axis indicates the creation time of each tweet, and the
for novelty detection. The selection of data that contains only y-axis shows the scores of the algorithm that obtains the best
text and an image was random. Table 12 shows the results for ROC AUC (namely, autoencoder with the MASK fusion).
four different proportions ranging from 10% to 40% of tweets The blue dots are tweets considered novelties, the green dots
with text and an image. This experiment showed that if the are tweets considered non-novelties, and the red dots are
proportion of images is higher than or equal to 30%, images tweets with an audience score of 4. Tweets with an audience
add relevant information to texts that improves novelty classi- score of 4 are mostly detected as novelties, as shown in
fication. Gao et al. [7] indicate that 30% of blogs post content the caption of Fig. 12. A traced line indicates the algorithm
with images, therefore showing that the proposed method threshold.
132798 VOLUME 7, 2019

FIGURE 12. Results for temporal windows of 1, 5, 10 and 20 days.
Novelty qualification allows filtering tweets that stand out communities is one of the core principles at Bloomberg’’.
according to their audience. Therefore, we observe that the Other tweets have been highlighted with the audience filter,
filter influences the ‘‘noisy’’ novelties elimination, such as for example, record-breaking moves in the stock market: ‘‘rt
tweets with advertising: ‘‘strengthening and enriching our @business: breaking: the dow hits 20,000 for the first time’’.
VOLUME 7, 2019 132799

TABLE 13. Sample of tweets from the dataset with the labels and audience scores.
FIGURE 13. Posted tweet with the timestamp of 2018-06-12 01:08:35
The first tweet example containing advertising is indeed a

novelty, since it satisfies the definition proposed by this
work, and we cannot restrict novelties by subject since even
an advertisement can be considered an attractive novelty,
depending on the context where it is placed. Table 13 shows
a sample from the dataset on 2018-06-12. This sample shows
that a filter requiring the audience score of 4 highlights the FIGURE 14. Examples of tweets with an image, for which fusion is
performed using COCO-aggregated information. The following objects are
most important event of the day. Fig. 13 shows the posted detected in the scenes: (a) dogs (0.970, 0.994 and 0.980) and a chair
novelty with the audience score of 4 and the timestamp (0.887); (b) dog (0.993), birds (0.987, 0.933 and 0.917), a chair (0.610) and
a person (0.500); (c) a person (1.000), a TV (0.968) and a tie (0.540), and
of 2018-06-12 01:08:35. (d) a person (0.980), a TV (0.989) and a tie (0.803).
Another interesting feature of temporal windows are the
changes in novelty labels. That implies a change in the and Fig. 14(b) ‘‘illegal puppy farming needs to stop’’, the first
underlying distribution of the problem, a phenomenon that text mentions the Labrador breed, and the second refers to
is known in the literature as concept drift. We can illustrate it a young animal, making classification more difficult. The
with the example tweet ‘‘rt @theterminal: market close today. MASK-RCNN network can recognize the ‘‘dog’’ label in the
# Dow20k’’ that in a 10-day window is considered a novelty scene with high confidence, thus assigning a new common
in the context of the present and past time frames. However, keyword to the two tweets, which will make the learning
it is not a novelty in a 20-day window because Bloomberg algorithm more assertive. An image carries information that
first reported it with the tweet ‘‘rt @business: breaking: the can complement the information present in the text.
dow hits 20.000 for the first time’’. This example is attractive, In another example, shown in Figs. 14(c) and 14(d),
since it entails identifying first-hand news, known as breaking the images have few similarities, as the text of Fig. 14(c)
news. is ‘‘how I stopped worrying and learned to love the link :
@blackrock jeff rosenberg at fi14 in ny #bbfi’’, and that of
V. DISCUSSION Fig. 14(d) is ‘‘head of global credit strategy @goldmansachs
The MASK-RCNN fusion identifies labels, adding which to discusses credit market trends in event fi14 in ny #bbgfi’’.
the text can improve the classifier. For example, the difficulty In this case, the texts contain distinct semantics, and although
in classifying Fig. 14(a) is due to a poor quality of the image. they share only a single hashtag, the tweets refer to the same
Analyzing the texts of Fig. 14(a): ‘‘10-year-old labrador and event. Shared hashtags do not result in a textual gain since
all-round good boy fred has become dad to nine ducklings’’ the network is pretrained with the Google embeddings, so it
132800 VOLUME 7, 2019

does not recognize the word in the hashtag and returns a null [10] (2019). 58 Incredible and Interesting Twitter Stats and Statistics.
vector for it. The MASK-RCNN network can recognize the Accessed: Jul. 15, 2019. [Online]. Available: https://www.brandwatch.
com/blog/twitter-stats-and-statistics/
labels ‘‘tv’’, ‘‘person’’ and ‘‘tie’’ in the scene. Thus, adding [11] G. Miner, J. Elder, T. Hill, R. Nisbet, D. Delen, and A. Fast, Practical Text
keywords to the tweet increases assertiveness of the learning Mining and Statistical Analysis for Non-Structured Text Data Applications,
algorithm. 1st ed. Orlando, FL, USA: Academic, 2012.
[12] S. Dodge and L. Karam, ‘‘Understanding how image quality affects
Another point of analysis is that the ROC AUC decreases as deep neural networks,’’ in Proc. 8th Int. Conf. Qual. Multimedia Exper.
the input vector size changes from 150 to 300 features, except (QoMEX), Jun. 2016, pp. 1–6.
for the autoencoder algorithm. The literature highlights that [13] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, ‘‘Deep learning
distance-based classification algorithms are at a disadvantage for IoT big data and streaming analytics: A survey,’’ IEEE Commun.
Surveys Tuts., vol. 20, no. 4, pp. 2923–2960, 4th Quart., 2018.
in analyzing high-dimensional data [49]. Another reason is [14] L.-L. Shi, L. Liu, Y. Wu, L. Jiang, and J. Hardy, ‘‘Event detection and user
that none of the unsupervised algorithms have a powerful fea- interest discovering in social media data streams,’’ IEEE Access, vol. 5,
ture selector, extractor or ensemble, except for the FBagging pp. 20953–20964, 2017.
[15] W. Zhong. (Jul. 2, 2016). Novelty Detection. Accessed: Jul. 15, 2019.
and autoencoder algorithms. [Online]. Available: https://github.com/CrowdTruth/Novelty_Detection
[16] N. Ghelani, S. Mohammed, S. Wang, and J. Lin, ‘‘Event detection on
VI. CONCLUSION curated tweet streams,’’ in Proc. 40th Int. ACM SIGIR Conf. Res. Develop.
We proposed two architectures for novelty detection in the Inf. Retr., New York, NY, USA, Aug. 2017, pp. 1325–1328.
[17] C. C. Aggarwal and K. Subbian, ‘‘Event detection in social streams,’’ in
social media that use information from text and images. Proc. SIAM Int. Conf. Data Mining, 2012, pp. 624–635.
To this end, we presented two methods of fusion: the fusion of [18] E. Schinas, S. Papadopoulos, Y. Kompatsiaris, and P. A. Mitkas,
text and an image using a trained MASK-RCNN network and ‘‘Event detection and retrieval on social media,’’ 2018, arXiv:1807.03675.
the output fusion of unsupervised algorithms with AOM. The [Online]. Available: https://arxiv.org/abs/1807.03675
[19] M. A. Masud, J. Z. Huang, M. Zhong, and X. Fu, ‘‘Cluster survival
method of fusion using MASK-RCNN showed good results, model of concept drift in load profile data,’’ IEEE Access, vol. 6,
surpassing those of the AOM fusion. pp. 51269–51285, 2018.
The text and image fusion methods perform well if the [20] S. Hendrickson, J. Montague, J. Kolb, and B. Lehman. (2019). Trend
Detection in Social Data. Accessed: Jul. 15, 2019. [Online]. Available:
image proportion is greater than or equal to 30% of the https://www.developer.twitter.com
dataset. Therefore, the MASK-RCNN fusion adds more rel- [21] S. Khater, D. Gračanin, and H. G. Elmongui, ‘‘Personalized recommen-
evant information to novelty detection in cases where the dation for online social networks information: Personal preferences and
location-based community trends,’’ IEEE Trans. Comput. Social Syst.,
isolated structures (text or an image) may be deficient. vol. 4, no. 3, pp. 104–120, Sep. 2017.
The novelty definition involving temporal windows and [22] L. Hong, O. Dan, and B. D. Davison, ‘‘Predicting popular messages in
novelty quality proved to be a powerful media analysis tool. Twitter,’’ in Proc. 20th Int. Conf. Companion World Wide Web, New York,
As a result, the novelty definition became context indepen- NY, USA, Apr. 2011, pp. 57–58.
[23] S. M. Alqhtani, S. Luo, and B. Regan, ‘‘Fusing text and image for event
dent. The novelty quality allowed filtering tweets with a detection in Twitter,’’ Int. J. Multimedia Appl., vol. 7, pp. 27–35, Feb. 2015.
higher audience, thus eliminating advertisements and other [24] X. Zhang, Z. Li, X. Lv, and X. Chen, ‘‘Integrating multiple types of features
irrelevant tweets. We also note that the smaller the temporal for event identification in social images,’’ Multimedia Tools Appl., vol. 75,
no. 6, pp. 3301–3322, Mar. 2016.
window is, the more difficult the classification. [25] T. Kaneko and K. Yanai, ‘‘Event photo mining from Twitter using keyword
In future research, we intend to compare our proposed bursts and image clustering,’’ Neurocomputing, vol. 172, pp. 143–158,
architecture with the LSTM (long short-term memory) net- Jan. 2016.
works and with deep learning networks for generating text [26] M. Mei, X. Guo, B. C. Williams, S. Doboli, J. B. Kenworthy, P. B. Paulus,
and A. A. Minai, ‘‘Using semantic clustering and autoencoders for detect-
from an image. ing novelty in corpora of short texts,’’ in Proc. Int. Joint Conf. Neural Netw.,
Jul. 2018, pp. 1–8.
REFERENCES [27] E. Langford. Quartiles in Elementary Statistics. Accessed: Jul. 15, 2019.
[1] F. Atefeh and W. Khreich, ‘‘A survey of techniques for event detection in [Online]. Available: http://jse.amstat.org/v14n3/langford.html
Twitter,’’ Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015. [28] (2011). Twitter First With News of Osama Bin Laden’s Death Via Ex-
[2] Merriam-Webster Online Dictionary. Accessed: Jul. 15, 2019. [Online]. Bush Staffer in the Guardian. Accessed: Jul. 15, 2019. [Online]. Avail-
Available: http://www.merriam-webster.com able: https://www.theguardian.com/technology/blog/2011/may/02/twitter-
[3] V. Barnett and T. Lewis, Outliers Statistical Data, vol. 37, 2nd ed. osama-bin-laden-death-leaked
Hoboken, NJ, USA: Wiley, 1994. [29] K. Hodge. (2010). 10 News Stories That Broke on Twitter First.
[4] J. Xi, ‘‘Outlier detection algorithms in data mining,’’ in Proc. 2nd Int. Accessed: Jul. 15, 2019. [Online]. Available: https://www.techradar.
Symp. Intell. Inf. Technol. Appl., vol. 1, Dec. 2008, pp. 94–97. com/news/world-of-tech/internet/10-news-stories-that-broke-on-twitter-
[5] M. A. F. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, ‘‘A review first-719532
of novelty detection,’’ Signal Process., vol. 99, pp. 215–249, Jun. 2014. [30] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, ‘‘Mask R-CNN,’’ in Proc.
[6] A. Aldhaheri and J. Lee, ‘‘Event detection on large social media using IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969.
temporal analysis,’’ in Proc. IEEE 7th Annu. Comput. Commun. Workshop [31] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
Conf. (CCWC), Jan. 2017, pp. 1–6. MA, USA: MIT Press, 2016.
[7] Y. Gao, S. Zhao, Y. Yang, and T.-S. Chua, ‘‘Multimedia social event [32] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
detection in microblog,’’ in MultiMedia Modeling. Cham, Switzerland: A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
Springer, 2015, pp. 269–281. vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[8] M. Ruocco and H. Ramampiaro, ‘‘A scalable algorithm for extraction and [33] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep con-
clustering of event-related pictures,’’ Multimedia Tools Appl., vol. 70, no. 1, volutional encoder-decoder architecture for image segmentation.,’’ 2015,
pp. 55–88, May 2014. arXiv:1511.00561. [Online]. Available: https://arxiv.org/abs/1511.00561
[9] Flickr Reaches 6 Billion Photo Uploads. Accessed: Jul. 18, 2019. [Online]. [34] A. Severyn and A. Moschitti, ‘‘Learning to rank short text pairs with
Available: https://latimesblogs.latimes.com/technology/2011/08/flickr- convolutional deep neural networks,’’ in Proc. 38th Int. ACM SIGIR Conf.
reaches-6-billion-photos-uploaded.html Res. Develop. Inf. Retr., Aug. 2015, pp. 373–382.
VOLUME 7, 2019 132801
[35] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of FREDERICO D. BORTOLOTI received the degree
word representations in vector space,’’ 2013, arXiv:1301.3781. [Online]. in computer science and the master’s degree in
Available: https://arxiv.org/abs/1301.3781 informatics from the Federal University of Espírito
[36] N. Audebert, B. Le Saux, and S. Lefèvrey, ‘‘Fusion of heterogeneous data Santo, in 1999 and 2012, respectively, where he is
in convolutional networks for urban semantic labeling,’’ in Proc. Joint currently pursuing the Ph.D. degree in computer
Urban Remote Sens. Event (JURSE), Mar. 2017, pp. 1–4. science (machine learning).
[37] S. Sun, L. Yang, W. Liu, and R. Li, ‘‘Feature fusion through multitask CNN He is currently an Assistant Professor with the
for large-scale remote sensing image segmentation,’’ in Proc. 10th IAPR
Federal University of Espírito Santo.
Workshop Pattern Recognit. Remote Sens. (PRRS), Aug. 2018, pp. 1–4.
[38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in
Proc. ECCV, 2014, pp. 740–755.
[39] C. C. Aggarwal and S. Sathe, ‘‘Theoretical foundations and algorithms
for outlier ensembles,’’ ACM SIGKDD Explor. Newslett., vol. 17, no. 1,
pp. 24–47, Jun. 2015.
[40] F. Angiulli and C. Pizzuti, ‘‘Fast outlier detection in high dimensional
spaces,’’ in Proc. 6th Eur. Conf. Princ. Data Mining Knowl. Discovery, PATRICK M. CIARELLI received the degree, mas-
2002, pp. 15–26. ter’s, and Ph.D. degrees in electrical engineering
[41] M. Goldstein and A. Dengel, ‘‘Histogram-based outlier score (HBOS): from the Federal University of Espírito Santo,
A fast unsupervised anomaly detection algorithm,’’ in Proc. Poster Demo in 2006, 2008 and 2012, respectively, where he
Track 35th German Conf. Artif. Intell., 2012, pp. 59–63. is currently an Assistant Professor. His research
[42] A. Lazarevic and V. Kumar, ‘‘Feature bagging for outlier detection,’’ in interests include electrical engineering, especially
Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
on artificial intelligence, information retrieval, pat-
Aug. 2005, pp. 157–166.
tern recognition, and image processing.
[43] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga,‘‘Outlier detection with
autoencoder ensembles,’’ in Proc. SIAM Int. Conf. Data Mining, Jun. 2017,
pp. 90–98.
[44] F. T. Liu, K. M. Ting, and Z.-H. Zhou, ‘‘Isolation-based anomaly detec-
tion,’’ ACM Trans. Knowl. Discovery Data, vol. 6, no. 1, p. 3, 2012.
[45] Y. Zhao, Z. Nasrullah, and Z. Li, ‘‘PyOD: A Python toolbox for scalable
outlier detection,’’ J. Mach. Learn. Res., vol. 20, no. 96, pp. 1–7, Jan. 2019.
[46] C. Baziotis, N. Pelekis, and C. Doulkeridis, ‘‘Datastories at SemEval-
2017 Task 4: Deep LSTM with attention for message-level and topic- EVANDRO O. T. SALLES received the B.Sc. and
based sentiment analysis,’’ in Proc. 11th Int. Workshop Semantic Eval. master’s degrees in electrical engineering from
(SemEval), Vancouver, BC, Canada, Aug. 2017, pp. 747–754. the Federal University of Espírito Santo, Vitoria,
[47] Z. Zhang and L. Luo, ‘‘Hate speech detection: A solved problem? The Brazil, in 1987 and 1994, respectively, and the
challenging case of long tail on Twitter,’’ Semantic Web, vol. 1, pp. 1–5, Ph.D. degree in electrical engineering from the
2018.
State University of Campinas, Campinas, Brazil,
[48] S. Bansal. (2010). Ideas for Generating Image Features and
in 2001.
Measuring Image Quality. Accessed: Jul. 15, 2019. [Online]. Available:
https://www.kaggle.com/shivamb/ideas-for-image-features-and-image- He is currently an Associate Professor with
quality the Department of Electrical Engineering, Federal
[49] C. C. Aggarwal, Outlier Analysis, 2nd ed. Cham, Switzerland: Springer, University of Espírito Santo. His current research
2016. interests include digital signal and image processing, pattern recognition, and
[50] H. Rosa, D. Matos, R. Ribeiro, L. Coheur, and J. P. Carvalho, ‘‘A ‘Deeper’ video processing.
look at detecting cyberbullying in social networks,’’ in Proc. Int. Joint
Conf. Neural Netw., Jul. 2018, pp. 1–8.
[51] (2019). The Top 20 Valuable Facebook Statistics. Accessed: Jul. 15, 2019.
[Online]. Available: https://zephoria.com/top-15-valuable-facebook-
statistics/
DANIEL C. CAVALIERI received the degree in

MARTA AMORIM received the degree in infor- electrical engineering from the Federal University
mation systems, in 2005 and the master’s degree in of Viçosa, in 2006, and the master’s and Ph.D.
informatics from the Federal University of Espírito degrees in electrical engineering from the Fed-
Santo, in 2012, where she is currently pursuing eral University of Espírito Santo, in 2007, where
the Ph.D. degree in electrical engineering (pattern he is currently a Professor of basic, technical,
recognition). and technological education. His research interests
She is currently a Professor of basic, technical, include electrical engineering, especially on arti-
and technological education with the Federal Insti- ficial intelligence, deep learning, pattern recogni-
tute of Education of Espírito Santo. tion, signal processing, and image processing.
132802 VOLUME 7, 2019

Novelty Detection in Social Media by Fusing Text and Image Into A Single Structure

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Novelty Detection in Social Media by Fusing Text and Image Into A Single Structure

Uploaded by

Copyright:

Available Formats

Received July 29, 2019, accepted August 29, 2019, date of publication September 5, 2019, date of current version

September 26, 2019.

Novelty Detection in Social Media by Fusing

EVANDRO O. T. SALLES 1 , AND DANIEL C. CAVALIERI2

VOLUME 7, 2019 132787

132788 VOLUME 7, 2019

VOLUME 7, 2019 132789

132790 VOLUME 7, 2019

input data (image, text, video, signals, and so forth), into

VOLUME 7, 2019 132791

FIGURE 3. Text encoding network.

FIGURE 4. Image feature extraction network.

neighbors (kNN) algorithm being used is an unsupervised

132792 VOLUME 7, 2019

VOLUME 7, 2019 132793

The autoencoder algorithm calculates the score as the

Fig. 6 shows the behavior of the hidden layer of the

132794 VOLUME 7, 2019

TABLE 6. Selected values of the FBagging parameter.

VOLUME 7, 2019 132795

gated word normalization (‘‘soooo’’ becomes ‘‘so’’), word

132796 VOLUME 7, 2019

In addition to the state-of-the-art unsupervised algorithms,

VOLUME 7, 2019 132797

132798 VOLUME 7, 2019

FIGURE 12. Results for temporal windows of 1, 5, 10 and 20 days.

VOLUME 7, 2019 132799

FIGURE 13. Posted tweet with the timestamp of 2018-06-12 01:08:35

The first tweet example containing advertising is indeed a

132800 VOLUME 7, 2019

DANIEL C. CAVALIERI received the degree in

132802 VOLUME 7, 2019

You might also like