Professional Documents
Culture Documents
ABSTRACT This work aims to propose an approach for detecting novelties, taking into account the temporal
flow of data streams in social media. To this end, we present a completely new architecture for novelty
detection. This new architecture entails three new contributions. First, we propose a new concept for novelty
definition based on temporal windows. Second, we formulate an expression to determine the quality of a
novelty. Third, we introduce a new approach to the fusion of heterogeneous data (image + text), using
the COCO dataset and the MASK-RCNN convolutional neural network, which transforms image and text
from social media into a single data format ready to be identified by machine learning algorithms. Since
novelty detection is a task in which labeled samples are scarce or inexistent, unsupervised algorithms are
used, and thus, the following baseline and state-of-the-art algorithms have been chosen: kNN, HBOS,
FBagging, IForesting, and autoencoders. The new fusion approach is also compared to a state-of-the-art
approach to outlier detection named AOM. Because of temporal particularities and the data types being
fused, a new dataset was created, containing 27,494 tweets collected from Twitter. Our experiments show
that data classification of social media using data fusion is superior to using only text or only images as input
data.
INDEX TERMS Detection algorithms, knowledge based systems, learning systems, machine learning,
neural networks, classification algorithms, streaming media.
I. INTRODUCTION abnormal or novelty and available for training. For this rea-
Novelty detection is also generally referred to in the liter- son, novelty detection is often performed using unsupervised
ature as event detection [1]. In fact, event detection is a or one-class methods. In cases where data labeled as novelties
general term used to refer to a variety of detection types, are available, novelty detection should be approached as a
including that of outliers and anomalies, among others. Addi- class imbalance problem.
tionally, many other synonymous terms, such as anomaly or Samples labeled as novelties are expensive to obtain or are
outlier detection, are often used to refer to novelty detec- naturally rare [5]. Some examples of this problem may occur
tion. However, there is a difference between meanings. The in the following contexts: a particular defect of a machine,
Merriam-Webster dictionary [2] defines novelty as an event cyber attacks attempting to destroy a system, or news of sig-
that is ‘‘new or different from anything familiar. Barnett nificant prominence, among others. These systems may oper-
and Lewis [3] define an outlier as an event that is incon- ate normally for an extended period of time. Data recorded
sistent with a training dataset. Therefore, outlier detection during normal operation are usually accessible and easy
aims to handle strictly abnormal observations in datasets [4]. to measure. On the other hand, observing failures or rare
In novelty detection, if a novelty is inconsistent with a events depends on the occurrence of all possible scenarios
dataset, the novelty is included in the dataset, whereas an of improper operation, which is not always possible. This
outlier is usually discarded. Nevertheless, both definitions may even be more difficult if significantly varied events may
share the same drawback of the scarcity of data labeled as occur.
In the case of social media news, thousands of news articles
The associate editor coordinating the review of this manuscript and are generated daily on the Internet, covering a significant
approving it for publication was Shirui Pan. variety of subjects [6]. Automatic identification (or detection)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
132786 VOLUME 7, 2019
M. Amorim et al.: Novelty Detection in Social Media by Fusing Text and Image Into a Single Structure
of news items that are novel and of interest to large audiences, To generalize the novelty concept to a context-free one,
such as a natural disaster occurring in a particular location or we redefine novelty as a unique news item that stands out rela-
the award of a Nobel Prize, poses a singular challenge to a pri- tive to the present and the past, considering a time interval that
ori prediction.Consequently, trained models are most likely to spans predefined temporal windows. Moreover, as we look at
present a high bias while maintaining low variance [5]. a temporal window in the past, we apply a memory sense. For
In addition to the scarcity of samples available for training, example, if the temporal window size is 4 hours and the news
another issue that has emerged in the novelty detection field item is ‘‘American dollar currency quotation’’, how would the
regards handling heterogeneous data, as heterogeneity is an novelty detector behave for this short temporal window? If
inherent characteristic of social media. The news posted on we change the temporal window from 4 hours to 10 days,
social networks can contain not only text but also images, would the detector identify the same news as a novelty? The
videos, temporal information, social connections, user pref- hypothesis is that a change in the size of temporal windows
erences, and other metadata. Gao et al. [7] indicate that 30% changes novelty labels and, therefore, implies a change in the
of blogs post content with images, and this number continues underlying distribution of the problem, which is known in the
to increase. Therefore, visual content becomes significant in literature as concept drift [19].
novelty detection analysis. The proliferation of applications User interactions in social media tell us a great deal about
that share photos on the web, such as Flickr, Instagram, the real world, and these insights are not limited to any
Pinterest, and Vero, results in a large number of photos being particular temporal aspect [20]. There are many ways to
available publicly. Flickr presently hosts more than 6 billion identify events of interest to social media users. Perhaps
photos, of which 100 million include geolocation data [8]. the most straightforward way is based on the content of the
Facebook is currently the most popular social network in tweets [21]. This means that the identification would be tied
photo sharing, with over 100 million photos shared daily [9]. to a context, ignoring events of importance to other contexts.
Another social network that is known for publishing short Due to the peculiar nature of novelties, they can be of great
posts (tweets) is Twitter, with a total of 1.3 billion accounts. interest to users. Popularity also indicates the degree of inter-
There are 500 million tweets sent every day, or approximately est in the event. As [22] pointed out, popularity of tweets
6,000 tweets every second. According to [10], tweets with is affected by multiple reasons aside from newsworthiness.
images result in 18% larger click-through rates, 89% more Therefore, in this study we seek to identify a novelty with
likes, and 150% more retweets. its degree of repercussion, which defines the quality of the
The literature discusses several problems encountered novelty.
when classifying images from social media, including low For our study, we break down novelty detection in social
quality, ambiguous meaning, and noise, among others. As a media into three aspects: heterogeneity, volatility, and audi-
consequence, most of the current novelty detections tech- ence. New data fusion tackles the data heterogeneity aspect.
niques for web content deal only with textual content and Temporal windows address the data volatility aspect. The
social connections, whereas visual content is ignored. How- quality of a novelty explains the data audience aspect. In this
ever, texts posted on social networks have many elements, paper, we propose two new novelty detection architectures
including paraphrases, metaphors, high expressiveness, and for heterogeneous social media data. The first applies text
so forth, that make classification a difficult task [11], [12]. and image fusion techniques to the input of unsupervised
The field of computer vision has recently advanced signif- algorithms, while the second applies fusion techniques to the
icantly due to deep learning techniques, more specifically, output of the unsupervised algorithms. We perform experi-
convolutional neural networks. Recent advances in computer ments on a dataset obtained from Twitter.
vision enable extraction of information from images with The contributions of this work are as follows: (i) a new
excellent precision. Examples include identifying different dataset containing text and images from Twitter that are
objects in a scene, including people, animals, vehicles, and so used for novelty detection, (ii) a new proposal for fusing
forth. This kind of information may be used as a complement heterogeneous data that improves the ROC AUC metric
to the information present in the text to develop better classi- (receiver operating characteristic area under the curve) con-
fiers of social network posts. In this context, we hypothesize cerning other techniques applied in this work, and (iii) a
that the fusion of text and images will add more relevant proposal for novelty detection considering temporal window
information to novelty detection in cases where the isolated and audience.
structures (text or images) may be deficient. This paper is organized as follows. In Section II,
Another peculiarity is that social media generates data we present novelty detection-related studies. The introduc-
in a data stream format. The quantity of data produced in tion of proposed methods is divided into two subsections of
social media is not only enormous but continues to grow, Section III: III-A and III-B. In Section III-A, we describe
and data are regularly updated [13], [14]. Due to the con- the procedures for labeling a dataset for training purposes
siderable quantity of data and the specific goal of most and other concepts pertaining to the novelty detection task.
applications, most papers attempt to identify novelties within In Section III-B, we explain the pipelines of the pro-
a specific context, sometimes filtering data by keywords, posed architectures: data encoding, fusion, and classification.
as done in [15]–[18]. Section IV outlines the experimental setup, the dataset and
results. Discussion is presented in Section V. Finally, the gen- relations between the features. A probabilistic Markov model
eral conclusions are stated in Section VI. connects the features with the event. The experiments analyze
two million images extracted from Flickr.
II. RELATED WORK Kaneko and Yanai [25] presented a system for discovering
Related studies in the literature use techniques of detecting events related to Twitter photos. Selected photos must con-
novelty only from unstructured data (raw text, images, user tain a geographical location (latitude/longitude). The method
data, and videos, among other metadata) from social media. consists of two main steps: detection of keywords based on
Alqhtani et al. [23] developed a method of detecting textual analysis and a clustering algorithm. First, the textual
novelty on Twitter using unstructured data. The goal is to analysis identifies the words in the text. Afterwards, visual
evaluate whether optimal event detection results from using features are extracted from the photos using BoF (bag-of-
only text, or only image, or a combination of both. The features), SURF (speeded up robust features) and 64-color
analysis starts with text-based event detection. Performing RGB (red-green-blue) color histograms. Photos are grouped
detection over text entails the application of several prepro- by the corresponding keywords, using Ward’s clustering algo-
cessing techniques, such as filtering characters, tokenization, rithm, which is a hierarchical clustering method. The most
removal of stopwords, and stemming. After the text has been representative photo of each group is selected by the Visu-
preprocessed, the TF-IDF (term frequency–inverse document alRank algorithm. The maps of the United States and Japan
frequency) technique is used to generate a high-dimensional are used to geolocate events. In the experiments, 17 million
vector that represents the text. Based on the weight assigned geolocated tweets posted in the United States and 3 million
by TF-IDF, a threshold is used to determine which tweets are tweets posted in Japan are analyzed.
novelties. If a tweet is deemed a non-novelty by the text anal- Mei et al. [26] proposed a new technique for joining het-
ysis technique, then the image-based event detection method erogeneous data (images, text, and users) from social net-
is used. For image-based event detection, visual characteris- works in the process of event summarization. First, the SIFT
tics such as HOG (histogram of oriented gradients), GLCM algorithm extracts the image features into a high-dimensional
(gray-level co-occurrence matrix), and color histogram are vector. The quantization hierarchy algorithm transforms the
used to generate the vector representing the image. Finally, high-dimensional vector into a visual vocabulary. In the sec-
the researchers use a supervised classifier to classify images ond step, a high-dimensional vector is computed by apply-
into novelties and non-novelties. They analyzed one million ing the bag-of-words approach to the text. The similari-
tweets that contained text and photos posted about Indonesia ties between text vectors and between image vectors are
Air Asia Flight 8501 that were collected from the Twitter calculated based on the cosine distance. Finally, a correla-
stream on 28 December 2014. The result of the experiment tion matrix between images and texts is calculated using
showed that the proposed method reached an accuracy of 89% the similarities, computing a relevance score between text
with text only, 86% with image only and 94% with text and and image. The experiments use a dataset extracted from
image. the Weibo social network containing 2,776,623 texts and
Zhang et al. [24] proposed a new probabilistic model that 4,432,514 images.
identifies an event by computing the correlation of several Reflecting upon all related studies, several conclu-
data types, such as visual and textual content, user profile, sions can be drawn: (i) most approaches use sparse and
and temporal information. The textual content includes tags, high-dimensional vectors to represent unstructured data;
titles, descriptions, and comments about images. The textual (ii) most approaches do not make it clear how they obtain
vector of text t is denoted by xt =< xt1 , xt2 , . . . , xti >, where the ground truth for a dataset; (iii) some approaches detect
xti is a value based on the frequency of the i-th element. The novelty from context; (iv) unstructured data are combined
visual content is represented in the form of visual words. rather than fused, whereby combining data entails using both
To this end, each image is evenly divided into 16 × 16 pieces. types of structured data to perform novelty detection, while
The visual features are extracted by the SIFT (scale invariant fusion does not; and (v) approaches only handle detection
feature transform) algorithm and through color histograms with no qualification of novelty based on the importance
of 32 and 128 dimensions. Afterwards, a k-means algo- of the event to social media users. With so many research
rithm groups and converts features into visual words. Thus, challenges, we seek to answer the following questions:
the visual content from image v can be represented by vector • Which state-of-the-art approach is most effective in
xv =< xv1 , xv2 , . . . , xvi >, where each element is associated detecting novelties in data streams?
with a visual word. The user features are related to the image, • Does fusion of data (image+text) improve novelty
that is, users who uploaded, shared or marked the image as detection?
favorite. Thus, user features obtained for user u are denoted • How does conceptualizing novelty as a function of time
by vector xu =< xu1 , xu2 , . . . , xui >. Each event Ci can be and audience facilitate novelty detection?
considered as a ‘‘virtual object’’ that consists of the union of
features Ci =< ∪xtj , ∪xvj , ∪xuj > (j = 1, 2, . . . , |Ci |). The III. METHODS
correlation of features is structured in an undirected graph This section is divided into two parts: data preparation and
model. The features are the vertices, and the edges are the novelty detection. The former describes the procedures for
labeling the dataset to be used to the process of training the the number of shares (e.g., ‘‘retweets’’ in the case of Twitter).
architectures. The latter describes the pipeline of the proposed Variable fc corresponds to the number of interactions (e.g.,
architectures: data encoding, fusion, and classification. ‘‘likes’’ in the case of Twitter). Variable n is the number of
tweets posted within the present temporal window. The norm
A. DATA PREPARATION function performs the normalization of rt and fc values to the
In this section, we define our novelty concept, novelty quality range [0, 1].
and a heuristic responsible for generating the ground truth for n−1
the dataset. 1X
qu = Quartile( sti + norm(rt) + norm(fc)) (2)
n
i=0
1) NOVELTY DEFINITION
where,
In this paper, we define a novelty as a news item that stands
out from other news over present and past time windows. norm(x) = (x − min(x))/(max(x) − min(x)). (3)
Time can be freely adjusted in units of time, including hours,
The Quartile function splits all the tweets within the same
days, months, etc. Equation (1) presents our mathematical
temporal window into four groups, each containing 25%
model for classifying an event as a novelty or a non-novelty.
of tweets. The first group contains tweets with the lowest
We redefine an event as a tweet.
( audience, while the fourth group comprises tweets with the
Novelty / wpr )∧(tac ∈
if (tac ∈ / wpa ) highest audience among all groups. The novelty quality value
f (wpr , wpa , tac ) = varies from 1 (for the smallest audience) to 4 (for the largest
NonNovelty otherwise
audience).
(1)
We chose the Quartile function because it is essentially a
Equation (1) is a function with three input variables: tac , location specifier. We want to know where the most impor-
wpr and wpa . Variable tac refers to a new incoming event tant events are located within the distribution of all detected
within the present temporal window. Variable wpr refers to novelties and non-novelties. However, the quartile function
the present temporal window. Variable wpa refers to the past is inappropriate if there are many replicas of some distinct
temporal window. Temporal windows can contain any time data values [27]. The dataset proposed in this work has few
frame, such as days, weeks, or months. For an event to be a repetitions of values, so the quartile function is suitable for
novelty, a new incoming tweet tac from a data stream should our objectives. Furthermore, quartiles are especially useful if
not belong to the present window wpr or the past window wpa . data are not symmetrically distributed, or a dataset contains
Otherwise, the event will be a non-novelty. outliers.
An event does not belong to a temporal window if it has The novelty quality levels aid our strategy by
no similarity to any event of the present and past temporal • Eliminating noisy novelty items (memes, daily news,
windows. In semiautomatic labeling heuristics (presented in and so forth), highlighting novelties that are rare for jour-
Algorithm 1), we calculate the cosine similarity between the nalistic purposes, and pointing out suspicious actions in
events within a temporal window, and a threshold is applied social media to detect crimes, among others.
to determine whether the similarity of an event indicates a • Identifying news from alternative sources. A novelty
novelty or non-novelty (operator ∈ / of Equation (1)); more will not always originate from major news media
details are available in Section III-A.3. (Reuters and BBC, among others). We live in the age
In the architecture for detecting novelty proposed in this of user-generated content that has become so popular
work, we use unsupervised algorithms to detect novelty. that some news usually appear first in other media before
In the training phase, the algorithms learn score thresholds being noticed by the newswires. More prominent exam-
/ of Equation (1)).
that will reveal a novelty label (operator ∈ ples include Osama bin Laden’s death [28] and Michael
Therefore, the proposed architecture and the semiautomatic Jackson’s death [29] notifications posted an hour before
labeling heuristics use this novelty concept described by announcements in any other news media.
Equation (1).
3) SEMIAUTOMATIC LABELING HEURISTICS
2) NOVELTY QUALITY The ground truth labels and the novelty quality for the dataset
Equations (2) characterizes the novelty quality of one event are generated by a heuristic detailed in Algorithm 1. The
in a temporal window, as measured by its audience score first step is to receive events from the AllTweets data stream,
on a scale from 1 to 4. The choice of range (1-4) is due which are ordered by event creation date dt. The current event
to the use of the quartile measure. An event with quality received by the algorithm is denoted by tac . Variables dtBegin
equal to 4 indicates significant repercussions, whereas an and dtEnd store the start and end times, respectively, of a
event with quality equal to 1 represents a weak reaction in temporal window. The start date dtBegin stores the current
social media. Variable st is defined by the similarity between event date dt. Variable dtEnd stores the start date dtBegin plus
events, that is, we calculate the cosine similarity between a custom time Windays (e.g., 10, 20 or 40 days). The events
events within a temporal window. Variable rt is measured by in the dataset are tweets.
FIGURE 1. Two architectures proposed for novelty detection. The architecture in (a) fuses text and an image using our new approach, the MASK-RCNN
network, encodes the fusion result, and finally classifies the tweet using an unsupervised algorithm; the architecture in (b) performs text encoding and
image feature extraction separately, uses an unsupervised algorithm for classification and finally performs data fusion with the AOM algorithm.
For each step of the loop, the algorithm checks if the Algorithm 1 Pseudocode of the Labeling Process
current tweet is outside the temporal window, generates 1: procedure LabelingTweet
the ground truth using the Similar function (Equation 4) 2: flag ← 0
and calculates the novelty quality using the Audience func- 3: while Not AllTweets is EOF do
tion (Equation 2). Algorithm 1 is considered semiautomatic 4: dt ← tac .dtCreated
because a manual verification of ground truth labels is 5: if flag == 0 then F Initializing the variables
needed. Manual verification employed an inspection method- 6: flag ← 1
ology. Three people have inspected each tweet and evaluated 7: dtBegin ← dt
if the semiautomatic labels were correct. 8: dtEnd ← dtBegin + Windays
Novelty
if (s < 0.4) ∨ (s >= 0.4 9: if Not (dtBegin ≤ dt ≤ dtEnd) then F current
h(s, qu) = ∧ s < 0.5 ∧ qu = 4) (4) tweet is not inside the temporal window
NonNovelty otherwise 10: dtBegin ← dt
11: dtEnd ← dtBegin + Windays
where h is the label of event tac , s is the cosine similarity mean 12: lstWinpas ← lstWinpre
between tac and events in the temporal window, and qu is the 13: lstWinpre ← empty
novelty quality of tac . 14: lstWinpre ← Add(tac )
15: lbl ← Similar(tac , lstWinpre, lstWinpas)
B. NOVELTY DETECTION
16: aud ← Audience(tac , lstWinpre)
In this section, we describe the proposed architectures that
perform novelty detection in social media. The two architec-
tures, shown in Fig. 1, have three main pipelines: encoding for the vector of the image and the vector of text, and
text and images feature extraction, data fusion and unsuper- (iii) fusing the scores of unsupervised algorithms using the
vised algorithms. AOM fusion method. The details of the proposed methods
First, data will arrive as a data stream; i.e., the architecture are presented in the following subsections.
will receive it in the chronological order of creation. The
first architecture (part (a) of Fig. 1) performs the follow- 1) ENCODING TEXT AND AN IMAGE
ing steps: (i) fusing the tweet image and text into a sin- An autoencoder can transform text and an image into a
gle textual structure using the MASK-RCNN network [30], real-valued weight vector, which is a neural network hid-
(ii) transforming the textual structure into a vector representa- den layer representation [31]. A suitable encoder is essential
tion using an autoencoder, and (iii) detecting novelty using an for a classifier or regressor performance [32]. Therefore,
unsupervised algorithm. The unsupervised algorithms receive the encoders used in this work are based on [33] and [34].
a vector of text as input. The second architecture (part (b) of There are several approaches to transforming text input to
Fig. 1) performs the following steps: (i) transforming the a neural network, including vector mean, one-hot, and so on.
tweet image and text into vector representations using autoen- The approach chosen for text encoding in this work is known
coders, (ii) detecting novelty using unsupervised algorithms as concatenation. This technique generates a matrix as an
FIGURE 6. First and second principal component (PC) of the hidden autoencoder layer for a 10-day temporal window. The threshold is calculated for each
temporal window.
The feature bagging (FBagging) algorithm uses an algo- the Q features. The output of the algorithm is the scor-
rithm ensemble according to Equation (8). First, it ran- ing vector. LOF is executed n times, the results are com-
domly chooses a subset of features Q without replacement. bined, and the maximum score is returned. The higher
The local outlier factor (LOF) algorithm is applied using the s value is, the more likely the instance to be a
TABLE 2. Several examples of normalized scores generated by the TABLE 3. Selected values of the kNN parameter.
autoencoder algorithm of Fig. 6 (a). The autoencoder applies the
threshold of −0.6762 and predicts the labels Novelty and Non-Novelty.
novelty.
n
X
s(Q, n) = max( LOF(Q)) (8)
i=1
TABLE 5. Selected values of the HBOS parameter. TABLE 7. Selected values of the autoencoder parameter.
B. DATASETS
The Twitter API (application programming interface) returns
a wide variety of data and metadata. In our work, the most
important indicators are tweet ID, user name, image, text,
creation date of a tweet, like count and retweet count [20].
By having a large quantity of free data available for con-
sumption, Twitter has become a social network widely used
for data analysis. Despite a large number of tweet datasets
available on the Internet, the following shortcomings were
observed in this research context:
• Most of the databases are presented in a particular con-
text or pertain to a specific theme, that is, tweets are
filtered using keywords.
• Our technique is based on temporal windows, and the
labels change as a function of a set timespan. Labels
are essential for measuring the effectiveness of the tech-
algorithm is related to the number of histograms. Parameter nique.
nestimators in the FBagging algorithm is related to the num- • Most databases contain many more instances of text
ber of algorithm runs. Parameter epochs in the autoencoder than of images. To effectively evaluate our method,
algorithm is related to the number of iterations performed to we require that each tweet must have text and an image.
complete training However, our method can be used if there is only text or
The choice of parameter ranges for grid search is based only an image.
on overfitting/underfitting using the ROC AUC metric and For the experiments, we collected 27,494 tweets
the number of novelties per temporal window. Therefore, (text+image) created in the 2017-2019 period. The tweets
the larger the temporal window is, the greater the num- were posted by various news sources: BBC, BBC Break-
ber of instances and the wider the parameter range. For ing News, CNN, Reuters, NYTimes, etc. The heuristics
instance, distance-based algorithms have to compute the dis- (described by Algorithm 1) were used to label all tweets
tance between the tweet instances within the temporal win- within temporal windows of four different sizes: 1, 5, 10 and
dow, so there is a limit on parameters k and nbins. 20 days. The peculiar characteristic of our dataset is that
As each temporal window has different distributions, each text must have an associated image. Despite the ample
the tendency is for parameters to behave differently. There- period of tweet data collection, the ratio of the number of
fore, parameters are not fixed for each algorithm execu- posts containing text with images in social media is less
tion. A grid search by parameter ranges is run separately in than the number of posts containing text without an image.
FIGURE 7. Tweet text size distribution. FIGURE 8. Tweet image width (pixels) distribution.
The database and the source code are freely available for
download.1
The choice of temporal window size should be made con-
sidering the information volatility (amount of news generated
per day). For example, the news regarding ‘‘dollar quotation’’
will be identified as a novelty more times in a 1-day window
than in 30-day windows. Therefore, smaller windows tend
to identify more novelties. Larger windows tend to identify
fewer novelties than do smaller windows. Information volatil-
ity may be linked to the type of news. The economic and
political domains have a higher volume of news generated
per day than the number of news of natural disasters and
FIGURE 9. Tweet image height (pixels) distribution.
Nobel Prize awards. Therefore, smaller windows would be
more suitable for the former example. The choice of different
temporal window sizes in this work (1, 5, 10 and 20) allows
evaluating the volatility of information in various contexts.
Figs. 7, 8, 9, 10 and 11 shown some statistics of the dataset.
Fig. 7 illustrates the number of words versus number of
tweets. The x-axis is the number of words. The y-axis is the
number of tweets. Most tweets have an average of 10 words,
and the dictionary contains 19259 unique tokens (words).
The limited number of words by tweet and large number
of unique tokens indicate a hard task. We apply a state-of-
the-art tweet normalization tool [46] to tokenize and trans-
form each tweet into a sequence of words. This is done to
mitigate the noise due to the colloquial nature of the data.
The process involves, for example, spelling correction, elon- FIGURE 10. Distribution of the number of pixels in a tweet image.
TABLE 8. Resulting ROC AUC metric values for the temporal window of 1 day.
TABLE 9. Resulting ROC AUC metric values for the temporal window of 5 days.
TABLE 10. Resulting ROC AUC metric values for the temporal window of 10 days.
TABLE 11. Resulting ROC AUC metric values for the temporal window of 20 days.
However, the AOM fusion is observed to be unsatisfactory TABLE 12. Resulting ROC AUC metric values for the temporal window
of 10 days obtained for various proportions.
in this experiment because even though there is a small
dataset in each window, the data are heterogeneous and with
much higher noise in images than in texts, which impacts
fusion. Taking advantage of the fact that it is easier to classify
text than an image, the MASK fusion extracts information
from the image to augment the text. Since the MASK fusion
is performed on the pipeline input, we do not observe the
impact of distinct probabilities on the algorithms’ output.
Additionally, the input vector size and the algorithm being
of unsupervised type impact novelty identification results,
as reported in other studies [39], [49] and [50].
According to statistics, the proportion of texts is larger than
that of images in the social media [51]. Therefore, we experi-
ment by varying the proportion of tweets including text and an is relevant. In comparison, the ROC AUC obtained using only
image as they are fed to the unsupervised algorithm, with the text under the same conditions is 0.7003.
best result obtained by the autoencoder algorithm, as shown Other results can be analyzed visually, considering Fig. 12
in Tables 8 to 11. For example, 40/60% means that 40% of that illustrates the time span between 2017 and 2019. The
data include text and an image, and 60% include text only x-axis indicates the creation time of each tweet, and the
for novelty detection. The selection of data that contains only y-axis shows the scores of the algorithm that obtains the best
text and an image was random. Table 12 shows the results for ROC AUC (namely, autoencoder with the MASK fusion).
four different proportions ranging from 10% to 40% of tweets The blue dots are tweets considered novelties, the green dots
with text and an image. This experiment showed that if the are tweets considered non-novelties, and the red dots are
proportion of images is higher than or equal to 30%, images tweets with an audience score of 4. Tweets with an audience
add relevant information to texts that improves novelty classi- score of 4 are mostly detected as novelties, as shown in
fication. Gao et al. [7] indicate that 30% of blogs post content the caption of Fig. 12. A traced line indicates the algorithm
with images, therefore showing that the proposed method threshold.
Novelty qualification allows filtering tweets that stand out communities is one of the core principles at Bloomberg’’.
according to their audience. Therefore, we observe that the Other tweets have been highlighted with the audience filter,
filter influences the ‘‘noisy’’ novelties elimination, such as for example, record-breaking moves in the stock market: ‘‘rt
tweets with advertising: ‘‘strengthening and enriching our @business: breaking: the dow hits 20,000 for the first time’’.
TABLE 13. Sample of tweets from the dataset with the labels and audience scores.
does not recognize the word in the hashtag and returns a null [10] (2019). 58 Incredible and Interesting Twitter Stats and Statistics.
vector for it. The MASK-RCNN network can recognize the Accessed: Jul. 15, 2019. [Online]. Available: https://www.brandwatch.
com/blog/twitter-stats-and-statistics/
labels ‘‘tv’’, ‘‘person’’ and ‘‘tie’’ in the scene. Thus, adding [11] G. Miner, J. Elder, T. Hill, R. Nisbet, D. Delen, and A. Fast, Practical Text
keywords to the tweet increases assertiveness of the learning Mining and Statistical Analysis for Non-Structured Text Data Applications,
algorithm. 1st ed. Orlando, FL, USA: Academic, 2012.
[12] S. Dodge and L. Karam, ‘‘Understanding how image quality affects
Another point of analysis is that the ROC AUC decreases as deep neural networks,’’ in Proc. 8th Int. Conf. Qual. Multimedia Exper.
the input vector size changes from 150 to 300 features, except (QoMEX), Jun. 2016, pp. 1–6.
for the autoencoder algorithm. The literature highlights that [13] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, ‘‘Deep learning
distance-based classification algorithms are at a disadvantage for IoT big data and streaming analytics: A survey,’’ IEEE Commun.
Surveys Tuts., vol. 20, no. 4, pp. 2923–2960, 4th Quart., 2018.
in analyzing high-dimensional data [49]. Another reason is [14] L.-L. Shi, L. Liu, Y. Wu, L. Jiang, and J. Hardy, ‘‘Event detection and user
that none of the unsupervised algorithms have a powerful fea- interest discovering in social media data streams,’’ IEEE Access, vol. 5,
ture selector, extractor or ensemble, except for the FBagging pp. 20953–20964, 2017.
[15] W. Zhong. (Jul. 2, 2016). Novelty Detection. Accessed: Jul. 15, 2019.
and autoencoder algorithms. [Online]. Available: https://github.com/CrowdTruth/Novelty_Detection
[16] N. Ghelani, S. Mohammed, S. Wang, and J. Lin, ‘‘Event detection on
VI. CONCLUSION curated tweet streams,’’ in Proc. 40th Int. ACM SIGIR Conf. Res. Develop.
We proposed two architectures for novelty detection in the Inf. Retr., New York, NY, USA, Aug. 2017, pp. 1325–1328.
[17] C. C. Aggarwal and K. Subbian, ‘‘Event detection in social streams,’’ in
social media that use information from text and images. Proc. SIAM Int. Conf. Data Mining, 2012, pp. 624–635.
To this end, we presented two methods of fusion: the fusion of [18] E. Schinas, S. Papadopoulos, Y. Kompatsiaris, and P. A. Mitkas,
text and an image using a trained MASK-RCNN network and ‘‘Event detection and retrieval on social media,’’ 2018, arXiv:1807.03675.
the output fusion of unsupervised algorithms with AOM. The [Online]. Available: https://arxiv.org/abs/1807.03675
[19] M. A. Masud, J. Z. Huang, M. Zhong, and X. Fu, ‘‘Cluster survival
method of fusion using MASK-RCNN showed good results, model of concept drift in load profile data,’’ IEEE Access, vol. 6,
surpassing those of the AOM fusion. pp. 51269–51285, 2018.
The text and image fusion methods perform well if the [20] S. Hendrickson, J. Montague, J. Kolb, and B. Lehman. (2019). Trend
Detection in Social Data. Accessed: Jul. 15, 2019. [Online]. Available:
image proportion is greater than or equal to 30% of the https://www.developer.twitter.com
dataset. Therefore, the MASK-RCNN fusion adds more rel- [21] S. Khater, D. Gračanin, and H. G. Elmongui, ‘‘Personalized recommen-
evant information to novelty detection in cases where the dation for online social networks information: Personal preferences and
location-based community trends,’’ IEEE Trans. Comput. Social Syst.,
isolated structures (text or an image) may be deficient. vol. 4, no. 3, pp. 104–120, Sep. 2017.
The novelty definition involving temporal windows and [22] L. Hong, O. Dan, and B. D. Davison, ‘‘Predicting popular messages in
novelty quality proved to be a powerful media analysis tool. Twitter,’’ in Proc. 20th Int. Conf. Companion World Wide Web, New York,
As a result, the novelty definition became context indepen- NY, USA, Apr. 2011, pp. 57–58.
[23] S. M. Alqhtani, S. Luo, and B. Regan, ‘‘Fusing text and image for event
dent. The novelty quality allowed filtering tweets with a detection in Twitter,’’ Int. J. Multimedia Appl., vol. 7, pp. 27–35, Feb. 2015.
higher audience, thus eliminating advertisements and other [24] X. Zhang, Z. Li, X. Lv, and X. Chen, ‘‘Integrating multiple types of features
irrelevant tweets. We also note that the smaller the temporal for event identification in social images,’’ Multimedia Tools Appl., vol. 75,
no. 6, pp. 3301–3322, Mar. 2016.
window is, the more difficult the classification. [25] T. Kaneko and K. Yanai, ‘‘Event photo mining from Twitter using keyword
In future research, we intend to compare our proposed bursts and image clustering,’’ Neurocomputing, vol. 172, pp. 143–158,
architecture with the LSTM (long short-term memory) net- Jan. 2016.
works and with deep learning networks for generating text [26] M. Mei, X. Guo, B. C. Williams, S. Doboli, J. B. Kenworthy, P. B. Paulus,
and A. A. Minai, ‘‘Using semantic clustering and autoencoders for detect-
from an image. ing novelty in corpora of short texts,’’ in Proc. Int. Joint Conf. Neural Netw.,
Jul. 2018, pp. 1–8.
REFERENCES [27] E. Langford. Quartiles in Elementary Statistics. Accessed: Jul. 15, 2019.
[1] F. Atefeh and W. Khreich, ‘‘A survey of techniques for event detection in [Online]. Available: http://jse.amstat.org/v14n3/langford.html
Twitter,’’ Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015. [28] (2011). Twitter First With News of Osama Bin Laden’s Death Via Ex-
[2] Merriam-Webster Online Dictionary. Accessed: Jul. 15, 2019. [Online]. Bush Staffer in the Guardian. Accessed: Jul. 15, 2019. [Online]. Avail-
Available: http://www.merriam-webster.com able: https://www.theguardian.com/technology/blog/2011/may/02/twitter-
[3] V. Barnett and T. Lewis, Outliers Statistical Data, vol. 37, 2nd ed. osama-bin-laden-death-leaked
Hoboken, NJ, USA: Wiley, 1994. [29] K. Hodge. (2010). 10 News Stories That Broke on Twitter First.
[4] J. Xi, ‘‘Outlier detection algorithms in data mining,’’ in Proc. 2nd Int. Accessed: Jul. 15, 2019. [Online]. Available: https://www.techradar.
Symp. Intell. Inf. Technol. Appl., vol. 1, Dec. 2008, pp. 94–97. com/news/world-of-tech/internet/10-news-stories-that-broke-on-twitter-
[5] M. A. F. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, ‘‘A review first-719532
of novelty detection,’’ Signal Process., vol. 99, pp. 215–249, Jun. 2014. [30] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, ‘‘Mask R-CNN,’’ in Proc.
[6] A. Aldhaheri and J. Lee, ‘‘Event detection on large social media using IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969.
temporal analysis,’’ in Proc. IEEE 7th Annu. Comput. Commun. Workshop [31] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
Conf. (CCWC), Jan. 2017, pp. 1–6. MA, USA: MIT Press, 2016.
[7] Y. Gao, S. Zhao, Y. Yang, and T.-S. Chua, ‘‘Multimedia social event [32] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
detection in microblog,’’ in MultiMedia Modeling. Cham, Switzerland: A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
Springer, 2015, pp. 269–281. vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[8] M. Ruocco and H. Ramampiaro, ‘‘A scalable algorithm for extraction and [33] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep con-
clustering of event-related pictures,’’ Multimedia Tools Appl., vol. 70, no. 1, volutional encoder-decoder architecture for image segmentation.,’’ 2015,
pp. 55–88, May 2014. arXiv:1511.00561. [Online]. Available: https://arxiv.org/abs/1511.00561
[9] Flickr Reaches 6 Billion Photo Uploads. Accessed: Jul. 18, 2019. [Online]. [34] A. Severyn and A. Moschitti, ‘‘Learning to rank short text pairs with
Available: https://latimesblogs.latimes.com/technology/2011/08/flickr- convolutional deep neural networks,’’ in Proc. 38th Int. ACM SIGIR Conf.
reaches-6-billion-photos-uploaded.html Res. Develop. Inf. Retr., Aug. 2015, pp. 373–382.
VOLUME 7, 2019 132801
M. Amorim et al.: Novelty Detection in Social Media by Fusing Text and Image Into a Single Structure
[35] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of FREDERICO D. BORTOLOTI received the degree
word representations in vector space,’’ 2013, arXiv:1301.3781. [Online]. in computer science and the master’s degree in
Available: https://arxiv.org/abs/1301.3781 informatics from the Federal University of Espírito
[36] N. Audebert, B. Le Saux, and S. Lefèvrey, ‘‘Fusion of heterogeneous data Santo, in 1999 and 2012, respectively, where he is
in convolutional networks for urban semantic labeling,’’ in Proc. Joint currently pursuing the Ph.D. degree in computer
Urban Remote Sens. Event (JURSE), Mar. 2017, pp. 1–4. science (machine learning).
[37] S. Sun, L. Yang, W. Liu, and R. Li, ‘‘Feature fusion through multitask CNN He is currently an Assistant Professor with the
for large-scale remote sensing image segmentation,’’ in Proc. 10th IAPR
Federal University of Espírito Santo.
Workshop Pattern Recognit. Remote Sens. (PRRS), Aug. 2018, pp. 1–4.
[38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in
Proc. ECCV, 2014, pp. 740–755.
[39] C. C. Aggarwal and S. Sathe, ‘‘Theoretical foundations and algorithms
for outlier ensembles,’’ ACM SIGKDD Explor. Newslett., vol. 17, no. 1,
pp. 24–47, Jun. 2015.
[40] F. Angiulli and C. Pizzuti, ‘‘Fast outlier detection in high dimensional
spaces,’’ in Proc. 6th Eur. Conf. Princ. Data Mining Knowl. Discovery, PATRICK M. CIARELLI received the degree, mas-
2002, pp. 15–26. ter’s, and Ph.D. degrees in electrical engineering
[41] M. Goldstein and A. Dengel, ‘‘Histogram-based outlier score (HBOS): from the Federal University of Espírito Santo,
A fast unsupervised anomaly detection algorithm,’’ in Proc. Poster Demo in 2006, 2008 and 2012, respectively, where he
Track 35th German Conf. Artif. Intell., 2012, pp. 59–63. is currently an Assistant Professor. His research
[42] A. Lazarevic and V. Kumar, ‘‘Feature bagging for outlier detection,’’ in interests include electrical engineering, especially
Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
on artificial intelligence, information retrieval, pat-
Aug. 2005, pp. 157–166.
tern recognition, and image processing.
[43] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga,‘‘Outlier detection with
autoencoder ensembles,’’ in Proc. SIAM Int. Conf. Data Mining, Jun. 2017,
pp. 90–98.
[44] F. T. Liu, K. M. Ting, and Z.-H. Zhou, ‘‘Isolation-based anomaly detec-
tion,’’ ACM Trans. Knowl. Discovery Data, vol. 6, no. 1, p. 3, 2012.
[45] Y. Zhao, Z. Nasrullah, and Z. Li, ‘‘PyOD: A Python toolbox for scalable
outlier detection,’’ J. Mach. Learn. Res., vol. 20, no. 96, pp. 1–7, Jan. 2019.
[46] C. Baziotis, N. Pelekis, and C. Doulkeridis, ‘‘Datastories at SemEval-
2017 Task 4: Deep LSTM with attention for message-level and topic- EVANDRO O. T. SALLES received the B.Sc. and
based sentiment analysis,’’ in Proc. 11th Int. Workshop Semantic Eval. master’s degrees in electrical engineering from
(SemEval), Vancouver, BC, Canada, Aug. 2017, pp. 747–754. the Federal University of Espírito Santo, Vitoria,
[47] Z. Zhang and L. Luo, ‘‘Hate speech detection: A solved problem? The Brazil, in 1987 and 1994, respectively, and the
challenging case of long tail on Twitter,’’ Semantic Web, vol. 1, pp. 1–5, Ph.D. degree in electrical engineering from the
2018.
State University of Campinas, Campinas, Brazil,
[48] S. Bansal. (2010). Ideas for Generating Image Features and
in 2001.
Measuring Image Quality. Accessed: Jul. 15, 2019. [Online]. Available:
https://www.kaggle.com/shivamb/ideas-for-image-features-and-image- He is currently an Associate Professor with
quality the Department of Electrical Engineering, Federal
[49] C. C. Aggarwal, Outlier Analysis, 2nd ed. Cham, Switzerland: Springer, University of Espírito Santo. His current research
2016. interests include digital signal and image processing, pattern recognition, and
[50] H. Rosa, D. Matos, R. Ribeiro, L. Coheur, and J. P. Carvalho, ‘‘A ‘Deeper’ video processing.
look at detecting cyberbullying in social networks,’’ in Proc. Int. Joint
Conf. Neural Netw., Jul. 2018, pp. 1–8.
[51] (2019). The Top 20 Valuable Facebook Statistics. Accessed: Jul. 15, 2019.
[Online]. Available: https://zephoria.com/top-15-valuable-facebook-
statistics/