Professional Documents
Culture Documents
Abstract—Document sensitivity classification is essential to known patterns and templates of a classified document,
prevent potential sensitive data leakage for individuals and such as contracts or agreements, but may not efficiently
organizations. As most of existing methods use regular expres- help people to aware the potential privacy leaks in informal
sions or data fingerprinting to classify sensitive documents,
they may not fully exploit the semantic and content of a documents, such as emails or personal notes.
document, especially with informal messages and files. This In recent years, natural language processing (NLP) tech-
motivates the authors to propose a novel method to classify niques, such as N-gram[4] or Named Entity Recognition[6],
document sensitivity in realtime with better semantic and have been applied for document classification in DLP. In
content analysis. this paper, we follow this trend to better understand and
Taking advantages of deep learning in natural language
processing, we use our pre-trained Twitter-based document
exploit document content to classify document sensitivity.
embedding TD2V to encode a document or a text fragment Inpired by ParagraphVector [7], one of the state-of-the-art
into a fixed length vector of 300 dimensions. Then we use methods for document representation, we apply our Twitter-
retrieval and automatic query expansion to retrieve a re-ranked based Doc2Vec model (TD2V [8]) to vectorize an arbitrary
list of semantically similar known documents, and determine document or text fragment. Collecting more than one million
the sensitivity score for a new document from those of the
retrieved documents in this list. Experimental results show that
tweets (from 2010 to 2017) in Twitters, we train TD2V from
our method can achieve classification accuracy of more than 422,351 English articles with 297,298,525 tokenized words
99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and [8]. Thus, our word embedding model is expected to be
98.34% for Enron dataset. Furthermore, our method can early general and efficient enough to represent English documents
predict a sensitive document from a short text fragment with and text fragments in various domains. Then, we propose a
the accuracy higher than 98.84%.
novel method to classify the sensitivity level of a document
Keywords-Sensitive document detection, document embed- d using retrieval and automatic query expansion (AQE [9]).
ding, Doc2Vec, automatic query expansion, data leakage pre- From the initial ranklist containing k nearest neighbors of
vention
d, we use Modified Distance[8] to re-rank documents in a
labeled sensitivity corpus S. The sensitivity label for d is
I. I NTRODUCTION
determined by majority voting scheme from the top l in this
Data leakage prevention (DLP[1]) is one of the essential re-ranked list.
problems to protect personal and organizational sensitive Following the work by Hart et.al. [10], we gather four dif-
information from disclosing without official consent. The ferent datasets, namely, Dyncorp, Transcendental meditation
number of confidential data leaks is increasing every year (TM), Mormon, and Enron. We also create the fifth dataset
[2]. In Global Data Leakage Report of InfoWatch Analytical Snowden. Our experiment for full document classification
Center, there are 1,556 confidential data leaks registered in shows that our proposed method achieves the accuracy more
2016, increasing 3.4% more than in 2015[3]. Especially, than 99.9% for 4 datasets (Snowden, Mormon, Dyncorp,
more than 3.1 billion personal data records are compromised, TM) and 98.34% for Enron dataset. Besides, we also conduct
up to three times more than in 2015[3]. sensitivity classification for short text segment (512 bytes,
With the increasing amount of data created everyday, 1KB, 2KB, and 4KB) and our method achieves the accuracy
it is not an easy task for people to classify and manage of more than 99.7% for 4 datasets (Snowden, Mormon,
documents based on their sensitivity levels. Currently most Dyncorp, TM) and 98.84% for 1 dataset (Enron).
of existing methods in DLP use regular expressions or The main contributions of our work are as follows.
data fingerprinting[4]. Thus these methods only analyze the • we propose a novel method to classify the sensitivity
formal representation, i.e. data format, not the semantic of a document or a text fragment with two phases: text
content of a document[5]. This approach is appropriate with fragment vectorization with our pre-trained document
1612869@student.hcmus.edu.vn
{thtrieu, npnguyen}@apcs.vn
Abstract – In this paper, we propose smartcam, a new surveillance solution based on object – detection technique
to achieve richer descriptiveness, higher storage usage efficiency and true antitheft feature while keeping the bill
of materials low and also flexibility for the user. Early calculations shows that in best case, an event recorded by
our solution only uses 0.0279% storage space of that is normally used in traditional solutions. An estimation
shows that over 5,100,000 events that is recorded in roughly 3 years can be fit on a storage space of 8 Gigabytes
continously with our solution.
Keywords – Object detection, surveillance, security
Featured Application: Our work can be applied as an IoT system to capture important events in
daily life for later storage. From wearable devices with camera such as smart glasses, photos of
events can be periodically taken and processed into description in text format. The description
is then stored in a database on server and can be retrieved via another smart device such as
smartphone. This let users easily retrieve the information they want for sharing or reminiscence.
The descriptions of photos taken each day can also be gathered as a diary. Furthermore,
the database is also a huge resource for analyzing user behavior.
Abstract: During a lifetime, a person can have many wonderful and memorable moments that he/she
wants to keep. With the development of technology, people now can store a massive amount of
lifelog information via images, videos or texts. Inspired by this, we develop a system to automatically
generate caption from lifelog pictures taken from wearable cameras. Following up on our previous
method introduced at the SoICT 2018 conference, we propose two improvements in our captioning
method. We trained and tested the model on the baseline MSCOCO datasets and evaluated on
different metrics. The results show better performance compared to our previous model and to some
other image captioning methods. Our system also shows effectiveness in retrieving relevant data
from captions and achieve high rank in ImageCLEF 2018 retrieval challenge.
1. Introduction
People usually want to keep footage of the events that happen around them for many purposes
such as reminiscence [1], retrieval [2] or verification [3]. However, it is not always convenient for
them to record those events because they do not have the time or tool at that moment. People also
could miss some events because they do not consider those events important or worth keeping until
later. With the development of technology, especially IoT system, smart environment such as smart
home and smart office can be established and give people easy access to ubiquitous service. In a
∗ ∗
Wenhui Li1† , Anan Liu 1† , Weizhi Nie 1† , Dan Song1† , Yuqian Li1† , Weijie Wang1† , Shu Xiang1† , Heyu Zhou1†
Ngoc-Minh Bui2 , Yunchi Cen3 , Zenian Chen3 , Huy-Hoang Chung-Nguyen2 , Gia-Han Diep2 , Trong-Le Do2 , Eugeni L. Doubrovski4 ,
Anh-Duc Duong5 , Jo M.P. Geraedts4 , Haobin Guo6 , Trung-Hieu Hoang2 , Yichen Li7 , Xing Liu9 , Zishun Liu4 , Duc-Tuan Luu2 , Yunsheng
Ma10 , Vinh-Tiep Nguyen5 , Jie Nie11 , Tongwei Ren6 , Mai-Khiem Tran2 , Son-Thanh Tran-Nguyen2 , Minh-Triet Tran2 , The-Anh Vu-Le2 ,
Charlie C.L. Wang8 , Shijie Wang9 , Gangshan Wu6 , Caifei Yang9 , Meng Yuan11 , Hao Zhai7 , Ao Zhang6 , Fan Zhang3 , Sicheng Zhao10
1
Shool of Electrical and Information Engineering, Tianjin University, China.
2 University of Science, VNU-HCM, Vietnam.
3 State Key Laboratory of Virtual Reality Technology and System, Beihang University, China.
4 Delft University of Technology, Netherlands.
5 University of Information Technology, VNU-HCM, Vietnam.
6 Nanjing University, China.
7 SuoAo Technology Center, SAEE, University of Science and Technology Beijing, China.
8 Chinese University of Hong Kong, China.
9 School of Software, Dalian University of Technology, China.
10 Department of Electrical Engineering and Computer Sciences, University of California Berkeley, USA
11 Ocean University of China, China.
Abstract
Monocular image based 3D object retrieval is a novel and challenging research topic in the field of 3D object retrieval. Given a
RGB image captured in real world, it aims to search for relevant 3D objects from a dataset. To advance this promising research,
we organize this SHREC track and build the first monocular image based 3D object retrieval benchmark by collecting 2D
images from ImageNet and 3D objects from popular 3D datasets such as NTU, PSB, ModelNet40 and ShapeNet. The benchmark
contains classified 21,000 2D images and 7,690 3D objects of 21 categories. This track attracted 9 groups from 4 countries and
the submission of 20 runs. To have a comprehensive comparison, 7 commonly-used retrieval performance metrics have been
used to evaluate their retrieval performance. We wish this publicly available benchmark, comparative evaluation results and
corresponding evaluation code, will further enrich and boost the research of monocular image based 3D object retrieval and
its applications.
Categories and Subject Descriptors (according to ACM CCS): H.3.3 [Computer Graphics]: Information Systems—Information
Search and Retrieval