You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/251981509

The Application of Lucene in Information Leakage Monitoring and Querying


System

Article · December 2010


DOI: 10.1109/ICIECS.2010.5677652

CITATIONS READS

2 473

3 authors, including:

Y.H. Gu
Beijing University of Posts and Telecommunications
19 PUBLICATIONS   120 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Intrusion Detection View project

Malware Detection View project

All content following this page was uploaded by Y.H. Gu on 26 April 2016.

The user has requested enhancement of the downloaded file.


The Application of Lucene in Information Leakage
Monitoring and Querying System

Qi Wang, Weiming Wu, Yonghao Gu


School of Computer Science and Technology
Beijing University of Posts and Telecommunications
Beijing, China, 100876
e-mail: hellowangqi77@gmail.com, wuwming@bupt.edu.cn, guyonghao@bupt.edu.cn

Abstract—With the characteristics of open source, excellent index system, it aims to search contents and related information of
structure and system architecture, Lucene promotes the files captured for files containing confidential information.
application of full-text retrieval technology in various industries
and fields greatly. In this paper, the system architecture and II. LUCENE
operational mechanism of Lucene were discussed. Further,
according to the requirement of external network information A. Lucene Introduction
leakage monitoring and querying system, an illegal file retrieval Lucene is an open source, high-performance text search
subsystem based on Lucene was designed and implemented, and engine library written entirely in Java. It is a technology
its system framework as well as each functional module was suitable for nearly any application that requires full-text search.
described in detail, providing comprehensive reference for The important work of Lucene is to create indexes for each
similar industrial applications. word of a file, by which the search efficiency is greatly
improved comparing to the traditional way of word-by-word
Keywords—full-text retrieval; index updated regularly; lucene; comparison. Lucene provides full-featured index engine, search
search engine engine and some text analysis engines. With its open source
I. INTRODUCTION feature, excellent index structure and system architecture,
Lucene has been widely used. By Lucene, programmers not
Today, business activities are highly dependent on only can build specified full-text search application, but also
information systems so that Internet becomes the main channel can integrate it into a variety of system software and build Web
for companies to communicate with others, exchange business applications. Further more, some commercial software also
data and information. Although it is convenient and efficient to takes Lucene as the core of their full-text search subsystems
use the Internet to communicate, it is also easy for confidential [5].
business information to be leaked to competitors through open
Internet. To prevent heavy losses caused by leakage of B. System Architecture of Lucene
confidential business information through the network, and to The system architecture of Lucene is shown in Fig. 1.
ensure systems and information security, it is necessary for
companies to strictly monitor their staff’s online behavior. Package org.apache.lucene.document is the logical
representation of a document for indexing and searching. Class
The external network information leakage monitoring and Document can encapsulate data into a virtual document, which
querying system is aimed at solving the problem above. This is the unit of indexing and search. Document is a set of fields.
system can monitor data packets sending from the internal Class Field encapsulates a section of a Document, such as the
network to the Internet, and then capture and store data title of an article.
packets, which may contain confidential information, based on
certain monitoring rules. In addition, this system includes a Package org.apache.lucene.analysis provides classes that
full-text search subsystem for all files captured, which is the can convert text into indexable/searchable tokens, excluding
focus of this paper. other uninterested information.

Lucene is a high-performance full-text search toolkit using Package org.apache.lucene.index provides classes to
java, which can be easily embedded in a variety of maintain and access indexes. IndexReader is an abstract class,
applications, achieving full-text search capabilities for providing an interface for accessing an index. Class
applications. With its open source feature, excellent index IndexWriter creates and maintains an index.
structure and system architecture, Lucene has been widely Package org.apache.lucene.queryPaser provides classes to
used. convert query string.
In this paper, we design and implement an illegal file Package org.apache.lucene.search provides classes to
retrieval subsystem based on Lucene. As the subsystem of search indexes, such as Searcher and TopDocs. Class Searcher
external network information leakage monitoring and querying

978-1-4244-7941-2/10/$26.00 ©2010 IEEE


is an abstract base class for search implementations. Class III. SYSTEM DESIGN
TopDocs represents the search results. The external network information leakage monitoring and
querying system includes two subsystems: one is monitor
query result subsystem; the other is the illegal file retrieval subsystem.
The former is mainly in charge of monitoring data packets
files to be indexed query string sent from the internal network to the Internet, capturing email
messages, files uploaded to and downloaded from the ftp
server, chat content and so on, then save the data captured
above as file.
org.apache.lucene.document org.apache.lucene.queryPaser
The latter is mainly responsible for creating indexes for the
content and other related information of files saved above. It
org.apache.lucene.analysis also provides search capabilities for users to find out files,
which contain the specified confidential information, and
information for tracking the source, such as the source IP and
org.apache.lucene.index org.apache.lucene.search
the protocol type.
In this paper, we focus on the design and implementation of
an illegal file retrieval subsystem, whose system architecture is
org.apache.lucene.store shown in Fig. 2.

User

Index

Figure 1. System Architecture of Lucene


Files captured.
Package org.apache.lucene.store provides binary I/O APIs,
File type: PDF /
used for all index data.
TXT / Word/ … Show search
C. Operational Mechanism of Lucene
Lucene mainly includes two functions: index creation and result
index search [1]. The index creation module is in charge of
reorganizing the data into specified format supporting efficient,
fast search. The index search module is to query indexes based Document Get query string
on query parameters and return result to users [1].
preprocessing
1) Index creation module: Steps to create indexes are as
following [6]:
First, convert data into format that Lucene can handle, i.e. Create index Search index
plain text character stream.
Second, design and create a document object with a number
of field objects. Then call the addDocument method of
IndexWriter class, pass the data to Lucene to store. Index

Third, tokenize each word of indexes using the specified


analyzer to make data more suitable to be indexed.
Figure 2. System Architecture of Illegal File Retrieval Subsystem
Finally, write the result above into index file.
2) Index search module: Steps to search indexes are as IV. SYSTEM IMPLEMENTATION
following [6]: A. Document preprocessing
First, call the QueryParser class to parse the query string The function of document pre-processing module is to
entered by the user. Not only parse query string into the extract plain text character stream from files of different type,
corresponding query object, but also tokenize each word of then make analysis to output characteristic information.
indexes using the specified analyzer. The data source of illegal file retrieval subsystem is files
At last, search index files for query results, which will be captured and stored by monitoring subsystem, which may
encapsulated in TopDocs object and returned to users. contain classified information. Considering the requirements of
users, we need to find out such information as following from
each file: date, IP address where the file sending from, IP
address where the file sending to, transfer protocol, and the we take the second way to delete index. The key code is as
contents of the file. follows:
Designing appropriate file structure, we can easily get any IndexReader ir=
information but file content from the file path information. IndexReader.open(FSDirectory.open(indexdir),false);
While files captured by monitoring subsystem are of ir.deleteDocuments(new Term("date",date));
various file type such as PPT, WORD, Lucene only can
process plain text stream. Before creating index by Lucene, we ir.close();
need design handlers for not-plain-text file to extract plain text Lucene does not provide APIs to regularly update index.
stream. Quartz is a powerful enterprise-level scheduling framework
that allows developers to schedule tasks according to time
B. Index creation
intervals. With Quartz, we can regularly execute the task of
To create index, firstly we create document objects, based increasing index and the task of deleting index. First, we design
on characteristic information from document preprocessing a class, which implements interface org.quartz.job, realizing
module. Then we call add method of class IndexWriter to it’s execute method with code for updating index. Then create a
accept and analyze document objects above. At last, create and task trigger, configuring the details of scheduling such as the
store index. The key code is as following: time interval. Finally, launch task scheduling.
//create Document object D. Index search
Document doc=new Document(); Illegal file retrieval subsystem provides user with query
interface similar to Google and other commercial search
doc.add(new Field("filePath", filePath, Field.Store.YES, engine. User input classified keyword, the system returns all
Field.Index.NOT_ANALYZED_NO_NORMS, the results that satisfy the query conditions. Results include
Field.TermVector.WITH_POSITIONS_OFFSETS)); information for tracking the source of files including classified
//accept and analyze document objects information above.
IndexWriter iw= new IndexWriter( //open index directory
FSDirectory.open(this.indexdir), IndexSearche indexSearch = new IndexSearcher(new
new StandardAnalyzer(Version.LUCENE_30), SimpleFSDirectory(new File(indexDir)));
false, MaxFieldLength.LIMITED);
//parse query string of users
iw.addDocument(doc);
QueryParser queryParser = new QueryParser(
iw.optimize(); Version.LUCENE_30, "contents",
//create and store index new StandardAnalyzer(Version.LUCENE_30));
iw.commit(); query = queryParser.parse(searchWord);
iw.close(); //search index and return results
C. Index updating TopDocs topDocs =
Since the data source of illegal file retrieval subsystem is indexSearch.search(query, Integer.MAX_VALUE);
constantly changing, it is necessary to regularly update the for (int i = 0; i < topDocs.scoreDocs.length; i++) {
index file accordingly to ensure the accuracy of search results.
ScoreDoc sdoc = topDocs.scoreDocs[i];
1) Creating new index: There are two ways to increase
index in Lucene, the key code is as follows: Document doc = indexSearch.doc(sdoc.doc);

IndexWriter iw= new IndexWriter(Directory d, Analyzer a, //get information of the content field
Boolean create); String filePath=doc.get(“filePath”);
Parameter “create” can be assigned two values: true or E. Hit highlight
false. We set it true to re-create all index, while set it false to
just increase new index. In order to make the display of search results friendlier, we
highlight hit of each search result. The key code is as follows:
Considering the large number of data source, and only part
of the data source being updated, we task the second way to //set display format for hit
update index. SimpleHTMLFormatter sHtmlF =
2) Deleting index: There are two ways to delete index in new SimpleHTMLFormatter("<b><font color='red'>",
Lucene. One way is to delete one index based on document ID, "</font></b>");
the other way is based on term, that is to perform a search Highlighter highlighter = new Highlighter(sHtmlF,
operation first, and then bulk deletes all the search results. new QueryScorer(query));
Considering the data source deleting data on a monthly basis,
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
ScoreDoc sdoc = topDocs.scoreDocs[i];
Document doc = indexSearch.doc(sdoc.doc);
//hit highlight for the content field
String summary=
highlighter.getBestFragment(analyzer,
"content",doc.get("contents"));
F. Operation result
The operation result of illegal file retrieval subsystem is
shown as Fig. 3.

Figure 3. Operation result of illegal file retrieval subsystem

V. CONCLUSION
The emergence of Lucene greatly promotes the application
of full-text retrieval technology in various industries and fields.
At first, the system architecture and operational mechanism of
Lucene were discussed. On this basis, a Lucene-based illegal
file retrieval subsystem was designed and implemented,
providing comprehensive reference for similar industrial
application, of practical significance.

[1] Yuehua Ding, Kui Yi, and RiHua Xiang, “Design of paper duplicate
detection system based on lucene,” 2010 Asia-Pacific conference on
wearable computiong systems, 2010.
[2] Bing Pan, Liangliang Xu, “Study of Chinese Blog Search Engine,”
Computer Engineering and Design, vol. 8, pp. 1719-1720, April 2010.
(in Chinese)
[3] Hongguang Suo, Xin Sun, “Research and development of Chinese full
text search engine based on Lucene,” Computer Engineering and
Design, vol, 19. pp. 5083-5086, October 2008. (in Chinese)
[4] Xiaowei Lang, Shenkang Wang, “Research and Development of Full
Text Search Engine Based on Lucene,” Computer Engineering, vol. 4,
pp. 96-99, February 2006. (in Chinese)
[5] http://www.lucene.com.cn/about.htm.
[6] Otis Gospodnetic, Erik Hatcher, “Lucene in action,”Transl. Tan Hong.
China, Publishing house of electronics industry, 2007. (in Chinese)

You might also like