Professional Documents
Culture Documents
Lucene PDF
Lucene PDF
net/publication/251981509
CITATIONS READS
2 473
3 authors, including:
Y.H. Gu
Beijing University of Posts and Telecommunications
19 PUBLICATIONS 120 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Y.H. Gu on 26 April 2016.
Abstract—With the characteristics of open source, excellent index system, it aims to search contents and related information of
structure and system architecture, Lucene promotes the files captured for files containing confidential information.
application of full-text retrieval technology in various industries
and fields greatly. In this paper, the system architecture and II. LUCENE
operational mechanism of Lucene were discussed. Further,
according to the requirement of external network information A. Lucene Introduction
leakage monitoring and querying system, an illegal file retrieval Lucene is an open source, high-performance text search
subsystem based on Lucene was designed and implemented, and engine library written entirely in Java. It is a technology
its system framework as well as each functional module was suitable for nearly any application that requires full-text search.
described in detail, providing comprehensive reference for The important work of Lucene is to create indexes for each
similar industrial applications. word of a file, by which the search efficiency is greatly
improved comparing to the traditional way of word-by-word
Keywords—full-text retrieval; index updated regularly; lucene; comparison. Lucene provides full-featured index engine, search
search engine engine and some text analysis engines. With its open source
I. INTRODUCTION feature, excellent index structure and system architecture,
Lucene has been widely used. By Lucene, programmers not
Today, business activities are highly dependent on only can build specified full-text search application, but also
information systems so that Internet becomes the main channel can integrate it into a variety of system software and build Web
for companies to communicate with others, exchange business applications. Further more, some commercial software also
data and information. Although it is convenient and efficient to takes Lucene as the core of their full-text search subsystems
use the Internet to communicate, it is also easy for confidential [5].
business information to be leaked to competitors through open
Internet. To prevent heavy losses caused by leakage of B. System Architecture of Lucene
confidential business information through the network, and to The system architecture of Lucene is shown in Fig. 1.
ensure systems and information security, it is necessary for
companies to strictly monitor their staff’s online behavior. Package org.apache.lucene.document is the logical
representation of a document for indexing and searching. Class
The external network information leakage monitoring and Document can encapsulate data into a virtual document, which
querying system is aimed at solving the problem above. This is the unit of indexing and search. Document is a set of fields.
system can monitor data packets sending from the internal Class Field encapsulates a section of a Document, such as the
network to the Internet, and then capture and store data title of an article.
packets, which may contain confidential information, based on
certain monitoring rules. In addition, this system includes a Package org.apache.lucene.analysis provides classes that
full-text search subsystem for all files captured, which is the can convert text into indexable/searchable tokens, excluding
focus of this paper. other uninterested information.
Lucene is a high-performance full-text search toolkit using Package org.apache.lucene.index provides classes to
java, which can be easily embedded in a variety of maintain and access indexes. IndexReader is an abstract class,
applications, achieving full-text search capabilities for providing an interface for accessing an index. Class
applications. With its open source feature, excellent index IndexWriter creates and maintains an index.
structure and system architecture, Lucene has been widely Package org.apache.lucene.queryPaser provides classes to
used. convert query string.
In this paper, we design and implement an illegal file Package org.apache.lucene.search provides classes to
retrieval subsystem based on Lucene. As the subsystem of search indexes, such as Searcher and TopDocs. Class Searcher
external network information leakage monitoring and querying
User
Index
IndexWriter iw= new IndexWriter(Directory d, Analyzer a, //get information of the content field
Boolean create); String filePath=doc.get(“filePath”);
Parameter “create” can be assigned two values: true or E. Hit highlight
false. We set it true to re-create all index, while set it false to
just increase new index. In order to make the display of search results friendlier, we
highlight hit of each search result. The key code is as follows:
Considering the large number of data source, and only part
of the data source being updated, we task the second way to //set display format for hit
update index. SimpleHTMLFormatter sHtmlF =
2) Deleting index: There are two ways to delete index in new SimpleHTMLFormatter("<b><font color='red'>",
Lucene. One way is to delete one index based on document ID, "</font></b>");
the other way is based on term, that is to perform a search Highlighter highlighter = new Highlighter(sHtmlF,
operation first, and then bulk deletes all the search results. new QueryScorer(query));
Considering the data source deleting data on a monthly basis,
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
ScoreDoc sdoc = topDocs.scoreDocs[i];
Document doc = indexSearch.doc(sdoc.doc);
//hit highlight for the content field
String summary=
highlighter.getBestFragment(analyzer,
"content",doc.get("contents"));
F. Operation result
The operation result of illegal file retrieval subsystem is
shown as Fig. 3.
V. CONCLUSION
The emergence of Lucene greatly promotes the application
of full-text retrieval technology in various industries and fields.
At first, the system architecture and operational mechanism of
Lucene were discussed. On this basis, a Lucene-based illegal
file retrieval subsystem was designed and implemented,
providing comprehensive reference for similar industrial
application, of practical significance.
[1] Yuehua Ding, Kui Yi, and RiHua Xiang, “Design of paper duplicate
detection system based on lucene,” 2010 Asia-Pacific conference on
wearable computiong systems, 2010.
[2] Bing Pan, Liangliang Xu, “Study of Chinese Blog Search Engine,”
Computer Engineering and Design, vol. 8, pp. 1719-1720, April 2010.
(in Chinese)
[3] Hongguang Suo, Xin Sun, “Research and development of Chinese full
text search engine based on Lucene,” Computer Engineering and
Design, vol, 19. pp. 5083-5086, October 2008. (in Chinese)
[4] Xiaowei Lang, Shenkang Wang, “Research and Development of Full
Text Search Engine Based on Lucene,” Computer Engineering, vol. 4,
pp. 96-99, February 2006. (in Chinese)
[5] http://www.lucene.com.cn/about.htm.
[6] Otis Gospodnetic, Erik Hatcher, “Lucene in action,”Transl. Tan Hong.
China, Publishing house of electronics industry, 2007. (in Chinese)