You are on page 1of 9

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE SYSTEMS JOURNAL 1

SE-PPFM: A Searchable Encryption Scheme


Supporting Privacy-Preserving Fuzzy
Multikeyword in Cloud Systems
Mingwu Zhang , Yu Chen, and Jiajun Huang

Abstract—Cloud computing provides an appearing application sci Similarity score.


for compelling vision in managing big-data files and responding K Symmetric key.
queries over a distributed cloud platform. To overcome privacy W Keyword matrix.
revealing risks, sensitive documents and private data are usually
stored in the clouds in a cipher-based manner. However, it is ineffi- wv Index matrix.
cient to search the data in traditional encryption systems. Search- Q Search matrix.
able encryption is a useful cryptographic primitive to enable users
to retrieve data in ciphertexts. However, the traditional searchable
encryptions provide lower search efficiency and cannot carry out I. INTRODUCTION
fuzzy multikeyword queries. To solve this issue, in this article, we
N RECENT years, with the rapid development of big data
propose a searchable encryption that supports privacy-preserving
fuzzy multikeyword search (SE-PPFM) in cloud systems, which
is built by asymmetric scalar-product-preserving encryptions and
I and cloud computing, a great number of electronic data
and documents are outsourced into the clouds for the secure
Hadamard product operations. In order to realize the functionality storing and processing. Cloud servers, which provide a power-
of efficient fuzzy searches, we employ Word2vec as the primitive ful storage and computational ability, give a potential solution
of machine learning to obtain a fuzzy correlation score between
encrypted data and queries predicates. We analyze and evaluate for the process of large-scale electronic data and documents,
the performance in terms of token of multikeyword, retrieval and and users can delegate the clouds to manage and search the
match time, file retrieval time and matching accuracy, etc. The data stored in the cloud servers. Also, commercial companies
experimental results show that our scheme can achieve a higher and hospitals store large amounts of electronic documents and
efficiency in fuzzy multikeyword ciphertext search and provide a medical data in cloud servers to reduce their own storage cost,
higher accuracy in retrieving and matching procedure.
such as mobile e-commerce systems and public medical systems
Index Terms—Asymmetric scalar-product-preserving encryp- [1]. However, sensitive electronic files stored in the clouds might
tion (ASPE), fuzzy search, multikeyword, privacy protection, be threatened by an outsider or insider attacker such as network
searchable encryption.
hackers, dishonest cloud servers and malicious adversaries, etc.
Therefore, the issue of privacy protection of sensitive electronic
NOMENCLATURE documents is a crucial issue in outsourced cloud services.
Encrypting and uploading electronic documents, and down-
Notations
loading and deciphering them upon users needing, seem an
ω Keyword.
effective approach to address this potential threat. However,
t Number of the files.
as documents were transformed into the ciphertext-based form,
tf − idf Correlation score.
they did not retain their original characteristics. For example,
∗ Hadamard product operation.
neither the user nor the cloud server can quickly distinguish
EQ Encryption of query vector.
which documents are user-required in the ciphertext form. Some
EI Encryption of index vector.
new approaches such as searchable encryption (SE) and differ-
ential privacy provide useful solutions to perform the documents
Manuscript received January 30, 2020; revised May 20, 2020; accepted May
20, 2020. This work was supported in part by the National Natural Science search while guaranteeing their privacy and security.
Foundation of China under Grant 61672010, in part by the Open Research SE cannot only perform the search on ciphertexts but also
Project of State Key Laboratory of Cryptology of China, and in part by the Key protect the security and privacy of data in a semihonest cloud
Projects of Guangxi Natural Science Foundation under Grant 2019JJD170020.
(Corresponding author: Mingwu Zhang.) model. Golle et al. [2], the first proposer of symmetric searchable
Mingwu Zhang is with the School of Computer Science and Information encryption, indicated to bind the files to their corresponding
Security, Guilin University of Electronic Technology, Guilin 541004, China, keywords, and find the corresponding ciphertext files through the
with the State Key Laboratory of Cryptology, Beijing 100878, China, and also
with the School of Computers, Hubei University of Technology, Wuhan 430068, search of the keywords to realize the ciphertext search. After that,
China (e-mail: mzhang@mail.hbut.edu.cn). many symmetric-based SE schemes [3], [4] have been proposed.
Yu Chen and Jiajun Huang are with the School of Computers, Hubei Uni- The authors in [5] and [6] put forward a keyword-ranking search
versity of Technology, Wuhan 430068, China (e-mail: ychen@hbut.edu.cn;
369109873@qq.com). scheme, which mainly achieves the accurate ranking of search
Digital Object Identifier 10.1109/JSYST.2020.2997932 results by encrypting the correlation score. For achieving a
1937-9234 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE SYSTEMS JOURNAL

higher security, some schemes [7]–[9] were proposed to build system framework, such as system model, entity roles, and threat
searchable encryptions by using the public-key cryptography model. Section V presents our concrete protocol, and Section VI
technique. gives the security proof and correctness analysis. In Section VII,
However, lots of existing schemes support the concrete key- we give the experimental result evaluations, and Section VIII
word search that matches the trapdoor and keyword in a one-by- concludes this article.
one manner. Meanwhile, several fuzzy search schemes [10]–[13]
only consider the fuzziness on keyword characters. In practical II. RELATED WORK
applications, there exist a great number of synonyms, and users
Traditional SE schemes [2]–[4] allow data user to securely
will not be able to enter all the synonyms of keywords to query
search the datasheet over ciphertexts through the search token.
the required files, which leads to inaccuracy of search results and
However, these schemes can only search the exact keywords
fails to meet user’s higher accuracy and security requirements.
and cannot perform the fuzzy search. For example, data users
The authors in [10]–[13] employ the one-hot model to repre-
may produce query requests which are different from the preset
sent the keyword vector, which means that each element in the
keywords, but the meaning of the query is similar with some
vector is associated with only one word in the lexicon, and the
keywords. On this precondition, Li et al. [10] proposed a fuzzy
vector of the specified word is set to 1 in the vector corresponding
SE scheme, which returned the closest files to data users where
to the vector, and the other elements is set to 0. However, with the
edit distance is employed to quantify keywords similarity.
increasing of total number of keywords, the length of bag of word
Wong et al. [14] proposed a secure kNN computation on
(BOW) will become larger dramatically. Furthermore, it is im-
encrypted databases for efficiently searchable processing, which
possible to make synonymous comparisons of word vectors with
uses an ad hoc security approach that employs the ASPE tech-
the representation. Actually, in the calculation of text similarity,
nology to implement the fuzzy process. A key characteristic
the BOW model is very impractical, so we employ Word2vec
of ASPE is the conservation of the inner product between the
instead of the one-hot model to construct keyword vector.
data vector and the search predicate vector, which allows it to
Based on the above-mentioned issues, in this article, we
be especially suitable for designing a sublinear search based on
propose a multikeyword fuzzy searchable scheme (SE-PPFM)
the distance and similarity queries. Recently, more and more
with privacy protection in cloud servers. Concretely, we employ
works have been proposed and enhanced for ASPE to sustain
TD-IDF in the Huffman tree as the correlation weight of the word
a variety of important searches in cloud computing, including
in the document, and create the keyword vectors by Word2vec to
nearest neighbor searches [15], range searches [16], similarity-
realize the fuzzy search of the synonyms. Meanwhile, we com-
based ranking searches [17]–[19], fuzzy multikeyword match
bine asymmetric scalar-product-preserving encryption (ASPE)
searches [12], [13], secure image search [20]–[22], and so on.
[14] and Hadamard product to build a secret index and then
Li et al. proposed a symmetric searchable scheme in medical
achieve the privacy-preserving searchable encryption. The con-
clouds that supports dynamic search [24]. To provide access
tributions of this article are summarized as follows.
control in spatial data, the authors in [28] gave the range query
1) We design and build a fast SE scheme by employing the
with access control over encrypted geometric data. Recently,
techniques such as symmetric scalar-product-preserving
Shen et al. [26] proposed a light and privacy-preserving de-
encryption and Hadamard product primitive. Compared
cision protocol that supports the fair location determination
with previous schemes, we generalize the ASPE scheme
[27]–[29]. In [30], Fan and Huang proposed a controllable and
of one-dimensional index vectors and query vectors to
privacy-preserving search scheme that is based on the symmetric
multidimensional matrices, resulting in a large reduction
predicate encryption. In [31], Zhang et al. proposed a secure
in the key matrix dimension needed in the construction of
and privacy-preserving optimal algorithm to support distributed
secret index and search token, which effectively enhances
fractional knapsack.
the search efficiency.
2) We employ term frequency-inverse document frequency
(TF-IDF) to calculate the correlation scores of keywords III. PRELIMINARIES, METHOD, AND TOOLS
in files, in which the scores are used to build the Huff- A. Notations
man tree. Thus, we can train a set of keywords by using
If x is a binary string, then |x| denotes its bit length. If S is a
Word2vec and obtain the similarity score, which provides
finite set, then |S| denotes its size and x ← S denotes sampling
the fuzzy match ability. We compute the Tanimoto coeffi-
a member uniformly from S and assigning it to x. We use
cients between files and trapdoors to sort the correlation of
λ ∈ N to denote the security parameter and 1λ to denote its
matching files and retrieve the correlated k-top document
unary representation. We provide the terms and notations in the
for the data user.
Nomenclature section that will be used in this article.
3) We provide the concrete construction of the proposed SE-
PPFM scheme, and give the analysis of performance such
as the computation cost, communication overhead, match B. Word2vec
accuracy, etc. The experimental results indicate that our Word2vec is created by a team of researchers led by Tomas
proposed scheme is efficient and practical. Mikolov at Google [23], in which a group of related models
The rest of the article is organized as follows. Section II are used to produce word embedding into. Each word can
overviews the related works, and Section III gives the approaches be represented by the distributive weight of elements, which
and tools that will be used in this article. Section IV describes the represents the meaning of a word in an abstract way.

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: SE-PPFM: A SE SCHEME SUPPORTING PRIVACY-PRESERVING FUZZY MULTIKEYWORD IN CLOUD SYSTEMS 3

D. Term Frequency-Inverse Document Request


TF-IDF is a numerical statistic method that is intended to
reflect how important a word is to a document in a collection or
corpus. We write tfw,f as the frequency of keyword ω appearing
in a file f , it is defined as follows:
nω,f
tfω,f =  (3)
nω,F
Fig. 1. Example of word vector generated by Word2vec. where nw,fdenotes the number of keyword ω appearing in a
file f , and nw,F means the number of keyword ω appearing
in a file set F = (f1 , . . . , ft ), and t is the number of files.
We use the term idf to represent the rare degree of keywords
in the total file, which is defined as follows:
t
idfω = log (4)
1 + dfω
where dfω denotes the number of files containing keyword ω.
For each keyword ω in each file f , the correlation score can
be calculated as
tf −idf = tfω,f × idfω . (5)
Fig. 2. Instance of Huffman tree.
E. Hadamard Product
For two matrices A = (ai,j )i,j∈N and B = (bi,j )i,j∈N with
In Fig. 1, we use the words in the vocabulary to repre- the same dimension m × n, the Hadamard product A ∗ B is a
sent the three-dimensional data, i.e., Royalty, Masculinity, and matrix with the same dimension as A and B, whose elements
Ability. For example, the word vector of the king is set as are given by
(0.99, 0.94, 0.78). We can easily find that the relationship be-
tween the word vectors is evaluated as follows: A ∗ B = (ai,j bi,j )i,j∈N
→ → → →
King − Man + Woman = Queen . i.e.,
⎡ ⎤ ⎡ ⎤
a11 a12 a13 b11 b12 b13
⎢ ⎥ ⎢ ⎥
Concretely, Word2vec usually employs a Huffman tree that is ⎣a21 a22 a23 ⎦ • ⎣b21 b22 b23 ⎦
shown in Fig. 2 instead of the mapping from the hidden layer to a31 a32 a33 b31 b32 b33
output the SoftMax layer, in which the Huffman tree is defined ⎡ ⎤
as a binary tree constructed by the weight. Each leaf node in a11 b11 a12 b12 a13 b13
the tree represents a word, and for each leaf there is a unique ⎢ ⎥
= ⎣a21 b21 a22 b22 a23 b23 ⎦ .
path from the root to the leaf node, which is used to estimate the
a31 b31 a32 b32 a33 b33
probability of the words represented by the leaf nodes.
The elements in the matrix of the Hadamard product are equal
C. Tanimoto Coefficient to the product of A and B corresponding elements.
Tanimoto distance is often referred to, erroneously, as a syn-
onym for Jaccard distance. More concretely, a Tanimoto distance F. Asymmetric Scalar-Product-Preserving Encryption
is often stated as being a proper distance metric, probably ASPE is an encryption mechanism of vectors. Let EQ be
because of its confusion with Jaccard distance. Let A and B an encryption algorithm of Query vector and EI be an en-
be two vectors, if Jaccard or Tanimoto similarity is expressed cryption algorithm of Index vector. Consequently, there creates
over a bit vector, then it can be rewritten as the ciphertext of Index matrices Ii and the ciphertext of Query
A·B matrices Q as follows:
TanSim (A, B) = (1)
|A| + |B|2 − A · B
2
I  i = EI (Ii , sk) = Ii M
where the same calculation is expressed in terms of a vector Q = EQ (Q, sk) = M −1 QT (6)
scalar product and magnitude. This representation relies on the
fact that for a bit vector (where the value of each dimension is where M is a reversible matrix of n × n and M acts as the secret
either 0 or 1) then key. ASPE scheme can keep the inner product of the I vector
  and the Q vector as
A·B = [ai bi ], |A|2 = a2i . (2)
i i Ii · Q = Ii M · M −1 QT = Ii · Q. (7)

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE SYSTEMS JOURNAL

B. SE-PPFM Model
In this section, we design the SE-PPFM system model for
multikeyword-based fuzzy searchable encryption, as shown in
Fig. 3. The encryption scheme has three roles involved: data
owner, cloud server, and data user.
Data owner: This entity owns the plaintext of all files. To
guarantee the security and reduce the storage price, the data
owner encrypts all the files and stores into the cloud server.
In order to realize the fuzzy search for encrypted files, the data
owner extracts the keywords and generates the documents as the
input of the algorithm Word2vec, and then creates the keyword
vectors. Finally, it generates a secret index for each encrypted
file and creates a search token that is based on the query request
of the data user.
Fig. 3. System model of SE-PPFM. Cloud server: The cloud server, as an outsourced server, is
responsible for the storing of a large number of encrypted file
data from all data owners and performs the encrypted search. In
our system model, the cloud server is honest and curious, i.e.,
IV. SYSTEM MODEL semihonest. It will be honest to perform the search operation and
A. Overview return the results, while it snoops into information stored inside
and query predicate from data users.
Our system model is illustrated in Fig. 3. We consider a Data user: The data users request the query operation to the
storage platform with encrypted files storage and keyword search cloud server and decrypt the matching ciphertext to obtain the
services at the low cost. From a high-level point of view, our actual searched plain data.
design first adopts the SE techniques in the cloud model. Data In Fig. 3, we give the system model and framework of our pro-
owners choose a symmetric encryption scheme for all files to tocol. The SE-PPFM system consists of six algorithms, which
ensure file security. At the same time, in order to make the are described as follows.
file searchable, for each file, the data owner extracts m × n 1) Train: In the train phase, the data owner trains the
keywords in the file by using the TF-IDF technique as the index Word2vec model and creates the keyword vectors for each
of these files. Considering more general situation, the keywords file.
that the data users want to search will not be exactly the same as 2) Encryption: The data owner chooses a symmetric encryp-
those index keywords. Therefore, we need to consider the fuzzy tion scheme (i.e., AES) to encrypt all files and sends all
search on the meaning of the word. To solve this problem, the encrypted files to the cloud server.
data owner uses the Word2vec technology to realize the fuzzy 3) IndexGen: In the index generation phase, the data owner
search. In particular, the data owner uses all the files as the builds a keyword matrix and generates the secret file index.
input of the Word2vec to train the semantic related model and 4) Trapdoor: In the trapdoor algorithm, the data owner takes
generates the keyword vectors for the index keywords. In this as input the search request from the data user, generates the
case, the system can return the correct file to the data user even search vectors, and builds the search token. Subsequently,
if the data user uses a word with the same meaning as the search the data owner sends the token and the symmetric key to
index word in the search request. the data users.
According to the above-mentioned discussion, it shows that 5) Match: In the match and search phase, the cloud server
both our scheme and fuzzy search over keywords. However, calls this algorithm to search the encrypted files and sends
external attackers can still obtain relevant content information the most similar k-top ciphertexts to the data user.
of files according to the index of files. To respond this issue, 6) Decryption: In the decryption algorithm, the data user uses
the data owner employs ASPE technology to encrypt the file the symmetric keys to decrypt the received ciphertexts and
index, making it impossible for an external attacker to attain obtains the matched results.
any information from the file even if the external attacker has
obtained the file index. In order to ensure the encrypted file index
in the cloud server to be correctly retrieved, the data owner will
build a search token connected to the search request. Search C. Threat Model
token can match the encrypted file index without revealing any In our model, we assume that the cloud server is semihonest,
information about the user’s search request. After receiving the in which the cloud server would be interested in both the data
search token and symmetric key, the data user will send the owner’s private files and the data user’s query keywords, and it
search token to the cloud servers to retrieve the encrypted files, tries to learn the content and abuses them for their self-benefits.
and the cloud server will call the match algorithm to output the Meanwhile, external attackers could also hack into the cloud and
results (matched files) in the forms of cipher. Finally, the data download those files. Therefore, the data owners should encrypt
users decrypt the matched files using the symmetric keys. files before they store the data in the cloud server.

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: SE-PPFM: A SE SCHEME SUPPORTING PRIVACY-PRESERVING FUZZY MULTIKEYWORD IN CLOUD SYSTEMS 5

Meanwhile, in the search phase, the secret file index cannot


Algorithm 1: IndexGen.
divulge any information of the file, that is, neither an external at-
tacker nor the semihonest cloud server can learn any information INPUT: index matrix wv = [ωij ]m×n of file v
about the search request from a secure search token. secret key sk = (P, M1 , M2 )
BEGIN:
V. DESIGN OF THE SCHEME For element pij of P do:
If pij = 1:
In this section, we present the concrete construction of fuzzy aij ←Random
SE scheme. In particular, we show how to use Word2vec and Let bij = ωij − aij ;
ASPE to build the secret index and search token. This protocol Else pij = 0:
is transparent to the users and can protect the confidentiality of Let aij = bij = ωij ;
files and query keywords. End For;
Let secret index as:
A. Setup Algorithm Ia = EI (Ia , M1 ) = Ia M1
To enable the cloud to provide the encrypted keyword search Ib = EI (Ib , M2 ) = Ib M2
functionality for data users, the data owner at first builds up an Indexv = {I  a , I  b }
encrypted index from the file warehouses, and then uploads this END;
index to the cloud server. OUTPUT: secret index Indexv = {I  a , I  b }
In this phase, the data owner generates the system parameters:

an m × n random matrix P = [pij ]m×n (pij ∈ [0, 1]) and two EI : encryption algorithm of Index
n × n random matrices M1 , M2 for encrypting the index of the
file, where m × n is the number of all keywords, and n is the
number much larger thanm. D. IndexGen Algorithm
At the same time, the data owner chooses a symmetric en- In order to describe more accurately, the data owner uses the
cryption algorithmEnck /Deck . Therefore, the secret key of the tf × idf score to indicate the correlation between the keywords
data owner is set as sk = (P, M1 , M2 ). and the files, and uses these keywords and keyword matrix W to
Furthermore, the data owner extracts keywords for each file build the file index as follows.
and calculates the TD-IDF score tf × idf as follows: 1) If the keyword ωi is appearing in the file f , the element
nω,f value of the keyword matrix is set to tfωi ,f × idfωi .
tfω,f =  (8)
nω,F 2) If the keyword ωi is not appearing in the file f , the element
t value of the keyword matrix is set to 0.
idfω = log . (9) Subsequently, to ensure the security and searchability of the
1 + dfω
IndexGen algorithm, the data owner uses matrix P for dividing
Finally, the data owner uses the extracted keywords to build the keyword matrix wv of file v to matrix Iv,a = [aij ]m×n and
a keyword matrix W, i.e., matrix Iv,b = [bij ]m×n , and calls the ASPE scheme to encrypt
W = [ωij ]m×n . (10) it as in Algorithm 1. Thus, the secret index matrix of file v is
encrypted as
B. Train Algorithm
Indexv = {I  v,a , I  v,b } .
In order to achieve the fuzzy search, the data owner takes as  2
input the file to the Word2vec algorithm, and uses the keyword Finally, the data owner computesAv = ωij . For file v, data
tf × idf score as the weight of the Huffman tree to train the owner sends (Indexv ||Av ||ctv ) to the cloud server.
keyword model. At this point, if the data user query request
contains the keyword ω  that does not in W, using the trained E. Trapdoor Algorithm
keyword model, the data owner can find a keyword the trained According to the search requirement, data users send query
keyword model, the data owner finds a keyword ωi that is requests to data owners through secure channels. The data owner
most similar to ω  . The similarity score sci of word ω  to ωi verifies the identity and privileges of the data user and builds a
is calculated as follows: search token based on the query request.
sci = sim (ω  , ωi ) . When the query request of the data user is accepted, the data
owner transforms the query request vector of the data user into
Then, the data owner uses ωi · sci instead of ω  to build a the query matrix according to the keyword matrix W . Similar
search token. to building a file index, if the keyword ωi of the query request
is appearing in the keyword matrix W , the element value of
C. Encryption Algorithm the keyword matrix is set to 1. Otherwise, the element value
The data owner at first generates the symmetric key K = is set to 0. In this process, there is another special case, if the
(k1 , k2 , . . . , kt ) for file F = (f1 , f2 , . . . , ft ) and encrypts the data user query request, which contains the keyword ω  that
file to ciphertext CT = (ct1 , ct2 , . . . , ctt ). is not appearing in W . By using the trained keyword model,

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE SYSTEMS JOURNAL

Qas follows:
Algorithm 2: Trapdoor.
  
INPUT: keyword matrix Q = [qij ]m×n for search G= [ωij qij ]m×n = [Ia ∗ Qa + Ib ∗ Qb ]ij
secret key sk = (P, M1 , M2 ) m×n
i,j i,j
BEGIN: 
For element pij of P do: = [td × idfij · scij ]. (12)
If pij = 1: i,j
Let xij = yij = qij ;
Else pij = 0: Then, it can use the Tanimoto coefficient to calculate the secret
xij ←Random file index and search the similarity of the token as follows:
Let yij = qij − xij ; G G
End For; TanSim (w, Q) = 2 2 = . (13)
|w| + |Q| − G A + B −G
Let secret index as:
Qa = EQ (Qa , M1 ) = (M1−1 QTa )T Finally, the cloud server ranks from large to small according
Qb = EQ (Qb , M2 ) = (M2−1 QTb )T to Tanimoto coefficient values, and chooses the top-k most
//EQ is encryption algorithm of Query correlated files to the data user.
T DQ = {Q a , Q b }
END;
G. Decryption Algorithm
OUTPUT: search token T DQ = {Q a , Q b }
After receiving the ciphertext, the data user employs the
symmetric key K to decrypt those matched ciphertext as fv =
Deckv (ctv ).
data owner finds a keyword ωi that is most similar to ω  . The
similarity score sci of the word ω  –ωi is calculated by the data
owner as follows: VI. ANALYSIS
A. Correctness
sci = sim (ω  , ωi )
In this section, we give the correctness of the combination of
where sim denotes the similarity of two strings. the Hadamard product and ASPE scheme. It is easy to see that
Then, the data owner uses ωi · sci instead of ω  to build a ASPE can keep the inner product of two vectors Q and I as
search matrix Q = [qij ]m×n . follows:
At same as IndexGen, the data owner uses the matrix P
for dividing search matrix Qto matrix Qa = [xij ]m×n and ma- Iv · Q = Iv M · M −1 QT = Iv · Q.
trix Qb = [yij ]m×n , and uses ASPE scheme to encrypt it like
Algorithm 2. So, the secret search matrix encrypted as TDQ = Let Iv = [ci ]1×n and Q = [di ]1×n be two 1 × n vectors,
{Q a , Q b }. M = [x1 , x2 , . . . , xn ]n×n and M −1 = [y1 , y2 , . . . , yn ]n×n be
 2 the secret key, in which xi s are n × 1 column vectors and yi
Finally, the data owner evaluatesB = qij , and then sends
(TDQ ||B||K) to the data user. are 1 × n row vectors. Then, the result of Iv M and M −1 QT are
The Trapdoor algorithm is described as follows. constructed as

Iv M = |Iv · x1 , . . . , Iv · xn |1×n


F. Match Algorithm  
M −1 QT = y1 · QT , . . . , yn · QT n×1 . (14)
To search the files that the data user requests, the data user first
sends a search token to the cloud server. As both search indexes Thus, the equation Iv M · M −1 QT = Iv · QT is represented
and search traps have the forms of encryption, the cloud server as
will not disclose files and query information in the matching
n
 n
procedure. Actually, the cloud server can obtain the match files 
(Iv · xi ) · yi · QT = c i di . (15)
as follows:
i=1 i=1

Indexv ∗ TDQ 
Actually, the above-mentioned equation satisfies ni=1 xi ·

T T yi = 1, and it indicates that the key issue in extending the ASPE
= {Ia M1 , Ib M2 } ∗ M1−1 Qa T , M2−1 Qb T
scheme from vectors to matrices.
 T   T  In our scheme, the index and the query are encoded into two
= Ia M1 ∗ M1−1 QTa + Ib M2 ∗ M2−1 QTb m × n matrices, that is
= I a ∗ Qa + I b ∗ Qb . (11) Iv = [c1 , . . . , cm ]m×n
 
Let the element of w ∗ Q be w ∗ Q = [ωij qij ]m×n , the cloud Q = d1 , . . . , dm .
server calculates the sum of all the elements of the matrix w ∗ m×n

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: SE-PPFM: A SE SCHEME SUPPORTING PRIVACY-PRESERVING FUZZY MULTIKEYWORD IN CLOUD SYSTEMS 7

We use the Hadamard product and ASPE scheme to build the


equations as follows:
⎡ ⎤
c1 · x1 . . . c1 · xn
⎢ . .. ⎥
Iv M = ⎢ ⎣ .
. ..
. . ⎦
⎥ (16)
cm · x1 · · · cm · xn m×n
⎡ ⎤
y1 · d1 . . . y1 · dm
⎢ . .. ⎥
M −1 QT = ⎢ ⎣ .
. ..
. . ⎦
⎥ . (17)
 
yn · d1 · · · yn · dm n×m
Fig. 4. Matching time on different number of keywords.
The evaluation of the Hadamard product (Iv M ) ∗
(M −1 QT )T is described as
T
(Iv M ) ∗ M −1 QT
⎡ ⎤
c1 · x1 · y1 · d1 . . . c1 · xn · yn · d1
⎢ .. .. ⎥
=⎢⎣ .
..
. .

⎦ . (18)
cm · x1 · y1 · dm · · · cm · xn · yn · dm m×n

Thus, the sum of all elements of the matrix (Iv M ) ∗


(M −1 QT )T is calculated as
m 
 n   m
(ci · xj ) · yj · di = ci · di . (19) Fig. 5. Search token time on different number of keywords.
i=1 j=1 i=1

We prove that our scheme can still preserve the sum of the VII. EXPERIMENTAL RESULTS
product of elements of two matrices while the transformation
from a one-dimensional vector to a multidimensional matrix In this section, we give the performance evaluation of our,
enables our scheme to own a smaller key matrix under the same the cost comparison between our scheme and original ASPE
number of keywords. Obviously, this approach can significantly scheme, and the cost of different m values in our scheme. We
prompt the computational efficiency and reduce the communi- implement the fuzzy keyword SE protocol on a PC with Intel(R)
cational cost. Pentium(R) CPU G3240 of 3.10 GHZ and 4GB of RAM. The
index building and search functions are programmed in Java
(version) and ran in a 64-bit Windows 7 operating system. For
B. Security Analysis simulation, we use the Jama library of Java to build index and
We consider the semihonest attackers, i.e., the malicious search token matrix, and use the Genism (Word2vec) library of
outsourced cloud server, which means that the cloud server Python to train those files for fuzzy search.
will execute the protocol, and it also tries to obtain the secret Fig. 4 indicates the operation time of matching the number
information such as the corresponding secret index according to of different keywords in the original ASPE scheme and our
the search token. Meanwhile, the cloud server is also curious, scheme under m = 2 and m = 5. Considering the original ASPE
and it will use the encrypted trapdoor to collect the information scheme, we let the number of keywords be n, it means that the
to deduce the contents of user queries and possible information scheme requires to perform O(n2 ) times point multiplication op-
about encrypted data. eration and O(n − 1) times point addition operation. However,
Our scheme can resist on the known plaintext attacks: We in our scheme, we only need O(n) times point multiplication
assume that an attacker inserts a set of plaintext I into the operation and O(n − 1) times point addition operation.
cloud server, which is known to the attacker and obtains its Fig. 4 shows that no matter how m values change, the time in
corresponding trapdoor. For any vector I, without the division the matching stage is only dependent on the number of keywords,
matrix P, the attacker could not obtain any information from because the Hadamard product just needs to compute the element
matrices Ia and Ib . Thus, the attacker can only consider Ia and product of the matrix.
Ib to be random m × n matrices, and solves the secret key matrix Fig. 5 shows the operation cost of building the search token,
M1 and M2 by Index = {I  a , I  b }. However, M1 and M2 have and it is easy to see that the larger m, the smaller token needed.
2n2 unknowns, while the attacker can only construct 2m × n In Fig. 6, it indicates the running time w.r.t. matching different
equations. Actually, m < n, which ensures the security of the number of files between ASPE and our scheme. It shows that
file index, and the attacker does not have enough information to the query and the match time of both our scheme and ASPE are
solve the equation. linear to the number of files.

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE SYSTEMS JOURNAL

we constructed a new secure and efficient SE scheme based


on the Hadamard product and ASPE. Then, in order to solve the
problem that data users may have inconsistencies with file index
keywords in the query process, we employed the Word2vec
technology of deep learning to calculate the semantic relevance
score. In this way, our scheme can achieve semantic fuzzy
search of keywords. The experimental results indicated that
our scheme provides a significant improvement in the terms of
search efficiency and match accuracy.

REFERENCES
Fig. 6. Matching time on different number of files. [1] M. Zhang et al., “Accountable mobile e-commerce scheme in intelligent
cloud system transactions,” J. Ambient Intell. Humanized Comput., vol. 9,
no. 6, pp. 1889–1899, 2018.
[2] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keyword search
over encrypted data,” in Proc. Int. Conf. Appl. Cryptography Netw. Secur.,
2004, pp. 31–45.
[3] F. Hahn and F. Kerschbaum, “Searchable encryption with secure and
efficient updates,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur.,
2014, pp. 310–320.
[4] K. S. Kim, M. Kim, D. Lee, J. H. Park, and W.-H. Kim, “Forward
secure dynamic searchable symmetric encryption with efficient up-
dates,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2017,
pp. 1449–1463.
[5] C. Wang, N. Cao, K. Ren, and W. Lou, “Enabling secure and efficient center
Fig. 7. Accuracy on different word similarity schemes. keyword search over out sourced cloud data,” IEEE Trans. Paralleled
Distrib. Syst., vol. 23, no. 8, pp. 1467–1479, Aug. 2012.
[6] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked keyword
search over encrypted cloud data,” in Proc. Soft IEEE Int. Conf. Distrib.
In Fig. 7, we compare the accuracy of several common word Comput. Syst., Genova, Italy, 2010, pp. 253–262.
similarity computation schemes. Let U be the number of docu- [7] M. Zhang, S. Zhang, and L. Harn, “An efficient and adaptive data-hiding
scheme based on secure random matrix,” PLOS One, vol. 14, no. 10, 2020,
ments in the returned result, and V be the number of returned Art. no. e0222892.
results. The similarity accuracy is defined as follows: [8] Q. Wang, M. He, M. Du, S. S. M. Chow, R. W. F. Lai, and Q. Zou,
“Searchable encryption over feature-rich data,” IEEE Trans. Depend.
U Secure Comput., vol. 15, no. 3, pp. 496–510, May/Jun. 2018.
Accuracy = × 100%. (20) [9] E. Stefanov, C. Papamanthou, and E. Shi, “Practical dynamic searchable
V
encryption with small leakage,” in Proc. Netw. Distrib. Syst. Secur. Symp.,
2014. [Online]. Available: https://www.ndss-symposium.org/wp-content/
According to the WordNet semantic relationship, the ap- uploads/2017/09/07_2_1.pdf
proach of calculating lexical similarity is relatively simple, that [10] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzy keyword
search over encrypted data in cloud computing,” in Proc. IEEE INFOCOM,
is, it only uses the information of synonym relation, semantic ex- 2010, pp. 441–445.
planation, example sentence, and so on. However, as some words [11] J. Li and X. Chen, “Efficient multi-user keyword search over encrypted
are not included in the semantic dictionary, such as entities and data in cloud computing,” Comput. Informat., vol. 32, pp. 723–738,
2013.
neologisms, etc., and it depends on the hierarchical relationship [12] B. Wang, S. Yu, W. Lou, and Y. T. Hou, “Privacy-preserving multi-keyword
between the subordinate and the subordinate: limited to nouns, fuzzy search over encrypted data in the cloud,” in Proc. IEEE Conf.
not perfect for adjectives and verbs. Comput. Commun., 2014, pp. 2112–2120.
[13] Z. Fu, X. Wu, C. Guan, X. Sun, and K. Ren, “Toward efficient multi-
Actually, there are several approaches to calculate the word keyword fuzzy search over encrypted outsourced data with accuracy
similarity based on corpus statistics, and in this article, we improvement,” IEEE Trans. Inf. Forensics Secur., vol. 11, no. 12, pp. 2706–
use two methods: Word2vec and GoogleNews [23], [30], [32]. 2716, Dec. 2016.
[14] W. K. Wong et al., “Secure KNN computation on encrypted databases,”
As shown in Fig. 7, the experimental results of the corpus- in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2009, pp. 139–152.
based method are obviously higher than those of the semantic [15] P. Wang and C. V. Ravishankar, “Secure and efficient range queries on
dictionary-based method [32]. The correlation coefficient be- outsourced databases using R̂-tree,” in Proc. IEEE 29th Int. Conf. Data
Eng., Brisbane, QLD, USA, 2013, pp. 314–325.
tween Word2vec and Google News is close to 60% and 65%– [16] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preserving multi-
70%, respectively, this is because Word2vec only selects 1.6G keyword ranked search over encrypted cloud data,” IEEE Trans. Parallel
as the training data considering the computational complexity, Distrib. Syst., vol. 25, no. 1, pp. 222–233, Jan. 2014.
[17] W. Sun et al., “Privacy-preserving multi-keyword text search in the cloud
while Google News uses more corpus information. The highest supporting similarity-based ranking,” in Proc. ACM SIGSAC Symp. Inf.,
LDA result was 65%. Comput. Commun. Secur., 2013, pp. 71–82.
[18] J. Yu, P. Lu, Y. Zhu, G. Xue, and M. Li, “Toward secure multi-keyword
top-k retrieval over encrypted cloud data,” IEEE Trans. Depend. Secure
VIII. CONCLUSION Comput., vol. 10, no. 4, pp. 239–250, Jul./Aug. 2013.
[19] H. Li, Y. Yang, T. H. Luan, X. Liang, L. Zhou, and X. S. Shen, “Enabling
We presented a multikeyword fuzzy searchable scheme with fine-grained multi-keyword search supporting classified sub-dictionaries
privacy protection in the cloud server. First of all, aiming at the over encrypted cloud data,” IEEE Trans. Depend. Secure Comput., vol. 13,
problem of low computational efficiency of the ASPE scheme, no. 3, pp. 312–325, May/Jun. 2016.

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: SE-PPFM: A SE SCHEME SUPPORTING PRIVACY-PRESERVING FUZZY MULTIKEYWORD IN CLOUD SYSTEMS 9

[20] J. Yuan, S. Yu, and L. Guo, “Secure and efficient encrypted image search Mingwu Zhang received the Ph.D. degree from
with access control,” in Proc. IEEE Conf. Comput. Commun., 2015, South China Agricultural University, Guangzhou,
pp. 2083–2091. China, in 2009.
[21] S. Kamara and T. Moataz, “Boolean searchable symmetric encryption with He is currently a Professor with the School of
worst-case sub-linear complexity,” in Proc. Annu. Int. Conf. Theory Appl. Computer Science and Information Security, Guilin
Cryptographic Techn., 2017, pp. 94–124. University of Electronic Technology, Guilin, China,
[22] S. Hu, Q. Wang, J. Wang, Z. Qin, and K. Ren, “Securing sift: Privacy and also a Professor with the School of Computer
preserving outsourcing computation of feature extractions over encrypted Science, Hubei University of Technology, Wuhan,
image data,” IEEE Trans. Image Process., vol. 25, no. 7, pp. 3411–3425, China. From 2015 to 2016, he was a Senior Visiting
Jul. 2016. Scholar with the School of Computing and Informa-
[23] T. Mikolov et al., “Distributed representations of words and phrases and tion Technology, University of Wollongong, Wollon-
their compositionality,” in Proc. Adv. Neural Inf. Process. Syst. 26: 27th gong, NSW, Australia. His research interests include cryptography technology
Annu. Conf. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, 2013, for network and data security, secure computation, and privacy preservation in
pp. 3111–3119. big-data and clouds.
[24] H. Li, Y. Yang, Y. Dai, S. Yu, and Y. Xiang, “Achieving secure and efficient Dr. Zhang was a JSPS Postdoctoral Fellow of the Japan Society of Promotion
dynamic searchable symmetric encryption over medical cloud data,” IEEE Sciences, Institute of Mathematics for Industry, Kyushu University, Fukuoka,
Trans. Cloud Comput., vol. 8, no. 2, pp. 484–494, Apr.–Jun. 2020. Japan, from 2010 to 2012.
[25] G. Xu, H. Li, Y. Dai, K. Yang, and X. Lin, “Enabling efficient and geometric
range query with access control over encrypted spatial data,” IEEE Trans.
Inf. Forensics Secur., vol. 14, no. 4, pp. 870–885, Apr. 2019.
[26] H. Shen, M. Zhang, H. Wang, F. Guo, and S. Willy, “A lightweight privacy- Yu Chen is currently working toward the master’s
degree with the School of Computer Science, Hubei
preserving fair meeting location determination scheme,” IEEE Internet
University of Technology, Wuhan, China.
Things J., vol. 7, no. 4, pp. 3083–3093, Apr. 2020.
His current research interests include information
[27] J. Baek, R. Safavi-Naini, and W. Susilo, “On the integration of public key
data encryption and public key encryption with keyword search,” in Proc. security and privacy preservation in decentralized
networks and blockchains.
Int. Conf. Inf. Secur., 2006, 217–232
[28] C. Ge, W. Susilo, Z. Liu, J. Xia, P. Szalachowski, and L. Fang,
“Secure keyword search and datasharing mechanism for cloud com-
puting,” IEEE Trans. Depend. Secure Comput., to be published, doi:
10.1109/TDSC.2020.2963978.
[29] W. Li, C. Zhang, and Y. Tanaka, “Privacy-aware sensing-quality based
budget feasible incentive mechanism for crowdsourcing fingerprint col-
lection,” IEEE Access, vol. 4, pp. 49775–49784, 2020. Jiajun Huang received the M.S. degree from the
[30] C. I. Fan and S. Y. Huang, “Controllable privacy preserving search based on
School of Computer Science, Hubei University of
symmetric predicate encryption in cloud storage,” Future Gener. Comput.
Technology, Wuhan, China, in 2019.
Syst., vol. 29, no. 7, pp. 1716–1724, 2013.
He is a Senior Programmer and Engineer with
[31] M. Zhang, Y. Chen, Z. Xia, J. Du, and W. Susilo, “PPO-DFK: A privacy- Hewlett Packard Enterprise, Beijing, China. His re-
preserving optimization of distributed fractional knapsack with application
search interests focus on the security and privacy
in secure footballer configurations,” IEEE Syst. J., to be published, doi:
in the outsourced storage and computation in the
10.1109/JSYST.2020.2991928.
medical clouds and wireless application systems.
[32] A. Das, M. Datar, and A. Garg, “Google news personalization: Scalable
online collaborative filtering,” in Proc. 16th Int. Conf. World Wide Web,
2007, pp. 271–280.

Authorized licensed use limited to: Imperial College London. Downloaded on June 20,2020 at 06:44:56 UTC from IEEE Xplore. Restrictions apply.

You might also like