Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 1
Abstract—Semantic searching over encrypted data is a crucial about the searchable encryption scheme, searchable encryption
task for secure information retrieval in public cloud. It aims has attracted significant attention. However, the traditional
to provide retrieval service to arbitrary words so that queries searchable encryption schemes require that query words must
and search results are flexible. In existing semantic searching
schemes, the verifiable searching does not be supported since it be the predefined keywords in the outsourced documents,
is dependent on the forecasted results from predefined keywords which leads to an obvious limitation of these schemes that
to verify the search results from cloud, and the queries are similarity measurement solely base on the exact matching
expanded on plaintext and the exact matching is performed by the between keywords in the queries and documents. Therefore,
extended semantically words with predefined keywords, which some works proposed semantic searching schemes to provide
limits their accuracy. In this paper, we propose a secure verifiable
semantic searching scheme. For semantic optimal matching on retrieval service to arbitrary words, making the query words
ciphertext, we formulate word transportation (WT) problem to and search results flexible and uncertain. However, the verifi-
calculate the minimum word transportation cost (MWTC) as able searching schemes are dependent on forecasting the fixed
the similarity between queries and documents, and propose a results of predefined keywords to verify the correctness of the
secure transformation to transform WT problems into random search result returned by the cloud. Therefore, the flexibility of
linear programming (LP) problems to obtain the encrypted
MWTC. For verifiability, we explore the duality theorem of LP semantic schemes and the fixity of verifiable schemes enlarge
to design a verification mechanism using the intermediate data the gap between semantic searching and verifiable searching
produced in matching process to verify the correctness of search over encrypted data. Although Fu et al. [2] proposed a verifi-
results. Security analysis demonstrates that our scheme can able semantic searching scheme that extends the query words
guarantee verifiability and confidentiality. Experimental results to get the predefined keywords related to query words, then
on two datasets show our scheme has higher accuracy than other
schemes. they used the extended keywords to search on a symbol-based
trie index. However, their scheme only verifies whether all the
Index Terms—public cloud, results verifiable searching, secure documents containing the extended keywords are returned to
semantic searching, word transportation.
users or not, and needs users to rank all the documents for
getting top-k related documents. Therefore, it is challenging
I. I NTRODUCTION to design a secure semantic searching scheme to support
verifiable searching.
I NHERENT scalability and flexibility of cloud computing
make cloud services so popular and attract cloud customers
to outsource their storage and computation into the public
Most of the existing secure semantic searching schemes
consider the semantic relationship among words to perform
cloud. Although the cloud computing technique develops query expansion on the plaintext, then still use the query words
magnificently in both academia and industry, cloud security and extended semantically related words to perform exact
is becoming one of the critical factors restricting its devel- matching with the specific keywords in outsourced documents.
opment. The events of data breaching in cloud computing, We can roughly divide these schemes into three categories:
such as the Apple Fappening and the Uber data breaches, secure semantic searching based synonym [3], [4], secure
are increasingly attracting public attention. In principle, the semantic searching based mutual information model [5], [6],
cloud services are trusted and honest, should ensure data secure semantic searching based concept hierarchy [2], [7],
confidentiality and integrity according to predefined protocols. [8]. We can see that these schemes only use the elementary
Unfortunately, as the cloud server providers take full control semantic information among words. For example, synonym
of data and execute protocols, they may conduct dishonest schemes only use synonym attributes; mutual information
behavior in the real world, such as sniffing sensitive data or models only use the co-occurrences information. Although
performing incorrect calculations. Therefore, cloud customers Liu et al. [9] introduce the Word2vec technique to utilize
should encrypt their data and establish a result verification the semantic information of word embeddings, their approach
mechanism before outsourcing storage and computation to the damages the semantic information due to straightly aggre-
cloud. Since Song et al. [1] proposed the pioneering work gating all the word vectors. We think that secure semantic
searching schemes should further utilize a wealth of semantic
The authors are with the Communication and Information Security Labora- information among words and perform optimal matching on
tory, Shenzhen Graduate School, School of Electronics Engineering and Com-
puter Science Peking University, Shenzhen 518055, China. (Corre-sponding the ciphertext for high search accuracy.
author: Yuesheng Zhu.) (e-mail: wyyang; zhuys@pku.edu.cn). In this paper, we propose a secure verifiable semantic
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 2
searching scheme that treats matching between queries and searchable encryption, then put forth chosen-keyword attacks
documents as an optimal matching task. We treat the document and adaptive chosen-keyword attacks. Besides, Islam et al. [12]
words as “suppliers,” the query words as “consumers,” and the first introduced the access pattern disclosure used to learn
semantic information as “product,” and design the minimum sensitive information about the encrypted documents, then Liu
word transportation cost (MWTC) as the similarity metric et al. [13] presented a novel attack based on the search pattern
between queries and documents. Therefore, we introduce word leakage. Stefanov et al. [14] introduced the notions of forward
embeddings to represent words and compute Euclidean dis- security and backward security for the dynamic searchable
tance as the similarity distance between words, then formulate encryption schemes that support data addition and deletion.
the word transportation (WT) problems based on the word Along another research line about functionality, many works
embeddings representation. However, the cloud server could introduced practical functions to meet the demand in practice,
learn sensitive information in the WT problems, such as the such as ranked search and semantic searching for improving
similarity between words. For semantic optimal matching on search accuracy. Additionally, some works proposed verifiable
the ciphertext, we further propose a secure transformation searching schemes to verify the correctness of search results.
to transform WT problems into random linear programming Ranked Search over Encrypted Data. Ranked search means
(LP) problems. In this way, the cloud can leverage any ready- that the cloud server can calculate the relevance scores be-
made optimizer to solve the RLP problems and obtain the tween the query and each document, then ranks the documents
encrypted MWTC as measurements without learning sensitive without leaking sensitive information. The notion of single-
information. Considering the cloud server may be dishonest keyword ranked search was proposed in [15] that used a
to return wrong/forged search results, we explore the duality modified one-to-many order-preserving encryption (OPE) to
theorem of linear programming (LP) and derive a set of encrypt relevance scores and rank the encrypted documents.
necessary and sufficient conditions that the intermediate data Cao et al. [16] first proposed a privacy-preserving multi-
produced in the matching process must satisfy. Thus, we can keyword ranked search scheme (MRSE), which represents
verify whether the cloud solves correctly RLP problems and documents and queries with binary vectors and uses the secure
further confirm the correctness of search results. Our new ideas kNN algorithm (SeckNN) [17] to encrypt the vectors, then use
are summarized as follows: the inner product of the encrypted vectors as the similarity
1. Treating the matching between queries and documents as measure. Besides, Yu et al. [18] introduced homomorphic
an optimal matching task, we explore the fundamental encryption to encrypt relevance scores and realize a multi-
theorems of linear programming (LP) to propose a se- keyword ranked search scheme under the vector space model.
cure verifiable semantic searching scheme that performs Recently, Kermanshahi et al. [19] used various homomor-
semantic optimal matching on the ciphertext. phic encryption techniques to propose a generic solution for
2. For secure semantic optimal matching on the ciphertext, supporting multi-keyword ranked searching schemes that can
we formulate the word transportation (WT) problem and resist against several attacks brought by OPE-based schemes.
propose a secure transformation technique to transform Secure Semantic Searching. A general limitation of tradi-
WT problems into random linear programming (LP) tional searchable encryption schemes is that they fail to utilize
problems for obtaining the encrypted minimum word semantic information among words to evaluate the relevance
transportation cost as measurements between queries and between queries and documents. Fu et al. [3] proposed the
documents. first synonym searchable encryption scheme under the vector
3. For supporting verifiable searching, we explore the dual- space model to bridge the gap between semantically related
ity theorem of LP and present a novel insight that using words and given keywords. They first extended the keyword
the intermediate data produced in the matching process set from the synonym keyword thesaurus built on the New
as proof to verify the correctness of search results. American Roget’s College Thesaurus (NARCT), then used the
extended keyword set to build secure indexes with SeckNN.
Using the order-preserving encryption algorithm, [5] and [6]
II. RELATED WORK
presented secure semantic searching schemes based on the
Since Song et al. [1] proposed the notion of searching mutual information model. Xia et al. [6] proposed a scheme
over encrypted cloud data, searchable encryption has received that requires the cloud to constructs a semantic relationship
significant attention for its practicability in the past 20 years. library based on the mutual information used in [20]. However,
Therefore, many works have made efforts on the security as any schemes based on the inverted index can calculate the
well as functionality in the searchable encryption field. mutual information model. Using the SeckNN algorithm, [7],
Along the research line about security, many works for- [8], [2] proposed secure semantic searching schemes based on
mulate the definitions of security as well as novel attack the concept hierarchy. For example, Fu et al. [8] proposed a
pattern against the existing schemes. Goh et al. [10] formulated central keyword semantic extension searching scheme which
a security model for document indexes known as semantic calculates weights of query words based on grammatical
security against adaptive chosen keyword attack (IND-CKA), relations, then extends the central word based on the concept
which requires the document indexes not to reveal contents of hierarchy tree from WordNet. Inspired by word embedding
documents. However, we note that the definition of IND-CKA used in plaintext information retrieval [21], [22], Liu et al. [9]
does not indicate that the queries must be secure. Curtmola et introduced the Word2vec to represent both queries and docu-
al. [11] further improved security definitions for symmetric ments as compact vectors. However, their approach damages
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 3
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 4
proofs, let F(Λ) denote the proofs returned from the cloud. In Table I
this attack, V(C, q, F(Λ)) = 0 and F(Λ) 6= C(Λ). THE EUCLIDEAN DISTANCE VALUES BETWEEN WORDS
As for confidentiality, we follow the widely-accepted university college professor office
Real/Ideal simulation [11], [24], [29] to analyze the confi- university 0 4.94 5.25 6.82
dentiality of symmetric searchable encryption schemes. Below college 4.94 0 5.11 5.18
professor 5.25 5.11 0 5.48
we give the definition of confidentiality with respect to the office 6.82 5.18 5.48 0
verifiable semantic searching scheme we are going to propose.
Definition 3 (Confidentiality). Our verifiable secure semantic
searching scheme is secure against adaptively chosen query
attack, if for any PPT stateful adversary A, there exists a PPT • Ξ: The encrypted minimum word transportation cost
stateful simulator S, L is stateful leakage algorithms, consider values as measurements between q and documents, and
the following probabilistic experiments: Ξ = {ξ1 , ξ2 , ξ3 . . . ξi . . . ξd }, where ξi denotes the mea-
RealA (ε) : The adversary A chooses dataset F for a chal- surement between q and fi .
lenger. The challenger runs {K, I, C} ← Initialize (1ε , F ),
where ε is our security parameter. A makes a polynomial num- IV. PRELIMINARIES
ber of adaptive queries q. For any query q, the challenger acts A. Word Embedding
as a data user and calls (Ω, ∆) ← BuildRLP (q, I, 1ε , CV ). A
Word embedding is a representative method for words in
act as the cloud server and runs SeaPro (). Finally, A returns a
vector space, through which we can preserve the fundamen-
bit b as the output of the experiment.
tal properties of words and the semantic relations between
IdealA,S (ε) : The adversary A chooses a document dataset
them. Neural language models [30], [31], [32] are trained to
F and makes a polynomial number of adaptive queries q for a
minimize the prediction error to learn vector representations
simulator S. Given L, S generates and sends C to A, then as a
for words. Therefore, we can perform algebraic operations
data user to generate the trapdoor, namely Ω and ∆. Finally, A
with word embeddings to probe semantic information between
acts as the cloud server and returns a bit b, which is the output
words. As illustrated in Table I, take “university, college,
of the experiment.
professor, and office” as an example, the Euclidean distance
A semantic searching scheme is L-confidential if for any PPT
values are just in line with our intuition that the more relevant
adversary A, there exists a PPT simulator S such that:
the words are, the smaller the Euclidean distance is. Word
|P r [RealA (ε) = 1] − P r [IdealA,S (ε) = 1]| ≤ negl(ε)
embedding has been studied in plaintext information retrieval
where negl(ε) is a negligible function.
tasks, such as query expansion [21] zero-shot retrieval [22]
and cross-modal retrieval [33]. In this paper, we use word
C. Notations embeddings to capture semantic information between words
The main notations used in this paper are shown as follows: without revealing semantic information to the cloud server.
• q: The query inputted from a data user.
• d: The number of documents in the dataset. B. Earth Mover’s Distance
• m: The number of keywords in a document.
Earth Mover’s Distance (EMD) is introduced in [34], [35]
• n: The number of query words in the query.
as a metric in computer vision to capture the signatures
• F : Plaintext documents dataset F = {f1 , f2 . . . fi . . . fd },
distribution differences between images. The name of EMD
where fi denotes a document in the F . comes from its intuitive interpretation: Given two distributions,
• C: Encrypted documents C = {c1 , c2 . . . ci . . . cd }, where
we regard one as a mass of earth spread properly in space,
ci denotes a document in the C. the other as a collection of holes in that same space. Then,
• Ψ: WT problems for the q and documents, and Ψ =
EMD is the result that the minimum amount of work cost
{ψ1 , ψ2 . . . ψi . . . ψd }, where ψi denotes a WT problem to fill the holes with earth. As EMD has advantages in
for the q with fi . representing problems involving multifeatured signatures, it
• Ω: RLP problems for the q and documents, and Ω =
has been applied to some practical scenarios, such as gesture
{ω1 , ω2 . . . ωi . . . ωd }, where ωi denotes a RLP problem recognition [36], music genre classification [37], document
for the q with fi . classification [38], plaintext retrieval [39] and gene identifica-
• θ: The dual problems of the RLP problem ω.
tion [40]. We observe that EMD is a particular case of linear
• ∆: Constant terms of every RLP problems, and ∆ =
programming problems. Therefore, in this paper, we explore
{δ1 , δ2 . . . δi . . . δd }, where δi denotes the constant term the fundamental theorems of linear programming and security
of the RLP problem ωi . algorithms to design our scheme for realizing secure semantic
• Λ: Proofs for every RLP problems, and Λ = optimal matching on the ciphertext.
{λ1 , λ2 . . . λi . . . λd }, where λi denotes the proof for ωi .
• β: The minimum word transportation cost value of a WT
problem. V. PROPOSED APPROACHES
• Π: Optimal values of RLP problems, and Π = In this section, we present the proposed core approaches in
{π1 , π2 . . . πi . . . πd }, where πi denotes the optimal value Fig. 1, namely, the word transportation problem, the secure
of the RLP problem ωi . transformation technique, and the verification mechanism.
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 5
1 1 0 0 0 0
college professor
0 0 1 1 0 0
0
0.5
4.
0 5.0 0 0 0 0 1 1
65
Document-1 : 4.794 4- -0.0 9-0
9 5.25 .20
-0
.
4
.2
0
1 0 1 0 1 0
4.97-0.0
0 graduate 3.76-0.10
Query MWTC university doctorate 0 1 0 1 0 1
5.11-0.0 .10
0 3.96-0
Document-2 : 6.8
.30
undergraduate 0
6.
6.003 2-0 0.1 Figure 4. An example of the matrix V, when m=3, n=2. The matrix V is
61
3-0
. 00 3-
-0
5.8 used to build the constraint Vx = W in our word transportation problem.
.5
5.7
0
headmaster office
the number of documents in the dataset. We adopt the same
method to represent the query and define the weights of query
words are equivalent. In this work, we normalize the amount
Figure 2. An example of the word transportation optimal matching. The
relative area of the shadow represents the weight of a word; the length of the of weight of each document/query to 1.
line segment represents the relative Euclidean distance between two connected Given forward indexes of documents and the query, we
words; as for the value M-N on the line segment, M represents the Euclidean treat the document words as “suppliers,” the query words
distance between two words, N represents the amount of transportation
between them. In this example, the MWTC between document-1 and the as “consumers,” and the semantic information as “product.”
query is 4.794; the MWTC between document-2 and the query is 6.003, so Therefore, given the forward index of a document f and the
document-1 is more relevant to the query compared with document-2. query q, we can formulate the WT problem as follows:
n
m X
...
f1 college professor 0.3 undergraduate 0.1
X
0.3
company ...
WT(f, q) = min fi,j di,j
f2 car 0.1 chauffeur 0.4 0.2
f3 language 0.2 country 0.5 leaders ... i=1 j=1
0.2
... ... ... ... n
X
fi criminal 0.15 police 0.2 national 0.05 ... subject to fi,j = efi , i = 1, 2, . . . , m
... ... ... ... j=1
fd-1 aids 0.14 adults 0.3 diseases 0.2 ... m
X (2)
fi,j = eqj , j = 1, 2, . . . , n
fd infection 0.1 air 0.22 recirculation 0.3 ...
i=1
fi,j ≥ 0
Figure 3. An example of the forward indexes of documents. Forward indexes
m X n
are the data structure storing the mapping from each document to its keywords. X
In our scheme, each keyword carries a normalized weight representing the fi,j = 1 ,
relevant score between the keyword and a specific document. i=1 j=1
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 6
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 7
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 8
B. BuildRLP D. Verification&Decryption
In this phase, data users perform BuildRLP () to generate In this phase, data users perform VerDec () to verify the
trapdoor the searching query q. To describe this algorithm in correctness of the search results and decrypt the top-k en-
detail, we split it into three algorithms, as follows: crypted documents. To describe this algorithm in detail, we
Ψ ← BuildWT(q, I, E) is a deterministic algorithm, corre- split it into Verify () and DecDoc (), as follows:
sponding to the “WT Builder” in Fig. 6. The users take query (0 or α) ← Verify(Λ) is a deterministic verification algo-
q, forward indexes I and word embedding library E as input, rithm, corresponding to the “Verify ?” in Fig. 6. Data users
then generate word transportation problems Ψ for each pair of take proofs Λ as input, then generate the result of verification
query and each document. 0 or α, where α ∈ N∗ , N∗ denotes the positive integer set.
KT ← TranKeyGen (1ε ) is a probabilistic transformation Υ ← DecDoc(K, Γ) is a deterministic decryption algo-
key generation algorithm, corresponding to the “One-Secret rithm, corresponding to the “Documents Decryption” in Fig.
Key Generator” in Fig. 6. The user takes the security parameter 6. The users take the top-k related encrypted documents Γ and
ε as input, then generates one-time transformation secret key secret key K as input, then generate the top-k related plaintext
KT = (A, Q, γ, r, R) for encrypting Ψ. documents Υ for the query q.
(Ω, ∆) ← SecTran (Ψ, KT ) is a deterministic algorithm, The users first call Verify() to verify the correctness of
corresponding to the “Secure Transformation” in Fig. 6. The the search results. The users verify the correctness of each
users take WT problems Ψ and transformation key KT as proof λi according to (9), thus verifying whether the cloud
input, then generate random linear programming problems Ω performs the correct calculations for each RLP problem and
and the corresponding constant terms ∆. determining the correctness of the search result, where λi ∈ Λ,
The users first call BuildWT() to build WT problems Ψ for and i = 1, 2, . . . , d. The Verify() will output 0 when the
the query and forward index of each document. Specifically, verification pass; otherwise, this algorithm calls “Annunciator”
The users use word embeddings to represent all words and to output α as the warning which denotes the number of
calculate Euclidean distance values between word embeddings, failing proofs. The users call DecDoc() to decrypt the top-
then build word transportation problems Ψ according to the k encrypted documents Γ with the secret key K and obtains
proposed approach. After building WT problems Ψ, the data the top-k related documents Υ if the proofs Λ pass our result
users call TranKeyGen() to generate a one-time secure key verification mechanism.
KT for encrypting Ψ. Then, the users call SecureTran() to
encrypt each ψi and get the corresponding RLP problem ωi
with its constant term δi , where ψi ∈ Ψ, ωi ∈ Ω, δi ∈ ∆, and VII. THEORETICAL ANALYSIS
i = 1, 2, . . . , d. Finally, the user sends all RLP problems Ω In this section, we give theoretical analyses on the correct-
and the corresponding constant terms ∆ to the cloud server. ness, rationality, and security of our scheme.
C. Search&Prove
A. Correctness Analysis
In this phase, the cloud server performs SeaPro () to search
documents and generate proofs. To describe this algorithm We analyze the reason why the encrypted minimum word
in detail, we split SeaPro () into two algorithms, namely, transportation cost ξ can be used as a measurement.
SolveRLP () and Rank (), as follows: Theorem 1: When y is the optimal decision vector of an RLP
(Π, Λ) ← SolveRLP(Ω) is a deterministic algorithm, cor- problem ω, the corresponding x is the optimal decision vector
responding to the “Any Ready-made Optimizer” in Fig. 6. The of the original WT problem ψ, where x = Ay − r.
cloud server takes RLP problems Ω as input, then generates Proof: We prove Theorem 1 by using the Reductio ad
the optimal values Π and proofs Λ for RLP problems. ab surdum. We first suppose that x and y0 are the optimal
(Γ, Ξ) ← Rank(Π, ∆, C, k) is a deterministic ranking decision vectors for ψ and ω respectively, and there is no
algorithm, corresponding to the “Subtractor” and “Ranker” mathematical relationship x = Ay0 − r between x and y0 .
in Fig. 6. The cloud server takes optimal values Π, the We can deduce cT x and γcT Ay0 are the optimal values for
constant terms ∆, the ciphertext dataset C and the number ψ and ω respectively. According to x = Ay − r and x0 =
k as input, first calculates all the measurements Ξ, then Ay0 − r , we can deduce γcT x0 = γcT Ay0 − γcT r, γcT x =
generates the top-k related encrypted documents Γ, where γcT Ay − γcT r, and γcT Ay0 − γcT r < γcT Ay − γcT r,
Ξ = {ξ1 , ξ2 , ξ3 . . . ξi . . . ξd }, and i = 1, 2, . . . , d. namely, γcT x0 < γcT x. Thus, we get a corollary that the
The cloud server calls SolveRLP() to solve RLP problems. x0 is the optimal decision vector for ψ, namely, x = x0 =
The cloud can leverage any ready-made optimizer to solve Ay0 − r. However, this corollary contradicts the precondition.
each RLP ωi and get the corresponding optimal value πi Theorem 1 is established.
and proof λi , where ωi ∈ Ω, πi ∈ Π, λi ∈ Λ, and Theorem 2: The encrypted minimum word transportation
i = 1, 2, . . . , d. The cloud calls RankDoc() to calculate each cost ξ is γ times of the minimum word transportation cost β.
encrypted minimum word transportation cost ξi = πi − δi as Proof: According to Theorem 1, the optimal decision vec-
measurement, where i = 1, 2, . . . , d. Then, the cloud ranks tor x and y meet the mathematical relationship x = Ay − r.
measurements Ξ in ascending order and obtains the top-k Therefore, we can get the final semantic difference
value ξ =
related encrypted documents Γ. Finally, the cloud returns the π − δ = γcT Ay − γcT r = γ cT Ay − cT r = γcT x = γβ.
top-k related encrypted documents Γ and proofs Λ to the users. Therefore, Theorem 2 is established.
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 9
As the ξ is γ times of the β, we define the encrypted Definition 4(Verifiability). Let the proposed scheme be verifi-
minimum word transportation cost ξ as the measurement. able and consider the following probabilistic experiment on our
Therefore, the cloud can calculate and rank the measurements scheme, where A is a stateful adversary: Vrf A (ε):
without learning any sensitive information of documents and 1. The challenger calls KeyGen (1ε ) to generate the symmet-
queries in our scheme. ric secret key K.
2. The adversary A chooses a document dataset F for the
B. Rationality Analysis challenger.
In this subsection, we analyze the rationality of our scheme 3. The challenger calls EncDoc(K, F ) to generate ciphertext
via complexity analysis. dataset C and calls BuildIndex(F ) to generate forward
We still define that d denotes the number of documents indexes I.
in the dataset, m denotes the number of keywords in a 4. Given {C, I} and oracle access to BuildWT(q, I, E),
document and n denotes the number of query words. For TranKeyGen (1ε ), SecTran (Ψ, KT ), the adversary A
the data users, the most time-consuming operations in our generates a set of RLP problems Ωq and the corresponding
secure transformation technique are the matrix-matrix op- constant terms ∆q according to a query q, then outputs a
erations. The complexity of each step is analyzed as fol- sequence of documents Cq0 and the proofs Λ0q . However,
lows. Encrypting the cost vector c in the word transporta- Cq0 6= Cq , where Cq denotes the real search result of the
tion problem ψ to obtain c0 = (γcT A)T , its complexity query q.
is O dm2 n2 . Then, encrypting the constraints in the ψ 5. The challenger calls Verify(Λ0q ) to generate a bit η
to get V0 = QVA, W0 = Q(W + Vr), and I0 = 6. Finally, this probabilistic experiment outputs the bit η
RA, its complexity is O max d(m + n)2 mn, d(mn)3 . Our scheme is verifiable if for any probabilistic polynomial-
Finally, obtaining the constant term δ = γcT r, its complex- time adversary A, P r [VrfA (ε) = 1] ≤ negl(ε).
ity is O(dmn). To summarize, the proposed secure trans- Therefore, the verifiability of our scheme means that it
formation technique requires computational
complexity of prevents the Result Forgeries Attack and Proof Forgeries
2 3
O max d(m + n) mn, d(mn) at the user side. We adopt Attack. We prove it via theorem 3.
O dm3 n3 as the computational complexity of the user side Theorem 3: Our secure semantic searching scheme guar-
due to m > 2 and n > 2 usually in our scheme. antee the verifiability if the probability P r[M isbehavior]
For the cloud, the time-consuming operations are solving that any probabilistic polynomial time (PPT) adversary A
the RLP problems and dual problems, and the ranking task. successfully forge search result R(C, q) and proofs F(Λ),
Most of the ready-made optimizers solve both the primal P r[M isbehavior] is negligible.
problems and dual problems at the same time via using Proof: According to the definitions of the Result Forg-
the interior point method. Therefore, the cloud server solves eries Attack and Proof Forgeries Attack, we can deduce the
the RLP problems and dual problems with computation of successfully probability of A as follow: P r[M isbehavior] =
O dm3.5 n3.5 L , where L denotes the input length of the P r[(R(C, q) 6= T (C, q)) ∩ (V(C, q, F(Λ)) = 0)], where
problem, i.e., the total binary length of the numerical data R(C, q) denotes the search result from the cloud server,
specifying the problem instance. The ranking task includes T (C, q) denotes the correct search result, F(Λ) denotes
calculating all of the minimum word transportation cost values the proofs returned from the cloud, V(C, q, F(Λ)) = 0
Ξ and ranking these values, resulting in the computational denotes the proofs F(Λ) pass the verification. To prove
complexity of O (d + d log2 d). We can get the following P r[M isbehavior] ≤ negl(ε), we just prove that A has the
conclusions and further demonstrate these conclusions with successfully probability P r[M is] of forging eachPproof F(λ)
experimental results in section VIII. d
in F(Λ), P r[M is] ≤ neg(ε), since negl(ε) = i=1 neg(ε)
1. It is rational that data users outsource the retrieval task based on applying union bound for all proofs.
3 3
due to the complexity O dm n
to the cloud server
3.5 3.5
We prove P r[M is] ≤ neg(ε) according to duality the-
O dm n L . orem [42]. We first suppose that F(λ) 6= C(λ) and F(λ)
2. A rational cloud server would perform the ranking task could pass our verification mechanism, namely, F(λ) =
honestly if the cloud ran the complex calculation hon- {F(y), F(s), F(t)} meets (9), where C(λ) denotes the real
estly for solving RLP problems to obtain the encrypted proof for a specific problem ω. We split the (9) into (10), (11)
minimum word transportation
cost due to the complexity and (12), as follows:
of O dm3.5 n3.5 L O (d + d log2 d) at the cloud side.
v1 = cT F(y)
C. Security Analysis V0 F(y) = W0 (10)
In this subsection, we elaborate on the security analysis the I0 F(y) ≥ 1 ,
proposed verifiable semantic searching scheme in two aspects,
i.e., verifiability and confidentiality. v2 = W0T F(s) + LT F(t)
1) Verifiability: The verifiability means that the proposed V0T F(s) + I0T F(t) = c0 (11)
scheme can verify the correctness of the search results and F(t) ≥ 0 ,
proofs returned from the cloud. We adopt the game-based
security definition to prove the verifiability of our scheme. v1 = v2 . (12)
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 10
According to (10) and (11), we can deduce that F(y) and use y = A−1 (x + r) to indicate y. Values of elements in
{F(s), F(t)} are the feasible decision vectors for a RLP x belong to [0, 1]. Besides, the vector r whose elements are
problem ω and its dual problem θ, respectively. We also get uniformly sampled from a numerical interval (0, 2ε ], where ε
that v1 and v2 are the optimal values of ω and θ, respectively. is security parameter. We prove the privacy of the x via that
Therefore, there is v1 ≥ v2 between v1 and v2 according to xi +ri and r0i are computationally indistinguishable for any A,
the weak duality lemma [42]. Next, we can deduce that F(y) where i ∈ [1, mn]. From the view of A, the best strategy is that
and {F(s), F(t)} make the problems obtain the same optimal guesses η from 0, 1 with equal probability if xi + ri ∈ (0, 2ε ],
values v1 = v2 according to (12). The strong dual theorem [42] and output 1 if xi +ri comes from a specific range (2ε , 2ε + 1].
claims that if x∗ and y ∗ are feasible for a linear programming Therefore, the success probability of the distinguisher A is:
and its dual problem, respectively, and if x∗ and y ∗ make the 1
problems obtain the same optimal values, then x∗ and y ∗ are P r [A (xi + ri ) = 1] = P r [0 < xi + ri ≤ 2ε ]
2
optimal for their respective problems. Therefore, we can get + P r [2ε < xi + ri ≤ 2ε + 1]
a corollary that F(λ) = {F(y), F(s), F(t)} meets the strong
1 1
theorem of the LP problem, thus F(λ) = C(λ). Therefore, for ≤ + ε
2 2
any PPT adversary A, there exits:P r[M is] = 0 ≤ neg(ε). 1
In a word, P r[M isbehavior] ≤ negl(ε). Therefore, the = + negl(ε) .
2
proposed scheme is verifiable.
2) Confidentiality: The confidentiality means that the pro- Meanwhile, when the input is r0i , A obviously has the success
posed scheme can guarantee that any adversary A cannot any probability which is:
useful sensitive information about documents F and query 1
q through the trapdoor (the corresponding RLP problems Ω P r [A (r0i ) = 1] = .
2
and constant terms ∆) and proofs Λ. We demonstrate that our Therefore, we can deduce that:
scheme guarantees the confidentiality defined by Definition 4.
To elaborate the confidentiality of our scheme, we first define |P r [A (xi + ri ) = 1] − P r [A (r0i )] = 1| ≤ ne(ε) .
leakage function L(F ) = (Ωq , ∆q , Λq ) which refers to the
In the end, we can apply union bound to deducing the con-
maximum information that is allowed to learn by the adversary
clusion that any probabilistic polynomial-time A distinguishes
A, where F denotes the documents, Ωq and ∆q denotes
x + r from r0 with negligible success probability:
the RLP problems and constant terms which are adaptively
generated from q as the trapdoor, Λq denotes the proofs. mn
Theorem 4: Our verifiable semantic searching scheme is L-
X
|P r [A (x + r) = 1] − P r [A (r0 )] = 1| ≤ neg(ε) = ne(ε) .
confidential if the proposed secure transformation technique is i=1
secure.
Second, we prove that the proposed technique guarantees
Proof: We prove it by describing a polynomial-time
the input privacy of ψ, namely, provides the security to the c,
simulator S such that for any PPT adversary A, the output
V, W, I in the word transportation problem ψ. For example,
between RealA (ε) and IdealA,S (ε) are computationally in-
the W is encrypted into W0 = Q(W + Vr) in our scheme.
distinguishable:
We have W0 = QV(x + r) according to Vx = W. We
|P r [RealA (ε) = 1] − P r [IdealA,S (ε) = 1]| ≤ negl(ε) .
have proved any A distinguishes x + r from r0 with negligible
Given the leakage function L(F ) = (Ωq , ∆q , Λq ), the
success probability. Therefore, we can deduce that QVr0
simulator S can simulate the search trapdoor (random linear
and QV(x + r) are statistically indistinguishable. Any A
programming problems and constant terms) and proofs. for
distinguishes QVr0 from QV(x + r) with negligible success
every query q, S simulate and generate randomly the trapdoor
˜ and proofs Λ̃. Our scheme is secure probability:
(problems Ω̃ and terms ∆)
if any PPT adversary A cannot differentiate the real trapdoor |P r [A (QV(x + r)) = 1] − P r [A (QVr0 )] = 1|
and proofs from the simulated trapdoor and proofs. Therefore, mn
X
we just need to prove that any PPT A is able to deduce each ≤ neg(·) = neg(ε) ,
ψi in Ψ through the ωi in Ω and the corresponding ∆i in i=1
∆, and proof λi in Λi with negligible success probability, where m and n are the number of keywords in a document and
where ∀i ∈ [1, d], d is the number of document in dataset. To the query, respectively. The matrixes c, V and I in the ψ is
prove it, we formally analyze that our secure transformation encrypted into c0 , V0 and I0 with random invertible matrixes.
technique guarantees the input/output privacy to original word For example, the element values and the structure pattern and
transportation problem ψ = (c, V, W, I). values of c are hidden effectively via multiplying the random
First, we prove that the proposed technique guarantees the matrix A and random positive number γ as proved [43].
output privacy of ψ, namely, provides the security to the We can deduce the proof λ is unable to reveal information
optimal decision vector x and the real semantic difference about original WT problem ψ since A is unable to learn
value β of ψ. As the random positive number γ hide the real any sensitive information of ψ = (c, V, W, I) from ω =
semantic difference value β, we only need to prove the security (c0 , V0 , W0 , I0 ).
of x. In our scheme, x is protected by the random invertible In a word, we can claim the proposed secure transformation
matrix A and random vector r. As x = Ay − r, we can technique is secure and our verifiable semantic searching
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 11
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 12
Table III
ACCURACY COMPARISON OF DIFFERENT SECURE SEMANTIC SEARCHING SCHEMES
Robust04 ClueWeb-09-Cat-B
Robust04-Title Robust04-Desc ClueWeb-Title ClueWeb-Desc
P@20 NDCG@20 P@20 NDCG@20 P@20 NDCG@20 P@20 NDCG@20
SSERS [3] 0.052 0.058 0.027 0.029 0.049 0.046 0.057 0.070
SSSMIM-Single [6] 0.128 0.142 0.025 0.023 0.041 0.046 0.028 0.041
SSSMIM-Multi [6] 0.123 0.134 0.028 0.030 0.043 0.049 0.032 0.046
CKSER-1 [8] 0.051 0.117 0.032 0.086 0.049 0.077 0.026 0.081
CKSER-2 [8] 0.050 0.092 0.031 0.083 0.033 0.076 0.019 0.053
VKSS [2] 0.049 0.087 0.030 0.068 0.018 0.019 0.022 0.024
SSSW-1 [9] 0.107 0.192 0.070 0.128 0.060 0.086 0.027 0.060
SSSW-2 [9] 0.106 0.186 0.067 0.122 0.037 0.082 0.021 0.054
Ours 0.148 0.271 0.136 0.255 0.061 0.103 0.041 0.102
is computed as follows: mark datasets and analyze the effectiveness of different word
DCG@k embeddings on our scheme.
NDCG @k = IDCG@k We used title topics and description topics as the queries
P|real| reli in our experiments. We did pre-process as follows: both
DCG @k = i=1 log2 (i+1) , (14) documents and query words were white-space tokenized, low-
P|ideal| ercased, and removed the stopword. We adopted a re-ranking
relj
IDCG @k = j=1 log2 (j+1) strategy for efficient computation. We used Indri to perform
initial retrievals for obtaining the top 1, 000 ranked documents
where DCG @k indicates the truth accumulated from the real
of each query. We applied schemes to re-rank these documents
ranking permutation at a particular rank position k, IDCG @k
and used the results of re-ranked to evaluate accuracy.
represents the ideal DCG at k. reli and relj denote relevance
assessments between the query and documents, these relevance 1) Compared with baselines: The experiments results show
assessments can be got from relevant judgment file in our that the accuracy of our proposed scheme is better than that
datasets. |real| represents the top-k documents in the result of other schemes in terms of all the evaluation measures on
of real ranking order for a query, |ideal| represents the top-k both Robust04 and ClueWeb-09-Cat-B dataset, as illustrated
documents in the result of ideal ranking order. in Table III. Taking the Robust04 dataset as an example, the
4) Baselines: Secure synonym extension ranked searching relative improvement of our scheme over the second-highest
scheme (SSERS): SSERS is a semantic searching scheme ones in other schemes are about 15.62%, 41.14%, 94.28%,
extending the query words from synonym thesaurus. We and 99.21% when using Robust04-Title and Robust04-Desc as
implemented the SSERS proposed in [3]. queries under P @20 and NDCG @20. The results demonstrate
Secure semantic searching based mutual information the effectiveness of our secure verifiable semantic searching
model (SSSMIM): SSSMIM is a ciphertext extension scheme, scheme based on word transportation optimal matching.
which extends the query words from the mutual informa- Overall, the SSERS scheme is inferior to other schemes ex-
tion model. We implemented the single-keyword search- cept using description topics as queries search on the Clueweb-
ing (SSSMIM-Single) proposed in [6]. We also extended it 09-Cat-B. The accuracy of SSSMIM schemes when using title
to a multi-keyword searching scheme (SSSMIM-Multi). topics as queries on both datasets is higher than the case when
Central keyword semantic extension ranked search- using description topics. As for schemes based on concept
ing (CKSER): CKSER is a secure semantic searching scheme hierarchy, CKSER-1, CKSER-2, and VKSS schemes usually
based concept hierarchy. CKSER selects a central query word are not competitive to other schemes. In particular, VKSS
and extends it to get semantically related words from WordNet. is inferior to CKSER-1 and CKSER-2. The reason is that
We implemented CKSER-1 and CKSER-2 proposed in [8]. CKSER-1 and CKSER-2 schemes consider the grammatical
Verifiable Keyword-based Semantic Searching (VKSS): relationship among query keywords and introduce a word
VKSS extends the query words according to WordNet and weighting algorithm to show the importance of the distinction
searches the words on a symbol-based trie index for supporting among them. From Table IV, we can see that CKSER-2 is
verifiability. We implemented the VKSS proposed in [2]. inferior to CKSER-1 under different evaluation measures on
Secure searching scheme based on word2vec (SSSW): both datasets, which also be observed between SSSW-1 and
SSSW is a secure searching scheme that uses the word vectors SSSW-2. This finding is not surprising since the CKSER-2
trained on Word2vec. We implemented both SSSW-1 and and SSSW-2 adopt enhanced Secknn algorithm introducing
SSSW-2 proposed in [9]. more random numbers, leading to limited accuracy. We take
a look at the SSSW-1 and SSSW-2 schemes using word
embeddings and the Secknn algorithm to build secure indexes.
B. Performance Evaluation of Accuracy Overall, we can see that the SSSW schemes obtain very limited
In this subsection, we compare the proposed scheme with improvement compared with query expansion schemes. The
other secure semantic searching schemes over the two bench- results demonstrate that using a compact vector representation
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 13
Table IV The data owner is the initiator who initializes the secure
ACCURACY COMPARISON OF OUR SCHEME USING DIFFERENT searching scheme. Not like in other schemes, the data owner
WORD EMBEDDINDS OVER ROBUST04
in our scheme does not need to perform massive cryptographic
operations, such as order-preserving encryption and homomor-
Robust04-Title Robust04-Desc
phic encryption. For the owner, the main steps including (1)
P@20 NDCG@20 P@20 NDCG@20
generating symmetrical encryption secret key with tO Key ;
Ours-GloVe100 0.137 0.239 0.126 0.244
Ours-GloVe200 0.146 0.259 0.127 0.254 (2) encrypting documents with tO Enc ; (3) building forward
Ours-GloVe300 0.147 0.264 0.128 0.252 indexes with tO Index . We define that the total time of a user is
Ours-Word2vec300 0.145 0.254 0.122 0.176 tOwner = tO Key + tO Enc + tO Index . From Table V, we can
Ours-Fasttext300 0.148 0.271 0.136 0.255
see that it takes the data owner tOwner = 6.289s to initialize
the proposed secure verifiable semantic searching scheme.
The data users are the searching requesters that send the
of documents/queries may damage the semantic information trapdoor of a query for acquiring top-k related documents. In
of word embeddings. our scheme, data users need to perform the main steps includ-
We can see that the accuracy of other schemes using the ing (1) building word transportation problems with tU W T P ;
title topics as queries is usually larger than that using the (2) generating the one-time secret key with tU Key ; (3) trans-
description topics across different datasets, which is con- forming the WT problems with tU T ran . We define that the
sistent with the previous findings in plaintext information total time of a user is tU ser = tU W T P + tU Key + tU T ran .
retrieval [49]. A reason is that the description topics are usually From Table V, we can see that the total time of generating
one or two sentences containing about 16 words so that these the trapdoor in our scheme is tU T rapdoor = tU ser = 1.106s.
semantic searching schemes are challenging to analyze the It is acceptable in many real-world scenarios in which the
semantic relationship among the query words. For example, users want to search for essential materials with high accuracy.
the SSSMIM-Single, CKSER-1 and CKSER-2 schemes are In addition, once receiving the proofs and search results,
facing the challenge to select an accurate central keyword from the users need to spend tU V erif y = 0.033s using the pro-
a long-text description topic. The accuracy of the proposed posed verification mechanism to check the proofs and spend
scheme using the description topics is still higher than that of tU Dec = 0.005s decrypting the top-k related documents.
other schemes. A reason is that the word transportation optimal The cloud server is an intermediate service provider that
matching is beneficial to analyze the semantic relationship performs the retrieval process. In our scheme, the cloud
between the words and the importance of the distinction needs to perform the main steps including (1) performing
among words in long-text queries. optimal matching on ciphertext and generating proofs in the
2) Effect of Word embeddings: As word embedding is an matching process with tC M atch ; (2) calculating and ranking
essential component in our scheme, we used two groups of the measurements with tC Rank . We define that tCloud =
word embeddings to conduct experiments for studying the tC M atch + tC Rank denotes the total time of the cloud. From
effect of word embeddings on our scheme. Table IV lists Table V, the cloud in our scheme needs tCloud = 14.036s to
the experimental results of our scheme on Robust04 dataset. perform the retrieval process. At first glance, such time may
In the first experiment, we used different word embeddings seem a little long for ordinary users. However, It is worth
with 100, 200, and 300 dimensions trained by GloVe over spending more time to get higher search accuracy for the
same corpuses, namely, GloVe100, GloVe200, GloVe300. applications in practical scenarios, such as secure searching
From Table IV, we can see that the proposed scheme using for medical and financial information. In addition, the cloud
word embedding with 300 dimensions usually obtains the server providers usually have enormous computing resources
best accuracy except under NDCG @20 using the description to reduce retrieval time in the real world.
topics. The result of the second experiment demonstrates that According to the experimental results, we further demon-
the higher dimensionality may help our scheme capture the strate the Rationality Analysis in section VII. We define that
accurate semantic information. In the second experiment, we tOutsource = tU Key + tU T ran denotes the time-consuming
used two types of word embeddings with 300 dimensions for outsourcing the retrieval tasks to the cloud,ρ = ttOutsource
C M atch
trained by Word2vec and Fasttext over same corpus, namely, denotes the savings of the computational costs when the user
Word2vec300 and Fasttext300. We can see that our scheme outsources retrieval tasks to the cloud. We obtain ρ = 93.546
gets higher accuracy when using the word embeddings trained from the experimental results in Table V. It demonstrates the
by Fasttext compared with using Word2vec300. rationality that data users want to outsource the retrieval task to
the cloud server. We define µ = 100t tC Rank % to denote the
C M atch
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 14
Table V
TIME COST STATISTICS OF OUR SCHEME
tO Key tO Enc tO Index tU WTP tU Key tU T ran tU V erif y tU Dec tC M atch tC Rank
Ours 0.001s 0.335s 5.953s 0.956s 0.008s 0.142s 0.033s 0.005s 14.032s 0.004s
Table VI
PERFORMANCE COMPARISONS BETWEEN OUR SCHEME AND VKSS
Time Cost tO P re tO BuildIndex tU W ordT ree tU T rapdoor tU V erif y tU Rank tC M atch tC Rank
VKSS 4.879s 17.778s 614.312s 0.034s 0.011s 0.035s 1.416ms \
Ours 4.879s 1.074s \ 1.106s 0.033s \ 14.032s 0.004s
Robust04-Title Robust04-Desc ClueWeb-Title ClueWeb-Desc
Accuracy
P@20 NDCG@20 P@20 NDCG@20 P@20 NDCG@20 P@20 NDCG@20
VKSS 0.049 0.087 0.030 0.068 0.018 0.019 0.022 0.024
Ours 0.148 0.271 0.136 0.255 0.061 0.103 0.041 0.102
Growth Rate 202% 211% 353% 275% 238% 442% 86% 325%
We conducted extensive experiments to evaluate the per- ranking the scores for getting the top-k related documents. The
formance of the time cost and accuracy between our scheme time cost in the cloud in VKSS is small, which is benefit from
and the VKSS scheme which is the only one supports secure the efficiency of the trie index. Another reason is that VKSS
verifiable semantic searching in prior works found. We report is unable to support verifying the ranked results return from
the results of time cost experiments over the Robust04 dataset the cloud. To obtain high accuracy in our scheme, the optimal
using title topics as queries due to limited space. We do not matching on ciphertext is performed and the documents are
consider the time cost of encrypting documents and decrypting ranked in the cloud.
documents since the time cost is the same in both schemes. As for performance evaluation of accuracy, the accuracy
Table VI lists the results of comparison experiments. of our scheme is much higher than that of VKSS on both
Robust04 and ClueWeb-09-Cat-B dataset. Taking the Ro-
From the performance evaluation of the time cost, we bust04 dataset as an example, the relative improvements of our
can see that the owner and users in our scheme can ease scheme over the VKSS scheme under NDCG @20 are about
the computational burden by paying the time cost in cloud 211% and 275% using title and description topics, respectively.
compared with the VKSS. Specifically, the owner in our Overall, our scheme spends more time on performing match
scheme spends tO P re = 4.879s and tO BuildIndex = in the cloud to obtain much more accuracy improvement than
1.074s preprocessing documents and building the forward VKSS. It is worth spending more time to get higher search
indexes when generating the indexes for the documents, where accuracy for the applications in practical scenarios. Take the
tO P re + tO BuildIndex = tO Index . However, the owner in medical profile as an example, a correct personal medical
VKSS spends much time tO BuildIndex = 17.778s on building profile of a patient is essential and useful to help the doctor
the trie index since the owner needs to forecast the results make a precise disease diagnosis and health evaluation.
from predefined document keywords for verifiable searching. In summary, the proposed scheme outsources the complex
Moreover, the users in VKSS take less time than our scheme computational of performing proof generation task and se-
to generate a trapdoor with tU T rapdoor , but the users need mantic matching task to the cloud. Therefore, our scheme is
to spend 614.312s for generating the word similarity tree to more in line with the outsourcing computation characteristics
expand query words from document keywords as the query. of the cloud computing paradigm. The main reason why our
Although part of the work constructing the word similarity scheme spends more time in the cloud is that the computation
tree can be performed in an off-line way, it is unfriendly to of optimal matching on ciphertext is related to the size of
the resource-constrained users. In other words, the owner and linear programming problem. Therefore, in the future, we plan
users in VKSS need to perform burdensome tasks to realize to design other schemes to reduce the time cost of optimal
secure verifiable semantic searching. While in our scheme, the matching on ciphertext according to this finding.
owner does not need to forecast the results since the verifiable
searching can be realized by using the intermediate data pro-
duced in the optimal matching process; and the users build the IX. CONCLUSIONS
encrypted word transportation problems and outsource them We propose a secure verifiable semantic searching scheme
to cloud for secure semantic searching rather than performing that treats matching between queries and documents as a
query expansion on plaintext. Moreover, after receiving the word transportation optimal matching task. Therefore, we
proofs and search results, the users in VKSS spend less time investigate the fundamental theorems of linear programming
on verifying the correctness of results than our scheme but (LP) to design the word transportation (WT) problem and a
need to spend additional tU Rank = 0.035s on calculating result verification mechanism. We formulate the WT problem
the relevance scores between the query and documents and to calculate the minimum word transportation cost (MWTC)
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 15
as the similarity metric between queries and documents, and [13] C. Liu, L. H. Zhu, M. Z. Wang, and Y. A. Tan, “Search pattern leakage
further propose a secure transformation technique to trans- in searchable encryption: Attacks and new construction,” Inf. Sci., vol.
265, pp. 176–188, 2014.
form WT problems into random LP problems. Therefore, our [14] E. Stefanov, C. Papamanthou, and E. Shi, “Practical dynamic searchable
scheme is simple to deploy in practice as any ready-made encryption with small leakage.” in Proc. ISOC Network Distrib. Syst.
optimizer can solve the RLP problems to obtain the encrypted Secur. Symp., vol. 71, 2014, pp. 72–75.
[15] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, “Secure ranked keyword
MWTC without learning sensitive information in the WT search over encrypted cloud data,” in Proc. Int. Conf. Distrib. Comput,
problems. Meanwhile, we believe that the proposed secure Syst., 2010, pp. 253–262.
transformation technique can be used to design other privacy- [16] N. Cao, C. Wang, M. Li, K. Ren, and W. J. Lou, “Privacy-preserving
multi-keyword ranked search over encrypted cloud data,” IEEE Trans.
preserving linear programming applications. We bridge the Parallel Distrib. Syst., vol. 25, no. 1, pp. 222–233, 2013.
semantic-verifiable searching gap by observing an insight that [17] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, “Secure knn
using the intermediate data produced in the optimal matching computation on encrypted databases,” in Proc. ACM Symp. Int. Conf.
Manage. Data, 2009, pp. 139–152.
process to verify the correctness of search results. Specifically, [18] J. D. Yu, P. Lu, Y. M. Zhu, G. T. Xue, and M. L. Li, “Toward secure
we investigate the duality theorem of LP and derive a set of multikeyword top-k retrieval over encrypted cloud data,” IEEE Trans.
necessary and sufficient conditions that the intermediate data Dependable Secure Comput., vol. 10, no. 4, pp. 239–250, 2013.
must meet. The experimental results on two TREC collections [19] S. K. Kermanshahi, J. K. Liu, R. Steinfeld, and S. Nepal, “Generic multi-
keyword ranked search on encrypted cloud data,” in Proc. Springer Eur.
show that our scheme has higher accuracy than other schemes. Symp. Res. Comput. Secur., 2019, pp. 322–343.
In the future, we plan to research on applying the principles [20] L. F. Lai, C. C. Wu, P. Y. Lin, and L. T. Huang, “Developing a fuzzy
of secure semantic searching to design secure cross-language search engine based on fuzzy ontology and semantic search,” in Proc.
IEEE Int. Conf. Fuzzy Syst., 2011, pp. 2684–2689.
searching schemes. [21] A. Imani, A. Vakili, A. Montazer, and A. Shakery, “Deep neural net-
works for query expansion using word embeddings,” in Proc. Springer
Eur. Conf. Inf. Retrieval, 2019, pp. 203–210.
ACKNOWLEDGMENT [22] Y. Long, L. Liu, Y. Shen, and L. Shao, “Towards affordable semantic
We thank Shaocong Wu for valuable discussions and his searching: Zero-shot retrieval via dominant attributes,” in Proc. AAAI
Conf. Artif. Intell., 2018, pp. 7210–7217.
help in experiments. This work was supported in part by the
[23] Q. Chai and G. Gong, “Verifiable symmetric searchable encryption
Shenzhen Municipal Development and Reform Commission for semi-honest-but-curious cloud servers,” in Proc. IEEE Int. Conf.
(Disciplinary Development Program for Data Science and Commun., 2012, pp. 917–922.
Intelligent Computing), in part by Shenzhen International co- [24] J. Zhu, Q. Li, C. Wang, X. L. Yuan, Q. Wang, and K. Ren, “Enabling
generic, verifiable, and secure data search in cloud services,” IEEE
operative research projects GJHZ20170313150021171, and in Trans. Parallel Distrib. Syst., vol. 29, no. 8, pp. 1721–1735, 2018.
part by NSFC-Shenzhen Robot Jointed Founding (U1613215). [25] C. Wang, N. Cao, K. Ren, and W. J. Lou, “Enabling secure and
efficient ranked keyword search over outsourced cloud data,” IEEE
Trans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467–1479, 2011.
R EFERENCES [26] Q. Liu, X. Nie, X. Liu, T. Peng, and J. Wu, “Verifiable ranked search
over dynamic encrypted data in cloud computing,” in Proc. IEEE/ACM
[1] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searches
Int. Symp. Qual. Serv., 2017, pp. 1–6.
on encrypted data,” in Proc. IEEE Symp. Secur. Privacy, 2000, pp. 44–
55. [27] W. H. Sun, B. Wang, N. Cao, M. Li, W. J. Lou, Y. T. Hou, and H. Li,
[2] Z. Fu, J. Shu, X. Sun, and N. Linge, “Smart cloud search services: “Privacy-preserving multi-keyword text search in the cloud supporting
verifiable keyword-based semantic search over encrypted cloud data,” similarity-based ranking,” in Proc. ACM SIGSAC Symp. Inf. Comput.
IEEE Trans. Consum. Electron., vol. 60, no. 4, pp. 762–770, 2014. Commun. Secur., 2013, pp. 71–82.
[3] Z. J. Fu, X. M. Sun, N. Linge, and L. Zhou, “Achieving effective cloud [28] W. Zhang, Y. P. Lin, and G. Qi, “Catch you if you misbehave: Ranked
search services: multi-keyword ranked search over encrypted cloud data keyword search results verification in cloud computing,” IEEE Trans.
supporting synonym query,” IEEE Trans. Consum. Electron., vol. 60, Cloud Comput., vol. 6, no. 1, pp. 74–86, 2015.
no. 1, pp. 164–172, 2014. [29] K. Kurosawa and Y. Ohtaki, “UC-secure searchable symmetric encryp-
[4] T. S. Moh and K. H. Ho, “Efficient semantic search over encrypted data tion,” in Proc. Int. Conf. Financial Cryptography Data Secur. Springer,
in cloud computing,” in Proc. IEEE. Int. Conf. High Perform. Comput. 2012, pp. 285–298.
Simul., 2014, pp. 382–390. [30] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
[5] N. Jadhav, J. Nikam, and S. Bahekar, “Semantic search supporting of word representations in vector space,” in Proc. Int. Conf. Learn.
similarity ranking over encrypted private cloud data,” Int. J. Emerging Represent., 2013, pp. 1532–1543.
Eng. Res. Technol., vol. 2, no. 7, pp. 215–219, 2014. [31] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors
[6] Z. H. Xia, Y. L. Zhu, X. M. Sun, and L. H. Chen, “Secure semantic for word representation,” in Proc. Conf. Empirical Methods Nat. Lang.
expansion based search over encrypted cloud data supporting similarity Process., 2014, pp. 1532–1543.
ranking,” J. Cloud Comput., vol. 3, no. 1, pp. 1–11, 2014. [32] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
[7] Z. Fu, L. Xia, X. Sun, A. X. Liu, and G. Xie, “Semantic-aware searching vectors with subword information,” Trans. Assoc. Comput. Ling., vol. 5,
over encrypted data for cloud computing,” IEEE Trans. Inf. Forensics pp. 135–146, 2017.
Security, vol. 13, no. 9, pp. 2359–2371, Sep. 2018. [33] M. Cornia, L. Baraldi, H. R. Tavakoli, and R. Cucchiara, “Towards
[8] Z. J. Fu, X. L. Wu, Q. Wang, and K. Ren, “Enabling central keyword- cycle-consistent models for text and image retrieval,” in Proc. Eur. Conf.
based semantic extension search over encrypted outsourced data,” IEEE Comput. Vision, 2018, pp. 687–691.
Trans. Inf. Forensics Security., vol. 12, no. 12, pp. 2986–2997, 2017. [34] Y. Rubner, L. J. Guibas, and C. Tomasi, “The earth mover’s distance,
[9] Y. G. Liu and Z. J. Fu, “Secure search service based on word2vec in multi-dimensional scaling, and color-based image retrieval,” in Proc.
the public cloud,” Int. J. Comput. Sci. Eng., vol. 18, no. 3, pp. 305–313, ARPA Workshop, vol. 661, 1997, pp. 668–675.
2019. [35] Z. H. Xia, Y. Zhu, X. M. Sun, Z. Qin, and K. Ren, “Towards privacy-
[10] E. J. Goh, “Secure indexes.” IACR Cryptology ePrint Archive, vol. 2003, preserving content-based image retrieval in cloud computing,” IEEE
pp. 216–234, 2003. Trans. Cloud Comput., vol. 6, no. 1, pp. 276–286, 2018.
[11] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchable [36] C. Wang, Z. Liu, and S. C. Chan, “Superpixel-based hand gesture
symmetric encryption: improved definitions and efficient constructions,” recognition with kinect depth camera,” IEEE Trans. Multimedia, vol. 17,
J. Comput. Secur., vol. 19, no. 5, pp. 895–934, 2011. no. 1, pp. 29–39, 2014.
[12] M. S. Islam, M. Kuzu, and M. Kantarcioglu, “Access pattern disclosure [37] I. Goienetxea, J. M. Martı́nez-Otzeta, B. Sierra, and I. Mendialdua,
on searchable encryption: Ramification, attack and mitigation.” in Proc. “Towards the use of similarity distances to music genre classification:
ISOC Network Distrib. Syst. Secur. Symp., vol. 20, 2012, pp. 12–26. A comparative study,” PloS One, vol. 13, no. 2, p. e0191417, 2018.
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2020.3001728, IEEE
Transactions on Information Forensics and Security
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, X 2020 16
1556-6013 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 04:46:08 UTC from IEEE Xplore. Restrictions apply.