You are on page 1of 22

Post-quantum key exchange – a new hope∗

Erdem Alkim
Department of Mathemathics, Ege University, Turkey
Léo Ducas
Centrum voor Wiskunde en Informatica, Amsterdam, The Netherlands
Thomas Pöppelmann
Infineon Technologies AG, Munich, Germany
Peter Schwabe
Digital Security Group, Radboud University, The Netherlands
Abstract 1 Introduction

In 2015, Bos, Costello, Naehrig, and Stebila (IEEE The last decade in cryptography has seen the birth of
Security & Privacy 2015) proposed an instantiation numerous constructions of cryptosystems based on lat-
of Ding’s1 ring-learning-with-errors (Ring-LWE) based tice problems, achieving functionalities that were previ-
key-exchange protocol (also including the tweaks pro- ously unreachable (e.g., fully homomorphic cryptogra-
posed by Peikert from PQCrypto 2014), together with phy [43]). But even for the simplest tasks in asymmetric
an implementation integrated into OpenSSL, with the af- cryptography, namely public-key encryption, signatures,
firmed goal of providing post-quantum security for TLS. and key exchange, lattice-based cryptography offers an
important feature: resistance to all known quantum algo-
In this work we revisit their instantiation and stand-
rithms. In those times of quantum nervousness [77, 78],
alone implementation. Specifically, we propose new pa-
the time has come for the community to deliver and op-
rameters and a better suited error distribution, analyze
timize concrete schemes, and to get involved in the stan-
the scheme’s hardness against attacks by quantum com-
dardization of a lattice-based cipher-suite via an open
puters in a conservative way, introduce a new and more
process.
efficient error-reconciliation mechanism, and propose a
defense against backdoors and all-for-the-price-of-one For encryption and signatures, several competitive
attacks. By these measures and for the same lattice di- schemes have been proposed; examples are NTRU en-
mension, we more than double the security parameter, cryption [55, 88], Ring-LWE encryption [71] as well as
halve the communication overhead, and speed up com- the signature schemes BLISS [36], PASS [53] or the
putation by more than a factor of 8 in a portable C proposal by Bai and Galbraith presented in [8]. To
implementation and by more than a factor of 27 in an complete the lattice-based cipher-suite, Bos, Costello,
optimized implementation targeting current Intel CPUs. Naehrig, and Stebila [22] recently proposed a concrete
These speedups are achieved with comprehensive protec- instantiation of Peikert’s tweaked version [81] of the key-
tion against timing attacks. exchange scheme originally proposed by Ding, Xie, and
Lin [34,35]. Bos et al. proved its practicality by integrat-
ing their implementation as additional cipher-suite into
∗ This work was initiated while Thomas Pöppelmann was a Ph.D. the transport layer security (TLS) protocol in OpenSSL.
student at Ruhr-University Bochum with support from the European In the following we will refer to this proposal as BCNS.
Union H2020 SAFEcrypto project (grant no. 644729). This work has Unfortunately, the performance of BCNS seemed
furthermore been supported by TÜBITAK under 2214-A Doctoral Re-
search Program Grant, by the European Commission through the ICT rather disappointing. We identify two main sources for
program under contract ICT-645622 (PQCRYPTO), and by the Nether- this inefficiency. First the analysis of the failure probabil-
lands Organisation for Scientific Research (NWO) through Veni 2013 ity was far from tight, resulting in a very large modulus
project 13114 and through a Free Competition Grant. Permanent ID
q ≈ 232 . As a side effect, the security is also significantly
of this document: 0462d84a3d34b12b75e8f5e4ca032869. Date:
2019-07-10. lower than what one could achieve with Ring-LWE for
1 Previous versions of this paper inappropriately attributed the pro- a ring of rank n = 1024. Second the Gaussian sampler,
tocol to Peikert only. To the best of our knowledge, the protocol (with used to generate the secret parameters, is fairly inefficient
a slightly different reconciliation mechanism) was first described in a
draft by Ding [http://eprint.iacr.org/2012/387] and again in
and hard to protect against timing attacks. This second
a preprint by Ding, Xie and Lin [http://eprint.iacr.org/2012/ source of inefficiency stems from the fundamental mis-
688]. conception that high-quality Gaussian noise is crucial for

1
encryption based on LWE2 , which has also made various of Montgomery reductions and short Barrett reduc-
other implementations [32,83] slower and more complex tions.
than they would have to be.
Availability of software. We place all software de-
scribed in this paper into the public domain and
1.1 Contributions make it available online at https://cryptojedi.
In this work, we propose solutions to the performance org/crypto/#newhope and https://github.com/
and security issues of the aforementioned BCNS pro- tpoeppelmann/newhope.
posal [22]. Our improvements are possible through a Acknowledgments. We are thankful to Mike Hamburg
combination of multiple contributions: and to Paul Crowley for pointing out mistakes in a pre-
vious version of this paper, and we are thankful to Isis
• Our first contribution is an improved analysis of Lovecruft for thoroughly proofreading the paper and for
the failure probability of the protocol. To push suggesting the name JAR JAR for the low-security variant
the scheme even further, inspired by analog error- of our proposal. We are thankful to Patrick Mertens and
correcting codes, we make use of the lattice D4 an anonymous reviewer of a follow-up paper for mak-
to allow error reconciliation beyond the original ing us aware of the off-by-one error in the definition of
bounds of [22]. This drastically decreases the mod- the binomial distribution. We are thankful to Christian
ulus to q = 12289 < 214 , which improves both effi- Berghoff for thoroughly reporting and helping us correct
ciency and security. several mistakes in the failure analysis section.
• Our second contribution is a more detailed secu-
rity analysis against quantum attacks. We pro- 2 Lattice-based key exchange
vide a lower bound on all known (or even pre-
supposed) quantum algorithms solving the shortest- Let Z be the ring of rational integers. We define for an
vector problem (SVP), and deduce the potential per- x ∈ R the rounding function bxe = bx + 12 c ∈ Z. Let Zq ,
formance of a quantum BKZ algorithm. Accord- for an integer q ≥ 1, denote the quotient ring Z/qZ. We
ing to this analysis, our improved proposal provides define R = Z[X]/(X n + 1) as the ring of integer polyno-
128 bits of post-quantum security with a comfort- mials modulo X n + 1. By Rq = Zq [X]/(X n + 1) we mean
able margin. the ring of integer polynomials modulo X n + 1 where
• We furthermore propose to replace the almost- each coefficient is reduced modulo q. In case χ is a prob-
$
perfect discrete Gaussian distribution by some- ability distribution over R, then x ← χ means the sam-
$
thing relatively close, but much easier to sample, pling of x ∈ R according to χ. When we write a ← Rq
and prove that this can only affect the security this means that all coefficients of a are chosen uniformly
marginally. at random from Zq . For a probabilistic algorithm A we
$
• We replace the fixed parameter a of the original denote by y ← A that the output of A is assigned to y
scheme by a freshly chosen random one in each key and that A is running with randomly chosen coins. We
exchange. This incurs an acceptable overhead but recall the discrete Gaussian distribution DZ,σ which is
prevents backdoors embedded in the choice of this parametrized by the Gaussian parameter σ ∈ R and de-
2
parameter and all-for-the-price-of-one attacks. fined by assigning a weight proportional to exp( −x
2σ 2
) to
• We specify an encoding of polynomials in the all integers x.
number-theoretic transform (NTT) domain which
allows us to eliminate some of the NTT transfor- 2.1 Ding’s protocol with Peikert’s tweak
mations inside the protocol computation.
In this section we briefly revisit the passively secure key-
• To demonstrate the applicability and performance exchange protocol by Ding, Xie, and Lin [34, 35] with
of our design we provide a portable reference im- the tweak proposed by Peikert [81], which has been in-
plementation written in C and a highly optimized stantiated and implemented in [22] (BCNS). We follow
vectorized implementation that targets recent Intel Peikert’s version which ensures uniform distribution of
CPUs and is compatible with recent AMD CPUs. the agreed key and, like Peikert, use key-encapsulation
We describe an efficient approach to lazy reduction (KEM) notation. The KEM is defined by the algorithms
inside the NTT, which is based on a combination (Setup, Gen, Encaps, Decaps); after a successful proto-
2 This is very different for lattice-based signatures or trapdoors, col run both parties share an ephemeral secret key that
where distributions need to be meticulously crafted to prevent any leak can be used to protect further communication (see Proto-
of information on a secret basis. col 1).

2
This KEM closely resembles previously introduced well as error estimation are based on asymptotics. The
(Ring-)LWE encryption schemes [66, 70] but due to a authors of [22] chose a dimension n = 1024, a modu-
new error-reconciliation mechanism, one Rq component lus q =√232 − 1, χ = DZ,σ and the Gaussian parameter
of the ciphertext can be replaced by a more compact ele- σ = 8/ 2π ≈ 3.192. It is claimed that these parameters
ment in R2 . provide a classical security level of at least 128 bits con-
This efficiency gain is possible due to the observa- sidering the distinguishing attack [66] with distinguish-
tion that it is not necessary to transmit an explicitly cho- ing advantage less than 2−128 and 281.9 bits of security
sen key to establish a secure ephemeral session key. In against an optimistic instantiation of a quantum adver-
Ding’s scheme, the reconciliation allows both parties to sary. The probability of a wrong key being established is
17
derive the session key from an approximately agreed less than 2−2 = 2−131072 . The message b sent by Alice
pseudorandom ring element. For Alice, this ring ele- is a ring element and thus requires at least log2 (q)n = 32
ment is us = ass0 + e0 s and for Bob it is v = bs0 + e00 = kbits while Bob’s response (u, r) is a ring element Rq
ass0 + es0 + e00 . For a full explanation of the reconcil- and an element from R2 and thus requires at least 33
iation we refer to the original papers [34, 35, 81] but kbits. As the polynomial a ∈ Rq is shared between all
briefly recall the cross-rounding function h·i2 from [81] parties this ring element has to be stored or generated
defined as hvi2 := b 4q · ve mod 2 and the randomized on-the-fly. For timings of their implementation we refer
function dbl(v) := 2v − ē for some random ē, where to Table 2. We would also like to note that besides its
ē = 0 with probability 12 , ē = 1 with probability 14 , and aim for securing classical TLS, the BCNS protocol has
ē = −1 with probability 14 . Let I0 = {0, 1, . . . , b q2 e − 1}, already been proposed as a building block for Tor [89]
I1 = {−b 2q c, . . . , −1}, and E = [− 4q , 4q ) then the reconcil- on top of existing elliptic-curve infrastructure [46].
iation function rec(w, b) is defined as
(
0, if w ∈ Ib + E ( mod q) 2.3 Our proposal: N EW H OPE
rec(w, b) =
1, otherwise. In this section we detail our proposal and modifications
of the above protocol3 . For the same reasons as described
If these functions are applied to polynomials this means in [22] we opt for an unauthenticated key-exchange pro-
they are applied to each of the coefficients separately. tocol; the protection of stored transcripts against future
decryption using quantum computers is much more ur-
Parameters: q, n, χ gent than post-quantum authentication. Authenticity will
KEM.Setup() : most likely be achievable in the foreseeable future us-
$
a ← Rq ing proven pre-quantum signatures and attacks on the
Alice (server) Bob (client) signature will not compromise previous communication.
KEM.Gen(a) : KEM.Encaps(a, b) : Additionally, by not designing or instantiating a lattice-
$ $ based authenticated key-exchange protocol (see [38, 91])
s, e ← χ s0 , e0 , e00 ← χ
b we reduce the complexity of the key-exchange protocol
b←as + e →
− u←as0 + e0
and simplify the choice of parameters. We actually see it
v←bs0 + e00
$
as an advantage to decouple key exchange and authen-
v̄ ← dbl(v) tication as it allows a protocol designer to choose the
u,v0
KEM.Decaps(s, (u, v0 )) : ←−− v0 = hv̄i2 optimal algorithm for both tasks (e.g., an ideal-lattice-
µ←rec(2us, v0 ) µ←bv̄e2 based key exchange and a hash-based signature like [18]
for authentication). Moreover, this way the design, se-
curity level, and parameters of the key-exchange scheme
Protocol 1: Peikert’s version of the KEM mechanism. are not constrained by requirements introduced by the
authentication part.
Parameter choices. A high-level description of our pro-
2.2 The BCNS proposal posal is given in Protocol 2 and as in [22, 34, 35, 81] all
In a work by Bos, Costello, Naehrig, and Stebila [22] polynomials except for r ∈ R4 are defined in the ring
(BCNS), the above KEM [34, 35, 81] was phrased as Rq = Zq [X]/(X n + 1) with n = 1024 and q = 12289. We
a key-exchange protocol (see again Protocol 1), instan- decided to keep the dimension n = 1024 as in [22] to be
tiated for a concrete parameter set, and integrated into able to achieve appropriate long-term security. As poly-
OpenSSL (see Section 8 for a performance comparison). 3 For the TLS use-case and for compatibility with BNCS [22] the
Selection of parameters was necessary as previous work key exchange is initiated by the server. However, in different scenarios
do not contain concrete parameters and the security as the roles of the server and client can be exchanged.

3
nomial arithmetic is fast and also scales better (doubling problem (see Section 3) in the flavor of the “Logjam”
n roughly doubles the time required for a polynomial DLP attack [1].
multiplication), our choice of n appears to be acceptable
No key caching. For ephemeral Diffie-Hellman key-
from a performance point of view. We chose the modulus
exchange in TLS it is common for servers to cache
q = 12289 as it is the smallest prime for which it holds
a key pair for a short time to increase performance.
that q ≡ 1 mod 2n so that the number-theoretic trans-
For example, according to [26], Microsoft’s SChannel
form (NTT) can be realized efficiently and that we can
library caches ephemeral keys for 2 hours. We re-
transfer polynomials in NTT encoding (see Section 7).
mark that for the lattice-based key exchange described
As the security level grows with the noise-to-modulus
in [22,34,35,81], and also for the key exchange described
ratio, it makes sense to choose the modulus as small as
in this paper, such short-term caching would be disas-
possible, improving compactness and efficiency together
trous for security. Indeed, it is crucial that both parties
with security. The choice is also appealing as the prime is
use fresh secrets for each instantiation (thus the perfor-
already used by some implementations of Ring-LWE en-
mance of the noise sampling is crucial). As short-term
cryption [32, 67, 86] and BLISS signatures [36, 82]; thus
key caching typically happens on higher layers of TLS li-
sharing of some code (or hardware modules) between our
braries than the key-exchange implementation itself, we
proposal and an implementation of BLISS would be pos-
stress that particular care needs to be taken to eliminate
sible.
such caching when switching from ephemeral (elliptic-
Noise distribution and reconciliation. Notably, we also curve) Diffie-Hellman key exchange to post-quantum
change the distribution of the LWE secret and error and lattice-based key exchange. This issue is discussed in
replace discrete Gaussians by the centered binomial dis- more detail in [37].
tribution ψk of parameter k = 16 (see Section 4). The One could enable key caching with a transformation
reason is that it turned out to be challenging to imple- from the CPA-secure key exchange to a CCA-secure key
ment a discrete Gaussian sampler efficiently and pro- exchange as outlined by Peikert in [81, Section 5]. Note
tected against timing attacks (see [22] and Section 5). that such a transform would furthermore require changes
On the other hand, sampling from the centered binomial to the noise distribution to obtain a failure probability
distribution is easy and does not require high-precision that is negligible in the cryptographic sense.
computations or large tables as one may sample from ψk
by computing ∑k−1 0 0
i=0 bi − bi , where the bi , bi ∈ {0, 1} are 3 Preventing backdoors and all-for-the-
uniform independent bits4 . The distribution ψk is cen-
price-of-one attacks
tered (its mean is 0), has variance k/2 pand for k = 16
this gives a standard deviation of ς = 16/2. Contrary One serious concern about the original design [22] is the
to [22,34,35,81] we hash the output of the reconciliation presence of the polynomial a as a fixed system parameter.
mechanism, which makes a distinguishing attack irrele- As described in Protocol 2, our proposal includes pseu-
vant and allows us to argue security for the modified error dorandom generation of this parameter for every key ex-
distribution. change. In the following we discuss the reasons for this
Moreover, we generalize previous reconciliation decision.
mechanisms using an analog error-correction approach
(see Section 5). The design rationale is that we only want Backdoor. In the worst scenario, the fixed parameter a
to transmit a 256-bit key but have n = 1024 coefficients could be backdoored. For example, inspired by NTRU
to encode data into. Thus we encode one key bit into trapdoors [55, 88], a dishonest authority may choose
four coefficients; by doing so we achieve increased error mildly small f, g such that f = g = 1 mod p for some
resilience which in turn allows us to use larger noise for prime p ≥ 4 · 16 + 1 and set a = gf−1 mod q. Then,
better security. given (a, b = as + e), the attacker can compute bf =
afs + fe = gs + fe mod q, and, because g, s, f, e are small
Short-term public parameters. N EW H OPE does not enough, compute gs + fe in Z. From this he can compute
rely on a globally chosen public parameter a as the ef- t = s + e mod p and, because the coefficients of s and
ficiency increase in doing so is not worth the measures e are smaller than 16, their sums are in [−2 · 16, 2 · 16]:
that have to be taken to allow trusted generation of this knowing them modulo p ≥ 4 · 16 + 1 is knowing them in
value and the defense against backdoors [15]. Moreover, Z. It now only remains to compute (b − t) · (a − 1)−1 =
this approach avoids the rather uncomfortable situation (as − s) · (a − 1)−1 = s mod q to recover the secret s.
that all connections rely on a single instance of a lattice One countermeasure against such backdoors is the
4 In an earlier version of this paper the definition of the binomial dis- “nothing-up-my-sleeve” process, which would, for ex-
tribution contained an off-by-one error and was wrongly defined from ample, choose a as the output of a hash function on a
0 to k. common universal string like the digits of π. Yet, even

4
Parameters: q = 12289 < 214 , n = 1024
Error distribution: ψ16
Alice (server) Bob (client)
$
seed ← {0, 1}256
a←Parse(SHAKE-128(seed))
$ $
n
s, e ← ψ16 s0 , e0 , e00 ← ψ16
n
(b,seed)
b←as + e −−−−→ a←Parse(SHAKE-128(seed))
u←as0 + e0
v←bs0 + e00
(u,r) $
v0 ←us ←−− r ← HelpRec(v)
ν←Rec(v0 , r) ν←Rec(v, r)
µ←SHA3-256(ν) µ←SHA3-256(ν)

Protocol 2: Our Scheme. For the definitions of HelpRec and Rec see Section 5. For the definition of encodings and
the definition of Parse see Section 7.

this process may be partially abused [15], and when not is a one-to-one map), and by transferring only a short
strictly required it seems preferable to avoid it. 256-bit seed for a (see Section 7).
A subtle question is to choose an appropriate prim-
All-for-the-price-of-one attacks. Even if this common itive to generate a “random-looking” polynomial a out
parameter has been honestly generated, it is still rather of a short seed. For a security reduction, it seems
uncomfortable to have the security of all connections to the authors that there is no way around the (non-
rely on a single instance of a lattice problem. The sce- programmable) random oracle model (ROM). It is ar-
nario is an entity that discovers an unforeseen cryptan- gued in [39] that such a requirement is in practice an
alytic algorithm, making the required lattice reduction overkill, and that any pseudorandom generator (PRG)
still very costly, but say, not impossible in a year of should also work. And while it is an interesting question
computation, given its outstanding computational power. how such a reasonable pseudo-random generator would
By finding once a good enough basis of the lattice Λ = interact with our lattice assumption, the cryptographic
{(a, 1)x + (q, 0)y|x, y ∈ R}, this entity could then com- notion of a PRG is not helpful to argue security. Indeed,
promise all communications, using for example Babai’s it is an easy exercise6 to build (under the NTRU assump-
decoding algorithm [7]. tion) a “backdoored” PRG that is, formally, a legitimate
This idea of massive precomputation that is only de- PRG, but that makes our scheme insecure.
pendent on a fixed parameter a and then afterwards can Instead, we prefer to base ourselves on a standard
be used to break all key exchanges is similar in fla- cryptographic hash-function, which is the typical choice
vor to the 512-bit “Logjam” DLP attack [1]. This at- of an “instantiation” of the ROM. As a suitable op-
tack was only possible in the required time limit because tion we see Keccak [21], which has recently been stan-
most TLS implementations use fixed primes for Diffie- dardized as SHA3 in FIPS-202 [76], and which offers
Hellman. One of the recommended mitigations by the extendable-output functions (XOF) named SHAKE. This
authors of [1] is to avoid fixed primes. avoids costly external iteration of a regular hash function
Against all authority. Fortunately, all those pitfalls can and directly fits our needs.
be avoided by having the communicating parties gen- We use SHAKE-128 for the generation of a, which
erate a fresh a at each instance of the protocol (as we offers 128-bits of (post-quantum) security against colli-
propose). If in practice it turns out to be too expen- sions and preimage attacks. With only a small perfor-
sive to generate a for every connection, it is also possi- mance penalty we could have also chosen SHAKE-256,
ble to cache a on the server side5 for, say a few hours but we do not see any reason for such a choice, in partic-
without significantly weakening the protection against ular because neither collisions nor preimages lead to an
all-for-the-price-of-one attacks. Additionally, the perfor- attack against the proposed scheme.
mance impact of generating a is reduced by sampling a 6 Consider a secure PRG p, and parse its output p(seed) as two
uniformly directly in NTT format (recalling that the NTT small polynomial (f, g): an NTRU secret-key. Define p0 (seed) = gf−1
mod q: under the decisional NTRU assumption, p0 is still a secure PRG.
5 Butrecall that the secrets s, e, s0 , s0 , e00 have to be sampled fresh for Yet revealing the seed does reveal (f, g) and provides a backdoor as de-
every connection. tailed above.

5
4 Choice of the error distribution
(0, 1) (1, 1)
On non-Gaussian errors. In works like [22, 32, 86], a
significant algorithmic effort is devoted to sample from a
discrete Gaussian distribution to a rather high precision. ( 12 , 12 )
In the following we argue that such effort is not neces-
sary and motivate our choice of a centered binomial ψk (0, 0) (1, 0)
as error distribution.
Indeed, we recall that the original worst-case to
average-case reductions for LWE [84] and Ring-
LWE [71] state hardness for continuous Gaussian dis- Figure 1: The lattice D̃2 with Voronoi cells
tributions (and therefore also trivially apply to rounded
Gaussian, which differ from discrete Gaussians). This
also extends to discrete Gaussians [23] but such proofs 5 Improved error-recovery mechanism
are not necessarily intended for direct implementations.
We recall that the use of discrete Gaussians (or other dis- In most of the literature, Ring-LWE encryption allows to
tributions with very high-precision sampling) is only cru- encrypt one bit per coordinate of the ciphertext. It is also
cial for signatures [69] and lattice trapdoors [44], to pro- well known how to encrypt multiple bits per coordinate
vide zero-knowledgeness. by using a larger modulus-to-error ratio (and therefore
The following Theorem states that choosing ψk as er- decreasing the security for a fixed dimension n). How-
ror distribution in Protocol 2 does not significantly de- ever, in the context of exchanging a symmetric key (of,
crease security compared to a rounded Gaussian say, 256 bits), we end up having a message space larger
p distri-
bution with the same standard deviation σ = 16/2. than necessary and thus want to encrypt one bit in multi-
ple coordinates.
Theorem 4.1 Let ξ be the √ rounded Gaussian distribu- In [83] Pöppelmann and Güneysu introduced a tech-
√ of parameter σ = 8, that is, the distribution of
tion nique to encode one bit into two coordinates, and verified
b 8 · xe where x follows the standard normal distribu- experimentally that it led to a better error tolerance. This
tion. Let P be the idealized version of Protocol 2, where allows to either increase the error and therefore improve
the distribution ψ16 is replaced by ξ . If an (unbounded) the security of the resulting scheme or to decrease the
algorithm, given as input the transcript of an instance probability of decryption failures. In this section we pro-
of Protocol 2 succeeds in recovering the pre-hash key ν pose a generalization of this technique in dimension 4.
with probability p, then it would also succeed against P We start with an intuitive description of the approach in
with probability at least 2 dimensions and then explain what changes in 4 dimen-
sions. Appendices C and D give a thorough mathemati-
q ≥ p9/8 /26. cal description together with a rigorous analysis.
Let us first assume that both client and server have the
Proof See Appendix B. same vector x ∈ [0, 1)2 ⊂ R2 and want to map this vector
to a single bit. Mapping polynomial coefficients from
As explained in Section 6, our choice of parameters
{0, . . . , q − 1} to [0, 1) is easily accomplished through a
leaves a comfortable margin to the targeted 128 bits
division by q.
of post-quantum security, which accommodates for the
Now consider the lattice D̃2 with basis {(0, 1), ( 21 , 12 )}.
slight loss in security indicated by Theorem 4.1. Even
This lattice is a scaled version of the root lattice D2 ,
more important from a practical point of view is that no
specifically, D̃2 = 21 · D2 . Part of D̃2 is depicted in Fig-
known attack makes use of the difference in error distri-
ure 1; lattice points are shown together with their Voronoi
bution; what matters for attacks are entropy and standard
cells and the possible range of the vector x is marked
deviation.
with dashed lines. Mapping x to one bit is done by
Simple implementation. We remark that sampling from finding the closest-vector v ∈ D̃2 . If v = ( 12 , 12 ) (i.e.,
the centered binomial distribution ψ16 is rather trivial x is in the grey Voronoi cell), then the output bit is 1;
in hardware and software, given the availability of a if v ∈ {(0, 0), (0, 1), (1, 0), (1, 1)} (i.e., x is in a white
uniform binary source. Additionally, the implementa- Voronoi cell) then the output bit is 0.
tion of this sampling algorithm is much easier to pro- This map may seem like a fairly complex way to map
tect against timing attacks as no large tables or data- from a vector to a bit. However, recall that client and
dependent branches are required (cf. to the issues caused server only have a noisy version of x, i.e., the client has
by the table-based approach used in [22]). a vector xc and the server has a vector xs . Those two

6
Lemma C.2.
From 2 to 4 dimensions. When moving from the 2-
dimensional case considered above to the 4-dimensional
case used in our protocol, not very much needs to change.
( 12 , 21 )
The lattice D̃2 becomes the lattice D̃4 with basis B =
(u0 , u1 , u2 , g), where uiare the canonical basis vectors of
Z4 and gt = 12 , 12 , 12 , 21 . The lattice D̃4 is a rotated and
scaled version of the root lattice D4 . The Voronoi cells
of this lattice are no longer 2-dimensional “diamonds”,
but 4-dimensional objects called icositetrachoron or 24-
cells [65]. Determining in which cell a target point lies
Figure 2: Splitting of the Voronoi cell of ( 21 , 12 ) into 2rd = in is done using the closest vector algorithm CVPD̃4 , and
16 sub-cells, some with their corresponding difference a simplified version of it, which we call Decode, gives
vector to the center the result modulo Z4 .
As in the 2-dimensional illustration in Figure 2, we are
using 2-bit discretization; we are thus sending r · d = 8
vectors are close, but they are not the same and can be on
bits of reconciliation information per key bit.
different sides of a Voronoi cell border.
Putting all of this together, we obtain the HelpRec
Error reconciliation. The approach described above function to compute the r-bit reconciliation information
now allows for an efficient solution to solve this as
agreement-from-noisy-data problem. The idea is that  r
2

one of the two participants (in our case the client) sends HelpRec(x; b) = CVPD̃4 (x + bg) mod 2r ,
q
as a reconciliation vector the difference of his vector xc
and the center of its Voronoi cell (i.e., the point in the where b ∈ {0, 1} is a uniformly chosen random bit. The
lattice). The server adds this difference vector to xs and corresponding function Rec(x, r) = Decode( 1q x − 21r Br)
thus moves away from the border towards the center of computes one key bit from a vector x with 4 coefficients
the correct Voronoi cell. Note that an eavesdropper does in Zq and a reconciliation vector r ∈ {0, 1, 2, 3}4 . The
not learn anything from the reconciliation information: algorithms CVPD̃4 and Decode are listed as Algorithm 1
the client tells the difference to a lattice point, but not and Algorithm 2, respectively.
whether this is a lattice point producing a zero bit or a
one bit.
Algorithm 1 CVPD̃4 (x ∈ R4 )
This approach would require sending a full additional
vector; we can reduce the amount of reconciliation in- Ensure: An integer vector z such that Bz is a closest
formation through r-bit discretization. The idea is to vector to x: x − Bz ∈ V
split each Voronoi cell into 2dr sub-cells and only send 1: v0 ←bxe
in which of those sub-cells the vector xc is. Both partici- 2: v1 ←bx − ge
pants then add the difference of the center of the sub-cell 3: k←(kx − v0 k1 < 1) ? 0 : 1
and the lattice point. This is illustrated for r = 2 and 4: (v0 , v1 , v2 , v3 )t ←vk
d = 2 in Figure 2. 5: return (v0 , v1 , v2 , k)t + v3 · (−1, −1, −1, 2)t

Blurring the edges. Figure 1 may suggest that the prob-


ability of x being in a white Voronoi cell is the same as
for x being in the grey Voronoi cell. This would be the Algorithm 2 Decode(x ∈ R4 /Z4 )
case if x actually followed a continuous uniform distri- Ensure: A bit k such that kg is a closest vector to x+Z4 :
bution. However, the coefficients of x are discrete values x − kg ∈ V + Z4
in {0, q1 , . . . , q−1
q } and with the protocol described so far, 1: v = x − bxe
the bits of ν would have a small bias. The solution is to 2: return 0 if kvk1 ≤ 1 and 1 otherwise
add, with probability 21 , the vector ( 2q 1 1
, 2q ) to x before
running the error reconciliation. This has close to no ef-
fect for most values of x, but, with probability 12 moves Finally it remains to remark that even with this rec-
x to another Voronoi cell if it is very close to one side onciliation mechanism client and server do not always
of a border. Appendix E gives a graphical intuition for agree on the same key. Appendix D provides a detailed
this trick in two dimensions and with q = 9. The proof analysis of the failure probability of the key agreement
that it indeed removes all biases in the key is given in and shows that it is smaller than 2−60 .

7
6 Post-quantum security analysis known [52] that the number of calls to that oracle re-
mains polynomial, yet concretely evaluating the number
In [22] the authors chose Ring-LWE for a ring of rank of calls is rather painful, and this is subject to new heuris-
n = 1024, while most previous instantiations of the Ring- tic ideas [27, 28]. We choose to ignore this polynomial
LWE encryption scheme, like the ones in [32, 47, 67, 83], factor, and rather evaluate only the core SVP hardness,
chose substantially smaller rank n = 256 or n = 512. It that is the cost of one call to an SVP oracle in dimension
is argued that it is unclear if dimension 512 can offer b, which is clearly a pessimistic estimation (from the de-
post-quantum security. Yet, the concrete post-quantum fender’s point of view).
security of LWE-based schemes has not been thoroughly
studied, as far as we know. In this section we propose 6.2 Enumeration versus quantum sieve
such a (very pessimistic) concrete analysis. In particu-
lar, our analysis reminds us that the security depends as Typical implementations [25, 28, 40] use an enumeration
much on q and its ratio with the error standard deviation algorithm as this SVP oracle, yet this algorithm runs in
ς as it does on the dimension n. That means that our ef- super-exponential time. On the other hand, the sieve al-
fort of optimizing the error recovery and its analysis not gorithms are known to run in exponential time, but are so
only improves efficiency but also offers superior security. far slower in practice for accessible dimensions b ≈ 130.
We choose the latter to predict the core hardness and will
Security level over-shoot? With all our improvements, argue that for the targeted dimension, enumerations are
it would be possible to build a scheme with n = 512 expected to be greatly slower than sieving.
(and k = 24, q = 12289) and to obtain security some-
Quantum sieve. A lot of recent work has pushed the ef-
what similar to the one of [22, 47], and therefore further
ficiency of the original lattice sieve algorithms [73, 79],
improve efficiency. We call this variant JAR JAR and de-
improving the heuristic complexity from (4/3)b+o(b) ≈
tails are provided in Appendix A. Nevertheless, as his- p b+o(b)
tory showed us with RSA-512 [31], the standardization 20.415b down to 3/2 ≈ 20.292b (see [12, 59]).
and deployment of a scheme awakens further cryptan- The hidden sub-exponential factor is known to be much
alytic effort. In particular, N EW H OPE could withstand greater than one in practice, so again, estimating the
a dimension-halving attack in the line of [41, Sec 8.8.1] cost ignoring this factor leaves us with a significant pes-
based on the Gentry-Szydlo algorithm [45,64] or the sub- simistic margin.
field approach of [2]. Note that so far, such attacks are Most of those algorithms have been shown [58, 60] to
only known for principal ideal lattices or NTRU lattices, benefit from Grover’s quantum search algorithm, bring-
and there are serious obstructions to extend them to Ring- ing the complexity down to 20.265b . It is unclear if fur-
LWE, but such precaution seems reasonable until lattice ther improvements are to be expected, yet, because all
cryptanalysis stabilizes. those algorithms require classically building lists of size
p b+o(b)
We provide the security and performance analysis of 4/3 ≈ 20.2075b , it is very plausible that the best
JAR JAR in Appendix A mostly for comparison with quantum SVP algorithm would run in time greater than
other lower-security proposals. We strongly recommend 20.2075b .
N EW H OPE for any immediate applications, and advise Irrelevance of enumeration for our analysis. In [28],
against using JAR JAR until concrete cryptanalysis of predictions of the cost of solving SVP classically us-
lattice-based cryptography is better understood. ing the most sophisticated heuristic enumeration algo-
rithms are given. For example, solving SVP in dimension
6.1 Methodology: the core SVP hardness 100 requires visiting about 239 nodes, and 2134 nodes
in dimension 250. Because this enumeration is a back-
We analyze the hardness of Ring-LWE as an LWE prob- tracking algorithm, it does benefit from the recent quasi-
lem, since, so far, the best known attacks do not make use quadratic speedup [74], decreasing the quantum cost to
of the ring structure. There are many algorithms to con- about at least 220 to 267 operations as the dimension in-
sider in general (see the survey [3]), yet many of those creases from 100 to 250.
are irrelevant for our parameter set. In particular, because On the other hand, our best-known attack bound
there are only m = n samples available one may rule out 20.265b gives a cost of 266 in dimension 250, and the best
BKW types of attacks [57] and linearization attacks [6]. plausible attack bound 20.2075b ≈ 239 . Because enumera-
This essentially leaves us with two BKZ [28, 87] attacks, tion is super-exponential (both in theory and practice), its
usually referred to as primal and dual attacks that we will cost will be worse than our bounds in dimension larger
briefly recall below. than 250 and we may safely ignore this kind of algo-
The algorithm BKZ proceeds by reducing a lattice ba- rithm.7
sis using an SVP oracle in a smaller dimension b. It is 7 The numbers are taken from the latest full version of [28] available

8
6.3 Primal attack Known Known Best
Attack m b Classical Quantum Plausible
The primal attack consists of constructing a unique-SVP BCNS proposal [22]: q = 232 − 1, n = 1024, ς = 3.192
instance from the LWE problem and solving it using Primal 1062 296 86 78 61
BKZ. We examine how large the block dimension b is Dual 1055 296 86 78 61
required to be for BKZ to find the unique solution. Given p
NTRU ENCRYPT [54]: q = 212 , n = 743, ς ≈ 2/3
the matrix LWE instance (A, b = As + e) one builds the Primal 613 603 176 159 125
lattice Λ = {x ∈ Zm+n+1 : (A| − Im | − b)x = 0 mod q} of Dual 635 600 175 159 124

dimension d = m + n + 1, volume qm , and √ with a unique- JAR JAR: q = 12289, n = 512, ς = 12
SVP solution v = (s, e, 1) of norm λ ≈ ς n + m. Note Primal 623 449 131 119 93
that the number of used samples m may be chosen be- Dual 602 448 131 118√ 92
tween 0 and 2n in our case and we numerically optimize N EW H OPE: q = 12289, n = 1024, ς = 8
this choice. Primal 1100 967 282 256 200
Dual 1099 962 281 255 199
Success condition. We model the behavior of BKZ us-
ing the geometric series assumption (which is known
to be optimistic from the attacker’s point of view), that Table 1: Core hardness of N EW H OPE and JAR JAR and se-
finds a basis whose Gram-Schmidt norms are given lected other proposals from the literature. The value b denotes
by kb?i k = δ d−2i−1 · Vol(Λ)1/d where δ = ((πb)1/b · the block dimension of BKZ, and m the number of used sam-
b/2πe)1/2(b−1) [3, 27]. The unique short vector v will ples. Cost is given in log2 and is the smallest cost for all pos-
sible choices of m and b. Note that our estimation is very op-
be detected if the projection of v onto the vector space
timistic about the abilities of the attacker so that our result for
spanned by the last b Gram-Schmidt vectors is shorter√ the parameter set from [22] does not indicate that it can be bro-
than b?d−b . Its projected norm is expected to be ς b, ken with ≈ 280 bit operations, given today’s state-of-the-art in
that is the attack is successful if and only if cryptanalysis.

ς b ≤ δ 2b−d−1 · qm/d . (1)
probability by building about 1/ε 2 many such short vec-
tors. Because the sieve algorithms provide 20.2075b vec-
6.4 Dual attack
tors, the attack must be repeated at least R times where
The dual attack consists of finding a short vector in the
R = max(1, 1/(20.2075b ε 2 )).
dual lattice w ∈ Λ0 = {(x, y) ∈ Zm ×Zn : At x = y mod q}.
Assume we have found a vector (x, y) of length ` and This makes the conservative assumption that all the vec-
compute z = vt · b = vt As + vt e = wt s + vt e mod q which tors provided by the Sieve algorithm are as short as the
is distributed as a Gaussian of standard deviation `ς if shortest one.
(A, b) is indeed an LWE sample (otherwise it is uniform
mod q). Those two distributions have maximal vari-
6.5 Security claims
ation distance bounded by8 ε = 4 exp(−2π 2 τ 2 ) where
τ = `ς /q, that is, given such a vector of length ` one According to our analysis, we claim that our proposed
has an advantage ε against decision-LWE. parameters offer at least (and quite likely with a large
The length ` of a vector given by the BKZ algorithm is margin) a post-quantum security of 128 bits. The cost
given by ` = kb0 k. Knowing that Λ0 has dimension d = of the primal attack and dual attacks (estimated by our
m + n and volume qn we get ` = δ d−1 qn/d . Therefore, script scripts/PQsecurity.py) are given in Table 1.
obtaining an ε-distinguisher requires running BKZ with For comparison we also give a lower bound on the secu-
block dimension b where rity of [22] and do notice a significantly improved se-
curity in our proposal. Yet, because of the numerous
− 2π 2 τ 2 ≥ ln(ε/4). (2) pessimistic assumption made in our analysis, we do not
claim any quantum attacks reaching those bounds.
Note that small advantages ε are not relevant since the Most other RLWE proposals achieve considerably
agreed key is hashed: an attacker needs an advantage of lower security than N EW H OPE; for example, the highest-
at least 1/2 to significantly decrease the search space of security parameter set used for RLWE encryption in [47]
the agreed key. He must therefore amplify his success is very similar to the parameters of JAR JAR. The situ-
at http://www.di.ens.fr/~ychen/research/Full_BKZ.pdf. ation is different for NTRU ENCRYPT, which has been
8 A preliminary version of this paper contained a bogus formula for instantiated with parameters that achieve about 128 bits
ε leading to under-estimating the cost of the dual attack. Correcting of security according to our analysis9 .
this formula leads to better security claim, and almost similar cost for
the primal and dual attacks. 9 For comparison we view the NTRU key-recovery as an homoge-

9
Specifically, we refer to NTRU ENCRYPT with n = the NTT computation and after the reverse transforma-
743 as suggested in [54]. A possible advantage of tion by powers of γ −1 , no zero padding is required and
NTRU ENCRYPT compared to N EW H OPE is somewhat an n-point NTT can be used to transform a polynomial
smaller message sizes, however, this advantage becomes with n coefficients.
i=0 gi X ∈ Rq we define
For a polynomial g = ∑1023
very small when scaling parameters to achieve a sim- i

ilar security margin as N EW H OPE. The large down-


side of using NTRU ENCRYPT for ephemeral key ex- 1023

change is the cost for key generation. The implemen- NTT(g) = ĝ = ∑ ĝi X i , with
i=0
tation of NTRU ENCRYPT with n = 743 in eBACS [19] 1023
takes about an order of magnitude longer for key gener-
ation alone than N EW H OPE takes in total. Also, unlike
ĝi = ∑ γ jg jωi j,
j=0
our N EW H OPE software, this NTRU ENCRYPT software
is not protected against timing attacks; adding such pro- where we fix √ the n-th primitive root of unity to ω = 49
tection would presumably incur a significant overhead. and thus γ = ω = 7. Note that in our implementation
we use an in-place NTT algorithm which requires bit-
reversal operations. As an optimization, our implemen-
7 Implementation tations skips these bit-reversals for the forward transfor-
mation as all inputs are only random noise. This opti-
In this section we provide details on the encodings of
mization is transparent to the protocol and for simplicity
messages and describe our portable reference implemen-
omitted in the description here.
tation written in C, as well as an optimized implementa-
The function NTT−1 is the inverse of the function
tion targeting architectures with AVX vector instructions.
NTT. The computation of NTT−1 is essentially the
same as the computation of NTT, except that it uses ω −1
7.1 Encodings and generation of a mod q = 1254, multiplies by powers of γ −1 mod q =
The key-exchange protocol described in Protocol 1 and 8778 after the summation, and also multiplies each coef-
also our protocol as described in Protocol 2 exchange ficient by the scalar n−1 mod q = 12277 so that
messages that contain mathematical objects (in particu- 1023
lar, polynomials in Rq ). Implementations of these proto- NTT−1 (ĝ) = g = ∑gi X i
, with
cols need to exchange messages in terms of byte arrays. i=0
As we will describe in the following, the choice of en- 1023
−1 −i
codings of polynomials to byte arrays has a serious im- gi = n γ ∑ ĝ j ω −i j .
j=0
pact on performance. We use an encoding of messages
that is particularly well-suited for implementations that
The inputs to NTT−1 are not just random noise, so in-
make use of quasi-linear NTT-based polynomial multi-
side NTT−1 our software has to perform the initial bit
plication.
reversal, making NTT−1 slightly more costly than NTT.
Definition of NTT and NTT−1 . The NTT is a tool
commonly used in implementations of ideal lattice-based Definition of Parse. The public parameter a is generated
cryptography [32, 47, 67, 83]. For some background from a 256-bit seed through the extendable-output func-
on the NTT and the description of fast software im- tion SHAKE-128 [76, Sec. 6.2]. The output of SHAKE-
plementations we refer to [51, 72]. In general, fast 128 is considered as an array of 16-bit, unsigned, little-
quasi-logarithmic algorithms exist for the computation endian integers. Each of those integers is used as a coef-
of the NTT and a polynomial multiplication can be per- ficient of a if it is smaller than 5q and rejected otherwise.
formed by computing c = NTT−1 (NTT(a) ◦ NTT(b)) The first such 16-bit integer is used as the coefficient of
for a, b, c ∈ R. An NTT targeting ideal lattices defined X 0 , the next one as coefficient of X 1 and so on. Earlier
in Rq = Zq [X]/(X n + 1) can be implemented very effi- versions of this paper described a slightly different way
ciently if n is a power of two and q is a prime for which of rejection sampling for coefficients of a. The more ef-
it holds that q ≡ 1 mod 2n. This way a primitive n-th ficient approach adopted in this final version was sug-
root of unity ω and its square root γ exist. By multiply- gested independently by Gueron and Schlieker in [50]
√ and by Yawning Angel in [5]. However, note that a re-
ing coefficient-wise by powers of γ = ω mod q before
duction modulo q of the coefficients of a as described
neous Ring-LWE instance. We do not take into account the combinato- in [50] and [5] is not necessary; both our implementa-
rial vulnerabilities [56] induced by the fact that secrets are ternary. We
tions can handle coefficients of a in {0, . . . , 5q − 1}.
note that NTRU is a potentially a weaker problem than Ring-LWE: it
is in principle subject to a subfield-lattice attack [2], but the parameters Due to a small probability of rejections, the amount of
proposed for NTRU ENCRYPT are immune. output required from SHAKE-128 depends on the seed –

10
what is required is n = 1024 coefficients that are smaller represented as unsigned 16-bit integers. Our in-place
than 5q. The minimal amount of output is thus 2 KB; the NTT implementation transforms from bit-reversed to
average amount is ≈ 2184.5 bytes. The resulting poly- natural order using Gentleman-Sande butterfly oper-
nomial a (denoted as â) is considered to be in NTT do- ations [29, 42]. One would usually expect that each
main. This is possible because the NTT transforms uni- NTT is preceded by a bit-reversal, but all inputs to
form noise to uniform noise. NTT are noise polynomials that we can simply consider
Using a variable amount of output from SHAKE-128 as being already bit-reversed; as explained earlier, the
leaks information about a through timing information. NTT−1 operation still involves a bit-reversal. The core
This is not a problem for most applications, since a is of the NTT and NTT−1 operation consists of 10 layers
public. As pointed out by Burdges in [24], such a tim- of transformations, each consisting of 512 butterfly
ing leak of public information can be a problem when operations of the form described in Listing 2.
deploying N EW H OPE in anonymity networks like Tor.
Appendix F describes an alternative approach for Parse, Montgomery arithmetic and lazy reductions. The per-
which is slightly more complex and slightly slower, but formance of operations on polynomials is largely deter-
does not leak any timing information about a. mined by the performance of NTT and NTT−1 . The
The message format of (b, seed) and (u, r). With the main computational bottleneck of those operations are
definition of the NTT, we can now define the format 5120 butterfly operations, each consisting of one addi-
of the exchanged messages. In both (b, seed) and (u, r) tion, one subtraction and one multiplication by a precom-
the polynomial is transmitted in the NTT domain (as in puted constant. Those operations are in Zq ; recall that q
works like [83,86]). Polynomials are encoded as an array is a 14-bit prime. To speed up the modular-arithmetic
of 1792 bytes, in a compressed little-endian format. The operations, we store all precomputed constants in Mont-
encoding of seed is straight-forward as an array of 32 gomery representation [75] with R = 218 , i.e., instead
bytes, which is simply concatenated with the encoding of storing ω i , we store 218 ω i (mod q). After a multi-
of b. Also the encoding of r is fairly straight-forward: plication of a coefficient g by some constant 218 ω i , we
it packs four 2-bit coefficients into one byte for a to- can then reduce the result r to gω i (mod q) with the fast
tal of 256 bytes, which are again simply concatenated Montgomery reduction approach. In fact, we do not al-
with the encoding of u. We denote these encodings to ways fully reduce modulo q, it is sufficient if the result of
byte arrays as encodeA and encodeB and their inverses the reduction has at most 14 bits. The fast Montgomery
as decodeA and decodeB. For a description of our key- reduction routine given in Listing 1a computes such a re-
exchange protocol including encodings and with explicit duction to a 14-bit integer for any unsigned 32-bit integer
NTT and NTT−1 transformations, see Protocol 3. in {0, . . . , 232 − q(R − 1) − 1}. Note that the specific im-
plementation does not work for any 32-bit integer; for
example, for the input 232 − q(R − 1) = 1073491969 the
7.2 Portable C implementation addition a=a+u causes an overflow and the function re-
This paper is accompanied by a C reference implemen- turns 0 instead of the correct result 4095. In the following
tation described in this section and an optimized imple- we establish that this is not a problem for our software.
mentation for Intel and AMD CPUs described in the next Aside from reductions after multiplication, we also
section. The main emphasis in the C reference imple- need modular reductions after addition. For this task we
mentation is on simplicity and portability. It does not use the “short Barrett reduction” [10] detailed in List-
use any floating-point arithmetic and outside the Kec- ing 1b. Again, this routine does not fully reduce modulo
cak (SHA3-256 and SHAKE-128) implementation only q, but reduces any 16-bit unsigned integer to an integer
needs 16-bit and 32-bit integer arithmetic. In particular, of at most 14 bits which is congruent modulo q.
the error-recovery mechanism described in Section 5 is In the context of the NTT and NTT−1 , we make sure
implemented with fixed-point (i.e., integer-) arithmetic. that inputs have coefficients of at most 14 bits. This al-
Furthermore, the C reference implementation does not lows us to avoid Barrett reductions after addition on ev-
make use of the division operator (/) and the modulo op- ery second level, because coefficients grow by at most
erator (%). The focus on simplicity and portability does one bit per level and the short Barrett reduction can
not mean that the implementation is not optimized at all. handle 16-bit inputs. Let us turn our focus to the in-
On the contrary, we use it to illustrate various optimiza- put of the Montgomery reduction (see Listing 2). Be-
tion techniques that are helpful to speed up the key ex- fore subtracting a[j+d] from t we need to add a mul-
change and are also of independent interest for imple- tiple of q to avoid unsigned underflow. Coefficients
menters of other ideal-lattice-based schemes. never grow larger than 15 bits and 3 · q = 36867 > 215 ,
so adding 3 · q is sufficient. An upper bound on the
NTT optimizations. All polynomial coefficients are expression ((uint32_t)t + 3*12289 - a[j+d]) is

11
Parameters: q = 12289 < 214 , n = 1024
Error distribution: ψ16n

Alice (server) Bob (client)


$
seed ← {0, . . . , 255}32
â←Parse(SHAKE-128(seed))
$ $
n
s, e ← ψ16 s0 , e0 , e00 ← ψ16
n

ŝ←NTT(s)
ma =encodeA(seed,b̂)
b̂←â ◦ ŝ + NTT(e) −−−−−−−−−−−−→ (b̂, seed)←decodeA(ma )
1824 Bytes
â←Parse(SHAKE-128(seed))
t̂←NTT(s0 )
û←â ◦ t̂ + NTT(e0 )
v←NTT−1 (b̂ ◦ t̂) + e00
mb =encodeB(û,r) $
(û, r)←decodeB(mb ) ←−−−−−−−−−− r ← HelpRec(v)
2048 Bytes
v0 ←NTT−1 (û ◦ ŝ) ν←Rec(v, r)
ν←Rec(v0 , r) µ←SHA3-256(ν)
µ←SHA3-256(ν)

Protocol 3: Our proposed protocol including NTT and NTT−1 computations and sizes of exchanged messages; ◦
denotes pointwise multiplication; elements in NTT domain are denoted with a hat (ˆ)

obtained if t is 215 − 1 and a[j+d] is zero; we Listing 2 The Gentleman-Sande butterfly inside odd lev-
thus obtain 215 + 3 · q = 69634. All precomputed els of our NTT computation. All a[j] and W are of type
constants are in {0, . . . , q − 1}, so the expression uint16_t.
(W * ((uint32_t)t + 3*12289 - a[j+d]), the in-
put to the Montgomery reduction, is at most 69634 · (q − W = omega[jTwiddle++];
t = a[j];
1) = 855662592 and thus safely below the maximum in- a[j] = bred(t + a[j+d]);
put that the Montgomery reduction can handle. a[j+d] = mred(W * ((uint32_t)t + 3*12289 - a[j+d]));

Listing 1 Reduction routines used in the reference im-


plementation. the “TweetFIPS202” implementation [20] for everything
(a) Montgomery reduction (R = 218 ). else.
uint16_t mred(uint32_t a) { The sampling of centered binomial noise polynomi-
uint32_t u; als is based on a fast PRG with a random seed from
u = (a * 12287);
u &= ((1 << 18) - 1); /dev/urandom followed by a quick summation of 16-
a += u * 12289; bit chunks of the PRG output. Note that the choice of
return a >> 18;
}
the PRG is a purely local choice that every user can
(b) Short Barrett reduction.
pick independently based on the target hardware archi-
tecture and based on routines that are available anyway
uint16_t bred(uint16_t a) {
uint32_t u; (for example, for symmetric encryption following the
u = ((uint32_t) a * 5) >> 16; key exchange). Our C reference implementation uses
a -= u * 12289; ChaCha20 [14], which is fast, trivially protected against
return a;
} timing attacks, and is already in use by many TLS clients
and servers [61, 62].

Fast random sampling. As a first step before perform- 7.3 Optimized AVX2 implementation
ing any operations on polynomials, both Alice and Bob
need to expand the seed to the polynomial a using Intel processors since the “Sandy Bridge” generation
SHAKE-128. The implementation we use is based on support Advanced Vector Extensions (AVX) that oper-
the “simple” implementation by Van Keer for the Kec- ate on vectors of 8 single-precision or 4 double-precision
cak permutation and slightly modified code taken from floating-point values in parallel. With the introduction

12
of the “Haswell” generation of CPUs, this support was With this vectorization, the cost for HelpRec is only 3 404
extended also to 256-bit vectors of integers of various cycles, the cost for Rec is only 2 804 cycles.
sizes (AVX2). It is not surprising that the enormous com-
putational power of these vector instructions has been
used before to implement very high-speed crypto (see, 8 Benchmarks and comparison
for example, [16, 18, 48]) and also our optimized refer-
ence implementation targeting Intel Haswell processors In the following we present benchmark results of our
uses those instructions to speed up multiple components software. All benchmark results reported in Table 2 were
of the key exchange. obtained on an Intel Core i7-4770K (Haswell) running
at 3491.953 MHz with Turbo Boost and Hyperthreading
NTT optimizations. The AVX instruction set has been disabled. We compiled our C reference implementation
used before to speed up the computation of lattice-based with gcc-4.9.2 and flags -O3 -fomit-frame-pointer
cryptography, and in particular the number-theoretic -march=corei7-avx -msse2avx. We compiled our
transform. Most notably, Güneysu, Oder, Pöppelmann optimized AVX implementation with clang-3.5 and flags
and Schwabe achieve a performance of only 4 480 cycles -O3 -fomit-frame-pointer -march=native.
for a dimension-512 NTT on Intel Sandy Bridge [51]. As described in Section 7, the sampling of a is not
For arithmetic modulo a 23-bit prime, they represent co- running in constant time; we report the median run-
efficients as double-precision integers. ning time and (in parentheses) the average running time
We experimented with multiple different approaches for this generation, the server-side key-pair generation
to speed up the NTT in AVX. For example, we vector- and client-side shared-key computation; both over 1000
ized the Montgomery arithmetic approach of our C ref- runs. For all other routines we report the median of
erence implementation and also adapted it to a 32-bit- 1000 runs. We built the software from [22] on the
signed-integer approach. In the end it turned out that same machine as ours and—like the authors of [22]—
floating-point arithmetic beats all of those more sophisti- used openssl speed for benchmarking their software
cated approaches, so we are now using an approach that and converted the reported results to approximate cycle
is very similar to the approach in [51]. One computation counts as given in Table 2.
of a dimension-1024 NTT takes 8 448 cycles, unlike the
numbers in [51] this does include multiplication by the Comparison with BCNS and RSA/ECDH. As previ-
powers of γ and unlike the numbers in [51], this excludes ously mentioned, the BCNS implementation [22] also
a bit-reversal. uses the dimension n = 1024 but the larger modulus
q = 232 − 1 and the Gaussian √ error distribution with
Fast sampling. Intel Haswell processors support the
Gaussian parameter σ = 8/ 2π = 3.192. When the au-
AES-NI instruction set and for the local choice of noise
thors of BCNS integrated their implementation into SSL
sampling it is obvious to use those. More specifically,
it only incurred a slowdown by a factor of 1.27 compared
we use the public-domain implementation of AES-256 in
to ECDH when using ECDSA signatures and a factor of
counter mode written by Dolbeau, which is included in
1.08 when using RSA signatures with respect to the num-
the SUPERCOP benchmarking framework [19]. Trans-
ber of connections that could be handled by the server.
formation from uniform noise to the centered binomial
As a reference, the reported cycle counts in [22] for a
is optimized in AVX2 vector instructions operating on
nistp256 ECDH on the client side are 2 160 000 cycles
vectors of bytes and 16-bit integers.
(0.8 ms @2.77 GHz) and on the server side 3 221 288
For the computation of SHAKE-128 we use the same
cycles (1.4 ms @2.33 GHz). These numbers are obvi-
code as in the C reference implementation. One might
ously not state of the art for ECDH software. Even on
expect that architecture-specific optimizations (for exam-
the nistp256 curve, which is known to be a far-from-
ple, using AVX instructions) are able to offer significant
optimal choice, it is possible to achieve cycle counts of
speedups, but the benchmarks of the eBACS project [19]
less than 300 000 cycles for a variable-basepoint scalar
indicate that on Intel Haswell, the fastest implementation
multiplication on an Intel Haswell [49]. Also OpenSSL
is the “simple” implementation by Van Keer that our C
optionally includes fast software for nistp256 ECDH
reference implementation is based on. The reasons that
by Käsper and Langley and we assume that the authors
vector instructions are not very helpful for speeding up
of [22] omitted enabling it. Compared to BCNS, our
SHAKE (or, more generally, Keccak) are the inherently
C implementation is more than 8 times faster and our
sequential nature and the 5 × 5 dimension of the state
AVX implementation even achieves a speedup factor of
matrix that makes internal vectorization hard.
more than 27. At this performance it is in the same ball-
Error recovery. The 32-bit integer arithmetic used by park as state-of-the-art ECDH software, even when TLS
the C reference implementation for HelpRec and Rec switches to faster 128-bit secure ECDH key exchange
is trivially 8-way parallelized with AVX2 instructions. based on Curve25519 [13], as recently specified in RFC

13
Table 2: Intel Haswell cycle counts of our proposal as compared to the BCNS proposal from [22].

BCNS [22] Ours (C ref) Ours (AVX2)


Generation of a 43 440a 37 470a
(43 607) a (36 863)a
NTT 55 360 8 448
−1 b
NTT 59 864 9 464b
Sampling of a noise polynomial 32 684 c 5 900c
HelpRec 14 608 3 404
Rec 10 092 2 804
Key generation (server) ≈ 2 477 958 258 246 88 920
(258 965) (89 079)
Key gen + shared key (client) ≈ 3 995 977 384 994 110 986
(385 146) (111 169)
Shared key (server) ≈ 481 937 86 280 19 422
a Includes reading a seed from /dev/urandom
b Includes one bit reversal
c Excludes reading a seed from /dev/urandom, which is shared across multiple calls to the noise generation

7748 [63]. tion of a lattice-based public-key encryption scheme,


In comparison to the BCNS proposal we see a large which could also be used for key exchange, is given by
performance advantage from switching to the binomial Bernstein, Chuengsatiansup, Lange, and van Vredendaal
error distribution. The BCNS software uses a large pre- in [17], but we leave a detailed comparison to future
computed table to sample from a discrete Gaussian dis- work. An efficient authenticated lattice-based key ex-
tribution with a high precision. This approach takes change scheme has recently been proposed by del Pino,
1 042 700 cycles to samples one polynomial in constant Lyubashevsky, and Pointcheval in [33].
time. Our C implementation requires only 32 684 cy-
cles to sample from the binomial distribution. Another
factor is that we use the NTT in combination with a
smaller modulus. Polynomial multiplication in [22] is
using Nussbaumer’s symbolic approach based on re- References
cursive negacyclic convolutions [80]. The implemen- [1] A DRIAN , D., B HARGAVAN , K., D URUMERIC , Z., G AUDRY,
tation in [22] only achieves a performance of 342 800 P., G REEN , M., H ALDERMAN , J. A., H ENINGER , N.,
cycles for a constant-time multiplication. Additionally, S PRINGALL , D., T HOMÉ , E., VALENTA , L., VANDER S LOOT,
B., W USTROW, E., B ÉGUELIN , S. Z., AND Z IMMERMANN , P.
the authors of [22] did not perform pre-transformation Imperfect forward secrecy: How Diffie-Hellman fails in practice.
of constants (e.g., a) or transmission of coefficients in In CCS ’15 Proceedings of the 22nd ACM SIGSAC Conference on
FFT/Nussbaumer representation. Computer and Communications Security (2015), ACM, pp. 5–17.
https://weakdh.org/. 4, 5

Follow-Up Work. We would like to refer the reader to [2] A LBRECHT, M., BAI , S., AND D UCAS , L. A subfield lattice
attack on overstretched NTRU assumptions. IACR Cryptology
follow-up work in which improvements to N EW H OPE
ePrint Archive report 2016/127, 2016. http://eprint.iacr.
and its implementation were proposed based on a org/2016/127. 8, 10
preprint version of this work [4]. In [50] Gueron and
[3] A LBRECHT, M. R., P LAYER , R., AND S COTT, S. On the con-
Schlieker introduce faster pseudorandom bytes gener- crete hardness of learning with errors. IACR Cryptology ePrint
ation by changing the underlying functions, a method Archive report 2015/046, 2015. http://eprint.iacr.org/
to decrease the rejection rate during sampling (see Sec- 2015/046/. 8, 9
tion 7.1), and a vectorization of the sampling step. Longa [4] A LKIM , E., D UCAS , L., P ÖPPELMANN , T., AND S CHWABE ,
and Naehrig [68] optimize the NTT and present new P. Post-quantum key exchange – a new hope. IACR Cryptol-
ogy ePrint Archive report 2015/1092, 2015. https://eprint.
modular reduction techniques and are able to achieve a iacr.org/2015/1092/20160329:201913. 14
speedup of factor-1.90 for the C implementation and a
[5] A NGEL , Y. Post-quantum secure hybrid handshake based
factor-1.25 for the AVX implementation compared to the on NewHope. Posting to the tor-dev mailing list, 2016.
preprint [4] (note that this version has updated numbers). https://lists.torproject.org/pipermail/tor-dev/
An alternative NTRU-based proposal and implementa- 2016-May/010896.html. 10

14
[6] A RORA , S., AND G E , R. New algorithms for learning in [19] B ERNSTEIN , D. J., AND L ANGE , T. eBACS: ECRYPT bench-
presence of errors. In Automata, Languages and Programming marking of cryptographic systems. http://bench.cr.yp.to
(2011), L. Aceto, M. Henzingeri, and J. Sgall, Eds., vol. 6755 of (accessed 2015-10-07). 10, 13
LNCS, Springer, pp. 403–415. https://www.cs.duke.edu/ [20] B ERNSTEIN , D. J., S CHWABE , P., AND A SSCHE , G. V.
~rongge/LPSN.pdf. 8, 18 Tweetable FIPS 202, 2015. http://keccak).noekeon.org/
[7] BABAI , L. On Lovász’ lattice reduction and the nearest tweetfips202.html (accessed 2016-03-21). 12
lattice point problem. Combinatorica 6, 1 (1986), 1–13. [21] B ERTONI , G., DAEMEN , J., P EETERS , M., AND A SSCHE ,
http://www.csie.nuk.edu.tw/~cychen/Lattices/ G. V. Keccak. In Advances in Cryptology – EUROCRYPT 2013
On%20lovasz%20lattice%20reduction%20and%20the% (2013), T. Johansson and P. Q. Nguyen, Eds., vol. 7881 of LNCS,
20nearest%20lattice%20point%20problem.pdf. 5 Springer, pp. 313–314. 5
[8] BAI , S., AND G ALBRAITH , S. D. An improved compression [22] B OS , J. W., C OSTELLO , C., NAEHRIG , M., AND S TEBILA , D.
technique for signatures based on learning with errors. In Topics Post-quantum key exchange for the TLS protocol from the ring
in Cryptology – CT-RSA 2014 (2014), J. Benaloh, Ed., vol. 8366 learning with errors problem. In 2015 IEEE Symposium on Secu-
of LNCS, Springer, pp. 28–47. https://eprint.iacr.org/ rity and Privacy (2015), pp. 553–570. http://eprint.iacr.
2013/838/. 1 org/2014/599. 1, 2, 3, 4, 6, 8, 9, 13, 14
[9] BAI , S., L ANGLOIS , A., L EPOINT, T., S TEHLÉ , D., AND S TE - [23] B RAKERSKI , Z., L ANGLOIS , A., P EIKERT, C., R EGEV, O.,
INFELD , R. Improved security proofs in lattice-based cryp-
AND S TEHLÉ , D. Classical hardness of learning with errors. In
tography: using the Rényi divergence rather than the statisti- Proceedings of the forty-fifth annual ACM symposium on Theory
cal distance. In Advances in Cryptology – ASIACRYPT 2015 of computing (2013), ACM, pp. 575–584. http://arxiv.org/
(2015), T. Iwata and J. H. Cheon, Eds., LNCS, Springer. http: pdf/1306.0281. 6
//eprint.iacr.org/2015/483/. 18
[24] B URDGES , J. Post-quantum secure hybrid handshake
[10] BARRETT, P. Implementing the Rivest Shamir and Adleman pub-
based on NewHope. Posting to the tor-dev mailing
lic key encryption algorithm on a standard digital signal proces-
list, 2016. https://lists.torproject.org/pipermail/
sor. In Advances in Cryptology – CRYPTO ’86 (1987), A. M.
tor-dev/2016-May/010886.html. 11
Odlyzko, Ed., vol. 263 of Lecture Notes in Computer Science,
Springer-Verlag Berlin Heidelberg, pp. 311–323. 11 [25] C ADÉ , D., P UJOL , X., AND S TEHLÉ , D. fplll 4.0.4, 2013.
https://github.com/dstehle/fplll (accessed 2015-10-
[11] BATCHER , K. E. Sorting networks and their applications.
13). 8
In AFIPS conference proceedings, volume 32: 1968 Spring
Joint Computer Conference (1968), Thompson Book Company, [26] C HECKOWAY, S., F REDRIKSON , M., N IEDERHAGEN , R., E V-
pp. 307–314. http://www.cs.kent.edu/~batcher/conf. ERSPAUGH , A., G REEN , M., L ANGE , T., R ISTENPART, T.,
html. 22 B ERNSTEIN , D. J., M ASKIEWICZ , J., AND S HACHAM , H. On
the practical exploitability of Dual EC in TLS implementations.
[12] B ECKER , A., D UCAS , L., G AMA , N., AND L AARHOVEN , T.
In Proceedings of the 23rd USENIX security symposium (2014).
New directions in nearest neighbor searching with applications to
https://projectbullrun.org/dual-ec/index.html. 4
lattice sieving. In SODA ’16 Proceedings of the twenty-seventh
annual ACM-SIAM symposium on Discrete Algorithms (2016 (to [27] C HEN , Y. Lattice reduction and concrete security of fully homo-
appear)), SIAM. 8 morphic encryption. PhD thesis, l’Université Paris Diderot, 2013.
http://www.di.ens.fr/~ychen/research/these.pdf. 8,
[13] B ERNSTEIN , D. J. Curve25519: new Diffie-Hellman speed
9
records. In Public Key Cryptography – PKC 2006 (2006),
M. Yung, Y. Dodis, A. Kiayias, and T. Malkin, Eds., vol. 3958 [28] C HEN , Y., AND N GUYEN , P. Q. BKZ 2.0: Better lattice
of LNCS, Springer, pp. 207–228. http://cr.yp.to/papers. security estimates. In Advances in Cryptology – ASIACRYPT
html#curve25519. 13 2011, D. H. Lee and X. Wang, Eds., vol. 7073 of LNCS.
Springer, 2011, pp. 1–20. http://www.iacr.org/archive/
[14] B ERNSTEIN , D. J. ChaCha, a variant of Salsa20. In Workshop
asiacrypt2011/70730001/70730001.pdf. 8
Record of SASC 2008: The State of the Art of Stream Ciphers
(2008). http://cr.yp.to/papers.html#chacha. 12 [29] C HU , E., AND G EORGE , A. Inside the FFT Black Box Serial and
[15] B ERNSTEIN , D. J., C HOU , T., C HUENGSATIANSUP, C., H ÜLS - Parallel Fast Fourier Transform Algorithms. CRC Press, Boca
ING , A., L ANGE , T., N IEDERHAGEN , R., AND VAN V REDEN -
Raton, FL, USA, 2000. 11
DAAL , C. How to manipulate curve standards: a white paper for [30] C ONWAY, J., AND S LOANE , N. J. A. Sphere packings, lattices
the black hat. IACR Cryptology ePrint Archive report 2014/571, and groups, 3rd ed., vol. 290 of Grundlehren der mathematischen
2014. http://eprint.iacr.org/2014/571/. 4, 5 Wissenschaften. Springer Science & Business Media, 1999. 19
[16] B ERNSTEIN , D. J., C HUENGSATIANSUP, C., L ANGE , T., AND [31] C OWIE , J., D ODSON , B., E LKENBRACHT-H UIZING , R. M.,
S CHWABE , P. Kummer strikes back: new DH speed records. In L ENSTRA , A. K., M ONTGOMERY, P. L., AND Z AYER , J. A
Advances in Cryptology – EUROCRYPT 2015 (2014), T. Iwata world wide number field sieve factoring record: on to 512 bits.
and P. Sarkar, Eds., vol. 8873 of LNCS, Springer, pp. 317–337. In Advances in Cryptology – ASIACRYPT’96 (1996), K. Kim and
full version: http://cryptojedi.org/papers/#kummer. 13 T. Matsumoto, Eds., vol. 1163 of LNCS, Springer, pp. 382–394.
[17] B ERNSTEIN , D. J., C HUENGSATIANSUP, C., L ANGE , T., AND http://oai.cwi.nl/oai/asset/1940/1940A.pdf. 8
VAN V REDENDAAL , C. NTRU Prime. IACR Cryptology ePrint [32] DE C LERCQ , R., ROY, S. S., V ERCAUTEREN , F., AND V ER -
Archive report 2016/461, 2016. https://eprint.iacr.org/ BAUWHEDE , I. Efficient software implementation of ring-LWE
2016/461. 14 encryption. In Design, Automation & Test in Europe Conference
[18] B ERNSTEIN , D. J., H OPWOOD , D., H ÜLSING , A., L ANGE , T., & Exhibition, DATE 2015 (2015), EDA Consortium, pp. 339–
N IEDERHAGEN , R., PAPACHRISTODOULOU , L., S CHNEIDER , 344. http://eprint.iacr.org/2014/725. 2, 4, 6, 8, 10
M., S CHWABE , P., AND W ILCOX -O’H EARN , Z. SPHINCS: [33] DEL P INO , R., LYUBASHEVSKY, V., AND P OINTCHEVAL , D.
practical stateless hash-based signatures. In Advances in Cryp- The whole is less than the sum of its parts: Constructing more ef-
tology – EUROCRYPT 2015 (2015), E. Oswald and M. Fis- ficient lattice-based AKEs. IACR Cryptology ePrint Archive re-
chlin, Eds., vol. 9056 of LNCS, Springer, pp. 368–397. https: port 2016/435, 2016. https://eprint.iacr.org/2016/435.
//cryptojedi.org/papers/#sphincs. 3, 13 14

15
[34] D ING , J. New cryptographic constructions using generalized [48] G UERON , S. Parallelized hashing via j-lanes and j-pointers tree
learning with errors problem. IACR Cryptology ePrint Archive modes, with applications to SHA-256. IACR Cryptology ePrint
report 2012/387, 2012. http://eprint.iacr.org/2012/ Archive report 2014/170, 2014. https://eprint.iacr.org/
387. 1, 2, 3, 4 2014/170. 13
[35] D ING , J., X IE , X., AND L IN , X. A simple provably secure [49] G UERON , S., AND K RASNOV, V. Fast prime field elliptic-curve
key exchange scheme based on the learning with errors prob- cryptography with 256-bit primes. Journal of Cryptographic En-
lem. IACR Cryptology ePrint Archive report 2012/688, 2012. gineering 5, 2 (2014), 141–151. https://eprint.iacr.org/
http://eprint.iacr.org/2012/688. 1, 2, 3, 4 2013/816/. 13
[36] D UCAS , L., D URMUS , A., L EPOINT, T., AND LYUBASHEVSKY,
[50] G UERON , S., AND S CHLIEKER , F. Speeding up R-LWE post-
V. Lattice signatures and bimodal Gaussians. In Advances
quantum key exchange. IACR Cryptology ePrint Archive report
in Cryptology – CRYPTO 2013 (2013), R. Canetti and J. A.
2016/467, 2016. https://eprint.iacr.org/2016/467. 10,
Garay, Eds., vol. 8042 of LNCS, Springer, pp. 40–56. https:
14
//eprint.iacr.org/2013/383/. 1, 4
[37] F LUHRER , S. Cryptanalysis of ring-LWE based key exchange [51] G ÜNEYSU , T., O DER , T., P ÖPPELMANN , T., AND S CHWABE ,
with key share reuse. IACR Cryptology ePrint Archive report P. Software speed records for lattice-based signatures. In
2016/085, 2016. http://eprint.iacr.org/2016/085. 4 Post-Quantum Cryptography (2013), P. Gaborit, Ed., vol. 7932
of LNCS, Springer, pp. 67–82. http://cryptojedi.org/
[38] F UJIOKA , A., S UZUKI , K., X AGAWA , K., AND YONEYAMA , papers/#lattisigns. 10, 13
K. Practical and post-quantum authenticated key exchange from
one-way secure key encapsulation mechanism. In Symposium on [52] H ANROT, G., P UJOL , X., AND S TEHLÉÉ , D. Terminating BKZ.
Information, Computer and Communications Security, ASIA CCS IACR Cryptology ePrint Archive report 2011/198, 2011. http:
2013 (2013), K. Chen, Q. Xie, W. Qiu, N. Li, and W. Tzeng, Eds., //eprint.iacr.org/2011/198/. 8
ACM, pp. 83–94. 3
[53] H OFFSTEIN , J., P IPHER , J., S CHANCK , J. M., S ILVERMAN ,
[39] G ALBRAITH , S. D. Space-efficient variants of cryptosystems J. H., AND W HYTE , W. Practical signatures from the partial
based on learning with errors, 2013. https://www.math. Fourier recovery problem. In Applied Cryptography and Net-
auckland.ac.nz/~sgal018/compact-LWE.pdf. 5 work Security (2014), I. Boureanu, P. Owesarski, and S. Vaude-
[40] G AMA , N., N GUYEN , P. Q., AND R EGEV, O. Lattice enu- nay, Eds., vol. 8479 of LNCS, Springer, pp. 476–493. https:
meration using extreme pruning. In Advances in Cryptology – //eprint.iacr.org/2013/757/. 1
EUROCRYPT 2010 (2010), H. Gilbert, Ed., vol. 6110 of LNCS, [54] H OFFSTEIN , J., P IPHER , J., S CHANCK , J. M., S ILVERMAN ,
Springer, pp. 257–278. http://www.iacr.org/archive/ J. H., W HYTE , W., AND Z HANG , Z. Choosing parameters
eurocrypt2010/66320257/66320257.pdf. 8 for NTRUEncrypt. IACR Cryptology ePrint Archive report
[41] G ARG , S., G ENTRY, C., AND H ALEVI , S. Candidate multilin- 2015/708, 2015. http://eprint.iacr.org/2015/708. 9, 10
ear maps from ideal lattices. In Advances in Cryptology – EU-
ROCRYPT 2013 (2013), vol. 7881 of LNCS, Springer, pp. 1–17. [55] H OFFSTEIN , J., P IPHER , J., AND S ILVERMAN , J. H. NTRU:
https://eprint.iacr.org/2012/610. 8 a ring-based public key cryptosystem. In Algorithmic num-
ber theory (1998), J. P. Buhler, Ed., vol. 1423 of LNCS,
[42] G ENTLEMAN , W. M., AND S ANDE , G. Fast Fourier transforms: Springer, pp. 267–288. https://www.securityinnovation.
for fun and profit. In Fall Joint Computer Conference (1966), com/uploads/Crypto/ANTS97.ps.gz. 1, 4
vol. 29 of AFIPS Proceedings, pp. 563–578. http://cis.rit.
edu/class/simg716/FFT_Fun_Profit.pdf. 11 [56] H OWGRAVE -G RAHAM , N. A hybrid lattice-reduction and
meet-in-the-middle attack against NTRU. In Advances in
[43] G ENTRY, C. Fully homomorphic encryption using ideal
Cryptology – CRYPTO 2007, A. Menezes, Ed., vol. 4622 of
lattices. In STOC ’09 Proceedings of the forty-first an-
LNCS. Springer, 2007, pp. 150–169. http://www.iacr.org/
nual ACM symposium on Theory of computing (2009),
archive/crypto2007/46220150/46220150.pdf. 10
ACM, pp. 169–178. https://www.cs.cmu.edu/~odonnell/
hits09/gentry-homomorphic-encryption.pdf. 1 [57] K IRCHNER , P., AND F OUQUE , P.-A. An improved BKW al-
[44] G ENTRY, C., P EIKERT, C., AND VAIKUNTANATHAN , V. Trap- gorithm for LWE with applications to cryptography and lattices.
doors for hard lattices and new cryptographic constructions. In In Advances in Cryptology – CRYPTO 2015 (2015), R. Gennaro
STOC ’08 Proceedings of the fortieth annual ACM symposium and M. Robshaw, Eds., vol. 9215 of LNCS, Springer, pp. 43–62.
on Theory of computing (2008), ACM, pp. 197–206. https: https://eprint.iacr.org/2015/552/. 8, 18
//eprint.iacr.org/2007/432/. 6 [58] L AARHOVEN , T. Search problems in cryptography. PhD the-
[45] G ENTRY, C., AND S ZYDLO , M. Cryptanalysis of the revised sis, Eindhoven University of Technology, 2015. http://www.
NTRU signature scheme. In Advances in Cryptology – EURO- thijs.com/docs/phd-final.pdf. 8
CRYPT 2002 (2002), L. R. Knudsen, Ed., vol. 2332 of LNCS,
[59] L AARHOVEN , T. Sieving for shortest vectors in lattices using
Springer, pp. 299–320. https://www.iacr.org/archive/
angular locality-sensitive hashing. In Advances in Cryptology
eurocrypt2002/23320295/nsssign_short3.pdf. 8
– CRYPTO 2015 (2015), R. Gennaro and M. Robshaw, Eds.,
[46] G HOSH , S., AND K ATE , A. Post-quantum secure onion routing vol. 9216 of LNCS, Springer, pp. 3–22. https://eprint.
(future anonymity in today’s budget). IACR Cryptology ePrint iacr.org/2014/744/. 8
Archive report 2015/008, 2015. http://eprint.iacr.org/
2015/008. 3 [60] L AARHOVEN , T., M OSCA , M., AND VAN DE P OL , J.
Finding shortest lattice vectors faster using quantum search.
[47] G ÖTTERT, N., F ELLER , T., S CHNEIDER , M., B UCHMANN ,
Designs, Codes and Cryptography 77, 2 (2015), 375–400.
J. A., AND H USS , S. A. On the design of hardware building
https://eprint.iacr.org/2014/907/. 8
blocks for modern lattice-based encryption schemes. In Crypto-
graphic Hardware and Embedded Systems - CHES 2012 (2012), [61] L ANGLEY, A. TLS symmetric crypto. Blog post on im-
E. Prouff and P. Schaumont, Eds., vol. 7428 of LNCS, Springer, perialviolet.org, 2014. https://www.imperialviolet.org/
pp. 512–529. http://www.iacr.org/archive/ches2012/ 2014/02/27/tlssymmetriccrypto.html (accessed 2015-10-
74280511/74280511.pdf. 8, 9, 10 07). 12

16
[62] L ANGLEY, A., AND C HANG , W.-T. ChaCha20 and Poly1305 [77] NATIONAL I NSTITUTE OF S TANDARDS AND T ECHNOL -
based cipher suites for TLS: Internet draft. https://tools. OGY . Workshop on cybersecurity in a post-quantum
ietf.org/html/draft-agl-tls-chacha20poly1305-04 world, 2015. http://www.nist.gov/itl/csd/ct/
(accessed 2015-02-01). 12 post-quantum-crypto-workshop-2015.cfm. 1
[63] L ANGLEY, A., H AMBURG , M., AND T URNER , S. RFC 7748: [78] NATIONAL S ECURITY AGENCY. NSA suite B cryp-
Elliptic curves for security, 2016. https://www.rfc-editor. tography. https://www.nsa.gov/ia/programs/suiteb_
org/rfc/rfc7748.txt. 14 cryptography/, Updated on August 19, 2015. 1
[64] L ENSTRA , H. W., AND S ILVERBERG , A. Revisiting the Gentry- [79] N GUYEN , P. Q., AND V IDICK , T. Sieve algorithms for the short-
Szydlo algorithm. In Advances in Cryptology – CRYPTO 2014, est vector problem are practical. Journal of Mathematical Cryp-
J. A. Garay and R. Gennaro, Eds., vol. 8616 of LNCS. Springer, tology 2, 2 (2008), 181–207. ftp://ftp.di.ens.fr/pub/
2014, pp. 280–296. https://eprint.iacr.org/2014/430. users/pnguyen/JoMC08.pdf. 8
8
[80] N USSBAUMER , H. J. Fast polynomial transform algorithms for
[65] L EYS , J., G HYS , E., AND A LVAREZ , A. Dimensions,
digital convolution. IEEE Transactions on Acoustics, Speech and
2010. http://www.dimensions-math.org/ (accessed 2015-
Signal Processing 28, 2 (1980), 205–215. 14
10-19). 7, 19
[66] L INDNER , R., AND P EIKERT, C. Better key sizes (and at- [81] P EIKERT, C. Lattice cryptography for the Internet. In Post-
tacks) for LWE-based encryption. In Topics in Cryptology - CT- Quantum Cryptography (2014), M. Mosca, Ed., vol. 8772 of
RSA 2011 (2011), A. Kiayias, Ed., vol. 6558 of LNCS, Springer, LNCS, Springer, pp. 197–219. http://web.eecs.umich.edu/
pp. 319–339. https://eprint.iacr.org/2010/613/. 3 ~cpeikert/pubs/suite.pdf. 1, 2, 3, 4, 19, 21
[67] L IU , Z., S EO , H., ROY, S. S., G ROSSSCHÄDL , J., K IM , H., [82] P ÖPPELMANN , T., D UCAS , L., AND G ÜNEYSU , T. Enhanced
AND V ERBAUWHEDE , I. Efficient Ring-LWE encryption on 8- lattice-based signatures on reconfigurable hardware. In Crypto-
bit AVR processors. In Cryptographic Hardware and Embed- graphic Hardware and Embedded Systems – CHES 2014 (2014),
ded Systems - CHES 2015 (2015), T. Güneysu and H. Hand- L. Batina and M. Robshaw, Eds., vol. 8731 of LNCS, Springer,
schuh, Eds., vol. 9293 of LNCS, Springer, pp. 663–682. https: pp. 353–370. https://eprint.iacr.org/2014/254/. 4
//eprint.iacr.org/2015/410/. 4, 8, 10 [83] P ÖPPELMANN , T., AND G ÜNEYSU , T. Towards practi-
[68] L ONGA , P., AND NAEHRIG , M. Speeding up the number theo- cal lattice-based public-key encryption on reconfigurable hard-
retic transform for faster ideal lattice-based cryptography. IACR ware. In Selected Areas in Cryptography – SAC 2013 (2013),
Cryptology ePrint Archive report 2016/504, 2016. https:// T. Lange, K. Lauter, and P. Lisoněk, Eds., vol. 8282 of LNCS,
eprint.iacr.org/2016/504. 14 Springer, pp. 68–85. https://www.ei.rub.de/media/sh/
[69] LYUBASHEVSKY, V. Lattice signatures without trapdoors. veroeffentlichungen/2013/08/14/lwe_encrypt.pdf. 2,
In Advances in Cryptology – EUROCRYPT 2012 (2012), 6, 8, 10, 11
D. Pointcheval and T. Johansson, Eds., vol. 7237 of LNCS, [84] R EGEV, O. On lattices, learning with errors, random linear codes,
Springer, pp. 738–755. https://eprint.iacr.org/2011/ and cryptography. Journal of the ACM 56, 6 (2009), 34. http:
537/. 6 //www.cims.nyu.edu/~regev/papers/qcrypto.pdf. 6
[70] LYUBASHEVSKY, V., P EIKERT, C., AND R EGEV, O. On
[85] R ÉNYI , A. On measures of entropy and information. In Fourth
ideal lattices and learning with errors over rings. In Advances
Berkeley symposium on mathematical statistics and probability
in Cryptology – EUROCRYPT 2010 (2010), H. Gilbert, Ed.,
(1961), vol. 1, pp. 547–561. http://projecteuclid.org/
vol. 6110 of LNCS, Springer, pp. 1–23. http://www.di.ens.
euclid.bsmsp/1200512181. 18
fr/~lyubash/papers/ringLWE.pdf. 3
[71] LYUBASHEVSKY, V., P EIKERT, C., AND R EGEV, O. On ideal [86] ROY, S. S., V ERCAUTEREN , F., M ENTENS , N., C HEN , D. D.,
AND V ERBAUWHEDE , I. Compact Ring-LWE cryptoproces-
lattices and learning with errors over rings. Journal of the ACM
(JACM) 60, 6 (2013), 43:1–43:35. http://www.cims.nyu. sor. In Cryptographic Hardware and Embedded Systems – CHES
edu/~regev/papers/ideal-lwe.pdf. 1, 6 2014 (2014), L. Batina and M. Robshaw, Eds., vol. 8731 of
LNCS, Springer, pp. 371–391. https://eprint.iacr.org/
[72] M ELCHOR , C. A., BARRIER , J., F OUSSE , L., AND K ILLI - 2013/866/. 4, 6, 11
JIAN , M. XPIRe: Private information retrieval for everyone.
IACR Cryptology ePrint Archive report 2014/1025, 2014. http: [87] S CHNORR , C.-P., AND E UCHNER , M. Lattice basis re-
//eprint.iacr.org/2014/1025. 10 duction: improved practical algorithms and solving sub-
set sum problems. Mathematical programming 66, 1-
[73] M ICCIANCIO , D., AND VOULGARIS , P. Faster exponential
3 (1994), 181–199. http://www.csie.nuk.edu.tw/
time algorithms for the shortest vector problem. In SODA ’10
~cychen/Lattices/Lattice%20Basis%20Reduction_
Proceedings of the twenty-first annual ACM-SIAM symposium
%20Improved%20Practical%20Algorithms%20and%
on Discrete Algorithms (2010), SIAM, pp. 1468–1480. http:
20Solving%20Subset%20Sum%20Problems.pdf. 8
//dl.acm.org/citation.cfm?id=1873720. 8
[74] M ONTANARO , A. Quantum walk speedup of backtracking al- [88] S TEHLÉ , D., AND S TEINFELD , R. Making NTRU as secure as
gorithms. arXiv preprint arXiv:1509.02374, 2015. http:// worst-case problems over ideal lattices. In Advances in Cryptol-
arxiv.org/pdf/1509.02374v2. 8 ogy – EUROCRYPT 2011 (2011), K. G. Paterson, Ed., vol. 6632
of LNCS, Springer, pp. 27–47. http://www.iacr.org/
[75] M ONTGOMERY, P. L. Modular multiplication without archive/eurocrypt2011/66320027/66320027.pdf. 1, 4
trial division. Mathematics of Computation 44, 170
(1985), 519–521. http://www.ams.org/journals/ [89] Tor project: Anonymity online. https://www.torproject.
mcom/1985-44-170/S0025-5718-1985-0777282-X/ org/. 3
S0025-5718-1985-0777282-X.pdf. 11 [90] V ERSHYNIN , R. Introduction to the non-asymptotic analysis of
[76] NATIONAL I NSTITUTE OF S TANDARDS AND T ECHNOLOGY. random matrices. In Compressed sensing. Cambridge University
FIPS PUB 202 – SHA-3 standard: Permutation-based hash and Press, 2012, pp. 210–268. http://www-personal.umich.
extendable-output functions, 2015. http://nvlpubs.nist. edu/~romanv/papers/non-asymptotic-rmt-plain.pdf.
gov/nistpubs/FIPS/NIST.FIPS.202.pdf. 5, 10 20

17
[91] Z HANG , J., Z HANG , Z., D ING , J., S NOOK , M., AND DAGDE - by a real a > 1, and defined for two distributions P, Q by:
LEN , Ö. Authenticated key exchange from ideal lattices. In Ad-
vances in Cryptology – EUROCRYPT 2015 (2015), E. Oswald
! 1
a−1
and M. Fischlin, Eds., vol. 9057 of LNCS, Springer, pp. 719–751. P(x)a
Ra (PkQ) = ∑ Q(x)a−1 .
https://eprint.iacr.org/2014/589/. 3
x∈Supp(P)

It is multiplicative: if P, P0 are independents, and


A JAR JAR Q, Q0 are also independents, then Ra (P × P0 kQ × Q0 ) ≤
Ra (PkQ) · Ra (P0 kQ0 ). Finally, Rényi divergence relates
As discussed in Section 6, it is also possible to instantiate the probabilities of the same event E under two different
our scheme with dimension n = 512, modulus q = 12289 distributions P and Q:
and noise parameter k = 24, to obtain a decent security Q(E) ≥ P(E)a/(a−1) /Ra (P||Q).
level against currently known attacks: our analysis leads
to a claim of 118-bits of post quantum security. Most For our argument, recall that because the final shared
of the scheme is similar, except that one should use D̃2 key µ is obtained through hashing as µ←SHA3-256(ν)
as reconciliation lattice, and the provable failure bound before being used, then, in the random oracle model
drops to 2−55 . It is quite likely that a more aggressive (ROM), any successful attacker must recover ν exactly.
analysis would lead to a claim of 128 bits of security We call this event E. We also define ξ topbe the √rounded
against the best known attacks. Gaussian distribution of√parameter σ = k/2 = 8, that
The performance of almost all computations in is the distribution of b 8 · xe where x follows the stan-
N EW H OPE scales linearly in n (except for the NTT, dard normal distribution.
which scales quasi-linear and a small constant overhead, A simple script (scripts/Renyi.py) computes
for example, for reading a seed from /dev/urandom). R9 (ψ16 kξ ) ≈ 1.00063. Yet because 5n = 5120 sam-
Also, noise sampling scales linearly in n and k, so JAR - ples are used per instance of the protocol, we need to
JAR with k = 24 is expected to take 3/4 of the time for consider the divergence R9 (PkQ) = R9 (ψ16 , ξ )5n ≤ 26
where P = ψ16 5n and Q = ξ 5n . We conclude as follows.
sampling of a noise polynomial compared to N EW H OPE.
Message sizes scale linearly in n (except for the constant- The choice a = 9 is rather arbitrary but seemed a good
size 32-byte seed). The performance of JAR JAR is thus trade-off between the coefficient 1/Ra (ψ16 kξ ) and the
expected to be about a factor of 2 better in terms of size exponent a/(a − 1). This reduction is provided as a
and slightly less than a factor of 2 in terms of speed. safeguard: switching from Gaussian to binomial distri-
We confirmed this by adapting our implementations of butions can not dramatically decrease the security of the
N EW H OPE to the JAR JAR parameters. The C reference scheme. With practicality in mind, we will simply ignore
implementation of JAR JAR takes 167 102 Haswell cycles the loss factor induced by the above reduction, since the
for the key generation on the server, 234 268 Haswell best-known attacks against LWE do not exploit the struc-
cycles for the key generation and joint-key computation ture of the error distribution, and seem to depend only
on the client, and 45 712 Haswell cycles for the joint- on the standard deviation of the error (except in extreme
key computation on the server. The AVX2 optimized cases [6, 57]).
implementation of JAR JAR takes 71 608 Haswell cycles
for the key generation on the server, 84 316 Haswell cy- C Details on Reconciliation
cles for the key generation and joint-key computation on
the client, and 12 944 Haswell cycles for the joint-key In this section, we provide more technical details on
computation on the server. As stated before, we do not the recovery mechanism involving the lattice D̃4 . For
recommend to use JAR JAR, but only provide these num- reader’s convenience, some details are repeated from
bers as a target for comparison to other more aggressive Section 5.
RLWE cryptosystems in the literature.
Splitting for recovery. By S = Z[Y ]/(Y 4 + 1) we de-
note the 8-th cyclotomic ring, having rank 4 over Z. Re-
call that our full ring is R = Z[X]/(X n + 1) for n = 1024,
B Proof of Theorem 4.1
we can therefore view S as a subring of R by identi-
fying Y = X 256 . An element of the full ring R can be
0 ) ∈ S n/4 such that
identified to a vector ( f00 , . . . , f255
A simple security reduction to rounded Gaus-
sians. In [9], Bai et al. identify Rényi divergence as a f (X) = f00 (X 256 ) + X f10 (X 256 ) + · · · + X 255 f255
0
(X 256 ).
powerful tool to improve or generalize security reduc-
tions in lattice-based cryptography. We review the key In other words, the coefficients of fi0 are the coefficients
properties. The Rényi divergence [9, 85] is parametrized fi , fi+256 , fi+512 , fi+768 .

18
The lattice D̃4 . One may construct the lattice D̃4 as two When we want to decode to D̃4 /Z4 rather than to D̃4 ,
shifted copies of Z4 using a glue vector g: the 8 inequalities given by type-A vectors are irrelevant
  since those vectors belong to Z4 . It follows that:
1 1 1 1
D̃4 = Z4 ∪ g + Z4 where gt = , , , . Lemma C.1 For any k ∈ {0, 1} and any e ∈ R4 such that
2 2 2 2
kek1 < 1, we have Decode(kg + e) = k.
We recall that D̃4 provides the densest lattice sphere
packing in dimension 4 [30], this suggest it is opti- C.1 Reconciliation
mal for error correction purposes. The Voronoi cell
V of D̃4 is the icositetrachoron [65] (a.k.a. the 24- We define the following r-bit reconciliation function:
cell, the convex regular 4-polytope with 24 octahedral  r
2

cells) and the Voronoi relevant vectors of two types: HelpRec(x; b) = CVPD̃4 (x + bg) mod 2r ,
q
8 type-A vectors (±1, 0, 0, 0), (0, ±1, 0, 0), (0, 0, ±1, 0),
(0, 0, 0, ±1), and 16 type-B vectors (± 21 , ± 12 , ± 21 , ± 12 ). where b ∈ {0, 1} is a uniformly chosen random bit. This
The natural condition to correct decoding in D̃4 should random vector is equivalent to the “doubling” trick of
therefore be e ∈ V , and this can be expressed as he, vi ≤ Peikert [81]. Indeed, because q is odd, it is not possible
1/2 for all Voronoi relevant vectors v. Interestingly, to map deterministically the uniform distribution from
those 24-linear inequalities can be split as kek1 ≤ 1 (pro- Z4q to Z2 , which necessary results in a final bias.
viding the 16 inequalities for the type-B vectors) and
kek∞ ≤ 1/2 (providing the 8 inequalities for the type-A Lemma C.2 Assume r ≥ 1 and q ≥ 9. For any x ∈ Z4q ,
vectors). In other words, the 24-cell V is the intersec- set r := HelpRec(x) ∈ Z42r . Then, 1q x − 21r Br mod 1 is
tion of an `1 -ball (an hexadecachoron) and an `∞ -ball (a close to a point of D̃4 /Z4 , precisely, for α = 1/2r + 2/q:
tessaract).
As our basis for D̃4 , we will choose B = (u0 , u1 , u2 , g) 1 1 1 1
x− r Br ∈ αV +Z4 or x− r Br ∈ g+αV +Z4 .
where ui are the canonical basis vectors of Z4 . The q 2 q 2
construction of D̃4 with a glue vector gives a simple
and efficient algorithm for finding closest vectors in Additionally, for x uniformly chosen in Z4q we have
D̃4 . Note that we have u3 = −u0 − u1 − u2 + 2g = Decode( q1 x− 21r Br) is uniform in {0, 1} and independent
B · (−1, −1, −1, 2)t . of r.

Proof One can easily check the correctness of the


Algorithm 1 (restated) CVPD̃4 (x ∈ R4 ) CVPD̃4 algorithm: for any y, y − B · CVPD̃4 (y) ∈ V . To
Ensure: An integer vector z such that Bz is a closest conclude with the first property, it remains to note that
vector to x: x − Bz ∈ V g ∈ 2V , and that V is convex.
1: v0 ←bxe For the second property, we show that there is a per-
2: v1 ←bx − ge mutation π : (x, b) 7→ (x0 , b0 ) of Z4q × Z2 , such that, forall
3: k←(kx − v0 k1 < 1) ? 0 : 1 (x, b) it holds that:
4: (v0 , v1 , v2 , v3 )t ←vk
HelpRec(x; b) = HelpRec(x0 ; b0 ) (= r) (3)
5: return (v0 , v1 , v2 , k)t + v3 · (−1, −1, −1, 2)t    
1 1 1 0 1
Decode x − r Br = Decode x − r Br ⊕ 1
q 2 q 2
Decoding in D̃4 /Z4 . Because u0 , u1 , u2 and 2g belong (4)
to Z4 , a vector in D̃4 /Z4 is simply given by the parity where ⊕ denotes the xor operation. We construct π
of its last coordinate in base B. This gives an even sim- as follows: set b0 := b ⊕ 1 and x0 = x + (b − b0 + q)g
pler algorithm to encode and decode a bit in D̃4 /Z4 . The mod q. Note that b − b0 + q is always even so (b −
encoding is given by Encode(k ∈ {0, 1}) = kg and the b0 + q)g is always well defined in Zq (recall that g =
decoding is given below. (1/2, 1/2, 1/2, 1/2)t ). It follows straightforwardly that
2r 2r 0 0 r r
q (x + bg) − q (x + b g) = −2 g mod 2 . Since g ∈ D̃4 ,
Algorithm 2 (restated) Decode(x ∈ R4 /Z4 ) condition (3) holds. For condition (4), notice that:
Ensure: A bit k such that kg is a closest vector to x+Z4 : 1 1
x − kg ∈ V + Z4 x − r Br ∈ kg + αV mod 1
q 2
1: v = x − bxe 1 1
2: return 0 if kvk1 ≤ 1 and 1 otherwise =⇒ x0 − r Br ∈ (k ⊕ 1)g + α 0 V mod 1
q 2

19
for α 0 = α + 2/q. Because r ≥ 1 and q ≥ 9, we have Definition A centered distribution φ over R is said to be
α 0 = 1/2r + 4/q < 1, and remembering that e ∈ V ⇒ σ -subgaussian if its moment-generating function verifies
kek1 ≤ 1 (inequalities for type-B vectors), one concludes EX ←φ [exp(tX )] ≤ exp(t 2 σ 2 /2).
by Lemma C.1.
A special case of Chernoff-Cramer bound follows by
It remains to define Rec(x, r) = Decode( q1 x − 21r Br) to choosing t appropriately.
describe a 1-bit-out-of-4-dimensions reconciliation pro-
tocol (Protocol 4). Those functions are extended to 256- Lemma D.2 (Adapted from [90]) If x has indepen-
bits out of 1024-dimensions by the splitting described at dently chosen coordinates from φ , a σ -subgaussian dis-
the beginning of this section. tribution, then, for any vector v ∈ Rn , except with prob-
ability less than exp(−τ 2 /2), we have:
Alice Bob
hx, vi ≤ kvkσ τ.
x0 ∈ Z4q x0≈x x ∈ Z4q
r
←−−−− r←HelpRec(x) ∈ Z42r The
pcentered binomial distribution ψk of parameter
k0 ←Rec(x0 , r) k←Rec(x, r) k is k/2-subgaussian. p This is established from the
0
fact that b0 − b0 is 1/2-subgaussian (which is easy
Protocol 4: Reconciliation protocol in qD̃4 /qZ4 . to check), and by Euclidean additivity of k independent
subgaussian variables. Therefore, √the binomial distribu-
Lemma C.3 If kx − x0 k1 < (1 − 1/2r ) · q − 2, then by tion used (of parameter k = 16) is 8-subgaussian.
the above protocol 4 k = k0 . Additionally, if x is uniform, We here propose a rather tight tail-bound on the error
then k is uniform independently of r. term. We recall that the difference d in the agreed key
before key reconciliation is d = es0 − e0 s + e00 . We wish
Fixed-point implementation. One remarks that, while to bound kdi0 k1 for all i ≤ 255, where the di0 ∈ S form
we described our algorithm in R for readability, floating- the decomposition of d described in Section 5. Note that,
point-arithmetic is not required in practice. Indeed, all seen as a vector over R, we have
computation can be performed using integer arithmetic
kxk1 = maxhx, yi,
modulo 2r q. Our parameters (r = 2, q = 12289) are such y
that 2r q < 216 , which offers a good setting also for small
embedded devices. where y ranges over {±1}4 . We also remark that for
a, b ∈ R, one may rewrite (ab)i ∈ S
D Analysis of the failure probability 255
(ab)i = ∑ Y pi a j b(i− j) mod 256 ,
To proceed with the task of bounding—as tightly as j=0
possible—the failure probability, we rely on the notion
where the powers pi are integers. This allows to rewrite
of moments of a distribution and of subgaussian random
variables [90]. We recall that the moment-generating 255
function of a real random variable X is defined as fol- k(es0 − e0 s)i k1 = max ∑ ± he j , s̄0i− j yi + he0j , s̄i− j yi

y
lows: j=0
MX (t) := E[exp(t(X − E[X ]))]. (5)
where y ∈ S ranges over all polynomials with ±1 coef-
We extend the definition to distributions over R: Mφ := ficients and ¯· denotes the complex conjugation Y 7→ −Y 3
MX where X ←φ . Note that Mφ (t) is not necessary over S . Note that the distribution of si , s0i are invariant
finite for all t, but it is the case if the support of φ is by ¯· and by multiplication by −1.
bounded. We also recall that if X and Y are indepen-
dent, then the moment-generating functions verify the Lemma D.3 Let s, s0 ∈ S be drawn with independent
identity MX +Y (t) = MX (t) · MY (t). coefficients from ψ16 , then, except with probability 2−64 ,
the vectors vy = (s0 y, . . . , s255 y, s00 y, . . . , s0255 y) ∈ Z2048
Theorem D.1 (Chernoff-Cramer inequality) Let φ be verify simultaneously kvy k22 ≤ 102500 for all y ∈ S with
a distribution over R and let X1 . . . Xn be i.i.d. random ±1 coefficients.
variables of law φ , with average µ. Then, for any t such
that Mφ (t) < ∞ it holds that Proof Let us first remark that kvk22 = ∑511 2
i=0 ksi yk2 where
" # the si ’ss are i.i.d. random variables following distribu-
n 4 . Because the support of ψ 4 is reasonably small
P ∑ Xi ≥ nµ + β ≤ exp −βt + n ln(Mφ (t)) .
 tion ψ16 16
i=1 (of size (2 · 16 + 1)4 ≈ 220 ), we numerically compute

20
the distribution ϕy : ksyk22 where s ← ψ16 4 for each y Consider the two-dimensional example with q = 9 de-
(see scripts/failure.py). Note that such numerical picted in Figure 3. All possible values of a vector are
computation of the probability density function does not depicted with red or black dots. All red dots (in the grey
raise numerical stability concern, because it involves a Voronoi cell) are mapped to 1 in the key bit; the black
depth-2 circuit of multiplication and addition of positive dots are mapped to a zero. There is a total of 9 · 9 = 81
reals: the relative error growth remain at most quadratic dots, 40 red ones, and 41 black ones. This obviously
in the length of the computation. leads to a biased key. To address this we use one ran-
1 1
From there, one may compute µ = EX ←ϕy [X ] and dom bit to determine whether we add the vector ( 2q , 2q )
Mϕy (t) (similarly this computation has polynomial rela- before running error reconciliation. For most dots this
tive error growth assuming exp is computed with con- makes no difference; they stay in the same Voronoi cell
stant relative error growth). and the “directed noise” is negligible for larger values
We apply Chernoff-Cramer inequality with parameters of q. However, some dots right at the border are moved
n = 512, nµ +β = 102500 and t = 0.0055, and obtain the to another Voronoi cell with probability 12 (depending on
desired inequality for each given y, except with probabil- the value of the random bit). In the example in Figure 3
ity at most 2−68.56 . We conclude by union-bound over we see that 72 dots (36 red and 36 black ones) remain
the 24 choices of y. in their Voronoi cell; the other 9 dots (4 red and 5 black
ones) change Voronoi cell with probability 21 which pre-
Corollary D.4 For se, e0 , e00 , s, s0 ∈ R drawn with inde- cisely eliminates the bias in the key.
pendent coefficients according to ψ16 , except with proba-
bility at most 2−61 we have simultaneously for all i ≤ 255
that b b b b b b b b b

k(e0 s + es0 + e00 )i k1 ≤ 9210 < b3q/4c − 2 = 9214. b b b b b b b b b

Proof First, we know that k(e00 )i k1 ≤ 4 · 16 = 64 for all b b b b b b b b b

i’s, because the support of ψ is [−16, 16]. Now, for each


i, write b b b b b b b b b

k(e0 s + es0 )i k1 = maxhvi,y , e+ i


y b b b b b b b b b

according to Equation 5, where vy depends on s, s0 ,and


e+ is a permutation of the 2048 coefficients of e and b b b b b b b b b

e0 , each of them drawn independently from ψ16 . By


b b b b b b b b b
Lemma D.3 kv0,y k22 ≤ 102500 for all y with probability
less that 2−64 . Because vi,y is equal to v0,y up to a signed
b b b b b b b b b

permutation of the coefficient, we have kvi,y k22 ≤ 102500


for all i, y. b b b b b b b b b

Now, for each i, y, we apply the tail bound of


Lemma D.2 with τ p = 10.1 knowing that ψ16 is σ - Figure 3: Possible values of a vector in Z9 × Z9 , their
subgaussian for σ = 16/2, and we obtain, except with mapping to a zero or one bit, and the effect of blurring.
probability at most 2 · 2−73.5 that

|hvi,y , ei| ≤ 9146.


F Constant-time sampling of a
By union bound, we conclude that the claimed in-
For most applications, variable-time sampling of the
equality holds except with probability less than 2−64 +
public parameter a is not a problem. This is simply be-
2 · 8 · 256 · 2−73.5 ≤ 2−61 .
cause timing only leaks information that is part of the
public messages. However, in anonymity networks like
E A small example of edge blurring Tor, the situation is slightly different. In Tor, as part
of the circuit establishment, a client performs key ex-
We illustrate what it means to “blur the edges” in the rec- changes with three nodes. The first of these three nodes
onciliation mechanism (cf. Section 5). This is an adap- (the entry or guard node) knows the identity of the client,
tation of the 1-dimensional randomized doubling trick of but already the second node does not; all communica-
Peikert [81]. Formally, the proof that this process results tion, including the key exchange, with the second node
in an unbiased key agreement is provided in Lemma C.2. is performed through the guard node. If the attacker has

21
control over the second (or third) node of a Tor circuit
and can run a timing attack on the client machine, then
the attacker can check whether the timing information
from the sampling of a “matches” the public parame-
ter received from that client and thus gain information
about whether this is the client currently establishing a
circuit. In other words, learning timing information does
not break security of the key exchange, but it may break
(or at least reduce) anonymity of the client.
For such scenarios we propose an efficient constant-
time method to generate the public parameter a. This
constant-time approach is incompatible with the straight-
forward approach described in Section 7; applications
have to decide what approach to use and then deploy it
on all servers and clients. The constant-time approach
works as follows:

1. Sample 16 blocks (2688 bytes) of SHAKE-128 out-


put from the 32-byte seed. Consider this output an
84 × 16-matrix of 16-bit unsigned little-endian inte-
gers ai, j .
2. In each of 16 the columns of this matrix, move all
entries that are > 5q to the end without changing the
order of the other elements (see below).
3. Check whether a63, j is smaller than 5q for j =
0, . . . , 15. In the extremely unlikely case that this
does not hold, “squeeze” another 16 blocks of of
output from SHAKE-128 and go back to step 2.
4. Write the entries of the matrix as coefficients to the
polynomial a, where a0,0 is the constant term, a0,1
is the coefficient of X, a0,2 is the coefficient of X 2
and so on, a1,0 is the coefficient of X 15 and so on
and finally a63,15 is the coefficient of X 1023 .

The interesting part of this algorithm for constant-time


implementations is the “move entries that are > 5q to the
end” step. This can efficiently be achieved by using a
sorting network like Batcher sort [11], where the usual
compare-and-swap operator is replaced by an operator
that swaps two elements if the one at the lower posi-
tion is ≥ 5q. As the columns are “sorted” independently,
the sorting network can be computed 16 times in paral-
lel, which is particularly efficient using the AVX2 vector
instructions. We implemented this approach in AVX2
and noticed only a very small slowdown compared to the
simpler approach described in Section 7: the median of
10 000 000 times sampling a is 37 292 cycles (average:
37 310) on the same Haswell CPU that we used for all
other benchmarks. For the reference implementation the
slowdown is more noticeable: the median of 10 000 000
times sampling a is 96 552 cycles (average: 96 562).

22

You might also like