You are on page 1of 4

Crypto Corner

Editors: Peter Gutmann, pgut001@cs.auckland.ac.nz | David Naccache, david.naccache@ens.fr | Charles C. Palmer, ccpalmer@us.ibm.com

Analysis of a Hardware Security Modules


High-Availability Setting
Benedikt Kppel and Stephan Neuhaus | ETH Zurich

ardware security modules


(HSMs), also called cryptographic accelerators, enable secure
cryptographic-key
management
and fast cryptographic operations.
They do the former by keeping
cryptographic-key material in special nonvolatile memory designed
to erase its content when its tampered with. They do the latter with
circuits that facilitate arithmetic
with long integers, on which much
public-key cryptography is based.
Consequently, HSMs are used to
generate keys and certificates, store
keys, and sign documents.
Whereas having one HSM is
good, having more of them in coordination would be even better. If we
replicate the keys across HSMs, we
can use them for load balancing at
high-traffic times, and the system
can continue to operate even if several HSMs fail. We call a group of
cooperating HSMs an HA group.
At the same time, HA groups
introduce risks because coordinating separate HSMs requires
exchanging state that the HSM previously managed itself. In particular, HSMs must be synchronized
to ensure they have the same key
material. Without a trusted out-ofband channel between HSMs, this
situation will necessarily expose
that key material to potential eavesdroppers. Therefore, that material
must be protected.
Thus, we have the following security requirements, derived from
the principle that the HA group

1540-7993/13/$31.00 2013 IEEE

should have the same security properties as a single HSM:


Material that was originally confidential should remain confidential in the high-availability setup,
even when its shared among
HSMs.
When an HSM joins or leaves an
HA group, a synchronization protocol should ensure that the group
functions at least as well as before.
Here, we look at a specific HSM.
(We dont disclose vendor and
product details because we worked
under a nondisclosure agreement.)
This HSM is certified at the highest security level of FIPS (Federal
Copublished by the IEEE Computer and Reliability Societies

Information Processing Standard)


PUB 140-2, Security Requirements
for Cryptographic Modules.1 It has
also undergone validation by the
Cryptographic Module Validation
Program (www.nist.gov/cmvp).
Validation is guided by a security
policy stating the security guarantees that the HSM aims to fulfill. Nevertheless, problems occur
when the HSM is part of an HA
groupthat is, when its in a highavailability setting.

The HSM

The HSM is a one-rack-unit, 19-inch


rack-mount appliance. It supports
Public-Key Cryptography Standard
#11,2 Microsofts CryptoAPI3 and
May/June 2013

77

Crypto Corner

CryptoAPI Next Generation,4 and


Java Cryptography Architecture5
and Java Cryptography Extension.6
A number of cryptographic tokens
control the HSMs setup, operation,
and backup.
To limit the effects of a compromise or theft of any of the tokens
and to discourage fraud and prevent error, the HSM is set up such
that any security-sensitive operation
needs at least two of these tokens.
The idea is to enable separation of
duties, or the four-eyes principle.
But because the tokens act as capabilities, they can, of course, be given
to the same person.

High-Availability
Operation

The HSMs designers decided to


handle HA group creation and management exclusively in software. The
HSMs themselves operate independently, unaware theyre part of an
HA group. A client library handles
state management and replication.
As long as the library is loaded, state
replication is automatic, even across
multiple programs that reference the
library. Unlike true distributed systems, which dont have a global view
of the systems state, this mode of
operation allows centralized control.
However, this potentially simpler operation comes at a cost. First,
it partially negates the high availability that comes from redundant
operation of several HSMs, creating
a single point of failure. If the client
library crashes, the entire HA group
is lost. Second, although operations
are potentially simpler than in a true
distributed setting, they still must be
carefully designed and implemented.
They must take into account all likely
failure scenarios and guarantee safe
operation even in unlikely scenarios.

Undeleting Keys
Normally, the library will correctly
handle an HSM that drops out and
later rejoins the group. It replays the
operations that have been done on
78

IEEE Security & Privacy

the group but not on the droppedout HSM. Consider this sequence
of events, with HSMs A and B:
1. A, B, and the library are online
and working.
2. B fails and drops out.
3. The application initiates a key
deletion, and the client library
instructs A to delete a key.
4. B rejoins the HA group.
5. The client tells B to delete the
same key that A deleted.
However, if the library crashes
while deleting a key, it wont handle
the case correctly. Consider this
sequence of events:
1. A, B, and the library are online
and working.
2. B fails and drops out.
3. The client instructs A to delete
a key.
4. The client crashes.
5. The client comes back up.
6. B rejoins the cluster.
7. The client extracts the same key
it deleted in step 3 from B and
reimports it into A.
So, the recovery protocol reinstates a key that should have been
deleted. This failure occurs because
the library doesnt keep a shared,
persistent transaction log. So, it
cant distinguish the case in which
it crashed before cloning a key from
B to A from the case in which it
crashed before deleting a key from
B that A had already deleted.
The simplest fix for the undeleted-key problem would be for
the library to implement a shared
transaction protocol that ensures
that actions are completed either
across the cluster or not at all. Such
protocols have been around for
a long time; one of the simplest
implementations would be for the
library to log each action when it
completes. Once an action has
successfully completed across all
HSMs in an HA group, the library

logs a special commit message. If the library restarts after a


crash, it looks at the log and simply repeats the latest actions that
havent been committed.
If the library crashes during key
deletion, this log would be truncated. Operations that have locally
completed (operations for which
theres a local step on some HSM
for which the log reads completed) dont need to be replayed.
Operations that have started but
not completed and operations that
are missing must be replayed. For
key deletion, this process is simple
because its idempotent; that is, it
can execute several times without
ill effects.
Recovery isnt necessarily as
simple for key creation, which
takes up some space on the HSM.
For example, if the library crashes
after key creation on an HSM but
before the log file can be updated,
the HSM will still have the key. Its
simple to check for this case by asking the HSM whether it has a key
or certificate containing the identifying information before trying to
reinitiate the operation.

Stealing Secrets
Backups involve using two tokens
to derive an encryption key. Secrets
on one HSM are encrypted with
this key and then sent to the second HSM, where theyre decrypted
using the same key. Secrets are thus
always encrypted in transit.
Furthermore, the HSMs designers aimed to implement separation
of duties by requiring at least two
tokens to set up a successful backup.
Well call one the owner token, KO,
and the other the backup token, KB .
And its true: you cant back up and
restore using only one of the tokens;
you need both. This was also supposed to hold true for network
replication in an HA group. But
fatally, only the backup token must
be identical. In other words, the following scenario is possible:
May/June 2013

1. An HA group with A and B is HSM, which can split KB into two documentation without conductconfigured and operational. Both parts such that HA group setup ing tests of its own will simply never
A and B are set up using KO requires both shares. One part of the consider the Case of the Unauthorand KB .
split token would go to the backup ized HA Group Member.
The best anyone can do in these
2. KB s legitimate owner buys HSM operator; the other part could go
circumstances
is guess. However,
E and sets it up using KB and KE , to KO s owner. This would require
wed
be
very
surprised if those
the key that came with E.
both token shares owners to be
guesses
werent
off by orders of
3. KB s owner adds E to the HA physically present at backup and HA
magnitude,
even
if
made by equally
group. (This requires either group setup and would effectively
competent
people
in
the field. Wed
access to the HSMthat is, the reinstate separation of duties.
love
to
conduct
a systemHSMs administraatic
study
on
this
topic; if
tive passwordor
High availability can lead to the seeming
youre
interested
in paraccess to an existing
paradox
in
which
each
device
is
secure
ticipating,
drop
us
a line.
client application
that is, the applicaindividually but the high-availability
tion user password on
Effective
group exhibits security problems.
the client machine.)
Workarounds
4. Synchronization of
Are Easy to Find
the HA group (now
If you know you need
consisting of A, B, and E) Lessons Learned
them. The workarounds for the two
occurs.
From our experience with this flaws discussed here are hardly rocket
5. A and B wrap their secrets HSM, we learned four lessons.
science. They do increase operausing a key derived from KB
tional costs somewhat, especially for
and export them to the client Risk Analysis Cant
setting up an HA group, but by and
library, which forwards them Anticipate Failures
large, both solutions are fairly cheap.
to E.
Risk assessment methodologies We therefore argue that its worth6. E can decrypt the wrapped se- make rational decisions about while to subject security solutions to
crets because theyre encrypted investment: the cost to prevent a various tests before using them operusing key material from KB but certain risk must not exceed the ationally, to uncover potential flaws
not from KO. So, it doesnt mat- cost when that risk occurs. To be and find effective workarounds.
ter that E was set up using KE in- effective, risk assessments must be
We anticipate a seemingly reastead of KO.
fairly accurate, but of course, this is sonable response to this, which
7. The owner of E and KB can now an exercise in making predictions would be simply to use every availuse the secrets on E because about the future. (The dangers of able security mechanism in the hope
KE allows access to them. The which have been attributed vari- that this shotgun approach will mitisecrets are no longer protected ously to Niels Bohr and Yogi Berra.) gate some unknown risks. Howby the legitimate KO.
We challenge any risk assessment ever, turning all security features
8. Es owner can now also make a methodology to come up with risk up to 11 in this way might come
backup of E using KB and KE .
numbers before and after learning at the price of losing those features
about these two flaws that are even usability. This is because those feaIn this scenario, Es owner wont close to being equal. This is because tures use is no longer motivated by
have access to the plaintext versions the HSM was apparently designed a comprehensive analysis of security
of the secrets that were stored on to anticipate the threat of unauthor- threats but by a blind, unjustifiable
A and B. However, Es owner now ized backups: its expected mode of hope that some unknown threat
has access to a machine that will let operation should have made the might be mitigated.
him or her encrypt and, more cru- attack impossible. However, such
cially, sign any message using those attacks are possible, and a close read- Composing Systems
secrets. If the illegally cloned HSM ing of the HSMs documentation Might Lead to Surprises
was, for example, VeriSigns, Es reveals that owner tokens are never High availability can lead to the
owner would now be able to sign constrained to be equal for all HSMs seeming paradox in which each
certificates using VeriSigns root cer- in an HA group. However, this device is secure individually but
becomes apparent only after the flaw the high-availability group exhibtificate authority key.
The workaround for this prob- is uncovered. So, risk assessment its security problems. This is a corlem is to use another feature of the that considers only the products ollary of the uncomposability of
www.computer.org/security

79

Crypto Corner

security: the composition of two


secure systems might not be secure.
This means that you cant add
features such as high availability in
a slapdash fashion, relying on the
components security to make the
entire system secure. Rather, you
must design, implement, and test
the system with the same care youd
use for its components.
Whats so sad about our Case of
the Undeleted Key is that protocols
with the desired atomicity properties have been around for decades.

Systems Must Be Tested


after Development
It isnt difficult to anticipate that a
client library might crash, so the
initial design should have taken this
into account. But even though that
didnt happen, library crash reports
from production use should have
found their way back to development. This should in turn have led
to the development and deployment of a more secure client protocol or at least a letter advising
customers of a workaround.
This shows again that security

devices must be tested after development. Of course, such production


tests have very different objectives
than development tests have. Development tests try to assure developers that the device theyre building
works according to some (written
or unwritten) specification. Production tests assure the company that
the product still has all the features
it was designed to have.

either of the two flaws we


uncovered is probably fatal, if
organizations develop workarounds
for them.
For example, to solve the key
deletion problem in production,
when the client library lacks atomicity properties, have a second program keep a list of deleted keys and
then periodically check whether
theyre truly gone from the cluster.
If a key turns out to have been successfully deleted, remove it from
the list. This solution is a hack, but it
gets most of the job done and significantly lowers the risk of using keys
that should have been deleted.

IEEE CSF 2013


IEEE 26th International Computer Security Foundations Symposium

June 26-28, 2013


Tulane University, New Orleans, LA USA
The Computer Security Foundations Symposium (CSF) is an annual conference
for researchers in computer security, to examine current theories of security,
the formal models that provide a context for those theories, and techniques for
verifying security. Topics of interest include access control, information ow, covert
channels, cryptographic protocols, database security, language-based security,
authorization and trust, verication techniques, integrity and availability models,
and broad discussions concerning the role of formal methods in computer security
and the nature of foundational research in this area.

Register today!

http://csf2013.seas.harvard.edu/

To solve the unauthorized HA


group member problem, split the
backup key so that separation of
duties is restored when devices are
added to an HA group, as we mentioned before.
Of course, for organizations to
apply these solutions, they must be
aware of these flaws.
References
1. A. Lee, M.E. Smid, and S.R.
Snouffer, Security Requirements
for Cryptographic Modules, FIPS
(Federal Information Processing Standard) PUB 14-2, US Natl
Inst. of Standards and Technology, May 2001; www.nist.gov/
manuscript-publication-search.
cfm?pub_id=902003.
2. PKCS #11 v2.30: Cryptographic
Token Interface Standard, RSA Laboratories, Apr. 2009; www.rsa.com/
rsalabs/node.asp?id=2133.
3. Cryptography Reference, Microsoft,
2013; http://msdn.microsoft.com/
en-us/library/aa380256.aspx.
4. Cryptography API: Next Generation, Microsoft, 2013; http://
m s d n .m i c ro s o f t .co m / d e - d e /
librar y/w indows/desktop/
aa376210%28v=vs.85%29.aspx.
5. Java Cryptography Architecture
(JCA) Reference Guide for Java Platform Standard Edition 6, Oracle,
2011; http://docs.oracle.com/java
se/6/docs/technotes/guides/
security/crypto/CryptoSpec.html.
6. Archive: Java Cryptography Extension 1.2.2, Oracle, 2013; www.oracle.
com/technetwork/java/jce-140
292.html.

Benedikt Kppel is a masters stu-

dent in electrical engineering at


ETH Zurich. Contact him at web
contact@benediktkoeppel.ch.

Neuhaus is a senior
researcher in the Communication
Systems Group at ETH Zurich.
Contact him at neuhaust@tik.
ee.ethz.ch.

Stephan

80

IEEE Security & Privacy

May/June 2013