ESnet CA Cloning

Michael Helm Dhivakaran Muruganantham
Energy Sciences Network Lawrence Berkeley National Laboratory 10 May 2009

Networking for the Future of Science

Provide High Availability CA/PKI Service

Why? • ESnet requirement for all services • Risk – Berkeley site • Risk – Resource availability
– Personnel – Replacement parts

• Prepare for additional services

5/10/2009

EUGridPMA

2

Provide High Availability CA/PKI Service
• What are some essential problems that we need to address? • Replicate essential features
– – – – CA signing key CA database of events CA web portal servers CRL publishing points

• Train additional operator staff • Disperse all the above geographically
– Yes, the operators too

• Know what is going on in this dispersed infrastructure • Address implicit security concerns • Trust

5/10/2009

EUGridPMA

3

5 Essentials
• Replication & geographic dispersion of key management

netHSM components

• Replication & geographic dispersion of CA (signing initiator, UI, database)
– RedHat CS cloning

• Replication & geographic dispersion of CRL publishing
– ANYCAST – CRL Distribution from the Cloud – eg Amazon Cloudfront

• Remote operator & geographic dispersion of operator
– Remote operator service

• Nagios monitoring system
5/10/2009

– A new development for this project EUGridPMA

4

DOEGrids CA with High Availability
DOEGrids CA

Berkeley Remote Location 2

Remote Location 1 DOEGrids CA
clone

LDAP

LDAP

CRL

delivery

Remote Location 3

netHSM

netHSM

Remote Operator 1
5/10/2009 EUGridPMA

Remote Operator 2
5

DOEGrids CA with High Availability
Remote Location 1 Berkeley DOEGrids CA Remote Location 2 DOEGrids CA clone

Remote Location 3

Nagios
netHSM netHSM

Remote Operator 1

Remote Operator 2

5/10/2009

EUGridPMA

6

HSM Deployment Strategies
Current model HSM + host Maybe Later Hybrid HSM Strategy

PCI nShield Device

Proposed model netHSM only
netHSM Device

Impath(intermodule path) protocol

5/10/2009

EUGridPMA

7

Where Will DOEGrids CA Be?
Seattle

Boise Chicago Denver Berkele y LA San Diego Kansas City Cleveland

Boston

?

Wash. DC Albuquerqu e El Paso

New York

Atlanta

Houston

5/10/2009

EUGridPMA

8

Stage 1- netHSM enabled Local deployment
DOEGrids CA Berkeley

Remote Location DOEGrids CA 1

Remote Location 2

clone

LDAP

LDAP

CRL

delivery

Remote Location 3

netHSM

netHSM

Remote Operator 1
5/10/2009 EUGridPMA

Remote Operator 2
9

Stage 1a- netHSM enabled Local deployment
DOEGrids CA Berkeley

Remote Location DOEGrids CA 1

Remote Location 2

clone

LDAP

LDAP

CRL

delivery

Remote Location 3

netHSM

netHSM

Remote Operator 1
5/10/2009 EUGridPMA

Remote Operator 2
10

Stage 2 - Replication of Key Management
Berkeley
DOEGrids CA Berkeley

Berkeley

DOEGrids CA clone

LDAP

LDAP

CRL

delivery

Berkeley

netHSM netHSM

Remote Operator 1
5/10/2009 EUGridPMA

Remote Operator 2
11

Stage 3 - Replication of CA Instance
Berkeley
DOEGrids CA Berkeley

Berkeley

DOEGrids CA clone

LDAP

LDAP

Berkeley
CRL delivery

netHSM netHSM

Remote Operator 1
5/10/2009 EUGridPMA

Remote Operator 2
12

Target Stage - Geographic dispersion
DOEGrids CA Berkeley

Remote Location DOEGrids CA 1

Remote Location 2

clone

LDAP

LDAP

CRL

delivery

Remote Location 3

netHSM

netHSM

Remote Operator 1
5/10/2009 EUGridPMA

Remote Operator 2
13

VLAN Architecture
DOEGrids CA Berkeley

Remote Location DOEGrids CA 1
clone

Remote Location 2

LDAP

LDAP

CRL

delivery

Remote Location 3

netHSM

netHSM

Remote Operator 1
5/10/2009

VLANs & ACL

Remote Operator 2
14

EUGridPMA

Replication of keys & management issues

• VLAN and ACL

netHSM, CA machine & Remote Operator are in different VLANs ACLs definition for each VLAN, change control on ACLs

• nCipher firmware or driver upgrade depends on the Redhat Certificate System package version • Linux version of nCipher driver must be compiled with kernel header files, which means the Linux OS upgrade has • Key Material Backup • Key Material Management
• • Backup per CA instance ( DOEGrids, FusionGrid & ESnetSSL) Do not allow auto push or auto sync between client and netHSM (the effect is similar to Windows doing automatic download and install)

• Mostly we simply must document the correct process for future reference 5/10/2009 EUGridPMA

15

CRL Publishing and CRL management

• Internal publishing
– Move publishing to crl.doegrids.org/crl.es.net – Create multiple replicas – use ANYCAST to find nearest

• External publishing – put them in the cloud
– Amazon Cloudfront – US, Asia, EU locations – Probably works similarly as ESnet internal strategy
5/10/2009

• Other possibilities

EUGridPMA

16

A Brief Digression on ANYCAST
• What is it?
– Use the routing infrastructure to advertize the same IP address in different parts of the network – Not usually (EVER) used on same network – Routers route traffic topologically best destination – When a particular node fails, router deletes route, and traffic moves elsewhere (~60 sec latency) – Variety of routing protocols : typically BGP

Why?
– Better performance and transparency than other HA – like solutions, for certain problems
• Contrast HA ; round robin DNS

ESnet usage
– zebra running on LINUX maps ANYCAST address to a BGP peering for the interface = port + VLAN – 2 config files maintain zebra and bgp on the host – Watchdog can be set on host: when application fails, zebra tears down route – Host also has UNICAST address for control/monitoring – No appliances!

5/10/2009

EUGridPMA

17

ANYCAST uses

5/10/2009

• Typical Application: root DNS name servers • Suitable for stateless protocols and applications • May be suitable for certain session – based protocols, where the client or server manages the session appropriately (eg sendmail) • Suitable for serving web pages (eg CRLs) • Perhaps we can get smarter about where and how many replicas we really need • Doubtful for other CA applications – experimentation in future
EUGridPMA

18

Replication of CA instances • RedHat Certificate System Ver. 7.3 upgrade is required
• Vendor recommendation • Upgrades are a serious undertaking

• Replication (Clone) process needs to be tested and the limitations should be studied • ANYCAST usage

• Probably not appropriate for sessions • May be useful for serving passive web pages and CSR submission

• Management of the backend database
5/10/2009 EUGridPMA 19

Remote Operator
• Remote Operator Card and Workstation security • Operator training • List of authorized personal to request this feature • How to verify the authorized personnel?
• May require a protocol such as, Call Back on known phone number.

• Audit trail for Remote Operator Card usage • Retrieve the card – use – replace • “Operator” does not necessarily mean a network operator (a combination of technician & help desk support staff) • ESnet ATF/PKI/Security/system administration teams • ESnet remote engineers (ESnet has 3 non-California-based remote locations) • trusted non-ESnet operations/engineering staff • This is a key component of our disaster recovery strategy and will require updating the ESnet disaster recovery plan
5/10/2009 EUGridPMA 20

NAGIOS
• Nagios is a new development for ESnet
– Considerable effort is being invested in remaking the network management / reporting / status services – Replacement for various proprietary / commercial products

• Nagios has promise for CA cloning project
– Organize the status information of our systems and services – Add additional specialized reporting about CA/PKI components – Delivery of status information to wider range of personnel, including various remote ESnet operators

• Some obvious risks
5/10/2009

– Nagios control features EUGridPMA

21

Key Management - Progress • VLAN Definition – Defined and Operational
• Networks and VLAN configurations for local/remote networks

• Development netHSM – operational • Development Test CA Transition - in progress
• Test netHSM usage configurations

• Development Remote Operator Card Machine – Operational

5/10/2009

EUGridPMA

22

Replication of CA instance - progress • None • Program: • Set up a Development CA and clone with software tokens • Test cases • Failover patterns • Effects on active sessions • Consistency of database • Can we lose a pending request?

• Set up a Development CA and clone using netHSM • Repeat … • We don’t currently plan to replicate the CA database separately from the CA 5/10/2009 EUGridPMA

• Other effects on publishing of certificates and crls

23

Replication of CRL - Progress

• crl.doegrid.org is in use • 80% of the CRL download traffic has been moved • crl.doegrids.org will get ‘ANYCAST’ make-over soon

• Interior structure – follow ESnet website replicas at ESnet co-location/peering points around the US

• Looking at Cloudfront
5/10/2009

• What about the rest of IGTF?
EUGridPMA

24

Nagios progress

• Teams are building initial configurations and modules • Considerable argument about security features • Our expectation:
– System administration team will provide a (several) passive reporting modules for various servers or classes of servers – We will develop modules appropriate for our services where appropriate

• Experimentation needed • Integration with operators
– We will rely on other ESnet services leading the way
5/10/2009 EUGridPMA 25

Risks & Mitigations
RISK Network failure between the CA machine and the netHSM Mitigation Use netHSM as primary and internal card (nShield) as backup (or vice-versa)

Security setup at Remote Locations Deploy locally, using virtual remote may not be desirable i.e not location. Identify and correct remote suitable for Certificate Authority site issues, then deploy remotely

5/10/2009

EUGridPMA

26

What do we need to do to ensure that this is trustable?

5/10/2009

EUGridPMA

27