IBM InfoSphere BigInsights

®

Data security best practices
A practical guide to implementing data encryption for InfoSphere BigInsights

Walid Rjaibi Chief Security Architect for Information Management Nisanth Simon InfoSphere BigInsights Software Developer Monty Wright Senior Solutions Architect, Vormetric Data Security

Issued: June 2013

1. Introduction ...................................................................................................... 3 2. Requirements for a data encryption solution .............................................. 3 2.1 Run-time component requirements ...................................................... 4 2.2 Key management component requirements........................................ 4 3. Guardium data encryption architecture....................................................... 4 4. Installing Guardium data encryption ........................................................... 6 5. Configuring encryption policies .................................................................... 6 5.1 Creating a policy for encrypting existing data .................................... 6 5.2 Creating a policy for encrypting new data......................................... 15 7. Conclusion ...................................................................................................... 23 Further reading................................................................................................... 25 Reviewers ............................................................................................................ 26

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 2 of 28

1. Introduction
Encryption is the process of storing and transmitting data in a form that only those it is intended for can read and process. It is an effective way of protecting sensitive information as it is stored on media or transmitted through un-trusted communication channels. Encryption is mandatory for complying with many government regulations and industry standards such as the Payment Card Industry Data Security Standard (PCI DSS). In an encryption scheme, the data requiring protection (referred to as plaintext) is transformed into an unreadable form (referred to as ciphertext) by applying an encryption algorithm and encryption key. Encryption keys are randomly generated using a key-generation algorithm. There are two main encryption schemes: Symmetric encryption and asymmetric encryption. In symmetric encryption, the same key is used to encrypt and decrypt a given piece of data. The Advanced Encryption Standard (AES) is an example of a symmetric encryption scheme. In asymmetric encryption, data is encrypted using one key (usually referred to as the public key) and is decrypted using another key (usually referred to as the private key). The Rivest, Shamir, Adleman (RSA) algorithm is an example of an asymmetric encryption scheme. In practice, asymmetric and symmetric encryption schemes are often combined to offer an encryption solution. Generally, a symmetric algorithm is used to protect actual data using some encryption key, and an asymmetric algorithm is used to protect that encryption key. While Transport Layer Security (TLS) is widely accepted as the solution for protecting data in transit, no single solution has achieved similar status for protecting data at rest although some solutions such as the one described in this paper are clearly emerging as leaders in this area. This paper focuses on encryption for data at rest, specifically for data stored within IBM InfoSphere BigInsights Hadoop. The rest of this paper is organized as follows. Section 2 reviews the requirements for a sound data encryption solution. Section 3 introduces IBM InfoSphere Guardium Data Encryption (GDE). Sections 4 and 5 describe how to install and configure GDE to protect data stored within IBM InfoSphere BigInsights Hadoop. Lastly, we present our concluding thoughts in section 6.

2. Requirements for a data encryption solution
An encryption solution for data at-rest consists of two main components: A run-time component and a key management component. The run-time component is responsible for the efficient encryption and decryption of data blocks for which an encryption policy exists. Data blocks are typically protected with data encryption keys (DEK) that are stored locally within the run-time component. For example, in a file system, the DEK may be stored together with the file meta-data. A DEK is typically protected with a master key that is stored in the key management component. In some encryption solutions such as GDE, both data encryption keys and master keys are stored in the key

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 3 of 28

management component and provided to the run-time component as needed following a well defined secure protocol.

2.1 Run-time component requirements
To comply with industry standards for encryption, a run-time encryption component should adhere to the following requirements: • Use FIPS 140-2 level 1 certified encryption modules. • Use NIST SP 800-131 compliant algorithms for encryption, hashing, and random number generation. • Exchange data with the key management component over TLS after mutual authentication has been established. • Provide a means for key rotation. • Provide a means for encrypting database backups (for a database system). Although not mandatory, the following are highly desirable properties of the run-time component: • The ability to exploit recent innovations in hardware acceleration for cryptography such as the AES NI on the Intel chip. • The ability to perform in-place encryption to be able to handle existing data in a non-intrusive way.

2.2 Key management component requirements
To comply with industry standards for encryption, a key management encryption component should adhere to the following requirements: • Support high availability so that access to data is not lost when the primary key management component becomes unavailable. • Provide a means for key backup and recovery so that keys can be recovered after a crash or a major disruption. • Enforce authentication and access control before returning the keys to the requester. • Achieve FIPS 140-2 level 2 certification in order to meet the requirements of high assurance environments such as those within government agencies. Although not mandatory, the following are highly desirable properties of the key management component: • Allow flexibility in authoring encryption policies (time of day, day of week, digital signature of executables, etc.).

3. Guardium data encryption architecture
InfoSphere Guardium Data Encryption is a comprehensive data protection solution which meets all the requirements outlined in section 2. It manages access control to files, directories, executables, and provides strong encryption of file content. It consists of two main components (figure 1):

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 4 of 28

• •

Security server: This is the central point of administration for encryption, access control and audit policies File system agent: It provides encryption and access control services for data in online storage accessed by file systems

When InfoSphere Guardium Data Encryption is used to protect a database system such as DB2 or Informix IDS, a backup agent is also provided. The backup agent integrates with the database system backup command to allow the generation of encrypted database backups. This ensures that the same data is consistently protected whether it is online or offline.

Figure 1: InfoSphere Guardium Data Encryption Architecture An important distinction between InfoSphere Guardium Data Encryption and other solutions that offer encryption is how the encryption is performed. InfoSphere Guardium Data Encryption employs a technique in which the file metadata is left in clear text (unencrypted) while the file content are encrypted. This technique provides an additional level of file access control in addition to what the file system offers—access without viewability. Effectively, an application can be granted access to a file for the purpose of management without decrypting its contents. Privileged super users can continue to manage their environments and access the file, but be restricted from having clear-text access to the file content. This capability helps mitigate risks from internal malicious activity targeted at sensitive data.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 5 of 28

4. Installing Guardium data encryption
Installing the Infosphere Guardium Data Encryption solution requires installing the security server component and the file system agent component. The security server needs to be installed once on the server of your choice. The file system agent needs to be installed on all the servers where you need protection. For example, if you have a BigInsights Hadoop cluster of three nodes, then you need to have the file system agent installed on each of those three nodes. The installation procedure itself is well documented in the Infosphere Guardium Data Encryption product documentation and is beyond the scope of this paper. Please check the references section at the end of this paper for product installation documentation and hardware/software requirements. The rest of this paper assumes that you have installed Infosphere Guardium Data Encryption on a supported environment. For the testing conducted as part of this paper, the environment was Infosphere BigInsights HDFS on Red Hat Linux. This environment did not include GPFS.

5. Configuring encryption policies
When configuring encryption policies you need to consider whether you have existing data that needs to be encrypted. If so, you need to create an encryption policy that allows you to encrypt that data in place. If you don’t have any existing data in the files or directories you are going to protect, then you can skip this step of encrypting existing data. Your new data will be automatically encrypted by the encryption policies in place as it is ingested into the files or directories you have protected.

5.1 Creating a policy for encrypting existing data
This section described the steps required to encrypt the existing data. In a nutshell, this process is about associating an encryption policy with a directory located on a particular node or host. In the screen shots given below, the directory containing the data to encrypt is called “/Hadoop”. Our BigInsights Hadoop cluster consists of three nodes called “hdtest021.svl.ibm.com”, “hdtest022.svl.ibm.com”, and “hdtest036.svl.ibm.com” respectively. __1. Creating the Policy __a. Log in to the security server administration console as secadmin. __b. Click on “Add Online Policy” in “Manage Policies”

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 6 of 28

__c. Click on “Action” Button and add “key_op – Key operations” and click “OK”

__d. Click on “Effect” button and Add effect as “permit” & “apply_key” and click “OK”.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 7 of 28

__e. Click the “Add” button to add the rule.

__f. Open “Key Selection Rules” tab and select the key as “clear_key” and press “Add”.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 8 of 28

__g. Open “Data Transformation Rules” tab and select the key as “test-aes256-key” and press “Add”.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 9 of 28

__h. Open “Security Rules” tabs and click “Reset” button. __i. Click on “Effects” and add “deny” & “audit” as effects.

__j. Click “Add” button to add the rules as shown below.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 10 of 28

__k. Save the policy as “NewDataEncryptionPolicy1”.

__2. Linking the policy to the host.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 11 of 28

__a. Go to “Hosts” tab

__b. Click on the host name (hdtest021.svl.ibm.com)

__c. Click on “Guard FS” Tab and click “Guard” button. Add the policy and the folder where the data has to be encrypted. All the hadoop data will be stored under /hadoop folder.

__d. After adding the guard, refresh ensure that the status in Green

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 12 of 28

__3. Performing the encryption on existing HDFS data __a. Open the terminal and login as root user.
__b. Run “secfsd -status guard”

__c. Run “dataxform --rekey --gp /hadoop”

__d. Run “dataxform --cleanup --gp /hadoop”

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 13 of 28

Perform the same operation (section __2 & __3) in other host machines. This is to ensure that existing data is encrypted across all nodes. __4. Removing/un-guarding the policy from the host __a. Open the “Host” tab and click on host name (hdtest021.svl.ibm.com).

__b. Select the policy and click “unguard” button

__c. Click the “Refresh” button to ensure that the policy is deleted.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 14 of 28

Perform the same operation (section __4) in other host machines. This is to ensure that the policy is removed from all the nodes. At this point, all existing data is encrypted and we are now ready to create a permanent policy for encrypting any new data that will be ingested going forward.

5.2 Creating a policy for encrypting new data
In a nutshell, this process is about associating an encryption policy with a directory located on a particular node or host. In the screen shots given below, the directory containing the data to encrypt is called “/Hadoop”. Our BigInsights Hadoop cluster consists of three nodes called “hdtest022.svl.ibm.com”, “hdtest036.svl.ibm.com”, and “hdtest021.svl.ibm.com” respectively. __1. Creating the Policy. __a. Select “Manage Policies” and click “Add Online Policy”

__b. Click on “Effect” button and add effects as “permit”, “apply_key” & “audit”and click ‘OK’

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 15 of 28

__c. Open “Key Selection Rules” and select key as test-aes256-key and press “Add” button as shown below.

__d. Press “Add” button in “Security Rules” tab.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 16 of 28

__e. Click on “reset” button. __f. Click on “Effect” and add effects as “deny” & “Audit” and click “OK”.

__g. Click “Add” button so that effect will be added to the security rules.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 17 of 28

__h. Save the policy

__i. Now the policy is added to the Server

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 18 of 28

__2. Linking the policy to the Host. __a. Go to “Hosts” tab

__b. Click on the host name (hdtest021.svl.ibm.com)

__c. Click on “Guard FS” Tab and click “Guard” button. Add the policy and the folder where the data has to be encrypted. All the HDFS data will be stored under /hadoop folder.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 19 of 28

__d. After adding the guard, refresh to ensure that the status in Green

__e. Perform the same operation in section __2 in other host machines. Thus we linked the policy with all the nodes.

__3. Changing the log info & Host setting in all host machines __a. Go to “Hosts” tab

__b. Click on the host name (hdtest021.svl.ibm.com)

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 20 of 28

__c. Click on “FS Agent Log” Tab and change the level in “Policy Evaluation” level as “INFO”.

__d. Click “Ok” button. __e. Click on “Host Settings” and add “|trust|*” as shown below.

__f. Click “Ok” button. __g. Perform the same operation in section __3 in other host machines. Thus we changed the log info and host settings in all the nodes. __4. Adding more rules to the existing policy - Here we add one more rule to policy. Note that this new rule does not audit the BIADMIN user, which is typically a trusted user id. This is fine for a test environment but for a production environment it is recommended that this user is also audited. This is particularly important since many breaches are due to compromised privileged user credentials or to a privileged user gone rogue.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 21 of 28

__a. Select “Manage Policies” and click on policy “newlyCreatedData1”

__b. Click on “Effects” button and add effects as “permit” & “apply_key” and click “OK”

__c. Click on “User” button and add select user as “BIADMIN” as shown below.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 22 of 28

__d. Press "Add" button in "Security Rules" tab. The new rule will be added to the policy.

__e. Click on “Up” button and move the new rule to top as shown below.

At this point, we have added a permanent policy to ensure that all newly ingested data across all nodes is encrypted going forward. Now simply start BigInsights. Data will be encrypted and decrypted transparently to your BigInsights applications from now on.

7. Conclusion
More and more customers from all sectors would like to take Hadoop to the next level by integrating big data with mission-critical systems and sensitive data. In order for this to happen, big data solutions need to integrate enterprise security solutions such as

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 23 of 28

encryption, access control, and auditing. In this regard, the InfoSphere Guardium activity monitoring and the InfoSphere Guardium data encryption solutions clearly emerge as leaders. They seamlessly allow you to integrate your InfoSphere BigInsights Hadoop data protection into your existing enterprise data security strategy and meet your regulatory compliance needs.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 24 of 28

Further reading
• IBM InfoSphere Guardium Data Encryption V2.0 secures data through encryption to help you meet rigorous data governance and compliance requirements, http://www-01.ibm.com/common/ssi/cgibin/ssialias?htmlfid=897/ENUS212-224&infotype=AN&subtype=CA Big data security and auditing with IBM InfoSphere Guardium, http://www.ibm.com/developerworks/data/library/techarticle/dm1210bigdatasecurity/ Install IBM InfoSphere Guardium Data Encryption on the IBM PureApplication System, http://www.ibm.com/developerworks/cloud/library/cl-installguardium/

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 25 of 28

Reviewers
Ron Ben Natan IBM Distinguished Engineer VP and CTO, Data Security, Compliance and Optimization James Giles IBM Distinguished Engineer Senior Manager, Big Data Development Ashvin Kamaraju VP of Product Development Vormetric Data Security Hui Liao Senior Development Manager BigInsights Development Kan Zhang Senior Technical Staff Member BigInsights Development Paul Zikopoulos Director World Wide Big Data Tiger Team

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 26 of 28

Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. Without limiting the above disclaimers, IBM provides no representations or warranties regarding the accuracy, reliability or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any recommendations or techniques herein is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Anyone attempting to adapt these techniques to their own environment do so at their own risk. This document and the information contained herein may be used solely in connection with the IBM products discussed in this document. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 27 of 28

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: © Copyright IBM Corporation 2013. All Rights Reserved. This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Windows is a trademark of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

A practical guide to implementing data encryption for InfoSphere BigInsights

Page 28 of 28