You are on page 1of 111

Dell EMC Ready Bundle

for Cloudera Hadoop


Deployment Guide
Version 5.10

Dell EMC Converged Platforms and Solutions


ii | Contents

Contents
List of Figures....................................................................................................................vi

List of Tables.................................................................................................................... vii

Trademarks.........................................................................................................................9
Glossary............................................................................................................................10
Notes, Cautions, and Warnings....................................................................................... 15

Chapter 1: Overview......................................................................................................... 16
Summary..................................................................................................................17
Deployment Workflow............................................................................................. 17

Chapter 2: Installation Prerequisites.................................................................................19


Software Requirements........................................................................................... 20
Cloudera Software Requirements.................................................................. 20
Red Hat Software Requirements................................................................... 20
VMware Hypervisor........................................................................................ 20
Dell EMC Ready Bundle for Cloudera Hadoop Architecture Guide............... 20
Downloading the Installation Packages......................................................... 21
Equipment Requirements........................................................................................ 22
Site Planning........................................................................................................... 22
Environmental Planning Checklists................................................................22
Network Integration Checklists.......................................................................22

Chapter 3: Hardware Setup..............................................................................................25


Unpacking and Installing the Equipment................................................................ 26
Powering Up the Equipment................................................................................... 26
Verifying the Equipment.......................................................................................... 26
Tested BIOS and Firmware.................................................................................... 26
Dell EMC PowerEdge FX2 Setup..........................................................................28
Chassis Identification..................................................................................... 28
Changing the FD332 Storage Controller Mode............................................. 28
Flex Addressing..............................................................................................29

Chapter 4: Dell EMC Ready Bundle for Cloudera Hadoop Nodes.................................. 30


Node Definitions...................................................................................................... 31

Chapter 5: Network Configuration.................................................................................... 33


High-level Network Architecture..............................................................................34
IP Addressing.......................................................................................................... 34
Sample Naming and IP Addressing...............................................................35

Dell EMC Ready Bundle for Cloudera Hadoop


Contents | iii

Cluster Networks and VLANs................................................................................. 36


Node Interface Bonds............................................................................................. 37
Active/Standby Name Nodes & HA Nodes....................................................37
Edge Node..................................................................................................... 37
Worker Node.................................................................................................. 38
Domain Name System............................................................................................ 38
Network Time Protocol............................................................................................38
Gathering Network Information............................................................................... 38

Chapter 6: Network Switches Configuration.................................................................... 40


Switch Configuration Overview............................................................................... 41
Cabling the Network Switches................................................................................ 41
Server Node Connections....................................................................................... 43
Configuring the Network Switches.......................................................................... 45
First Time Setup.............................................................................................45
Switch Configuration...................................................................................... 46

Chapter 7: Server Configuration and OS Installation.......................................................48


Installing and Configuring the Kickstart Server.......................................................49
Configuring the Kickstart VM Image.............................................................. 49
Configuring the Kickstart Server.................................................................... 50
DTK Configurator.................................................................................................... 52
Using the DTK Configurator...........................................................................52

Chapter 8: Additional Packages....................................................................................... 56


Checking and Installing Packages.......................................................................... 57

Chapter 9: Operating System Software Updates............................................................. 58


Software Update Recommendations...................................................................... 59

Chapter 10: Installing Cloudera Manager........................................................................ 60


Configuring the Metadata Database....................................................................... 61
Installing Cloudera Manager Software....................................................................62

Chapter 11: Cloudera Configuration.................................................................................64


Cloudera and Network Interfaces........................................................................... 65
Using Spark 1 and Spark 2.................................................................................... 65
Service Assignments............................................................................................... 65
Hadoop Rack Awareness........................................................................................67
Dell EMC PowerEdge FX2 Rack Awareness................................................ 67
Cloudera Update Recommendations...................................................................... 68

Chapter 12: Installing Syncsort DMX-h............................................................................ 69


Syncsort DMX-h Prerequisites................................................................................ 70

Dell EMC Ready Bundle for Cloudera Hadoop


iv | Contents

Syncsort DMX-h Software Packages and Versions................................................70


Installation Procedure..............................................................................................70
Acquire Syncsort Files................................................................................... 70
Install the DMX-h IDE.................................................................................... 71
Configure the Syncsort Parcel for Cloudera.................................................. 71
Install DMX-h on the Edge Node...................................................................71

Chapter 13: YARN Performance Optimization................................................................. 73


YARN Applications.................................................................................................. 74
Determining the Reserved Memory........................................................................ 74
Hadoop Configuration Settings............................................................................... 75

Chapter 14: Cluster Testing..............................................................................................78


Before Hadoop Cluster Deployment....................................................................... 79
After Hadoop Cluster Deployment.......................................................................... 79

Chapter 15: QuickStart Configuration Differences........................................................... 80


QuickStart Node Configuration Differences............................................................ 81
QuickStart Network Configuration Differences........................................................82
QuickStart Service Assignments.............................................................................82

Appendix A: BIOS Configuration...................................................................................... 84


IPMI Configuration...................................................................................................85
Primary BIOS Settings............................................................................................ 85
Infrastructure Node Settings................................................................................... 85
Worker Node Settings............................................................................................. 86

Appendix B: RAID Configuration...................................................................................... 88


PERC-H730-Specific Infrastructure Nodes RAID Settings..................................... 89
PERC-H730-Specific Worker Node RAID Settings.................................................89

Appendix C: File System Layout...................................................................................... 90


Infrastructure Nodes................................................................................................ 91
Worker Nodes......................................................................................................... 93
File Systems and Parameters.................................................................................95

Appendix D: Operating System Settings..........................................................................96


CPU Settings...........................................................................................................97
IRQ Balancer..................................................................................................97
CPU Frequency Governor..............................................................................97
Network Settings..................................................................................................... 98
Advanced NIC Features..........................................................................................98
TCP Segmentation Offload............................................................................ 99
Generic Segmentation Offload.......................................................................99

Dell EMC Ready Bundle for Cloudera Hadoop


Contents | v

Scatter-Gather................................................................................................ 99
Display Offload Features................................................................................99
Interrupt Moderation and Coalescing...........................................................100
Process Limits....................................................................................................... 100
Memory Management Settings............................................................................. 100
Transparent Huge Page (THP) Compaction................................................100
Swap Settings.............................................................................................. 101
Secure Linux Settings........................................................................................... 101
Services................................................................................................................. 101
Firewall Settings.................................................................................................... 102
Ports Listing...........................................................................................................102
Disable Network Manager.....................................................................................103
Secure Shell Keys.................................................................................................103
User Accounts and Groups...................................................................................103

Appendix E: Example node-config.json File...................................................................104


node-config.json Example..................................................................................... 105

Appendix F: Support....................................................................................................... 106


Software Support...................................................................................................107
Java Compatibility................................................................................................. 107

Appendix G: Related Documentation............................................................................. 108


Cloudera Manager 5.10 and Cloudera Enterprise 5.10 Documentation............... 109
Apache Hadoop Documentation........................................................................... 109
Red Hat Documentation........................................................................................109
Syncsort DMX-h Documentation...........................................................................109

Appendix H: References.................................................................................................110
About Cloudera..................................................................................................... 111
About Syncsort...................................................................................................... 111
To Learn More...................................................................................................... 111

Dell EMC Ready Bundle for Cloudera Hadoop


vi | List of Figures

List of Figures
Figure 1: Dell EMC PowerEdge FX2 Chassis Identification - Front View........................ 28

Figure 2: Dell EMC Ready Bundle for Cloudera Hadoop Cluster Networking................. 34

Figure 3: Single Pod Networking Equipment...................................................................42

Figure 4: Dell Networking S6000-ON Multi-pod Networking Equipment..........................43

Figure 5: PowerEdge R730xd Node Network Ports........................................................ 44

Figure 6: Dell EMC PowerEdge FX2 Infrastructure Chassis Network Ports.................... 44

Figure 7: Dell EMC PowerEdge FX2 Worker Chassis Network Ports............................. 45

Dell EMC Ready Bundle for Cloudera Hadoop


List of Tables | vii

List of Tables
Table 1: Deployment Workflow........................................................................................17

Table 2: Power and Cooling Checklist............................................................................ 22

Table 3: Physical Networking Checklist...........................................................................22

Table 4: Logical Networking Checklist.............................................................................23

Table 5: Dell EMC PowerEdge R730xd Tested BIOS and Firmware Versions............... 27

Table 6: Dell EMC PowerEdge FX2/FC630 Tested BIOS and Firmware Versions..........27

Table 7: Dell Networking S3048-ON Tested Firmware Versions.....................................27

Table 8: Dell Networking S4048-ON Tested Firmware Versions.....................................27

Table 9: Dell Networking S6000-ON Tested Firmware Versions.....................................27

Table 10: Service Locations............................................................................................ 31

Table 11: Network IP Addressing Scheme......................................................................35

Table 12: IP Addressing Scheme.................................................................................... 35

Table 13: Cluster Networks............................................................................................. 36

Table 14: Name Nodes and HA Nodes Network Connections........................................ 37

Table 15: Edge Node Network Connections................................................................... 37

Table 16: Worker Nodes Network Connections.............................................................. 38

Table 17: Switch Configuration Files............................................................................... 41

Table 18: Bond / Interface Cross Reference................................................................... 45

Table 19: Service Role Assignments...............................................................................65

Table 20: Syncsort Installation Files................................................................................70

Table 21: Reserved Memory Recommendations............................................................ 75

Table 22: YARN and MapReduce RAM Settings............................................................ 75

Table 23: QuickStart Node Roles.................................................................................... 81

Table 24: QuickStart Service Role Assignments.............................................................82

Table 25: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630
Infrastructure Node Settings........................................................................................ 85

Dell EMC Ready Bundle for Cloudera Hadoop


viii | List of Tables

Table 26: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630 Worker
Node Settings...............................................................................................................86

Table 27: PERC-H730 BIOS Settings for Infrastructure Nodes.......................................89

Table 28: PERC-H730 BIOS Settings for Worker Nodes................................................ 89

Table 29: Dell EMC PowerEdge R730xd Infrastructure Node Volumes.......................... 91

Table 30: Dell EMC PowerEdge R730xd Infrastructure Node Partitions......................... 91

Table 31: Dell EMC PowerEdge FC630 Infrastructure Node Volumes............................92

Table 32: Dell EMC PowerEdge FC630 Infrastructure Node Partitions...........................92

Table 33: Dell EMC PowerEdge R730xd Worker Node Volumes................................... 93

Table 34: Dell EMC PowerEdge R730xd Worker Node Partitions.................................. 93

Table 35: Dell EMC PowerEdge FC630 Worker Node Volumes.....................................94

Table 36: Dell EMC PowerEdge FC630 Worker Node Partitions.................................... 94

Table 37: Dell EMC Ready Bundle for Cloudera Hadoop Support Matrix..................... 107

Dell EMC Ready Bundle for Cloudera Hadoop


Trademarks | 9

Trademarks
Copyright © 2011-2017 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks
are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective
owners.
This document is for informational purposes only, and may contain typographical errors and technical
inaccuracies. The content is provided as-is and without expressed or implied warranties of any kind.

Dell EMC Ready Bundle for Cloudera Hadoop


10 | Glossary

Glossary

ASCII

American Standard Code for Information Interchange, a binary code for alphanumeric characters
developed by ANSI®.

BMC

Baseboard Management Controller

BMP

Bare Metal Provisioning

CDH

Cloudera Distribution for Apache Hadoop

Clos

A multi-stage, non-blocking network switch architecture. It reduces the number of required ports within a
network switch fabric.

CMC

Chassis Management Controller

DBMS

Database Management System

DTK

Dell OpenManage Deployment Toolkit

Dell EMC Ready Bundle for Cloudera Hadoop


Glossary | 11

EBCDIC

Extended Binary Coded Decimal Interchange Code, a binary code for alphanumeric characters developed
by IBM®.

ECMP

Equal Cost Multi-Path

EDW

Enterprise Data Warehouse

EoR

End-of-Row Switch/Router

ETL

Extract, Transform, Load is a process for extracting data from various data sources; transforming the data
into proper structure for storage; and then loading the data into a data store.

HBA

Host Bus Adapter

HDFS

Hadoop Distributed File System

HVE

Hadoop Virtualization Extensions

Dell EMC Ready Bundle for Cloudera Hadoop


12 | Glossary

IPMI

Intelligent Platform Management Interface

JBOD

Just a Bunch of Disks

LACP

Link Aggregation Control Protocol

LAG

Link Aggregation Group

LOM

Local Area Network on Motherboard

NIC

Network Interface Card

NTP

Network Time Protocol

OS

Operating System

PAM

Pluggable Authentication Modules, a centralized authentication method for Linux systems.

Dell EMC Ready Bundle for Cloudera Hadoop


Glossary | 13

RPM

Red Hat Package Manager

RSTP

Rapid Spanning Tree Protocol

RTO

Recovery Time Objectives

SIEM

Security Information and Event Management

SLA

Service Level Agreement

THP

Transparent Huge Pages

ToR

Top-of-Rack Switch/Router

VLT

Virtual Link Trunking

VRRP

Virtual Router Redundancy Protocol

Dell EMC Ready Bundle for Cloudera Hadoop


14 | Glossary

YARN

Yet Another Resource Negotiator

Dell EMC Ready Bundle for Cloudera Hadoop


Notes, Cautions, and Warnings | 15

Notes, Cautions, and Warnings


Note: A Note indicates important information that helps you make better use of your system.

Caution: A Caution indicates potential damage to hardware or loss of data if instructions are not
followed.
Warning: A Warning indicates a potential for property damage, personal injury, or death.

This document is for informational purposes only and may contain typographical errors and technical
inaccuracies. The content is provided as is, without express or implied warranties of any kind.

Dell EMC Ready Bundle for Cloudera Hadoop


16 | Overview

Chapter

1
Overview
Topics: This guide describes the prerequisites to install the Dell EMC Ready
Bundle for Cloudera Hadoop on a predefined hardware and network
• Summary configuration, as specified in the current Dell EMC Ready Bundle for
• Deployment Workflow Cloudera Hadoop Architecture Guide. It also covers requirements for
preparing the hardware platform and provisioning the operating system
for Cloudera Enterprise 5.10 deployment.

Dell EMC Ready Bundle for Cloudera Hadoop


Overview | 17

Summary

This guide describes deploying the Dell EMC Ready Bundle for Cloudera Hadoop using either of two
server architectures:
• Dell EMC PowerEdge R730xd - A 2U rack server platform
• Dell EMC PowerEdge FX2 - A high density 2U converged infrastructure platform
Both architectures use similar server configurations and cluster layout. In the converged infrastructure
architecture, each Dell EMC PowerEdge FX2 chassis is the equivalent of two Dell EMC PowerEdge
R730xd servers in the design.
The networking architecture for both architectures is the same, and consists of:
• A leaf-and-spine for the cluster production network
• A flat daisy chain of switches for a dedicated iDRAC network

Deployment Workflow

Table 1: Deployment Workflow on page 17 describes the basic Dell EMC Ready Bundle for Cloudera
Hadoop deployment sequence:

Table 1: Deployment Workflow

Deployment Step Information Reference


1. Complete Installation Prerequisites • Installation Prerequisites on page 19

2. Hardware Setup • Hardware Setup on page 25

3. Network Setup and Switch • Network Setup - Network Configuration on page 33


Configuration • Switch Configuration - Network Switches Configuration on
page 40

4. Server Configuration and • Server Configuration and OS Installation on page 48


Operating System Installation • Install Operating System and Enable Services- Installing and
Configuring the Kickstart Server on page 49
• Boot the Servers, and Configure with the DTK - DTK
Configurator on page 52

5. Configure Software updates, install • Operating System Software Updates on page 58


additional packages
6. Install Cloudera Manager and • Installing Cloudera Manager Software on page 62
configure the Cloudera Manager • Configuring the Metadata Database on page 61
Database.
7. Install and Configure Cloudera • Cloudera Configuration on page 64
Enterprise

Dell EMC Ready Bundle for Cloudera Hadoop


18 | Overview

Deployment Step Information Reference


8. Reference Material • BIOS Configuration on page 84
• RAID Configuration on page 88
• File System Layout on page 90
• Operating System Settings on page 96
• Example node-config.json File on page 104

Refer to QuickStart Configuration Differences on page 80 for details on deploying a QuickStart


configuration.

Dell EMC Ready Bundle for Cloudera Hadoop


Installation Prerequisites | 19

Chapter

2
Installation Prerequisites
Topics: In order to install the components that comprise the Dell EMC Ready
Bundle for Cloudera Hadoop, several prerequisites must be satisfied.
• Software Requirements
This guide assumes that you are familiar with:
• Equipment Requirements
• Site Planning • Cloudera Enterprise 5.10
• RAID and BIOS configuration of Dell EMC PowerEdge R730xd or
Dell EMC PowerEdge FX2 servers
• Red Hat Enterprise Linux® (RHEL) 7.3
• Network installation

Dell EMC Ready Bundle for Cloudera Hadoop


20 | Installation Prerequisites

Software Requirements

Required software includes:


• Cloudera software (see Cloudera Software Requirements on page 20)
• Red Hat software (see Red Hat Software Requirements on page 20)
• VMware Hypervisor software (see VMware Hypervisor on page 20)
• The Dell EMC Ready Bundle for Cloudera Hadoop Architecture Guide (see Dell EMC Ready Bundle for
Cloudera Hadoop Architecture Guide on page 20)
• Switch configuration files (see Table 17: Switch Configuration Files on page 41
• The Dell EMC Ready Bundle for Cloudera Hadoop installation packages (see Downloading the
Installation Packages on page 21)
• Firewall rules for the kickstart VM DNS server (see Configuring the Kickstart VM Image on page 49)
Optional software includes:
• Syncsort software (see Installing Syncsort DMX-h on page 69)
• Rufus (see Writing the ISO to a USB Key in Windows on page 52)

Cloudera Software Requirements


Licensed Cloudera software must be obtained via one of the following means prior to installation:
• Directly from Cloudera’s repository. This requires outbound public Internet access to
archive.cloudera.com from the node where Cloudera Manager is installed.
• A local staging repository, which is copied or mirrored from Cloudera’s master repository.

Red Hat Software Requirements


Licensed Red Hat Enterprise Linux Server 7.3 must be obtained via one of the following means prior to
installation:
• Local media access
• Satellite server
• Outbound public Internet connectivity
Note: Alternately, you can use CentOS 7.3; however, support for CentOS is limited to Dell EMC
hardware support only.
See Software Support on page 107 for a list of support options for Dell EMC Ready Bundle for Cloudera
Hadoop components.

VMware Hypervisor
The Kickstart Server is a virtual machine that you run on your laptop via any of the following VMware
hypervisor products:
• VMware ESXi™ 5.5 or above
• VMware Fusion® 6.0 or above
• VMware Workstation Pro™ 10 or above

Dell EMC Ready Bundle for Cloudera Hadoop Architecture Guide


You must obtain, and have a thorough understanding of, the Dell EMC Ready Bundle for Cloudera Hadoop
Architecture Guide.
The architecture guide is a companion to this deployment guide, and provides detailed descriptions of the
solution, its hardware and software components, and deployment methodologies that you should consider.

Dell EMC Ready Bundle for Cloudera Hadoop


Installation Prerequisites | 21

Please contact your Dell EMC sales representative to obtain a copy of the Dell EMC Ready Bundle for
Cloudera Hadoop Architecture Guide.

Downloading the Installation Packages


Dell EMC channel partners, Dell EMC deployers, and Red Hat partners can download the following archive
packages, which are available to install the Dell EMC Ready Bundle for Cloudera Hadoop. They are
divided into release-specific and non-release-specific packages.

Installation Packages
Release-specific packages include:
• DTK .iso file and MD5 checksum for Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FX2
servers
• Kickstart VM
• Configuration files for Dell Networking S3048-ON, S4048-ON, and S6000-ON switches
• Cut sheets for Dell Networking S3048-ON, S4048-ON, and S6000-ON switches
Non-release-specific packages include:
• Network connectivity tool

Dell Digital Locker


All installation material is available for download from the Dell Digital Locker. To gain access to the Dell
Digital Locker:
1. Order your product using the appropriate SKU.
2. Dell processes the order, and sends you an email with the subject, "Dell Digital Locker Order".
3. Follow the instructions in the email to access your product in the Dell Digital Locker.
a. If you already have a Dell MyAccount support account, you can use it to sign in.
b. Otherwise, click on the Create an Account button to create a Dell Digital Locker account.

Download Procedure
To download the installation packages and prepare them for use:
1
1. Using a web browser , sign into your Dell Digital Locker account.
2. Click on the Digital Products heading in the left-hand pane to display a list of products to which you
have access.
3. Click on the product you wish to download to display a Product Management page.
4. Click on the Download link to display an End User License Agreement (EULA).
a. Scroll to read the entire EULA in order to activate its agree/disagree buttons.
5. Click on the Yes, I Agree button to display a download method dialog window.
a. Or, click on the No, I Do Not Agree button to return to the Product Management page.
6. Select one of the following download methods:
• Download manager — A Windows program that enables multiple downloads, pause/resume
downloads, etc.
• If the download manager is not present on your system, you are offered a choice to either
download and run it, or download your product using your web browser.
• Web browser — Uses your web browser to download your product, and your system's file manager
to save or run it.
7. Click on the Download Now button to begin the download process.

1
Dell EMC recommends that you use current versions of either Firefox®, Chrome™, or Internet Explorer®.

Dell EMC Ready Bundle for Cloudera Hadoop


22 | Installation Prerequisites

a. Or, click on the Cancel button to abort the operation and return to the Product Management page.
8. Repeat Steps 2-7 for any additional downloads.
9. When finished, click on the Sign Out link atop the page.

Equipment Requirements

Some miscellaneous equipment is required during the installation:


• A 1GB or larger USB memory stick is required for the DTK boot image.
• A serial cable and USB serial adapter is required for initial switch programming.
• A laptop or other machine is required for running the kickstart server
• A KVM or console is required for initial access to server consoles
• A spare 1GbE network cable is required to connect the kickstart server machine to port 48 of the
S3048-ON management switch for initial booting.

Site Planning

There are site planning tasks that should be completed prior to beginning installation.
The scope of these tasks is outside the actual architecture so this section provides checklists that should
be reviewed and answered prior to beginning installation. Some of these questions are intended to raise
additional questions.

Environmental Planning Checklists


Table 2: Power and Cooling Checklist

Typical Question Answer


What is the available site power – voltage, phase?
What type of power plugs are required?
How many power drops are required?
Will power drops be at floor level or above?
What type of PDUs are being used?
Have ESSA power and cooling calculations been
completed for the actual rack layouts?

Network Integration Checklists


Refer to Network Configuration on page 33 for the details of the cluster networking architecture.

Table 3: Physical Networking Checklist

Typical Question Answer


Will network drops come from above or below
racks?

Dell EMC Ready Bundle for Cloudera Hadoop


Installation Prerequisites | 23

Typical Question Answer


Will the main connection to the site network be
10GbE or 40GbE?
Are transceivers required?
What type of transceivers?
Who is providing transceivers?
Are site network connection optical or copper?
Have cables between the cluster and site network
been accounted for?

Table 4: Logical Networking Checklist

Typical Questions Answer


Does the site network support IEEE 802.1Q vLAN
traffic and port tagging?
Does the site network support using one untagged
and multiple tagged VLANs on the same port?
Will the cluster data network be connected to the
main site network? (Dell EMC normally does not
recommend this.)
What is the DNS sub domain for the cluster? (Dell
EMC recommends a dedicated sub-domain, such
as cluster1.example.com)
What is the IP address range for the data network?
What is the data network VLAN?
What is the gateway IP?
What is the IP address range for the edge network?
What is the edge network VLAN?
What is the IP address range for the iDRAC
network?
What is the iDRAC network VLAN?
Will the iDRAC network be connected to an existing
management network?
What are the IP addresses of the site DNS
Server(s)?
Is synchronization with an existing NTP server
needed?
What is the NTP Server IP address?
Will outbound (internet) access be available to the
cluster?
Will outbound (internet) access be available at
installation and set up time?

Dell EMC Ready Bundle for Cloudera Hadoop


24 | Installation Prerequisites

Typical Questions Answer


Are there any site firewalls that need to be updated
to allow cluster access?
Does the site DNS server need to be updated in
advance? How long in advance?
What is the naming convention used for
hostnames?

Dell EMC Ready Bundle for Cloudera Hadoop


Hardware Setup | 25

Chapter

3
Hardware Setup
Topics: These procedures ensure that your hardware is installed correctly prior
to installing the Dell EMC Ready Bundle for Cloudera Hadoop.
• Unpacking and Installing the
Equipment
• Powering Up the Equipment
• Verifying the Equipment
• Tested BIOS and Firmware
• Dell EMC PowerEdge FX2
Setup

Dell EMC Ready Bundle for Cloudera Hadoop


26 | Hardware Setup

Unpacking and Installing the Equipment

Before you proceed you must perform the following procedures following all standard industry safety
procedures:
1. Unpack and install the racks.
2. Unpack and install the server hardware.
3. Unpack and install the switch hardware.
4. Unpack and install the network cabling. See:
a. Server Node Connections on page 43
b. Cabling the Network Switches on page 41
5. Connect each individual machine to both power bus installations.
6. Apply power to the racks.
Note: This is usually performed by the Dell EMC EDT Team.

Powering Up the Equipment

To perform the power on test:


Note: This is usually performed by the Dell EMC EDT Team.

1. Power on each server node, individually.


2. Wait for internal system diagnostic procedures to complete.
3. Power on the network switches.
4. Wait for the switches' internal system diagnostic procedures to complete.

Verifying the Equipment

The cluster hardware should be verified before physical installation begins. After installation, the final
functional tests should be run.
Recommended validation steps:
1. All power on tests complete successfully.
2. All drives should be powered on, verify that the hardware diagnostic LEDs and system console does not
report any errors.
3. All nodes should be checked for correct memory size.
4. All network ports and cables should be checked for connections.

Tested BIOS and Firmware

Table 5: Dell EMC PowerEdge R730xd Tested BIOS and Firmware Versions on page 27 and Table 6:
Dell EMC PowerEdge FX2/FC630 Tested BIOS and Firmware Versions on page 27 list the server BIOS
and firmware versions that were tested for the Dell EMC Ready Bundle for Cloudera Hadoop.
Table 7: Dell Networking S3048-ON Tested Firmware Versions on page 27, Table 8: Dell Networking
S4048-ON Tested Firmware Versions on page 27, and Table 9: Dell Networking S6000-ON Tested

Dell EMC Ready Bundle for Cloudera Hadoop


Hardware Setup | 27

Firmware Versions on page 27 list the switch firmware versions that were tested for the Dell EMC
Ready Bundle for Cloudera Hadoop.
Caution: You must ensure that the firmware on all servers and switches is up to date. Otherwise,
unexpected results may occur.

Table 5: Dell EMC PowerEdge R730xd Tested BIOS and Firmware Versions

Product Version
BIOS 2.3.4
RAID 25.5.0.0018_A08
NIC 17.5.10_A00
Backplane Expander 3.31_A00-01
Non-storage Backplane 2.23_A00-00
iDRAC 2.41.40.40_A00

Table 6: Dell EMC PowerEdge FX2/FC630 Tested BIOS and Firmware Versions

Product Version
CMC 1.32.200.201601210012_A00
BIOS 2.3.5
RAID 25.5.0.0018_A08
NIC 17.5.12_A00
Backplane Expander 3.31_A00-00
Non-storage Backplane 2.23_A00-00
iDRAC 2.41.40.40_A00

Table 7: Dell Networking S3048-ON Tested Firmware Versions

Product Version
Firmware SG-9.10.0.1p13
Boot Selector 3.21.0.4 or higher

Table 8: Dell Networking S4048-ON Tested Firmware Versions

Product Version
Firmware SK-9.10.0.1p13
Boot Selector 3.21.0.4 or higher

Table 9: Dell Networking S6000-ON Tested Firmware Versions

Product Version
Firmware SI-9.10.0.1p13
Boot Selector 3.21.0.4 or higher

Dell EMC Ready Bundle for Cloudera Hadoop


28 | Hardware Setup

Dell EMC PowerEdge FX2 Setup

The Dell EMC PowerEdge FX2 requires some additional hardware setup and verification.

Chassis Identification

Figure 1: Dell EMC PowerEdge FX2 Chassis Identification - Front View

There are two chassis configurations for the Dell EMC PowerEdge FX2 - Infrastructure and Worker. These
chassis configurations appear physically identical, and the infrastructure nodes may have to be identified
from the actual orders, or by checking the drive quantity in the storage module.
The cabling details in Server Node Connections on page 43 are based on the sled configuration shown
in Figure 1: Dell EMC PowerEdge FX2 Chassis Identification - Front View on page 28. It may be
necessary to re-arrange the sleds to match this configuration.

Changing the FD332 Storage Controller Mode


Follow these procedures to change the FD332 Storage Controller mode:
1. Configuring the Dell EMC PowerEdge FX2 CMC IP Address on page 28
2. Logging into the CMC and Accessing the Blades on page 29
3. Configuring the FD332 Storage Blade for Use by a Worker Node on page 29

Configuring the Dell EMC PowerEdge FX2 CMC IP Address


To provision the Dell EMC PowerEdge FX2 Chassis Management Controller (CMC) with an IP address:
1. Connect a keyboard and monitor to the chassis.
2. Power on one of the compute blades in the chassis. The monitor should display the server's boot
screen.

Dell EMC Ready Bundle for Cloudera Hadoop


Hardware Setup | 29

a. If this is the first time the system has been powered on, the system will boot into Life Cycle
Controller for configuration.
b. If it does not, press [F2] to go into the system setup screens.
3. From the Life Cycle Controller, click on the Hardware Configuration link on the left hand side.
4. Select the Configuration Wizards, and then select iDRAC Settings.
5. Scroll to the bottom of the iDRAC Settings page, and click on CMC Network.
6. Under the IPv4 Settings, make sure Enable IPv4 is set to Enabled.
7. Apply a Static IP Address, Subnet Mask and Gateway to the CMC.
8. Press Back, and then Finish.
9. Exit the Life Cycle Controller and reboot the server.

Logging into the CMC and Accessing the Blades


From a system with access to the iDRAC network:
1. Open a web browser, and navigate to the address given to the CMC.
2. If a certificate warning is presented by the CMC, allow the exception.
3. Proceed to the login page, using the default credentials:
a. Username — root
b. Password — calvin

Configuring the FD332 Storage Blade for Use by a Worker Node


The FD332 storage blades have three operating modes for which they can be configured:
• Split Dual Host
• Split Single Host
• Joined
The FD332 for a Cloudera Hadoop Worker Node must be in Split Single Host mode. To set the mode via
the CMC:
1. Select the server blade that is paired with the storage blade from the tree Chassis Overview > Server
Overview > 1 localhost.localdomain (Compute).
2. Click on the Power tab.
3. If the Power Status is On, choose the Power Off Server radio button.
4. Click on the Apply button.
Once the system has been powered off:
5. Select the associated storage blade from the tree Chassis Overview > Server Overview > 3 SLOT-03
(Storage).
6. Click on the Setup tab.
7. Select the Split Single Host Storage Mode.
8. Click on the Apply button.
9. Follow the instructions in USB Boot on page 53 to configure the Compute blade as a Hadoop Worker
Node.

Flex Addressing
The FlexAddress feature in the Dell EMC PowerEdge FX2 allows the replacement of the factory-assigned
iDRAC MAC with a chassis-assigned MAC for individual slots. The use of Flex Addressing is a customer
choice. However, if it is enabled remember that iDRAC MAC addresses will not follow sleds when they are
moved.

Dell EMC Ready Bundle for Cloudera Hadoop


30 | Dell EMC Ready Bundle for Cloudera Hadoop Nodes

Chapter

4
Dell EMC Ready Bundle for Cloudera Hadoop Nodes
Topics: Several node types, each with specific functions, are included in the
Dell EMC Ready Bundle for Cloudera Hadoop. This topic provides
• Node Definitions detailed definitions of those nodes.

Dell EMC Ready Bundle for Cloudera Hadoop


Dell EMC Ready Bundle for Cloudera Hadoop Nodes | 31

Node Definitions

• Administration Node — provides cluster deployment and management capabilities. The


Administration Node is optional in cluster deployments, depending on whether existing provisioning,
monitoring, and management infrastructure will be used.
• Active Name Node — runs all the services needed to manage the HDFS data storage and YARN
resource management. This is sometimes called the “master name node.” There are four primary
services running on the Active Name Node:
• Resource Manager (to support cluster resource management, including MapReduce jobs)
• NameNode (to support HDFS data storage)
• Journal Manager (to support high availability)
• ZooKeeper (to support coordination)
• Standby Name Node — when quorum-based HA mode is used, this node runs the standby namenode
process, a second journal manager, and an optional standby resource manager. This node also runs
the Spark History Server and a second ZooKeeper service.
• High Availability (HA) Node — this node provides the third journal node for HA. The Active Name
Nodes and Standby Name Nodes provide the first and second journal nodes. It also runs a third
ZooKeeper service. The operational databases required for Cloudera Manager and additional
metastores are on the HA.
• Edge Node — provides an interface between the data and processing capacity available in the Hadoop
cluster and a user of that capacity. An Edge Node has a an additional connection to the Edge Network,
and is sometimes called a “gateway node.” At least one Edge Node is required.
• Worker Node — runs all the services required to store blocks of data on the local hard drives and
execute processing tasks against that data. A minimum of five Worker Nodes are required, and larger
clusters are scaled primarily by adding additional Worker Nodes. There are three types of services
running on the Worker Nodes:
• DataNode daemon (to support HDFS data storage)
• NodeManager daemon (to support YARN job execution)
• Services managed with Cloudera Manager service pools instead of YARN, such as Impala and
HBase
Spark jobs also run on the Worker Nodes. However, there is no persistent service associated with
Spark jobs.
Table 10: Service Locations on page 31 describes the node locations and functions of the cluster
services.

Table 10: Service Locations

Physical Node Software Function


Administration Node Systems Management Services
First Edge Node Hadoop Clients
Cloudera Manager
DMX-h
DMExpress Service (dmxd)

Dell EMC Ready Bundle for Cloudera Hadoop


32 | Dell EMC Ready Bundle for Cloudera Hadoop Nodes

Physical Node Software Function


Active Name Node NameNode
Resource Manager
ZooKeeper
Quorum Journal Node
HMaster
Impala State Store and Catalog Daemons

Standby Name Node Yum Repositories


Standby NameNode
Standby Resource Manager (optional)
Spark History Server
Spark2 History Server
ZooKeeper
Quorum Journal Node

HA Node ZooKeeper
Quorum Journal Node
Operational Databases (PostgreSQL)

Worker Node(N) DataNode


NodeManager
HBase RegionServer
ImpalaDaemon

Dell EMC Ready Bundle for Cloudera Hadoop


Network Configuration | 33

Chapter

5
Network Configuration
Topics: This section describes how to configure the network for the Dell EMC
Ready Bundle for Cloudera Hadoop.
• High-level Network Architecture
• IP Addressing
• Cluster Networks and VLANs
• Node Interface Bonds
• Domain Name System
• Network Time Protocol
• Gathering Network Information

Dell EMC Ready Bundle for Cloudera Hadoop


34 | Network Configuration

High-level Network Architecture

All servers in the cluster are tied together using TCP/IP networks. These networks form a data interconnect
across which individual servers pass data back and forth, return query results, and load/unload data.
These networks are also used for management and interfaces to an existing corporate network.
A combination of network switches and Layer 2 VLANs are used to segregate traffic in the cluster. Network
interface bonding is used to provide higher performance for selected networks. A high-level overview of
the network organization is provided in Figure 2: Dell EMC Ready Bundle for Cloudera Hadoop Cluster
Networking on page 34.
The Standby Name Node will usually provide the following network services:
• NTP server (Network Time Protocol server) — makes sure all nodes are keeping the same time
• DHCP server — can be used to assign and manage IP addresses for the compute and storage nodes.
This guide uses static addressing for the cluster nodes.
Note: If the Standby Name Node does not exist in your environment, then these services must be
placed on another node.

Figure 2: Dell EMC Ready Bundle for Cloudera Hadoop Cluster Networking

IP Addressing

The IP addressing uses large subnets to support many machines on the cluster network. The cluster and
BMC/IPMI networks are Class B networks, with 65,536 IP addresses.
In these example networks, the first 10 IP addresses are reserved for switches, routers, and firewalls.
The Edge network is a Class C network, with 256 IP address. The first 10 IP addresses are reserved for
switches, routers, and firewalls.
Note: Each network's ".1" address is reserved for the network gateway.

Dell EMC Ready Bundle for Cloudera Hadoop


Network Configuration | 35

Table 11: Network IP Addressing Scheme

LAN Class Network Subnet Mask Gateway Broadcast


Cluster B 172.16.0.0 255.255.0.0 172.16.0.1 172.16.255.255
iDRAC/BMC B 172.18.0.0 255.255.0.0 172.18.0.1 172.18.255.255
Edge C 90.80.70.0 255.255.255.0 90.80.70.1 90.80.70.255

Sample Naming and IP Addressing


Table 12: IP Addressing Scheme on page 35 presents an example of a three-rack cluster, with 12
nodes per rack (24 RU). Network switches and administrative servers are atop each rack.

Table 12: IP Addressing Scheme

Hostname iDRAC/BMC IP Cluster Data IP


Rack 1
namenode1-r1 172.18.0.11 172.16.0.11
data1-r1 172.18.0.15 172.16.0.15
data2-r1 172.18.0.16 172.16.0.16
data3-r1 172.18.0.17 172.16.0.17
data4-r1 172.18.0.18 172.16.0.18
data5-r1 172.18.0.19 172.16.0.19
data6-r1 172.18.0.20 172.16.0.20
data7-r1 172.18.0.21 172.16.0.21
data8-r1 172.18.0.22 172.16.0.22
data9-r2 172.18.0.23 172.16.0.23
data10-r2 172.18.0.24 172.16.0.24
Rack 2
namenode2-r2 172.18.0.12 172.16.0.12
edge-r2 172.18.0.14 172.16.0.14
data11-r2 172.18.0.25 172.16.0.25
data12-r2 172.18.0.26 172.16.0.26
data13-r2 172.18.0.27 172.16.0.27
data14-r2 172.18.0.28 172.16.0.28
data15-r2 172.18.0.29 172.16.0.29
data16-r2 172.18.0.30 172.16.0.30
data17-r2 172.18.0.31 172.16.0.31
data18-r2 172.18.0.32 172.16.0.32
data19-r2 172.18.0.33 172.16.0.33

Dell EMC Ready Bundle for Cloudera Hadoop


36 | Network Configuration

Hostname iDRAC/BMC IP Cluster Data IP


data20-r2 172.18.0.34 172.16.0.34
Rack 3
ha-r3 172.18.0.13 172.16.0.13
data21-r3 172.18.0.35 172.16.0.35
data22-r3 172.18.0.36 172.16.0.36
data23-r3 172.18.0.37 172.16.0.37
data24-r3 172.18.0.38 172.16.0.38
data25-r3 172.18.0.39 172.16.0.39
data26-r3 172.18.0.40 172.16.0.40
data27-r3 172.18.0.41 172.16.0.41
data28-r3 172.18.0.42 172.16.0.42
data29-r3 172.18.0.43 172.16.0.43
data30-r3 172.18.0.44 172.16.0.44

Cluster Networks and VLANs

The Dell EMC Ready Bundle for Cloudera Hadoop implements three distinct VLANs for cluster functions.
The networks are described in Table 13: Cluster Networks on page 36.

Table 13: Cluster Networks

Network Description Recommended VLAN Tagged


Tag
Cluster Data Network The Cluster Data 300 802.1q
Network is the primary
network in the cluster,
and provides a high
speed interconnect that
carries the bulk of the
traffic within the cluster.
Cloudera Services
are accessed on this
network.
iDRAC / BMC Network This network is used for 100 Untagged
access to all of the BMC/
IPMI/iDRAC interfaces
on each node. This
provides console access
to each node at the
BIOS/boot-level. It also
provides access to the
management ports of the
cluster switches.

Dell EMC Ready Bundle for Cloudera Hadoop


Network Configuration | 37

Network Description Recommended VLAN Tagged


Tag
Edge Network This is an optional 400 802.1q
network to allow access
to the cluster through
the Edge Node(s). This
network may have a
firewall configured to
selectively protect the
cluster from outside
access.

Node Interface Bonds

Layer 2 Interface bonding is used on the core cluster network to increase performance, bandwidth, and
reliability. The recommended configuration is 802.3ad (LACP) bonding. Bonding can also be used on the
Edge network for the same reasons, depending on the interfaces required to existing networks. See:
• Active/Standby Name Nodes & HA Nodes on page 37
• Edge Node on page 37
• Worker Node on page 38

Active/Standby Name Nodes & HA Nodes


Table 14: Name Nodes and HA Nodes Network Connections

Interface Interface Type Network Bonding


iDRAC 1GbE RJ45 iDRAC / BMC no bond
TenGig 1 10 GbE SFP Cluster Data bond0 802.3ad (LACP)
TenGig 2 10 GbE SFP Cluster Data bond0 802.3ad (LACP)

Note: The Active/Standby Name Nodes & HA Nodes hardware configurations include additional
10GbE ports, but these are ports are not used.

Edge Node
Table 15: Edge Node Network Connections

Interface Interface Type Network Bonding


iDRAC 1GbE RJ45 iDRAC / BMC no bond
TenGig 1 10 GbE SFP Cluster Data bond0 802.3ad (LACP)
TenGig 2 10 GbE SFP Cluster Data bond0 802.3ad (LACP)
TenGig 3 10 GbE SFP Edge bond1 (802.3ad (LACP)
optional)
TenGig 4 10 GbE SFP Edge bond1 (802.3ad (LACP)
optional)

Dell EMC Ready Bundle for Cloudera Hadoop


38 | Network Configuration

Worker Node
Table 16: Worker Nodes Network Connections

Interface Interface Type Network Type Teaming Type


iDRAC 1GbE RJ45 iDRAC / BMC no bond
TenGig 1 10 GbE SFP Cluster Data bond0 802.3ad (LACP)
TenGig 2 10 GbE SFP Cluster Data bond0 802.3ad (LACP)

Domain Name System

The installation programs and methodologies provided in this document will result in static IP assignments,
listed in /etc/hosts, on all machines. Any updates should be applied to /etc/hosts on one machine, and
then copied to all other nodes. You must update /etc/resolv.conf to point to your DNS server of choice. Dell
EMC has defaulted to using a public DNS server (8.8.8.8) for your initial use.
Note: DNScache is installed on all hosts.

Dell EMC recommends that the optional administration node attached to the data network be configured
with an authoritative DNS server. This server must have authoritative forward and reverse DNS records for
each and every host that is a member of the cluster.
Note: If you are using Cloudera BDR or DISTCP, then external access and DNS resolution are
required for all nodes in both clusters.
Information on how to configure DNS can be obtained at:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/
Deployment_Guide/ch-DNS_Servers.html

Network Time Protocol

All nodes in an Apache Hadoop cluster require closely synchronized time. If the time between machines
is not synchronized, undefined errors will occur. Cloudera Manager will also flag nodes that have
unsynchronized time. To maintain clock synchronization, the OS configuration steps set up the Network
Time Protocol (NTP) on the nodes in the cluster, with an NTP server on the Standby Name Node.
This configuration synchronizes all nodes with the Standby Name Node. To synchronize the Standby
Name Node with an external clock source, the NTP server configuration should be updated.
Note: See http://www.ntp.org/ for more information.

To check the NTP server settings, execute the following commands:

# grep server /etc/ntp.conf


# ntpq -p

Gathering Network Information

You must gather several pieces of customer network environment information, including:

Dell EMC Ready Bundle for Cloudera Hadoop


Network Configuration | 39

• IP addresses for:
• Kickstart Server
• bond0 interface on each node
• bond1 interface on edge nodes
• iDRAC interfaces
• Each node's service tag (case-insensitive)
• Each node's name
• Whether or not updates are to be installed
• If so, you must gather their source (directly from RHN, or from a RHN Satellite Server)
• Rack location, if racked in a non-standard manner
The IP address recommendations in IP Addressing on page 34 can be used as a starting point.
The Hadoop cluster network can be implemented such that only the edge network has access to the
Internet, while the cluster data network is private. In this configuration, only bond1 interfaces need to have
IP addresses that are routed externally. Cloudera Manager will access the Cloudera packages via bond1
and then distribute them over bond0, which is on the cluster-only network.
Optionally, all nodes can have the ability to connect with the Internet. In all cases, you will need to know
the gateway address for bond1 as well as the network mask. For example:

Gateway Bond 0: 172.16.0.1


Netmask Bond 0: 255.255.0.0

Gateway Bond 1: 10.152.248.1


Netmask Bond 1: 255.255.255.0

Service tags for each node are available in multiple places. Dell EMC PowerEdge R730xd servers have a
slide-out tag that contains this information. The information can be written down or scanned from the tag
via a smartphone app. They usually have the format of the following example:

D120R22

Once all required information is gathered you can proceed to Server Configuration and OS Installation on
page 48.

Dell EMC Ready Bundle for Cloudera Hadoop


40 | Network Switches Configuration

Chapter

6
Network Switches Configuration
Topics: The Dell EMC Ready Bundle for Cloudera Hadoop is based on the
network switches documented in the Dell EMC Ready Bundle for
• Switch Configuration Overview Cloudera Hadoop Architecture Guide. This guide assumes the use
• Cabling the Network Switches of those switches. Configuring the Network Switches on page 45
• Server Node Connections provides the necessary switch configurations as starting points.
• Configuring the Network
Switches

Dell EMC Ready Bundle for Cloudera Hadoop


Network Switches Configuration | 41

Switch Configuration Overview

This section describes the connection and setup of the switches used in the Dell EMC Ready Bundle for
Cloudera Hadoop.
The network must be cabled and the switches configured before software installation can begin. The
network configuration is divided into three phases:
• Setting up the S3048-ON, required for each rack in the cluster.
• Setting up the S4048-ON, required for each pod in the cluster.
• Setting up the S6000-ON, required for clusters larger than a single pod.
For each phase, we provide 'cut sheets' for the cabling details, and switch configuration files for the switch
programming. Refer to Table 17: Switch Configuration Files on page 41 to identify the correct cut sheet
and configuration file for each switch.

Table 17: Switch Configuration Files

Function Switch Model Physical Location Configuration Cut Sheet


Template
Cluster S3048-ON One per rack s3048-1 Cutsheets.xlsx
Management
Pod Switch S4048-ON Two per pod s4048-1 Cutsheets.xlsx
s4048-2

Cluster S6000-ON Two per cluster s6000-1 s6000-cutsheet.xlsx


Aggregation
s6000-2

Cabling the Network Switches

The Dell EMC PowerEdge FX architecture uses a converged iDRAC or CMC connection on the back
of the chassis. All of the units in a Dell EMC PowerEdge FX2 chassis use the same physical connector
on the back of the unit for the physical network connection, and have separate IP addresses for each
sub unit. Figure 4: Dell Networking S6000-ON Multi-pod Networking Equipment on page 43 shows
the connection for a single chassis iDRAC connection. The port next to it can be used to daisy chain the
CMCs.
The management network for all of the nodes in the cluster, using either the Dell EMC PowerEdge R730xd
servers, or the Dell EMC PowerEdge FX2 chassis is a very simple network setup. The S3048-ON cut
sheet, Cutsheets.xlsx, shows that each Dell EMC PowerEdge R730xd host has a single connection from
the dedicated iDRAC port, to one of the 1 GbE ports on the S3048-ON listed in the cut sheet for host
management access. The Dell EMC PowerEdge FX architecture is similar. Each Dell EMC PowerEdge
FX2 CMC port is connected to the host ports in the cut sheet, in host order.
The listed interconnect ports, s3048-left and s3048-right, are for connecting multiple top-of-rack S3048-
ON switches together. The switches are connected as a simple bus. There is also a port shown in the
cut sheet, marked admin node, having the production network and iDRAC networks. This port allows 1
GbE access for kick starting the machines using our Kickstart VM running in either in ESX or VMware
workstation. This port carries both the Production network and iDRAC networks in tagged form. After the
initial installation, this port can be used for a customer administration node if desired.
Follow the cut sheets, and the following diagrams, to cable each switch:

Dell EMC Ready Bundle for Cloudera Hadoop


42 | Network Switches Configuration

• Figure 3: Single Pod Networking Equipment on page 42


• Figure 4: Dell Networking S6000-ON Multi-pod Networking Equipment on page 43

Figure 3: Single Pod Networking Equipment

Dell EMC Ready Bundle for Cloudera Hadoop


Network Switches Configuration | 43

Figure 4: Dell Networking S6000-ON Multi-pod Networking Equipment

Server Node Connections

Server connections to the network switches for the data network are bonded, and use an Active-Active
LAN aggregation group (LAG) in a load-balance configuration using IEEE 802.3 Link Aggregation Control
Protocol (LACP). (Under Linux®, this is referred to as 802.3ad or mode 4 bonding).
The connections are made to a pair of Pod switches, to provide redundancy in the case of port, cable,
or switch failure. The switch ports are configured as a LAG, and the switches are configured as a high
availability pair using VLT.
Connections to the BMC network use a single connection from the iDRAC port to a S3048-ON
management switch in each rack.
Edge Nodes have an additional pair of 10GbE connections available. These connections facilitate high-
performance cluster access between applications running on those nodes, and the optional edge network.
The mapping of bonds to individual interfaces is shown in Table 18: Bond / Interface Cross Reference on
page 45.
The Dell EMC PowerEdge FX2 architecture uses a converged iDRAC or CMC connection on the back of
the chassis. All of the units in a Dell EMC PowerEdge FX2 chassis use the same physical connector on the
back of the unit for the physical network connection, and have separate IP addresses for each sub-unit.
Figure 7: Dell EMC PowerEdge FX2 Worker Chassis Network Ports on page 45 displays the connection
for a single chassis iDRAC connection. The port next to it can be used to daisy chain the CMCs.

Dell EMC Ready Bundle for Cloudera Hadoop


44 | Network Switches Configuration

Figure 5: PowerEdge R730xd Node Network Ports

Figure 6: Dell EMC PowerEdge FX2 Infrastructure Chassis Network Ports

Dell EMC Ready Bundle for Cloudera Hadoop


Network Switches Configuration | 45

Figure 7: Dell EMC PowerEdge FX2 Worker Chassis Network Ports

Note: The Dell EMC PowerEdge FX2 has two iDRAC ports per chassis - an uplink port and a
stacking port (STK). The uplink port is the main iDRAC port. The stacking port is only used when
chassis are daisy-chained.

Table 18: Bond / Interface Cross Reference

Server Platform Interface Bond Network


Dell EMC PowerEdge R730xd em1 bond0 Cluster Data
Dell EMC PowerEdge R730xd em2 bond0 Cluster Data
Dell EMC PowerEdge R730xd p4p1 bond1 Edge
Dell EMC PowerEdge R730xd p4p2 bond1 Edge
Dell EMC PowerEdge FX2 em1 bond0 Cluster Data
Dell EMC PowerEdge FX2 em2 bond0 Cluster Data
Dell EMC PowerEdge FX2 em3 bond1 Edge
Dell EMC PowerEdge FX2 em4 bond1 Edge

Configuring the Network Switches

Configuring the network switches consists of two separate procedures:


1. First Time Setup on page 45
2. Switch Configuration on page 46

First Time Setup


The following steps are necessary for first time setup of a new Dell Networking switch. The switch is
shipped in Bare Metal Provisioning (BMP) mode and needs to be placed into normal running mode.

Dell EMC Ready Bundle for Cloudera Hadoop


46 | Network Switches Configuration

Perform the following steps to change its mode only if necessary; otherwise, skip to Switch Configuration
on page 46.
To run the first time setup on each switch:
1. Connect to the switch using a serial cable and laptop. The required serial port settings are:
a. 115200 baud rate
b. No parity
c. 8 data bits
d. 1 stop bit
e. No flow control
2. Bring up a HyperTerminal window to connect to the switch.
3. Power on the switch, and wait for the following menu to appear:
To continue with the standard manual interactive mode, it is necessary to
abort BMP.
Press A to abort BMP now.
Press C to continue with BMP.
Press L to toggle BMP syslog and console messages.
Press S to display the BMP status.
4. Choose A to abort Bare Metal Provisioning.
5. Wait for the switch to finish its current activities. You may need to press the [Enter] key to see the
prompt.
6. Type enable, and then press the [Enter] key, to enter privileged mode.
7. Type configure, and then press the [Enter] key, to enter configuration mode.
8. Type reload-type, and then press the [Enter] key, to change the boot mode for the machine.
9. Type boot-type normal-reload, and then press the [Enter] key,.
10.Type exit, and then press the [Enter] key, to exit the boot-type submenu.
11.Type do wr, and then press the [Enter] key, to write the new configuration to the switch.
12.Type exit, and then press the [Enter] key, to exit the configure mode.
13.Type reload, and then press the [Enter] key, to cause the switch to reboot into the newly chosen mode.
14.When you are asked to confirm saving the configuration, and to confirm reloading the system, type yes,
and then press the [Enter] key.

Switch Configuration
The configuration procedure is nearly identical for each switch. The only difference is the configuration file
that is copied and pasted into the switch console window. Switch configurations are plain text files.
For each switch, you will need to update the template to specify the actual IP address for the management
interface on the switch. You will also need to update the configuration templates to reflect the correct VLAN
IDs.
To configure each switch:
1. Connect to the switch using a serial cable and laptop. The required serial port settings are:
a. 115200 baud rate
b. No parity
c. 8 data bits
d. 1 stop bit
e. No flow control
2. Bring up a HyperTerminal window to connect to the switch.
3. Press the [Enter] key to display a console prompt.
4. Type enable, and then press the [Enter] key, to enter privileged mode.
5. Type configure, and then press the [Enter] key, to enter configuration mode.

Dell EMC Ready Bundle for Cloudera Hadoop


Network Switches Configuration | 47

6. Copy the configuration from the appropriate text file, and then paste it into the console window. The files
are named according to the conventions in the cut sheets provided in the download packages.
7. After the configuration finishes copying, press the [Enter] key.
8. Press [Ctrl-z].
9. Type exit, and then press the [Enter] key, to leave configuration mode.
10.Type copy running-config startup-config, and then press the [Enter] key.
11.Type reload, and then press the [Enter] key.

Dell EMC Ready Bundle for Cloudera Hadoop


48 | Server Configuration and OS Installation

Chapter

7
Server Configuration and OS Installation
Topics: Dell EMC PowerEdge servers can be configured with the Dell EMC
OpenManage Deployment Toolkit (DTK). We have developed a
• Installing and Configuring the simplified tool to enable the DTK to configure Dell EMC servers
Kickstart Server specifically for Dell EMC Ready Bundle for Cloudera Hadoop
• DTK Configurator workloads: the DTK Configurator.
The Dell EMC Ready Bundle for Cloudera Hadoop Kickstart Server is
used to automate the operating system installation on all the nodes in
a Hadoop stamp. It is comprised of a VMware virtual machine image
that can be run at the customer site on either of:
• Your laptop
• A customer-supplied system in the data center
The kickstart image must be configured with a correct IP address
within the customer's networking environment.

Dell EMC Ready Bundle for Cloudera Hadoop


Server Configuration and OS Installation | 49

Installing and Configuring the Kickstart Server

• Downloading the Installation Packages on page 21


• Configuring the Kickstart VM Image on page 49
• Configuring the Kickstart Server on page 50
• Editing the node-config.json File on page 51

Configuring the Kickstart VM Image


Note: You must install VMware Workstation™ onto your laptop before performing this procedure.

To configure the kickstart VM image:


1. Configure the laptop firewall to allow traffic through to VMware.
a. Navigate to Start > Control Panel > Windows Firewall.
b. Select Allow a program or feature through Windows Firewall.
c. Scroll to VMware Workstation Server, and enable firewall traffic by selecting both:
• Home/Work (Private)
• Public
2. Plug your laptop's Ethernet cable into a port on the management network.
a. Ensure that your laptop's physical port IP addressing matches the customer network (e.g., DHCP or
a static IP address for the laptop wired network).
Note: Two IP addresses are required in order for the kickstart to proceed correctly: one for
the laptop Ethernet port; and one for the VM.
3. Start VMware Workstation by right-clicking on its desktop icon, and then selecting Run as administrator.
4. Navigate to Edit > Virtual Network Editor.
5. Select the Bridged device, usually vmnet0.
a. If the Bridged device is set to automatic, change it to the physical Ethernet port device.
6. Close the Virtual Network Editor.
7. Select File > Open to load the Kickstart VM ovf file into VMware Workstation.
8. Choose the VM from the list.
9. Select Edit virtual machine settings.
10.Ensure that the network adapter is set to Bridged mode.
11.Click on the Advanced button, and make a note of the device's MAC address.
Note: This becomes important when powering on the VM, as it may be changed due to the
import process.
12.Power on the VM.
13.Click on Dell EMC Hadoop Kickstart to log into the VM as user dell.
Note: The dell and root users share the same password, Ignition01. Dell EMC recommends that
you perform all actions as the dell user via sudo.
14.Start a terminal session.
15.Determine the physical Ethernet device and its assigned DHCP address, if any:

$ sudo ifconfig

Note: The following steps configure a network interface (eth2 in our examples) over which the
Kickstart Server can PXE boot the cluster nodes. Our examples assume that both eth1 and
eth2 appear after ifconfig is run; however, your environment may be configured differently.
Substitute your interface names as desired.

Dell EMC Ready Bundle for Cloudera Hadoop


50 | Server Configuration and OS Installation

16.Change to the network-scripts directory:

$ cd /etc/sysconfig/network-scripts
17.Move the existing ifcfg-eno16777736 file to the proper device found in the step above:

$ sudo mv ifcfg-eno16777736 ifcfg-eth2


18.Shut down the interface:

$ sudo ifdown eth2


19.Edit the ifcfg-eth2 file in the text editor of your choice.
a. Ensure that the bootproto is set to none.
b. Change the name to eth2.
c. Change the MAC address to match that reported by VMware Workstation.
d. Add an entry for the IP address:

IPADDR=
e. Add an entry for the network mask:

NETMASK=
f. Add an entry for the gateway:

GATEWAY=
g. Add an entry the Domain Name Service:

DNS1=
h. Save the file.
20.Restart the interface:

$ service network restart

Configuring the Kickstart Server


The machine used as the Kickstart Server should have an IP address that is reachable from the internal
network. This can be passed on the command line to the configuration script. The script will then configure
the IP address for the Kickstart Server.
To configure the Kickstart Server's IP address:
1. Log onto the Kickstart Server as the dell user.
2. Change to the HTML master directory:

$ cd /var/www/html/master
3. Execute the following command, passing the IP address specified by the customer, or the DHCP
address found earlier:

$ sudo bash ./configure-pxe.sh <ip_address>


4. At the prompt, enter and verify the root password as directed.
Note: This is the root user password for every node in the stamp, not the Kickstart Server itself.

You can now proceed to Editing the node-config.json File on page 51.

Dell EMC Ready Bundle for Cloudera Hadoop


Server Configuration and OS Installation | 51

Editing the node-config.json File


To edit the node-config.json file:
1. Open the /var/www/html/node-config.json file in a text editor of your choice.
2. Edit the file to ensure that it reflects the customer environment:
• Cluster name
• Cluster domain name
• Gateway and network masks for both bonded interfaces
• Time zone
• Descriptions for each node, including:
• Service tag (case-insensitive)
• Node type:
• Active Name Node
• Standby Name Node
• HA Node
• Edge Node
• Worker Node
• Whether or not the bonds will be configured via DHCP (true/false)
• If true, use the string, dhcp as the value for that parameter
• If false, use the static IP address as the value for that parameter
3. From the /var/www/html directory, run the read-json.py script to make sure the node-config.json file
is correct:

$ sudo python dell/read-json.py --file=node-config.json


4. If errors are returned:
a. Fix the issues.
a. For service tag issues, see Troubleshooting Service Tag Errors on page 55.
b. Rerun the read-json.py script.
c. Repeat until all errors are corrected.
Note: A common error is putting a comma at the end of a stanza's last line. The last line
must not end with a comma.
Any number of Worker Nodes, up to the cluster maximum, can be configured in the node-config.json file. At
a minimum, there should be:
• Two Name Nodes
• One Edge Node
• One HA Node
• Five Worker Nodes
Caution: At this point you are running a set of boot services that will potentially network boot any
node on the connected network that requests such services. You should either limit access to
the console/kickstart network during this procedure, or use another method to prevent unwanted
network installations. Once the installation process is completed for the cluster, the kickstart VM is
shut down, and this issue will no longer be a consideration.
See Example node-config.json File on page 104 for a sample node-config.json file.

Dell EMC Ready Bundle for Cloudera Hadoop


52 | Server Configuration and OS Installation

DTK Configurator

The DTK Configurator is a USB key bootable image. It enables you to boot any of our architecture-
compliant machines. Once booted, you can select the type of Hadoop machine you wish to build from a
menu. The DTK Configurator will automatically set up all of the following settings, as necessary:
• BIOS
• Firmware
• RAID Controller
• Disks/Volumes
• iDRAC

Using the DTK Configurator


To use the DTK Configurator you must first create bootable ISO images in either a Windows® or Linux®
environment. Once the DTK configuration has completed its work on the machine, it will cause the host
to reboot and begin the kickstart procedure. Verify that the kickstart server, set up in a previous step, is
currently configured and running.
Topics discussed in this section include:
• Writing the ISO to a USB Key in Windows on page 52
• Writing the ISO to a USB Key in Linux on page 53
• USB Boot on page 53

Writing the ISO to a USB Key in Windows


Several software packages are available for Windows® that enable you to copy the bootimage.iso file
onto a USB key; some are free, some are not. The following instructions are for using the Rufus freeware
package. You can use different software if you wish.
Note: USB keys created this way will not work properly if booted to in UEFI mode; the system will
appear to boot, but CentOS will kernel-panic halfway through the bootstrap process. If you create a
key using this method, always boot it in BIOS mode.
To write bootimage.iso to a USB key in Windows:
1. Download Rufus from http://rufus.akeo.ie/.
2. Run Rufus.
3. Insert the key into the system.
a. Rufus should detect the key and show it in the Device dropdown. If it does not, manually select the
USB key from the Device dropdown.
4. Under Partition scheme and target system type, select MBR partition scheme for BIOS or UEFI
computers from the drop-down.
5. Under Format Options, ensure that there is a check next to Create a bootable disk using.
a. Select ISO Image from the adjacent drop-down.
6. Click on the CD-ROM icon and browse to the bootimg.iso file.
7. Press the Start button, and then click on the OK button in the subsequent warning dialog box.
a. Rufus will then:
a. Format the USB key
b. Make it bootable
c. Copy the contents of the ISO file over to it
8. Once Rufus displays READY at the bottom of the window, close the program and then remove the USB
key.

Dell EMC Ready Bundle for Cloudera Hadoop


Server Configuration and OS Installation | 53

You can now proceed to USB Boot on page 53.

Writing the ISO to a USB Key in Linux


To write bootimage.iso to a USB key in Linux®:
1. Download the bootimage.iso bootable key image.
2. Download the associated MD5 file.
3. Verify the file against the MD5 checksum by executing the following command:

# md5sum bootimage.iso
4. List all attached block devices, including USB mass storage devices, by executing the following
command:

# blkid
5. Insert the USB key.
6. Rerun the blkid command. The newly-listed device will be the USB key you just entered. For example:
[root@data2 ~]# blkid > before
[root@data2 ~]# echo insert key now
insert key now
[root@data2 ~]# blkid > after
[root@data2 ~]# diff before after
23a24
> /dev/sdr1: LABEL="BOOTIMG" UUID="20B4-D909" TYPE="vfat"
7. Create the bootable USB key by executing the following command:

[root@edge ~]# dd if=bootimg.iso of=/dev/sdr1 bs=2048 && sync


8. Once the command completes execution, remove the USB key.
You can now proceed to USB Boot on page 53.

USB Boot
1. Ensure that the target machine is in BIOS boot mode. If it is in UEFI mode:
a. Press [F2] to enter the machine into System Setup mode.
b. Navigate to System BIOS > Boot Settings > Boot Mode > BIOS.
c. Save, and then exit the BIOS.
2. Insert the USB key into one of the USB ports on the target machine.
3. When the machine reboots, and the BIOS boot menu appears, press [F11] to enter BIOS Boot
Manager.
4. Select the One-shot BIOS Boot menu.
5. Select the USB port into which the key is inserted.
6. Select Finish, and exit BIOS Boot Manager to boot the machine.
At this point the machine will boot from the USB key, and display the standard CentOS boot messages.
7. The DTK then checks the machine's hardware model and boot sequence.
Dell EMC PowerEdge R730xd example:

Determining hardware model:


Hardware model is R730xd.
Checking Boot Sequence for defined HardDisk.List:
Found HardDisk in BootSeq BootSeq=HardDisk.List.1-1

Dell EMC PowerEdge FX2 example:

Determining hardware model:

Dell EMC Ready Bundle for Cloudera Hadoop


54 | Server Configuration and OS Installation

Hardware model is FC630.


Checking Boot Sequence for defined HardDisk.List:
Found HardDisk in BootSeq BootSeq=HardDisk.List.1-1
Checking if CSIOR is enabled:

8. The DTK then checks the system profile configurations.


a. If no prior system profile configuration exists on the machine, a message similar to the following is
displayed:

Checking for an existing configuration on the server:


No existing configuration, continuing.
b. If a prior configuration exists on the machine, a message similar to the following is displayed:

Checking for an existing configuration on the server:


This system appears to have an existing configuration.
Do you want to remove the configuration (y/n)

In this case, the DTK guides you through one of two scenarios that you can select:
• Keeping the existing configuration (select n at the prompt)
• Removing the existing configuration (select y at the prompt)
• Selecting n will cause the DTK to abort the operation, and display a reboot message.
• Selecting y will cause the DTK to respond with a confirmation prompt before continuing.
Caution: Removing configurations is a destructive operation. Please be sure of your
selection before confirming.
9. The DTK then checks the machine's network interface boot protocols.
a. If the network interfaces are configured correctly, a message similar to the following is displayed:

Checking boot protocol on network devices:


Network devices are configured correctly.
Detected RAID controller 0
Name: PERC H730 Mini
Virtual Disk Count: 0

Detected RAID controller 31


Name: PERC FD33xD
Virtual Disk Count: 0

The DTK then prompts you to select a system profile. See step 10 below.
b. If the network interfaces are configured incorrectly, a message similar to the following is displayed:
Note: In this case, you must allow the machine to reboot in order to continue to Step 10
below.

Checking boot protocol on network devices:


One or more network boot devices do not have the proper setting.
Setting NIC.Integrated.1-1-1 to a boot protocol of PXE.
Boot protocols have been configured, rebooting to process the change.

The DTK then prompts you to select a system profile. See step 10 below.
10.Follow the prompts to select the system profile that you wish to install:

If you need a command prompt, press Alt+F2.


Choose a system profile:

1. Hadoop Infrastructure

Dell EMC Ready Bundle for Cloudera Hadoop


Server Configuration and OS Installation | 55

2. Hadoop Worker
3. OpenStack Infrastructure
4. OpenStack Compute
5. OpenStack Storage
6. OpenStack SAH

a. When you are prompted for the IPv4 address and network mask, enter the machine's iDRAC IP
address and mask.
11.When the process is complete, follow the prompt to remove the USB key and reboot the machine.
Note: Certain update packages during this procedure may require that the machine being
updated be rebooted immediately, prior to finishing all updates.
12.If the machine reboots on its own without user intervention, or you do not see the DTK finish message
asking you to press [Enter] to reboot the machine:
a. Rerun the DTK updater on the same machine to finish all available updates.
13.While rebooting, the machine contacts the Kickstart Server, and then performs the operating system
installation based upon the service tag, and the node-config.json file.
14.Perform the cluster test in Before Hadoop Cluster Deployment on page 79.
Note: Once the operating system is installed, the root password for each machine will be the
password that you entered in Configuring the Kickstart Server on page 50.
Troubleshooting Service Tag Errors
If a node's service tag cannot be found in the node-config.json file, you can either:
• Select the appropriate node type from the menu option that is displayed, or
• Add the correct service tag to the node-config.json file
Note: Dell EMC recommends that you add the correct service tag to the node-config,json file, in
order to save time and effort.
If you choose to select the node type from the menu:
1. Select the node type. Available types include:
• Name
• Standby Name
• High Availability
• Edge
• Data
2. The operating system will be installed without customizations typically performed by the kickstart
automation.
3. Manually configure the:
• /etc/hosts file with hostnames and IP addresses of all Hadoop nodes
• bond0 interface
• Domain name
• NTP server configuration
• Optional bond1 interface on Infrastructure nodes
• Operating system tuning parameters
• Local RHEL 7.3 repositories, based upon the installation ISO
• Additional mount points
If you choose to add the service tag to the node-config.json file:
1. Rerun the read-json.py script as in Editing the node-config.json File on page 51. The
customizations will be performed automatically.
2. Reboot the problematic node.

Dell EMC Ready Bundle for Cloudera Hadoop


56 | Additional Packages

Chapter

8
Additional Packages
Topics: The kickstart process installs all necessary OS packages. If you need
additional packages, they should be installed manually.
• Checking and Installing
Packages

Dell EMC Ready Bundle for Cloudera Hadoop


Additional Packages | 57

Checking and Installing Packages

Packages must be preinstalled if you plan to use them.


The Kickstart Server virtual machine contains a complete distribution of Red Hat Enterprise Linux Server
7.3. This distribution is used to install the OS onto each of the nodes in the cluster. The RHEL installation
packages are also copied onto the Standby Name Node for use as a remote repository.
All of the nodes in the cluster are configured to use the Standby Name Node as a remote repository for
any software installation. If you need to add any software packages to the cluster, you can use normal
software distribution practices such as adding the package to the existing repository, or to manually install
the package using standard tools.

Dell EMC Ready Bundle for Cloudera Hadoop


58 | Operating System Software Updates

Chapter

9
Operating System Software Updates
Topics: Dell EMC recommends that you perform software updates on a regular
basis, for all installed packages.
• Software Update
Recommendations

Dell EMC Ready Bundle for Cloudera Hadoop


Operating System Software Updates | 59

Software Update Recommendations

All of the nodes should be configured for either:


• Automatic updates using standard software update mechanisms (i.e., Red Hat Satellite Server)
• Manual updates on an ongoing basis.
These procedures are beyond the scope of this document, and should be managed by local administrators.
Note: It is particularly important that your operating system software be up to date prior to installing
Cloudera Manager.

Dell EMC Ready Bundle for Cloudera Hadoop


60 | Installing Cloudera Manager

Chapter

10
Installing Cloudera Manager
Topics: After the base operating system has been imaged on all cluster
nodes, the next step is to install Cloudera Manager to complete the
• Configuring the Metadata deployment. Management of HDFS and other Hadoop services is
Database performed by Cloudera Manager. The Cloudera Manager software
• Installing Cloudera Manager should be installed on the Edge Node.
Software Note: Before continuing to Configuring the Metadata Database
on page 61, best practice is to perform the cluster test in
Before Hadoop Cluster Deployment on page 79.

Dell EMC Ready Bundle for Cloudera Hadoop


Installing Cloudera Manager | 61

Configuring the Metadata Database

Refer to the following documents for instructions to configure the PostgreSQL metadata database:
• Cloudera — http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_extrnl_pstgrs.html
• PostgreSQL — https://www.postgresql.org/docs/9.4/static/index.html
Note: The PostgreSQL database should be configured on the HA Node.

Since the Dell EMC Ready Bundle for Cloudera Hadoop installs the PostgreSQL database software on
the appropriate host, you can skip the Installing the External PostgreSQL Server section and refer to these
sections instead:
• Configuring and Starting the PostgreSQL Server
• Creating Databases for Activity Monitor, Reports Manager, Hive Metastore Server, Sentry Server,
Cloudera Navigator Audit Server, and Cloudera Navigator Metadata Server
• Configuring PostgreSQL for Oozie
To configure the metadata database:
1. Log onto the HA Node as root.
2. Set the correct software localization variables by executing the following commands:

# export LANGUAGE=en_US.UTF-8
# export LANG=en_US.UTF-8
# export LC_ALL=en_US.UTF-8
3. Initialize the database service, which will copy default configuration files into the appropriate locations:

# mkdir /var/lib/pgsql/9.4
# /usr/pgsql-9.4/bin/postgresql94-setup initdb
# systemctl start postgresql-9.4.service
# systemctl stop postgresql-9.4.service
4. To enable client machines in the local subnet to access the database:
a. Open the /var/lib/pgsql/9.4/data/pg_hba.conf file in a text editor.
b. Add the following lines before all other local and host lines, substituting your local environment's
subnet:

host all all 127.0.0.1/32 md5


host all all 192.168.102.1/24 md5
c. Save and close the file.
5. To enable all interfaces on the HA Node to access the database:
a. Open the /var/lib/pgsql/9.4/data/postgresql.conf file in a text editor.
b. Change the #listen_addresses='localhost' line to read:

listen_addresses = '*'
c. In this same file, change the settings as listed in step 3 of the Cloudera link given above. These
settings relate to the size of the cluster being installed.
d. Save and close the file.
6. Start the database, and enable it to be restarted after each reboot, execute the following commands:

# systemctl enable postgresql-9.4.service


# systemctl start postgresql-9.4.service

Dell EMC Ready Bundle for Cloudera Hadoop


62 | Installing Cloudera Manager

7. Start the postgres psql client as the postgres user:

# sudo -u postgres psql


8. Execute the following SQL commands:

CREATE ROLE scm LOGIN PASSWORD 'scm';


CREATE DATABASE scm OWNER scm ENCODING 'UTF8';
CREATE ROLE amon LOGIN PASSWORD 'amon_password';
CREATE DATABASE amon OWNER amon ENCODING 'UTF8';
CREATE ROLE rman LOGIN PASSWORD 'rman_password';
CREATE DATABASE rman OWNER rman ENCODING 'UTF8';
CREATE ROLE hive LOGIN PASSWORD 'hive_password';
CREATE DATABASE metastore OWNER hive ENCODING 'UTF8';
ALTER DATABASE Metastore SET standard_conforming_strings = off;
CREATE ROLE sentry LOGIN PASSWORD 'sentry_password';
CREATE DATABASE sentry OWNER sentry ENCODING 'UTF8';
CREATE ROLE nav LOGIN PASSWORD 'nav_password';
CREATE DATABASE nav OWNER nav ENCODING 'UTF8';
CREATE ROLE navms LOGIN PASSWORD 'navms_password';
CREATE DATABASE navms OWNER navms ENCODING 'UTF8';
CREATE ROLE oozie LOGIN ENCRYPTED PASSWORD 'oozie' NOSUPERUSER INHERIT
CREATEDB NOCREATEROLE;
CREATE DATABASE "oozie" WITH OWNER = oozie
ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
CONNECTION LIMIT = -1;
create database hue;
\c hue;
create user hue with password 'secretpassword';
grant all privileges on database hue to hue;
\q
9. Exit the postgres psql client.

Installing Cloudera Manager Software

These instructions summarize the overall installation process and call out specific recommendations for the
Dell EMC Ready Bundle for Cloudera Hadoop. For additional details, refer to the Cloudera documentation
at: http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html
You will download the “seed” portion of Cloudera Manager software from Cloudera, and then install the
Cloudera Hadoop environment using their Internet-accessable repositories.
Cloudera Manager is installed upon the Edge Node. To install Cloudera Manager:
1. Log into the Edge Node:
a. Username: root
b. Password: the password that you entered in Configuring the Kickstart Server on page 50
2. Update the package repository information:

# yum clean all


# yum makecache

Dell EMC Ready Bundle for Cloudera Hadoop


Installing Cloudera Manager | 63

3. Add the Cloudera repository for the selected release:

# wget -P /etc/yum.repos.d https://archive.cloudera.com/cm5/redhat/7/


x86_64/cm/cloudera-manager.repo
4. Install the Cloudera agent and manager daemons on the Edge Node:

# yum install cloudera-manager-daemons cloudera-manager-server java

a. Accept the GPG keys for the Cloudera repository.


b. Type yes, and then press the [Enter] key, to confirm the installation.
5. Prepare the PostgreSQL database for use by Cloudera Manager:
Note: The PostgreSQL database should be configured on the HA Node.

# /usr/share/cmf/schema/scm_prepare_database.sh -h <hostname_of_HA_node>
postgresql scm scm

a. You are prompted for the SCM password. Enter the password to continue.
6. Start the Cloudera server processes:

# service cloudera-scm-server start

Cloudera Manager is now installed. Its HTTP management interface should be reachable on port 7180,
using the admin/admin username and password credentials.
You can now follow the install wizard steps for a custom deployment, or proceed to Cloudera Configuration
on page 64.
Note: If allowed in your jurisdiction, you should install the Java Cryptography Extension (JCE)
Unlimited Strength Jurisdiction Policy File on all cluster and Hadoop user machines. For JCE Policy
File installation instructions, see the README.txt file included in the jce_policy-x.zip file. You
will be given an option to do this when using Cloudera Manager to deploy the Hadoop Environment.

Dell EMC Ready Bundle for Cloudera Hadoop


64 | Cloudera Configuration

Chapter

11
Cloudera Configuration
Topics: This section describes Cloudera-specific configuration settings that
Dell EMC recommends you set. These changes are not automatically
• Cloudera and Network applied by the DTK/Kickstart process, and must be applied manually.
Interfaces
Note: Once you have finished configuring Cloudera, best
• Using Spark 1 and Spark 2
practice is to perform the cluster test in After Hadoop Cluster
• Service Assignments Deployment on page 79.
• Hadoop Rack Awareness
• Cloudera Update
Recommendations

Dell EMC Ready Bundle for Cloudera Hadoop


Cloudera Configuration | 65

Cloudera and Network Interfaces

The Cloudera services are not multi-homed, and only function on a single network interface.
The network interface used for the Cloudera services is the interface that corresponds to the fully qualified
node name. For the Dell EMC Ready Bundle for Cloudera Hadoop Architecture Guide and Dell EMC
Ready Bundle for Cloudera Hadoop Deployment Guide, this will be the 'bond0' interface and the Cloudera
services will be available on the cluster data network.
If the network interface names are changed, or an alternative deployment method is used, the Cloudera
services must be explicitly configured to run on the desired network interface.

Using Spark 1 and Spark 2

Cloudera Enterprise 5.10 supports the simultaneous installation and use of Spark 1.x and Spark 2.x.
Spark 2 contains significant API changes and functional improvements over Spark 1. However, it is not
backwards compatible with Spark 1. Cloudera Enterprise supports both versions by treating Spark 2 as an
additional service in Cloudera Manager.
Spark 2 is a separate download, not included in the base installation. Complete instructions are available
at: http://www.cloudera.com/downloads/spark2/2-0.html.
To install and configure Spark 2:
1. Follow the instructions on the Cloudera Spark 2 page to download and install the Spark 2 parcel. The
most direct way is to configure the Spark 2 parcel repository in Cloudera Manager.
2. Follow the guidelines in Service Assignments on page 65 to add the Spark 2 service to the cluster.
The Service Assignments on page 65 include recommendations for both services. You can configure
either Spark 1 or Spark 2, or configure both depending on your requirements.

Service Assignments

These are the recommended service role to node assignments for the cluster configuration.
As part of Cloudera installation, the mapping of service roles to nodes must be specified. We recommend
the service role assignments in Table 19: Service Role Assignments on page 65 below as a starting
point.

Table 19: Service Role Assignments

Role Physical Nodes


HDFS
NameNode Active Name Node
Secondary NameNode Standby Name Node
Balancer Standby Name Node
HttpFS Edge Node, Active Name Node
NFSGateway Active Name Node

Dell EMC Ready Bundle for Cloudera Hadoop


66 | Cloudera Configuration

Role Physical Nodes


DataNode Worker Node 1, Worker Node 2, ... Worker Node N
Hive
Gateway all nodes
Hive Metastore Server Standby Name Node
WebHCat Server Standby Name Node
HiveServer2 Standby Name Node
Hue
Hue Server Standby Name Node
Impala
Impala Catalog Server Active Name Node
Impala StateStore Active Name Node
Impala Daemon same servers as DataNode role
Cloudera Management Service
Service Monitor Standby Name Node
Activity Monitor Standby Name Node
Host Monitor Standby Name Node
Reports Manager Standby Name Node
Event Server Standby Name Node
Alert Publisher Standby Name Node
Navigator Audit Standby Name Node
Navigator Metadata Server Standby Name Node
Oozie
Oozie Server Standby Name Node
Spark
Gateway all nodes
History Server Standby Name Node
Spark 2
Spark 2 Gateway all nodes
Spark 2 History Server Standby Name Node
YARN (MR2 Included)
Resource Manager Active Name Node
Job History Server Active Name Node
Node Manager same servers as DataNode role
Gateway all nodes

Dell EMC Ready Bundle for Cloudera Hadoop


Cloudera Configuration | 67

Role Physical Nodes


ZooKeeper
ZooKeeper Server Active Name Node, Standby Name Node, HA Node

Hadoop Rack Awareness

Hadoop rack awareness takes a node's network location into account when scheduling tasks and
allocating storage. Cloudera Manager allows the specification of the rack/switch location for each node in
the cluster. You must configure rack awareness to achieve optimal performance and high availability.
HDFS, MapReduce, and YARN will automatically use the location information (topology) that you specify to
optimize reliability and performance. The default installation of Cloudera places all nodes in the same rack.
If your cluster contains more than one rack, you should specify the topology for each node based on the
rack and pod location for each host. We recommend specifying the topology for all clusters, even if they
are a single rack.
The location of a node is specified using a hierarchical path, such as:
• /pod1/rack1
• /pod1/rack2
• /pod2/rack4
Note: It is important to specify both the pod and rack level information, and the rack component
should be unique within the cluster.
The rack location for hosts is specified in Cloudera Manager, under the hosts tab. For more information,
please see:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_specify_rack.html
You must restart the affected services after making these changes.
We provide the set_rackId.py utility to assist in configuring the correct rack awareness values for a
cluster. set_rackId.py can set rack identifiers based on the hostname, chassis serial number, or a
supplied list of hosts and identifiers. Refer to the included README file for details on how to run this utility.

Dell EMC PowerEdge FX2 Rack Awareness


The Dell EMC PowerEdge FX2 platform requires a slightly different configuration for rack awareness.
For this platform, multiple nodes share a single chassis, which creates a fault zone at a lower level than
a rack. This scenario is similar to the one that exists when running Hadoop in virtualized environments,
where multiple virtual machines can exist on the same physical host. To inform Hadoop of this scenario,
we enable the Hadoop Virtualization Extensions (HVE) in addition to specifying the node topology.
For more information on the Hadoop Virtualization Extensions, see:
https://issues.apache.org/jira/browse/HADOOP-8468 and https://issues.apache.org/jira/browse/
HDFS-6261.
The rack location for hosts is specified in Cloudera Manager, under the hosts tab. For more information,
please see:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_specify_rack.html
To configure rack awareness and enable HVE:
1. Specify the location of each node in Cloudera Manager, under the hosts tab, using a path of the form /
pod/rack/chassis, e.g., /pod1/rack2/chassis3. The rack and chassis information should be unique within
the cluster.

Dell EMC Ready Bundle for Cloudera Hadoop


68 | Cloudera Configuration

Note: The Dell EMC PowerEdge FX2 chassis serial number is good unique identifier.
2. Change the Replica Placement Policy in Cloudera Manager by adding the following to the hdfs core-
site.xml safety valve:

<property>
<name>net.topology.impl</name>
<value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value>
</property>
<property>
<name>dfs.block.replicator.classname</name>
<value>

org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup
</value>
</property>

You must restart the affected services after making these changes.

Cloudera Update Recommendations

Dell EMC recommends installing the latest Cloudera maintenance updates during initial installation and as
part of normal administration processes.
For parcel deployment, updates are managed in the Settings section of Cloudera Manager, under Parcels.
The Cloudera Manager repositories are normally accessed via HTTP. Some environments may require the
use of an HTTP proxy server, which can be specified under Settings/Network.

Dell EMC Ready Bundle for Cloudera Hadoop


Installing Syncsort DMX-h | 69

Chapter

12
Installing Syncsort DMX-h
Topics: Syncsort® DMX-h® is an Extract, Transform, Load (ETL) product for
Hadoop, and is an optional installation. For more information about
• Syncsort DMX-h Prerequisites Syncsort, see the Syncsort website. Access to the Syncsort support
• Syncsort DMX-h Software portal requires a valid site login account.
Packages and Versions
This topic briefly describes installing DMX-h on a Dell EMC Ready
• Installation Procedure Bundle for Cloudera Hadoop architecture-compliant cluster, and
configuring it to extract data from a PostgreSQL database using that
database's ODBC driver. The detailed directions for installing DMX-h
are in the Syncsort DMX-h Installation Guide.
Note: For information about configuring other data sources,
such as Oracle DB2 and Sybase, see the Syncsort website.

Dell EMC Ready Bundle for Cloudera Hadoop


70 | Installing Syncsort DMX-h

Syncsort DMX-h Prerequisites

The following prerequisites must be met:


• The Cloudera software must be installed and configured on the cluster.
• You must have downloaded all of the necessary software packages from Syncsort to install DMX-h.
• You must have identified a Windows-based computer for running the DMX-h client tools, that has
access to both the Cloudera cluster, and to the data sources you wish to use.
• You must have the appropriate access permissions on both the Cloudera cluster and the Windows
computer.

Syncsort DMX-h Software Packages and Versions

These instructions assume that you will be installing and using the following versions of software:
• Red Hat Enterprise Linux Server 7.3 - installed and configured on all required Cloudera nodes
• Cloudera Distribution for Apache Hadoop 5.10 - installed with all nodes running their proper roles
• Syncsort DMX-h 9.2
• Syncsort DMX-h license key file

Installation Procedure

To install and configure Syncsort DMX-h on a Dell EMC Ready Bundle for Cloudera Hadoop architecture-
compliant cluster:
1. Acquire Syncsort Files on page 70
2. Install the DMX-h IDE on page 71
3. Configure the Syncsort Parcel for Cloudera on page 71
4. Install DMX-h on the Edge Node on page 71

Acquire Syncsort Files


To download the required Syncsort DMX-h installation files:
1. Register for a support account at http://www.syncsort.com/en/SupportandServices/SupportandServices.
2. Log into the Syncsort website using your licensed Syncsort account.
3. Download the following files from Syncsort:

Table 20: Syncsort Installation Files

File Description Location


dmexpress_9-2_windows_x64.exe Windows workstation User account's Downloads page
installer
dmexpress-9.2-1.x86_64_en.bin Red Hat RPM User account's Downloads page
dmexpress-9.2-el7.parcel_en.bin Red Hat/Cloudera parcel User account's Downloads page
Syncsort DMX-h ETL 9.2 Installation DMExpress Installation User account's Downloads page
Guide Guide

Dell EMC Ready Bundle for Cloudera Hadoop


Installing Syncsort DMX-h | 71

File Description Location


DMExpressLicense.txt License text file User account's Home page

You can now proceed to Install the DMX-h IDE on page 71.

Install the DMX-h IDE


You must install the DMX-h IDE onto the Windows computer identified earlier. This machine will be used to
copy and orchestrate the ETL connections between the database and the Cloudera cluster.
To install the DMX-h IDE:
1. Copy the dmexpress_9-2_windows_x64.exe installer program and DMExpressLicense.txt file onto the
Windows computer.
2. Run the installer program with Administrator privileges, and accept all defaults.
3. When prompted to provide a license key:
• If you have purchased a license key, select Provide license key in the license dialog window, and
browse to the DMExpressLicense.txt file.
• If you have not purchased a license key, select Start free trial to evaluate DMExpress.
When the installation is complete a new sub-menu will be available in Windows, called DMExpress. You
can now proceed to Configure the Syncsort Parcel for Cloudera on page 71.

Configure the Syncsort Parcel for Cloudera


To extract, place, distribute and activate the Syncsort parcel for Cloudera:
1. Copy the dmexpress-9.2-el7.parcel_en.bin file into a writable directory on the HA Node.
2. Change the file permissions so that it is executable:

# chmod a+x dmexpress-9.2-el7.parcel_en.bin


3. Run the executable:

# ./dmexpress-9.2-el7.parcel_en.bin
4. Specify the extraction directory as /opt/cloudera/parcel-repo/.
5. Log into the Cloudera Management Console as the administrator user.
6. Navigate to Hosts > Parcels to display the parcels administration page.
7. If the new Syncsort parcel is not displayed, perform a scan for newly-available parcels.
8. Select Automatically Distribute Available Parcels to distribute the Syncsort parcel to all nodes.
9. Click on the Save Changes button.
10.Once the operation is complete, activate the parcel on all nodes.
You can now proceed to Install DMX-h on the Edge Node on page 71.

Install DMX-h on the Edge Node


The dmxd service should reside on the Edge Node. To install DMX-h:
1. Copy the DMX-h RPM to the Edge Node.
2. Change the file permissions so that it is executable:

# chmod a+x dmexpress-9.2-1.x86_64_en.bin


3. Extract the contents to an installation directory, located under the current directory, by executing the
following command:

# ./dmexpress-9.2-1.x86_64_en.bin

Dell EMC Ready Bundle for Cloudera Hadoop


72 | Installing Syncsort DMX-h

4. Change to the newly-created directory:

# cd <new_directory>

a. Ensure that the directory contains a dmexpress-9.2-1.x86_64.rpm file.


Note: The language descriptor (_en) does not appear in the extracted file name.
5. Create a dmexpress folder under /usr by executing the following command:

# rpm -i dmexpress-9.2-1.x86_64.rpm

a. To install to a different location, use the --prefix option as described in the rpm man page.
6. Install and configure the dmxd service by issuing the following commands as the root user:

# cd /usr/dmexpress
# ./install
7. Select the option to install and run the dmxd daemon on the Edge Node.
a. Select the following when prompted:
• Select [2] to configure the DMExpress Service.
• Select [y] or [n] to choose whether or not to use PAM for authentication.
• Select [y] or [n] to choose whether or not to start the DMExpress Service automatically.
• Select [y] or [n] to choose whether or not to start the DMExpress Service now.
Syncsort and the ODBC connectors are now installed, and configured to allow ETL between the
PostgreSQL database and the Dell EMC Ready Bundle for Cloudera Hadoop architecture-compliant
cluster.
Syncsort DMX-h is now installed and configured.

Dell EMC Ready Bundle for Cloudera Hadoop


YARN Performance Optimization | 73

Chapter

13
YARN Performance Optimization
Topics: This topic describes how to configure YARN and MapReduce memory
allocation settings for the Dell EMC Ready Bundle for Cloudera
• YARN Applications Hadoop, based upon the node hardware specifications. These
• Determining the Reserved guidelines were developed using several documents publicly available
Memory from Cloudera:
• Hadoop Configuration Settings • http://blog.cloudera.com/blog/2014/02/getting-mapreduce-2-up-to-
speed/
• http://www.cloudera.com/documentation/enterprise/latest/topics/
cdh_ig_yarn_tuning.html
Note: These guidelines have been tested on Dell EMC Ready
Bundle for Cloudera Hadoop cluster configurations.

Dell EMC Ready Bundle for Cloudera Hadoop


74 | YARN Performance Optimization

YARN Applications

The performance of YARN applications should be tunable based upon the hardware resources of the
cluster, especially the physical cores and memory. YARN takes into account all of the available compute
resources on each machine in the cluster. Based on the available resources, YARN:
1. Negotiates resource requests from applications (such as MapReduce) running in the cluster
2. Provides processing capacity to each application by allocating Containers
Note: A Container is the basic unit of processing capacity in YARN, and is an encapsulation of
resource elements (memory, CPU, etc.).
In a Hadoop cluster, it is vital to balance the usage of memory (RAM), processors (CPU cores), and disks
so that processing is not constrained by any one of these cluster resources. As a general recommendation,
allowing for two Containers per disk and per core provides the best balance for cluster utilization.
When determining the appropriate YARN and MapReduce memory configurations for a cluster node, start
with the available hardware resources. Specifically, note the following values on each node:
• RAM - Amount of memory
• Cores - Number of CPU cores
• Disks - Number of disks

Determining the Reserved Memory

The total available RAM for YARN and MapReduce should take into account the Reserved Memory.
Reserved Memory is the RAM needed by system processes and other Hadoop processes (such as
HBase).
To determine Reserved Memory per node:
1. Use the Search facility in Cloudera Manager to find the values for the following Role Instance Memory
parameters:
a. Memory Overcommit Threshold — Navigate to Cloudera Manager (CM) > Hosts > [select a
DataNode Host] > Configuration
b. Java Heap Size of Worker Node — Navigate to CM > Hosts > [select a DataNode Host] > Roles >
DataNode > Configuration
c. Java Heap Size of NFS Gateway — Navigate to CM > Hosts > [select a DataNode Host] > Roles >
NFS Gateway > Configuration
d. Java Heap Size of NodeManager — Navigate to CM > Hosts > [select a DataNode Host] > Roles >
NodeManager > Configuration
2. Sum those values to determine the Role Instance Memory.
3. Then, use the following formula:

Reserved Memory = System Memory + Role Instance Memory

Table 21: Reserved Memory Recommendations on page 75 provides Dell EMC's recommended
Reserved Memory values.

Dell EMC Ready Bundle for Cloudera Hadoop


YARN Performance Optimization | 75

Table 21: Reserved Memory Recommendations

Memory Description Suggested Value


Worker Node Memory # grep MemTotal /proc/
meminfo
Memory Overcommit Threshold Threshold used when validating the 0.8 (default)
allocation of RAM on a host. Values
can range from 0 to 1.
System Memory (1-Memory Overcommit Threshold) 0.2 x Worker Node Memory
x Worker Node Memory
Role Instance Memory: Worker Java Heap Size of Worker Node in 1GB (default) +30%
Node Bytes (+ 30% Overhead)
Role Instance Memory: Worker Maximum Memory used for caching 4GB (default)
Node dfs.datanode.max.locked.memory
Role Instance Memory: NFS Java Heap Size of NFS Gateway in 256MB (default) + 30%
Gateway Bytes (+ 30% Overhead)
Role Instance Memory: Node Java Heap Size of Node Manager 1GB (default) + 30%
Manager in Bytes (+ 30% Overhead)

Hadoop Configuration Settings

The YARN and MapReduce configurations should be set as per Table 22: YARN and MapReduce RAM
Settings on page 75.

Table 22: YARN and MapReduce RAM Settings

Configuration File Description Suggested Value


yarn.scheduler.maximum-allocation- The largest number of num cores in a Worker Node - 1
vcores virtual CPU cores (vCPU)
that can be requested for a
container.
yarn.scheduler.increment-allocation- vCPU allocation must be a 1
vcores multiple of this value.
yarn.scheduler.minimum-allocation- The smallest number of 1
vcores virtual CPU cores that
can be requested for a
container.
yarn.nodemanager.resource.cpu- Number of virtual CPU num cores in a Worker Node - 1
vcores cores that can be allocated
for containers.
mapreduce.map.cpu.vcores The number of virtual CPU 1
cores allocated for each
map task of a job.
mapreduce.reduce.cpu.vcores The number of virtual CPU 1
cores allocated for each
reduce task of a job.

Dell EMC Ready Bundle for Cloudera Hadoop


76 | YARN Performance Optimization

Configuration File Description Suggested Value


yarn.scheduler.maximum-allocation- The largest amount of Worker Node Memory – Reserved
mb physical memory, in MB, Memory
that can be requested for a
container.
yarn.scheduler.increment-allocation- Memory allocation must be 512
mb a multiple of this value.
yarn.scheduler.minimum-allocation- The smallest amount of 1024
mb physical memory, in MB,
that can be requested for a
container.
yarn.nodemanager.resource.memory The amount of physical Worker Node Memory – Reserved
-mb memory, in MB, that can be Memory
allocated for containers.
mapreduce.map.memory.mb The amount of physical 1024
memory, in MB, allocated
for each map task of a job.
mapreduce.reduce.memory.mb The amount of physical 2048
memory, in MB, allocated
for each reduce task of a
job.
yarn.app.mapreduce.am.resource.mb The amount of memory 2048
required to run the
ApplicationMaster.
yarn.app.mapreduce.am.command- Java command line -Djava.net.preferIPv4Stack
opts arguments passed = true –Xmx1717986918
to the MapReduce
ApplicationMaster.
ApplicationMaster Java Maximum The maximum heap 1717986918
Heap Size size, in bytes, of the
Java MapReduce
ApplicationMaster. This
number will be formatted
and concatenated with
'ApplicationMaster
Java Opts Base' to pass
to Hadoop.
mapreduce.map.java.opts Java opts for the map -Djava.net.preferIPv4Stack
processes. = true -Xmx858993459

Map Task Maximum Heap Size The maximum Java heap 858993459
size, in bytes, of the map
processes. This number
will be formatted and
concatenated with 'Map
Task Java Opts Base'
to pass to Hadoop.

Dell EMC Ready Bundle for Cloudera Hadoop


YARN Performance Optimization | 77

Configuration File Description Suggested Value


mapreduce.reduce.java.opts Java opts for the reduce -Djava.net.preferIPv4Stack
processes. = true -Xmx1717986918

Reduce Task Maximum Heap Size The maximum Java heap 1717986918
size, in bytes, of the reduce
processes. This number
will be formatted and
concatenated with 'Reduce
Task Java Opts Base'
to pass to Hadoop.
mapreduce.task.io.sort.mb The total amount of memory Default=256
buffer, in MB, to use while
sorting files.
mapreduce.map.sort.spill.percent The soft limit in either the Default=0.8, Recommended (> 0.5)
buffer or record collection
buffers. When this limit is
reached, a thread will begin
to spill the contents to disk
in the background.
mapreduce.job.reduce.slowstart. Fraction of the number of Default=0.8, Depending on
completedmaps map tasks in the job which workload and Configuration (valid
should be completed before range: 0 – 1)
reduce tasks are scheduled
for the job.
mapreduce.job.maps The default number of map num Worker Node cores x num
tasks per job. Worker Nodes
mapreduce.job.reduces The default number of (Valid range: 1/3 – 1) x
reduce tasks per job. mapreduce.job.maps
dfs.blocksize The default block size in Valid range: 256MB-1GB
bytes for new HDFS files.
dfs.replication The number of replications 3
to make when the file is
created.
dfs.namenode.handler.count The number of server 30
threads for the Name Node.
dfs.datanode.handler.count The number of server 10
threads for the Worker
Node.

Note: After installation, both yarn-site.xml and mapred-site.xml are located in the /etc/hadoop/conf
folder. If using Cloudera Manager, these settings should be entered via the YARN configuration
tool.

Dell EMC Ready Bundle for Cloudera Hadoop


78 | Cluster Testing

Chapter

14
Cluster Testing
Topics: You should test your Hadoop cluster both before and after Cloudera
Manager has deployed the cluster. The tests you perform will vary
• Before Hadoop Cluster depending upon the deployment status.
Deployment
• After Hadoop Cluster
Deployment

Dell EMC Ready Bundle for Cloudera Hadoop


Cluster Testing | 79

Before Hadoop Cluster Deployment

Before the Hadoop cluster has been deployed by Cloudera Manager:


1. Verify access to archive.cloudera.com by running these commands:

# curl -I archive.cloudera.com
# dig @<dns_server> archive.cloudera.com
# yum repolist
# more /etc/yum.repos.d/*

After Hadoop Cluster Deployment

After The Hadoop cluster has been deployed by Cloudera Manager:


1. Run the Host Inspector from the Cloudera Manager user interface.
2. Monitor Cloudera Manager health checks on a regular basis.
3. You may find it useful to run the teragen, terasort, and teravalidate MapReduce jobs,
utilizing all cluster nodes for a period of time. For more information on terasort, teragen, and
teravalidate, see the following link:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html
Note: Currently there is a known issue when running teravalidate using MapReduce1. Dell
EMC and Cloudera recommend that you run teravalidate using YARN (MapReduce2) instead.

Dell EMC Ready Bundle for Cloudera Hadoop


80 | QuickStart Configuration Differences

Chapter

15
QuickStart Configuration Differences
Topics: There are differences between the full cluster and QuickStart
configurations.
• QuickStart Node Configuration
Differences
• QuickStart Network
Configuration Differences
• QuickStart Service Assignments

Dell EMC Ready Bundle for Cloudera Hadoop


QuickStart Configuration Differences | 81

QuickStart Node Configuration Differences

The QuickStart configuration is intended for proof of concept installations, and is not a full cluster
configuration. The QuickStart uses the same node configurations as a full cluster, but includes only 5
nodes and does not include a high availability network.
The recommended QuickStart node usage is shown in Table 23: QuickStart Node Roles on page 81.

Table 23: QuickStart Node Roles

Physical Node Software Function


Active Name Node NameNode
Resource Manager
ZooKeeper
Quorum Journal Node
HMaster
Impala State Store and Catalog Daemons

Standby Name Node Yum Repositories


Hadoop Clients
Cloudera Manager
Spark History Server
Spark2 History Server
Standby NameNode
Standby Resource Manager (optional)
ZooKeeper
Quorum Journal Node

Worker Node1 ZooKeeper


Quorum Journal Node
DataNode
NodeManager
HBase RegionServer
ImpalaDaemon

Worker Node 2 and 3 DataNode


NodeManager
HBase RegionServer
ImpalaDaemon

Dell EMC Ready Bundle for Cloudera Hadoop


82 | QuickStart Configuration Differences

QuickStart Network Configuration Differences

The QuickStart configuration uses the same switches and switch configurations as a full cluster. However,
the dual switches that provide high availability are not included.
To configure networking for the QuickStart configuration:
1. Configure switches and cabling just like a full cluster deployment, using only switch S4048-1.
2. Each node will have a single connection to the cluster data network instead of dual connections.
3. Configure hosts and IP addresssing using the same method as a full cluster deployment.

QuickStart Service Assignments

Table 24: QuickStart Service Role Assignments on page 82 shows the recommended service role to
node assignments for the QuickStart configuration.

Table 24: QuickStart Service Role Assignments

Role Nodes
HDFS
NameNode Active Name Node
Secondary NameNode Standby Name Node
Balancer Standby Name Node
HttpFS Active Name Node
NFSGateway Active Name Node
DataNode Worker Node 1, Worker Node 2,... Worker Node N
Hive
Gateway all nodes
Hive Metastore Server Standby Name Node
WebHCat Server Standby Name Node
HiveServer2 Standby Name Node
Hue
Hue Server Standby Name Node
Impala
Impala Catalog Server Active Name Node
Impala StateStore Active Name Node
Impala Daemon same servers as DataNode role
Cloudera Management Service
Service Monitor Standby Name Node
Activity Monitor Standby Name Node

Dell EMC Ready Bundle for Cloudera Hadoop


QuickStart Configuration Differences | 83

Role Nodes
Host Monitor Standby Name Node
Reports Manager Standby Name Node
Event Server Standby Name Node
Alert Publisher Standby Name Node
Navigator Audit Standby Name Node
Navigator Metadata Server Standby Name Node
Oozie
Oozie Server Standby Name Node
Spark
Gateway all nodes
History Server Active Name Node
Spark 2
Spark 2 Gateway all nodes
Spark 2 History Server Standby Name Node
YARN (MR2 Included)
Resource Manager Active Name Node
Job History Server Active Name Node
Node Manager same servers as DataNode role
Gateway all nodes
ZooKeeper
ZooKeeper Server Active Name Node, Standby Name Node, Worker
Node 1

Dell EMC Ready Bundle for Cloudera Hadoop


84 | BIOS Configuration

Appendix

A
BIOS Configuration
Topics: This appendix describes BIOS configurations on Dell EMC PowerEdge
R730xd and Dell EMC PowerEdge FC630 server hardware for the
• IPMI Configuration Dell EMC Ready Bundle for Cloudera Hadoop with Red Hat Enterprise
• Primary BIOS Settings Linux Server 7.3.
• Infrastructure Node Settings Note: The Dell EMC-provided DTK tool updates all of the
• Worker Node Settings necessary IPMI/BIOS/iDRAC settings for you. Table 25: Dell
EMC PowerEdge R730xd and Dell EMC PowerEdge FC630
Infrastructure Node Settings on page 85 and Table 26:
Dell EMC PowerEdge R730xd and Dell EMC PowerEdge
FC630 Worker Node Settings on page 86 contain all of the
settings performed by the DTK, and are provided here for your
reference.

Dell EMC Ready Bundle for Cloudera Hadoop


BIOS Configuration | 85

IPMI Configuration

You must configure the iDRAC on supported systems. Dell EMC recommends that you configure these
settings from the iDRAC web interface, or directly on the node console:
• User Information
• Network Configuration
• IPMI Validation

Primary BIOS Settings

The primary BIOS configurations for the Dell EMC Ready Bundle for Cloudera Hadoop are for
Infrastructure Nodes and Worker Nodes.
• For more information about Dell EMC PowerEdge R730xd BIOS settings, please see the Dell EMC
PowerEdge R730xd Owner's Manual.
Note: Dell EMC recommends that you perform BIOS updates on a regular basis. It is particularly
important that your operating system firmware be up to date prior to installing Cloudera Manager.

Infrastructure Node Settings

This section describes required settings for Dell EMC PowerEdge R730xd and Dell EMC PowerEdge
FC630 Infrastructure nodes (Cloudera Manager node, optional Administration Node, HDFS Active and
Standby Name Nodes, etc.).

Table 25: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630 Infrastructure Node
Settings

Type Setting State


Boot Settings BootMode BIOS
Boot Option Settings BIOS Boot Sequence Dell EMC PowerEdge R730xd:
Integrated RAID first
Dell EMC PowerEdge FC630:
Modular RAID first

Memory Settings System Memory Testing Disabled


Memory Settings Memory Operating Mode Optimizer Mode
Memory Settings Node Interleaving Disabled
Memory Settings Snoop Mode Early Snoop
Memory Settings Memory Speed Dell EMC PowerEdge R730xd:
Maximum
Dell EMC PowerEdge FC630:
Setting does not exist

Processor Settings Logical Processor (HT) Enabled

Dell EMC Ready Bundle for Cloudera Hadoop


86 | BIOS Configuration

Type Setting State


Processor Settings QPI Speed Maximum Data Rate
Processor Settings Alternate RTID Setting Disabled
Processor Settings Virtualization Technology Disabled
Processor Settings Adjacent Cache Line Prefetch Enabled
Processor Settings Hardware Prefetcher Enabled
Processor Settings DCU Streamer Prefetcher Enabled
Processor Settings DCU IP Prefetcher Enabled
Processor Settings Logical Processor Idling Disabled
Processor Settings Number of cores per Processor All
Integrated Devices Integrated RAID Controller Dell EMC PowerEdge R730xd:
Enabled
Dell EMC PowerEdge FC630:
Setting does not exist

Integrated Devices I/OAT DMA Engine Enabled


Integrated Devices SR-IOV Global Enable Enabled
Integrated Devices OS Watchdog Timer Disabled
Integrated Devices Memory Mapped I/O above 4GB Enabled
System Profile Settings System Profile Performance
System Profile Settings CPU Power Management Maximum Performance
System Profile Settings C States Disabled
System Profile Settings Turbo Boost Enabled
System Profile Settings Memory Frequency Maximum Performance

Worker Node Settings

This section describes required settings for Dell EMC PowerEdge R730xd and Dell EMC PowerEdge
FC630 Worker Nodes.

Table 26: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630 Worker Node Settings

Type Setting State


Boot Settings Boot Mode BIOS
Boot Option Settings BIOS Boot Sequence Dell EMC PowerEdge R730xd:
Integrated RAID first
Dell EMC PowerEdge FC630:
Embedded SATA Port Disk A

Memory Settings System Memory Testing Disabled

Dell EMC Ready Bundle for Cloudera Hadoop


BIOS Configuration | 87

Type Setting State


Memory Settings Memory Operating Mode Optimizer Mode
Memory Settings Node Interleaving Disabled
Memory Settings Snoop Mode Early Snoop
Memory Settings Memory Speed Dell EMC PowerEdge R730xd:
Maximum
Dell EMC PowerEdge FC630:
Setting does not exist

Processor Settings Logical Processor (HT) Enabled


Processor Settings QPI Speed Maximum Data Rate
Processor Settings Alternate RTID Setting Disabled
Processor Settings Virtualization Technology Disabled
Processor Settings Adjacent Cache Line Prefetch Enabled
Processor Settings Hardware Prefetcher Enabled
Processor Settings DCU Streamer Prefetcher Enabled
Processor Settings DCU IP Prefetcher Enabled
Processor Settings Logical Processor Idling Disabled
Processor Settings Dell Controlled Turbo Disabled
Processor Settings Number of cores per Processor All
Integrated Devices Integrated RAID Controller Dell EMC PowerEdge R730xd:
Enabled
Dell EMC PowerEdge FC630:
Setting does not exist

Integrated Devices I/OAT DMA Engine Enabled


Integrated Devices SR-IOV Global Enable Disabled
Integrated Devices OS Watchdog Timer Disabled
Integrated Devices Memory Mapped I/O above 4GB Enabled
System Profile Settings System Profile Performance
System Profile Settings CPU Power Management Maximum Performance
System Profile Settings C States Disabled
System Profile Settings Turbo Boost Enabled
System Profile Settings Memory Frequency Maximum Performance

Dell EMC Ready Bundle for Cloudera Hadoop


88 | RAID Configuration

Appendix

B
RAID Configuration
Topics: This appendix describes Infrastructure Nodes and Worker Nodes
RAID settings for the PERC-H730 RAID Controller.
• PERC-H730-Specific
Infrastructure Nodes RAID Note: The Dell EMC-provided DTK tool automatically
configures the RAID controller, and creates all necessary
Settings
RAID sets on each machine. Table 27: PERC-H730 BIOS
• PERC-H730-Specific Worker Settings for Infrastructure Nodes on page 89 and Table 28:
Node RAID Settings PERC-H730 BIOS Settings for Worker Nodes on page 89
contain all of the RAID settings performed by the DTK, and are
provided here for your reference.
For more information on configuring your controller please see the Dell
EMC PowerEdge RAID Controller (PERC) 9 User’s Guide.

Dell EMC Ready Bundle for Cloudera Hadoop


RAID Configuration | 89

PERC-H730-Specific Infrastructure Nodes RAID Settings

Note that:
• Rear flex-bay drives are a single RAID 1 set.

Table 27: PERC-H730 BIOS Settings for Infrastructure Nodes

Screen Setting Parameter


Controller Management Personality Mode RAID Mode
Controller Management Enable Controller BIOS Enabled
Virtual Disk Management Virtual Disk 0 Include the Two Flex-Bay Drives,
RAID 1
Virtual Disk Management Virtual Disk 1 Include Two of the Front Drives,
RAID 1
Virtual Disk Management Virtual Disk 2 Include Four of the Front Drives,
RAID 10
Virtual Disk Management Read Policy Read Ahead
Virtual Disk Management Write Policy Write Back
Configuration Management Remaining Drives Convert to Non-RAID Disk

Note: We do not use more than six front drives directly. Any remaining front drives are available for
customer use.

PERC-H730-Specific Worker Node RAID Settings

Table 28: PERC-H730 BIOS Settings for Worker Nodes

Screen Setting Parameter


Controller Management Personality Mode RAID Mode
Controller Management Enable Controller BIOS Enabled
Virtual Disk Management Virtual Disk 0 Include the Two Flex-Bay Drives,
RAID 1
Virtual Disk Management Read Policy Read Ahead
Virtual Disk Management Write Policy Write Back
Configuration Management Remaining Drives Convert to Non-RAID Disk

Note: Worker Nodes are set as a single RAID 1 set for the two Flex Bay Drives, and HBA pass-
through (JBOD) for the data drives.

Dell EMC Ready Bundle for Cloudera Hadoop


90 | File System Layout

Appendix

C
File System Layout
Topics: This appendix describes filesystem layout deployment parameters.

• Infrastructure Nodes When a cluster is deployed using the procedures described in


Server Configuration and OS Installation on page 48 the hardware
• Worker Nodes
and filesystems are configured as described in this appendix. This
• File Systems and Parameters information is provided for reference in case an alternate deployment
method is used.
The Dell EMC-provided DTK tool automatically configures the
RAID sets on each machine. The following tables contain all of the
filesystem layout configurations performed by the DTK and kickstart,
and are provided here for your reference:
Infrastructure Nodes
• Table 29: Dell EMC PowerEdge R730xd Infrastructure Node
Volumes on page 91
• Table 30: Dell EMC PowerEdge R730xd Infrastructure Node
Partitions on page 91
• Table 31: Dell EMC PowerEdge FC630 Infrastructure Node
Volumes on page 92
• Table 32: Dell EMC PowerEdge FC630 Infrastructure Node
Partitions on page 92
Worker Nodes
• Table 33: Dell EMC PowerEdge R730xd Worker Node Volumes on
page 93
• Table 34: Dell EMC PowerEdge R730xd Worker Node Partitions on
page 93
• Table 35: Dell EMC PowerEdge FC630 Worker Node Volumes on
page 94
• Table 36: Dell EMC PowerEdge FC630 Worker Node Partitions on
page 94

Dell EMC Ready Bundle for Cloudera Hadoop


File System Layout | 91

Infrastructure Nodes

The Infrastructure nodes (Active Name Node, Standby Name Node, HA Node, and Edge Node) are
configured as multiple partitions and filesystems using all available drives. Each partition is optimized for
both performance and reliability.
Dell EMC recommends the following disk and partition layout for this set of machines.

Table 29: Dell EMC PowerEdge R730xd Infrastructure Node Volumes

Physical Disks Usage Volume Type


12-13 or 24-25 Operating System RAID1
0 ZooKeeper Journal Passthrough
1 NameNode Journal Passthrough
2-3 HDFS Metadata RAID1
4-7 Database Storage RAID10

Table 30: Dell EMC PowerEdge R730xd Infrastructure Node Partitions

Disk PartitionMount Point Size Filesystem Description


Type
Virtual 1 Primary /boot 1024 ext4 Contains BIOS boot files that must be
MB within first 2GB of disk
Virtual 1 LVM / 100 GB ext4 Root filesystem
Virtual 1 LVM swap 4 GB swap Operating system swap space partition
Virtual 1 LVM /home 1 GB ext4 User home directories
Virtual 3 Primary /var/lib/pgsql 2 TB ext4 Operational data directory for databases.
This primarily contains the Cloudera
Manager databases, since the Postgres
Data Directory (PGDATA) is typically /var/lib/
pgsql. Alternatives to Postgres should be
configured to store their data files here
Virtual 2 Primary /metadata 1 TB ext4 HDFS Metadata, ZooKeeper Data,
NameNode data
NameNode Data Directories
(dfs.name.dir,
dfs.namenode.name.dir) location of
fsimag (typically /data/1/dfs/nn, now /
metadata/dfs/nn)
ZooKeeper Data Directory (dataDir)
Typically /var/lib/zookeeper, now /
metadata/zookeeper

Physical 1 Primary /journal/ 1 TB ext4 ZooKeeper Data Log Directory


zookeeper (dataLogDir) Typically /var/lib/zookeeper,
now /journal/zookeeper

Dell EMC Ready Bundle for Cloudera Hadoop


92 | File System Layout

Disk PartitionMount Point Size Filesystem Description


Type
Physical 2 Primary /journal/dfs 1 TB ext4 NameNode Edits Directories
(dfs.namenode.edits.dir) Typically /
data/1/dfs/nn, now /journal/dfs/nn) defaults
to same as dfs.name.dir, must change
it)
Virtual 1 LVM /var All ext4 Contains variable data like system logging
available files, databases, mail and printer spool
space directories, transient and temporary files

Table 31: Dell EMC PowerEdge FC630 Infrastructure Node Volumes

Physical Disks Usage Volume Type


0, 1 Operating System RAID1
2 ZooKeeper Journal RAID0
3 DFS Journal RAID0
4, 5 HDFS Metadata RAID1
6-9 Database Storage RAID10

Table 32: Dell EMC PowerEdge FC630 Infrastructure Node Partitions

Disk PartitionMount Point Size Filesystem Description


Type
Virtual 1 Primary /boot 1024 ext4 Contains BIOS boot files that must be
MB within first 2GB of disk
Virtual 1 LVM / 100 GB ext4 Root filesystem
Virtual 1 LVM swap 4 GB swap Operating system swap space partition
Virtual 1 LVM /home 1 GB ext4 User home directories
Virtual 2 Primary metadata 917 GB ext4 HDFS Metadata, ZooKeeper Data,
NameNode data
Virtual 3 Primary /journal/ 917 GB ext4 ZooKeeper Data Log Directory
zookeeper (dataLogDir) Typically /var/lib/zookeeper,
now /journal/zookeeper

Virtual 4 Primary /journal/dfs 917 GB ext4 NameNode Edits Directories


(dfs.namenode.edits.dir) Typically /
data/1/dfs/nn, now /journal/dfs/nn) defaults
to same as dfs.name.dir, must change
it)
Virtual 5 Primary /boot 1.8 TB ext4 Operational data directory for databases.
This primarily contains the Cloudera
Manager databases, since the Postgres
Data Directory (PGDATA) is typically /var/
lib/pgsql. Alternatives to Postgres should
be configured to store their data files here.

Dell EMC Ready Bundle for Cloudera Hadoop


File System Layout | 93

Disk PartitionMount Point Size Filesystem Description


Type
Virtual 1 LVM /var All ext4 Contains variable data like system logging
available files, databases, mail and printer spool
space directories, transient and temporary files

Note: Dell EMC does not recommend that a large swap space be configured. Swapping in a
Hadoop cluster should be avoided, due to the larger and random performance degradation that can
result. See Swap Settings on page 101.
Note: The settings for dfs.name.dir, dfs.namenode.name.dir, ZooKeeper DataDir, ZooKeeper
DataLogDir, and dfs.namenode.edits.dir must be updated in Cloudera Manager to reflect the
locations in this partition layout.

Worker Nodes

The Worker Nodes in the cluster are the processing and data storage nodes. When using Dell EMC
PowerEdge R730xd servers we recommend that the two Flex Bay drives in the back of the chassis be
configured as a mirrored pair, and used for the operating system. All of the other disks attached to the
system should be configured as HBA or JBOD.
Dell EMC recommends the following disk and partition layout for this set of machines.

Table 33: Dell EMC PowerEdge R730xd Worker Node Volumes

Virtual Usage Physical Disks Volume Type


Disk
1 Operating System 12-13 or 24-25 RAID1
2-15, or HDFS Data 0-11 or 0-23 Passthrough
2-25

Table 34: Dell EMC PowerEdge R730xd Worker Node Partitions

Virtual Partition Mount Point Size Filesystem Description


Disk Type
1 /boot 1024 MB ext4 Contains BIOS boot files that must be
within first 2GB of disk
1 c (dev/ / 100 GB ext4 Root filesystem
mapper/VG-
LV_ROOT)
1 d swap 4 GB swap Operating system swap space partition
1 e (/dev/ /home 1 GB ext4 user home directories
mapper/VG-
LV_HOME)
1 f (/dev/ /var 170 GB ext4 Contains variable data like system logging
mapper/VG- files, databases, mail and printer spool
LV_VAR ) directories, transient and temporary files

Dell EMC Ready Bundle for Cloudera Hadoop


94 | File System Layout

Virtual Partition Mount Point Size Filesystem Description


Disk Type
2 a /data/1 All ext4 Contains HDFS data
available
space
(e.g. 4
TB)
3 a /data/2 All ext4 Contains HDFS data
available
space
(e.g. 4
TB)
n a /data/n All ext4 Contains HDFS data
available
space
(e.g. 4
TB)

Table 35: Dell EMC PowerEdge FC630 Worker Node Volumes

Physical Disk Usage Volume Type


SATA 1 Operating System Passthrough
SATA 2 Additional Storage Passthrough
FD332 0-15 HDFS Data Passthrough

Table 36: Dell EMC PowerEdge FC630 Worker Node Partitions

Virtual Partition Mount Point Size Filesystem Description


Disk Type
SATA 1 Primary /boot 1024 MB ext4 Contains BIOS boot files that must be
within first 2GB of disk
SATA 1 LVM / 100 GB ext4 Root filesystem
SATA 1 LVM swap 4 GB swap Operating system swap space partition
SATA 1 LVM /home 1 GB ext4 User home directories
SATA 1 LVM /var 271 GB ext4 Contains variable data like system logging
files, databases, mail and printer spool
directories, transient and temporary files
SATA 2 Primary /var2 400 GB ext4 Additional storage
FD332 0 Primary /data/1 917 GB ext4 Contains HDFS data

FD332 1 Primary /data/2 917 GB ext4 Contains HDFS data

FD332 n Primary /data/n 917 GB ext4 Contains HDFS data

Note: Dell EMC does not recommend that a large swap space be configured. Swapping in a
Hadoop cluster should be avoided, due to the large and random performance degradation that can
result. See Swap Settings on page 101.

Dell EMC Ready Bundle for Cloudera Hadoop


File System Layout | 95

Note: The partition layout in Table 34: Dell EMC PowerEdge R730xd Worker Node Partitions
on page 93 and Table 36: Dell EMC PowerEdge FC630 Worker Node Partitions on page
94 applies to all the data drives in all the Worker Nodes. Depending on the Worker Node drive
configuration, the Dell EMC PowerEdge R730xd will have either 12 or 24 data drives. The Dell EMC
PowerEdge FC630 will have 16 data drives.
Note: Operating system partitions are configured with the Logical Volume Manager enabled.

File Systems and Parameters

Note the following:


• All file systems should be formatted using a Cloudera recommended file system type (i.e., ext4).
• For administration purposes, Cloudera recommends that you mount all HDFS disks on the Worker
Nodes with a naming pattern (e.g., /data/1, /data/2, /data/3, etc.).
• All file systems should be mounted by UUID numbers. This ensures that physical drives always use the
same file system mount point in case a drive is removed.
• All file systems should have noatime and nodiratime set. This results in a significant performance
increase because file and directory access times are not forced to be updated on read operations.

Dell EMC Ready Bundle for Cloudera Hadoop


96 | Operating System Settings

Appendix

D
Operating System Settings
Topics: This appendix describes how to configure the operating system for the
Dell EMC Ready Bundle for Cloudera Hadoop.
• CPU Settings
Note: The Dell EMC-provided DTK tool automatically
• Network Settings
configures the operating system settings on each machine. The
• Advanced NIC Features information in this appendix is provided here for your reference.
• Process Limits
• Memory Management Settings
• Secure Linux Settings
• Services
• Firewall Settings
• Ports Listing
• Disable Network Manager
• Secure Shell Keys
• User Accounts and Groups

Dell EMC Ready Bundle for Cloudera Hadoop


Operating System Settings | 97

CPU Settings

You can configure the following Linux® operating system settings to increase Dell EMC Ready Bundle for
Cloudera Hadoop performance:
• IRQ Balancer on page 97
• CPU Frequency Governor on page 97

IRQ Balancer
To prevent the IRQ balancer from interfering with the interrupt affinity scheme, the IRQ balancer service
needs to be disabled.
1. Disable the IRQ balancer service by executing the following commands:

# chkconfig irqbalance off


# service irqbalance stop

CPU Frequency Governor


The cpufreq_performance module forces the CPU to use the highest possible clock frequency. It is meant
for heavy workloads, and is best suited for interactive workloads.
Note: This feature is dependent upon the OS release, and its use may be different across different
versions of the OS. The example below assumes Red Hat Enterprise Linux Server release 6.7, with
Kernel version 2.6.32-573.el6.x86_64.
To install and activate the CPU frequency governor:
1. Find appropriate kernel modules available on the System under Test.
2. Use the modprobe utility to add the required driver:

# modprobe cpufreq_performance
3. Enable the governor by executing the following command:

# cpupower frequency-set --governor cpufreq_performance


4. The available drivers can be found in the /lib/modules/<kernel version>/kernel/arch/<architecture>/
kernel/cpu/cpufreq/ directory. For example:

# cd /lib/modules/2.6.32-573.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq
# ls
acpi-cpufreq.ko mperf.ko p4-clockmod.ko pcc-cpufreq.ko
powernow-k8.ko speedstep-lib.ko
5. If the necessary cpufreq drivers are not available, you can get them from the /lib/modules/<kernel
version>/kernel/drivers/cpufreq directory. For example:

# cd /lib/modules/2.6.32-573.el6.x86_64/kernel/drivers/cpufreq
# ls
cpufreq_conservative.ko cpufreq_ondemand.ko cpufreq_powersave.ko
cpufreq_stats.ko freq_table.ko

Note: The uname –r command will give you the kernel version.
The cpupower utility is provided by the cpupowerutils package. If you do not have it installed,
you can set the tunables in /sys/devices/system/cpu/<cpu id>/cpufreq/.

Dell EMC Ready Bundle for Cloudera Hadoop


98 | Operating System Settings

Network Settings

Dell EMC recommends that you tune certain network settings to increase Dell EMC Ready Bundle for
Cloudera Hadoop performance.
To tune the network settings:
1. Add the following parameters to the /etc/sysctl.conf file:

#Disable TCP timestamps


net.ipv4.tcp_timestamps=0

#Enable TCP sacks


net.ipv4.tcp_sack=1

#Increase the TCP max and default buffer sizes


net.core.rmem_max=4194304
net.core.wmem_max=4194304
net.core.rmem_default=4194304
net.core_wmem_default=4194304
net.core.optmem_max=4194304

#Increase memory thresholds


net.ipv4.tcp_rmem=”4096 87380 4194304”
net.ipv4.tcp_wmem=”4096 65536 4194304”

#Turn off ipv6


net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
2. Set the MTU size from end to end (i.e., switch to NIC interface).
Note: This example uses eth0 as the interface. Your interface name may be different.

a. Temporarily change the MTU size of an interface by executing the following command:

# ifconfig eth0 mtu 9000


b. To persistently change the MTU size of an interface on a RHEL-based system, edit the configuration
script for the relevant interface in /etc/sysconfig/network-scripts/. If named directly after the interface
eth0, then this would be called ifcfg-eth0:

MTU=9000
c. Activate the new MTU by taking the interface down, and then bringing it back up:

# ifdown eth0
# ifup eth0

Advanced NIC Features

Modern NICs have various offload capabilities, such as:


• TSO — tcp-segmentation-offload (see TCP Segmentation Offload on page 99)
• GSO — generic-segmentation-offload (see Generic Segmentation Offload on page 99)
• SG — scatter-gather (see Scatter-Gather on page 99)

Dell EMC Ready Bundle for Cloudera Hadoop


Operating System Settings | 99

• IC — interrupt-coalescing (see Interrupt Moderation and Coalescing on page 100)


Although they are optional, Dell EMC recommends that you always enable them, post-deployment. These
are advanced NIC features, and are enabled by running the ethtool commands. These commands can
be scripted for ease of use.
Note: These examples use eth0 as the interface. Your interface name may be different.

TCP Segmentation Offload


To enable tcp-segmentation-offload:
1. Execute the following command:

# sudo ethtool --offload eth0 tso on

Generic Segmentation Offload


To enable generic-segmentation-offload:
1. Execute the following command:

# sudo ethtool --offload eth0 gso on

Scatter-Gather
NICS with scatter-gather enabled are able to read from, and write to, many memory buffers for Direct
Memory Access (DMA). Depending upon the NIC, scatter-gather can be turned on with ethtool.
To enable scatter-gather:
1. Execute the following command:

# sudo ethtool --offload eth0 sg on

Display Offload Features


After enabling the offload features on the NIC, you can display them to ensure that the results are as you
expect.
1. Display the offload features by entering the following command:

# sudo ethtool --show-offload eth0

The output will appear similar to this example:

Features for eth0:

rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off

Dell EMC Ready Bundle for Cloudera Hadoop


100 | Operating System Settings

Interrupt Moderation and Coalescing


Depending on the NIC, it might be possible to reduce the frequency of sending interrupts to the CPU.
Using the ethtool command, features like adaptive_rx can be disabled. Interrupt coalescing (IC) will
combine several packets before issuing an interrupt.
To enable Interrupt Coalescing:
1. Coalesce NIC packets within 10 micro-second IC windows, and generate an interrupt at the end of the
window:

# ethtool -C em1 rx-usecs 10

Process Limits

The Linux® operating system needs to be configured with several processes and files limit settings. The
lines below should be added to the /etc/security/limits.conf file.

hdfs - nofile 32768


mapred - nofile 32768
hbase - nofile 32768
hdfs - nproc 32768
mapred - nproc 32768
hbase – nproc 32768

Memory Management Settings

The following memory management settings must be configured:


• Transparent Huge Page (THP) Compaction on page 100
• Swap Settings on page 101

Transparent Huge Page (THP) Compaction


Red Hat Enterprise Linux Server attempts to reduce the number of huge pages in use by defragmenting
the used memory blocks. There is a performance cost to this operation.
Dell EMC recommends that this functionality be turned off on each node in a Hadoop cluster at boot time
by following these steps:
1. Append or change the transparent_hugepage kernel parameter on the GRUB_CMDLINE_LINUX
option in /etc/sysconfig/grub file. Save the file. Eg.

GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/root rd.lvm.lv=rhel/
swap vconsole.font=latarcyrheb-sun16 vconsole.keymap=us
transparent_hugepage=never"
2. Run the grub2-mkconfig command to regenerate the grub.cfg file: Eg.

grub2-mkconfig -o /boot/grub2/grub.cfg
3. Reboot the system and ensure that the parameter is set correctly. This can be confirmed by running this
command:

# cat /proc/cmdline

Dell EMC Ready Bundle for Cloudera Hadoop


Operating System Settings | 101

Refer to https://access.redhat.com/solutions/1320153 for additional details.

Swap Settings
The vm.swappiness Linux® kernel parameter controls how aggressively memory pages are swapped to
disk. It can be set to a value between 0-100. The higher the value, the more aggressively the kernel seeks
out inactive memory pages and swaps them to disk.
On most systems this parameter is set to 60 by default. This is not always suitable for Hadoop cluster
nodes because it can cause processes to swap out, even when there is free memory available. This can
affect stability and performance, and may cause problems such as lengthy garbage collection pauses
for important system daemons. Cloudera recommends that vm.swappiness be set based on the Linux
kernel version. Red Hat Enterprise Linux Server 7.3 uses a Linux kernel version 3.1.x.
• To check the kernel version, run:

# uname -a

• To check the vm.swappiness parameter setting, run:

# sysctl vm.swappiness
• To set the vm.swappiness parameter for kernel versions earlier than 2.6.32-303:

# sysctl -w vm.swappiness=0
• To set the vm.swappiness parameter for later kernel versions:

# sysctl -w vm.swappiness=1

Secure Linux Settings

Security Enhanced Linux (SELinux) is a kernel module and toolset to allow greater security control. The
feature is not compatible with Cloudera Manager 5 and should not be installed, or should be disabled.
1. To indicate if the feature is active, execute the following command:

# selinuxenabled || echo "disabled"


2. To disable SELinux, change the following line in the /etc/selinux/config file:

#From this:
SELINUX=enforcing

#To this:
SELINUX=disabled

Services

All unnecessary daemons and services, such as the CUPS printing service, should be disabled on all
cluster nodes. This reduces maintenance requirements and resource usage.
In addition, all hosts in the cluster should have the same time, date and zone settings. Dell EMC highly
recommends that you run the ntpd service.
To disable or stop any unnecessary daemons:

Dell EMC Ready Bundle for Cloudera Hadoop


102 | Operating System Settings

1. Use the chkconfig command to disable any unwanted services. For example:

# chkconfig iptables off


# chkconfig ip6tables off
# chkconfig cups off
# chkconfig ntpd on
# chkconfig ntpdate off
2. Stop any unnecessary services. For example:

# service iptables stop


# service ip6tables stop
# service cups stop
# service ntpdate stop
3. Start the ntpd service:

# service ntpd start

Firewall Settings

Cloudera suggests that all firewall software on and between nodes in the cluster be disabled.
1. Check the firewall status by running the following commands:

# chkconfig --list iptables


# chkconfig --list ip6tables
2. Disable the firewall by running the following commands:

# chkconfig iptables off


# chkconfig ip6tables off

Caution: You must ensure that you provide suitable network security for the cluster, including
but not limited to external firewalls. Please consult with your local site security administrator to
determine the proper solution.
When iptables is disabled, the Linux kernel still implements a limited amount of IP connection tracking
using a fixed size table. If there are indications of packets loss (i.e., errors of the form nf_conntrack:
table full, dropping packet), increase the size of the connection tracking table using sysctl to
change the parameter net.netfilter.nf_conntrack_max. Refer to https://access.redhat.com/solutions/8721
for additional details.
Note: Registration is required to view this solution content.

Ports Listing

See the following link for information about all ports that are used within a Cloudera Hadoop cluster:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_ports.html
This information can be used to program a firewall to protect the entire cluster.

Dell EMC Ready Bundle for Cloudera Hadoop


Operating System Settings | 103

Disable Network Manager

The Red Hat Network Manager should be disabled, or not installed. Interfaces should be configured to use
the normal Red Hat network service.
Disable the Network Manager by following the instructions at:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-
NetworkManager_and_the_Network_Scripts.html

Secure Shell Keys

We normally configure password-less SSH access (using keys) for the root user, from the node running
Cloudera Manager, to simplify access to all nodes in the cluster. This configuration is not required. If
password-less SSH is not configured, the root password is required by the Cloudera Manager installation
process.
To allow this access:
1. Create the public and private keys by running the following command on all nodes as the root user:

# ssh-keygen

The public keys for each machine will reside on those machines in the ~/.ssh/ directory, and are named
according to the type of encryption that is chosen (i.e., id_rsa.pub).
2. Copy the pubic key from the High Availability node to all nodes in the cluster.
3. Append the key to the ~/.ssh/authorized_keys file on each of the nodes.
4. Secure the authorized_keys file to ensure that the system is secure. For more information, please see:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/
System_Administrators_Guide/ch-OpenSSH.html

User Accounts and Groups

Cloudera Manager and Cloudera Enterprise use several user accounts and groups to complete their tasks.
These accounts and group are setup automatically by Cloudera Manager during the cluster install process.
The set of user accounts and groups varies according to which components you choose to install.
Caution: Do not delete these accounts or groups, and do not modify their permissions and rights.

For specific details, see Permission at:


http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/installation_reqts.html

Dell EMC Ready Bundle for Cloudera Hadoop


104 | Example node-config.json File

Appendix

E
Example node-config.json File
Topics: This appendix provides an example node-config.json file.

• node-config.json Example

Dell EMC Ready Bundle for Cloudera Hadoop


Example node-config.json File | 105

node-config.json Example

{
"ClusterName" : "Silver Stamp",
"DomainName" : "ignition.dell.com",
"GatewayBond0" : "172.16.30.1",
"NetMaskBond0" : "255.255.255.0",
"GatewayBond1" : "10.152.248.1",
"NetMaskBond1" : "255.255.255.0",
"EthsBond0" : "em1,em2",
"EthsBond1" : "p4p1,p4p2",
"TimeZone" : "UTC",
"NTPSubnet" : "172.16.30.0",
"Nodes" : [
{
"ServiceTag": "D120R22",
"NodeType" : "namenode",
"NodeName" : "r1s10-namenode1",
"bond0IP" : "172.16.30.93",
"bond1IP" : "10.152.247.93"
},
{
"ServiceTag": "D100R32",
"NodeType" : "edge",
"NodeName" : "r1s12-edge",
"bond0IP" : "172.16.30.94",
"bond1IP" : "10.152.247.94"
},
{
"ServiceTag": "D115D56",
"NodeType" : "workernode",
"NodeName" : "r1s14-workernode1",
"bond0IP" : "172.16.30.95"
},
.
.
.
}

Dell EMC Ready Bundle for Cloudera Hadoop


106 | Support

Appendix

F
Support
Topics: Note: Cloudera and Red Hat technical support are paid
services, and require support contract agreements with
• Software Support each respective vendor. Please contact your Dell EMC sales
• Java Compatibility representative for more details.

Dell EMC Ready Bundle for Cloudera Hadoop


Support | 107

Software Support

Table 37: Dell EMC Ready Bundle for Cloudera Hadoop Support Matrix on page 107 describes where
you can obtain technical support for the various components of the Dell EMC Ready Bundle for Cloudera
Hadoop.

Table 37: Dell EMC Ready Bundle for Cloudera Hadoop Support Matrix

Category Component Version Available Support


Operating System Red Hat Enterprise 7.3 Red Hat Linux support
Linux Server
Operating System CentOS 7.3 Dell EMC Hardware
support
Java Virtual Machine Sun Oracle JVM Java 7 (1.7.0_67) N/A
Java 8 (1.8.0_60)

Hadoop Cloudera Enterprise 5.10 Cloudera support


Hadoop Cloudera Manager 5.10 Cloudera support
Hadoop Cloudera Navigator 2.9 Cloudera support
ETL Engine Syncsort DMX-h 9.2 Syncsort support

Java Compatibility

The Cloudera Enterprise software supports either Java 7 or Java 8.


1. Verify that a supported version of Java is installed by running the following commands:

# java -version
# javac -version
# update-java-alternatives --list
# alternatives --display java

Dell EMC Ready Bundle for Cloudera Hadoop


108 | Related Documentation

Appendix

G
Related Documentation
Topics: This topic provides links to the latest related documentation.

• Cloudera Manager 5.10 and


Cloudera Enterprise 5.10
Documentation
• Apache Hadoop Documentation
• Red Hat Documentation
• Syncsort DMX-h Documentation

Dell EMC Ready Bundle for Cloudera Hadoop


Related Documentation | 109

Cloudera Manager 5.10 and Cloudera Enterprise 5.10 Documentation

For the latest Cloudera Manager and Cloudera Enterprise documentation, please see:
http://www.cloudera.com/documentation/enterprise/latest.html
Note: In particular, see the Cloudera Manager Installation Guide.

Apache Hadoop Documentation

For the latest Apache Hadoop documentation, please see:


http://hadoop.apache.org/

Red Hat Documentation

For Red Hat Enterprise Linux Server installation and deployment information, please see:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/
System_Administrators_Guide/index.html

Syncsort DMX-h Documentation

For the latest Syncsort DMX-h documentation, please see:


http://www.syncsort.com/en/Resource-Center

Dell EMC Ready Bundle for Cloudera Hadoop


110 | References

Appendix

H
References
Topics: Additional information can be obtained at http://www.dell.com/en-us/
work/learn/software-platforms-hadoop.
• About Cloudera
If you need additional services or implementation help, please contact
• About Syncsort
your Dell EMC sales representative.
• To Learn More

Dell EMC Ready Bundle for Cloudera Hadoop


References | 111

About Cloudera

Cloudera is a key contributor to the Apache Hadoop project. The Cloudera Distribution for Apache Hadoop
(CDH) is a highly-scalable open source platform for high-volume data management and analytics. CDH
integrates with existing enterprise IT infrastructure, enabling data engineers and data scientists to quickly
and easily develop and deploy Hadoop applications in a cost-efficient manner.
The Dell EMC servers in this Architecture Guide are Cloudera Certified.

About Syncsort

Syncsort creates software that allows enterprises to collect, integrate, sort, and distribute large amounts of
data quickly, with reduced resources usage, in a cost-effective manner.
Dell EMC is a Syncsort-certified Technology Alliance Partner.

To Learn More

For more information on the Dell EMC Ready Bundle for Cloudera Hadoop, visit http://www.dell.com/en-us/
work/learn/software-platforms-hadoop.
Copyright © 2011-2017 Dell Inc. or its subsidiaries. All rights reserved. Trademarks and trade names may
be used in this document to refer to either the entities claiming the marks and names or their products.
Specifications are correct at date of publication but are subject to availability or change without notice
at any time. Dell Inc. and its affiliates cannot be responsible for errors or omissions in typography or
photography. Dell Inc.’s Terms and Conditions of Sales and Service apply and are available on request.
Dell Inc. service offerings do not affect consumer’s statutory rights.
Dell EMC, the DELL EMC logo, the DELL EMC badge, and PowerEdge are trademarks of Dell Inc.

Dell EMC Ready Bundle for Cloudera Hadoop