Getting Started With HDP Sandbox

NOTICE
As of January 31, 2021, this tutorial references legacy products that no longer represent Cloudera’s current product offerings.
Please visit recommended tutorials:
 How to Create a CDP Private Cloud Base Development Cluster

 All Cloudera Data Platform (CDP) related tutorials
Introduction
Hello World is often used by developers to familiarize themselves with new concepts by building a simple program. This tutorial aims to achieve a similar purpose by
getting practitioners started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
This tutorial describes how to refine data for a Trucking IoT Data Discovery (aka IoT Discovery) use case using the Hortonworks Data Platform. The IoT Discovery use
cases involves vehicles, devices and people moving across a map or similar surface. Your analysis is targeted to linking location information with your analytic data.
For our tutorial we are looking at a use case where we have a truck fleet. Each truck has been equipped to log location and event data. These events are streamed
back to a datacenter where we will be processing the data. The company wants to use this data to better understand risk.
Here is the video of Analyzing Geolocation Data to show you what you’ll be doing in this tutorial.
Prerequisites
 Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox
 Go through Learning the Ropes of the HDP Sandbox to become familiar with the Sandbox.
Outline
 Concepts to strengthen your foundation in the Hortonworks Data Platform (HDP)
 Loading Sensor Data into HDFS
 Hive - Data ETL
 Spark - Risk Factor
 Data Reporting with Zeppelin
Introduction
In this tutorial, we will explore important concepts that will strengthen your foundation in the Hortonworks Data Platform (HDP). Apache Hadoop is a layered structure
to process and store massive amounts of data. In our case, Apache Hadoop will be recognized as an enterprise solution in the form of HDP. At the base of HDP exists
our data storage environment known as the Hadoop Distributed File System. When data files are accessed by Hive, Pig or another coding language, YARN is the Data
Operating System that enables them to analyze, manipulate or process that data. HDP includes various components that open new opportunities and efficiencies in
healthcare, finance, insurance and other industries that impact people.
Prerequisites
 Learning the Ropes of the HDP Sandbox
Outline
 Hadoop & HDP
 HDFS
 MapReduce & YARN
 Hive and Pig
 Further Reading
Concept: Hadoop & HDP

In this module you will learn about Apache Hadoop and what makes it scale to large data sets. We will also talk about various components of the Hadoop ecosystem
that make Apache Hadoop enterprise ready in the form of Hortonworks Data Platform (HDP) distribution. This module discusses Apache Hadoop and its capabilities as
a data platform. The core of Hadoop and its surrounding ecosystem solution vendors provide enterprise requirements to integrate alongside Data Warehouses and
other enterprise data systems. These are steps towards the implementation of a modern data architecture, and towards delivering an enterprise ‘Data Lake’.
Goals of this module

 Understanding Hadoop
 Understanding five pillars of HDP
 Understanding HDP components and their purpose
Apache Hadoop
Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to
quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an
enterprise to deploy, integrate and work with Hadoop. Refer to the blog reference below for more information on Hadoop.
 Hortonworks Blog:
o How Apache Hadoop 3 Adds Value Over Apache Hadoop 2.0
The base Apache Hadoop framework is composed of the following modules:
 Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
 Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
 Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
 Hadoop MapReduce – a programming model for large scale data processing.
Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles. There are five pillars to
Hadoop that make it enterprise ready:
 Data Management – Store and process vast quantities of data in a storage layer that scales linearly. Hadoop Distributed File System (HDFS) is the core
technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. Apache Hadoop YARN is the pre-requisite for
Enterprise Hadoop as it provides the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data
stored in Hadoop with predictable performance and service levels.
o Apache Hadoop YARN – Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by
supporting non-MapReduce workloads associated with other programming models.
o HDFS – Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of
commodity servers.
 Data Access – Interact with your data in a wide variety of ways – from batch to real-time. Apache Hive is the most widely adopted data access technology,
though there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing, Apache HBase
offers columnar NoSQL storage and Apache Accumulo offers cell-level access control. All of these engines can work across one set of data and resources
thanks to YARN and intermediate engines such as Apache Tez for interactive access and Apache Slider for long-running applications. YARN also provides
flexibility for new and emerging data access methods, such as Apache Solr for search and programming frameworks such as Cascading.
o Apache Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for
large datasets stored in HDFS.
o Apache Pig – A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired
with the MapReduce framework for processing these programs.
o MapReduce – MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.
o Apache Spark – Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering
and classification of datasets.
o Apache Storm – Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to
Apache Hadoop 2.x
o Apache HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
o Apache Tez – Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-
time big data processing.
o Apache Kafka – Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher
throughput, replication, and fault tolerance.
o Apache HCatalog – A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of
the data stored within Apache Hadoop.
o Apache Slider – A framework for deployment of long-running data access applications in Hadoop. Slider leverages YARN’s resource management capabilities to
deploy those applications, to manage their lifecycles and scale them up or down.
o Apache Solr – Solr is the open source platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the
world’s largest Internet sites.
o Apache Mahout – Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based
collaborative filtering.
o Apache Accumulo – Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big
Table design that works on top of Apache Hadoop and Apache ZooKeeper.
 Data Governance and Integration – Quickly and easily load data, and manage according to policy. Workflow Manager provides workflows for data governance,
while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
o Workflow Management – Workflow Manager allows you to easily create and schedule workflows and monitor workflow jobs. It is based on the Apache Oozie workflow
engine that allows users to connect and automate the execution of big data processing tasks into a defined workflow.
o Apache Flume – Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
o Apache Sqoop – Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data
sources.
 Security – Address requirements of Authentication, Authorization, Accounting and Data Protection. Security is provided at every layer of the Hadoop stack from
HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox.
o Apache Knox – The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to
simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster.
o Apache Ranger – Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core
enterprise security requirements of authorization, accounting and data protection.
 Operations – Provision, manage, monitor and operate Hadoop clusters at scale.

o Apache Ambari – An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
o Apache Oozie – Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
o Apache ZooKeeper – A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to
important configuration information.
Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process,
and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive
modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing. To learn more about Apache Hadoop, watch the following
introduction:
Hortonworks Data Platform (HDP)

Hortonworks Data Platform (HDP) is a packaged software Hadoop distribution that aims to ease deployment and management of Hadoop clusters. Compared with
simply downloading the various Apache code bases and trying to run them together in a system, HDP greatly simplifies the use of Hadoop. Architected, developed,
and built completely in the open, HDP provides an enterprise ready data platform that enables organizations to adopt a Modern Data Architecture.
With YARN as its architectural center it provides a data platform for multi-workload data processing across an array of processing methods – from batch through
interactive to real-time, supported by key capabilities required of an enterprise data platform — spanning Governance, Security and Operations.
The Hortonworks Sandbox is a single node implementation of HDP. It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and
easy. The tutorials and features in the Sandbox are oriented towards exploring how HDP can help you solve your business big data problems. The Sandbox tutorials
will walk you through how to bring some sample data into HDP and how to manipulate it using the tools built into HDP. The idea is to show you how you can get
started and show you how to accomplish tasks in HDP. HDP is free to download and use in your enterprise and you can get it here: Hortonworks Data Platform
HDFS
A single physical machine gets saturated with its storage capacity as data grows. With this growth comes the impending need to partition your data across separate
machines. This type of File system that manages storage of data across a network of machines is called a Distributed File System. HDFS is a core component of
Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware. With Hortonworks Data Platform
(HDP), HDFS is now expanded to support heterogeneous storage media within the HDFS cluster.
Goals of this module

 Understanding HDFS architecture
 Understanding Hortonworks Sandbox Amabri File User View
Hadoop Distributed File System

HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it
was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500
servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent
data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation
across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the
NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas.
The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored
on the local file system on the DataNodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates
another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.
The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The
instructions include commands to:
 replicate blocks to other nodes

 remove local block replicas
 re-register and send an immediate block report, or
 shut down the node
 For more details on HDFS: https://hortonworks.com/hadoop/hdfs/
With the next generation HDFS data architecture that comes with HDP, HDFS has evolved to provide automated failure with a hot standby, with full stack resiliency.
The video provides more clarity on HDFS.
Ambari Files User View on Hortonworks Sandbox
Ambari Files User View
Ambari Files User View provides a user friendly interface to upload, store and move data. Underlying all components in Hadoop is the Hadoop Distributed File System
(HDFS). This is the foundation of the Hadoop cluster. The HDFS file system manages how the datasets are stored in the Hadoop cluster. It is responsible for
distributing the data across the datanodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of data nodes.
MapReduce & YARN

Cluster computing faces several challenges such as how to store data persistently and keep it available if nodes fail or how to deal with node failures during a long
running computation. Also there is network bottleneck which delays the time of processing data. MapReduce offers a solution by bring computation close to data
thereby minimizing data movement. It is a simple programming model designed to process large volumes of data in parallel by dividing the job into a set of
independent tasks.
The biggest limitation with MapReduce programming is that map and reduce jobs are not stateless. This means that Reduce jobs have to wait for map jobs to be
completed first. This limits maximum parallelism and therefore YARN was born as a generic resource management and distributed application framework.
Goals of the Module

 Understanding Map and Reduce jobs
 Understanding YARN
Apache MapReduce
MapReduce is the key algorithm that the Hadoop data processing engine uses to distribute work around a cluster. A MapReduce job splits a large data set into
independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster,
returning solutions more quickly and with greater reliability.
The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the
worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.
 map(key1,value) -> list<key2,value2>

The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve. Each reduce pulls the
relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps
for the keys and combine them to solve the problem.
 reduce(key2, list<value2>) -> list<value3>

The current Apache Hadoop MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves called TaskTrackers. The JobTracker
is responsible for resource management (managing the worker nodes i.e. TaskTrackers), tracking resource consumption/availability and also job life-cycle
management (scheduling individual tasks of the job, tracking progress, providing fault-tolerance for tasks etc).
The TaskTracker has simple responsibilities – launch/teardown tasks on orders from the JobTracker and provide task-status information to the JobTracker periodically.
The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using
inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the
computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing
software is called MapReduce. Here’s a short video introduction to MapReduce:
Apache YARN (Yet Another Resource Negotiator)
Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer in Hadoop 1x. However, the MapReduce algorithm, by itself, isn’t
sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. YARN was introduced in Hadoop 2.0, as a generic resource-management
and distributed application framework, whereby, one can implement multiple data processing applications customized for the task at hand. The fundamental idea of
YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a
global ResourceManager and per-application ApplicationMaster (AM).
The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner.
The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect,
a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the
component tasks.
ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of
capacities, queues etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on
restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the
applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, CPU, disk, network etc.
NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network)
and reporting the same to the ResourceManager.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for
progress. From the system perspective, the ApplicationMaster itself runs as a normal container.
Here is an architectural view of YARN:
One of the crucial implementation details for MapReduce within the new YARN system that should be mentioned is that we have reused the existing
MapReduce framework without any major surgery. This was very important to ensure compatibility for existing MapReduce applications and users. Here is a short
video introduction for YARN.
Concept: Hive and Pig

Introduction: Apache Hive
Hive is an SQL like query language that enables those analysts familiar with SQL to run queries on large volumes of data. Hive has three main functions: data
summarization, query and analysis. Hive provides tools that enable easy data extraction, transformation and loading (ETL).
Goals of the module
 Understanding Apache Hive
 Understanding Apache Tez
 Understanding Ambari Hive User Views on Hortonworks Sandbox
Apache Hive
Data analysts use Hive to explore, structure and analyze that data, then turn it into business insights. Hive implements a dialect of SQL (HiveQL) that focuses on
analytics and presents a rich set of SQL semantics including OLAP functions, sub-queries, common table expressions and more. Hive allows SQL developers or users
with SQL tools to easily query, analyze and process data stored in Hadoop. Hive also allows programmers familiar with the MapReduce framework to plug in their
custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Hive users have a choice of 3 runtimes when executing SQL queries. Users can choose between Apache Hadoop MapReduce, Apache Tez or Apache Spark
frameworks as their execution backend.
Here are some advantageous characteristics of Hive for enterprise SQL in Hadoop:
Feature Description
Query data
with a SQL-
Familiar
based
language
Interactive
response
Fast times, even
over huge
datasets
As data
variety and
volume
grows, more
Scalable commodity
and machines can
Extensible be added,
without a
corresponding
reduction in
performance
How Hive Works

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are
comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.
Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be
sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into
buckets.
Components of Hive
 HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and
MapReduce — to more easily read and write data on the grid. HCatalog holds a set of files paths and metadata about data in a Hadoop cluster. This allows scripts,
MapReduce and Tez, jobs to be decoupled from data location and metadata like the schema. Additionally, since HCatalog also supports tools like Hive and Pig, the location
and metadata can be shared between tools. Using the open APIs of HCatalog external tools that want to integrate, such as Teradata Aster, can also use leverage file path
location and metadata in HCatalog.
Note: At one point HCatalog was its own Apache project. However, in March, 2013, HCatalog’s project merged with Hive. HCatalog is currently released as part of
Hive.
 WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style)
interface.
Here is a short video introduction on Hive:

Apache Tez
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop.
Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop
ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader
Hadoop ecosystem.
Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those
data access applications to work with petabytes of data over thousands nodes. The Apache Tez component library allows developers to create Hadoop applications
that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters.
Since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage
over end-user-facing engines such as MapReduce and Apache Spark. Tez also offers a customizable execution architecture that allows users to express complex
computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it.
Here is a short video introduction on Tez.
Stinger and Stinger.next

The Stinger Initiative was started to enable Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support
interactive queries – all with a common SQL access layer.
Stinger.next is a continuation of this initiative focused on even further enhancing the speed, scale and breadth of SQL support to enable truly real-time access in Hive
while also bringing support for transactional capabilities. And just as the original Stinger initiative did, this will be addressed through a familiar three-phase delivery
schedule and developed completely in the open Apache Hive community.
Data Analytics Studio on Hortonworks Sandbox
To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Data Analytics Studio (DAS). It provides an interactive interface to Hive 3. We
can create, edit, save and run queries, and have Hive evaluate them for us using a series of Tez jobs.
Let’s now open Data Analytics Studio and get introduced to the environment. Navigate to http://sandbox-hdp.hortonworks.com:30800
Data Analytics Studio
There are 4 tabs to interact with DAS:
1. Queries: This view allows you to search previously executed SQL queries. You can also see commands each user issues.
2. Compose: From this view you can execute SQL queries and observe their output. Additionally, visually inspect the results of your queries and download them as csv
files.
3. Database: Database allows you to add new Databases and Tables. Furthermore, this view grants you access to advanced information about your databases.
4. Reports: This view allows you keep track of Read and Write operations, and shows you a Join Report of your tables.
The Apache Hive project provides a data warehouse view of the data in HDFS. Using a SQL dialect, HiveQL (HQL), Hive lets you create summarizations of your data
and perform ad-hoc queries and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then
manipulate it with SQL. The notion of projecting a table structure on a file is often referred to as Schema-On-Read. Since you are using data in HDFS, your operations
can be scaled across all the datanodes and you can manipulate huge datasets.
Introduction: Apache Pig

MapReduce allows allows you to specify map and reduce functions, but working out how to fit your data processing into this pattern may sometimes require you to
write multiple MapReduce stages. With Pig, data structures are much richer and the transformations you can apply to data are much more powerful.
Goals of this Module

 Understanding Apache Pig
 Understanding Apache Pig on Tez
 Understanding Ambari Pig User Views on Hortonworks Sandbox
Apache Pig
Apache Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin
script into MapReduce so that it can be executed within YARN for access to a single dataset stored in the Hadoop Distributed File System (HDFS).
Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs:
 Extract-transform-load (ETL) data pipelines,

 Research on raw data, and
 Iterative data processing.
Whatever the use case, Pig will be:
Characteristic Benefit
Pig users can
create custom
functions to
Extensible meet their
particular
processing
requirements
Complex tasks
involving
interrelated
data
transformations
can be
simplified and
Easily encoded as
Programmed data flow
sequences. Pig
programs
accomplish
huge tasks, but
they are easy to
write and
maintain.
Because the
system
Self- automatically
Optimizing optimizes
execution of Pig
jobs, the user
can focus on
semantics.
Please refer the following video on Pig for more clarity:
How Pig Works

Pig runs on Apache Hadoop YARN and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin,
which abstracts from the Java MapReduce idiom into a form similar to SQL. While SQL is designed to query the data, Pig Latin allows you to write a data flow that
describes how your data will be transformed (such as aggregate, join and sort).
Since Pig Latin scripts can be graphs (instead of requiring a single output) it is possible to build complex data flows involving multiple inputs, transforms, and outputs.
Users can extend Pig Latin by writing their own functions, using Java, Python, Ruby, or other scripting languages. Pig Latin is sometimes extended using UDFs (User
Defined Functions), which the user can write in any of those languages and then call directly from the Pig Latin.
The user can run Pig in two modes, using either the “pig” command or the “java” command:
 MapReduce Mode. This is the default mode, which requires access to a Hadoop cluster. The cluster may be a pseudo- or fully distributed one.
 Local Mode. With access to a single machine, all files are installed and run using a local host and file system.
Further Reading
 HDFS is one of the 4 components of Apache Hadoop the other 3 are Hadoop Common, Hadoop YARN and Hadoop MapReduce
 Announcing HDP 3.0
 To learn more about YARN watch the following YARN introduction video
 Hadoop 3 Blogs
 Apache Hadoop 3.1 A Giant Leap For Big Data
 Apache Ambari is an open source and open community based web based tool for Hadoop operations which has been extended via Ambari User Views to
provide a growing list of developer tools as User Views.
 Follow this link to learn more about the Ambari features included in HDP.
Hive Blogs:
 Hive 3 Overview
 Cost-Based Optimizer Makes Apache Hive 0.14 More Than 2.5X Faster
 Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 HIVE 0.14 Cost Based Optimizer (CBO) Technical Overview
 5 Ways to Make Your Hive Queries Run Faster
 Secure JDBC and ODBC Clients’ Access to HiveServer2
 Speed, Scale and SQL: The Stinger Initiative, Apache Hive 12 & Apache Tez
 Hive/HCatalog – Data Geeks & Big Data Glue
Tez Blogs
 Apache Tez: A New Chapter in Hadoop Data Processing

 Data Processing API in Apache Tez
ORC Blogs
 Apache ORC Launches as a Top-Level Project

 ORCFile in HDP 2: Better Compression, Better Performance
HDFS Blogs:
 Heterogeneous Storage Policies in HDP 2.2

 HDFS Metadata Directories Explained
 Heterogeneous Storages in HDFS
 HDFS 2.0 Next Generation Architecture
 NameNode High Availability in HDP 2.0
 Introducing… Tez: Accelerating processing of data stored in HDFS
 To learn more about HDFS watch the following HDFS introduction video.
YARN Blogs:
 YARN series-1
 YARN series-2
Capacity Scheduler Blogs:
 Understanding Apache Hadoop’s Capacity Scheduler

 Configuring YARN Capacity Scheduler with Ambari
 Multi-Tenancy in HDP 2.0: Capacity Scheduler and YARN
 Better SLAs via Resource-preemption in YARN’s Capacity Scheduler
Introduction
In this section, you will download the sensor data and load that into HDFS using Ambari User Views. You will get introduced to the Ambari Files User View to manage
files. You can perform tasks like create directories, navigate file systems and upload files to HDFS. In addition, you’ll perform a few other file-related tasks as
well. Once you get the basics, you will create two directories and then load two files into HDFS using the Ambari Files User View.
Prerequisites
The tutorial is a part of series of hands on tutorial to get you started on HDP using Hortonworks sandbox. Please ensure you complete the prerequisites before
proceeding with this tutorial.

Outline
 HDFS backdrop
 Download and Extract Sensor Data Files
 Load the Sensor Data into HDFS
 Summary
 Further Reading
HDFS backdrop
A single physical machine gets saturated with its storage capacity as the data grows. This growth drives the need to partition your data across separate machines. This
type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and
is designed to store large files with streaming data access patterns, running on clusters of commodity hardware. With Hortonworks Data Platform (HDP), HDFS is now
expanded to support heterogeneous storage media within the HDFS cluster.
Download and Extract Sensor Data Files

1. Download the sample sensor data contained in a compressed (.zip) folder here: Geolocation.zip
2. Save the Geolocation.zip file to your computer, then extract the files. You should see a Geolocation folder that contains the following files:
o geolocation.csv – This is the collected geolocation data from the trucks. It contains records showing truck location, date, time, type of event, speed, etc.
o trucks.csv – This is data was exported from a relational database and it shows information on truck models, driverid, truckid, and aggregated mileage info.
Load the Sensor Data into HDFS

1. Logon to Ambari using: maria_dev/maria_dev
2. Go to Ambari Dashboard and open Files View.
3. Start from the top root of the HDFS file system, you will see all the files the logged in user (maria_dev in this case) has access to see:
4. Navigate to /tmp/ directory by clicking on the directory links.
5. Create directory data. Click the button to create that directory. Then navigate to it. The directory path you should see: /tmp/data
Upload Geolocation and Trucks CSV Files to data Folder

1. If you're not already in your newly created directory path /tmp/data, go to the data folder. Then click on the button to upload the
corresponding geolocation.csv and trucks.csv files into it.
2. An Upload file window will appear, click on the cloud symbol.
3. Another window will appear, navigate to the destination the two csv files were downloaded. Click on one at a time, press open to complete the upload. Repeat the
process until both files are uploaded.
Both files are uploaded to HDFS as shown in the Files View UI:
You can also perform the following operations on a file or folder by clicking on the entity's
row: Open, Rename, Permissions, Delete, Copy, Move, Download and Concatenate.
Set Write Permissions to Write to data Folder
1. click on the data folder's row, which is contained within the directory path /tmp/.
2. Click Permissions.
3. Make sure that the background of all the write boxes are checked (blue).
Refer to image for a visual explanation.
Summary
Congratulations! Let’s summarize the skills and knowledge we acquired from this tutorial. We learned Hadoop Distributed File System (HDFS) was built to manage
storing data across multiple machines. Now we can upload data into the HDFS using Ambari’s HDFS Files view.
Further Reading
 HDFS
 HDFS User Guide
 HDFS Architecture Guide
 HDP OPERATIONS: HADOOP ADMINISTRATION
Introduction
In this section, you will be introduced to Apache Hive. In the earlier section, we covered how to load data into HDFS. So now you have geolocation and trucks files
stored in HDFS as csv files. In order to use this data in Hive, we will guide you on how to create a table and how to move data into a Hive warehouse, from where it
can be queried. We will analyze this data using SQL queries in Hive User Views and store it as ORC. We will also walk through Apache Tez and how a DAG is created
when you specify Tez as execution engine for Hive. Let's begin...
Prerequisites
The tutorial is a part of a series of hands on tutorials to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before

Outline
 Apache Hive Basics
 Become Familiar with Data Analytics Studio
 Create Hive Tables
 Explore Hive Settings on Ambari Dashboard
 Analyze the Trucks Data
 Summary
 Further Reading
Apache Hive Basics

Apache Hive provides SQL interface to query data stored in various databases and files systems that integrate with Hadoop. Hive enables analysts familiar with SQL
to run queries on large volumes of data. Hive has three main functions: data summarization, query and analysis. Hive provides tools that enable easy data extraction,
transformation and loading (ETL).
Become Familiar with Data Analytics Studio

Apache Hive presents a relational view of data in HDFS. Hive can represent data in a tabular format managed by Hive or just stored in HDFS irrespective in the file
format the data is in. Hive can query data from RCFile format, text files, ORC, JSON, parquet, sequence files and many of other formats in a tabular view. Through the
use of SQL you can view your data as a table and create queries like you would in an RDBMS.
To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Data Analytics Studio. DAS provides an interactive interface to Hive. We can
create, edit, save and run queries, and have Hive evaluate them for us using a series of Tez jobs.
Let’s now open DAS and get introduced to the environment. From Ambari Dashboard Select Data Analytics Studio and click on Data Analytics Studio UI
Alternatively, use your favorite browser navigate to http://sandbox-hdp.hortonworks.com:30800/ while your sandbox is running.
Now let’s take a closer look at the SQL editing capabilities Data Analytics Studio:
There are 4 tabs to interact with Data Analytics Studio:

1. Queries: This view allows you to search previously executed SQL queries. You can also see commands each user issues.
2. Compose: From this view you can execute SQL queries and observe their output. Additionally, visually inspect the results of your queries and download them as csv
files.
3. Database: Database allows you to add new Databases and Tables. Furthermore, this view grants you access to advanced information about your databases.
4. Reports: This view allows you keep track of Read and Write operations, and shows you a Join Report of your tables.
Take a few minutes to explore the various DAS sub-features.
Create Hive Tables

Now that you are familiar with DAS UI, let’s create and load tables for the geolocation and trucks data. We will create two tables: geolocation and trucks using DAS's
Upload Table tab.
Create and Load Trucks Table

Starting from DAS Main Menu:
1. Select Database
2. Select + next to Tables to add a new Table
3. Select Upload Table
Complete form as follows:
 Select checkbox: Is first row Header: True

 Select Upload from HDFS
 Set Enter HDFS Path to /tmp/data/geolocation.csv
 Click Preview
You should see a similar screen:
Note: that the first row contains the names of the columns.
Click Create button to complete table creation.
Create and Load Trucks Table

Repeat the steps above with the trucks.csv file to create and load the trucks table.
Behind the Scenes

Before reviewing what happened behind the scenes during the Upload Table Process, let’s learn a little more about Hive file formats.
Apache ORC is a fast columnar storage file format for Hadoop workloads.
The Optimized Row Columnar (Apache ORC project) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other
Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
To create a table using the ORC file format, use STORED AS ORC option. For example:
CREATE TABLE <tablename> ... STORED AS ORC ...
NOTE: For details on these clauses consult the Apache Hive Language Manual.
Following is a visual representation of the Upload table creation process:
1. The target table is created using ORC file format (i.e. Geolocation)
2. A temporary table is created using TEXTFILE file format to store data from the CSV file
3. Data is copied from temporary table to the target (ORC) table
4. Finally, the temporary table is dropped
You can review the SQL statements issued by selecting the Queries tab and reviewing the four most recent jobs, which was a result of using the Upload Table.
Verify New Tables Exist
To verify the tables were defined successfully:
1. Click on the Database tab.

2. Click on the refresh icon in the TABLES explorer.
3. Select table you want to verify. Definition of columns will be displayed.
Sample Data from the trucks table
Click on the Compose tab, type the following query into the query editor and click on Execute:
select * from trucks limit 10;
The results should look similar to:

A few additional commands to explore tables:
 show tables; - List the tables created in the database by looking up the list of tables from the metadata stored in HCatalogdescribe
 describe {table_name}; - Provides a list of columns for a particular table
describe geolocation;
 show create table {table_name}; - Provides the DDL to recreate a table

show create table geolocation;
 describe formatted {table_name}; - Explore additional metadata about the table. For example you can verify geolocation is an ORC Table, execute the following query:
describe formatted geolocation;
By default, when you create a table in Hive, a directory with the same name gets created in the /warehouse/tablespace/managed/hive folder in HDFS. Using the
Ambari Files View, navigate to that folder. You should see both a geolocation and trucks directory:
NOTE: The definition of a Hive table and its associated metadata (i.e., the directory the data is stored in, the file format, what Hive
properties are set, etc.) are stored in the Hive metastore, which on the Sandbox is a MySQL database.
Rename Query Editor Worksheet
Click on the SAVE AS button in the Compose section, enter the name of your query and save it.
Beeline - Command Shell
Try running commands using the command line interface - Beeline. Beeline uses a JDBC connection to connect to HiveServer2. Use the built-in SSH Web Client (aka
Shell-In-A-Box):
1. Connect to Beeline hive.
beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000 -n hive
2. Enter the beeline commands to grant all permission access for maria_dev user:
grant all on database foodmart to user maria_dev;
grant all on database default to user maria_dev;
!quit
3. Connect to Beeline using maria_dev.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000 -n maria_dev
4. Enter the beeline commands to view 10 rows from foodmart database customer and account tables:
select * from foodmart.customer limit 10;
select * from foodmart.account limit 10;
select * from trucks;
show tables;
!help
!tables
!describe trucks
5. Exit the Beeline shell:

!quit
What did you notice about performance after running hive queries from shell?
 Queries using the shell run faster because hive runs the query directory in hadoop whereas in DAS, the query must be accepted by a rest server before it can submitted to
hadoop.
 You can get more information on the Beeline from the Hive Wiki.
 Beeline is based on SQLLine.
Explore Hive Settings on Ambari Dashboard

Open Ambari Dashboard in New Tab
Click on the Dashboard tab to start exploring the Ambari Dashboard.
Become Familiar with Hive Settings

Go to the Hive page then select the Configs tab then click on Settings tab:
Once you click on the Hive page you should see a page similar to above:
1. Hive Page
2. Hive Configs Tab
3. Hive Settings Tab
4. Version History of Configuration
Scroll down to the Optimization Settings:

In the above screenshot we can see:
Tez is set as the optimization engine
This shows the HDP Ambari Smart Configurations, which simplifies setting configurations
 Hadoop is configured by a collection of XML files.

 In early versions of Hadoop, operators would need to do XML editing to change settings. There was no default versioning.
 Early Ambari interfaces made it easier to change values by showing the settings page with dialog boxes for the various settings and allowing you to edit them. However, you
needed to know what needed to go into the field and understand the range of values.
 Now with Smart Configurations you can toggle binary features and use the slider bars with settings that have ranges.
By default the key configurations are displayed on the first page. If the setting you are looking for is not on this page you can find additional settings in
the Advanced tab:
For example, if we wanted to improve SQL performance, we can use the new Hive vectorization features. These settings can be found and enabled by following these
steps:
1. Click on the Advanced tab and scroll to find the property

2. Or, start typing in the property into the property search field and then this would filter the setting you scroll for.
As you can see from the green circle above, the Enable Vectorization and Map Vectorization is turned on already.
Some key resources to learn more about vectorization and some of the key settings in Hive tuning:
 Apache Hive docs on Vectorized Query Execution

 HDP Docs Vectorization docs
 Hive Blogs
 5 Ways to Make Your Hive Queries Run Faster
 Evaluating Hive with Tez as a Fast Query Engine
Analyze the Trucks Data

Next we will be using Hive, and Zeppelin to analyze derived data from the geolocation and trucks tables. The business objective is to better understand the risk the
company is under from fatigue of drivers, over-used trucks, and the impact of various trucking events on risk. In order to accomplish this, we will apply a series of
transformations to the source data, mostly though SQL, and use Spark to calculate risk. In the last lab on Data Visualization, we will be using Zeppelin to generate a
series of charts to better understand risk.
Let’s get started with the first transformation. We want to calculate the miles per gallon for each truck. We will start with our truck data table. We need to sum up all the
miles and gas columns on a per truck basis. Hive has a series of functions that can be used to reformat a table. The keyword LATERAL VIEW is how we invoke things.
The stack function allows us to restructure the data into 3 columns labeled rdate, gas and mile (ex: 'june13', june13_miles, june13_gas) that make up a maximum of 54
rows. We pick truckid, driverid, rdate, miles, gas from our original table and add a calculated column for mpg (miles/gas). And then we will calculate average mileage.
Create Table truckmileage From Existing Trucking Data

Using DAS, execute the following query:
CREATE TABLE truckmileage STORED AS ORC AS SELECT truckid, driverid, rdate, miles, gas, miles / gas mpg FROM trucks LATERAL VIEW stack(54,
'jun13',jun13_miles,jun13_gas,'may13',may13_miles,may13_gas,'apr13',apr13_miles,apr13_gas,'mar13',mar13_miles,mar13_gas,'feb13',feb13_miles,feb13_g
as,'jan13',jan13_miles,jan13_gas,'dec12',dec12_miles,dec12_gas,'nov12',nov12_miles,nov12_gas,'oct12',oct12_miles,oct12_gas,'sep12',sep12_miles,sep1
2_gas,'aug12',aug12_miles,aug12_gas,'jul12',jul12_miles,jul12_gas,'jun12',jun12_miles,jun12_gas,'may12',may12_miles,may12_gas,'apr12',apr12_miles,a
pr12_gas,'mar12',mar12_miles,mar12_gas,'feb12',feb12_miles,feb12_gas,'jan12',jan12_miles,jan12_gas,'dec11',dec11_miles,dec11_gas,'nov11',nov11_mile
s,nov11_gas,'oct11',oct11_miles,oct11_gas,'sep11',sep11_miles,sep11_gas,'aug11',aug11_miles,aug11_gas,'jul11',jul11_miles,jul11_gas,'jun11',jun11_m
iles,jun11_gas,'may11',may11_miles,may11_gas,'apr11',apr11_miles,apr11_gas,'mar11',mar11_miles,mar11_gas,'feb11',feb11_miles,feb11_gas,'jan11',jan1
1_miles,jan11_gas,'dec10',dec10_miles,dec10_gas,'nov10',nov10_miles,nov10_gas,'oct10',oct10_miles,oct10_gas,'sep10',sep10_miles,sep10_gas,'aug10',a
ug10_miles,aug10_gas,'jul10',jul10_miles,jul10_gas,'jun10',jun10_miles,jun10_gas,'may10',may10_miles,may10_gas,'apr10',apr10_miles,apr10_gas,'mar10
',mar10_miles,mar10_gas,'feb10',feb10_miles,feb10_gas,'jan10',jan10_miles,jan10_gas,'dec09',dec09_miles,dec09_gas,'nov09',nov09_miles,nov09_gas,'oc
t09',oct09_miles,oct09_gas,'sep09',sep09_miles,sep09_gas,'aug09',aug09_miles,aug09_gas,'jul09',jul09_miles,jul09_gas,'jun09',jun09_miles,jun09_gas,
'may09',may09_miles,may09_gas,'apr09',apr09_miles,apr09_gas,'mar09',mar09_miles,mar09_gas,'feb09',feb09_miles,feb09_gas,'jan09',jan09_miles,jan09_g
as ) dummyalias AS rdate, miles, gas;
Explore a sampling of the data in the truckmileage table
To view the data generated by the script, execute the following query in the query editor:
select * from truckmileage limit 100;
You should see a table that lists each trip made by a truck and driver:
Use the Content Assist to build a query
1. Create a new SQL Worksheet.
2. Start typing in the SELECT SQL command, but only enter the first two letters:
SE
3. Note that suggestions automatically begin to appear:
NOTE: Notice content assist shows you some options that start with an “SE”. These shortcuts will be great for when you write a lot of
custom query code.
4. Type in the following query
SELECT truckid, avg(mpg) avgmpg FROM truckmileage GROUP BY truckid;
5. Click the “Save As” button to save the query as “average-mpg”:
6. Notice your query now shows up in the list of “Saved Queries”, which is one of the tabs at the top of the Hive User View.
7. Execute the “average-mpg” query and view its results.
Explore Explain Features of the Hive Query Editor

Let's explore the various explain features to better understand the execution of a query: Visual Explain, Text Explain, and Tez Explain. Click on the Visual
Explain button:
This visual explain provides a visual summary of the query execution plan. You can see more detailed information by clicking on each plan phase.
If you want to see the explain result in text, select RESULTS. You should see something like:
Explore TEZ
Click on Queries and select the last SELECT query we issued:
From this view you can observe critically important information, such as:
USER, STATUS, DURATION, TABLES READ, TABLES WRITTEN, APPLICATION ID, DAG ID
There are seven tabs at the top, please take a few minutes to explore the various tabs.
Create Table avgmileage From Existing trucks_mileage Data

It is common to save results of query into a table so the result set becomes persistent. This is known as Create Table As Select (CTAS). Copy the following DDL into
the query editor, then click Execute:
CREATE TABLE avgmileage
STORED AS ORC
AS
SELECT truckid, avg(mpg) avgmpg
FROM truckmileage
GROUP BY truckid;
View Sample Data of avgmileage
To view the data generated by CTAS above, execute the following query:
SELECT * FROM avgmileage LIMIT 100;
Table avgmileage provides a list of average miles per gallon for each truck.
Create Table DriverMileage from Existing truckmileage data
The following CTAS groups the records by driverid and sums of miles. Copy the following DDL into the query editor, then click Execute:
CREATE TABLE DriverMileage
STORED AS ORC
AS
SELECT driverid, sum(miles) totmiles
FROM truckmileage
GROUP BY driverid;
View Data of DriverMileage
To view the data generated by CTAS above, execute the following query:
SELECT * FROM drivermileage;
We will use these result to calculate all truck driver's risk factors in the next section, so lets store our results on to HDFS:
and store it at /tmp/data/drivermileage.
Then open your web shell client:
sudo -u hdfs hdfs dfs -chown maria_dev:hdfs /tmp/data/drivermileage.csv
Next, navigate to HDFS as maria_dev and give permission to other users to use this file:
Summary
Congratulations! Let’s summarize some Hive commands we learned to process, filter and manipulate the geolocation and trucks data. We now can create Hive tables
with CREATE TABLE and UPLOAD TABLE. We learned how to change the file format of the tables to ORC, so hive is more efficient at reading, writing and processing this
data. We learned to retrieve data using SELECT statement and create a new filtered table (CTAS).
Further Reading
Augment your hive foundation with the following resources:
 Apache Hive
 Hive LLAP enables sub second SQL on Hadoop
 Programming Hive
 Hive Language Manual
 HDP DEVELOPER: APACHE PIG AND HIVE
Introduction
In this tutorial you will be introduced to Apache Zeppelin and teach you to visualize data using Zeppelin.
Prerequisites
The tutorial is a part of series of hands on tutorial to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before

 Hive - Data ETL
 Spark - Risk Factor
Outline
 Apache Zeppelin
 Create a Zeppelin Notebook
 Download the Data
 Execute a Hive Query
 Build Charts Using Zeppelin
 Summary
 Further Reading
Apache Zeppelin
Apache Zeppelin provides a powerful web-based notebook platform for data analysis and discovery. Behind the scenes it supports Spark distributed contexts as well
as other language bindings on top of Spark.
In this tutorial we will be using Apache Zeppelin to run SQL queries on our geolocation, trucks, and riskfactor data that we've collected earlier and visualize the result
through graphs and charts.
Create a Zeppelin Notebook

Navigate to Zeppelin Notebook
Open Zeppelin interface using browser URL:
http://sandbox-hdp.hortonworks.com:9995/
Click on a Notebook tab at the top left and select Create new note. Name your notebook Driver Risk Factor.
Download the Data
If you had trouble completing the previous tutorial or lost the risk factor data click here to download it and upload it to HDFS under /tmp/data/
Note: if you completed the Spark - Risk Factor section successfully advance to Execute a Hive Query
and enter the following command to create a spark temporary view
%spark2
val hiveContext = new org.apache.spark.sql.SparkSession.Builder().getOrCreate()
val riskFactorDataFrame = spark.read.format("csv").option("header", "true").load("hdfs:///tmp/data/riskfactor.csv")
riskFactorDataFrame.createOrReplaceTempView("riskfactor")
hiveContext.sql("SELECT * FROM riskfactor LIMIT 15").show()
Execute a Hive Query

Visualize final results Data in Tabular Format
In the previous Spark tutorial you already created a table finalresults or riskfactor which gives the risk factor associated with every driver. We will use the data
we generated in this table to visualize which drivers have the highest risk factor. We will use the jdbc Hive interpreter to write queries in Zeppelin.
1. Copy and paste the code below into your Zeppelin note.
%sql
SELECT * FROM riskfactor
2. Click the play button next to "ready" or "finished" to run the query in the Zeppelin notebook.
Alternative way to run query is "shift+enter."
Initially, the query will produce the data in tabular format as shown in the screenshot.
Build Charts using Zeppelin
Visualize finalresults Data in Chart Format
1. Iterate through each of the tabs that appear underneath the query. Each one will display a different type of chart depending on the data that is returned in the query.
2. After clicking on a chart, we can view extra advanced settings to tailor the view of the data we want.
3. Click settings to open the advanced chart features.
4. To make a chart with riskfactor.driverid and riskfactor.riskfactor SUM, drag the table relations into the boxes as shown in the image below.
5. You should now see an image like the one below.
6. If you hover on the peaks, each will give the driverid and riskfactor.
7. Try experimenting with the different types of charts as well as dragging and dropping the different table fields to see what kind of results you can obtain.
8. Let' try a different query to find which cities and states contain the drivers with the highest risk factors.
%sql
SELECT a.driverid, a.riskfactor, b.city, b.state
FROM riskfactor a, geolocation b where a.driverid=b.driverid
9. After changing a few of the settings we can figure out which of the cities have the high risk factors. Try changing the chart settings by clicking the scatterplot icon.
Then make sure that the keys a.driverid is within the xAxis field, a.riskfactor is in the yAxis field, and b.city is in the group field. The chart should look similar to the
following.
You can hover over the highest point to determine which driver has the highest risk factor and in which cities.
Summary
Great, now we know how to query and visualize data using Apache Zeppelin. We can leverage Zeppelin—along with our newly gained knowledge of Hive and Spark—
to solve real world problems in new creative ways.
Further Reading
 Zeppelin on HDP
 Apache Zeppelin Docs
 Zeppelin Homepage

Getting Started With HDP Sandbox

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Getting Started With HDP Sandbox

Uploaded by

Copyright:

Available Formats

NOTICE

 How to Create a CDP Private Cloud Base Development Cluster

Concept: Hadoop & HDP

Goals of this module

The base Apache Hadoop framework is composed of the following modules:

 Operations – Provision, manage, monitor and operate Hadoop clusters at scale.

Hortonworks Data Platform (HDP)

Goals of this module

Hadoop Distributed File System

 replicate blocks to other nodes

MapReduce & YARN

Goals of the Module

 map(key1,value) -> list<key2,value2>

 reduce(key2, list<value2>) -> list<value3>

Concept: Hive and Pig

How Hive Works

Here is a short video introduction on Hive:

Stinger and Stinger.next

Introduction: Apache Pig

Goals of this Module

 Extract-transform-load (ETL) data pipelines,

Whatever the use case, Pig will be:

Please refer the following video on Pig for more clarity:

How Pig Works

 Apache Tez: A New Chapter in Hadoop Data Processing

 Apache ORC Launches as a Top-Level Project

 Heterogeneous Storage Policies in HDP 2.2

Capacity Scheduler Blogs:

 Understanding Apache Hadoop’s Capacity Scheduler

 Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox

Download and Extract Sensor Data Files

Load the Sensor Data into HDFS

Upload Geolocation and Trucks CSV Files to data Folder

Refer to image for a visual explanation.

 Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox

Apache Hive Basics

Become Familiar with Data Analytics Studio

There are 4 tabs to interact with Data Analytics Studio:

Create Hive Tables

Create and Load Trucks Table

 Select checkbox: Is first row Header: True

Create and Load Trucks Table

Behind the Scenes

1. Click on the Database tab.

The results should look similar to:

 show create table {table_name}; - Provides the DDL to recreate a table

3. Connect to Beeline using maria_dev.

5. Exit the Beeline shell:

Explore Hive Settings on Ambari Dashboard

Become Familiar with Hive Settings

Scroll down to the Optimization Settings:

 Hadoop is configured by a collection of XML files.

1. Click on the Advanced tab and scroll to find the property

 Apache Hive docs on Vectorized Query Execution

Analyze the Trucks Data

Create Table truckmileage From Existing Trucking Data

3. Note that suggestions automatically begin to appear:

Explore Explain Features of the Hive Query Editor

Create Table avgmileage From Existing trucks_mileage Data

 Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox

Create a Zeppelin Notebook

Execute a Hive Query