You are on page 1of 24

Operational

Database Partner Workshop Kudu Lab


This lab walks through setting up and running Kudu workloads on a CDH 5.12+
Quickstart VM.

Author Dave Fowler


Date October 2017
Cloudera Version 5.12+
Kudu 1.4+

Prerequisites for the Lab


You will need a 64-bit PC with 9+GB and 2 VCPU’s available for a VM. Either VMWare
(Player/Fusion 4.x+) or VirtualBox installed and working. You need to be able to download
and use a Cloudera Quickstart VM on your environment. Quickstart VM’s are provided as zip
archives (requires 7-Zip or equivalent). 6 GB of disk space for the zip file image. 15 GB of
disk space for the uncompressed VM and Kudu lab.

Attendees should be comfortable with Linux command line, relational concepts, and SQL.

Not supported in this lab doc
We are using the Cloudera Quickstart VM Express configuration for this lab. Other
Quickstart VM environments such as Docker or KVM should work but have not been tested
for this lab document. You could also install Kudu and run the labs on your own cluster or
create your own VM image.


Overview
This Kudu hands-on-lab walks through install, configuration, ingest, processing, and analysis
of streaming type data. The data used in the lab is actual streaming data from the San

Page 1 Part of the Cloudera Partner Workshop Series


Francisco Mass Transit Authority and profiles each vehicle for IoT type behaviors and
events.

The lab consists of the following steps:

1. Download the CDH Quickstart VM for a VMware or VirtualBox VM


2. Change the memory and VCPU settings for a CDH Express install
3. Install Kudu as packages to the CDH Express configuration included in the VM
4. Configure Kudu for a non-supported single node VM
5. Load sample streaming data to HDFS
6. Create external table in Impala mapping to the HDFS data
7. Create tables in Kudu and ingest the sample streaming data
8. Explore data through Hue/Search (or a BI tool of your choosing)
9. Explore data via the Impala shell
10. Use the Java client API to create tables, load data, and query
11. Monitor Kudu

Kudu is a column oriented relational storage layer partitioned by range, hash, or


combinations. It is a mutable alternative to HDFS which is still an excellent choice for many
analytic workloads. Some example of where Kudu excels over HDFS as a storage layer are:

● Time Series – insert time series data in real-time.


● IoT and Machine Data Analytics – handle large volumes and velocities of machine-
generated data.
● Online Reporting – quickly add/update data to your data store.
● Lambda type architectures that involve both real-time pipelines and batch data.
● Machine Learning with scoring models.
● Integration with SQL, Spark, and Spark Streaming processing engines.

Page 2 Part of the Cloudera Partner Workshop Series


Lab1 Install and Configure Kudu

Download the Quickstart VM
1. Download the pre-installed VMware or VirtualBox CDH 5.12 Quickstart VM from the
download page here. Later versions of the CDH Quickstart VM have not been tested
at the time this lab was written but should work with this lab. The VM will need 2 or
more VCPU’s and approximately 9+ GB of memory allocated to the VM to run with
minimal or no paging.

Caution: If you don’t have enough actual memory in your system and allocated to the VM,
the VM will start swapping or your base OS will. Both will result in timeouts that cause the
VM image or your base OS to quit responding properly. You can check the free memory
inside your vm with Linux commands like “free –m”. You can check what memory is being
used and available on your PC with something like the MacBook Activity Monitor or
equivalent on a Windows PC to minimize the amount of memory used by the VM on your
laptop/desktop.

Note: The CDH Enterprise configuration with the trial license available in the Quickstart VM
and parcels based configurations should work but are not covered in this lab document.
You would need approximately 1GB additional memory or more depending on the other
CDH services you have running.

Change the Memory and VCPU settings for the VM
1. Before starting the VM, open the VMWare or VirtualBox settings menu for the VM.
Change the VCPU’s to 2 or higher and the memory to 9 GB (9216 MiB) or higher.

If you would like more information on the Quickstart VM. The Quickstart VM Guide
referenced in the Appendix has more instructions on using the VM with VMWare and
VirtualBox environments.

Start the Quickstart VM
1. Start the Quickstart VM from VMware or VirtualBox.
2. After the initial desktop boots, click on the shortcut icon “Launch Cloudera
Express…” This will change the Quickstart VM configuration to use the Express
Edition version of CDH that we will use for the remainder of these labs. It takes a few
minutes for the Quickstart VM to switch to the Express version of CDH from the
default init scripts configuration. The “Launch Cloudera Express…” step needs to be
executed once, not on subsequent restarts of the VM.

Page 3 Part of the Cloudera Partner Workshop Series




3. Bring up Cloudera Manager when it is available and login as user/password
cloudera/cloudera. It will take several minutes for the VM, CM, and CDH cluster to
start or restart. The Cloudera Manager link is on the browser links bar for easy
access ( or http://quickstart.cloudera:7180/cmf/home )
4. No CDH services are running with the Express install by default after boot other than
the Host Monitor and Cloudera Manager. Verify no other services are running in
Cloudera Manager (CM). If the CM Service and Host Monitor do not show as running
restart the Cloudera Manager service now (Cloudera Manager Service->Restart).

Page 4 Part of the Cloudera Partner Workshop Series


Kudu Installation by Packages
To load Kudu from Linux packages we will add the Kudu repository definition for yum to our
VM Linux install.
1. Download the Kudu repository for RH/OL/CentOS 6 file used in the Quickstart VM
and save in “/etc/yum.repos.d”. This can be done in a terminal as shown here:

$ cd /etc/yum.repos.d
$ sudo curl -L -O
http://archive.cloudera.com/kudu/redhat/6/x86_64/kudu/cloudera-kudu.repo

2. Use yum in the Linux terminal to install Kudu via packages.

$ sudo yum -y install kudu

3. Optional. If you want to try C++ development libraries separate from this lab, install
kudu-client and kudu-client-devel packages.

Add the Kudu Service and Configure Kudu in Cloudera Manager
1. Create a /data directory from a Linux shell. We will configure Kudu to use /data for
the Kudu master and tserver sub-directories which will be created automatically by
Kudu during startup.

$ sudo mkdir /data; sudo chown kudu:kudu /data
In this step we add Kudu as a Service to the cluster. This will attempt to start the Kudu
service at the end of the initial configuration process when adding the Kudu service. The
Kudu service may not start successfully during this step yet since we still need to make more
configuration changes for it to work on a Quickstart VM. We will be restarting Kudu service
later after we make those changes.
2. Open Cloudera Manager. Click the Cloudera Quick… drop down menu>Add Service.
Select Kudu from the list. If you don’t see Kudu in the list, your install in the previous
section was incomplete. The Kudu setup wizard will be started.

Page 5 Part of the Cloudera Partner Workshop Series


1. Select your VM host name “quickstart.cloudera” from the list presented for both the
Master and Tablet Server roles (the Quickstart VM runs on a single host
“quickstart.cloudera” but a production cluster would have multiple hosts).



2. Finish running through the Kudu setup wizard in Cloudera Manager and fill out the
required fields. Use the sample settings below for Kudu to set the tablet, master
tablet, and WAL directories. Cloudera Manager will create the tserver and master
sub-directories specified. (Make sure to set all four directories listed below)

Sample Settings for Kudu on a Quickstart VM

Kudu Master WAL Directory = /data/master
Kudu Master Data Directories = /data/master
Kudu Tablet Server WAL Directory = /data/tserver
Kudu Tablet Server Data Directories = /data/tserver

Note: Normally you would be configuring master and data tablets on different hosts for
best performance which we are not doing with this Quickstart VM based configuration.

Page 6 Part of the Cloudera Partner Workshop Series


The Kudu service will attempt to start and may fail since we haven’t made configuration
changes yet needed for it to run on a Quickstart VM. We will be restarting the Kudu service
later after we make those changes.

3. Enable Kudu in the Impala Service. From Cloudera Manager, select the Impala
Service->Configuration and put in kudu for the keyword search. Change the setting
to turn on access to the “Kudu Service” from Impala (turned off by default) and save.
Don’t start Impala just yet.

We will start all the required services in another step shortly and can start/restart Impala
when needed. (Manager makes it easy to stop or start the CDH services when desired)



4. Change the Kudu Service setting for the default number of replicas. From the main
CM screen, click on the Kudu Service>Configuration. Look for the setting Default
Number of Replicas = 3 (default setting,). Change this to 1 since the Quickstart VM
only has one host and one tablet server and save. Normally replicas would be set for
at least 3 for any production installation.



Note: Use a Reliable clock in Linux. Normally you need to install and configure NTP so Kudu
has a reliable clock. Without a reliable clock you may get errors starting and running Kudu
that NTP would resolve. For development only, it is possible to avoid setting up and running
ntpd with a reliable time server by running Kudu with the setting --use-hybrid-clock=false.

Page 7 Part of the Cloudera Partner Workshop Series


However that setting has a serious effect on transactional consistency so it's not something
we recommend for production or load testing. Furthermore without a reliable clock
provided by NTP, Kudu skips cleanup work it would normally do. Your disk size will grow
beyond what is normal and performance will be negatively impacted.
5. For just this short lab only environment, because it is non-production, and we don’t
care about performance in the lab, we will set the not recommended --use-hybrid-
clock=false parameter. Normally never forget to setup NTP for Kudu in production!

Click on the Kudu Service>Configuration. Look for the setting, “"Kudu Service Advanced
Configuration Snippet (Safety Valve) for gflagfile” and add the setting:

--use-hybrid-clock=false


Note: if you don’t set this clock setting correctly (or setup NTP), Kudu will fail to start
intermittently and the full log file for the Kudu service will have errors indicating time
sync problems such as:
Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Error reading clock. Clock
considered unsynchronized

Start the Cluster Services We Need for the Labs
1. We will be using HDFS for our data source and Impala (which uses Hive catalogue) as
our SQL engine for Kudu. Hue (Hadoop User Experience) has a graphical interface
for running SQL on Kudu in an Impala Query Editor. There are also command line
shells and API interfaces you could use with Kudu and Impala.
2. Start/restart the required services in CM for our Kudu labs. One at a time, in this
order, start/restart; HDFS, Hive, Kudu, Impala, and Hue.

You may get some initial warnings (yellow) on the service monitors as we are running below
thresholds for recommended memory. Those are ok for the Quickstart VM configuration.
All the services listed above should be running successfully at this point and before
proceeding to the next lab.

Page 8 Part of the Cloudera Partner Workshop Series


Page 9 Part of the Cloudera Partner Workshop Series


Lab 2 Tables and Data

Download the Sample Time Series Data
For this step, we will be using streaming data from the San Fran Mass Transit system on the
Kudu Apache site. The SF MTA’s site is often a bit slow, so the Apache project has mirrored
a sample CSV file from the dataset we will be using.
The original dataset uses DOS-style line endings, so we will also convert it to UNIX-style
during the upload process using tr. We will then load the sample csv file data into HDFS and
ingest from there into Kudu.

Open a Linux terminal in the Quickstart VM. Login as cloudera/cloudera


$ cd /tmp
$ wget http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz
$ hdfs dfs -mkdir /user/cloudera/sfmta
$ zcat sfmtaAVLRawData01012013.csv.gz | tr -d '\r' | hadoop fs -put -
/user/cloudera/sfmta/data.csv

Page 10 Part of the Cloudera Partner Workshop Series


Create a Source Table definition in Impala for the Sample Data
1. Open Hue, login with cloudera/cloudera. There is a link to Hue on the browser bar
or from the Hue service in CM.
2. Go to the Impala Query Editor in Hue.
3. Run this command to create the table.

CREATE EXTERNAL TABLE sfmta_raw (
revision int,
report_time string,
vehicle_tag int,
longitude float,
latitude float,
speed float,
heading float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sfmta/'
TBLPROPERTIES ('skip.header.line.count'='1');

Food for Thought: Did we just create a Kudu or Impala table?

Page 11 Part of the Cloudera Partner Workshop Series



4. Query the data to validate that the table was created and data loaded successfully. You
should see a similar successful count to the screen shown below.

SELECT count(*) FROM sfmta_raw;



Create a Kudu Table and Load Data
5. Create a Kudu table in the Impala Query Editor for time series data with a composite primary
key partitioned with a hash of the report time.

CREATE TABLE sfmta
PRIMARY KEY (report_time, vehicle_tag)
PARTITION BY HASH(report_time) PARTITIONS 8
STORED AS KUDU
AS SELECT
UNIX_TIMESTAMP(report_time, 'MM/dd/yyyy HH:mm:ss') AS report_time,
vehicle_tag,
longitude,
latitude,
speed,
heading
FROM sfmta_raw;

Page 12 Part of the Cloudera Partner Workshop Series


Read and Modify the Data
1. Run Queries to find the highest record vehicle speed in Kudu.

SELECT * FROM sfmta ORDER BY speed DESC LIMIT 1;


2. With a quick Google search we can see that this bus was traveling east on 16th
street at 68MPH. At first glance, this seems unlikely to be true. Perhaps we do some
research and find that this bus’s sensor equipment was broken and we decide to
remove the data. If this data was stored in HDFS using Hive or Impala updates would
be problematic since HDFS is not designed to be easily mutable. With Kudu this is
very easy to correct using standard SQL:

DELETE FROM sfmta WHERE vehicle_tag = 5411;



Impala Shell Access
We could also have used the impala shell for command line SQL access. Let’s run a
query to Kudu with the Impala shell command line access.
1. Open a Linux shell with the default user cloudera (password cloudera) and run:

$ ssh cloudera@quickstart.cloudera -t impala-shell
(password is cloudera for the user cloudera password prompt)

Then execute the following sql:

> SELECT count(*) FROM sfmta;

The result should be 842279 with the current data example set.

> exit;

Page 13 Part of the Cloudera Partner Workshop Series


Note: To keep this lab short the above example showed how to load, query, and mutate a
static dataset with Impala and Kudu. A real power of Kudu is also the ability to ingest and
mutate data directly in a streaming fashion.

Lab 3 Using the Java API Client

There are several API clients you can use to create and access objects in Kudu. Clients
include; SQL (Impala & Spark SQL), Java, JDBC/ODBC (Impala), C++, and Python (Python
friendly interface to the C++ client API). A number of other CDH subsystems and 3rd party
applications have made use of these API’s and provide direct integration with Kudu. For
example Spark has a Kudu DataFrame as a built in capability. Flume and Kafka can be used
as part of an ingest cycle with Kudu. Cloudera Data Science Workbench has the Kudu
libraries pre-loaded in release 5.13+. Numerous 3rd party applications like StreamSets
pipelines and Business Intelligence tools have Kudu support as a core data source.

In this lab we will download a set of Kudu examples and run a Java client that creates a
table, writes some data, scans it, and then deletes its table.


Download the Kudu examples
1. In your VM, open a browser tab. Go to the link:
https://github.com/cloudera/kudu-examples
2. Click on the “Clone or download” button and save the zip file in your VM. The
default save location is to the Downloads directory which is fine.



3. In your VM, open a terminal session as cloudera/cloudera (default user/pass) and
run these commands to download the needed components and build our jar file for
the java-sample application:

$ cd ~/Downloads
$ unzip kudu-examples-master.zip
$ cd kudu-examples-master/java/java-sample
$ mvn package

Page 14 Part of the Cloudera Partner Workshop Series







4. View the java source file we just built and will run shortly:

$ less /home/cloudera/Downloads/kudu-examples-master/java/java-
sample/src/main/java/org/kududb/examples/sample/Sample.java

5. Run the java-sample client that creates a table writes some data, scans it, and then
deletes the table.

$ java -jar target/kudu-java-sample-1.0-SNAPSHOT.jar

You should have received back “value 0, value 1, value 2” with no errors.

Page 15 Part of the Cloudera Partner Workshop Series


Using Kudu Tables Created with the API and Impala

You can add tables that were created by one of the Kudu API’s to Impala’s metadata and use
from the both API and Impala. In this step we will modify and build the java source to
create a table. We will lookup the table definition from the Kudu Master Web Ui and run
that command in the Impala Query Editor. That adds the table definition to the Impala
metadata definitions of internal and external tables.

1. In your VM, open a terminal and login as cloudera/cloudera. Then run these
commands:

$ vi /home/cloudera/Downloads/kudu-examples-master/java/java-
sample/src/main/java/org/kududb/examples/sample/Sample.java

Change the line with the table name to just ascii characters. Impala uses ascii characters in
table names. Also comment out the line that deletes the table after the query.

Before:
String tableName = "java_sample-" + System.currentTimeMillis();j
After:
/*String tableName = "java_sample-" + System.currentTimeMillis();*/
String tableName = "javasample";

Before:
client.deleteTable(tableName);
After:
/* client.deleteTable(tableName); */

2. Rebuild the project and run the Java command to build and populate the Kudu table;

$ cd kudu-examples-master/java/java-sample
$ mvn package
$ java -jar target/kudu-java-sample-1.0-SNAPSHOT.jar

3. Open the Kudu Master Web UI from CM to obtain the table definition SQL for
Impala. From the main Cloudera Manager screen Kudu->Kudu Master Web UI->
Tables.

Page 16 Part of the Cloudera Partner Workshop Series


4. Click on the Table Id link for the “javasample” table we just created



5. Scroll down and copy the command listed to create the Impala table definition.

CREATE EXTERNAL TABLE `javasample` STORED AS KUDU
TBLPROPERTIES(
'kudu.table_name' = 'javasample',
'kudu.master_addresses' = 'quickstart.cloudera:7051');

6. In the VM browser, open Hue->Impala Query Editor and run the command we just
captured to create the table definition in the Impala metadata.



7. Run a “select * from javasample” to query the table that is now available from the
Impala Query Editor.

Page 17 Part of the Cloudera Partner Workshop Series




Page 18 Part of the Cloudera Partner Workshop Series


Lab 4 Monitor and Manage Kudu from Cloudera
Manager

Monitor the memory Usage for Kudu in CM
You can review the cpu and memory setting for the services in the cluster that would be
used if they were started.
1. Click on the “Hosts → All Hosts” on the CM home page, then click on the
quickstart.cloudera host, and click on the Resources tab. Scroll through the services
including Kudu to see what base resources they would use when started.




2. Review the Kudu memory settings. In a normal production environment, you would
be bumping these up from the default setting. For the Quickstart lab we will leave
the default settings and not change. From the Manager main menu, click on the
Kudu Service>Configuration. Review the setting, ‘memory_limit_hard_bytes’.



Note: The ‘memory_limit_hard_bytes’ flag (Kudu Tablet Server Hard Memory Limit in CM)
determines the amount of RAM that a Kudu tablet server may use. The amount of memory
required for a workload scales with a number of factors including data size, write workload,
and read concurrency.

The following determines a baseline/starting point for a configuration:

Page 19 Part of the Cloudera Partner Workshop Series


● Baseline 1.5GB of memory / 1TB stored on disk to at least start up a tserver with that
amount of data. NOTE: this ratio may be schema/workload-dependent so this is
meant as a starting point, not a hard rule.


Monitoring Master and Tablet Servers for Appropriate Memory Limit
After appropriately configuring and while running a workload, monitor the tablet server
process RAM usage using CM.
1. CM->Kudu Service-> Tablet Server->Chart Library (some base default charts also on
the Status page). Scroll down to the Resident Memory Chart and view your memory
usage.
2. Do the same for the Master Server (you can select the Master Server from the top
level Kudu Service screen).

In a real production or load test environment you will see more relevant number you want
to monitor on this chart. For example,the following graph is from a server configured with a
6GB limit:


Note that this server usually stays between 50-75% of its limit and occasionally tops 80%.
This server may benefit from an increased memory limit, but it appears generally healthy. If
a server is consistently utilizing more than 75% of its memory the memory limit should be
increased.

Page 20 Part of the Cloudera Partner Workshop Series


Next Steps

HBase and Kudu White Papers and Engineering Blogs
https://www.cloudera.com/products/open-source/apache-hadoop/apache-hbase.html
https://www.cloudera.com/products/open-source/apache-hadoop/apache-kudu.html


Architecture Series Webinar Recordings
The Kudu Architecture Series of recorded webinars that is helpful for more informations on
Kudu use cases and for additional architecture discussion.
http://app.go.cloudera.com/e/er?s=1465054361&lid=27415&elqTrackId=c5a1564d87f6433
091b8036d58f8a18d&elq=0d2673ac630b43cdba267eec5d2271c0&elqaid=4139&elqat=1
https://www.cloudera.com/resources/resources-library.html


Cloudera Connect Recordings for Partners - Kudu Deep Dive
As a Cloudera partner you have access to live and recorded webinars with content created
just for partners. The Cloudera Connect - Kudu Deep Dive provides and overview of the
positioning and architecture used by Kudu.
https://www.cloudera.com/partner-portal/training/cloudera-showcase/cloudera-showcase-
on-demand.html
https://clouderaconnect.mindtickle.com/#/917883691545578001


Cloudera Kudu Documentation
https://www.cloudera.com/documentation/enterprise/latest/topics/hbase.html
https://www.cloudera.com/documentation/kudu/latest.html


Cloudera University Training
Cloudera University provides in-person, web based, and on-demand classes on both HBase
and Kudu.
https://www.cloudera.com/more/training.html?src=GoogleAdWords&gclid=CjwKEAjwgvfO
BRD7_IDSuP3znTwSJAB4_t6GSlWLK3nw8VTGvekTGvNMrSy7UJZkj_CX8MD4sdRjQBoCddrw_
wcB


Page 21 Part of the Cloudera Partner Workshop Series


Page 22 Part of the Cloudera Partner Workshop Series


Appendix

Cloudera Quickstart VM documentation.

Install Kudu using Cloudera Manager documentation (packages or parcels).

Apache Kudu site



Page 23 Part of the Cloudera Partner Workshop Series


Troubleshooting

Kudu Master or Tablet server fails to start
If the Kudu service fails to start check the role log file for the tablet server or master server in the Kudu Service. CM-
>Kudu->{master or tablet server}->Log Files->Role Log File

1. If you see a network error in the log file similar to this one you may have to restart the VM to clear. This is
an intermittent artifact of running on a single VM environment and the network setup in that VM.
Check failed: _s.ok() Bad status: Network error: error binding socket to 0.0.0.0:7051: Address already in use (error
98)
2. If you see a time sync error or error about no reliable clock similar to this one you didn’t set the clock
parameter correctly in Lab 1
Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Error reading clock. Clock considered
unsynchronized


CM or Host Monitor not showing a healthy status
After booting the VM, in CM, if the Hosts monitor or CM monitor is not showing a good status, try restarting
Cloudera Manager service.


Impala Service is not showing a healthy status and failed to start
You need to start the HDFS and Hive services before the Impala service. The full role log from the Impala service will
show why Impala won't start. In this case it will show the HDFS and Hive services are not running.


Page 24 Part of the Cloudera Partner Workshop Series

You might also like