Professional Documents
Culture Documents
Overview
This Kudu hands-on-lab walks through install, configuration, ingest, processing, and analysis
of streaming type data. The data used in the lab is actual streaming data from the San
2. Finish running through the Kudu setup wizard in Cloudera Manager and fill out the
required fields. Use the sample settings below for Kudu to set the tablet, master
tablet, and WAL directories. Cloudera Manager will create the tserver and master
sub-directories specified. (Make sure to set all four directories listed below)
Sample Settings for Kudu on a Quickstart VM
Kudu Master WAL Directory = /data/master
Kudu Master Data Directories = /data/master
Kudu Tablet Server WAL Directory = /data/tserver
Kudu Tablet Server Data Directories = /data/tserver
Note: Normally you would be configuring master and data tablets on different hosts for
best performance which we are not doing with this Quickstart VM based configuration.
4. Change the Kudu Service setting for the default number of replicas. From the main
CM screen, click on the Kudu Service>Configuration. Look for the setting Default
Number of Replicas = 3 (default setting,). Change this to 1 since the Quickstart VM
only has one host and one tablet server and save. Normally replicas would be set for
at least 3 for any production installation.
Note: Use a Reliable clock in Linux. Normally you need to install and configure NTP so Kudu
has a reliable clock. Without a reliable clock you may get errors starting and running Kudu
that NTP would resolve. For development only, it is possible to avoid setting up and running
ntpd with a reliable time server by running Kudu with the setting --use-hybrid-clock=false.
Note: if you don’t set this clock setting correctly (or setup NTP), Kudu will fail to start
intermittently and the full log file for the Kudu service will have errors indicating time
sync problems such as:
Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Error reading clock. Clock
considered unsynchronized
Start the Cluster Services We Need for the Labs
1. We will be using HDFS for our data source and Impala (which uses Hive catalogue) as
our SQL engine for Kudu. Hue (Hadoop User Experience) has a graphical interface
for running SQL on Kudu in an Impala Query Editor. There are also command line
shells and API interfaces you could use with Kudu and Impala.
2. Start/restart the required services in CM for our Kudu labs. One at a time, in this
order, start/restart; HDFS, Hive, Kudu, Impala, and Hue.
You may get some initial warnings (yellow) on the service monitors as we are running below
thresholds for recommended memory. Those are ok for the Quickstart VM configuration.
All the services listed above should be running successfully at this point and before
proceeding to the next lab.
Create a Kudu Table and Load Data
5. Create a Kudu table in the Impala Query Editor for time series data with a composite primary
key partitioned with a hash of the report time.
CREATE TABLE sfmta
PRIMARY KEY (report_time, vehicle_tag)
PARTITION BY HASH(report_time) PARTITIONS 8
STORED AS KUDU
AS SELECT
UNIX_TIMESTAMP(report_time, 'MM/dd/yyyy HH:mm:ss') AS report_time,
vehicle_tag,
longitude,
latitude,
speed,
heading
FROM sfmta_raw;
2. With a quick Google search we can see that this bus was traveling east on 16th
street at 68MPH. At first glance, this seems unlikely to be true. Perhaps we do some
research and find that this bus’s sensor equipment was broken and we decide to
remove the data. If this data was stored in HDFS using Hive or Impala updates would
be problematic since HDFS is not designed to be easily mutable. With Kudu this is
very easy to correct using standard SQL:
DELETE FROM sfmta WHERE vehicle_tag = 5411;
Impala Shell Access
We could also have used the impala shell for command line SQL access. Let’s run a
query to Kudu with the Impala shell command line access.
1. Open a Linux shell with the default user cloudera (password cloudera) and run:
$ ssh cloudera@quickstart.cloudera -t impala-shell
(password is cloudera for the user cloudera password prompt)
Then execute the following sql:
> SELECT count(*) FROM sfmta;
The result should be 842279 with the current data example set.
> exit;
3. In your VM, open a terminal session as cloudera/cloudera (default user/pass) and
run these commands to download the needed components and build our jar file for
the java-sample application:
$ cd ~/Downloads
$ unzip kudu-examples-master.zip
$ cd kudu-examples-master/java/java-sample
$ mvn package
Before:
client.deleteTable(tableName);
After:
/* client.deleteTable(tableName); */
2. Rebuild the project and run the Java command to build and populate the Kudu table;
$ cd kudu-examples-master/java/java-sample
$ mvn package
$ java -jar target/kudu-java-sample-1.0-SNAPSHOT.jar
3. Open the Kudu Master Web UI from CM to obtain the table definition SQL for
Impala. From the main Cloudera Manager screen Kudu->Kudu Master Web UI->
Tables.
5. Scroll down and copy the command listed to create the Impala table definition.
CREATE EXTERNAL TABLE `javasample` STORED AS KUDU
TBLPROPERTIES(
'kudu.table_name' = 'javasample',
'kudu.master_addresses' = 'quickstart.cloudera:7051');
6. In the VM browser, open Hue->Impala Query Editor and run the command we just
captured to create the table definition in the Impala metadata.
7. Run a “select * from javasample” to query the table that is now available from the
Impala Query Editor.
2. Review the Kudu memory settings. In a normal production environment, you would
be bumping these up from the default setting. For the Quickstart lab we will leave
the default settings and not change. From the Manager main menu, click on the
Kudu Service>Configuration. Review the setting, ‘memory_limit_hard_bytes’.
Note: The ‘memory_limit_hard_bytes’ flag (Kudu Tablet Server Hard Memory Limit in CM)
determines the amount of RAM that a Kudu tablet server may use. The amount of memory
required for a workload scales with a number of factors including data size, write workload,
and read concurrency.
The following determines a baseline/starting point for a configuration:
Note that this server usually stays between 50-75% of its limit and occasionally tops 80%.
This server may benefit from an increased memory limit, but it appears generally healthy. If
a server is consistently utilizing more than 75% of its memory the memory limit should be
increased.