You are on page 1of 194

Big Data Workshop

Lab Guide

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

TABLE OF CONTENTS
Big Data Workshop Lab Guide......................................................................................................... i
1. Introduction................................................................................................................................... 4
2. Hadoop Hello World...................................................................................................................... 7
2.1 Introduction to Hadoop............................................................................................................ 7
2.2 Overview of Hands on Exercise..............................................................................................8
2.3 Word Count............................................................................................................................. 8
2.4 Summary............................................................................................................................... 22
3. Pig Exercise................................................................................................................................ 23
3.1 Introduction to Pig................................................................................................................. 23
3.2 Overview Of Hands On Exercise........................................................................................... 23
3.3 Working with PIG.................................................................................................................. 23
3.4 Summary............................................................................................................................... 43
4. Hive Coding................................................................................................................................ 44
4.1 Introduction to Hive............................................................................................................... 44
4.2 Overview Of Hands On Exercise........................................................................................... 44
4.3 Queries with Hive.................................................................................................................. 44
4.4 Summary............................................................................................................................... 55
5. Oracle ODI and Hadoop............................................................................................................. 56
5.1 Introduction To Oracle Connectors........................................................................................ 56
5.2 Overview of Hands on Exercise............................................................................................ 57
5.3 Setup and Reverse Engineering in ODI................................................................................57
5.4 Using ODI to import text file into Hive...................................................................................64
5.5 Using ODI to import Hive Table into Oracle...........................................................................77
5.6 Using ODI to import Hive Table into Hive..............................................................................93
5.7 Summary............................................................................................................................. 109
6. Working with External Tables.................................................................................................... 110
6.1 Introduction to External Tables............................................................................................ 110
6.2 Overview of Hands on Exercise..........................................................................................110
6.3 Configuring External Tables.................................................................................................110
6.4 Summary............................................................................................................................. 120
7. Working with Mahout................................................................................................................ 121
7.1 Introduction to Mahout........................................................................................................ 121
7.2 Overview of Hands on Exercise.......................................................................................... 121
7.3 Clustering with K-means..................................................................................................... 121

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

7.4 Summary............................................................................................................................. 131


8. Programming with R................................................................................................................. 132
8.1 Introduction to Enterprise R................................................................................................ 132
8.2 Overview of Hands on Exercise.......................................................................................... 132
8.3 Talking data from R and inserting it into database...............................................................133
8.4 Taking data from database and using it in R and clustering................................................144
8.5 Summary............................................................................................................................. 154
9. Oracle NoSQL Database.......................................................................................................... 155
9.1 Introduction To NoSQL........................................................................................................ 155
9.2 Overview of Hands on Exercise.......................................................................................... 155
9.3 Insert, and retrieve Key Value pairs.................................................................................155
9.4 Summary............................................................................................................................. 175
Appendix A................................................................................................................................... 176
A.1 Setup of a Hive Data Store................................................................................................. 176

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

1. INTRODUCTION
Big data is not just about managing petabytes of data. It is also about managing large numbers of
complex unstructured data streams which contain valuable data points. However, which data
points are the most valuable depends on who is doing the analysis and when they are doing the
analysis. Typical big data applications include: smart grid meters that monitor electricity usage in
homes, sensors that track and manage the progress of goods in transit, analysis of medical
treatments and drugs that are used, analysis of CT scans etc. What links these big data
applications is the need to track millions of events per second, and to respond in real time. Utility
companies will need to detect an uptick in consumption as soon as possible, so they can bring
supplementary energy sources online quickly. Probably the fastest growing area relates to location
data being collected from mobile always-on devices. If retailers are to capitalise on their
customers location data, they must be able to respond as soon as they step through the door.
In the conventional model of business intelligence and analytics, data is cleaned, cross-checked
and processed before it is analysed, and often only a sample of the data is used in the actual
analysis. This is possible because the kind of data that is being analysed - sales figures or stock
counts, for example can easily be arranged in a pre-ordained database schema, and because BI
tools are often used simply to create periodic reports.

At the center of the big data movement is an open source software framework called Hadoop.
Hadoop has become the technology of choice to support applications that in turn support petabytesized analytics utilizing large numbers of computing nodes. The Hadoop system consists of three
projects: Hadoop Common, a utility layer that provides access to the Hadoop Distributed File
System and Hadoop subprojects. HDFS acts as the data storage platform for the Hadoop
framework and can scale to massive size when distributed over numerous computing nodes.
Hadoop MapReduce is a framework for processing data sets across clusters of Hadoop nodes.
The Map and Reduce process splits the work by first mapping the input across the control nodes
of the cluster, then splitting the workload into even smaller data sets and distributing it further
throughout the computing cluster. This allows it to leverage massively parallel processing, a
computing advantage that technology has introduced to modern system architectures. With MPP,
Hadoop can run on inexpensive commodity servers, dramatically reducing the upfront capital costs

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

traditionally required to build out a massive system. As the nodes "return"


their answers, the Reduce function collects and combines the information to deliver a final result.
To extend the basic Hadoop ecosystem capabilities a number of new open source projects have
added functionality to the environment. A typical Hadoop ecosystem will look something like this:

Avro is a data serialization system that converts data into a fast, compact binary data
format. When Avro data is stored in a file, its schema is stored with it

Chukwa is a large-scale monitoring system that provides insights into the Hadoop
distributed file system and MapReduce

HBase is a scalable, column-oriented distributed database modeled after Google's


BigTable distributed storage system. HBase is well-suited for real-time data analysis

Hive is a data warehouse infrastructure that provides ad hoc query and data
summarization for Hadoop- supported data. Hive utilizes a SQL-like query language call
HiveQL. HiveQL can also be used by programmers to execute custom MapReduce jobs

Pig is a high-level programming language and execution framework for parallel


computation. Pig works within the Hadoop and MapReduce frameworks

ZooKeeper provides coordination, configuration and group services for distributed


applications working over the Hadoop stack

Data exploration of Big Data result sets requires displaying millions or billions of data points to
uncover hidden patterns or records of interest as shown below:

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

Many vendors are talking about Big Data in terms of managing petabytes of data. For example
EMC has a number of Big Data storage platforms such as it's new Isilon storage platform. In reality
the issue of big data is much bigger and Oracle's aim is to focus on providing a big data platform
which provides the following:

Deep Analytics a fully parallel, extensive and extensible toolbox full of advanced and
novel statistical and data mining capabilities

High Agility the ability to create temporary analytics environments in an end-user


driven, yet secure and scalable environment to deliver new and novel insights to the
operational business

Massive Scalability the ability to scale analytics and sandboxes to previously unknown
scales while leveraging previously untapped data potential

Low Latency the ability to instantly act based on these advanced analytics in your
operational, production environment

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

2. HADOOP HELLO WORLD


2.1 Introduction to Hadoop
Map/Reduce is a programming paradigm that expresses a large distributed computation as a
sequence of distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce
framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across
the nodes in the cluster. A Map/Reduce computation has two phases, a map phase and a reduce
phase. The input to the computation is a data set of key/value pairs.
In the map phase, the framework splits the input data set into a large number of fragments and
assigns each fragment to a map task. The framework also distributes the many map tasks across
the cluster of nodes on which it operates. Each map task consumes key/value pairs from its
assigned fragment and produces a set of intermediate key/value pairs. For each input key/value
pair (K,V), the map task invokes a user defined map function that transmutes the input into a
different key/value pair (K',V').
Following the map phase the framework sorts the intermediate data set by key and produces a set
of (K',V'*) tuples so that all the values associated with a particular key appear together. It also
partitions the set of tuples into a number of fragments equal to the number of reduce tasks.
In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For
each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output
key/value pair (K,V). Once again, the framework distributes the many reduce tasks across the
cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each
reduce task.
Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a
computation the tasks assigned to them are re-distributed among the remaining nodes. Having
many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with
small runtime overhead.
Architecture
The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master server
or jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker
is the point of interaction between users and the framework. Users submit map/reduce jobs to the
jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/firstserved basis. The jobtracker manages the assignment of map and reduce tasks to the
tasktrackers. The tasktrackers execute tasks upon instruction from the jobtracker and also handle
data motion between the map and reduce phases.
Hadoop DFS
Hadoop's Distributed File System is designed to reliably store very large files across machines in a
large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence
of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. Files in
HDFS are "write once" and have strictly one writer at any time.
Architecture
Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation
consists of a single Namenode, a master server that manages the filesystem namespace and
regulates access to files by clients. In addition, there are a number of Datanodes, one per node in
the cluster, which manage storage attached to the nodes that they run on. The Namenode makes

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

filesystem namespace operations like opening, closing, renaming etc. of files


and directories available via an RPC interface. It also determines the mapping of blocks to
Datanodes. The Datanodes are responsible for serving read and write requests from filesystem
clients, they also perform block creation, deletion, and replication upon instruction from the
Namenode.

2.2 Overview of Hands on Exercise


To get an understanding of what is involved in running a Hadoop Job and what are all of the steps
one must undertake we will embark on setting up and running a hello world type exercise on our
Hadoop Cluster.
In this exercise you will:
1)
2)
3)
4)
5)

Compile a Java Word Count written to run on a Hadoop Cluster


Create some files to run word count on
Upload the files into HDFS
Run Word Count
View the Results

NOTE: During this exercise you will be asked to run several scripts. If you would like to see the
content of these scripts type cat scriptName and the contents of the script will be displayed in
the terminal

2.3 Word Count


1. All of the setup and execution for the Work Count exercise can be done from the terminal,
hence to start out this first exercise please open the terminal by double clicking on the
Terminal icon on the desktop.

2. To get into the folder where the scripts for the first exercise are, type in the terminal:
cd /home/oracle/exercises/wordCount
Then press Enter

3. Lets look at the java code which will run word count on a Hadoop cluster. Type in the
terminal:

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

gedit WordCount.java
Then press Enter

4. A new window will open with the java code for word count. We would like you to look at
line 14 and 28 of the code. You can see there the Mapper and Reducer Interfaces are
being implemented.

5. When you are done evaluating the code you can click on the X in the right upper corner of
the screen to close the window.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Big Data Workshop

6. We can now go ahead and compile the Word Count code. We need run the compile.sh
script which will set the correct classpath and output directory while compiling
WordCount.java. Type in the terminal:
./compile.sh
Then press Enter

7. We can now create a jar file from the compile directory of Word Count. This jar file is
required as the code for word count will be sent to all of the nodes in the cluster and the
code will be run simultaneous on all nodes that have appropriate data. To create the jar file
in the terminal type:
./createJar.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

10

Big Data Workshop

8. For the exercise to be more interesting we need to create some file on which word count
will be executed. To create some file go the terminal and type:
./createFiles.sh
Then press Enter

9. To see the contents of the files type in the terminal:


cat file01 file02
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

11

Big Data Workshop

In the terminal window you will see the contents of the two files. Each file having 4 words
in it. Although these are quite small files the code would run identical with more than 2 file
and with files that are several Gigabytes of Terabytes in size.

10. Now that we have the files ready we must move them into the Hadoop File System
(HDFS). Hadoop can now work with file on other file systems; they must be within the
HDFS for them to be usable. It is also important to note that files which are within HDFS
are split into multiple chunks and stored on separate nodes for parallel parsing. To upload
our two file into the HDFS you need to use the copyFromLocal commanding Hadoop. Run
the command by typing at the terminal

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

12

Big Data Workshop

hadoop
dfs
-copyFromLocal
/user/oracle/wordcount/input/file01
Then press Enter

file01

For convince you can also run the script copyFiles.sh and it will upload the files for you
so do not need to type in this and the next command.

11. We should now upload the second file. Go to the terminal and type:
hadoop
dfs
-copyFromLocal
/user/oracle/wordcount/input/file02
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

file02

13

Big Data Workshop

12. We can now run our MapReduce job to do a word count on the file we just uploaded. Go
the the terminal and typing:
hadoop
jar
WordCount.jar
org.myorg.WordCount
/user/oracle/wordcount/input /user/oracle/wordcount/output
Then press Enter
For your convenience you can also run the script runWordCount.sh and will run the
Hadoop job for you so do not need to type in the above command.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

14

Big Data Workshop

A lot of text will role by in the terminal window. This is informational data coming from the
Hadoop infrastructure to help track the status of the job. Wait for the job to finish, this is
signaled by the command prompt coming back.
13. Once you have you command prompt back your MapReduce task is complete. It is now
time to look at the results. We can display they results file right from the HDFS files by
using the cat command from Hadoop. Go to the terminal and type the following command:
hadoop dfs -cat /user/oracle/wordcount/output/part-00000
Then press Enter
For your convenience you can also run the script viewResults.sh and will run the
Hadoop command for you to see the results.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

15

Big Data Workshop

In the terminal the word count results are displayed. You will see that job counted the
number of times each word appears.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

16

Big Data Workshop

14. As an experiment lets try to run the Hadoop job again. Go to the terminal and type:
hadoop
jar
WordCount.jar
org.myorg.WordCount
/user/oracle/wordcount/input /user/oracle/wordcount/output
Then press enter
For your convenience you can also run the script runWordCount.sh and will run the
Hadoop job for you so do not need to type in the above command.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

17

Big Data Workshop

15. You will notice an error message appears and no map reduce task is executed. This is
easily explained by the immutability of data. Since Hadoop does not allow an update of
data files (just read and write) you cannot update the data in the results directory hence
the execution has nowhere to place to output. For you to re-run the Map-Reduce job you
must either point it to another output directory or clean out the current output directory.
Lets go ahead and clean out the previous output directory. Go to the terminal and type:
hadoop dfs -rmr /user/oracle/wordcount/output
Then press Enter
For convince you can also run the script deleteOutput.sh and it will delete the files for
you so do not need to type in this command

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

18

Big Data Workshop

16. Now we have cleared the output directory and can re-run the map reduce task. Lets just
go ahead and make sure it works again. Go to the terminal and type:
hadoop
jar
WordCount.jar
org.myorg.WordCount
/user/oracle/wordcount/input /user/oracle/wordcount/output
Then press enter
For your convenience you can also run the script runWordCount.sh and will run the
Hadoop job for you so do not need to type in the above command.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

19

Big Data Workshop

Now the Map Reduce job ran fine again as per the output on the screen.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

20

Big Data Workshop

17. This completes the word count example. You can now close the terminal window; go to the
terminal window and type:
exit
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

21

Big Data Workshop

2.4 Summary
In this exercise you were able to see the basic steps required in setting up and running a very
simple Map Reduce Job. You say what interfaces must be implemented when creating a
MapReduce task, you saw how to upload data into HDFS and how to run the map reduce task. It is
important to talk about execution time for the exercise and the amount of time required to count 8
words is quite high in absolute terms. It is important to understand that Hadoop needs to start a
separate Java Virtual Machine to process each file or chunk of a file on each node of the cluster.
As such even a trivial job has some processing time which limits the possible application of
Hadoop as it can only handle bath jobs. Real time application where answers are required cant be
run on a Hadoop cluster. At the same time as the data volumes increase processing time does not
increase that much as long as there are enough processing nodes. A recent benchmark of a
Hadoop cluster saw the complete sorting of 1 terabyte of data in just over 3 minutes on 910 nodes.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

22

Big Data Workshop

3. PIG EXERCISE
3.1 Introduction to Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs. The
salient property of Pig programs is that their structure is amenable to substantial parallelization,
which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin,
which has the following key properties:
Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel"
data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities: The way in which tasks are encoded permits the system to optimize
their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility: Users can create their own functions to do special-purpose processing.

3.2 Overview Of Hands On Exercise


In this exercise we will be analyzing data coming from the New York Stock Exchange specifically
we will like to evaluate the dividends given by different companies. We have a tab delimited file
with four columns; exchange name, stock name, date, and dividend. For our analysis we want to
find the companies which had the highest average dividend.
In this exercise we will:
1. Load our stock exchange data into our HDFS
2. Run a PIG script which will find the company with the highest dividends
3. View the results
NOTE: During this exercise you will be asked to run several scripts. If you would like to see the
content of these scripts type cat scriptName and the contents of the script will be displayed in
the terminal
NOTE2: This exercise and dataset for this exercise was inspired from the flowing website:
http://ofps.oreilly.com/titles/9781449302641/running_pig.html

3.3 Working with PIG


1. All of the setup and execution for this exercise can be done from the terminal, hence the
terminal by double clicking on the Terminal icon on the desktop.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

23

Big Data Workshop

2. To get into the folder where the scripts for the first exercise are, type in the terminal:
cd /home/oracle/exercises/pig
Then press Enter

3. To get an idea of what our dividends file looks like lets look at the first couple of rows. In
the terminal type:
head NYSE_dividends
Then press Enter

The first 10 rows of the data file will be displayed on the screen

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

24

Big Data Workshop

4. Now that we have an idea what our data file looks like, lets load it into the HDFS for
processing. To load the data we use the copyfromLocal function of Hadoop, go to the
terminal and type:
hadoop
dfs
/user/oracle/NYSE_dividends
Then press Enter

-copyFromLocal

NYSE_dividends

For convince you can also run the script loadData.sh and it will upload the files for you.
This is so you do not need to type in the command above.

5. We will be running our PIG script in interactive mode so we can see each step of the
process. For this we will need to open the PIG interpreter called grunt. Go the terminal and
type
pig

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

25

Big Data Workshop

Then press Enter

6. Once at the grunt shell we can start typing Pig script. The first thing we need to do is load
the datafile from HDFS into Pig for processing. The data is not actually copied but a
handler is created for the file so Pig know how to interperate the data. Go to the grunt shell
and type:
dividends = load
dividend);
Then Press Enter

'NYSE_dividends'

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

as

(exchange,

symbol,

date,

26

Big Data Workshop

7. Now the data is loaded as a for column table lets see what the data looks like in PIG. Go
to the grunt shell and type:
dump dividends;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

27

Big Data Workshop

You will see output similar to the first exercise on the screen. This is normal as Pig is
merarly a high level language all commands which process data simply run Map Reduce
rasks in the backgroup so the dump command simply becomes a map reduce task that is
run. This will apply to all of the command you will run in Pig. The output on the screen will
show you all of the rows of the file in tuple form

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

28

Big Data Workshop

8. The first step in alalizing the data will be grouping the data by stock symbol so we have all
of the dividends of one compay grouped together. Go to the grunt shell and type:
grouped = group dividends by symbol;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

29

Big Data Workshop

9. Lets go ahead and dump this grouped variable to the screen to see what its contents look
like Go to the grunt shell and type:
dump grouped;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

30

Big Data Workshop

Open the screen you will see all of the groups displayed in tuple of typle form. As the
output might look a bit confusing only one tuple is hiligted in the screen shot below to help
clarity. The hilighed region show all of the rows of the table for the CATO stock symbol

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

31

Big Data Workshop

10. In the next step we will go through each group tuple and get the group name and the
average dividend. Go to the grunt shell and type:
avg = foreach grouped generate group, AVG(dividends.dividend);
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

32

Big Data Workshop

11. Lets go ahead and see what this output looks like. Go to the grunt shell and type:
dump avg
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

33

Big Data Workshop

Now you can see on the screen a dump of all stock symbols with their respective average
dividend. A couple of them are hilighed in the image below

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

34

Big Data Workshop

12. Now that we have the dividents for each company it would be ideal if we had them in order
from highest to lowest dividend. Lets get that list, go to the grunt shell and type:
sorted = order avg by $1 DESC;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

35

Big Data Workshop

We can now see what the sorted list looks like. Go to the grunt terminal and type:
dump sorted;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

36

Big Data Workshop

On the screen you now see the list sorted in descending order. On the screen are the
lowest dividens but can scroll up the see the rest of the value.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

37

Big Data Workshop

13. We now have the final results we want. It might be worth writing these results out to
HDFS. Lets do that. Go to the grunt shell and type:
store sorted into 'average_dividend';
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

38

Big Data Workshop

14. The new calculated data is now permanently stored in HDFS. We can now exit the grunt
shell. Go to the grunt shell and type:
quit;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

39

Big Data Workshop

15. Now back at the terminal lets view the top 10 companies by average dividend directly from
HDFS. Go to the terminal and type:
hadoop dfs -cat /user/oracle/average_dividend/part-r-00000 | head
Then press Enter
For convince you can also run the script viewResults.sh and it will display the files for
you. This is so you do not need to type in the command above.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

40

Big Data Workshop

This command simply did cat on the results file available in the HDFS. The results are
seen on the screen.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

41

Big Data Workshop

16. That concludes the the Pig exercise you can now close the terminal window. Go to the
terminal and type:
exit
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

42

Big Data Workshop

3.4 Summary
In this exercise you saw what a pig script looks like and how to run it. It is important to understand
that pig is a scripting language which ultimately runs MapReduce jobs on a Hadoop cluster hence
all of the power of a distributed system and the high data volume / size which HDFS can
accommodate are exploitable through pig. Pig provides an easier interface to the MapReduce
infrastructure allow for scripting paradigms to be used rather than direct java coding.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

43

Big Data Workshop

4. HIVE CODING
4.1 Introduction to Hive
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive
provides a mechanism to project structure onto this data and query the data using a SQL-like
language called HiveQL. At the same time this language also allows traditional map/reduce
programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to
express this logic in HiveQL.

4.2 Overview Of Hands On Exercise


In this exercise you will use Hive Query Language to create tables, insert data into those tables
and run queries on that data. For this exercise we will use the same data file as the PIG exercise
above which contains a tab delimited file with four columns; exchange name, stock name, date,
and dividend. For our analysis we want to find the companies which had the highest average
dividend.
In this exercise you will
1.
2.
3.
4.

Upload the comments file into HDFS


Create a table in Hive
Load the comments data into the Hive table
Run queries on the Hive table

4.3 Queries with Hive


1. All of the setup and execution for this exercise can be done from the terminal, hence open
a terminal by double clicking on the Terminal icon on the desktop.

2. To get into the folder where the scripts for the Hive exercise are, type in the terminal:
cd /home/oracle/exercises/hive
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

44

Big Data Workshop

3. We already have an idea what our data file looks like, so lets load it into the HDFS for
processing. This is done identically to the way it was done in the first two exercises. We
will see a better way to load data in the next exercise. To load the data we use the
copyfromLocal function of Hadoop. Go to the terminal and type:
hadoop dfs -copyFromLocal NYSE_dividend /user/oracle/NYSE_dividend
Then press Enter
For convince you can also run the script loadData.sh and it will display the files for you.
This is so you do not need to type in the command above.

4. Lets now enter the Hive interactive shell environment to create tables and run queries
against those tables. To give an analogy this is similar to SQL *Plus but on this
environment is specifically for the HiveQL language. To enter the environment go to the
terminal and type:
hive
Then press Enter

5. This first thing we need to do in hive is create a table. We will create a table named
dividends with four fields field called exchange, symbol, dates and dividends something
that looks very natural based on the data set. Go the terminal and type:
create table dividends(exchange
string,
dividend float)
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

string,

symbol

string,

dates

45

Big Data Workshop

An OK should be printed on the screen indicating the success of the operation. This OK
message will be printed for all operation but we will only mention it this time. It if left up to
the user to check for this message on future HiveQL commands.

6. We can now run a command to see all of the tables available to this OS user. Go to the
hive terminal and type:
show tables;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

46

Big Data Workshop

You can see the only table currently available is the one we just created.

7. As with normal SQL you also have a describe command available to see the columns in
the table. Go the terminal and type
describe dividends;
Then press Enter

As you can see that the dividends table has the 4 fields each with their own Hive specific
data type. This is to be expected as this is the way we created the table.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

47

Big Data Workshop

8. Lets go ahead and load some data into this table. Data is loaded into hive from flat files
available in the HDFS file system. Go the terminal and type:
load
data
inpath
dividends;
Then press Enter

/user/oracle/NYSQ_dividend

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

into

table

48

Big Data Workshop

The data is now loaded into the table.


9. We can now see the data that has been loaded into the table. Go the the terminal and
type:
select * from dividends limit 5;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

49

Big Data Workshop

Five lines from the table are printed to the screen; only 3 of the lines are highlighted in the
image below.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

50

Big Data Workshop

10. Now that we have all of the data loaded into a Hive table we can run SQL queries on the
code. As we has the same data set as in the Pig exercises lets try to extract the same
data. We will look for the top 10 companies by average dividend. Go to the terminal and
type:
select symbol, avg(dividend) avg_dividend from dividends group by
symbol order by avg_dividend desc limit 10;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

51

Big Data Workshop

On the screen you will see a lot of log information scroll through. Most of this is generated
by Hadoop as Hive (just like Pig) takes the queries you write and rewrites them as
MapReduce jobs then executes them. The query we wrote can take full advantage of the
distributed computational power of Hadoop as well as the striping and parallelism that
HDFS enables.
When the query is done you should see on the screen the top 10 companies in
descending order. This output shows the exact same information as we got with the
previous exercises. As the old idiom goes there is more than one way to skin a cat. As with
hadoop there is always more than one way to achieve any task.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

52

Big Data Workshop

11. This is the end of the Hive exercise. You can now exit the hive interpreter. Go to the
terminal and type:
exit;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

53

Big Data Workshop

12. Then close the terminal. Go the the terminal and type:
exit
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

54

Big Data Workshop

4.4 Summary
In this exercise you were introduces to the Hive Query Language. You saw how to create
and view tables using the HiveQL. Once tables were created you were introduced to the JSON
native interface as well as some of standard SQL constructs which HiveQL has available. It is
important to understand that Hive is an abstraction layer for Hadoop and MapReduce jobs. All
queries written in HiveQL get transformed into a DAG (Directed Acyclic Graph) of MapReduce
tasks which are then run on the Hadoop cluster, hence taking advantage of all performance,
scalability capabilities, but also maintain all of the limitations of Hadoop.
HiveQL has most of the functionality available with standard SQL, have s series of DDL and DML
functions implemented, but at the same time it does not strictly adhere to SQL-92 standard HiveQL
offers extensions not in SQL, including multitable inserts and create table as select, but only offers
basic support for indexing. Also, HiveQL lacks support for transactions and materialized views, and
only limited subquery support. It is intended for long running queries of a Data Warehousing type
rather than a transactional OLTP type of data load.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

55

Big Data Workshop

5. ORACLE ODI AND HADOOP


5.1 Introduction To Oracle Connectors
Apache Hadoop is designed to handle and process data from data sources that are typically nonRDBMS and data volumes that are typically beyond what is handled by relational databases.
The Oracle Data Integrator Application Adapter for Hadoop enables data integration developers to
integrate and transform data easily within Hadoop using Oracle Data Integrator. Employing familiar
and easy-to-use tools and preconfigured knowledge modules, the adapter provides the following
capabilities:

Loading data into Hadoop from the local file system and HDFS.

Performing validation and transformation of data within Hadoop.

Loading processed data from Hadoop to Oracle Database for further processing and
generating reports.

Typical processing in Hadoop includes data validation and transformations that are programmed
as MapReduce jobs. Designing and implementing a MapReduce job requires expert programming
knowledge. However, using Oracle Data Integrator and the Oracle Data Integrator Application
Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive
and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs.
The Oracle Data Integrator graphical user interface enhancing the developer's experience and
productivity while enabling them to create Hadoop integrations.
When implementing a big data processing scenario, the first step is to load the data into Hadoop.
The data source is typically in the local file system, HDFS, Hive tables, or external Hive tables.
After the data is loaded, you can validate and transform the data using HiveQL like you do in SQL.
You can perform data validation such as checking for NULLS and primary keys, and
transformations such as filtering, aggregations, set operations, and derived tables. You can also
include customized procedural snippets (scripts) for processing the data.
When the data has been aggregated, condensed, or crunched down, you can load it into Oracle
Database for further processing and analysis. Oracle Loader for Hadoop is recommended for
optimal loading into Oracle Database.
Knowledge Modules:
KM Name

Description

IKM File To
Hive (Load
Data)

Loads data from local and HDFS files into Hive tables. It File
provides options for better performance through Hive System
partitioning and fewer data movements.

Hive

IKM Hive
Control
Append

Integrates data into a Hive target table in truncate/ Hive


insert (append) mode. Data can be controlled
(validated). Invalid data is isolated in an error table and
can be recycled.

Hive

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

Source

Target

56

Big Data Workshop

KM Name

Description

Source

Target

IKM Hive
Transform

Integrates data into a Hive target table after the data Hive
has been transformed by a customized script such as
Perl or Python.

IKM File-Hive
to Oracle
(OLH)

Integrates data from an HDFS file or Hive source into File


Oracle
an Oracle Database target using Oracle Loader for System or Database
Hadoop.
Hive

CKM Hive

Validates data against constraints.

NA

Hive

RKM Hive

Reverse engineers Hive tables.

Hive
Metadata

NA

Hive

5.2 Overview of Hands on Exercise


In this workshop we have loading data into the HDFS using a cumbersome command line utility
one for one. We viewed results from within HDFS also using a command line utility. Although this
is fine for smaller jobs, it would be a good idea to integrate the moving of data with an ETL tool
such as Oracle Data Integration (ODI). In the exercise we will see how Oracle Data Integrator
integrates seamlessly with Hadoop and more specifically Hive.
In this exercise you will:
1.
2.
3.
4.

Use ODI to reverse engineer a Hive Table


Use ODI in import a text file into a Hive Table
Use ODI to move data from a Hive table into the Oracle Database
Use ODI to move data from A Hive table to another table with Check Constraints

5.3 Setup and Reverse Engineering in ODI


1. All of the setup for this exercise can be done from the terminal, hence open a terminal by
double clicking on the Terminal icon on the desktop.

2. To get into the folder where the scripts for the first exercise are, type in the terminal:
cd /home/oracle/exercises/odi
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

57

Big Data Workshop

3. Next we need to run a script to setup the environment for the next exercise. We will be
loading the same data as in the previous exercise (the dividens table) only this time we will
be ODI to perform this task. For this we need to drop that table and recreate it so it is
empty for the import. Also we need to start the hive server to enable ODI to communicate
with Hive. We have a script which will perform both tasks. Go to the terminal and type:
./setup.sh
Then press Enter

4. Next we need to start Oracle Data Integrator. Go to the terminal and type
./startODI
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

58

Big Data Workshop

5. Once ODI opens we need to connect to the repository. In the right upper corner of the
screen screen click on the Connect To Repository

6. In the dialog that pops up all of the connection details should already be configured.
Login Name: DEFAULT_LOGIN1
User: SUPERVISOR
Password: Welcome1
If all of the data is entered correctly you can simple click OK

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

59

Big Data Workshop

7. Once you login make sure you are on the Designer Tab. Near the top of the screen on the
left side click on Designer.

8. Near the bottom of the screen on the left side there is a Models tab click on it.

You will notice that we have already created a File, Hive, and Oracle mode for you. These
were pre-created as to reduce the number of steps in the exercise. For details on how use
flat files and Oracle database with ODI please see the excellent ODI tutorials offered by

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

60

Big Data Workshop

the
Oracle
by
Example
Tutorials
found
http://www.oracle.com/technetwork/tutorials/index.html.

at

9. The first feature of ODI we would like to show involved reverse engineering a data store.
The reverse engineering function takes a data store and finds all of the table and their
structure automatically. In the Models tab on the left side of the screen there Hive is a
model. Lets click on the + to expand it out.

10. You will notice there is no information about the data that is stored in that particular
location. Right click on the Hive folder and select Reverse Engineer

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

61

Big Data Workshop

11. You will see two items appear in the Hive Folder called dividends and dividends2. It is the
tables we created in Hive. You can click on the + beside dividends to see some more
information.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

62

Big Data Workshop

12. You can also expand the Columns folder to see all of the column in this particular table.

You will see the columns created in step 3 displayed.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

63

Big Data Workshop

This is the power of the Hive Reverse Engineering Knowledge Module (RKM) integrated in
Oracle Data Integrator. Once you define a data store (in our case a hive source) the RKM
will automatically discovery all tables and their corresponding columns available at that
source. Once a data model is created there is no need to rewrite it in ODI. ODI will
automatically discover that model for you to be able to get straight to the development of
the data movement.

5.4 Using ODI to import text file into Hive


1. Now that we have configured our models we can begin creating Interfaces to perform ETL
tasks. Near the top left corner of the screen in the Projects tab there is an icon with 3
squares. Click on them and select New Project.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

64

Big Data Workshop

2. In the new window that opened up on the right side of the screen enter the following
information:
Name: Hadoop
Code: HADOOP
Then click on the Save All in the right upper corner of the screen.

3. In the left hand menu in the Projects section a new item appeared called Hadoop. Click on
the + to expand it out

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

65

Big Data Workshop

4. Next to the folder called First Fold er there is another + expand out that folder as well by
clicking the +

5. Right click on the Item Interfaces and select New Interface

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

66

Big Data Workshop

6. We can now start to define the new interface. In this interface we will map out the columns
in the text file and move the data into the hive table. To start out lets give the interface a
name. In the new tab that opened on the right side of the screen type in the following
information.
Name: File_To_Hive

7. Next we need to move to the mapping tab of the File_To_Hive interface. Click on Mapping
at the bottom of the screen.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

67

Big Data Workshop

8. We now need to define the sources and target data stores. On the left bottom of the
screen in the Models Section expand the File folder by clicking on the + beside it.

9. Now we can drag and drop the Dividends table from the File model into the source
section of the interface.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

68

Big Data Workshop

10. Next we will drag and drop the dividends Hive table into the target section of the interface.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

69

Big Data Workshop

11. A pop up window will appear which will ask if you would like to create automatic mapping.
This will try to automatically match source columns with target columns based on column
name. Click on Yes to see what happens.

12. By name it was able to match of the columns. The mapping is now complete. Lets go back
to the Overview tab to setup one last thing. Click on the Overview tab on the left side of
the screen.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

70

Big Data Workshop

In the definitions tab Tick the box Staging Area Different From Target.

13. A drop down menu below the tick now gets activated. Select File: Comments File

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

71

Big Data Workshop

14. We can now click on the flow tab at the bottom of the screen to see what the interface will
look like.

15. On the screen in the top right box you will see a diagram of the data flow. Lets see all of
the options for the integration. Click on the Target(Hive Server) header.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

72

Big Data Workshop

16. At the bottom of the screen a new window appeared, a Property Inspector. There you can
inspect and modify the configuration of the integration process. Let change one of the
properties. We dont need a staging table so lets disable it. Set the following options:
USE_STAGING_TABLE: false

Lets now execute this Interface. Click on the Execute button at the top of the screen.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

73

Big Data Workshop

17. You will be asked to save your interface before running it. Click Yes

18. Next you will be asked for the Execution options. Here you can choose agents contexts
and other elements. You can just accept the default options and click OK

19. An informational screen will pop up telling you the session has started. Click OK

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

74

Big Data Workshop

20. We will now check if the execution of the interface was successful. In the left menu click
on the Operator Tab

21. In the menu on the left side make sure the Date tab is expanded. Then expand the Today
folder

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

75

Big Data Workshop

You will see a green checkmark beside the File_To_Hive execution.


This means the integration process was successful.

You have now successfully moved data from a flat file into a hive table without touching
the terminal. All of the data was moved without cumbersome command line interface and
allowing for the use of all of the powerful functionality of a powerful ETL tool.
22. You can now move back to the Desinger tab in the left menu and close all of the open tabs
on the right side menu

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

76

Big Data Workshop

This was to prepare for the next exercise.

5.5 Using ODI to import Hive Table into Oracle


1. Another useful process ODI can perform is move data from a Hive table into the Oracle
database. Once data processing has occurred with the Hive table you might want to move
the data back into your Oracle database for integration into your data warehouse. Lets
move the data we just loaded into our hive table into an Oracle database. First step is to
create a new interface.
In the projects tab in the left hand menu right click on Interfaces and select New Interface

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

77

Big Data Workshop

2. On the right side of the screen a new window pops up. Enter the following name for the
Interface
Name: Hive_To_Oracle

3. Next at the bottom of the screen we will need to move the mapping tab to setup the
interface. At the bottom of the screen click on Mapping

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

78

Big Data Workshop

4. To give up more viewing space lets clean up the Models tab in the left bottom part of the
screen. Minimize the File Tab and the dividends table to give up more viewing space.

5. In the same tab (the Models tab) we now see the Oracle folder. Lets expand that out as
we will need to Oracle tables as our target. Click on the + beside the Oracle folder

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

79

Big Data Workshop

6. We can now drag and drop the Hive dividends table into the sources window

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

80

Big Data Workshop

7. Similarly you can drag the Oracle DIVIDENDS tables into the destination window.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

81

Big Data Workshop

8. As before you will be asked if you would like to perform automatic mapping. Click on Yes

9. Unfortunately due to a capitalization difference no mapping could be done automatically.


We will need to map the tables manually. Drag and drop each source table (from the
source tables windows in the left upper part of the right tab) to its corresponding target
table (in the right upper part of the right tab) given the following mapping:
exchange ->
dates ->
dividend ->

STOCK
DATES
DIVIDEND

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

82

Big Data Workshop

10. One of the advantages of an ETL tool can be seen when doing transformations during the
data movement. Lets concatenate the exchange and symbol into one string and load that
into the STOCK column in the database. Go to the Property Inspector screen of the
STOCK column buy click on it in the targets window

11. The property inspector window should open at the bottom of the screen. In the
implementation edit box type the following
concat(DIVIDENDS.exchange, DIVIDENDS.symbol)

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

83

Big Data Workshop

12. The transformation is now setup. Lets now go back to the Overview Tab to configure the
staging area. Click on the Overview tab

13. In the definitions tab Tick the box Staging Area Different From Target.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

84

Big Data Workshop

14. A drop down menu below the tick now gets activated. Select Hive: Hive Store

15. We are now ready to run the interface. To run the interface go to the left upper corner of
the screen and click the execute button

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

85

Big Data Workshop

16. A window will pop up telling you a save is required before the execution can continue. Just
click Yes

17. Another window will pop pop asking you to configure the Execution Context and Agent.
The default options are fine just click OK

18. A final window will pop up tell you the session has started. Click OK

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

86

Big Data Workshop

19. Lets now go to the Operator tab to check if the execution was successful. In the top left
corner of the screen click on the Operator tab

20. When you get to the Operator Tab you might see a lightning bolt beside the
Hive_To_Oracle execution. This means integration is executing wait for a little bit until the
checkmark appears.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

87

Big Data Workshop

The movement of data from Hive to Oracle has completed successfully.


21. One of the great feature of ODI is it allows you to look at the code that was executed as
part of the ETL process. Lets drill down and see some of the code executed. In the
Operator tab click on the + next to the latest execution.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

88

Big Data Workshop

22. Continue to drill down by click on the + next to the 1- Hive_to_Oracle


<date>

23. You can now see all of the steps taken to perform this particular mapping. Lets investigate
further the forth step in the process. Double click on 4 Integration Hive_To_Oracle
Create Hive staging table to open up a window with its details.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

89

Big Data Workshop

24. In the window that opened up click on the Code tab

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

90

Big Data Workshop

In this widow you will see exactly what code was run. If an error occurs this information
becomes quite useful in debugging your transformations.

25. To check that data that is now in the oracle database go back to the designer Tab, by
going to the left upper corner of the screen and clicking on Designer

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

91

Big Data Workshop

26. Then in the Models Section at the left bottom of the screen right click on the table LOGS in
the Oracle folder and select View Data

On the left side of the screen a new window will pop up with all of the data inside that table
of the oracle database.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

92

Big Data Workshop

27. We can now go ahead and close all of the open windows in the right side of the screen to
prepare for the next exercise.

5.6 Using ODI to import Hive Table into Hive


1. The last ETL process we are going to show involves moving data from a Hive table into
another Hive table. Although this might sound a bit weird there are a lot of circumstances
where you might want to move data from one Table to another verifying the data for

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

93

Big Data Workshop

constraint violations of transforming the data. Lets go ahead and


create an interface for this type of transaction. In the Project tab right click on Interfaces
and select New Interface

2. As before lets give the interface a name. In the new tab that opened on the right side of
the screen type in the following information.
Name: Hive_To_Hive

3. Next at the bottom of the screen lets go to the Mapping table

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

94

Big Data Workshop

4. We will first drag the Hive dividends table as our source window on the right

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

95

Big Data Workshop

5. Next we will drag the dividends2 table into the target window on the right

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

96

Big Data Workshop

6. You will be asked if you would like to perform auto mapping. Just click Yes

7. All of the mapping auto complete without a problem. We now need to specify the
Integration Knowledge Modules (IKM) which will be used to perform the integration. In ODI
an IKM is the engine which has all of the code templates for the integration process; hence
without an appropriate IKM the integration is now possible. In the previous section there
was only one appropriate IKM hence it was auto chosen for use. In this case there are
multiple possible IKMs so we need select one. In the left upper corner of the screen in the
Designer window right click on Knowledge Modules and select Import Knowledge
Modules.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

97

Big Data Workshop

8. A window will pop up which will allow you to import Knowledge Modules. First we need to
specify the folder in which the Knowledge Modules are stored. Fill in the following
information.
File import directory: /u01/ODI/oracledi/xml-reference
Then Press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

98

Big Data Workshop

9. A list of different Knowledge Modules should appear in the space


below. Scroll down until you find the file(s) to import:
IKM Hive Control Append
Then press OK

10. An import report will pop up. Just click Close

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

99

Big Data Workshop

11. Lets now add a constraint to the target tables to see what happens
during the data movement. In the left bottom part of the screen in the models window
expand out the dividends2 store buy pressing the + beside it.

12. In the subsections that appear in the dividends2 you will see a section called Constraints.
Right click on it and select New Condition.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

100

Big Data Workshop

13. On the right side a new window will open allowing you to define the
properties of this condition. We will set a check condition which will check if the dividends
value is too low. Enter the following information.
Name: Low_Dividend
Type: Oracle Data Integrator Condition
Where: dividends2.dividend>0.01
Message: Dividend Too Low

14. We now need to save our constraint. In the top right corner of the screen click on the Save
button

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

101

Big Data Workshop

15. We are now ready to run the Interface. In the top right section of the screen click back to
our interface by click on the Hive_To_Hive tab.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

102

Big Data Workshop

16. Now at the top of the screen we can click the Play button to run the interface

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

103

Big Data Workshop

17. A new window pop up saying you need to save all of the changes before the interface can
be run. Just click Yes

18. A new window will pop up asking for the execution context, just click OK

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

104

Big Data Workshop

19. An informational pop will show up telling you the execution has started. Simply click OK

20. It is now time to check our constraint. In the left bottom part of the screen (in the Models
section) right click on the dividends2 model then to to the Control section and click on
Check

21. This check is its own job that must be run; hence a window will pop up asking you to select
a context for the execution. The default option are good so just click OK

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

105

Big Data Workshop

22. An informational window pops up telling to the execution has started. Just click OK

23. We can now see all of the rows that failed our check. Again in the right bottom part of the
screen (in the Models section) right click on the dividends2 model go to the Control menu
and select Errors

A now tab will pop up on the right side of the screen. You will see of the columns which did
not pass the constraint.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

106

Big Data Workshop

24. This concludes the ODI section of the workshop. Go to the right upper corner of the screen
and click the X to close ODI.

25. Then in the terminal type exit to close it as well.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

107

Big Data Workshop

5.7 Summary
In this exercise you were introduced the Oracles integration of ODI with Hadoop. It is
worthy to note this integration with ODI is only available for the oracle database and only available
from Oracle. It is a custom extension for ODI developed by Oracle to allow users how already have
ETL as part of the Data Warehousing methodologies to continue using the same tools and
procedures with the new Hadoop technologies.
It is quite important to note that ODI is a very powerful ETL tool which can offer all of the
functionality typically found in an enterprise quality ETL. Although the examples given in this
exercise are quite simple this does not mean the integration of ODI and Hadoop is. All of the power
and functionality of ODI is available when working with Hadoop. Workflow definition, complex
transforms, flow control, multi source just to name a few of the functionalities of the ODI and
inherently feature that can be used with Hadoop.
Through this exercise you were introduced to three Knowledge Modules of ODI. Reverse
Integration for Hive, Integration into hive and Integration from hive to Oracle. These are not the
only knowledge modules available, and we encourage you to review the table available in section
5.2 of this document to get a better idea of all the functionality currently available.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

108

Big Data Workshop

6. WORKING WITH EXTERNAL TABLES


6.1 Introduction to External Tables
Oracle Direct Connector runs on the system where Oracle Database runs. It provides read access
to HDFS from Oracle Database by using external tables.
An external table is an Oracle Database object that identifies the location of data outside of the
database. Oracle Database accesses the data by using the metadata provided when the external
table was created. By querying the external tables, users can access data stored in HDFS as if
that data were stored in tables in the database. External tables are often used to stage data to be
transformed during a database load.
These are a few ways that you can use Oracle Direct Connector:

Access any data stored in HDFS files

Access CSV files and Data Pump files generated by Oracle Loader for Hadoop

Load data extracted and transformed by Oracle Data Integrator

Oracle Direct Connector uses the ORACLE_LOADER access driver.

6.2 Overview of Hands on Exercise


This Exercise will involve working with oracle external tables. We will create 3 text files with some
data in each. We will upload these files into HDFS and connect them to the Oracle database using
external tables. The data within these files will then be accessible from within the oracle database.
In this exercise you will:
1. Create and query external tables stored in HDFS
NOTE: During this exercise you will be asked to run several scripts. If you would like to see the
content of these scripts type cat scriptName and the contents of the script will be displayed in
the terminal

6.3 Configuring External Tables


1. All of the setup and execution for this exercise can be done from the terminal, hence open
a terminal by double clicking on the Terminal icon on the desktop.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

109

Big Data Workshop

2. To get into the folder where the scripts for the external tables
exercise are, type in the terminal:
cd /home/oracle/exercises/external
Then press Enter

3. This first step in this exercise is to create some random files. This is just so we have some
data in Hadoop to load as an external file. We will create three files called sales1, sales2
and sales3 with a single row comprised of 3 numbers in each file. To create the files go to
the terminal and type:
./createFiles.sh
Then press Enter

4. Next we will load these files in HDFS. We have a script for that processes as well. Go to
the terminal and type:
./loadFiles.sh
Then press Enter

5. Next we will need to create the external table in Oracle. As the SQL code is quite long we
have written a script with that code. This being quite important lets look at what that code
looks like. In the terminal type:
gedit createTable.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

110

Big Data Workshop

Looking at the code for creating the table you will notice very similar syntax for other types
of external tables except for 2 line; the preprocessor and type highlighted in the image
below

6. When you are done evaluating the code you can close the window by clicking the X in the
right upper corner of the window

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

111

Big Data Workshop

7. Lets go ahead now and run that piece of code. In the terminal type:
./createTable.sh
Then

press

Enter

8. Now that the table is created we need to connect that table with the files we loaded into
HDFS. To make this connection we must run a Hadoop job which calls oracle loader code.
Go to the terminal and type:
./connectTable.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

112

Big Data Workshop

9. You will be asked to enter a password for the code to be able to login to the database user.
Enter the following information
[Enter Database Password:]: tiger
Then Press Enter
NOTE: No text will appear while you type

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

113

Big Data Workshop

10. We can now use SQL from oracle to read those files in HDFS. Lets experiment with that.
First we connect to the database using SQL* Plus. Go to the terminal and type:
sqlplus scott/tiger
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

114

Big Data Workshop

11. Now lets query that data. Go to the terminal and type:
select * from sales_hdfs_ext_tab;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

115

Big Data Workshop

They query returns the data that is is all three files.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

116

Big Data Workshop

12. This concludes this exercise. You can now exit SQL* Plus. Go to the terminal and type:
exit;
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

117

Big Data Workshop

13. Then close the terminal. Go the the terminal and type:
exit
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

118

Big Data Workshop

6.4 Summary
In this chapter we show how data in HDFS can be queried using standard SQL right from the
oracle database. With the data stored in HDFS all of the parallelism and striping that would
naturally occur is taken full advantage of while at the same time you can use all of the power and
functionality of the Oracle Database.
When implementing this method in interaction with data parallel processing is extremely important
when working with large volumes of data. When using external tables, consider enable parallel
query with this SQL command:
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading data into Oracle Database from the external files created by Oracle Direct
Connector, enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Before inserting data into an existing database table, enable parallel DML with this SQL command:
ALTER SESSION ENABLE PARALLEL DML;
Hints such as APPEND and PQ_DISTRIBUTE also improve performance when inserting data.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

119

Big Data Workshop

7. WORKING WITH MAHOUT


7.1 Introduction to Mahout
Apache Mahout is an Apache project to produce free implementations of distributed or otherwise
scalable machine learning algorithms on the Hadoop platform. Mahout is a work in progress; the
number of implemented algorithms has grown quickly, but there are still various algorithms
missing.
While Mahout's core algorithms for clustering, classification and batch based collaborative filtering
are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict
contributions to Hadoop based implementations. Contributions that run on a single node or on a
non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering
recommender component of Mahout was originally a separate project, and can run stand-alone
without Hadoop.
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behaviour
and from that tries to find items users might like. Clustering takes e.g. text documents and groups
them into groups of topically related documents. Classification learns from existing categorized
documents what documents of a specific category look like and is able to assign unlabelled
documents to the (hopefully) correct category. Frequent item set mining takes a set of item groups
(terms in a query session, shopping cart content) and identifies, which individual items usually
appear together.

7.2 Overview of Hands on Exercise


In this exercise you will be using the K-means algorithm to cluster data using mahouts
implementation of K-means. To give a bit of background on K-means it is an algorithm which
clusters data, and despite its simplistic nature it can be quite powerful. The algorithm takes two
inputs, a series of input values (v) and the number of groups those values need to be split into (k).
The algorithm first picks randomly k centres to represent the centre of each group, than
continuously moves those centres so that the distance from the centre to every point in that group
is as small as possible. Once the centres are at a point where any movement would just increase
the distance to all of the points the algorithm stops. This is a great algorithm to find pattern in data
where you have no information what patterns are in the data. Given the power the K-means
algorithm is quite expensive computationally hence using a massively distributed computation
cluster such as Hadoop would offer great advantage when dealing with very large data sets. This
is exactly what we will be experimenting with in this exercise
In this exercise you will
1. Use mahout to cluster a large data set
2. Use the graphic library in Java to visualize a mahout k-means cluster

7.3 Clustering with K-means


1. All of the setup and execution for this exercise can be done from the terminal, hence open
a terminal by double clicking on the Terminal icon on the desktop.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

120

Big Data Workshop

2. To get into the folder where the scripts for the Hive exercise are, type in the terminal:
cd /home/oracle/exercises/mahout
Then press Enter

3. To get an idea of what our forum file looks like lets look at the first couple of rows. In the
terminal type:
head n 1 synthetic_control.data
Then press Enter

As you can see on the screen all there is in the file are random data points. It is within the
field of that we would like to find patterns.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

121

Big Data Workshop

4. The first step in analyzing this data is loading it into the HDFS. Lets go ahead and do that.
Go to the terminal and type:
./loadData.sh
Then press Enter

5. Now that the data is loaded we can run mahout against the data. This is an example we
are running were the data is already in vector form and a distance function has already
been compiled into the example. When clustering your own data, the command line for
running the clustering should include the distance function written and compiled in java.
Go to the terminal and type:

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

122

Big Data Workshop

This would be an excellent time to get a cup of coffee. The clustering is quite
computationally intensive and execution should take a couple of minutes to execute.
6. Once you get back the command prompt the clustering is done, but the results are stored
in binary format inside Hadoop. We need to first bring all of the results out of Hadoop and
then convert the data from binary format to text format. We have a script which will
perform both tasks. Lets run that script go to the terminal and type:
./extractData.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

123

Big Data Workshop

7. We can how go ahead and look at the results of the clustering. We will look at the text
output of the results. Go to the terminal and type:
gedit Clusters
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

124

Big Data Workshop

The output is not very user friendly but there are several indicators to look for as followed:
n= the number of clusters
c= the centers of each one of the clusters
r= the radius of the circle which defines the cluster
Points= the data points in each cluster

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

125

Big Data Workshop

8. Once you are done evaluating the results you can click the X in the right upper corner of
the screen to close the window.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

126

Big Data Workshop

9. Despite the highlighting data points are not very representative. Mahout also has some
graphing functions for simple data points. We will run a much simpler clustering with points
that can be displayed on a X,Y plane to visually see the results. Go to the terminal and
type:
./displayClusters
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

127

Big Data Workshop

A new window will pop up with a visual display of a K-means cluster. The black squares
represent data points the red circles define the clusters. The yellow and green lines
represent the error margin for each cluster.
10. Once you are done evaluating the image you can click the X in the right upper corner of
the window to close it.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

128

Big Data Workshop

11. This concludes our mahout exercise. You can now close the terminal window. Go to the
terminal and type:
exit
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

129

Big Data Workshop

7.4 Summary
In this exercise you were introduced to the K-mean clustering algorithm and how to run the
algorithm using mahout and inherently on a Hadoop cluster. It is important to note that Mahout
does not only focus on k-mean but also has many different algorithms in the categories of
Clustering, Classification, Pattern Mining, Regression, Dimension reduction, Evolutionary
Algorithms, Recommendation/ Collaboration Filtering and Vector Similarity. Most of these
algorithms have special variants which are optimized to run on a massively distributed
infrastructure (Hadoop) to allow for rapid results on very large data sets.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

130

Big Data Workshop

8. PROGRAMMING WITH R
8.1 Introduction to Enterprise R
R is a language and environment for statistical computing and graphics. It is a GNU project which
is similar to the S language and environment which was developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a
different implementation of S. There are some important differences, but much code written for S
runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly
extensible. The S language is often the vehicle of choice for research in statistical methodology,
and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full control.
Oracle R Enterprise integrates the open-source R statistical environment and language with
Oracle Database 11g, Exadata, Big Data Appliance, and Hadoop massively scalable computing.
Oracle R Enterprise delivers enterprise-level advanced analytics based on the R environment.
Oracle R Enterprise allows analysts and statisticians to leverage existing R applications and use
the R client directly against data stored in Oracle Database 11gvastly increasing scalability,
performance and security. The combination of Oracle Database 11g and R delivers an enterpriseready, deeply integrated environment for advanced analytics. Data analysts can also take
advantage of analytical sandboxes, where they can analyze data and develop R scripts for
deployment while results stay managed inside Oracle Database.
As an embedded component of the RDBMS, Oracle R Enterprise eliminates Rs memory
constraints since it can work on data directly in the database. Oracle R Enterprise leverages
Oracles in-database analytics and scales R for high-performance in Exadata and the Big Data
Appliance. Being part of the Oracle ecosystem, ORE enables execution of R scripts in the
database to support enterprise production applications and OBIEE dashboards, both for structured
results and graphics. Since its R, were able to leverage the latest R algorithms and contributed
packages.
Oracle R Enterprise users not only can build models using any of the data mining algorithms in the
CRAN task view for machine learning, but also leverage in-database implementations for
predictions (e.g., stepwise regression, GLM, SVM), attribute selection, clustering, feature
extraction via non-negative matrix factorization, association rules, and anomaly detection.

8.2 Overview of Hands on Exercise


In this exercise you will be introduced the R programming language as well as the enhancements
Oracle has brought to the programming language. Limitations were all data must be kept in system
memory are now gone as you can save and load data from and to both the Oracle database and
HDFS. To exemplify the uses of R will be doing K-means clustering again, as in Exercise 7, this
time using the R programming language. If you would like a review of K-means please see the
introduction to section 7.
In this exercise you will:

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

131

Big Data Workshop

1.
2.
3.
4.
5.
6.

Generate a set of random data points


Save the data in both the Oracle database and HDFS
View the data in Oracle and HDFS
Load the data from Oracle back into R
Perform K-means clustering on the data points
View the results

8.3 Talking data from R and inserting it into database


1. All of the setup and execution for this exercise can be done from the terminal, hence open
a terminal by double clicking on the Terminal icon on the desktop.

2. To get into the folder where the scripts for the R exercises are, type in the terminal:
cd /home/oracle/exercises/R
Then press Enter

3. To work with R you can write scripts for the interpreter to execute or you can use the
interactive shell environment. To get a more hands on experience with R we will use the
interactive shell. To start the interactive shell go to the terminal and type:
R
Then press Enter

4. During the login process many different library load which extend functionality of R. If a
particular library is not loaded automatically one can load it manually after login. We will

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

132

Big Data Workshop

need to load a library to interface with HDFS so lets load that now.
Go to the R shell and type:
library(ORHC)
Then press Enter

5. Now lets go ahead and generate some pseudo random data points so have some data to
play with. We will generate 2D data points so we can easily visualize the data. Go to the R
terminal and type:
myDataPoints=rbind(matrix(rnorm(100, mean=0, sd=0.3),ncol=2)
,matrix(rnorm(100, mean=1, sd=0.3), ncol=2))
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

133

Big Data Workshop

Now the variable myDataPoints will have some data points in it.
6. To be able to save data into the database or HDFS you need to have the data in columns
(as we already do) and you also need to have each of the the columns labeled. This is
because column names are required within a database to be able to identify the columns.
Lets go ahead and label the columns x and y. Go to the R terminal and type:
colnames(myDataPoints) <- c(x, y)
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

134

Big Data Workshop

7. We can now create a data frame which will load the data into the Oracle Database. Go to
the terminal and type:
ore.create(as.data.frame(myDataPoints, optional = TRUE),
table=DATA_POINTS)
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

135

Big Data Workshop

8. If required we can even load this data into HDFS. Lets go ahead and do that. Go to the R
terminal and type:
hdfs.put(DATA_POINTS,dfs.name=data_points)
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

136

Big Data Workshop

9. Now that we have loaded the data into both the database and HDFS lets exit R and look at
that data. Go to the R shell and type:
q()
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

137

Big Data Workshop

10. You will be asked if you want to save workspace image. Type:
n
Then press Enter
Note: when typing n the information typed does not appear on the screen.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

138

Big Data Workshop

11. At this point all data and calculated data would be wiped from the memory and hence lost
in class R. With R Enterprise Edition we saved our data in the database, so lets go and
query that data. Go to the terminal and type:
./queryDB.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

139

Big Data Workshop

On the screen you will see the table displayed which contains our data points.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

140

Big Data Workshop

12. We can also look at the data we stored inside HDFS. Go to the terminal and type:
./queryHDFS.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

141

Big Data Workshop

Again on the screen you will see all of the data points displayed.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

142

Big Data Workshop

As you can see all of the work done in R can now be exported to the database or HDFS
for further processing based on business needs.

8.4 Taking data from database and using it in R and


clustering
1. Data can not only be pushed out to the database but it can also be retrieved from the
database or HDFS to be used within R. Lets see how that is done. First lets go back into
the R environment. Go to the terminal and type:
R
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

143

Big Data Workshop

2. Lets now go ahead and load the data from the Oracle database. Go to the R shell and
type:
myData=ore.pull(DATA_POINTS)
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

144

Big Data Workshop

3. Now that we have our data inside the database we can manipulate the data. Let do kmeans clustering on the data. Go the the R shell and type:
cl <- kmeans(myData, 2)
Then Press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

145

Big Data Workshop

4. The clustering is now done, but displaying the data in text format it not very interesting.
Lets graph the data. Go to the R terminal and type:
plot(myData, col = cl$cluster)
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

146

Big Data Workshop

5. A new window pops up with the data. The two color (red and black) differentiate the two
clusters we asked the algorithm to find. We can even see where the cluster centers are.
Go back to the R shell. The terminal might hidden behind the graph move the windows
around until you find the terminal then type:
points(cl$centers, col=1:2, pch = 8, cex=2)
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

147

Big Data Workshop

When you go back to the graph you will see the centers marked with a * and points
marked with circles, data clustered for raw random data using the K-means algorithm.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

148

Big Data Workshop

6. When you are done evaluating the image you can click on the X in the right upper corner
of the window.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

149

Big Data Workshop

7. You can also close the R terminal by going the R shell and typing:
q()
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

150

Big Data Workshop

8. When asked if you want to save workspace image go to the terminal and type:
n
Then Press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

151

Big Data Workshop

9. This concludes this exercise. You can now go ahead and close the terminal. Go to the
terminal and type:
exit
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

152

Big Data Workshop

8.5 Summary
In this exercise you were introduced to the R programming language and how to do clustering
using the programming language. You also saw one of the advantages of Oracle R enterprise
Edition where you can save your results into the Oracle database as well as extract data from the
database for further calculations. Oracle R Enterprise edition also has a small set of function which
can be run on data in the database directly in the database. This enables the user to use very
large data sets which would not if into the normal memory of R.
Oracle R Enterprise provides these collections of functions:

ore.corr
ore.crosstab
ore.extend
ore.freq
ore.rank
ore.sort
ore.summary
ore.univariate

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

153

Big Data Workshop

9. ORACLE NOSQL DATABASE


9.1 Introduction To NoSQL
Oracle NoSQL Database provides multi-terabyte distributed key/value pair storage that offers
scalable throughput and performance. That is, it services network requests to store and retrieve
data which is organized into key-value pairs. Oracle NoSQL Database services these types of data
requests with a latency, throughput, and data consistency that is predictable based on how the
store is configured.
Oracle NoSQL Database offers full Create, Read, Update and Delete (CRUD) operations with
adjustable durability guarantees. Oracle NoSQL Database is designed to be highly available, with
excellent throughput and latency, while requiring minimal administrative interaction.
Oracle NoSQL Database provides performance scalability. If you require better performance, you
use more hardware. If your performance requirements are not very steep, you can purchase and
manage fewer hardware resources.
Oracle NoSQL Database is meant for any application that requires network-accessible key-value
data with user-definable read/write performance levels. The typical application is a web application
which is servicing requests across the traditional three-tier architecture: web server, application
server, and back-end database. In this configuration, Oracle NoSQL Database is meant to be
installed behind the application server, causing it to either take the place of the back-end database,
or work alongside it. To make use of Oracle NoSQL Database, code must be written (using Java)
that runs on the application server.

9.2 Overview of Hands on Exercise


In this exercise you will be experimenting with the Oracle NoSQL database. Most of the exercises
will have you look at pre written java code and the compiling and running that code. Ensure you
understand the code and all of its nuances as it is what makes up the NoSQL database interface.
If you would like to understand all of the functions that are available there is a javadoc available on
the Oracle web
In this exercise you will
1. Insert and retrieve a simple key value pair from the NoSQL database
2. Experiment with the multiget functionality to retrieve multiple values at the same time
3. Integrate NoSQL with Hadoop code to do word count on data in the NoSQL database

9.3 Insert, and retrieve Key Value pairs


1. All of the setup and execution for this exercise can be done from the terminal, hence open
a terminal by double clicking on the Terminal icon on the desktop.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

154

Big Data Workshop

2. To get into the folder where the scripts for the R exercises are, type in the terminal:
cd /home/oracle/exercises/noSQL
Then press Enter

3. Before we do anything with the noSQL database we must first start it. So let go ahead and
do that. Go to the terminal and type:
./startNoSQL.sh
Then press Enter

4. To check if the database is up and running we can do a ping on the database. Lets do
that. Go to the terminal and type:
./pingNoSQL.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

155

Big Data Workshop

You will see Status: RUNNING displayed within the text. This show the database is
running.

5. Oracle NoSQL database uses a Java interface to interact with the data. This is a dedicated
java API which will let you insert update delete and query data in the Key Value store
that is the NoSQL database. Lets look at a very simple example of java code where we
insert a Key-Value pair into the database and then retrieve it. Go to the terminal and type:
gedit Hello.java
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

156

Big Data Workshop

A new window will pop up with the code. In this code there are a couple of thing to be
noted. We see the config variable which holds our connection string and the store variable
which is our connection factory to the database. They are the initiation variable for the
Key-Value Store and are highlighted in yellow. Next we define 2 variable of type Key and
Value, they will serve as our payload to be inserted. These are highlighted in green. Next
we have highlighted in purple the actual insert command. Highlighted in blue is the retrieve
command for getting data out of the database.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

157

Big Data Workshop

6. When you are done evaluating the code press the X in the right upper corner of the
window to close it.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

158

Big Data Workshop

7. Lets go ahead and compile that code. Go to the terminal and type:
javac Hello.java
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

159

Big Data Workshop

8. Now that the code is complied lets run it. Go to the terminal and type:
java Hello
Then press Enter

You will see printed on the screen Hello Big Data World which is the key, and the value we
inserted in the database.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

160

Big Data Workshop

9. Oracle NoSQL database has the possibility of having a major and minor component to the
key. This feature can be very useful when trying to group and retrieve multiple items at the
same time from the database. In the next code we have 2 major components to the key
(Mike and Dave) and each major component has a minor component (Question and
Answer). We will insert a value for each key but we will use a multiget function to retrieve
all of the values regardless of the minor component of the key for Mike and completely
ignore Dave. Lets see what that code looks like. Go to the terminal and type:
gedit Keys.java
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

161

Big Data Workshop

10. A new window will pop up with the code. If you scroll to the bottom with will remark the
following piece of code. Highlighted in purple as the insertion calls which will add the data
to the database. The retrieval of multiple records is highlighted in blue, and the green
shows the display of the retrieved data. Do note there were 4 Key-Value pairs inserted in
to the database.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

162

Big Data Workshop

11. When you are done evaluating the code press the X in the right upper corner of the
window to close it.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

163

Big Data Workshop

12. Let go ahead and compile that code. Go to the terminal and type:
javac Keys.java
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

164

Big Data Workshop

13. Now that the code is complied lets run it. Go to the terminal and type:
java Keys
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

165

Big Data Workshop

You will see the 2 values that are stored under the Mike major key displayed on the
screen, and no data points for Dave major key. Major and minor parts of the key can
actually be composed of multiple string and further filtering can be done. This is left up the
participants to experiment with.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

166

Big Data Workshop

14. The potential of a Key-Value store grows significantly when integrated with the power of a
Hadoop and distributed computing. Oracle NoSQL database can be used as a source and
target for the data used by and produced by NoSQL. Lets look at a modified example of
word count run in Hadoop only this time we will count the number of Values under the
Major component of the key in the NoSQL database. To see the code go to the terminal
and type:
gedit Hadoop.java
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

167

Big Data Workshop

The code see is very similar to the Word Count seen in the first section of the workshops.
There are only difference 2 differences. The first (highlighted in yellow) is the retrieval of
data from the NoSQL database rather than a flat file.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

168

Big Data Workshop

The second can be seen of scroll down into the run function and
notice the InputFormatClass is now KVInputFormat

15. When you are done evaluating the code press the X in the right upper corner of the
window to close it.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

169

Big Data Workshop

16. Lets go ahead and run that code. We will need to go through the entire procedure of the
first exercise where we compile the code, create a jar then execute it on the Hadoop
cluster. We have written a script which will do all of that for us. Lets run that script, go to
the terminal and type
./runHadoop.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

170

Big Data Workshop

17. You will see a Hadoop job being executed with all of the terminal display it comes with.
Once the execution is done it is time to see results. We will just cat the results directly from
HDFS. Go to the terminal and type
./viewHadoop.sh
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

171

Big Data Workshop

You will see, displayed on the screen a word count based on the major component of keys
in the NoSQL database. In the previous exercises we inserted 2 pieces of data under the
major key Dave and Mike. We also inserted a hello key for the first exercise. This is
exactly the data the word count displays.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

172

Big Data Workshop

18. That concludes our exercises on NoSQL database. It is time to shutdown our NoSQL
database. Go to the terminal and type:
./stopNoSQL
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

173

Big Data Workshop

19. We can now close our terminal window. Go to the terminal and type:
exit
Then press Enter

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

174

Big Data Workshop

9.4 Summary
In this exercise you were introduced to Oracles NoSQL database. You saw how to insert and
retrieve key-value pairs as well as the mutiget function where multiple values could be retrieved
under the same major component of a key. The last example showed how a NoSQL database can
be used as a source for a Hadoop job and how the two technologies can be integrated.
It is important to note here the differences between the NoSQL database and a traditional RDBMS.
With relational data the queries performed are much more powerful and more complex while
NoSQL simply stores and retrieves values from a specific key. Given that simplicity in NoSQL
storage type, it has a significant performance and scaling advantage. A NoSQL database can store
petabytes worth of information in a distributed cluster and still maintain very good performance on
data interaction at a much low cost per megabyte of data. NoSQL has many uses and has been
implemented successfully in many different circumstances but at the same time it does not mimic
or replace the use of a traditional RDBMS.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

175

Big Data Workshop

APPENDIX A
A.1 Setup of a Hive Data Store

1. Once we are connected to ODI we need to setup our models; the logical and physical
definition of our data sources and targets. To start off, at the top of the screen click on
Topology.

2. Next in the left menu make sure you are on the Physical Architecture tab and expand the
Technologies list

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

176

Big Data Workshop

3. In the expanded list find the folder Hive and expand it

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

177

Big Data Workshop

4. In this folder we need to create a new Data Server. Right click on the
Hive Technology and select New Data Server

5. A new tab will open on the right side of the screen. Here You can define all of the
properties of this data server. Enter the following details:
Name: Hive Server
Then click on the JDBC tab in the left menu

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

178

Big Data Workshop

6. On the right of the JDBC Driver field click on the Magnifying Glass to select the JDBC
Driver

7. A new Window will pop up which will allow you to select from a list of drivers. Click on the
Down Array to see the list

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

179

Big Data Workshop

8. For the list that appears select Apache Hive JDBC Driver.

9. Now click OK to close the window

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

180

Big Data Workshop

10. Back at the main window enter the following information


JDBC Url: jdbc:hive://bigdatalite.us.oracle.com:10000/default

11. We need to set some Hive specific variable. On the menu on the left go now to the tab
Flexfields

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

181

Big Data Workshop

12. It the Flexfields tab uncheck the Default check box and write the following information:
Value: thrift://localhost:10000
Dont forget to press Enter when done typing to set the variable

13. It is now time to test to ensure we set everything up correctly. In the left upper corner of
the right windows click on Test Connection

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

182

Big Data Workshop

14. A window will pop up asking if you would like to save you data before testing. Click OK

15. An informational message will pop up asking to register a physical schema. We can ignore
this message as that will be our next step. Just click OK

16. You need to select an agent to use for the test. Leave the default
Physical Agent: Local(No Agent)
Then click Test

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

183

Big Data Workshop

17. A window should pop up saying Successful Connection. Click OK

If any other message is displayed please ask for assistance to debug. It is critical for the
entirety of this exercise this connection is fully functional.
18. Now in the menu on the left side of the screen, in the Hive folder, there should now be a
Physical server created called Hive Server. Right click on it and select New Physical
Schema.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

184

Big Data Workshop

19. A new tab will again open on the right side of the screen to enable you to define the details
of the Physical Schema. Enter the following details.
Schema (Schema): default
Schema (Work Schema): default

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

185

Big Data Workshop

20. Then click Save All in the left upper part of the screen

21. A warning will appear about No Context specified. This again will be the next step we
undertake. Just click OK

22. We now need to expand the Logical Architecture tab in the left menu. Toward the left
bottom of the screen you will see Logical Architecture tab click on it.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

186

Big Data Workshop

23. In the Logical Architecture tab you will need to again find the Hive folder and click on the +
to expand it.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

187

Big Data Workshop

24. Now to create the logical store, right click on the Hive Folder and select New Logical
Schema.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

188

Big Data Workshop

25. In the new window that open on the right of the screen enter the following information:
Name: Hive Store
Context: Global
Physical Schemas: Hive Server.default

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

189

Big Data Workshop

26. This should setup the Hive data store to enable us to move data into and out of Hive with
ODI. We now need to save all of the changes we made. In the left upper corner of the
screen click on the Save All button.

27. We can close all of the tabs we have opened on the right side of the screen. This will help
in reducing the clutter. Click on the X for all of the windows.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

190

Big Data Workshop

We would theoretically need to repeat steps 7 29 for each of the different type of data
store. As the procedure is almost the same a flat file source and an Oracle database target
have already been setup for you. This is to reduce the number of steps in this exercise.
For details on how use flat files and Oracle database with ODI please see the excellent
ODI tutorials offered by the Oracle by Example Tutorials found at
http://www.oracle.com/technetwork/tutorials/index.html.
28. We now need to go to the Designer Tab in the left menu to perform the rest of our
exercise. Near the top of the screen on the left side click on the Designer tab.

29. Near the bottom of the screen on the left side there is a Models tab click on it.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

191

Big Data Workshop

30. You will notice there is already a File and Oracle mode created for you. These were precreated as per the note at step 29. Lets now create a model for the Hive data store we
just created. In the middle of the screen in the right panel there is a folder icon next to the
work Models. Click on the Folder icon and select New Model

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

192

Big Data Workshop

31. In the new tab that appears on the right side enter the following information:
Name: Hive
Code: HIVE
Technology: Hive
Logical Schema: Hive Store

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

193

Big Data Workshop

32. We can now go up the left upper corner of the screen and save this Model by clicking on
the Save All icon.

http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved

194

You might also like