Big Data Workshop: Lab Guide

Big Data Workshop
Lab Guide
http://www.oracle-developer-days.com
Copyright © 2012, Oracle and/or its affiliates. All rights reserved
Big Data Workshop
TABLE OF CONTENTS
Big Data Workshop Lab Guide .......................................................................................................... i

1. Introduction .................................................................................................................................... 4
2. Hadoop Hello World ....................................................................................................................... 7
2.1 Introduction to Hadoop ............................................................................................................. 7
2.2 Overview of Hands on Exercise ............................................................................................... 8
2.3 Word Count .............................................................................................................................. 8
2.4 Summary ................................................................................................................................ 22
3. Pig Exercise ................................................................................................................................. 23
3.1 Introduction to Pig .................................................................................................................. 23
3.2 Overview Of Hands On Exercise ........................................................................................... 23
3.3 Working with PIG .................................................................................................................... 23
3.4 Summary ................................................................................................................................ 43
4. Hive Coding ................................................................................................................................. 44
4.1 Introduction to Hive ................................................................................................................ 44
4.2 Overview Of Hands On Exercise ........................................................................................... 44
4.3 Queries with Hive ................................................................................................................... 44
4.4 Summary ................................................................................................................................ 55
5. Oracle ODI and Hadoop .............................................................................................................. 56
5.1 Introduction To Oracle Connectors ........................................................................................ 56
5.2 Overview of Hands on Exercise ............................................................................................. 57
5.3 Setup and Reverse Engineering in ODI ................................................................................. 57
5.4 Using ODI to import text file into Hive .................................................................................... 64
5.5 Using ODI to import Hive Table into Oracle ........................................................................... 77
5.6 Using ODI to import Hive Table into Hive .............................................................................. 93
5.7 Summary .............................................................................................................................. 109
6. Working with External Tables .................................................................................................... 110
6.1 Introduction to External Tables ............................................................................................ 110
6.2 Overview of Hands on Exercise ........................................................................................... 110
6.3 Configuring External Tables ................................................................................................. 110
6.4 Summary .............................................................................................................................. 120
7. Working with Mahout ................................................................................................................. 121
7.1 Introduction to Mahout.......................................................................................................... 121
7.3 Clustering with K-means ...................................................................................................... 121
7.4 Summary .............................................................................................................................. 131
Copyright © 2012, Oracle and/or its affiliates. All rights reserved 2
Big Data Workshop
8. Programming with R .................................................................................................................. 132

8.1 Introduction to Enterprise R ................................................................................................. 132
8.3 Talking data from R and inserting it into database ............................................................... 133
8.4 Taking data from database and using it in R and clustering ................................................ 144
8.5 Summary .............................................................................................................................. 154
9. Oracle NoSQL Database ........................................................................................................... 155
9.1 Introduction To NoSQL......................................................................................................... 155
9.3 Insert, and retrieve Key – Value pairs .................................................................................. 155
9.4 Summary .............................................................................................................................. 175
Appendix A ..................................................................................................................................... 176
A.1 Setup of a Hive Data Store .................................................................................................. 176
Big Data Workshop
1. INTRODUCTION
Big data is not just about managing petabytes of data. It is also about managing large numbers of complex
unstructured data streams which contain valuable data points. However, which data points are the most
valuable depends on who is doing the analysis and when they are doing the analysis. Typical big data
applications include: smart grid meters that monitor electricity usage in homes, sensors that track and
manage the progress of goods in transit, analysis of medical treatments and drugs that are used, analysis
of CT scans etc. What links these big data applications is the need to track millions of events per second,
and to respond in real time. Utility companies will need to detect an uptick in consumption as soon as
possible, so they can bring supplementary energy sources online quickly. Probably the fastest growing
area relates to location data being collected from mobile always-on devices. If retailers are to capitalise on
their customers’ location data, they must be able to respond as soon as they step through the door.
In the conventional model of business intelligence and analytics, data is cleaned, cross-checked and
processed before it is analysed, and often only a sample of the data is used in the actual analysis. This is
possible because the kind of data that is being analysed - sales figures or stock counts, for example – can
easily be arranged in a pre-ordained database schema, and because BI tools are often used simply to
create periodic reports.
At the center of the big data movement is an open source software framework called Hadoop. Hadoop has
become the technology of choice to support applications that in turn support petabyte-sized analytics
utilizing large numbers of computing nodes. The Hadoop system consists of three projects: Hadoop
Common, a utility layer that provides access to the Hadoop Distributed File System and Hadoop
subprojects. HDFS acts as the data storage platform for the Hadoop framework and can scale to massive
size when distributed over numerous computing nodes.
Hadoop MapReduce is a framework for processing data sets across clusters of Hadoop nodes. The Map
and Reduce process splits the work by first mapping the input across the control nodes of the cluster, then
splitting the workload into even smaller data sets and distributing it further throughout the computing
cluster. This allows it to leverage massively parallel processing, a computing advantage that technology
has introduced to modern system architectures. With MPP, Hadoop can run on inexpensive commodity
servers, dramatically reducing the upfront capital costs traditionally required to build out a massive system.
Big Data Workshop
As the nodes "return" their answers, the Reduce function collects and combines the information to deliver a
final result.
To extend the basic Hadoop ecosystem capabilities a number of new open source projects have added
functionality to the environment. A typical Hadoop ecosystem will look something like this:
 Avro is a data serialization system that converts data into a fast, compact binary data format. When
Avro data is stored in a file, its schema is stored with it
 Chukwa is a large-scale monitoring system that provides insights into the Hadoop distributed file
system and MapReduce
 HBase is a scalable, column-oriented distributed database modeled after Google's BigTable

distributed storage system. HBase is well-suited for real-time data analysis
 Hive is a data warehouse infrastructure that provides ad hoc query and data summarization for
Hadoop- supported data. Hive utilizes a SQL-like query language call HiveQL. HiveQL can also be
used by programmers to execute custom MapReduce jobs
 Pig is a high-level programming language and execution framework for parallel computation. Pig
works within the Hadoop and MapReduce frameworks
 ZooKeeper provides coordination, configuration and group services for distributed applications
working over the Hadoop stack
Data exploration of Big Data result sets requires displaying millions or billions of data points to uncover
hidden patterns or records of interest as shown below:
Big Data Workshop
Many vendors are talking about Big Data in terms of managing petabytes of data. For example EMC has a
number of Big Data storage platforms such as it's new Isilon storage platform. In reality the issue of big
data is much bigger and Oracle's aim is to focus on providing a big data platform which provides the
following:
 Deep Analytics – a fully parallel, extensive and extensible toolbox full of advanced and novel
statistical and data mining capabilities
 High Agility – the ability to create temporary analytics environments in an end-user driven, yet
secure and scalable environment to deliver new and novel insights to the operational business
 Massive Scalability – the ability to scale analytics and sandboxes to previously unknown scales
while leveraging previously untapped data potential
 Low Latency – the ability to instantly act based on these advanced analytics in your operational,
production environment
Big Data Workshop
2. HADOOP HELLO WORLD
2.1 Introduction to Hadoop

Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of
distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce framework harnesses a
cluster of machines and executes user defined Map/Reduce jobs across the nodes in the cluster. A
Map/Reduce computation has two phases, a map phase and a reduce phase. The input to the computation
is a data set of key/value pairs.
In the map phase, the framework splits the input data set into a large number of fragments and assigns
each fragment to a map task. The framework also distributes the many map tasks across the cluster of
nodes on which it operates. Each map task consumes key/value pairs from its assigned fragment and
produces a set of intermediate key/value pairs. For each input key/value pair (K,V), the map task invokes a
user defined map function that transmutes the input into a different key/value pair (K',V').
Following the map phase the framework sorts the intermediate data set by key and produces a set of
(K',V'*) tuples so that all the values associated with a particular key appear together. It also partitions the
set of tuples into a number of fragments equal to the number of reduce tasks.
In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For each
such tuple it invokes a user-defined reduce function that transmutes the tuple into an output key/value pair
(K,V). Once again, the framework distributes the many reduce tasks across the cluster of nodes and deals
with shipping the appropriate fragment of intermediate data to each reduce task.
Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a computation
the tasks assigned to them are re-distributed among the remaining nodes. Having many map and reduce
tasks enables good load balancing and allows failed tasks to be re-run with small runtime overhead.
Architecture
The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master server or
jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker is the point
of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which
puts them in a queue of pending jobs and executes them on a first-come/first-served basis. The jobtracker
manages the assignment of map and reduce tasks to the tasktrackers. The tasktrackers execute tasks
upon instruction from the jobtracker and also handle data motion between the map and reduce phases.
Hadoop DFS
Hadoop's Distributed File System is designed to reliably store very large files across machines in a large
cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all
blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault
tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and
have strictly one writer at any time.
Architecture
Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation consists of a
single Namenode, a master server that manages the filesystem namespace and regulates access to files
by clients. In addition, there are a number of Datanodes, one per node in the cluster, which manage
storage attached to the nodes that they run on. The Namenode makes filesystem namespace operations
like opening, closing, renaming etc. of files and directories available via an RPC interface. It also
determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write
Big Data Workshop
requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction
from the Namenode.
2.2 Overview of Hands on Exercise

To get an understanding of what is involved in running a Hadoop Job and what are all of the steps one
must undertake we will embark on setting up and running a “hello world” type exercise on our Hadoop
Cluster.
In this exercise you will:
1) Compile a Java Word Count written to run on a Hadoop Cluster
2) Create some files to run word count on
3) Upload the files into HDFS
4) Run Word Count
5) View the Results
NOTE: During this exercise you will be asked to run several scripts. If you would like to see the content of
these scripts type cat scriptName and the contents of the script will be displayed in the terminal
2.3 Word Count

1. All of the setup and execution for the Work Count exercise can be done from the terminal, hence to
start out this first exercise please open the terminal by double clicking on the Terminal icon on the
desktop.
2. To get into the folder where the scripts for the first exercise are, type in the terminal:
cd /home/oracle/exercises/wordCount
Then press Enter
3. Let’s look at the java code which will run word count on a Hadoop cluster. Type in the terminal:
gedit WordCount.java
Then press Enter
Big Data Workshop
4. A new window will open with the java code for word count. We would like you to look at line 14 and
28 of the code. You can see there the Mapper and Reducer Interfaces are being implemented.
5. When you are done evaluating the code you can click on the X in the right upper corner of the
screen to close the window.
Big Data Workshop
6. We can now go ahead and compile the Word Count code. We need run the compile.sh script which
will set the correct classpath and output directory while compiling WordCount.java. Type in the
terminal:
./compile.sh
Then press Enter
7. We can now create a jar file from the compile directory of Word Count. This jar file is required as
the code for word count will be sent to all of the nodes in the cluster and the code will be run
simultaneous on all nodes that have appropriate data. To create the jar file in the terminal type:
./createJar.sh
Then press Enter
Big Data Workshop
8. For the exercise to be more interesting we need to create some file on which word count will be
executed. To create some file go the terminal and type:
./createFiles.sh
Then press Enter
9. To see the contents of the files type in the terminal:
cat file01 file02

Then press Enter
Big Data Workshop
In the terminal window you will see the contents of the two files. Each file having 4 words in it.
Although these are quite small files the code would run identical with more than 2 file and with files
that are several Gigabytes of Terabytes in size.
10. Now that we have the files ready we must move them into the Hadoop File System (HDFS).
Hadoop can now work with file on other file systems; they must be within the HDFS for them to be
usable. It is also important to note that files which are within HDFS are split into multiple chunks
and stored on separate nodes for parallel parsing. To upload our two file into the HDFS you need
to use the copyFromLocal commanding Hadoop. Run the command by typing at the terminal
Big Data Workshop
hadoop dfs -copyFromLocal file01 /user/oracle/wordcount/input/file01

Then press Enter
For convince you can also run the script copyFiles.sh and it will upload the files for you so do
not need to type in this and the next command.
11. We should now upload the second file. Go to the terminal and type:
hadoop dfs -copyFromLocal file02 /user/oracle/wordcount/input/file02

Then press Enter
Big Data Workshop
12. We can now run our MapReduce job to do a word count on the file we just uploaded. Go the the
terminal and typing:
hadoop jar WordCount.jar org.myorg.WordCount /user/oracle/wordcount/input

/user/oracle/wordcount/output
Then press Enter
For your convenience you can also run the script runWordCount.sh and will run the Hadoop job
for you so do not need to type in the above command.
Big Data Workshop
A lot of text will role by in the terminal window. This is informational data coming from the Hadoop
infrastructure to help track the status of the job. Wait for the job to finish, this is signaled by the
command prompt coming back.
13. Once you have you command prompt back your MapReduce task is complete. It is now time to
look at the results. We can display they results file right from the HDFS files by using the cat
command from Hadoop. Go to the terminal and type the following command:
hadoop dfs -cat /user/oracle/wordcount/output/part-00000

Then press Enter
For your convenience you can also run the script viewResults.sh and will run the Hadoop
command for you to see the results.
Big Data Workshop
In the terminal the word count results are displayed. You will see that job counted the number of
times each word appears.
Big Data Workshop
14. As an experiment let’s try to run the Hadoop job again. Go to the terminal and type:

Then press enter
Big Data Workshop
15. You will notice an error message appears and no map reduce task is executed. This is easily
explained by the immutability of data. Since Hadoop does not allow an update of data files (just
read and write) you cannot update the data in the results directory hence the execution has
nowhere to place to output. For you to re-run the Map-Reduce job you must either point it to
another output directory or clean out the current output directory. Let’s go ahead and clean out the
previous output directory. Go to the terminal and type:
hadoop dfs -rmr /user/oracle/wordcount/output

Then press Enter
For convince you can also run the script deleteOutput.sh and it will delete the files for you so
do not need to type in this command
Big Data Workshop
16. Now we have cleared the output directory and can re-run the map reduce task. Let’s just go ahead
and make sure it works again. Go to the terminal and type:

Then press enter
Big Data Workshop
Now the Map Reduce job ran fine again as per the output on the screen.
Big Data Workshop
17. This completes the word count example. You can now close the terminal window; go to the
terminal window and type:
exit
Then press Enter
Big Data Workshop
2.4 Summary
In this exercise you were able to see the basic steps required in setting up and running a very simple Map
Reduce Job. You say what interfaces must be implemented when creating a MapReduce task, you saw
how to upload data into HDFS and how to run the map reduce task. It is important to talk about execution
time for the exercise and the amount of time required to count 8 words is quite high in absolute terms. It is
important to understand that Hadoop needs to start a separate Java Virtual Machine to process each file or
chunk of a file on each node of the cluster. As such even a trivial job has some processing time which limits
the possible application of Hadoop as it can only handle bath jobs. Real time application where answers
are required can’t be run on a Hadoop cluster. At the same time as the data volumes increase processing
time does not increase that much as long as there are enough processing nodes. A recent benchmark of a
Hadoop cluster saw the complete sorting of 1 terabyte of data in just over 3 minutes on 910 nodes.
Big Data Workshop
3. PIG EXERCISE
3.1 Introduction to Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing
data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of
Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them
to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-
Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop
subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the
following key properties:
Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data
analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly
encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities: The way in which tasks are encoded permits the system to optimize their
execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility: Users can create their own functions to do special-purpose processing.
3.2 Overview Of Hands On Exercise

In this exercise we will be analyzing data coming from the New York Stock Exchange specifically we will
like to evaluate the dividends given by different companies. We have a tab delimited file with four columns;
exchange name, stock name, date, and dividend. For our analysis we want to find the companies which
had the highest average dividend.
In this exercise we will:
1. Load our stock exchange data into our HDFS
2. Run a PIG script which will find the company with the highest dividends
3. View the results
NOTE2: This exercise and dataset for this exercise was inspired from the flowing website:
http://ofps.oreilly.com/titles/9781449302641/running_pig.html
3.3 Working with PIG

1. All of the setup and execution for this exercise can be done from the terminal, hence the terminal
by double clicking on the Terminal icon on the desktop.
Big Data Workshop
cd /home/oracle/exercises/pig
Then press Enter
3. To get an idea of what our dividends file looks like let’s look at the first couple of rows. In the
terminal type:
head NYSE_dividends
Then press Enter
The first 10 rows of the data file will be displayed on the screen
Big Data Workshop
4. Now that we have an idea what our data file looks like, let’s load it into the HDFS for processing.
To load the data we use the copyfromLocal function of Hadoop, go to the terminal and type:
hadoop dfs -copyFromLocal NYSE_dividends /user/oracle/NYSE_dividends

Then press Enter
For convince you can also run the script loadData.sh and it will upload the files for you. This is
so you do not need to type in the command above.
5. We will be running our PIG script in interactive mode so we can see each step of the process. For
this we will need to open the PIG interpreter called grunt. Go the terminal and type
pig
Then press Enter
Big Data Workshop
6. Once at the grunt shell we can start typing Pig script. The first thing we need to do is load the
datafile from HDFS into Pig for processing. The data is not actually copied but a handler is created
for the file so Pig know how to interperate the data. Go to the grunt shell and type:
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);

Then Press Enter
Big Data Workshop
7. Now the data is loaded as a for column table lets see what the data looks like in PIG. Go to the
grunt shell and type:
dump dividends;
Then press Enter
Big Data Workshop
You will see output similar to the first exercise on the screen. This is normal as Pig is merarly a
high level language all commands which process data simply run Map Reduce rasks in the
backgroup so the dump command simply becomes a map reduce task that is run. This will apply to
all of the command you will run in Pig. The output on the screen will show you all of the rows of the
file in tuple form
Big Data Workshop
8. The first step in alalizing the data will be grouping the data by stock symbol so we have all of the
dividends of one compay grouped together. Go to the grunt shell and type:
grouped = group dividends by symbol;

Then press Enter
Big Data Workshop
9. Let’s go ahead and dump this grouped variable to the screen to see what its contents look like Go
to the grunt shell and type:
dump grouped;
Then press Enter
Big Data Workshop
Open the screen you will see all of the groups displayed in tuple of typle form. As the output might
look a bit confusing only one tuple is hiligted in the screen shot below to help clarity. The hilighed
region show all of the rows of the table for the CATO stock symbol
Big Data Workshop
10. In the next step we will go through each group tuple and get the group name and the average
dividend. Go to the grunt shell and type:
avg = foreach grouped generate group, AVG(dividends.dividend);

Then press Enter
Big Data Workshop
11. Let’s go ahead and see what this output looks like. Go to the grunt shell and type:
dump avg
Then press Enter
Big Data Workshop
Now you can see on the screen a dump of all stock symbols with their respective average dividend.
A couple of them are hilighed in the image below
Big Data Workshop
12. Now that we have the dividents for each company it would be ideal if we had them in order from
highest to lowest dividend. Let’s get that list, go to the grunt shell and type:
sorted = order avg by $1 DESC;

Then press Enter
Big Data Workshop
We can now see what the sorted list looks like. Go to the grunt terminal and type:
dump sorted;
Then press Enter
Big Data Workshop
On the screen you now see the list sorted in descending order. On the screen are the lowest
dividens but can scroll up the see the rest of the value.
Big Data Workshop
13. We now have the final results we want. It might be worth writing these results out to HDFS. Let’s
do that. Go to the grunt shell and type:
store sorted into 'average_dividend';

Then press Enter
Big Data Workshop
14. The new calculated data is now permanently stored in HDFS. We can now exit the grunt shell. Go
to the grunt shell and type:
quit;
Then press Enter
Big Data Workshop
15. Now back at the terminal lets view the top 10 companies by average dividend directly from HDFS.
Go to the terminal and type:
hadoop dfs -cat /user/oracle/average_dividend/part-r-00000 | head

Then press Enter
For convince you can also run the script viewResults.sh and it will display the files for you.
This is so you do not need to type in the command above.
Big Data Workshop
This command simply did cat on the results file available in the HDFS. The results are seen on the
screen.
Big Data Workshop
16. That concludes the the Pig exercise you can now close the terminal window. Go to the terminal
and type:
exit
Then press Enter
Big Data Workshop
3.4 Summary
In this exercise you saw what a pig script looks like and how to run it. It is important to understand that pig
is a scripting language which ultimately runs MapReduce jobs on a Hadoop cluster hence all of the power
of a distributed system and the high data volume / size which HDFS can accommodate are exploitable
through pig. Pig provides an easier interface to the MapReduce infrastructure allow for scripting paradigms
to be used rather than direct java coding.
Big Data Workshop
4. HIVE CODING
4.1 Introduction to Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and
the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to
project structure onto this data and query the data using a SQL-like language called HiveQL. At the same
time this language also allows traditional map/reduce programmers to plug in their custom mappers and
reducers when it is inconvenient or inefficient to express this logic in HiveQL.
4.2 Overview Of Hands On Exercise

In this exercise you will use Hive Query Language to create tables, insert data into those tables and run
queries on that data. For this exercise we will use the same data file as the PIG exercise above which
contains a tab delimited file with four columns; exchange name, stock name, date, and dividend. For our
analysis we want to find the companies which had the highest average dividend.
In this exercise you will
1. Upload the comments file into HDFS
2. Create a table in Hive
3. Load the comments data into the Hive table
4. Run queries on the Hive table
4.3 Queries with Hive

1. All of the setup and execution for this exercise can be done from the terminal, hence open a
terminal by double clicking on the Terminal icon on the desktop.
2. To get into the folder where the scripts for the Hive exercise are, type in the terminal:
cd /home/oracle/exercises/hive
Then press Enter
Big Data Workshop
3. We already have an idea what our data file looks like, so let’s load it into the HDFS for processing.
This is done identically to the way it was done in the first two exercises. We will see a better way to
load data in the next exercise. To load the data we use the copyfromLocal function of Hadoop. Go
to the terminal and type:
hadoop dfs -copyFromLocal NYSE_dividend /user/oracle/NYSE_dividend

Then press Enter
For convince you can also run the script loadData.sh and it will display the files for you. This is
so you do not need to type in the command above.
4. Let’s now enter the Hive interactive shell environment to create tables and run queries against
those tables. To give an analogy this is similar to SQL *Plus but on this environment is specifically
for the HiveQL language. To enter the environment go to the terminal and type:
hive
Then press Enter
5. This first thing we need to do in hive is create a table. We will create a table named dividends with
four fields field called exchange, symbol, dates and dividends something that looks very natural
based on the data set. Go the terminal and type:
create table dividends(exchange string, symbol string, dates string,

dividend float)
Then press Enter
Big Data Workshop
An OK should be printed on the screen indicating the success of the operation. This OK message
will be printed for all operation but we will only mention it this time. It if left up to the user to check
for this message on future HiveQL commands.
6. We can now run a command to see all of the tables available to this OS user. Go to the hive
terminal and type:
show tables;
Then press Enter
Big Data Workshop
You can see the only table currently available is the one we just created.
7. As with normal SQL you also have a describe command available to see the columns in the table.
Go the terminal and type
describe dividends;
Then press Enter
Big Data Workshop
As you can see that the dividends table has the 4 fields each with their own Hive specific data type.
This is to be expected as this is the way we created the table.
8. Let’s go ahead and load some data into this table. Data is loaded into hive from flat files available
in the HDFS file system. Go the terminal and type:
load data inpath ‘/user/oracle/NYSQ_dividend’ into table dividends;
Big Data Workshop
Then press Enter
The data is now loaded into the table.
9. We can now see the data that has been loaded into the table. Go the the terminal and type:
select * from dividends limit 5;

Then press Enter
Big Data Workshop
Five lines from the table are printed to the screen; only 3 of the lines are highlighted in the image
below.
Big Data Workshop
10. Now that we have all of the data loaded into a Hive table we can run SQL queries on the code. As
we has the same data set as in the Pig exercises let’s try to extract the same data. We will look for
the top 10 companies by average dividend. Go to the terminal and type:
select symbol, avg(dividend) avg_dividend from dividends group by symbol

order by avg_dividend desc limit 10;
Then press Enter
Big Data Workshop
On the screen you will see a lot of log information scroll through. Most of this is generated by
Hadoop as Hive (just like Pig) takes the queries you write and rewrites them as MapReduce jobs
then executes them. The query we wrote can take full advantage of the distributed computational
power of Hadoop as well as the striping and parallelism that HDFS enables.
When the query is done you should see on the screen the top 10 companies in descending order.
This output shows the exact same information as we got with the previous exercises. As the old
idiom goes there is more than one way to skin a cat. As with hadoop there is always more than one
way to achieve any task.
Big Data Workshop
11. This is the end of the Hive exercise. You can now exit the hive interpreter. Go to the terminal and
type:
exit;
Then press Enter
Big Data Workshop
12. Then close the terminal. Go the the terminal and type:
exit
Then press Enter
Big Data Workshop
4.4 Summary
In this exercise you were introduces to the Hive Query Language. You saw how to create and view
tables using the HiveQL. Once tables were created you were introduced to the JSON native interface as
well as some of standard SQL constructs which HiveQL has available. It is important to understand that
Hive is an abstraction layer for Hadoop and MapReduce jobs. All queries written in HiveQL get transformed
into a DAG (Directed Acyclic Graph) of MapReduce tasks which are then run on the Hadoop cluster, hence
taking advantage of all performance, scalability capabilities, but also maintain all of the limitations of
Hadoop.
HiveQL has most of the functionality available with standard SQL, have s series of DDL and DML functions
implemented, but at the same time it does not strictly adhere to SQL-92 standard HiveQL offers extensions
not in SQL, including multitable inserts and create table as select, but only offers basic support for indexing.
Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support. It is
intended for long running queries of a Data Warehousing type rather than a transactional OLTP type of
data load.
Big Data Workshop
5. ORACLE ODI AND HADOOP
5.1 Introduction To Oracle Connectors

Apache Hadoop is designed to handle and process data from data sources that are typically non-RDBMS
and data volumes that are typically beyond what is handled by relational databases.
The Oracle Data Integrator Application Adapter for Hadoop enables data integration developers to integrate
and transform data easily within Hadoop using Oracle Data Integrator. Employing familiar and easy-to-use
tools and preconfigured knowledge modules, the adapter provides the following capabilities:
 Loading data into Hadoop from the local file system and HDFS.
 Performing validation and transformation of data within Hadoop.
 Loading processed data from Hadoop to Oracle Database for further processing and generating
reports.
Typical processing in Hadoop includes data validation and transformations that are programmed as
MapReduce jobs. Designing and implementing a MapReduce job requires expert programming knowledge.
However, using Oracle Data Integrator and the Oracle Data Integrator Application Adapter for Hadoop, you
do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language
(HiveQL), a SQL-like language for implementing MapReduce jobs. The Oracle Data Integrator graphical
user interface enhancing the developer's experience and productivity while enabling them to create
Hadoop integrations.
When implementing a big data processing scenario, the first step is to load the data into Hadoop. The data
source is typically in the local file system, HDFS, Hive tables, or external Hive tables.
After the data is loaded, you can validate and transform the data using HiveQL like you do in SQL. You can
perform data validation such as checking for NULLS and primary keys, and transformations such as
filtering, aggregations, set operations, and derived tables. You can also include customized procedural
snippets (scripts) for processing the data.
When the data has been aggregated, condensed, or crunched down, you can load it into Oracle Database
for further processing and analysis. Oracle Loader for Hadoop is recommended for optimal loading into
Oracle Database.
Knowledge Modules:
KM Name Description Source Target
IKM File To Loads data from local and HDFS files into Hive tables. It File System Hive
Hive (Load provides options for better performance through Hive
Data) partitioning and fewer data movements.
IKM Hive Integrates data into a Hive target table in truncate/ insert Hive Hive
Control Append (append) mode. Data can be controlled (validated). Invalid
data is isolated in an error table and can be recycled.
Big Data Workshop
KM Name Description Source Target
IKM Hive Integrates data into a Hive target table after the data has Hive Hive
Transform been transformed by a customized script such as Perl or
Python.
IKM File-Hive to Integrates data from an HDFS file or Hive source into an File System Oracle
Oracle (OLH) Oracle Database target using Oracle Loader for Hadoop. or Hive Database
CKM Hive Validates data against constraints. NA Hive
RKM Hive Reverse engineers Hive tables. Hive NA

Metadata

In this workshop we have loading data into the HDFS using a cumbersome command line utility one for
one. We viewed results from within HDFS also using a command line utility. Although this is fine for smaller
jobs, it would be a good idea to integrate the moving of data with an ETL tool such as Oracle Data
Integration (ODI). In the exercise we will see how Oracle Data Integrator integrates seamlessly with
Hadoop and more specifically Hive.
1. Use ODI to reverse engineer a Hive Table
2. Use ODI in import a text file into a Hive Table
3. Use ODI to move data from a Hive table into the Oracle Database
4. Use ODI to move data from A Hive table to another table with Check Constraints
5.3 Setup and Reverse Engineering in ODI

1. All of the setup for this exercise can be done from the terminal, hence open a terminal by double
clicking on the Terminal icon on the desktop.
cd /home/oracle/exercises/odi
Then press Enter
Big Data Workshop
3. Next we need to run a script to setup the environment for the next exercise. We will be loading the
same data as in the previous exercise (the dividens table) only this time we will be ODI to perform
this task. For this we need to drop that table and recreate it so it is empty for the import. Also we
need to start the hive server to enable ODI to communicate with Hive. We have a script which will
perform both tasks. Go to the terminal and type:
./setup.sh
Then press Enter
4. Next we need to start Oracle Data Integrator. Go to the terminal and type
./startODI
Then press Enter
Big Data Workshop
5. Once ODI opens we need to connect to the repository. In the right upper corner of the screen
screen click on the Connect To Repository…
6. In the dialog that pops up all of the connection details should already be configured.
Login Name: DEFAULT_LOGIN1

User: SUPERVISOR
Password: Welcome1
If all of the data is entered correctly you can simple click OK
Big Data Workshop
7. Once you login make sure you are on the Designer Tab. Near the top of the screen on the left side
click on Designer.
8. Near the bottom of the screen on the left side there is a Models tab click on it.
You will notice that we have already created a File, Hive, and Oracle mode for you. These were
pre-created as to reduce the number of steps in the exercise. For details on how use flat files and
Big Data Workshop
Oracle database with ODI please see the excellent ODI tutorials offered by the Oracle by Example
Tutorials found at http://www.oracle.com/technetwork/tutorials/index.html.
9. The first feature of ODI we would like to show involved reverse engineering a data store. The
reverse engineering function takes a data store and finds all of the table and their structure
automatically. In the Models tab on the left side of the screen there Hive is a model. Lets click on
the + to expand it out.
10. You will notice there is no information about the data that is stored in that particular location. Right
click on the Hive folder and select Reverse Engineer
Big Data Workshop
11. You will see two items appear in the Hive Folder called dividends and dividends2. It is the tables
we created in Hive. You can click on the + beside dividends to see some more information.
Big Data Workshop
12. You can also expand the Columns folder to see all of the column in this particular table.
You will see the columns created in step 3 displayed.
Big Data Workshop
This is the power of the Hive Reverse Engineering Knowledge Module (RKM) integrated in Oracle
Data Integrator. Once you define a data store (in our case a hive source) the RKM will
automatically discovery all tables and their corresponding columns available at that source. Once a
data model is created there is no need to rewrite it in ODI. ODI will automatically discover that
model for you to be able to get straight to the development of the data movement.
5.4 Using ODI to import text file into Hive

1. Now that we have configured our models we can begin creating Interfaces to perform ETL tasks.
Near the top left corner of the screen in the Projects tab there is an icon with 3 squares. Click on
them and select New Project.
Big Data Workshop
2. In the new window that opened up on the right side of the screen enter the following information:
Name: Hadoop
Code: HADOOP
Then click on the Save All in the right upper corner of the screen.
3. In the left hand menu in the Projects section a new item appeared called Hadoop. Click on the + to
expand it out
Big Data Workshop
4. Next to the folder called First Fold er there is another + expand out that folder as well by clicking
the +
5. Right click on the Item Interfaces and select New Interface
Big Data Workshop
6. We can now start to define the new interface. In this interface we will map out the columns in the
text file and move the data into the hive table. To start out let’s give the interface a name. In the
new tab that opened on the right side of the screen type in the following information.
Name: File_To_Hive
7. Next we need to move to the mapping tab of the File_To_Hive interface. Click on Mapping at the
bottom of the screen.
Big Data Workshop
8. We now need to define the sources and target data stores. On the left bottom of the screen in the
Models Section expand the File folder by clicking on the + beside it.
9. Now we can drag and drop the Dividends table from the File model into the source section of the
interface.
Big Data Workshop
10. Next we will drag and drop the dividends Hive table into the target section of the interface.
Big Data Workshop
11. A pop up window will appear which will ask if you would like to create automatic mapping. This will
try to automatically match source columns with target columns based on column name. Click on
Yes to see what happens.
12. By name it was able to match of the columns. The mapping is now complete. Let’s go back to the
Overview tab to setup one last thing. Click on the Overview tab on the left side of the screen.
Big Data Workshop
In the definitions tab Tick the box Staging Area Different From Target.
13. A drop down menu below the tick now gets activated. Select File: Comments File
Big Data Workshop
14. We can now click on the flow tab at the bottom of the screen to see what the interface will look like.
15. On the screen in the top right box you will see a diagram of the data flow. Let’s see all of the
options for the integration. Click on the Target(Hive Server) header.
Big Data Workshop
16. At the bottom of the screen a new window appeared, a Property Inspector. There you can inspect
and modify the configuration of the integration process. Let change one of the properties. We don’t
need a staging table so let’s disable it. Set the following options:
USE_STAGING_TABLE: false
Big Data Workshop
Let’s now execute this Interface. Click on the Execute button at the top of the screen.
17. You will be asked to save your interface before running it. Click Yes
18. Next you will be asked for the Execution options. Here you can choose agents contexts and other
elements. You can just accept the default options and click OK
Big Data Workshop
19. An informational screen will pop up telling you the session has started. Click OK
20. We will now check if the execution of the interface was successful. In the left menu click on the
Operator Tab
21. In the menu on the left side make sure the Date tab is expanded. Then expand the Today folder
Big Data Workshop
You will see a green checkmark beside the File_To_Hive execution. This means the integration
process was successful.
You have now successfully moved data from a flat file into a hive table without touching the
terminal. All of the data was moved without cumbersome command line interface and allowing for
the use of all of the powerful functionality of a powerful ETL tool.
Big Data Workshop
22. You can now move back to the Desinger tab in the left menu and close all of the open tabs on the
right side menu
This was to prepare for the next exercise.
5.5 Using ODI to import Hive Table into Oracle

1. Another useful process ODI can perform is move data from a Hive table into the Oracle database.
Once data processing has occurred with the Hive table you might want to move the data back into
your Oracle database for integration into your data warehouse. Let’s move the data we just loaded
into our hive table into an Oracle database. First step is to create a new interface.
In the projects tab in the left hand menu right click on Interfaces and select New Interface
Big Data Workshop
2. On the right side of the screen a new window pops up. Enter the following name for the Interface
Name: Hive_To_Oracle
3. Next at the bottom of the screen we will need to move the mapping tab to setup the interface. At
the bottom of the screen click on Mapping
Big Data Workshop
4. To give up more viewing space lets clean up the Models tab in the left bottom part of the screen.
Minimize the File Tab and the dividends table to give up more viewing space.
5. In the same tab (the Models tab) we now see the Oracle folder. Let’s expand that out as we will
need to Oracle tables as our target. Click on the + beside the Oracle folder
Big Data Workshop
6. We can now drag and drop the Hive dividends table into the sources window
Big Data Workshop
7. Similarly you can drag the Oracle DIVIDENDS tables into the destination window.
Big Data Workshop
8. As before you will be asked if you would like to perform automatic mapping. Click on Yes
9. Unfortunately due to a capitalization difference no mapping could be done automatically. We will

need to map the tables manually. Drag and drop each source table (from the source tables
windows in the left upper part of the right tab) to its corresponding target table (in the right upper
part of the right tab) given the following mapping:
exchange -> STOCK

dates -> DATES
dividend -> DIVIDEND
Big Data Workshop
10. One of the advantages of an ETL tool can be seen when doing transformations during the data
movement. Let’s concatenate the exchange and symbol into one string and load that into the
STOCK column in the database. Go to the Property Inspector screen of the STOCK column buy
click on it in the targets window
11. The property inspector window should open at the bottom of the screen. In the implementation edit
box type the following
concat(DIVIDENDS.exchange, DIVIDENDS.symbol)
Big Data Workshop
12. The transformation is now setup. Lets now go back to the Overview Tab to configure the staging
area. Click on the Overview tab
Big Data Workshop
13. In the definitions tab Tick the box Staging Area Different From Target.
14. A drop down menu below the tick now gets activated. Select Hive: Hive Store
15. We are now ready to run the interface. To run the interface go to the left upper corner of the screen
and click the execute button
Big Data Workshop
16. A window will pop up telling you a save is required before the execution can continue. Just click
Yes
17. Another window will pop pop asking you to configure the Execution Context and Agent. The default
options are fine just click OK
18. A final window will pop up tell you the session has started. Click OK
Big Data Workshop
19. Let’s now go to the Operator tab to check if the execution was successful. In the top left corner of
the screen click on the Operator tab
20. When you get to the Operator Tab you might see a lightning bolt beside the Hive_To_Oracle
execution. This means integration is executing wait for a little bit until the checkmark appears.
Big Data Workshop
The movement of data from Hive to Oracle has completed successfully.
21. One of the great feature of ODI is it allows you to look at the code that was executed as part of the
ETL process. Let’s drill down and see some of the code executed. In the Operator tab click on the
+ next to the latest execution.
Big Data Workshop
22. Continue to drill down by click on the + next to the 1- Hive_to_Oracle – <date>
23. You can now see all of the steps taken to perform this particular mapping. Let’s investigate further
the forth step in the process. Double click on 4 – Integration – Hive_To_Oracle – Create Hive
staging table to open up a window with its details.
Big Data Workshop
24. In the window that opened up click on the Code tab
Big Data Workshop
In this widow you will see exactly what code was run. If an error occurs this information becomes
quite useful in debugging your transformations.
25. To check that data that is now in the oracle database go back to the designer Tab, by going to the
left upper corner of the screen and clicking on Designer
Big Data Workshop
26. Then in the Models Section at the left bottom of the screen right click on the table LOGS in the
Oracle folder and select View Data
On the left side of the screen a new window will pop up with all of the data inside that table of the
oracle database.
Big Data Workshop
27. We can now go ahead and close all of the open windows in the right side of the screen to prepare
for the next exercise.
5.6 Using ODI to import Hive Table into Hive

1. The last ETL process we are going to show involves moving data from a Hive table into another
Hive table. Although this might sound a bit weird there are a lot of circumstances where you might
Big Data Workshop
want to move data from one Table to another verifying the data for constraint violations of
transforming the data. Let’s go ahead and create an interface for this type of transaction. In the
Project tab right click on Interfaces and select New Interface
2. As before let’s give the interface a name. In the new tab that opened on the right side of the screen
type in the following information.
Name: Hive_To_Hive
Big Data Workshop
3. Next at the bottom of the screen lets go to the Mapping table
4. We will first drag the Hive dividends table as our source window on the right
Big Data Workshop
5. Next we will drag the dividends2 table into the target window on the right
Big Data Workshop
6. You will be asked if you would like to perform auto mapping. Just click Yes
7. All of the mapping auto complete without a problem. We now need to specify the Integration
Knowledge Modules (IKM) which will be used to perform the integration. In ODI an IKM is the
engine which has all of the code templates for the integration process; hence without an
appropriate IKM the integration is now possible. In the previous section there was only one
appropriate IKM hence it was auto chosen for use. In this case there are multiple possible IKMs so
we need select one. In the left upper corner of the screen in the Designer window right click on
Knowledge Modules and select Import Knowledge Modules.
Big Data Workshop
8. A window will pop up which will allow you to import Knowledge Modules. First we need to specify
the folder in which the Knowledge Modules are stored. Fill in the following information.
File import directory: /u01/ODI/oracledi/xml-reference

Then Press Enter
Big Data Workshop
9. A list of different Knowledge Modules should appear in the space below. Scroll down until you find
the file(s) to import:
IKM Hive Control Append

Then press OK
10. An import report will pop up. Just click Close
Big Data Workshop
11. Let’s now add a constraint to the target tables to see what happens during the data movement. In
the left bottom part of the screen in the models window expand out the dividends2 store buy
pressing the + beside it.
Big Data Workshop
12. In the subsections that appear in the dividends2 you will see a section called Constraints. Right
click on it and select New Condition.
Big Data Workshop
13. On the right side a new window will open allowing you to define the properties of this condition. We
will set a check condition which will check if the dividends value is too low. Enter the following
information.
Name: Low_Dividend
Type: Oracle Data Integrator Condition
Where: dividends2.dividend>0.01
Message: Dividend Too Low
14. We now need to save our constraint. In the top right corner of the screen click on the Save button
Big Data Workshop
15. We are now ready to run the Interface. In the top right section of the screen click back to our
interface by click on the Hive_To_Hive tab.
Big Data Workshop
16. Now at the top of the screen we can click the Play button to run the interface
Big Data Workshop
17. A new window pop up saying you need to save all of the changes before the interface can be run.
Just click Yes
18. A new window will pop up asking for the execution context, just click OK
Big Data Workshop
19. An informational pop will show up telling you the execution has started. Simply click OK
20. It is now time to check our constraint. In the left bottom part of the screen (in the Models section)
right click on the dividends2 model then to to the Control section and click on Check
21. This check is its own job that must be run; hence a window will pop up asking you to select a
context for the execution. The default option are good so just click OK
Big Data Workshop
22. An informational window pops up telling to the execution has started. Just click OK
23. We can now see all of the rows that failed our check. Again in the right bottom part of the screen
(in the Models section) right click on the dividends2 model go to the Control menu and select
Errors…
Big Data Workshop
A now tab will pop up on the right side of the screen. You will see of the columns which did not
pass the constraint.
24. This concludes the ODI section of the workshop. Go to the right upper corner of the screen and
click the X to close ODI.
25. Then in the terminal type exit to close it as well.
Big Data Workshop
5.7 Summary
In this exercise you were introduced the Oracle’s integration of ODI with Hadoop. It is worthy to
note this integration with ODI is only available for the oracle database and only available from Oracle. It is a
custom extension for ODI developed by Oracle to allow users how already have ETL as part of the Data
Warehousing methodologies to continue using the same tools and procedures with the new Hadoop
technologies.
It is quite important to note that ODI is a very powerful ETL tool which can offer all of the
functionality typically found in an enterprise quality ETL. Although the examples given in this exercise are
quite simple this does not mean the integration of ODI and Hadoop is. All of the power and functionality of
ODI is available when working with Hadoop. Workflow definition, complex transforms, flow control, multi
source just to name a few of the functionalities of the ODI and inherently feature that can be used with
Hadoop.
Through this exercise you were introduced to three Knowledge Modules of ODI. Reverse
Integration for Hive, Integration into hive and Integration from hive to Oracle. These are not the only
knowledge modules available, and we encourage you to review the table available in section 5.2 of this
document to get a better idea of all the functionality currently available.
Big Data Workshop
6. WORKING WITH EXTERNAL TABLES
6.1 Introduction to External Tables

Oracle Direct Connector runs on the system where Oracle Database runs. It provides read access to HDFS
from Oracle Database by using external tables.
An external table is an Oracle Database object that identifies the location of data outside of the database.
Oracle Database accesses the data by using the metadata provided when the external table was created.
By querying the external tables, users can access data stored in HDFS as if that data were stored in tables
in the database. External tables are often used to stage data to be transformed during a database load.
These are a few ways that you can use Oracle Direct Connector:
 Access any data stored in HDFS files
 Access CSV files and Data Pump files generated by Oracle Loader for Hadoop
 Load data extracted and transformed by Oracle Data Integrator

Oracle Direct Connector uses the ORACLE_LOADER access driver.

This Exercise will involve working with oracle external tables. We will create 3 text files with some data in
each. We will upload these files into HDFS and connect them to the Oracle database using external tables.
The data within these files will then be accessible from within the oracle database.
1. Create and query external tables stored in HDFS
6.3 Configuring External Tables

2. To get into the folder where the scripts for the external tables exercise are, type in the terminal:
Big Data Workshop
cd /home/oracle/exercises/external
Then press Enter
3. This first step in this exercise is to create some random files. This is just so we have some data in
Hadoop to load as an external file. We will create three files called sales1, sales2 and sales3 with a
single row comprised of 3 numbers in each file. To create the files go to the terminal and type:
./createFiles.sh
Then press Enter
4. Next we will load these files in HDFS. We have a script for that processes as well. Go to the
terminal and type:
./loadFiles.sh
Then press Enter
5. Next we will need to create the external table in Oracle. As the SQL code is quite long we have
written a script with that code. This being quite important let’s look at what that code looks like. In
the terminal type:
gedit createTable.sh
Then press Enter
Big Data Workshop
Looking at the code for creating the table you will notice very similar syntax for other types of
external tables except for 2 line; the preprocessor and type highlighted in the image below
6. When you are done evaluating the code you can close the window by clicking the X in the right
upper corner of the window
Big Data Workshop
7. Let’s go ahead now and run that piece of code. In the terminal type:
./createTable.sh
Then press Enter
8. Now that the table is created we need to connect that table with the files we loaded into HDFS. To
make this connection we must run a Hadoop job which calls oracle loader code. Go to the terminal
and type:
./connectTable.sh
Then press Enter
Big Data Workshop
9. You will be asked to enter a password for the code to be able to login to the database user. Enter
the following information
[Enter Database Password:]: tiger

Then Press Enter
NOTE: No text will appear while you type
Big Data Workshop
10. We can now use SQL from oracle to read those files in HDFS. Lets experiment with that. First we
connect to the database using SQL* Plus. Go to the terminal and type:
sqlplus scott/tiger
Then press Enter
Big Data Workshop
11. Now let’s query that data. Go to the terminal and type:
select * from sales_hdfs_ext_tab;

Then press Enter
Big Data Workshop
They query returns the data that is is all three files.
Big Data Workshop
12. This concludes this exercise. You can now exit SQL* Plus. Go to the terminal and type:
exit;
Then press Enter
Big Data Workshop
13. Then close the terminal. Go the the terminal and type:
exit
Then press Enter
Big Data Workshop
6.4 Summary
In this chapter we show how data in HDFS can be queried using standard SQL right from the oracle
database. With the data stored in HDFS all of the parallelism and striping that would naturally occur is
taken full advantage of while at the same time you can use all of the power and functionality of the Oracle
Database.
When implementing this method in interaction with data parallel processing is extremely important when
working with large volumes of data. When using external tables, consider enable parallel query with this
SQL command:
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading data into Oracle Database from the external files created by Oracle Direct Connector,
enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Before inserting data into an existing database table, enable parallel DML with this SQL command:
ALTER SESSION ENABLE PARALLEL DML;
Hints such as APPEND and PQ_DISTRIBUTE also improve performance when inserting data.
Big Data Workshop
7. WORKING WITH MAHOUT
7.1 Introduction to Mahout

Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable
machine learning algorithms on the Hadoop platform. Mahout is a work in progress; the number of
implemented algorithms has grown quickly, but there are still various algorithms missing.
While Mahout's core algorithms for clustering, classification and batch based collaborative filtering are
implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict contributions to
Hadoop based implementations. Contributions that run on a single node or on a non-Hadoop cluster are
also welcomed. For example, the 'Taste' collaborative-filtering recommender component of Mahout was
originally a separate project, and can run stand-alone without Hadoop.
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behaviour and
from that tries to find items users might like. Clustering takes e.g. text documents and groups them into
groups of topically related documents. Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign unlabelled documents to the (hopefully)
correct category. Frequent item set mining takes a set of item groups (terms in a query session, shopping
cart content) and identifies, which individual items usually appear together.

In this exercise you will be using the K-means algorithm to cluster data using mahout’s implementation of
K-means. To give a bit of background on K-means it is an algorithm which clusters data, and despite its
simplistic nature it can be quite powerful. The algorithm takes two inputs, a series of input values (v) and
the number of groups those values need to be split into (k). The algorithm first picks randomly k centres to
represent the centre of each group, than continuously moves those centres so that the distance from the
centre to every point in that group is as small as possible. Once the centres are at a point where any
movement would just increase the distance to all of the points the algorithm stops. This is a great algorithm
to find pattern in data where you have no information what patterns are in the data. Given the power the K-
means algorithm is quite expensive computationally hence using a massively distributed computation
cluster such as Hadoop would offer great advantage when dealing with very large data sets. This is exactly
what we will be experimenting with in this exercise
1. Use mahout to cluster a large data set
2. Use the graphic library in Java to visualize a mahout k-means cluster
7.3 Clustering with K-means

Big Data Workshop
2. To get into the folder where the scripts for the Hive exercise are, type in the terminal:
cd /home/oracle/exercises/mahout
Then press Enter
3. To get an idea of what our forum file looks like let’s look at the first couple of rows. In the terminal
type:
head –n 1 synthetic_control.data
Then press Enter
As you can see on the screen all there is in the file are random data points. It is within the field of
that we would like to find patterns.
Big Data Workshop
4. The first step in analyzing this data is loading it into the HDFS. Let’s go ahead and do that. Go to
the terminal and type:
./loadData.sh
Then press Enter
5. Now that the data is loaded we can run mahout against the data. This is an example we are
running were the data is already in vector form and a distance function has already been compiled
into the example. When clustering your own data, the command line for running the clustering
should include the distance function written and compiled in java. Go to the terminal and type:
Big Data Workshop
This would be an excellent time to get a cup of coffee. The clustering is quite computationally
intensive and execution should take a couple of minutes to execute.
6. Once you get back the command prompt the clustering is done, but the results are stored in binary
format inside Hadoop. We need to first bring all of the results out of Hadoop and then convert the
data from binary format to text format. We have a script which will perform both tasks. Let’s run that
script go to the terminal and type:
./extractData.sh
Then press Enter
Big Data Workshop
7. We can how go ahead and look at the results of the clustering. We will look at the text output of the
results. Go to the terminal and type:
gedit Clusters
Then press Enter
Big Data Workshop
The output is not very user friendly but there are several indicators to look for as followed:
n= the number of clusters

c= the centers of each one of the clusters
r= the radius of the circle which defines the cluster
Points= the data points in each cluster
Big Data Workshop
8. Once you are done evaluating the results you can click the X in the right upper corner of the screen
to close the window.
Big Data Workshop
9. Despite the highlighting data points are not very representative. Mahout also has some graphing
functions for simple data points. We will run a much simpler clustering with points that can be
displayed on a X,Y plane to visually see the results. Go to the terminal and type:
./displayClusters
Then press Enter
Big Data Workshop
A new window will pop up with a visual display of a K-means cluster. The black squares represent
data points the red circles define the clusters. The yellow and green lines represent the error
margin for each cluster.
10. Once you are done evaluating the image you can click the X in the right upper corner of the
window to close it.
Big Data Workshop
11. This concludes our mahout exercise. You can now close the terminal window. Go to the terminal
and type:
exit
Then press Enter
Big Data Workshop
7.4 Summary
In this exercise you were introduced to the K-mean clustering algorithm and how to run the algorithm using
mahout and inherently on a Hadoop cluster. It is important to note that Mahout does not only focus on k-
mean but also has many different algorithms in the categories of Clustering, Classification, Pattern Mining,
Regression, Dimension reduction, Evolutionary Algorithms, Recommendation/ Collaboration Filtering and
Vector Similarity. Most of these algorithms have special variants which are optimized to run on a massively
distributed infrastructure (Hadoop) to allow for rapid results on very large data sets.
Big Data Workshop
8. PROGRAMMING WITH R
8.1 Introduction to Enterprise R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar
to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent
Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S.
There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series
analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible. The S language
is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route
to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced,
including mathematical symbols and formulae where needed. Great care has been taken over the defaults
for the minor design choices in graphics, but the user retains full control.
Oracle R Enterprise integrates the open-source R statistical environment and language with Oracle
Database 11g, Exadata, Big Data Appliance, and Hadoop massively scalable computing. Oracle R
Enterprise delivers enterprise-level advanced analytics based on the R environment.
Oracle R Enterprise allows analysts and statisticians to leverage existing R applications and use the R
client directly against data stored in Oracle Database 11g—vastly increasing scalability, performance and
security. The combination of Oracle Database 11g and R delivers an enterprise-ready, deeply integrated
environment for advanced analytics. Data analysts can also take advantage of analytical sandboxes, where
they can analyze data and develop R scripts for deployment while results stay managed inside Oracle
Database.
As an embedded component of the RDBMS, Oracle R Enterprise eliminates R’s memory constraints since
it can work on data directly in the database. Oracle R Enterprise leverages Oracle’s in-database analytics
and scales R for high-performance in Exadata and the Big Data Appliance. Being part of the Oracle
ecosystem, ORE enables execution of R scripts in the database to support enterprise production
applications and OBIEE dashboards, both for structured results and graphics. Since it’s R, we’re able to
leverage the latest R algorithms and contributed packages.
Oracle R Enterprise users not only can build models using any of the data mining algorithms in the CRAN
task view for machine learning, but also leverage in-database implementations for predictions (e.g.,
stepwise regression, GLM, SVM), attribute selection, clustering, feature extraction via non-negative matrix
factorization, association rules, and anomaly detection.

In this exercise you will be introduced the R programming language as well as the enhancements Oracle
has brought to the programming language. Limitations were all data must be kept in system memory are
now gone as you can save and load data from and to both the Oracle database and HDFS. To exemplify
the uses of R will be doing K-means clustering again, as in Exercise 7, this time using the R programming
language. If you would like a review of K-means please see the introduction to section 7.
1. Generate a set of random data points
2. Save the data in both the Oracle database and HDFS
Big Data Workshop
3. View the data in Oracle and HDFS

4. Load the data from Oracle back into R
5. Perform K-means clustering on the data points
6. View the results
8.3 Talking data from R and inserting it into database

2. To get into the folder where the scripts for the R exercises are, type in the terminal:
cd /home/oracle/exercises/R
Then press Enter
3. To work with R you can write scripts for the interpreter to execute or you can use the interactive
shell environment. To get a more hands on experience with R we will use the interactive shell. To
start the interactive shell go to the terminal and type:
R
Then press Enter
4. During the login process many different library load which extend functionality of R. If a particular
library is not loaded automatically one can load it manually after login. We will need to load a library
to interface with HDFS so let’s load that now. Go to the R shell and type:
Big Data Workshop
library(ORHC)
Then press Enter
5. Now let’s go ahead and generate some pseudo random data points so have some data to play
with. We will generate 2D data points so we can easily visualize the data. Go to the R terminal and
type:
myDataPoints=rbind(matrix(rnorm(100, mean=0, sd=0.3),ncol=2)

,matrix(rnorm(100, mean=1, sd=0.3), ncol=2))
Then press Enter
Big Data Workshop
Now the variable myDataPoints will have some data points in it.
6. To be able to save data into the database or HDFS you need to have the data in columns (as we
already do) and you also need to have each of the the columns labeled. This is because column
names are required within a database to be able to identify the columns. Let’s go ahead and label
the columns x and y. Go to the R terminal and type:
colnames(myDataPoints) <- c(“x”, “y”)

Then press Enter
Big Data Workshop
7. We can now create a data frame which will load the data into the Oracle Database. Go to the
terminal and type:
ore.create(as.data.frame(myDataPoints, optional = TRUE),

table=”DATA_POINTS”)
Then press Enter
Big Data Workshop
8. If required we can even load this data into HDFS. Let’s go ahead and do that. Go to the R terminal
and type:
hdfs.put(DATA_POINTS,dfs.name=’data_points’)
Then press Enter
Big Data Workshop
9. Now that we have loaded the data into both the database and HDFS lets exit R and look at that
data. Go to the R shell and type:
q()
Then press Enter
Big Data Workshop
10. You will be asked if you want to save workspace image. Type:
n
Then press Enter
Note: when typing n the information typed does not appear on the screen.
Big Data Workshop
11. At this point all data and calculated data would be wiped from the memory and hence lost in class
R. With R Enterprise Edition we saved our data in the database, so let’s go and query that data. Go
./queryDB.sh
Then press Enter
Big Data Workshop
On the screen you will see the table displayed which contains our data points.
Big Data Workshop
12. We can also look at the data we stored inside HDFS. Go to the terminal and type:
./queryHDFS.sh
Then press Enter
Big Data Workshop
Again on the screen you will see all of the data points displayed.
Big Data Workshop
As you can see all of the work done in R can now be exported to the database or HDFS for further
processing based on business needs.
8.4 Taking data from database and using it in R and clustering

1. Data can not only be pushed out to the database but it can also be retrieved from the database or
HDFS to be used within R. Let’s see how that is done. First let’s go back into the R environment.
R
Then press Enter
Big Data Workshop
2. Let’s now go ahead and load the data from the Oracle database. Go to the R shell and type:
myData=ore.pull(DATA_POINTS)
Then press Enter
Big Data Workshop
3. Now that we have our data inside the database we can manipulate the data. Let do k-means
clustering on the data. Go the the R shell and type:
cl <- kmeans(myData, 2)
Then Press Enter
Big Data Workshop
4. The clustering is now done, but displaying the data in text format it not very interesting. Let’s graph
the data. Go to the R terminal and type:
plot(myData, col = cl$cluster)

Then press Enter
Big Data Workshop
5. A new window pops up with the data. The two color (red and black) differentiate the two clusters
we asked the algorithm to find. We can even see where the cluster centers are. Go back to the R
shell. The terminal might hidden behind the graph move the windows around until you find the
terminal then type:
points(cl$centers, col=1:2, pch = 8, cex=2)

Then press Enter
Big Data Workshop
When you go back to the graph you will see the centers marked with a * and points marked with
circles, data clustered for raw random data using the K-means algorithm.
Big Data Workshop
6. When you are done evaluating the image you can click on the X in the right upper corner of the
window.
Big Data Workshop
7. You can also close the R terminal by going the R shell and typing:
q()
Then press Enter
Big Data Workshop
8. When asked if you want to save workspace image go to the terminal and type:
n
Then Press Enter
Big Data Workshop
9. This concludes this exercise. You can now go ahead and close the terminal. Go to the terminal and
type:
exit
Then press Enter
Big Data Workshop
8.5 Summary
In this exercise you were introduced to the R programming language and how to do clustering using the
programming language. You also saw one of the advantages of Oracle R enterprise Edition where you can
save your results into the Oracle database as well as extract data from the database for further
calculations. Oracle R Enterprise edition also has a small set of function which can be run on data in the
database directly in the database. This enables the user to use very large data sets which would not if into
the normal memory of R.
Oracle R Enterprise provides these collections of functions:
 ore.corr
 ore.crosstab
 ore.extend
 ore.freq
 ore.rank
 ore.sort
 ore.summary
 ore.univariate
Big Data Workshop
9. ORACLE NOSQL DATABASE
9.1 Introduction To NoSQL

Oracle NoSQL Database provides multi-terabyte distributed key/value pair storage that offers scalable
throughput and performance. That is, it services network requests to store and retrieve data which is
organized into key-value pairs. Oracle NoSQL Database services these types of data requests with a
latency, throughput, and data consistency that is predictable based on how the store is configured.
Oracle NoSQL Database offers full Create, Read, Update and Delete (CRUD) operations with adjustable
durability guarantees. Oracle NoSQL Database is designed to be highly available, with excellent throughput
and latency, while requiring minimal administrative interaction.
Oracle NoSQL Database provides performance scalability. If you require better performance, you use more
hardware. If your performance requirements are not very steep, you can purchase and manage fewer
hardware resources.
Oracle NoSQL Database is meant for any application that requires network-accessible key-value data with
user-definable read/write performance levels. The typical application is a web application which is servicing
requests across the traditional three-tier architecture: web server, application server, and back-end
database. In this configuration, Oracle NoSQL Database is meant to be installed behind the application
server, causing it to either take the place of the back-end database, or work alongside it. To make use of
Oracle NoSQL Database, code must be written (using Java) that runs on the application server.

In this exercise you will be experimenting with the Oracle NoSQL database. Most of the exercises will have
you look at pre written java code and the compiling and running that code. Ensure you understand the
code and all of its nuances as it is what makes up the NoSQL database interface. If you would like to
understand all of the functions that are available there is a javadoc available on the Oracle web
1. Insert and retrieve a simple key value pair from the NoSQL database
2. Experiment with the multiget functionality to retrieve multiple values at the same time
3. Integrate NoSQL with Hadoop code to do word count on data in the NoSQL database
9.3 Insert, and retrieve Key – Value pairs

Big Data Workshop
2. To get into the folder where the scripts for the R exercises are, type in the terminal:
cd /home/oracle/exercises/noSQL
Then press Enter
3. Before we do anything with the noSQL database we must first start it. So let go ahead and do that.
./startNoSQL.sh
Then press Enter
4. To check if the database is up and running we can do a ping on the database. Let’s do that. Go to
the terminal and type:
./pingNoSQL.sh
Then press Enter
You will see Status: RUNNING displayed within the text. This show the database is running.
Big Data Workshop
5. Oracle NoSQL database uses a Java interface to interact with the data. This is a dedicated java
API which will let you insert update delete and query data in the Key – Value store that is the
NoSQL database. Let’s look at a very simple example of java code where we insert a Key-Value
pair into the database and then retrieve it. Go to the terminal and type:
gedit Hello.java
Then press Enter
Big Data Workshop
A new window will pop up with the code. In this code there are a couple of thing to be noted. We
see the config variable which holds our connection string and the store variable which is our
connection factory to the database. They are the initiation variable for the Key-Value Store and are
highlighted in yellow. Next we define 2 variable of type Key and Value, they will serve as our
payload to be inserted. These are highlighted in green. Next we have highlighted in purple the
actual insert command. Highlighted in blue is the retrieve command for getting data out of the
database.
6. When you are done evaluating the code press the X in the right upper corner of the window to
close it.
Big Data Workshop
7. Let’s go ahead and compile that code. Go to the terminal and type:
javac Hello.java
Then press Enter
Big Data Workshop
8. Now that the code is complied let’s run it. Go to the terminal and type:
java Hello
Then press Enter
Big Data Workshop
You will see printed on the screen Hello Big Data World which is the key, and the value we inserted
in the database.
9. Oracle NoSQL database has the possibility of having a major and minor component to the key.
This feature can be very useful when trying to group and retrieve multiple items at the same time
from the database. In the next code we have 2 major components to the key (Mike and Dave) and
each major component has a minor component (Question and Answer). We will insert a value for
each key but we will use a multiget function to retrieve all of the values regardless of the minor
component of the key for Mike and completely ignore Dave. Let’s see what that code looks like. Go
gedit Keys.java
Then press Enter
Big Data Workshop
10. A new window will pop up with the code. If you scroll to the bottom with will remark the following
piece of code. Highlighted in purple as the insertion calls which will add the data to the database.
The retrieval of multiple records is highlighted in blue, and the green shows the display of the
retrieved data. Do note there were 4 Key-Value pairs inserted in to the database.
Big Data Workshop
close it.
Big Data Workshop
12. Let go ahead and compile that code. Go to the terminal and type:
javac Keys.java
Then press Enter
Big Data Workshop
13. Now that the code is complied let’s run it. Go to the terminal and type:
java Keys
Then press Enter
Big Data Workshop
You will see the 2 values that are stored under the Mike major key displayed on the screen, and no
data points for Dave major key. Major and minor parts of the key can actually be composed of
multiple string and further filtering can be done. This is left up the participants to experiment with.
Big Data Workshop
14. The potential of a Key-Value store grows significantly when integrated with the power of a Hadoop
and distributed computing. Oracle NoSQL database can be used as a source and target for the
data used by and produced by NoSQL. Let’s look at a modified example of word count run in
Hadoop only this time we will count the number of Values under the Major component of the key in
the NoSQL database. To see the code go to the terminal and type:
gedit Hadoop.java
Then press Enter
Big Data Workshop
The code see is very similar to the Word Count seen in the first section of the workshops. There
are only difference 2 differences. The first (highlighted in yellow) is the retrieval of data from the
NoSQL database rather than a flat file.
Big Data Workshop
The second can be seen of scroll down into the run function and notice the InputFormatClass is
now KVInputFormat
close it.
Big Data Workshop
16. Let’s go ahead and run that code. We will need to go through the entire procedure of the first
exercise where we compile the code, create a jar then execute it on the Hadoop cluster. We have
written a script which will do all of that for us. Let’s run that script, go to the terminal and type
./runHadoop.sh
Then press Enter
Big Data Workshop
17. You will see a Hadoop job being executed with all of the terminal display it comes with. Once the
execution is done it is time to see results. We will just cat the results directly from HDFS. Go to the
terminal and type
./viewHadoop.sh
Then press Enter
Big Data Workshop
You will see, displayed on the screen a word count based on the major component of keys in the
NoSQL database. In the previous exercises we inserted 2 pieces of data under the major key Dave
and Mike. We also inserted a hello key for the first exercise. This is exactly the data the word count
displays.
Big Data Workshop
18. That concludes our exercises on NoSQL database. It is time to shutdown our NoSQL database. Go
./stopNoSQL
Then press Enter
Big Data Workshop
19. We can now close our terminal window. Go to the terminal and type:
exit
Then press Enter
Big Data Workshop
9.4 Summary
In this exercise you were introduced to Oracle’s NoSQL database. You saw how to insert and retrieve key-
value pairs as well as the mutiget function where multiple values could be retrieved under the same major
component of a key. The last example showed how a NoSQL database can be used as a source for a
Hadoop job and how the two technologies can be integrated.
It is important to note here the differences between the NoSQL database and a traditional RDBMS. With
relational data the queries performed are much more powerful and more complex while NoSQL simply
stores and retrieves values from a specific key. Given that simplicity in NoSQL storage type, it has a
significant performance and scaling advantage. A NoSQL database can store petabytes worth of
information in a distributed cluster and still maintain very good performance on data interaction at a much
low cost per megabyte of data. NoSQL has many uses and has been implemented successfully in many
different circumstances but at the same time it does not mimic or replace the use of a traditional RDBMS.
Big Data Workshop
APPENDIX A
A.1 Setup of a Hive Data Store
1. Once we are connected to ODI we need to setup our models; the logical and physical definition of
our data sources and targets. To start off, at the top of the screen click on Topology.
2. Next in the left menu make sure you are on the Physical Architecture tab and expand the
Technologies list
Big Data Workshop
3. In the expanded list find the folder Hive and expand it
Big Data Workshop
4. In this folder we need to create a new Data Server. Right click on the Hive Technology and select
New Data Server
5. A new tab will open on the right side of the screen. Here You can define all of the properties of this
data server. Enter the following details:
Name: Hive Server
Then click on the JDBC tab in the left menu
Big Data Workshop
6. On the right of the JDBC Driver field click on the Magnifying Glass to select the JDBC Driver
7. A new Window will pop up which will allow you to select from a list of drivers. Click on the Down
Array to see the list
Big Data Workshop
8. For the list that appears select Apache Hive JDBC Driver.
9. Now click OK to close the window
Big Data Workshop
10. Back at the main window enter the following information
JDBC Url: jdbc:hive://bigdatalite.us.oracle.com:10000/default
11. We need to set some Hive specific variable. On the menu on the left go now to the tab Flexfields
Big Data Workshop
12. It the Flexfields tab uncheck the Default check box and write the following information:
Value: thrift://localhost:10000
Don’t forget to press Enter when done typing to set the variable
13. It is now time to test to ensure we set everything up correctly. In the left upper corner of the right
windows click on Test Connection
Big Data Workshop
14. A window will pop up asking if you would like to save you data before testing. Click OK
15. An informational message will pop up asking to register a physical schema. We can ignore this
message as that will be our next step. Just click OK
16. You need to select an agent to use for the test. Leave the default
Physical Agent: Local(No Agent)
Then click Test
Big Data Workshop
17. A window should pop up saying Successful Connection. Click OK
If any other message is displayed please ask for assistance to debug. It is critical for the entirety of
this exercise this connection is fully functional.
18. Now in the menu on the left side of the screen, in the Hive folder, there should now be a Physical
server created called Hive Server. Right click on it and select New Physical Schema.
Big Data Workshop
19. A new tab will again open on the right side of the screen to enable you to define the details of the
Physical Schema. Enter the following details.
Schema (Schema): default

Schema (Work Schema): default
Big Data Workshop
20. Then click Save All in the left upper part of the screen
21. A warning will appear about No Context specified. This again will be the next step we undertake.
Just click OK
22. We now need to expand the Logical Architecture tab in the left menu. Toward the left bottom of
the screen you will see Logical Architecture tab click on it.
Big Data Workshop
23. In the Logical Architecture tab you will need to again find the Hive folder and click on the + to
expand it.
Big Data Workshop
24. Now to create the logical store, right click on the Hive Folder and select New Logical Schema.
Big Data Workshop
25. In the new window that open on the right of the screen enter the following information:
Name: Hive Store

Context: Global
Physical Schemas: Hive Server.default
Big Data Workshop
26. This should setup the Hive data store to enable us to move data into and out of Hive with ODI. We
now need to save all of the changes we made. In the left upper corner of the screen click on the
Save All button.
27. We can close all of the tabs we have opened on the right side of the screen. This will help in
reducing the clutter. Click on the X for all of the windows.
Big Data Workshop
We would theoretically need to repeat steps 7 – 29 for each of the different type of data store. As
the procedure is almost the same a flat file source and an Oracle database target have already
been setup for you. This is to reduce the number of steps in this exercise. For details on how use
flat files and Oracle database with ODI please see the excellent ODI tutorials offered by the Oracle
by Example Tutorials found at http://www.oracle.com/technetwork/tutorials/index.html.
28. We now need to go to the Designer Tab in the left menu to perform the rest of our exercise. Near
the top of the screen on the left side click on the Designer tab.
29. Near the bottom of the screen on the left side there is a Models tab click on it.
Big Data Workshop
30. You will notice there is already a File and Oracle mode created for you. These were pre-created as
per the note at step 29. Let’s now create a model for the Hive data store we just created. In the
middle of the screen in the right panel there is a folder icon next to the work Models. Click on the
Folder icon and select New Model…
Big Data Workshop
31. In the new tab that appears on the right side enter the following information:
Name: Hive
Code: HIVE
Technology: Hive
Logical Schema: Hive Store
Big Data Workshop
32. We can now go up the left upper corner of the screen and save this Model by clicking on the Save
All icon.

Big Data Workshop: Lab Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Workshop: Lab Guide

Uploaded by

Copyright:

Available Formats

Big Data Workshop

Big Data Workshop Lab Guide .......................................................................................................... i

8. Programming with R .................................................................................................................. 132

 HBase is a scalable, column-oriented distributed database modeled after Google's BigTable

2. HADOOP HELLO WORLD

2.1 Introduction to Hadoop

2.2 Overview of Hands on Exercise

2.3 Word Count

9. To see the contents of the files type in the terminal:

cat file01 file02

hadoop dfs -copyFromLocal file01 /user/oracle/wordcount/input/file01

hadoop dfs -copyFromLocal file02 /user/oracle/wordcount/input/file02

hadoop jar WordCount.jar org.myorg.WordCount /user/oracle/wordcount/input

Then press Enter

hadoop dfs -cat /user/oracle/wordcount/output/part-00000

hadoop jar WordCount.jar org.myorg.WordCount /user/oracle/wordcount/input

hadoop dfs -rmr /user/oracle/wordcount/output

hadoop jar WordCount.jar org.myorg.WordCount /user/oracle/wordcount/input

3.1 Introduction to Pig

3.2 Overview Of Hands On Exercise

3.3 Working with PIG

hadoop dfs -copyFromLocal NYSE_dividends /user/oracle/NYSE_dividends

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);

grouped = group dividends by symbol;

avg = foreach grouped generate group, AVG(dividends.dividend);

sorted = order avg by $1 DESC;

store sorted into 'average_dividend';

hadoop dfs -cat /user/oracle/average_dividend/part-r-00000 | head

4.1 Introduction to Hive

4.2 Overview Of Hands On Exercise

4.3 Queries with Hive

hadoop dfs -copyFromLocal NYSE_dividend /user/oracle/NYSE_dividend

create table dividends(exchange string, symbol string, dates string,

load data inpath ‘/user/oracle/NYSQ_dividend’ into table dividends;

Then press Enter

The data is now loaded into the table.

select * from dividends limit 5;

select symbol, avg(dividend) avg_dividend from dividends group by symbol

5. ORACLE ODI AND HADOOP

5.1 Introduction To Oracle Connectors

 Performing validation and transformation of data within Hadoop.

KM Name Description Source Target

KM Name Description Source Target

CKM Hive Validates data against constraints. NA Hive

RKM Hive Reverse engineers Hive tables. Hive NA

5.2 Overview of Hands on Exercise

5.3 Setup and Reverse Engineering in ODI

Login Name: DEFAULT_LOGIN1

If all of the data is entered correctly you can simple click OK

You will see the columns created in step 3 displayed.

5.4 Using ODI to import text file into Hive

5. Right click on the Item Interfaces and select New Interface

This was to prepare for the next exercise.

5.5 Using ODI to import Hive Table into Oracle

9. Unfortunately due to a capitalization difference no mapping could be done automatically. We will

exchange -> STOCK

The movement of data from Hive to Oracle has completed successfully.

24. In the window that opened up click on the Code tab

5.6 Using ODI to import Hive Table into Hive

3. Next at the bottom of the screen lets go to the Mapping table

File import directory: /u01/ODI/oracledi/xml-reference

IKM Hive Control Append

10. An import report will pop up. Just click Close