Big Data Workshop Lab Guide

Big Data Workshop
Lab Guide
http://www.oracle-developer-days.com
Copyright 2012, Oracle and/or its affiliates. All rights reserved
Big Data Workshop
TABLE OF CONTENTS
Big Data Workshop Lab Guide......................................................................................................... i
1. Introduction................................................................................................................................... 4
2. Hadoop Hello World...................................................................................................................... 7
2.1 Introduction to Hadoop............................................................................................................ 7
2.2 Overview of Hands on Exercise..............................................................................................8
2.3 Word Count............................................................................................................................. 8
2.4 Summary............................................................................................................................... 22
3. Pig Exercise................................................................................................................................ 23
3.1 Introduction to Pig................................................................................................................. 23
3.2 Overview Of Hands On Exercise........................................................................................... 23
3.3 Working with PIG.................................................................................................................. 23
3.4 Summary............................................................................................................................... 43
4. Hive Coding................................................................................................................................ 44
4.1 Introduction to Hive............................................................................................................... 44
4.2 Overview Of Hands On Exercise........................................................................................... 44
4.3 Queries with Hive.................................................................................................................. 44
4.4 Summary............................................................................................................................... 55
5. Oracle ODI and Hadoop............................................................................................................. 56
5.1 Introduction To Oracle Connectors........................................................................................ 56
5.2 Overview of Hands on Exercise............................................................................................ 57
5.3 Setup and Reverse Engineering in ODI................................................................................57
5.4 Using ODI to import text file into Hive...................................................................................64
5.5 Using ODI to import Hive Table into Oracle...........................................................................77
5.6 Using ODI to import Hive Table into Hive..............................................................................93
5.7 Summary............................................................................................................................. 109
6. Working with External Tables.................................................................................................... 110
6.1 Introduction to External Tables............................................................................................ 110
6.2 Overview of Hands on Exercise..........................................................................................110
6.3 Configuring External Tables.................................................................................................110
6.4 Summary............................................................................................................................. 120
7. Working with Mahout................................................................................................................ 121
7.1 Introduction to Mahout........................................................................................................ 121
7.2 Overview of Hands on Exercise.......................................................................................... 121
7.3 Clustering with K-means..................................................................................................... 121
Big Data Workshop
7.4 Summary............................................................................................................................. 131

8. Programming with R................................................................................................................. 132
8.1 Introduction to Enterprise R................................................................................................ 132
8.3 Talking data from R and inserting it into database...............................................................133
8.4 Taking data from database and using it in R and clustering................................................144
8.5 Summary............................................................................................................................. 154
9. Oracle NoSQL Database.......................................................................................................... 155
9.1 Introduction To NoSQL........................................................................................................ 155
9.3 Insert, and retrieve Key Value pairs.................................................................................155
9.4 Summary............................................................................................................................. 175
Appendix A................................................................................................................................... 176
A.1 Setup of a Hive Data Store................................................................................................. 176
Big Data Workshop
1. INTRODUCTION
Big data is not just about managing petabytes of data. It is also about managing large numbers of
complex unstructured data streams which contain valuable data points. However, which data
points are the most valuable depends on who is doing the analysis and when they are doing the
analysis. Typical big data applications include: smart grid meters that monitor electricity usage in
homes, sensors that track and manage the progress of goods in transit, analysis of medical
treatments and drugs that are used, analysis of CT scans etc. What links these big data
applications is the need to track millions of events per second, and to respond in real time. Utility
companies will need to detect an uptick in consumption as soon as possible, so they can bring
supplementary energy sources online quickly. Probably the fastest growing area relates to location
data being collected from mobile always-on devices. If retailers are to capitalise on their
customers location data, they must be able to respond as soon as they step through the door.
In the conventional model of business intelligence and analytics, data is cleaned, cross-checked
and processed before it is analysed, and often only a sample of the data is used in the actual
analysis. This is possible because the kind of data that is being analysed - sales figures or stock
counts, for example can easily be arranged in a pre-ordained database schema, and because BI
tools are often used simply to create periodic reports.
At the center of the big data movement is an open source software framework called Hadoop.
Hadoop has become the technology of choice to support applications that in turn support petabytesized analytics utilizing large numbers of computing nodes. The Hadoop system consists of three
projects: Hadoop Common, a utility layer that provides access to the Hadoop Distributed File
System and Hadoop subprojects. HDFS acts as the data storage platform for the Hadoop
framework and can scale to massive size when distributed over numerous computing nodes.
Hadoop MapReduce is a framework for processing data sets across clusters of Hadoop nodes.
The Map and Reduce process splits the work by first mapping the input across the control nodes
of the cluster, then splitting the workload into even smaller data sets and distributing it further
throughout the computing cluster. This allows it to leverage massively parallel processing, a
computing advantage that technology has introduced to modern system architectures. With MPP,
Hadoop can run on inexpensive commodity servers, dramatically reducing the upfront capital costs
Big Data Workshop
traditionally required to build out a massive system. As the nodes "return"

their answers, the Reduce function collects and combines the information to deliver a final result.
To extend the basic Hadoop ecosystem capabilities a number of new open source projects have
added functionality to the environment. A typical Hadoop ecosystem will look something like this:
Avro is a data serialization system that converts data into a fast, compact binary data
format. When Avro data is stored in a file, its schema is stored with it
Chukwa is a large-scale monitoring system that provides insights into the Hadoop
distributed file system and MapReduce
HBase is a scalable, column-oriented distributed database modeled after Google's

BigTable distributed storage system. HBase is well-suited for real-time data analysis
Hive is a data warehouse infrastructure that provides ad hoc query and data
summarization for Hadoop- supported data. Hive utilizes a SQL-like query language call
HiveQL. HiveQL can also be used by programmers to execute custom MapReduce jobs
Pig is a high-level programming language and execution framework for parallel

computation. Pig works within the Hadoop and MapReduce frameworks
ZooKeeper provides coordination, configuration and group services for distributed

applications working over the Hadoop stack
Data exploration of Big Data result sets requires displaying millions or billions of data points to
uncover hidden patterns or records of interest as shown below:
Big Data Workshop
Many vendors are talking about Big Data in terms of managing petabytes of data. For example
EMC has a number of Big Data storage platforms such as it's new Isilon storage platform. In reality
the issue of big data is much bigger and Oracle's aim is to focus on providing a big data platform
which provides the following:
Deep Analytics a fully parallel, extensive and extensible toolbox full of advanced and
novel statistical and data mining capabilities
High Agility the ability to create temporary analytics environments in an end-user

driven, yet secure and scalable environment to deliver new and novel insights to the
operational business
Massive Scalability the ability to scale analytics and sandboxes to previously unknown
scales while leveraging previously untapped data potential
Low Latency the ability to instantly act based on these advanced analytics in your
operational, production environment
Big Data Workshop
2. HADOOP HELLO WORLD

2.1 Introduction to Hadoop
Map/Reduce is a programming paradigm that expresses a large distributed computation as a
sequence of distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce
framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across
the nodes in the cluster. A Map/Reduce computation has two phases, a map phase and a reduce
phase. The input to the computation is a data set of key/value pairs.
In the map phase, the framework splits the input data set into a large number of fragments and
assigns each fragment to a map task. The framework also distributes the many map tasks across
the cluster of nodes on which it operates. Each map task consumes key/value pairs from its
assigned fragment and produces a set of intermediate key/value pairs. For each input key/value
pair (K,V), the map task invokes a user defined map function that transmutes the input into a
different key/value pair (K',V').
Following the map phase the framework sorts the intermediate data set by key and produces a set
of (K',V'*) tuples so that all the values associated with a particular key appear together. It also
partitions the set of tuples into a number of fragments equal to the number of reduce tasks.
In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For
each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output
key/value pair (K,V). Once again, the framework distributes the many reduce tasks across the
cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each
reduce task.
Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a
computation the tasks assigned to them are re-distributed among the remaining nodes. Having
many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with
small runtime overhead.
Architecture
The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master server
or jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker
is the point of interaction between users and the framework. Users submit map/reduce jobs to the
jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/firstserved basis. The jobtracker manages the assignment of map and reduce tasks to the
tasktrackers. The tasktrackers execute tasks upon instruction from the jobtracker and also handle
data motion between the map and reduce phases.
Hadoop DFS
Hadoop's Distributed File System is designed to reliably store very large files across machines in a
large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence
of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. Files in
HDFS are "write once" and have strictly one writer at any time.
Architecture
Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation
consists of a single Namenode, a master server that manages the filesystem namespace and
regulates access to files by clients. In addition, there are a number of Datanodes, one per node in
the cluster, which manage storage attached to the nodes that they run on. The Namenode makes
Big Data Workshop
filesystem namespace operations like opening, closing, renaming etc. of files

and directories available via an RPC interface. It also determines the mapping of blocks to
Datanodes. The Datanodes are responsible for serving read and write requests from filesystem
clients, they also perform block creation, deletion, and replication upon instruction from the
Namenode.
2.2 Overview of Hands on Exercise

To get an understanding of what is involved in running a Hadoop Job and what are all of the steps
one must undertake we will embark on setting up and running a hello world type exercise on our
Hadoop Cluster.
In this exercise you will:
1)
2)
3)
4)
5)
Compile a Java Word Count written to run on a Hadoop Cluster

Create some files to run word count on
Upload the files into HDFS
Run Word Count
View the Results
NOTE: During this exercise you will be asked to run several scripts. If you would like to see the
content of these scripts type cat scriptName and the contents of the script will be displayed in
the terminal
2.3 Word Count

1. All of the setup and execution for the Work Count exercise can be done from the terminal,
hence to start out this first exercise please open the terminal by double clicking on the
Terminal icon on the desktop.
2. To get into the folder where the scripts for the first exercise are, type in the terminal:
cd /home/oracle/exercises/wordCount
Then press Enter
3. Lets look at the java code which will run word count on a Hadoop cluster. Type in the
terminal:
Big Data Workshop
gedit WordCount.java
Then press Enter
4. A new window will open with the java code for word count. We would like you to look at
line 14 and 28 of the code. You can see there the Mapper and Reducer Interfaces are
being implemented.
5. When you are done evaluating the code you can click on the X in the right upper corner of
the screen to close the window.
Big Data Workshop
6. We can now go ahead and compile the Word Count code. We need run the compile.sh
script which will set the correct classpath and output directory while compiling
WordCount.java. Type in the terminal:
./compile.sh
Then press Enter
7. We can now create a jar file from the compile directory of Word Count. This jar file is
required as the code for word count will be sent to all of the nodes in the cluster and the
code will be run simultaneous on all nodes that have appropriate data. To create the jar file
in the terminal type:
./createJar.sh
Then press Enter
10
Big Data Workshop
8. For the exercise to be more interesting we need to create some file on which word count
will be executed. To create some file go the terminal and type:
./createFiles.sh
Then press Enter
9. To see the contents of the files type in the terminal:

cat file01 file02
Then press Enter
11
Big Data Workshop
In the terminal window you will see the contents of the two files. Each file having 4 words
in it. Although these are quite small files the code would run identical with more than 2 file
and with files that are several Gigabytes of Terabytes in size.
10. Now that we have the files ready we must move them into the Hadoop File System
(HDFS). Hadoop can now work with file on other file systems; they must be within the
HDFS for them to be usable. It is also important to note that files which are within HDFS
are split into multiple chunks and stored on separate nodes for parallel parsing. To upload
our two file into the HDFS you need to use the copyFromLocal commanding Hadoop. Run
the command by typing at the terminal
12
Big Data Workshop
hadoop
dfs
-copyFromLocal
/user/oracle/wordcount/input/file01
Then press Enter
file01
For convince you can also run the script copyFiles.sh and it will upload the files for you
so do not need to type in this and the next command.
11. We should now upload the second file. Go to the terminal and type:
hadoop
dfs
-copyFromLocal
/user/oracle/wordcount/input/file02
Then press Enter
file02
13
Big Data Workshop
12. We can now run our MapReduce job to do a word count on the file we just uploaded. Go
the the terminal and typing:
hadoop
jar
WordCount.jar
org.myorg.WordCount
/user/oracle/wordcount/input /user/oracle/wordcount/output
Then press Enter
For your convenience you can also run the script runWordCount.sh and will run the
Hadoop job for you so do not need to type in the above command.
14
Big Data Workshop
A lot of text will role by in the terminal window. This is informational data coming from the
Hadoop infrastructure to help track the status of the job. Wait for the job to finish, this is
signaled by the command prompt coming back.
13. Once you have you command prompt back your MapReduce task is complete. It is now
time to look at the results. We can display they results file right from the HDFS files by
using the cat command from Hadoop. Go to the terminal and type the following command:
hadoop dfs -cat /user/oracle/wordcount/output/part-00000
Then press Enter
For your convenience you can also run the script viewResults.sh and will run the
Hadoop command for you to see the results.
15
Big Data Workshop
In the terminal the word count results are displayed. You will see that job counted the
number of times each word appears.
16
Big Data Workshop
14. As an experiment lets try to run the Hadoop job again. Go to the terminal and type:
hadoop
jar
WordCount.jar
org.myorg.WordCount
Then press enter
17
Big Data Workshop
15. You will notice an error message appears and no map reduce task is executed. This is
easily explained by the immutability of data. Since Hadoop does not allow an update of
data files (just read and write) you cannot update the data in the results directory hence
the execution has nowhere to place to output. For you to re-run the Map-Reduce job you
must either point it to another output directory or clean out the current output directory.
Lets go ahead and clean out the previous output directory. Go to the terminal and type:
hadoop dfs -rmr /user/oracle/wordcount/output
Then press Enter
For convince you can also run the script deleteOutput.sh and it will delete the files for
you so do not need to type in this command
18
Big Data Workshop
16. Now we have cleared the output directory and can re-run the map reduce task. Lets just
go ahead and make sure it works again. Go to the terminal and type:
hadoop
jar
WordCount.jar
org.myorg.WordCount
Then press enter
19
Big Data Workshop
Now the Map Reduce job ran fine again as per the output on the screen.
20
Big Data Workshop
17. This completes the word count example. You can now close the terminal window; go to the
terminal window and type:
exit
Then press Enter
21
Big Data Workshop
2.4 Summary
In this exercise you were able to see the basic steps required in setting up and running a very
simple Map Reduce Job. You say what interfaces must be implemented when creating a
MapReduce task, you saw how to upload data into HDFS and how to run the map reduce task. It is
important to talk about execution time for the exercise and the amount of time required to count 8
words is quite high in absolute terms. It is important to understand that Hadoop needs to start a
separate Java Virtual Machine to process each file or chunk of a file on each node of the cluster.
As such even a trivial job has some processing time which limits the possible application of
Hadoop as it can only handle bath jobs. Real time application where answers are required cant be
run on a Hadoop cluster. At the same time as the data volumes increase processing time does not
increase that much as long as there are enough processing nodes. A recent benchmark of a
Hadoop cluster saw the complete sorting of 1 terabyte of data in just over 3 minutes on 910 nodes.
22
Big Data Workshop
3. PIG EXERCISE
3.1 Introduction to Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs. The
salient property of Pig programs is that their structure is amenable to substantial parallelization,
which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin,
which has the following key properties:
Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel"
data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities: The way in which tasks are encoded permits the system to optimize
their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility: Users can create their own functions to do special-purpose processing.
3.2 Overview Of Hands On Exercise

In this exercise we will be analyzing data coming from the New York Stock Exchange specifically
we will like to evaluate the dividends given by different companies. We have a tab delimited file
with four columns; exchange name, stock name, date, and dividend. For our analysis we want to
find the companies which had the highest average dividend.
In this exercise we will:
1. Load our stock exchange data into our HDFS
2. Run a PIG script which will find the company with the highest dividends
3. View the results
the terminal
NOTE2: This exercise and dataset for this exercise was inspired from the flowing website:
http://ofps.oreilly.com/titles/9781449302641/running_pig.html
3.3 Working with PIG

1. All of the setup and execution for this exercise can be done from the terminal, hence the
terminal by double clicking on the Terminal icon on the desktop.
23
Big Data Workshop
cd /home/oracle/exercises/pig
Then press Enter
3. To get an idea of what our dividends file looks like lets look at the first couple of rows. In
the terminal type:
head NYSE_dividends
Then press Enter
The first 10 rows of the data file will be displayed on the screen
24
Big Data Workshop
4. Now that we have an idea what our data file looks like, lets load it into the HDFS for
processing. To load the data we use the copyfromLocal function of Hadoop, go to the
terminal and type:
hadoop
dfs
/user/oracle/NYSE_dividends
Then press Enter
-copyFromLocal
NYSE_dividends
For convince you can also run the script loadData.sh and it will upload the files for you.
This is so you do not need to type in the command above.
5. We will be running our PIG script in interactive mode so we can see each step of the
process. For this we will need to open the PIG interpreter called grunt. Go the terminal and
type
pig
25
Big Data Workshop
Then press Enter
6. Once at the grunt shell we can start typing Pig script. The first thing we need to do is load
the datafile from HDFS into Pig for processing. The data is not actually copied but a
handler is created for the file so Pig know how to interperate the data. Go to the grunt shell
and type:
dividends = load
dividend);
Then Press Enter
'NYSE_dividends'
as
(exchange,
symbol,
date,
26
Big Data Workshop
7. Now the data is loaded as a for column table lets see what the data looks like in PIG. Go
to the grunt shell and type:
dump dividends;
Then press Enter
27
Big Data Workshop
You will see output similar to the first exercise on the screen. This is normal as Pig is
merarly a high level language all commands which process data simply run Map Reduce
rasks in the backgroup so the dump command simply becomes a map reduce task that is
run. This will apply to all of the command you will run in Pig. The output on the screen will
show you all of the rows of the file in tuple form
28
Big Data Workshop
8. The first step in alalizing the data will be grouping the data by stock symbol so we have all
of the dividends of one compay grouped together. Go to the grunt shell and type:
grouped = group dividends by symbol;
Then press Enter
29
Big Data Workshop
9. Lets go ahead and dump this grouped variable to the screen to see what its contents look
like Go to the grunt shell and type:
dump grouped;
Then press Enter
30
Big Data Workshop
Open the screen you will see all of the groups displayed in tuple of typle form. As the
output might look a bit confusing only one tuple is hiligted in the screen shot below to help
clarity. The hilighed region show all of the rows of the table for the CATO stock symbol
31
Big Data Workshop
10. In the next step we will go through each group tuple and get the group name and the
average dividend. Go to the grunt shell and type:
avg = foreach grouped generate group, AVG(dividends.dividend);
Then press Enter
32
Big Data Workshop
11. Lets go ahead and see what this output looks like. Go to the grunt shell and type:
dump avg
Then press Enter
33
Big Data Workshop
Now you can see on the screen a dump of all stock symbols with their respective average
dividend. A couple of them are hilighed in the image below
34
Big Data Workshop
12. Now that we have the dividents for each company it would be ideal if we had them in order
from highest to lowest dividend. Lets get that list, go to the grunt shell and type:
sorted = order avg by $1 DESC;
Then press Enter
35
Big Data Workshop
We can now see what the sorted list looks like. Go to the grunt terminal and type:
dump sorted;
Then press Enter
36
Big Data Workshop
On the screen you now see the list sorted in descending order. On the screen are the
lowest dividens but can scroll up the see the rest of the value.
37
Big Data Workshop
13. We now have the final results we want. It might be worth writing these results out to
HDFS. Lets do that. Go to the grunt shell and type:
store sorted into 'average_dividend';
Then press Enter
38
Big Data Workshop
14. The new calculated data is now permanently stored in HDFS. We can now exit the grunt
shell. Go to the grunt shell and type:
quit;
Then press Enter
39
Big Data Workshop
15. Now back at the terminal lets view the top 10 companies by average dividend directly from
HDFS. Go to the terminal and type:
hadoop dfs -cat /user/oracle/average_dividend/part-r-00000 | head
Then press Enter
For convince you can also run the script viewResults.sh and it will display the files for
you. This is so you do not need to type in the command above.
40
Big Data Workshop
This command simply did cat on the results file available in the HDFS. The results are
seen on the screen.
41
Big Data Workshop
16. That concludes the the Pig exercise you can now close the terminal window. Go to the
terminal and type:
exit
Then press Enter
42
Big Data Workshop
3.4 Summary
In this exercise you saw what a pig script looks like and how to run it. It is important to understand
that pig is a scripting language which ultimately runs MapReduce jobs on a Hadoop cluster hence
all of the power of a distributed system and the high data volume / size which HDFS can
accommodate are exploitable through pig. Pig provides an easier interface to the MapReduce
infrastructure allow for scripting paradigms to be used rather than direct java coding.
43
Big Data Workshop
4. HIVE CODING
4.1 Introduction to Hive
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive
provides a mechanism to project structure onto this data and query the data using a SQL-like
language called HiveQL. At the same time this language also allows traditional map/reduce
programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to
express this logic in HiveQL.
4.2 Overview Of Hands On Exercise

In this exercise you will use Hive Query Language to create tables, insert data into those tables
and run queries on that data. For this exercise we will use the same data file as the PIG exercise
above which contains a tab delimited file with four columns; exchange name, stock name, date,
and dividend. For our analysis we want to find the companies which had the highest average
dividend.
In this exercise you will
1.
2.
3.
4.
Upload the comments file into HDFS

Create a table in Hive
Load the comments data into the Hive table
Run queries on the Hive table
4.3 Queries with Hive

1. All of the setup and execution for this exercise can be done from the terminal, hence open
a terminal by double clicking on the Terminal icon on the desktop.
2. To get into the folder where the scripts for the Hive exercise are, type in the terminal:
cd /home/oracle/exercises/hive
Then press Enter
44
Big Data Workshop
3. We already have an idea what our data file looks like, so lets load it into the HDFS for
processing. This is done identically to the way it was done in the first two exercises. We
will see a better way to load data in the next exercise. To load the data we use the
copyfromLocal function of Hadoop. Go to the terminal and type:
hadoop dfs -copyFromLocal NYSE_dividend /user/oracle/NYSE_dividend
Then press Enter
For convince you can also run the script loadData.sh and it will display the files for you.
This is so you do not need to type in the command above.
4. Lets now enter the Hive interactive shell environment to create tables and run queries
against those tables. To give an analogy this is similar to SQL *Plus but on this
environment is specifically for the HiveQL language. To enter the environment go to the
terminal and type:
hive
Then press Enter
5. This first thing we need to do in hive is create a table. We will create a table named
dividends with four fields field called exchange, symbol, dates and dividends something
that looks very natural based on the data set. Go the terminal and type:
create table dividends(exchange
string,
dividend float)
Then press Enter
string,
symbol
string,
dates
45
Big Data Workshop
An OK should be printed on the screen indicating the success of the operation. This OK
message will be printed for all operation but we will only mention it this time. It if left up to
the user to check for this message on future HiveQL commands.
6. We can now run a command to see all of the tables available to this OS user. Go to the
hive terminal and type:
show tables;
Then press Enter
46
Big Data Workshop
You can see the only table currently available is the one we just created.
7. As with normal SQL you also have a describe command available to see the columns in
the table. Go the terminal and type
describe dividends;
Then press Enter
As you can see that the dividends table has the 4 fields each with their own Hive specific
data type. This is to be expected as this is the way we created the table.
47
Big Data Workshop
8. Lets go ahead and load some data into this table. Data is loaded into hive from flat files
available in the HDFS file system. Go the terminal and type:
load
data
inpath
dividends;
Then press Enter
/user/oracle/NYSQ_dividend
into
table
48
Big Data Workshop
The data is now loaded into the table.

9. We can now see the data that has been loaded into the table. Go the the terminal and
type:
select * from dividends limit 5;
Then press Enter
49
Big Data Workshop
Five lines from the table are printed to the screen; only 3 of the lines are highlighted in the
image below.
50
Big Data Workshop
10. Now that we have all of the data loaded into a Hive table we can run SQL queries on the
code. As we has the same data set as in the Pig exercises lets try to extract the same
data. We will look for the top 10 companies by average dividend. Go to the terminal and
type:
select symbol, avg(dividend) avg_dividend from dividends group by
symbol order by avg_dividend desc limit 10;
Then press Enter
51
Big Data Workshop
On the screen you will see a lot of log information scroll through. Most of this is generated
by Hadoop as Hive (just like Pig) takes the queries you write and rewrites them as
MapReduce jobs then executes them. The query we wrote can take full advantage of the
distributed computational power of Hadoop as well as the striping and parallelism that
HDFS enables.
When the query is done you should see on the screen the top 10 companies in
descending order. This output shows the exact same information as we got with the
previous exercises. As the old idiom goes there is more than one way to skin a cat. As with
hadoop there is always more than one way to achieve any task.
52
Big Data Workshop
11. This is the end of the Hive exercise. You can now exit the hive interpreter. Go to the
terminal and type:
exit;
Then press Enter
53
Big Data Workshop
12. Then close the terminal. Go the the terminal and type:
exit
Then press Enter
54
Big Data Workshop
4.4 Summary
In this exercise you were introduces to the Hive Query Language. You saw how to create
and view tables using the HiveQL. Once tables were created you were introduced to the JSON
native interface as well as some of standard SQL constructs which HiveQL has available. It is
important to understand that Hive is an abstraction layer for Hadoop and MapReduce jobs. All
queries written in HiveQL get transformed into a DAG (Directed Acyclic Graph) of MapReduce
tasks which are then run on the Hadoop cluster, hence taking advantage of all performance,
scalability capabilities, but also maintain all of the limitations of Hadoop.
HiveQL has most of the functionality available with standard SQL, have s series of DDL and DML
functions implemented, but at the same time it does not strictly adhere to SQL-92 standard HiveQL
offers extensions not in SQL, including multitable inserts and create table as select, but only offers
basic support for indexing. Also, HiveQL lacks support for transactions and materialized views, and
only limited subquery support. It is intended for long running queries of a Data Warehousing type
rather than a transactional OLTP type of data load.
55
Big Data Workshop
5. ORACLE ODI AND HADOOP

5.1 Introduction To Oracle Connectors
Apache Hadoop is designed to handle and process data from data sources that are typically nonRDBMS and data volumes that are typically beyond what is handled by relational databases.
The Oracle Data Integrator Application Adapter for Hadoop enables data integration developers to
integrate and transform data easily within Hadoop using Oracle Data Integrator. Employing familiar
and easy-to-use tools and preconfigured knowledge modules, the adapter provides the following
capabilities:
Loading data into Hadoop from the local file system and HDFS.
Performing validation and transformation of data within Hadoop.
Loading processed data from Hadoop to Oracle Database for further processing and
generating reports.
Typical processing in Hadoop includes data validation and transformations that are programmed
as MapReduce jobs. Designing and implementing a MapReduce job requires expert programming
knowledge. However, using Oracle Data Integrator and the Oracle Data Integrator Application
Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive
and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs.
The Oracle Data Integrator graphical user interface enhancing the developer's experience and
productivity while enabling them to create Hadoop integrations.
When implementing a big data processing scenario, the first step is to load the data into Hadoop.
The data source is typically in the local file system, HDFS, Hive tables, or external Hive tables.
After the data is loaded, you can validate and transform the data using HiveQL like you do in SQL.
You can perform data validation such as checking for NULLS and primary keys, and
transformations such as filtering, aggregations, set operations, and derived tables. You can also
include customized procedural snippets (scripts) for processing the data.
When the data has been aggregated, condensed, or crunched down, you can load it into Oracle
Database for further processing and analysis. Oracle Loader for Hadoop is recommended for
optimal loading into Oracle Database.
Knowledge Modules:
KM Name
Description
IKM File To
Hive (Load
Data)
Loads data from local and HDFS files into Hive tables. It File
provides options for better performance through Hive System
partitioning and fewer data movements.
Hive
IKM Hive
Control
Append
Integrates data into a Hive target table in truncate/ Hive

insert (append) mode. Data can be controlled
(validated). Invalid data is isolated in an error table and
can be recycled.
Hive
Source
Target
56
Big Data Workshop
KM Name
Description
Source
Target
IKM Hive
Transform
Integrates data into a Hive target table after the data Hive
has been transformed by a customized script such as
Perl or Python.
IKM File-Hive
to Oracle
(OLH)
Integrates data from an HDFS file or Hive source into File

Oracle
an Oracle Database target using Oracle Loader for System or Database
Hadoop.
Hive
CKM Hive
Validates data against constraints.
NA
Hive
RKM Hive
Reverse engineers Hive tables.
Hive
Metadata
NA
Hive

In this workshop we have loading data into the HDFS using a cumbersome command line utility
one for one. We viewed results from within HDFS also using a command line utility. Although this
is fine for smaller jobs, it would be a good idea to integrate the moving of data with an ETL tool
such as Oracle Data Integration (ODI). In the exercise we will see how Oracle Data Integrator
integrates seamlessly with Hadoop and more specifically Hive.
1.
2.
3.
4.
Use ODI to reverse engineer a Hive Table

Use ODI in import a text file into a Hive Table
Use ODI to move data from a Hive table into the Oracle Database
Use ODI to move data from A Hive table to another table with Check Constraints
5.3 Setup and Reverse Engineering in ODI

1. All of the setup for this exercise can be done from the terminal, hence open a terminal by
double clicking on the Terminal icon on the desktop.
cd /home/oracle/exercises/odi
Then press Enter
57
Big Data Workshop
3. Next we need to run a script to setup the environment for the next exercise. We will be
loading the same data as in the previous exercise (the dividens table) only this time we will
be ODI to perform this task. For this we need to drop that table and recreate it so it is
empty for the import. Also we need to start the hive server to enable ODI to communicate
with Hive. We have a script which will perform both tasks. Go to the terminal and type:
./setup.sh
Then press Enter
4. Next we need to start Oracle Data Integrator. Go to the terminal and type
./startODI
Then press Enter
58
Big Data Workshop
5. Once ODI opens we need to connect to the repository. In the right upper corner of the
screen screen click on the Connect To Repository
6. In the dialog that pops up all of the connection details should already be configured.
Login Name: DEFAULT_LOGIN1
User: SUPERVISOR
Password: Welcome1
If all of the data is entered correctly you can simple click OK
59
Big Data Workshop
7. Once you login make sure you are on the Designer Tab. Near the top of the screen on the
left side click on Designer.
8. Near the bottom of the screen on the left side there is a Models tab click on it.
You will notice that we have already created a File, Hive, and Oracle mode for you. These
were pre-created as to reduce the number of steps in the exercise. For details on how use
flat files and Oracle database with ODI please see the excellent ODI tutorials offered by
60
Big Data Workshop
the
Oracle
by
Example
Tutorials
found
http://www.oracle.com/technetwork/tutorials/index.html.
at
9. The first feature of ODI we would like to show involved reverse engineering a data store.
The reverse engineering function takes a data store and finds all of the table and their
structure automatically. In the Models tab on the left side of the screen there Hive is a
model. Lets click on the + to expand it out.
10. You will notice there is no information about the data that is stored in that particular
location. Right click on the Hive folder and select Reverse Engineer
61
Big Data Workshop
11. You will see two items appear in the Hive Folder called dividends and dividends2. It is the
tables we created in Hive. You can click on the + beside dividends to see some more
information.
62
Big Data Workshop
12. You can also expand the Columns folder to see all of the column in this particular table.
You will see the columns created in step 3 displayed.
63
Big Data Workshop
This is the power of the Hive Reverse Engineering Knowledge Module (RKM) integrated in
Oracle Data Integrator. Once you define a data store (in our case a hive source) the RKM
will automatically discovery all tables and their corresponding columns available at that
source. Once a data model is created there is no need to rewrite it in ODI. ODI will
automatically discover that model for you to be able to get straight to the development of
the data movement.
5.4 Using ODI to import text file into Hive

1. Now that we have configured our models we can begin creating Interfaces to perform ETL
tasks. Near the top left corner of the screen in the Projects tab there is an icon with 3
squares. Click on them and select New Project.
64
Big Data Workshop
2. In the new window that opened up on the right side of the screen enter the following
information:
Name: Hadoop
Code: HADOOP
Then click on the Save All in the right upper corner of the screen.
3. In the left hand menu in the Projects section a new item appeared called Hadoop. Click on
the + to expand it out
65
Big Data Workshop
4. Next to the folder called First Fold er there is another + expand out that folder as well by
clicking the +
5. Right click on the Item Interfaces and select New Interface
66
Big Data Workshop
6. We can now start to define the new interface. In this interface we will map out the columns
in the text file and move the data into the hive table. To start out lets give the interface a
name. In the new tab that opened on the right side of the screen type in the following
information.
Name: File_To_Hive
7. Next we need to move to the mapping tab of the File_To_Hive interface. Click on Mapping
at the bottom of the screen.
67
Big Data Workshop
8. We now need to define the sources and target data stores. On the left bottom of the
screen in the Models Section expand the File folder by clicking on the + beside it.
9. Now we can drag and drop the Dividends table from the File model into the source
section of the interface.
68
Big Data Workshop
10. Next we will drag and drop the dividends Hive table into the target section of the interface.
69
Big Data Workshop
11. A pop up window will appear which will ask if you would like to create automatic mapping.
This will try to automatically match source columns with target columns based on column
name. Click on Yes to see what happens.
12. By name it was able to match of the columns. The mapping is now complete. Lets go back
to the Overview tab to setup one last thing. Click on the Overview tab on the left side of
the screen.
70
Big Data Workshop
In the definitions tab Tick the box Staging Area Different From Target.
13. A drop down menu below the tick now gets activated. Select File: Comments File
71
Big Data Workshop
14. We can now click on the flow tab at the bottom of the screen to see what the interface will
look like.
15. On the screen in the top right box you will see a diagram of the data flow. Lets see all of
the options for the integration. Click on the Target(Hive Server) header.
72
Big Data Workshop
16. At the bottom of the screen a new window appeared, a Property Inspector. There you can
inspect and modify the configuration of the integration process. Let change one of the
properties. We dont need a staging table so lets disable it. Set the following options:
USE_STAGING_TABLE: false
Lets now execute this Interface. Click on the Execute button at the top of the screen.
73
Big Data Workshop
17. You will be asked to save your interface before running it. Click Yes
18. Next you will be asked for the Execution options. Here you can choose agents contexts
and other elements. You can just accept the default options and click OK
19. An informational screen will pop up telling you the session has started. Click OK
74
Big Data Workshop
20. We will now check if the execution of the interface was successful. In the left menu click
on the Operator Tab
21. In the menu on the left side make sure the Date tab is expanded. Then expand the Today
folder
75
Big Data Workshop
You will see a green checkmark beside the File_To_Hive execution.

This means the integration process was successful.
You have now successfully moved data from a flat file into a hive table without touching
the terminal. All of the data was moved without cumbersome command line interface and
allowing for the use of all of the powerful functionality of a powerful ETL tool.
22. You can now move back to the Desinger tab in the left menu and close all of the open tabs
on the right side menu
76
Big Data Workshop
This was to prepare for the next exercise.
5.5 Using ODI to import Hive Table into Oracle

1. Another useful process ODI can perform is move data from a Hive table into the Oracle
database. Once data processing has occurred with the Hive table you might want to move
the data back into your Oracle database for integration into your data warehouse. Lets
move the data we just loaded into our hive table into an Oracle database. First step is to
create a new interface.
In the projects tab in the left hand menu right click on Interfaces and select New Interface
77
Big Data Workshop
2. On the right side of the screen a new window pops up. Enter the following name for the
Interface
Name: Hive_To_Oracle
3. Next at the bottom of the screen we will need to move the mapping tab to setup the
interface. At the bottom of the screen click on Mapping
78
Big Data Workshop
4. To give up more viewing space lets clean up the Models tab in the left bottom part of the
screen. Minimize the File Tab and the dividends table to give up more viewing space.
5. In the same tab (the Models tab) we now see the Oracle folder. Lets expand that out as
we will need to Oracle tables as our target. Click on the + beside the Oracle folder
79
Big Data Workshop
6. We can now drag and drop the Hive dividends table into the sources window
80
Big Data Workshop
7. Similarly you can drag the Oracle DIVIDENDS tables into the destination window.
81
Big Data Workshop
8. As before you will be asked if you would like to perform automatic mapping. Click on Yes
9. Unfortunately due to a capitalization difference no mapping could be done automatically.

We will need to map the tables manually. Drag and drop each source table (from the
source tables windows in the left upper part of the right tab) to its corresponding target
table (in the right upper part of the right tab) given the following mapping:
exchange ->
dates ->
dividend ->
STOCK
DATES
DIVIDEND
82
Big Data Workshop
10. One of the advantages of an ETL tool can be seen when doing transformations during the
data movement. Lets concatenate the exchange and symbol into one string and load that
into the STOCK column in the database. Go to the Property Inspector screen of the
STOCK column buy click on it in the targets window
11. The property inspector window should open at the bottom of the screen. In the
implementation edit box type the following
concat(DIVIDENDS.exchange, DIVIDENDS.symbol)
83
Big Data Workshop
12. The transformation is now setup. Lets now go back to the Overview Tab to configure the
staging area. Click on the Overview tab
13. In the definitions tab Tick the box Staging Area Different From Target.
84
Big Data Workshop
14. A drop down menu below the tick now gets activated. Select Hive: Hive Store
15. We are now ready to run the interface. To run the interface go to the left upper corner of
the screen and click the execute button
85
Big Data Workshop
16. A window will pop up telling you a save is required before the execution can continue. Just
click Yes
17. Another window will pop pop asking you to configure the Execution Context and Agent.
The default options are fine just click OK
18. A final window will pop up tell you the session has started. Click OK
86
Big Data Workshop
19. Lets now go to the Operator tab to check if the execution was successful. In the top left
corner of the screen click on the Operator tab
20. When you get to the Operator Tab you might see a lightning bolt beside the
Hive_To_Oracle execution. This means integration is executing wait for a little bit until the
checkmark appears.
87
Big Data Workshop
The movement of data from Hive to Oracle has completed successfully.

21. One of the great feature of ODI is it allows you to look at the code that was executed as
part of the ETL process. Lets drill down and see some of the code executed. In the
Operator tab click on the + next to the latest execution.
88
Big Data Workshop
22. Continue to drill down by click on the + next to the 1- Hive_to_Oracle

<date>
23. You can now see all of the steps taken to perform this particular mapping. Lets investigate
further the forth step in the process. Double click on 4 Integration Hive_To_Oracle
Create Hive staging table to open up a window with its details.
89
Big Data Workshop
24. In the window that opened up click on the Code tab
90
Big Data Workshop
In this widow you will see exactly what code was run. If an error occurs this information
becomes quite useful in debugging your transformations.
25. To check that data that is now in the oracle database go back to the designer Tab, by
going to the left upper corner of the screen and clicking on Designer
91
Big Data Workshop
26. Then in the Models Section at the left bottom of the screen right click on the table LOGS in
the Oracle folder and select View Data
On the left side of the screen a new window will pop up with all of the data inside that table
of the oracle database.
92
Big Data Workshop
27. We can now go ahead and close all of the open windows in the right side of the screen to
prepare for the next exercise.
5.6 Using ODI to import Hive Table into Hive

1. The last ETL process we are going to show involves moving data from a Hive table into
another Hive table. Although this might sound a bit weird there are a lot of circumstances
where you might want to move data from one Table to another verifying the data for
93
Big Data Workshop
constraint violations of transforming the data. Lets go ahead and

create an interface for this type of transaction. In the Project tab right click on Interfaces
and select New Interface
2. As before lets give the interface a name. In the new tab that opened on the right side of
the screen type in the following information.
Name: Hive_To_Hive
3. Next at the bottom of the screen lets go to the Mapping table
94
Big Data Workshop
4. We will first drag the Hive dividends table as our source window on the right
95
Big Data Workshop
5. Next we will drag the dividends2 table into the target window on the right
96
Big Data Workshop
6. You will be asked if you would like to perform auto mapping. Just click Yes
7. All of the mapping auto complete without a problem. We now need to specify the
Integration Knowledge Modules (IKM) which will be used to perform the integration. In ODI
an IKM is the engine which has all of the code templates for the integration process; hence
without an appropriate IKM the integration is now possible. In the previous section there
was only one appropriate IKM hence it was auto chosen for use. In this case there are
multiple possible IKMs so we need select one. In the left upper corner of the screen in the
Designer window right click on Knowledge Modules and select Import Knowledge
Modules.
97
Big Data Workshop
8. A window will pop up which will allow you to import Knowledge Modules. First we need to
specify the folder in which the Knowledge Modules are stored. Fill in the following
information.
File import directory: /u01/ODI/oracledi/xml-reference
Then Press Enter
98
Big Data Workshop
9. A list of different Knowledge Modules should appear in the space

below. Scroll down until you find the file(s) to import:
IKM Hive Control Append
Then press OK
10. An import report will pop up. Just click Close
99
Big Data Workshop
11. Lets now add a constraint to the target tables to see what happens
during the data movement. In the left bottom part of the screen in the models window
expand out the dividends2 store buy pressing the + beside it.
12. In the subsections that appear in the dividends2 you will see a section called Constraints.
Right click on it and select New Condition.
100
Big Data Workshop
13. On the right side a new window will open allowing you to define the
properties of this condition. We will set a check condition which will check if the dividends
value is too low. Enter the following information.
Name: Low_Dividend
Type: Oracle Data Integrator Condition
Where: dividends2.dividend>0.01
Message: Dividend Too Low
14. We now need to save our constraint. In the top right corner of the screen click on the Save
button
101
Big Data Workshop
15. We are now ready to run the Interface. In the top right section of the screen click back to
our interface by click on the Hive_To_Hive tab.
102
Big Data Workshop
16. Now at the top of the screen we can click the Play button to run the interface
103
Big Data Workshop
17. A new window pop up saying you need to save all of the changes before the interface can
be run. Just click Yes
18. A new window will pop up asking for the execution context, just click OK
104
Big Data Workshop
19. An informational pop will show up telling you the execution has started. Simply click OK
20. It is now time to check our constraint. In the left bottom part of the screen (in the Models
section) right click on the dividends2 model then to to the Control section and click on
Check
21. This check is its own job that must be run; hence a window will pop up asking you to select
a context for the execution. The default option are good so just click OK
105
Big Data Workshop
22. An informational window pops up telling to the execution has started. Just click OK
23. We can now see all of the rows that failed our check. Again in the right bottom part of the
screen (in the Models section) right click on the dividends2 model go to the Control menu
and select Errors
A now tab will pop up on the right side of the screen. You will see of the columns which did
not pass the constraint.
106
Big Data Workshop
24. This concludes the ODI section of the workshop. Go to the right upper corner of the screen
and click the X to close ODI.
25. Then in the terminal type exit to close it as well.
107
Big Data Workshop
5.7 Summary
In this exercise you were introduced the Oracles integration of ODI with Hadoop. It is
worthy to note this integration with ODI is only available for the oracle database and only available
from Oracle. It is a custom extension for ODI developed by Oracle to allow users how already have
ETL as part of the Data Warehousing methodologies to continue using the same tools and
procedures with the new Hadoop technologies.
It is quite important to note that ODI is a very powerful ETL tool which can offer all of the
functionality typically found in an enterprise quality ETL. Although the examples given in this
exercise are quite simple this does not mean the integration of ODI and Hadoop is. All of the power
and functionality of ODI is available when working with Hadoop. Workflow definition, complex
transforms, flow control, multi source just to name a few of the functionalities of the ODI and
inherently feature that can be used with Hadoop.
Through this exercise you were introduced to three Knowledge Modules of ODI. Reverse
Integration for Hive, Integration into hive and Integration from hive to Oracle. These are not the
only knowledge modules available, and we encourage you to review the table available in section
5.2 of this document to get a better idea of all the functionality currently available.
108
Big Data Workshop
6. WORKING WITH EXTERNAL TABLES

6.1 Introduction to External Tables
Oracle Direct Connector runs on the system where Oracle Database runs. It provides read access
to HDFS from Oracle Database by using external tables.
An external table is an Oracle Database object that identifies the location of data outside of the
database. Oracle Database accesses the data by using the metadata provided when the external
table was created. By querying the external tables, users can access data stored in HDFS as if
that data were stored in tables in the database. External tables are often used to stage data to be
transformed during a database load.
These are a few ways that you can use Oracle Direct Connector:
Access any data stored in HDFS files
Access CSV files and Data Pump files generated by Oracle Loader for Hadoop
Load data extracted and transformed by Oracle Data Integrator
Oracle Direct Connector uses the ORACLE_LOADER access driver.

This Exercise will involve working with oracle external tables. We will create 3 text files with some
data in each. We will upload these files into HDFS and connect them to the Oracle database using
external tables. The data within these files will then be accessible from within the oracle database.
1. Create and query external tables stored in HDFS
the terminal
6.3 Configuring External Tables

109
Big Data Workshop
2. To get into the folder where the scripts for the external tables
exercise are, type in the terminal:
cd /home/oracle/exercises/external
Then press Enter
3. This first step in this exercise is to create some random files. This is just so we have some
data in Hadoop to load as an external file. We will create three files called sales1, sales2
and sales3 with a single row comprised of 3 numbers in each file. To create the files go to
the terminal and type:
./createFiles.sh
Then press Enter
4. Next we will load these files in HDFS. We have a script for that processes as well. Go to
./loadFiles.sh
Then press Enter
5. Next we will need to create the external table in Oracle. As the SQL code is quite long we
have written a script with that code. This being quite important lets look at what that code
looks like. In the terminal type:
gedit createTable.sh
Then press Enter
110
Big Data Workshop
Looking at the code for creating the table you will notice very similar syntax for other types
of external tables except for 2 line; the preprocessor and type highlighted in the image
below
6. When you are done evaluating the code you can close the window by clicking the X in the
right upper corner of the window
111
Big Data Workshop
7. Lets go ahead now and run that piece of code. In the terminal type:
./createTable.sh
Then
press
Enter
8. Now that the table is created we need to connect that table with the files we loaded into
HDFS. To make this connection we must run a Hadoop job which calls oracle loader code.
Go to the terminal and type:
./connectTable.sh
Then press Enter
112
Big Data Workshop
9. You will be asked to enter a password for the code to be able to login to the database user.
Enter the following information
[Enter Database Password:]: tiger
Then Press Enter
NOTE: No text will appear while you type
113
Big Data Workshop
10. We can now use SQL from oracle to read those files in HDFS. Lets experiment with that.
First we connect to the database using SQL* Plus. Go to the terminal and type:
sqlplus scott/tiger
Then press Enter
114
Big Data Workshop
11. Now lets query that data. Go to the terminal and type:
select * from sales_hdfs_ext_tab;
Then press Enter
115
Big Data Workshop
They query returns the data that is is all three files.
116
Big Data Workshop
12. This concludes this exercise. You can now exit SQL* Plus. Go to the terminal and type:
exit;
Then press Enter
117
Big Data Workshop
13. Then close the terminal. Go the the terminal and type:
exit
Then press Enter
118
Big Data Workshop
6.4 Summary
In this chapter we show how data in HDFS can be queried using standard SQL right from the
oracle database. With the data stored in HDFS all of the parallelism and striping that would
naturally occur is taken full advantage of while at the same time you can use all of the power and
functionality of the Oracle Database.
When implementing this method in interaction with data parallel processing is extremely important
when working with large volumes of data. When using external tables, consider enable parallel
query with this SQL command:
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading data into Oracle Database from the external files created by Oracle Direct
Connector, enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Before inserting data into an existing database table, enable parallel DML with this SQL command:
ALTER SESSION ENABLE PARALLEL DML;
Hints such as APPEND and PQ_DISTRIBUTE also improve performance when inserting data.
119
Big Data Workshop
7. WORKING WITH MAHOUT

7.1 Introduction to Mahout
Apache Mahout is an Apache project to produce free implementations of distributed or otherwise
scalable machine learning algorithms on the Hadoop platform. Mahout is a work in progress; the
number of implemented algorithms has grown quickly, but there are still various algorithms
missing.
While Mahout's core algorithms for clustering, classification and batch based collaborative filtering
are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict
contributions to Hadoop based implementations. Contributions that run on a single node or on a
non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering
recommender component of Mahout was originally a separate project, and can run stand-alone
without Hadoop.
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behaviour
and from that tries to find items users might like. Clustering takes e.g. text documents and groups
them into groups of topically related documents. Classification learns from existing categorized
documents what documents of a specific category look like and is able to assign unlabelled
documents to the (hopefully) correct category. Frequent item set mining takes a set of item groups
(terms in a query session, shopping cart content) and identifies, which individual items usually
appear together.

In this exercise you will be using the K-means algorithm to cluster data using mahouts
implementation of K-means. To give a bit of background on K-means it is an algorithm which
clusters data, and despite its simplistic nature it can be quite powerful. The algorithm takes two
inputs, a series of input values (v) and the number of groups those values need to be split into (k).
The algorithm first picks randomly k centres to represent the centre of each group, than
continuously moves those centres so that the distance from the centre to every point in that group
is as small as possible. Once the centres are at a point where any movement would just increase
the distance to all of the points the algorithm stops. This is a great algorithm to find pattern in data
where you have no information what patterns are in the data. Given the power the K-means
algorithm is quite expensive computationally hence using a massively distributed computation
cluster such as Hadoop would offer great advantage when dealing with very large data sets. This
is exactly what we will be experimenting with in this exercise
1. Use mahout to cluster a large data set
2. Use the graphic library in Java to visualize a mahout k-means cluster
7.3 Clustering with K-means

120
Big Data Workshop
2. To get into the folder where the scripts for the Hive exercise are, type in the terminal:
cd /home/oracle/exercises/mahout
Then press Enter
3. To get an idea of what our forum file looks like lets look at the first couple of rows. In the
terminal type:
head n 1 synthetic_control.data
Then press Enter
As you can see on the screen all there is in the file are random data points. It is within the
field of that we would like to find patterns.
121
Big Data Workshop
4. The first step in analyzing this data is loading it into the HDFS. Lets go ahead and do that.
./loadData.sh
Then press Enter
5. Now that the data is loaded we can run mahout against the data. This is an example we
are running were the data is already in vector form and a distance function has already
been compiled into the example. When clustering your own data, the command line for
running the clustering should include the distance function written and compiled in java.
122
Big Data Workshop
This would be an excellent time to get a cup of coffee. The clustering is quite
computationally intensive and execution should take a couple of minutes to execute.
6. Once you get back the command prompt the clustering is done, but the results are stored
in binary format inside Hadoop. We need to first bring all of the results out of Hadoop and
then convert the data from binary format to text format. We have a script which will
perform both tasks. Lets run that script go to the terminal and type:
./extractData.sh
Then press Enter
123
Big Data Workshop
7. We can how go ahead and look at the results of the clustering. We will look at the text
output of the results. Go to the terminal and type:
gedit Clusters
Then press Enter
124
Big Data Workshop
The output is not very user friendly but there are several indicators to look for as followed:
n= the number of clusters
c= the centers of each one of the clusters
r= the radius of the circle which defines the cluster
Points= the data points in each cluster
125
Big Data Workshop
8. Once you are done evaluating the results you can click the X in the right upper corner of
the screen to close the window.
126
Big Data Workshop
9. Despite the highlighting data points are not very representative. Mahout also has some
graphing functions for simple data points. We will run a much simpler clustering with points
that can be displayed on a X,Y plane to visually see the results. Go to the terminal and
type:
./displayClusters
Then press Enter
127
Big Data Workshop
A new window will pop up with a visual display of a K-means cluster. The black squares
represent data points the red circles define the clusters. The yellow and green lines
represent the error margin for each cluster.
10. Once you are done evaluating the image you can click the X in the right upper corner of
the window to close it.
128
Big Data Workshop
11. This concludes our mahout exercise. You can now close the terminal window. Go to the
terminal and type:
exit
Then press Enter
129
Big Data Workshop
7.4 Summary
In this exercise you were introduced to the K-mean clustering algorithm and how to run the
algorithm using mahout and inherently on a Hadoop cluster. It is important to note that Mahout
does not only focus on k-mean but also has many different algorithms in the categories of
Clustering, Classification, Pattern Mining, Regression, Dimension reduction, Evolutionary
Algorithms, Recommendation/ Collaboration Filtering and Vector Similarity. Most of these
algorithms have special variants which are optimized to run on a massively distributed
infrastructure (Hadoop) to allow for rapid results on very large data sets.
130
Big Data Workshop
8. PROGRAMMING WITH R
8.1 Introduction to Enterprise R
R is a language and environment for statistical computing and graphics. It is a GNU project which
is similar to the S language and environment which was developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a
different implementation of S. There are some important differences, but much code written for S
runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly
extensible. The S language is often the vehicle of choice for research in statistical methodology,
and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full control.
Oracle R Enterprise integrates the open-source R statistical environment and language with
Oracle Database 11g, Exadata, Big Data Appliance, and Hadoop massively scalable computing.
Oracle R Enterprise delivers enterprise-level advanced analytics based on the R environment.
Oracle R Enterprise allows analysts and statisticians to leverage existing R applications and use
the R client directly against data stored in Oracle Database 11gvastly increasing scalability,
performance and security. The combination of Oracle Database 11g and R delivers an enterpriseready, deeply integrated environment for advanced analytics. Data analysts can also take
advantage of analytical sandboxes, where they can analyze data and develop R scripts for
deployment while results stay managed inside Oracle Database.
As an embedded component of the RDBMS, Oracle R Enterprise eliminates Rs memory
constraints since it can work on data directly in the database. Oracle R Enterprise leverages
Oracles in-database analytics and scales R for high-performance in Exadata and the Big Data
Appliance. Being part of the Oracle ecosystem, ORE enables execution of R scripts in the
database to support enterprise production applications and OBIEE dashboards, both for structured
results and graphics. Since its R, were able to leverage the latest R algorithms and contributed
packages.
Oracle R Enterprise users not only can build models using any of the data mining algorithms in the
CRAN task view for machine learning, but also leverage in-database implementations for
predictions (e.g., stepwise regression, GLM, SVM), attribute selection, clustering, feature
extraction via non-negative matrix factorization, association rules, and anomaly detection.

In this exercise you will be introduced the R programming language as well as the enhancements
Oracle has brought to the programming language. Limitations were all data must be kept in system
memory are now gone as you can save and load data from and to both the Oracle database and
HDFS. To exemplify the uses of R will be doing K-means clustering again, as in Exercise 7, this
time using the R programming language. If you would like a review of K-means please see the
introduction to section 7.
131
Big Data Workshop
1.
2.
3.
4.
5.
6.
Generate a set of random data points

Save the data in both the Oracle database and HDFS
View the data in Oracle and HDFS
Load the data from Oracle back into R
Perform K-means clustering on the data points
View the results
8.3 Talking data from R and inserting it into database

2. To get into the folder where the scripts for the R exercises are, type in the terminal:
cd /home/oracle/exercises/R
Then press Enter
3. To work with R you can write scripts for the interpreter to execute or you can use the
interactive shell environment. To get a more hands on experience with R we will use the
interactive shell. To start the interactive shell go to the terminal and type:
R
Then press Enter
4. During the login process many different library load which extend functionality of R. If a
particular library is not loaded automatically one can load it manually after login. We will
132
Big Data Workshop
need to load a library to interface with HDFS so lets load that now.
Go to the R shell and type:
library(ORHC)
Then press Enter
5. Now lets go ahead and generate some pseudo random data points so have some data to
play with. We will generate 2D data points so we can easily visualize the data. Go to the R
terminal and type:
myDataPoints=rbind(matrix(rnorm(100, mean=0, sd=0.3),ncol=2)
,matrix(rnorm(100, mean=1, sd=0.3), ncol=2))
Then press Enter
133
Big Data Workshop
Now the variable myDataPoints will have some data points in it.
6. To be able to save data into the database or HDFS you need to have the data in columns
(as we already do) and you also need to have each of the the columns labeled. This is
because column names are required within a database to be able to identify the columns.
Lets go ahead and label the columns x and y. Go to the R terminal and type:
colnames(myDataPoints) <- c(x, y)
Then press Enter
134
Big Data Workshop
7. We can now create a data frame which will load the data into the Oracle Database. Go to
ore.create(as.data.frame(myDataPoints, optional = TRUE),
table=DATA_POINTS)
Then press Enter
135
Big Data Workshop
8. If required we can even load this data into HDFS. Lets go ahead and do that. Go to the R
terminal and type:
hdfs.put(DATA_POINTS,dfs.name=data_points)
Then press Enter
136
Big Data Workshop
9. Now that we have loaded the data into both the database and HDFS lets exit R and look at
that data. Go to the R shell and type:
q()
Then press Enter
137
Big Data Workshop
10. You will be asked if you want to save workspace image. Type:
n
Then press Enter
Note: when typing n the information typed does not appear on the screen.
138
Big Data Workshop
11. At this point all data and calculated data would be wiped from the memory and hence lost
in class R. With R Enterprise Edition we saved our data in the database, so lets go and
query that data. Go to the terminal and type:
./queryDB.sh
Then press Enter
139
Big Data Workshop
On the screen you will see the table displayed which contains our data points.
140
Big Data Workshop
12. We can also look at the data we stored inside HDFS. Go to the terminal and type:
./queryHDFS.sh
Then press Enter
141
Big Data Workshop
Again on the screen you will see all of the data points displayed.
142
Big Data Workshop
As you can see all of the work done in R can now be exported to the database or HDFS
for further processing based on business needs.
8.4 Taking data from database and using it in R and

clustering
1. Data can not only be pushed out to the database but it can also be retrieved from the
database or HDFS to be used within R. Lets see how that is done. First lets go back into
the R environment. Go to the terminal and type:
R
Then press Enter
143
Big Data Workshop
2. Lets now go ahead and load the data from the Oracle database. Go to the R shell and
type:
myData=ore.pull(DATA_POINTS)
Then press Enter
144
Big Data Workshop
3. Now that we have our data inside the database we can manipulate the data. Let do kmeans clustering on the data. Go the the R shell and type:
cl <- kmeans(myData, 2)
Then Press Enter
145
Big Data Workshop
4. The clustering is now done, but displaying the data in text format it not very interesting.
Lets graph the data. Go to the R terminal and type:
plot(myData, col = cl$cluster)
Then press Enter
146
Big Data Workshop
5. A new window pops up with the data. The two color (red and black) differentiate the two
clusters we asked the algorithm to find. We can even see where the cluster centers are.
Go back to the R shell. The terminal might hidden behind the graph move the windows
around until you find the terminal then type:
points(cl$centers, col=1:2, pch = 8, cex=2)
Then press Enter
147
Big Data Workshop
When you go back to the graph you will see the centers marked with a * and points
marked with circles, data clustered for raw random data using the K-means algorithm.
148
Big Data Workshop
6. When you are done evaluating the image you can click on the X in the right upper corner
of the window.
149
Big Data Workshop
7. You can also close the R terminal by going the R shell and typing:
q()
Then press Enter
150
Big Data Workshop
8. When asked if you want to save workspace image go to the terminal and type:
n
Then Press Enter
151
Big Data Workshop
9. This concludes this exercise. You can now go ahead and close the terminal. Go to the
terminal and type:
exit
Then press Enter
152
Big Data Workshop
8.5 Summary
In this exercise you were introduced to the R programming language and how to do clustering
using the programming language. You also saw one of the advantages of Oracle R enterprise
Edition where you can save your results into the Oracle database as well as extract data from the
database for further calculations. Oracle R Enterprise edition also has a small set of function which
can be run on data in the database directly in the database. This enables the user to use very
large data sets which would not if into the normal memory of R.
Oracle R Enterprise provides these collections of functions:
ore.corr
ore.crosstab
ore.extend
ore.freq
ore.rank
ore.sort
ore.summary
ore.univariate
153
Big Data Workshop
9. ORACLE NOSQL DATABASE

9.1 Introduction To NoSQL
Oracle NoSQL Database provides multi-terabyte distributed key/value pair storage that offers
scalable throughput and performance. That is, it services network requests to store and retrieve
data which is organized into key-value pairs. Oracle NoSQL Database services these types of data
requests with a latency, throughput, and data consistency that is predictable based on how the
store is configured.
Oracle NoSQL Database offers full Create, Read, Update and Delete (CRUD) operations with
adjustable durability guarantees. Oracle NoSQL Database is designed to be highly available, with
excellent throughput and latency, while requiring minimal administrative interaction.
Oracle NoSQL Database provides performance scalability. If you require better performance, you
use more hardware. If your performance requirements are not very steep, you can purchase and
manage fewer hardware resources.
Oracle NoSQL Database is meant for any application that requires network-accessible key-value
data with user-definable read/write performance levels. The typical application is a web application
which is servicing requests across the traditional three-tier architecture: web server, application
server, and back-end database. In this configuration, Oracle NoSQL Database is meant to be
installed behind the application server, causing it to either take the place of the back-end database,
or work alongside it. To make use of Oracle NoSQL Database, code must be written (using Java)
that runs on the application server.

In this exercise you will be experimenting with the Oracle NoSQL database. Most of the exercises
will have you look at pre written java code and the compiling and running that code. Ensure you
understand the code and all of its nuances as it is what makes up the NoSQL database interface.
If you would like to understand all of the functions that are available there is a javadoc available on
the Oracle web
1. Insert and retrieve a simple key value pair from the NoSQL database
2. Experiment with the multiget functionality to retrieve multiple values at the same time
3. Integrate NoSQL with Hadoop code to do word count on data in the NoSQL database
9.3 Insert, and retrieve Key Value pairs

154
Big Data Workshop
2. To get into the folder where the scripts for the R exercises are, type in the terminal:
cd /home/oracle/exercises/noSQL
Then press Enter
3. Before we do anything with the noSQL database we must first start it. So let go ahead and
do that. Go to the terminal and type:
./startNoSQL.sh
Then press Enter
4. To check if the database is up and running we can do a ping on the database. Lets do
that. Go to the terminal and type:
./pingNoSQL.sh
Then press Enter
155
Big Data Workshop
You will see Status: RUNNING displayed within the text. This show the database is
running.
5. Oracle NoSQL database uses a Java interface to interact with the data. This is a dedicated
java API which will let you insert update delete and query data in the Key Value store
that is the NoSQL database. Lets look at a very simple example of java code where we
insert a Key-Value pair into the database and then retrieve it. Go to the terminal and type:
gedit Hello.java
Then press Enter
156
Big Data Workshop
A new window will pop up with the code. In this code there are a couple of thing to be
noted. We see the config variable which holds our connection string and the store variable
which is our connection factory to the database. They are the initiation variable for the
Key-Value Store and are highlighted in yellow. Next we define 2 variable of type Key and
Value, they will serve as our payload to be inserted. These are highlighted in green. Next
we have highlighted in purple the actual insert command. Highlighted in blue is the retrieve
command for getting data out of the database.
157
Big Data Workshop
6. When you are done evaluating the code press the X in the right upper corner of the
window to close it.
158
Big Data Workshop
7. Lets go ahead and compile that code. Go to the terminal and type:
javac Hello.java
Then press Enter
159
Big Data Workshop
8. Now that the code is complied lets run it. Go to the terminal and type:
java Hello
Then press Enter
You will see printed on the screen Hello Big Data World which is the key, and the value we
inserted in the database.
160
Big Data Workshop
9. Oracle NoSQL database has the possibility of having a major and minor component to the
key. This feature can be very useful when trying to group and retrieve multiple items at the
same time from the database. In the next code we have 2 major components to the key
(Mike and Dave) and each major component has a minor component (Question and
Answer). We will insert a value for each key but we will use a multiget function to retrieve
all of the values regardless of the minor component of the key for Mike and completely
ignore Dave. Lets see what that code looks like. Go to the terminal and type:
gedit Keys.java
Then press Enter
161
Big Data Workshop
10. A new window will pop up with the code. If you scroll to the bottom with will remark the
following piece of code. Highlighted in purple as the insertion calls which will add the data
to the database. The retrieval of multiple records is highlighted in blue, and the green
shows the display of the retrieved data. Do note there were 4 Key-Value pairs inserted in
to the database.
162
Big Data Workshop
window to close it.
163
Big Data Workshop
12. Let go ahead and compile that code. Go to the terminal and type:
javac Keys.java
Then press Enter
164
Big Data Workshop
13. Now that the code is complied lets run it. Go to the terminal and type:
java Keys
Then press Enter
165
Big Data Workshop
You will see the 2 values that are stored under the Mike major key displayed on the
screen, and no data points for Dave major key. Major and minor parts of the key can
actually be composed of multiple string and further filtering can be done. This is left up the
participants to experiment with.
166
Big Data Workshop
14. The potential of a Key-Value store grows significantly when integrated with the power of a
Hadoop and distributed computing. Oracle NoSQL database can be used as a source and
target for the data used by and produced by NoSQL. Lets look at a modified example of
word count run in Hadoop only this time we will count the number of Values under the
Major component of the key in the NoSQL database. To see the code go to the terminal
and type:
gedit Hadoop.java
Then press Enter
167
Big Data Workshop
The code see is very similar to the Word Count seen in the first section of the workshops.
There are only difference 2 differences. The first (highlighted in yellow) is the retrieval of
data from the NoSQL database rather than a flat file.
168
Big Data Workshop
The second can be seen of scroll down into the run function and
notice the InputFormatClass is now KVInputFormat
window to close it.
169
Big Data Workshop
16. Lets go ahead and run that code. We will need to go through the entire procedure of the
first exercise where we compile the code, create a jar then execute it on the Hadoop
cluster. We have written a script which will do all of that for us. Lets run that script, go to
the terminal and type
./runHadoop.sh
Then press Enter
170
Big Data Workshop
17. You will see a Hadoop job being executed with all of the terminal display it comes with.
Once the execution is done it is time to see results. We will just cat the results directly from
HDFS. Go to the terminal and type
./viewHadoop.sh
Then press Enter
171
Big Data Workshop
You will see, displayed on the screen a word count based on the major component of keys
in the NoSQL database. In the previous exercises we inserted 2 pieces of data under the
major key Dave and Mike. We also inserted a hello key for the first exercise. This is
exactly the data the word count displays.
172
Big Data Workshop
18. That concludes our exercises on NoSQL database. It is time to shutdown our NoSQL
database. Go to the terminal and type:
./stopNoSQL
Then press Enter
173
Big Data Workshop
19. We can now close our terminal window. Go to the terminal and type:
exit
Then press Enter
174
Big Data Workshop
9.4 Summary
In this exercise you were introduced to Oracles NoSQL database. You saw how to insert and
retrieve key-value pairs as well as the mutiget function where multiple values could be retrieved
under the same major component of a key. The last example showed how a NoSQL database can
be used as a source for a Hadoop job and how the two technologies can be integrated.
It is important to note here the differences between the NoSQL database and a traditional RDBMS.
With relational data the queries performed are much more powerful and more complex while
NoSQL simply stores and retrieves values from a specific key. Given that simplicity in NoSQL
storage type, it has a significant performance and scaling advantage. A NoSQL database can store
petabytes worth of information in a distributed cluster and still maintain very good performance on
data interaction at a much low cost per megabyte of data. NoSQL has many uses and has been
implemented successfully in many different circumstances but at the same time it does not mimic
or replace the use of a traditional RDBMS.
175
Big Data Workshop
APPENDIX A
A.1 Setup of a Hive Data Store
1. Once we are connected to ODI we need to setup our models; the logical and physical
definition of our data sources and targets. To start off, at the top of the screen click on
Topology.
2. Next in the left menu make sure you are on the Physical Architecture tab and expand the
Technologies list
176
Big Data Workshop
3. In the expanded list find the folder Hive and expand it
177
Big Data Workshop
4. In this folder we need to create a new Data Server. Right click on the
Hive Technology and select New Data Server
5. A new tab will open on the right side of the screen. Here You can define all of the
properties of this data server. Enter the following details:
Name: Hive Server
Then click on the JDBC tab in the left menu
178
Big Data Workshop
6. On the right of the JDBC Driver field click on the Magnifying Glass to select the JDBC
Driver
7. A new Window will pop up which will allow you to select from a list of drivers. Click on the
Down Array to see the list
179
Big Data Workshop
8. For the list that appears select Apache Hive JDBC Driver.
9. Now click OK to close the window
180
Big Data Workshop
10. Back at the main window enter the following information

JDBC Url: jdbc:hive://bigdatalite.us.oracle.com:10000/default
11. We need to set some Hive specific variable. On the menu on the left go now to the tab
Flexfields
181
Big Data Workshop
12. It the Flexfields tab uncheck the Default check box and write the following information:
Value: thrift://localhost:10000
Dont forget to press Enter when done typing to set the variable
13. It is now time to test to ensure we set everything up correctly. In the left upper corner of
the right windows click on Test Connection
182
Big Data Workshop
14. A window will pop up asking if you would like to save you data before testing. Click OK
15. An informational message will pop up asking to register a physical schema. We can ignore
this message as that will be our next step. Just click OK
16. You need to select an agent to use for the test. Leave the default
Physical Agent: Local(No Agent)
Then click Test
183
Big Data Workshop
17. A window should pop up saying Successful Connection. Click OK
If any other message is displayed please ask for assistance to debug. It is critical for the
entirety of this exercise this connection is fully functional.
18. Now in the menu on the left side of the screen, in the Hive folder, there should now be a
Physical server created called Hive Server. Right click on it and select New Physical
Schema.
184
Big Data Workshop
19. A new tab will again open on the right side of the screen to enable you to define the details
of the Physical Schema. Enter the following details.
Schema (Schema): default
Schema (Work Schema): default
185
Big Data Workshop
20. Then click Save All in the left upper part of the screen
21. A warning will appear about No Context specified. This again will be the next step we
undertake. Just click OK
22. We now need to expand the Logical Architecture tab in the left menu. Toward the left
bottom of the screen you will see Logical Architecture tab click on it.
186
Big Data Workshop
23. In the Logical Architecture tab you will need to again find the Hive folder and click on the +
to expand it.
187
Big Data Workshop
24. Now to create the logical store, right click on the Hive Folder and select New Logical
Schema.
188
Big Data Workshop
25. In the new window that open on the right of the screen enter the following information:
Name: Hive Store
Context: Global
Physical Schemas: Hive Server.default
189
Big Data Workshop
26. This should setup the Hive data store to enable us to move data into and out of Hive with
ODI. We now need to save all of the changes we made. In the left upper corner of the
screen click on the Save All button.
27. We can close all of the tabs we have opened on the right side of the screen. This will help
in reducing the clutter. Click on the X for all of the windows.
190
Big Data Workshop
We would theoretically need to repeat steps 7 29 for each of the different type of data
store. As the procedure is almost the same a flat file source and an Oracle database target
have already been setup for you. This is to reduce the number of steps in this exercise.
For details on how use flat files and Oracle database with ODI please see the excellent
ODI tutorials offered by the Oracle by Example Tutorials found at
http://www.oracle.com/technetwork/tutorials/index.html.
28. We now need to go to the Designer Tab in the left menu to perform the rest of our
exercise. Near the top of the screen on the left side click on the Designer tab.
29. Near the bottom of the screen on the left side there is a Models tab click on it.
191
Big Data Workshop
30. You will notice there is already a File and Oracle mode created for you. These were precreated as per the note at step 29. Lets now create a model for the Hive data store we
just created. In the middle of the screen in the right panel there is a folder icon next to the
work Models. Click on the Folder icon and select New Model
192
Big Data Workshop
31. In the new tab that appears on the right side enter the following information:
Name: Hive
Code: HIVE
Technology: Hive
Logical Schema: Hive Store
193
Big Data Workshop
32. We can now go up the left upper corner of the screen and save this Model by clicking on
the Save All icon.
194

Big Data Workshop Lab Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Workshop Lab Guide

Uploaded by

Copyright:

Available Formats

Big Data Workshop