Tame Big Data Using Oracle Data Integration

Tame Big Data using Oracle Data Integration
Purpose
This demo illustrates how you can move and transform all your data using Oracle Data Integration - whether that data resides in Oracle Database, Hadoop, third-party databases, applications,
files, or a combination of these sources. The "Design once, run anywhere" paradigm allows you to focus on the logical rules of data transformation and movement while the choice of
implementation is separate and can evolve as use cases change and new technologies become available.
This demo is included in the virtual machine Big Data Lite 4.2, which is downloadable from http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.
Time to Complete
Approximately one hour
Introduction
This tutorial is divided into the following sections:
1. Review the Scenario

2. Ingesting Data Using Sqoop and Oracle GoldenGate
3. Transforming Data Using Hive, Spark, or Pig
4. Loading Data to Oracle DB using Oracle Loader for Hadoop
5. Accessing Hadoop Data from Oracle using Big Data SQL
6. Data Sessionization using Pig
7. Execute all Steps Using an ODI Package
Scenario
Oracle MoviePlex is an online movie streaming company. Its web site collects every customer interaction in massive JSON formatted log files. It also maintains movie data in an Oracle source
database. By unlocking the information contained in these sources and combining it with enterprise data in its data warehouse, the company will be able to enrich its understanding of customer
behavior, the effectiveness of product offers, the organization of web site content, and more.
The company is using Oracle's Big Data Management System to unify their data platform and facilitate these analyses, and it is achieved by implementing a Data Reservoir pattern where both
structured and unstructured data is collected and staged in a Big Data instance for further analysis and load into target DBs.
Oracle Data Integration provides a unified tool-driven approach to declaratively define integration of all data. The concept of "Design once, run anywhere" means that users can define integration
processes regardless of the implementation language and run them in different environments. A transformation that is executed today on an RDBMS can be reused to run the Hadoop cluster and
utilize future Hadoop languages and optimizations in the as they become available.
For Oracle MoviePlex, a variety of mechanisms is showcased. Data is loaded from a source database to Hive tables, both in bulk using Sqoop through Oracle Data Integrator(ODI), as well as
through change data capture using Oracle GoldenGate(OGG). Data is transformed through joining, filtering and aggregating through Hive,Spark, or Pig in ODI, and the resulting data can be
unloaded into a target Oracle DB using the optimized Oracle Loader for Hadoop(OLH) or Oracle SQL Connector for Hadoop(OSCH). Hadoop data can also be used in Oracle DB using Big Data
SQL, where ODI transparently generates the necessary external tables to expose Hive tables in the Oracle DB to be used in SQL queries.
Let's begin the tutorial by reviewing how data is moved and transformed using Oracle Data Integration for Big Data.
Resetting the demo
You can reset the demo environment from a previously run demo or hands-on lab by executing the script:
/home/oracle/movie/moviework/odi/reset_ogg_odi.sh
Please note that this script will erase any changes to the ODI repository or target tables.
Downloading and Installing Big Data Lite Virtual Machine
Please follow the instructions at the Big Data Lite Deployment Guide for details on downloading,installing, and starting Big Data Lite 4.2.
Starting required services
1. Double-click Start/Stop Services on the desktop
2. In the Start/Stop Services window, make sure the following services are started:
◾ ORCL Oracle Database 12c

◾ Zookeeper
Select any services that are not started and press OK.
Prepare Oracle Database for Oracle GoldenGate
1. Open a terminal window by single-clicking the Terminal icon in the task bar:
2. Execute the setup script by entering the following commands in the Terminal window:
cd /home/oracle/movie/moviework/ogg
./enable_ogg_on_orcl.sh.
This only has to be done once, the configuration stays valid after reboots of the VM. In order to undo the changes that this script does, you can execute disable_ogg_on_orcl.sh on the
same directory.
In this section, you will learn how to ingest data from external sources into Hadoop, using Sqoop for bulk load and Oracle GoldenGate for change data capture.
Ingest Bulk Data Using Sqoop in Oracle Data Integrator

Oracle Data Integrator allows the creation of mappings to move and transform data. In this step we will review and execute an existing mapping to load data from an Oracle DB source to a
Hive target table.
Open and review the Sqoop mapping
1. Launch ODI from the Desktop toolbar.
2. Connect to the ODI repository.

3. Use the pre-selected login for ODI Movie Demo.
4. Select the left-hand Designer navigator, open the Projects accordion and navigate to Mapping Big Data Demo > Demo > Mappings > A - Load Movie (Sqoop) . Double-
click to open the mapping.
5. The Sqoop mapping is displayed in the mapping editor. You can see the source datastore MOVIE from Oracle being mapped into the target datastore movie_updates
from Hive. Datastore is an ODI term for tables and similar entities of data. This is the logical view of the mapping where transformation rules are defined independent of
physical implementation.
Please note that movie_updates has additional fields op for operation type and ts for timestamp. These columns are used to prepare the data for later reconciliation with
updates that will be appended by Oracle GoldenGate. The field op is initialized with a string 'i', while ts calls SYSDATE when executed on the Oracle source.
Click on the tab Physical under the mapping to see the physical design.
6. The physical design shows the source and target execution units, such as databases or Hadoop clusters. It is now visible that the two tables are in different systems, and
an access point MOVIE_AP controls the load.
Select the access point MOVIE_AP to review the configuration of the data load.
7. Go to the Properties windows which is typically under the mapping editor. Scroll down to the Load Knowledge Module section and expand it. It shows that the selected
Load Knowledge Module (LKM) is LKM SQL to Hive SQOOP, which performs a load into Hadoop using Sqoop. Under the Knowledge Module there is a list of options to
configure and tune the Sqoop load, for example by setting up parallelism. Please review the Advanced and Paralellism tabs for additional options. All options have been left
default for simplicity, in a real use case these options would be used to tune the Sqoop load.
Execute the Sqoop mapping
1. Press the Run button on the taskbar above the mapping editor. When asked to save your changes, press Yes.
2. Click OK for the run dialog. We will use all defaults and run this mapping on the local agent that is embedded in the ODI Studio UI. After a moment a Session started
dialog will appear, press OK there as well.
Note: The execution can take several minutes depending on the environment. In order to view the generated code only, you can check the checkbox Simulation. In this
case the generated session is displayed in a dialog window and no execution is shown in the operator.
3. To review execution go to the Operator navigator and expand the All Executions node to see the current execution.
If execution has not finished it will show the Run icon for an ongoing task. You can refresh the view by pressing the blue Refresh icons to refresh once or to refresh
automatically every 5 seconds.
4. Once the load is complete, the warning icon will be displayed. A warning icon is ok for this run and still means the load was successful. You can expand the execution tree
to see the individual tasks of the execution.
5. Go to Designer navigator and Models and right-click HiveMovie.movie. Select View Data from the menu to see the loaded rows.
6. A Data editor appears with all rows of the movie table in Hive.
Ingest Change Data using Oracle GoldenGate for Big Data
Oracle GoldenGate allows the capture of completed transactions from a source database, and the replication of these changes to a target system. Oracle GoldenGate is non-invasive, highly
performant, and reliable in capturing and applying these changes. Oracle GoldenGate for Big Data provides a component to replicate changes captured by GoldenGate into different Hadoop
target technologies. In this tutorial we will replicate inserts into the MOVIE table in Oracle to the respective movie_updates table in Hive. Oracle GoldenGate provides this capability through
GoldenGate for Big Data and provides adapters for Hive, HDFS, HBase, Flume, and Kafka. For this example we will be first using the Hive example and later also demonstrate the delivery
mechanism to Kafka.
The GoldenGate processes for the Hive example are as following:
Start Oracle GoldenGate and Set Up Replication from Oracle to Hive
1. Start a terminal window from the menu bar by single-clicking on the Terminal icon.
2. First, start the Extract processes for GoldenGate. In the terminal window, execute the commands:
cd /u01/ogg
./ggsci
3. Add and start the Capture(EMOV) process by executing:

obey dirprm/bigdata.oby
4. See the status of the newly added processes by executing

info all
5. Close the ggsci client by entering

exit
6. Second, start the Replicat processes for GoldenGate for Big Data. In the terminal window, execute the commands:
cd /u01/ogg-bd
./ggsci
7. Add and start the Replicat(RMOV) process by executing:

obey dirprm/bigdata.oby
8. See the status of the newly added process by executing

info all
The RKAFKA process can be ignored here, it will show as STOPPED if Kafka is not started.
9. Start a second terminal window from the menu bar and enter the command:
sqlplus system/welcome1@orcl
10. Insert a new row into the Oracle table MOVIE by executing the following commands:
INSERT INTO "MOVIEDEMO"."MOVIE" (MOVIE_ID, TITLE, YEAR, BUDGET, GROSS, PLOT_SUMMARY) VALUES ('1', 'Sharknado', '2014', '500000', '20000000',
'Flying sharks attack city');
commit;
Note: Alternatively you can execute the following command:

@ /home/oracle/movie/moviework/ogg/oracle_insert_movie.sql;
11. Go to the ODI Studio and open the Designer navigator and Models accordion. Right-click on datastore HiveMovie.movie_updates and select View Data.
12. In the View Data window choose the Move to last row toolbar button. The inserted row with movie_id 1 should be in the last row. You might have to scroll all the way
down to see it. Refresh the screen if you don’t see the entry.
13. Update an existing row in the Oracle table MOVIE by executing the following commands:
UPDATE "MOVIEDEMO"."MOVIE" SET BUDGET=BUDGET+'10000' where MOVIE_ID='1';
commit;

@ /home/oracle/movie/moviework/ogg/oracle_update_movie.sql;
14. Go back to the Data Viewer in ODI Studio. You will have to refresh the screen and jump to the end again. You will notice a new entry with operation type "U" and a budget
commit;

@ /home/oracle/movie/moviework/ogg/oracle_delete_movie.sql;
16. Go back to the Data Viewer in ODI Studio. You will have to refresh the screen and jump to the end again. You will notice another new entry with operation type "D" for the
movie id 1. This marks a delete from the table.
17. Start a another terminal window from the menu bar and enter the command:
hive
This will open the Hive CLI to query the result table.
18. Enter the command:

show create table movie_view;
This shows a view you can use to query a reconciliated version of the movie table in Hive. Please note that in this tutorial we will also use Oracle Data Integrator to create a
reconciliated table. Reconciliation by View is provides a real-time status, but is more resource intensive for frequent queries.
You can query this view by entering:

select * from movie_view;
Replicate Data from Oracle to Kafka

Oracle GoldenGate for Big Data provides a built in component to replicate changes captured by GoldenGate into Kafka. It also provides capabilities to the data in various formats such as
JSON, XML, Avro in addition to the CSV format. The GoldenGate processes for the Kafka handler example are as following:
Set up Kafka
1. Double-click Start/Stop Services on the desktop

Oracle Big Data Management System
2. In the Start/Stop Services window, make sure the following services are started:
◾ ORCL Oracle Database 12c
◾ Zookeeper
◾ Kafka
3. Start the Kafka server
Kafka uses Zookeeper so you need to first start a

Zookeeper server if you don't already have one. You
can use the convenience script packaged with Kafka
to get a quick-and-dirty single-node Zookeeper
instance in case it is not running.
Check whether zookeeper and Kafka is running by

running >ps -eaf | grep -i kafka
If Kafka and zookeeper are not running, start it with

the Step 1
4. Create a topic
Let's create a kafka topic named "oggtopic" :
cd /usr/lib/kafka
>bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic oggtopic --from-beginning
5. Send some test messages
Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default each line will be
sent as a separate message.
Run the producer and then type a few messages into another console to send to the server.
> /usr/lib/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic oggtopic

Test message 1
This is another test message 2
If you have each of the above commands running in a different terminal then you should now be able to type messages into the producer terminal and see them appear in
the consumer terminal.
All of the command line tools have additional options; running the command with no arguments will display usage information documenting them in more detail.
Working with Kafka handler
1. Set Up Replication from Oracle:

Start a terminal window from the menu bar by single-clicking on the Terminal icon.
2. In the terminal window, execute the commands:

cd /u01/ogg-bd
./ggsci
3. Run the bigdata.oby file in the directory /u01/ogg-bd/dirprm to start the OGG Manager, add Kafka Replicat and start Kafka Replicat
ggsci> obey dirprm/bigdata.oby
4. Check whether the big data replicat process is running
ggsci> info all
5. Start the Oracle Database Extract process by

cd /u01/ogg
./ggsci
>ggsci obey dirprm/bigdata.oby
6. Check whether the Oracle extract process is running
ggsci> info all
7. Open a X-window from the menu bar and enter the command:
>sqlplus system/welcome1@orcl
8. Insert a new row into the Oracle table MOVIE by executing the following commands:
INSERT INTO "MOVIEDEMO"."MOVIE" (MOVIE_ID, TITLE, YEAR, BUDGET, GROSS, PLOT_SUMMARY) VALUES ('1', 'Sharknado', '2014', '500000', '20000000',
'Flying sharks attack city');
commit;
The Kafka replicat produces the Kafka message in AVRO format and it is captured by the Kafka consumer console.
Merge Updates using Hive in Oracle Data Integrator

In our demo we have used Oracle Data Integrator to load initial data using Sqoop and Oracle GoldenGate to replicate changes in real-time. We have also used a view to reconciliate the
updates on the fly. We can also merge the data in bulk using an ODI mapping to provide the same data that the original table in Oracle holds.
Open and review the Hive merge mapping
1. Open ODI Studio. See Part 1 for information how to start and log into ODI Studio.
2. Select the left-hand Designer navigator, open the Projects accordion and navigate to Mapping Big Data Demo > Demo > Mappings > B - Merge Movies (Hive) .
Double-click to open the mapping.
3. The mapping logical view shows the transformations used for the mapping:
◾ Source table movie_updates provides information about each movie with op providing operation types I, U, and D, while ts provides a timestamp value.
◾ AGGREGATE is used to group all movies based on movie_id and calculate the latest timestamp as max_ts.
◾ JOIN is used to get all movie update entries with the latest timestamp for each movie_id as calculated in the aggregate.
◾ FILTER is being used to filter out any rows that are of operation type D.This means that the latest update was a delete.
4. Switch to the Physical View of the mapping. Since the transformation is within Hive, both source and target datastore are in the same execution unit.
Execute the Hive merge mapping
2. Click OK for the run dialog. We will use all defaults and run this mapping on the local agent that is embedded in the ODI Studio UI. After a moment a Session started
dialog will appear, press OK there as well.
Note: The execution can take several minutes depending on the environment. In order to view the generated code only, you can check the checkbox Simulation. In this
case the generated session is displayed in a dialog window and no execution is shown in the operator.
3. To review execution go to the Operator navigator and expand the All Executions node to see the current execution. The execution might not have finished, then it will
show the Run icon for an ongoing task. You can refresh the view by pressing the blue Refresh icons to refresh once or to refresh automatically every 5 seconds.
4. Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the task 40 - Load MOVIE to review the generated code in
a task editor.
5. Go to Designer navigator and Models and right-click

HiveMovie.movie. Select View Data from the menu to see the loaded rows.
6. A Data editor appears with all rows of the movie table in Hive.
In this section, you will use Oracle Data Integrator to transform the movie data previously loaded with Sqoop and GoldenGate. The use case is to join the table movie with customer activity event
data that has been previously loaded into an HDFS file using Flume and is now exposed as a Hive table movieapp_log_odistage. The activity data contains rating actions, we will calculate an
average rating for every movie and store the result in a Hive table movie_rating. You can create one logical mapping and choose whether to use Hive, Spark, or Pig as execution engine of the
staging location. ODI will generate either Hive SQL, Spark-Python, or Pig Lating and execute it in the appropriate server engine.
Also as part of this chapter we will show a mapping that takes a nested JSON HDFS file as input and flattens it to calculate movie ratings on the contents. The implementation engine used is
Transform Movie Data using Hive
Open and review the Hive mapping
2. Select the left-hand Designer navigator, open the Projects accordion and navigate to Mapping Big Data Demo > Demo > Mappings > C - Calc Ratings (Hive - Pig - Spark) .
◾ Source table movie provides information about each movie, while source table movieapp_log_avro contains raw customer activities.
◾ FILTER is being used to filter down to activity = 1 events, which are rating events.
◾ AGGREGATE is used to group all ratings based on movieid and calculate an average of the movie ratings.
◾ JOIN is used to join base movie information from table movie with aggregated events to write to the target table movie_rating.
◾ The target table movie_rating stores the result from the join. It uses a user-defined function XROUND that provides rounding support for all languages.
4. Switch to the Physical View of the mapping and make sure the tab Hive is selected on the bottom. There is a physical design for each of the 3 implementation engines. Since
the transformation is within Hive, both source and target datastore are in the same execution unit.
5. Select the target MOVIE_RATING and review the Properties window, Integration Knowledge Module tab.The IKM Hive Append has been selected with default settings.
Execute the Hive mapping
2. Click OK for the run dialog. We will use all defaults and run this mapping on the local agent that is embedded in the ODI Studio UI. After a moment a Session started dialog will
appear, press OK there as well.
Note: The execution can take several minutes depending on the environment. In order to view the generated code only, you can check the checkbox Simulation. In this case
the generated session is displayed in a dialog window and no execution is shown in the operator.
3. To review execution go to the Operator navigator and expand the All Executions node to see the current execution. The execution might not have finished, then it will show the
Run icon for an ongoing task. You can refresh the view by pressing the blue Refresh icons to refresh once or to refresh automatically every 5 seconds.
4. Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the task 40 - Load MOVIE_RATING to review the generated
code in a task editor.

HiveMovie.movie_rating. Select View Data from the menu to see the loaded rows.
6. A Data editor appears with all rows of the movie_rating table in Hive.
Transform Movie Data using Spark
Open and review the mapping
1. Open ODI Studio and open mapping Big Data Demo > Demo > Mappings > C - Calc Ratings (Hive - Pig - Spark) . For details please see previous chapter "Transform Movie
Data Using Hive".
2. Please note that we are using the same logical mapping for Hive, Spark, and Pig. There is no separate logical design for these implementation engines. The property Staging
Location Hint of the logical diagram is used to select an implementation engine prior to generation of a physical design.
3. Switch to the Physical View of the mapping and make sure the tab Spark is selected on the bottom. There is a physical design for each of the 3 implementation engines. Since
the transformation is using Spark and the sources and targets are defined using Hive Catalog metadata, source, staging (Spark), and target datastore are shown in separate
execution units.
4. Select the component JOIN and review the Properties window, Extract Options tab. If you switch to the tab Advanced, you will see that an Extract Knowledge Module XKM
Spark Join has been selected for you by the system. XKMs such as this one organize the Spark-Python code generation of each component for you. You can set advanced
options such as CACHE_DATA or NUMBER_OF_TASKS in the Options tab to tune Spark memory handling and parallelism.
Execute the Spark mapping
2. In the Run dialog change the Physical Mapping Design drop-down menu from Hive to Spark. Click OK for the run dialog. We will use all defaults and run this mapping on the
local agent that is embedded in the ODI Studio UI. After a moment a Session started dialog will appear, press OK there as well.
3. To review execution go to the Operator navigator and expand the All Executions node to see the current execution. If the execution has not finished it will show the Run icon
for an ongoing task. You can refresh the view by pressing the blue Refresh icons to refresh once or to refresh automatically every 5 seconds.
4. Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the task 30 - Load JOIN_AP to review the generated Spark-
Python code in a task editor.
5. You can review the result data in the datastore movie_rating, please see previous chapter "Transform Movie Data Using Hive" for details.
Transform Movie Data using Pig
1. Open ODI Studio and open mapping Big Data Demo > Demo > Mappings > C - Calc Ratings (Hive - Pig - Spark) . For details please see previous chapter "Transform Movie
Data Using Hive".
2. Please note that we are using the same logical mapping for Hive, Spark, and Pig. There is no separate logical design for these implementation engines.
The property Staging Location Hint of the logical diagram is used to select an implementation engine for the next generated physical design.
3. Switch to the Physical View of the mapping and make sure the tab Pig is selected on the bottom. There is a physical design for each of the 3 implementation engines. Since the
transformation is using Pig and the sources and targets are defined using Hive Catalog metadata, source, staging (Pig), and target datastore are shown in separate execution
units.
Execute the Pig mapping
1. Prior to running the Pig mapping it is necessary to truncate the target table. Go to the Procedures node in the Projects tree and run the procedure Truncate movie_rating by
right-clicking on it and selecting Run.... Use default settings on the Run dialog and select OK for all following dialogs.
Note: You will see in Part 6 how to automate this step in a package.
2. Go back to the Mapping editor and press the Run button on the taskbar above the mapping editor.
3. In the Run dialog change the Physical Mapping Design drop-down menu from Hive to Pig. Click OK for the run dialog. We will use all defaults and run this mapping on the local
agent that is embedded in the ODI Studio UI. After a moment a Session started dialog will appear, press OK there as well.
5. Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the task 30 - Load JOIN_AP to review the generated Spark-
Python code in a task editor.
6. You can review the result data in the datastore movie_rating, please see previous chapter "Transform Movie Data Using Hive" for details.
Transform Nested JSON Data Using Spark
Prepare and review source HDFS JSON data
1. Start a terminal window from the menu bar by single-clicking on the Terminal icon.
2. First we need to put test JSON data into HDFS for further transformation.
In the terminal window, execute the commands:
hdfs dfs -put ~/movie/moviework/odi/movie_ratings/movie_ratings.json /user/odi/demo/movie_ratings.json
Note: You will see in Part 6 how to automate this copy step in a package.
3. In the terminal window, execute the commands:

hdfs dfs -cat /user/odi/demo/movie_ratings.json
You can see a JSON-formatted file with records of movies and nested rating records between 1-5. This will be the input for our mapping, you can see the ODI Model
HDFSMovie with datastore movie_ratings containing the metadata for the JSON file. Please note that the datastore does not keep metadata about complex attributes like the
ratings sub-array, it is displayed as a flat String type attribute.
2. Select the left-hand Designer navigator, open the Projects accordion and navigate to Mapping Big Data Demo > Demo > Mappings > D - Calc Ratings (JSON Flatten) .
◾ Source HDFS file movie_ratings provides nested JSON information about each movie with a sub-array of ratings for each movie.
◾ Flatten is being used to un-nest the rating information. It will do a cross-product of each movie with its nested ratings.
◾ AGGREGATE is used to group all ratings based on movie_id and calculate an average of the movie ratings.
◾ The target table movie_rating stores the result from the join. It uses a user-defined function XROUND that provides rounding support for all languages.
4. Switch to the Physical View of the mapping.
Execute the mapping
4. Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the task 40 - Load MOVIE_RATING to review the generated
code in a task editor.

HiveMovie.movie_rating. Select View Data from the menu to see the loaded rows.
6. A Data editor appears with all rows of the movie_rating table in Hive. Look for the entries with movie_id 10,11,12,13,14, these were added through this mapping.
In this task we load the results of the prior Hive transformation from the resulting Hive table into the Oracle DB data warehouse. We are using the Oracle Loader for Hadoop (OLH) build data
loader which uses mechanisms specifically optimized for Oracle DB.
Load Movie data to Oracle DB using ODI Mapping and Oracle Loader for Hadoop
2. Select the left-hand Designer navigator, open the Projects accordion and navigate to Mapping Big Data Demo > Demo > Mappings > E - Load Oracle (OLH) . Double-click to
open the mapping.
3. The mapping logical view shows direct map from the Hive table movie to the Oracle table MOVIE. No transformations are done on the attributes.
4. Switch to the Physical View of the mapping. It shows the move from the Hive source to the Oracle target.
5. Select the access point MOVIE_RATING_AP (some of the label might be invisible) and review the Properties window, Integration Knowledge Module tab.The LKM Hive to
Oracle OLH-OSCH Direct has been selected with default settings except for TRUNCATE=True, which allows repeated execution of the mapping, and
USE_HIVE_STAGING_TABLE=False, which avoids an additional staging step. The output mode OLH_OUTPUT_MODE is set to JDBC by default, which is a good setting for
debugging and simple use cases. Other settings are to perform an OLH load through OCI or data pump or to perform a load through OSCH.
Execute the OLH mapping
Note: The execution can take several minutes depending on the environment. In order to view the generated code only, you can check the checkbox Simulation. In this case the
generated session is displayed in a dialog window and no execution is shown in the operator.
4. Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the different tasks to review how the Loader Config and Map files
are generated and the loader is executed.

OracleMovie.ODI_MOVIE_RATING. Select View Data from the menu to see the loaded rows.
6. A Data editor appears with all rows of the ODI_MOVIE_RATING table in Oracle.
In the next section ODI will transform data from both Oracle DB as well as Hive in a single mapping using Oracle Big Data SQL. Big Data SQL enables Oracle Exadata to seamlessly query data
on the Oracle Big Data Appliance using Oracle's rich SQL dialect. Data stored in Hadoop is queried in exactly the same way as all other data in Oracle Database.
In this use case we are combining the previously used activity data with Customer data that is stored in Oracle DB. We are summarizing all purchase activities for each customer and join it with
the core customer data.
Calculate Sales from Hive and Oracle Tables using Big Data SQL
Open and review the Big Data SQL mapping
1. Open ODI Studio. See Part 1 for information how to start and log into ODI Studio
2. Select the left-hand Designer navigator, open the Projects accordion and navigate to Mapping Big Data Demo > Demo > Mappings > F - Calc Sales (Big Data SQL) . Double-
click to open the mapping.
◾ Hive source table movieapp_log_odistage provides information about each movie, while Oracle source table CUSTOMER contains information about each customer.
◾ FILTER is being used to filter down to activity = 11 events, which are sales events.
4. Switch to the Physical View of the mapping. It is visible that table movieapp_log_stage is within Hive, while table CUSTOMER and all transformations are in the target Oracle DB.
5. Select the access point MOVIEAPP and review the Properties window, Loading Knowledge Module tab.The LKM Hive to Oracle (Big Data SQL) has been selected to access
the Hive table from Oracle remotely through an external table definition. All LKM options are default.
Note: LKM Hive to Oracle (Big Data SQL) is a custom example KM that is available for download at Java.net.
The IKM chosen for this mapping is the standard IKM Oracle Insert.
6. You can review the code of the custom LKM at Designer > Global Objects > Global Knowledge Modules > Loading > LKM Hive to Oracle (Big Data SQL). The main part of
the LKM is a template to generate an external table using the Big Data SQL syntax. This external table is used in the integration query to insert rows into the target table.
Execute the Big Data SQL mapping
4. Once the load is complete, the operator will show all tasks of the session as successful.
5. Double-click on task 40 - Create external table to review the generated code for Big Data SQL in a task editor. Switch to the Code tab. You can see the create table expression
OracleMovie.ODI_COUNTRY_SALES. Select View Data from the menu to see the loaded rows.
7. A Data editor appears with all rows of the COUNTRY_SALES table in Oracle .
This demo shows how to execute a complex mapping complex Pig using ODI with the ability to use user-defined functions and table functions. The mapping is using a Pig function sessionize
from the Apache DataFu library which is included in this Hadoop distribution.
The objective of this mapping is to order the activities in the Hive table movieapp_log_odistage into separate sessions for different users and calculate statistics for session minimum, maximum,
and average duration based on geography.
Load Sessionize and Analyze User Activities using Pig in ODI
1. Open ODI Studio and open mapping Big Data Demo > Demo > Mappings > G - Sessionize Data (Pig).
◾ Hive source table movieapp_log_odistage provides information about each movie, while Hive table cust contains information about each customer.
◾ SESSIONIZE is a table function component that contains custom Pig code calling the DataFu.sessionize UDF. This functionality combines activities to sessions based on
the time window they are happening in.
◾ AGGREGATE is used to aggregate session statistics based on location (province, country).
◾ EXPRESSION is used to normalize and convert the session statistics.
◾ SORT is ordering the aggregate records based on average session length.
◾ The result is stored into the target Hive table session_stats.
3. Switch to the Physical View of the mapping. Since the transformation is using Pig and the sources and targets are defined using Hive Catalog metadata, source, staging (Pig), and
target datastore are shown in separate execution units.
4. Highlight the physical component SESSIONIZE and go to the properties window. Select the Extract Options tab and double-click on the Value column of the
PIG_SCRIPT_CONTENT option
5. A dialog is shown with the custom Pig code used to implement this table function. The function Sessionize from DataFu is at the core of this implementation. Please note that the
Execute the Pig mapping
1. Prior to running the Pig mapping it might be necessary to truncate the target table. Go to the Procedures node in the Projects tree and run the procedure Truncate session_stats by
right-clicking on it and selecting Run.... Use default settings on the Run dialog and select OK for all following dialogs.
Note: You will see in Part 6 how to automate this step in a package.
2. Go back to the Mapping editor and press the Run button on the taskbar above the mapping editor.
4. To review execution go to the Operator navigator and expand the All Executions node to see the current execution. The execution might not have finished, then it will show the Run
icon for an ongoing task. You can refresh the view by pressing the blue Refresh icons to refresh once or to refresh automatically every 5 seconds.
5. Once the load is complete, the operator will show all tasks of the session as successful. You can double-click on the task 30 - Load JOIN_AP to review the generated Pig code in a
task editor.
6. You can review the result data in the datastore session_stats.

Review Package
2. Select the left-hand Designer navigator, open the Projects accordion and navigate to package Big Data Demo > Demo > Packages > Big Data Load. Double-click to open the
package.
3. Review the package. All mappings, procedures, and tool calls are connected in a workflow to execute in one operation.
4. Select the Copy To HDFS step and review the properties. Please note that the parameter Source Logical Schema is empty and Target Logical Schema is set to HadoopLocal.
This defines the source as a local filesystem path and target as HDFS path.
5. Select the C - Calc Ratings (Hive - Pig - Spark) step and review the properties. Please notice that the Physical Mapping Design is set to Hive, you could change it to the other
options Pig or Spark to execute in another engine.
6. Press the Run button on the taskbar above the package editor. When asked to save your changes, press Yes.
Note: The execution can take several minutes depending on the environment.
8. To review execution go to the Operator navigator and expand the All Executions node to see the current execution. The execution might not have finished, then it will show the Run
icon for an ongoing task. You can refresh the view by pressing the blue Refresh icons to refresh once or to refresh automatically every 5 seconds.
Once the load is complete, the operator will show all tasks of the session as successful.
In this tutorial, you have learned how to:
◦ Ingest Data Using Sqoop and Oracle GoldenGate

◦ Transform Data Using Hive, Pig, and Spark
◦ Load Data to Oracle DB using Oracle Loader for Hadoop
◦ Access Hadoop Data from Oracle using Big Data SQL
◦ Sessionize Data using Pig
◦ Use ODI Packages to orchestrate multiple jobs
Resources
Go to the Oracle Technology Network for more information on Oracle Data Integration, Oracle Big Data SQL, Oracle Data Warehousing and Oracle Analytical SQL
Credits
◦ Authors: Alex Kotopoulis

Tame Big Data Using Oracle Data Integration

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tame Big Data Using Oracle Data Integration

Uploaded by

Copyright:

Available Formats

Tame Big Data using Oracle Data Integration

Approximately one hour

This tutorial is divided into the following sections:

1. Review the Scenario

Resetting the demo

Downloading and Installing Big Data Lite Virtual Machine

Starting required services

1. Double-click Start/Stop Services on the desktop

◾ ORCL Oracle Database 12c

Prepare Oracle Database for Oracle GoldenGate

Ingest Bulk Data Using Sqoop in Oracle Data Integrator

Open and review the Sqoop mapping

1. Launch ODI from the Desktop toolbar.

2. Connect to the ODI repository.

The GoldenGate processes for the Hive example are as following:

Start Oracle GoldenGate and Set Up Replication from Oracle to Hive

3. Add and start the Capture(EMOV) process by executing:

4. See the status of the newly added processes by executing

5. Close the ggsci client by entering

7. Add and start the Replicat(RMOV) process by executing:

8. See the status of the newly added process by executing

Note: Alternatively you can execute the following command:

Note: Alternatively you can execute the following command:

Note: Alternatively you can execute the following command:

movie id 1. This marks a delete from the table.

18. Enter the command:

You can query this view by entering:

Replicate Data from Oracle to Kafka

1. Double-click Start/Stop Services on the desktop

Kafka uses Zookeeper so you need to first start a

Check whether zookeeper and Kafka is running by

If Kafka and zookeeper are not running, start it with

Let's create a kafka topic named "oggtopic" :

5. Send some test messages

> /usr/lib/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic oggtopic

Working with Kafka handler

1. Set Up Replication from Oracle:

2. In the terminal window, execute the commands:

ggsci> info all

5. Start the Oracle Database Extract process by

6. Check whether the Oracle extract process is running

ggsci> info all

Merge Updates using Hive in Oracle Data Integrator

Open and review the Hive merge mapping

Execute the Hive merge mapping

5. Go to Designer navigator and Models and right-click

Open and review the Hive mapping

Execute the Hive mapping

5. Go to Designer navigator and Models and right-click

Transform Movie Data using Spark

Open and review the mapping

Execute the Spark mapping

Transform Movie Data using Pig

Open and review the mapping

Execute the Pig mapping

Transform Nested JSON Data Using Spark

Prepare and review source HDFS JSON data

3. In the terminal window, execute the commands:

Open and review the mapping

4. Switch to the Physical View of the mapping.

Execute the mapping

5. Go to Designer navigator and Models and right-click

Execute the OLH mapping