Big Data Analytics for Oracle Professionals

Big Data for Oracle Developers & DBAs -  
Towards Spark, Real-Time and Predictive Analytics

Mark Rittman, CTO, Rittman Mead
Riga Dev Day 2016, Riga, March 2016
info@rittmanmead.com www.rittmanmead.com @rittmanmead

About the Speaker
• Mark Rittman, Co-Founder of Rittman Mead
‣Oracle ACE Director, specialising in Oracle BI&DW
‣14 Years Experience with Oracle Technology
‣Regular columnist for Oracle Magazine
• Author of two Oracle Press Oracle BI books
‣Oracle Business Intelligence Developers Guide
‣Oracle Exalytics Revealed
‣Writer for Rittman Mead Blog : 

http://www.rittmanmead.com/blog
• Email : mark.rittman@rittmanmead.com
• Twitter : @markrittman
info@rittmanmead.com www.rittmanmead.com @rittmanmead 2

Hadoop is the Big Hot Topic In IT / Analytics
• Everyone’s talking about Hadoop and “Big Data”

Why is Hadoop of Interest to Us?
• Gives us an ability to store more data, at more detail, for longer
• Provides a cost-effective way to analyse vast amounts of data
• Hadoop & NoSQL technologies can give us “schema-on-read” capabilities
• There’s vast amounts of innovation in this area we can harness
• And it’s very complementary to Oracle BI & DW

Flexible Cheap Storage for Logs, Feeds + Social Data
Voice + Chat
Transcripts
CRM Data Transactions Chat Logs Call Center Logs iBeacon Logs Website Logs Social FeedsDemographics
Real-time Feeds, 
batch and API
Hadoop Customer 360 Apps

Node
Raw Data
Business analytics
SQL-on-Hadoop
Predictive  
$50k Models

Deploy Alongside Traditional DW as “Data Reservoir”
• Extend the DW with new data sources, datatypes, detail-level data
• Offload archive data into Hadoop but federate it with DW data in user queries
• Use Hadoop, Hive and MapReduce for low-cost ETL staging
Hadoop Platform
Operational Data
Data Factory
Data Reservoir
Segments
Transactions
File Based
Integration Raw
Mapped
Business
Models
Customer Data
Customer Data Intelligence Tools
Customer
Master ata Data stored in Machine
Data streams Data sets Learning
ETL Based the original
produced by
Integration format (usually
mapping and
files) such as
transforming Marketing /
SS7, ASN.1,
Unstructured Data Stream JSON etc. raw data Sales Applications
Based
Integration
Data sets and Models and
samples programs
Voice + Chat
Transcripts Discovery & Development Labs
Safe & secure Discovery and Development
environment
Data Transfer Data Access

Incorporate Hadoop Data Reservoirs into DW Design
Data Sources
Enterprise
Performance
Management
Data Ingestion Access & Performance Layer
Past, current and future interpretation of

Data Engines &   enterprise data. Structured to support agile
Poly-structured   access & navigation
sources Pre-built &  
Query Federation
Virtualization &  
Ad-hoc  
Foundation Data Layer
BI Assets
Structured •Operational Data
•COTS Data Immutable modelled data. Business
Data  •Master & Ref. Data Process Neutral form. Abstracted from
Sources •Streaming & BAM business process changes
Raw Data Reservoir Information 

Services
Content SMS Immutable raw data reservoir
Docs Web & Social Media Raw data at rest is not interpreted
Information Interpretation
Discovery Lab Sandboxes Rapid Development Sandboxes

Data  
Project based data stores to Project based data stored to
support specific discovery facilitate rapid content / Science
objectives presentation delivery

Deployed on Oracle Big Data Appliance
• Oracle Engineered system for big data processing and analysis
• Start with Oracle Big Data Appliance Starter Rack - expand up to 18 nodes per rack
• Cluster racks together for horizontal scale-out using enterprise-quality infrastructure
Oracle Big Data Appliance Oracle Big Data Appliance

Starter Rack + Expansion Starter Rack + Expansion
• Cloudera CDH + Oracle software • Cloudera CDH + Oracle software
• 18 High-spec Hadoop Nodes with • 18 High-spec Hadoop Nodes with
InfiniBand switches for internal Infiniband InfiniBand switches for internal
Hadoop traffic, optimised for network Hadoop traffic, optimised for network
throughput throughput
• 1 Cisco Management Switch • 1 Cisco Management Switch
• Single place for support for H/W + S/ • Single place for support for H/W + S/
W  W 
Scoring
Enriched  
Customer Profile
Modeling

Hadoop Tenets : Simplified Distributed Processing
• Hadoop, through MapReduce, breaks processing down into simple stages
‣Map : select the columns and values you’re interested in, pass through as key/value pairs
‣Reduce : aggregate the results
• Most ETL jobs can be broken down into filtering,  

projecting and aggregating
Mapper
Filter, Project
Mapper
Filter, Project
Mapper
Filter, Project
• Hadoop then automatically runs job on cluster
‣Share-nothing small chunks of work
Reducer Reducer
‣Run the job on the node where the data is
Aggregate Aggregate
‣Handle faults etc
‣Gather the results back in

Output 
One HDFS file per reducer, 
in a directory

Hive as the Hadoop SQL Access Layer
• MapReduce jobs are typically written in Java, but Hive can make this simpler
• Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
• Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically 

creates MapReduce jobs against data previously loaded into the Hive HDFS tables
• Approach used by ODI and OBIEE 

to gain access to Hadoop data
• Allows Hadoop data to be accessed just like  

any other data source (sort of...)

Hive Provides a SQL Interface for BI + ETL Tools
• Data integration tools such as Oracle Data Integrator can load and process Hadoop data
• BI tools such as Oracle Business Intelligence 12c can report on Hadoop data
• Generally use MapReduce and Hive to access data
‣ODBC and JDBC access to Hive tabular data

Access direct Hive or extract using ODI12c
‣Allows Hadoop unstructured/semi-structured  for structured OBIEE dashboard analysis
data on HDFS to be accessed like RDBMS
What pages are people visiting?

Who is referring to us on Twitter?
What content has the most reach?

Common Developer Understanding of Hadoop Today
• Most Oracle DBAs and developers know about Hadoop, but assume…
‣Hadoop is just for batch (because of the MapReduce JVN spin-up issue)
‣Hadoop is just for large datasets, not ad-hoc work or micro batches
‣Hadoop will always be slow because it stages everything to disk
‣All Hadoop can do is Map (select, filter) and Reduce (aggregate)
‣Hadoop == MapReduce

but … Hadoop is slow
and only for batch jobs
…isn’t it?
Hadoop is Now Real-Time, In-Memory and Analytics-Optimised
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or   E : info@rittmanmead.com
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) W : www.rittmanmead.com
Hadoop 1.0 and MapReduce
• MapReduce’s great innovation was to break processing down into distributed jobs
• Jobs that have no functional dependency on each other, only upstream tasks
• Provides a framework that is infinitely scalable and very fault tolerant
• Hadoop handled job scheduling and resource management
‣All MapReduce code had to do was provide the “map” and “reduce” functions
‣Automatic distributed processing
‣Slow but extremely powerful

Compiling Hive/Pig Scripts into MapReduce
• A typical Hive or Pig script compiles down into multiple MapReduce jobs
• Each job stages its intermediate results to disk
• Safe, but slow - write to disk, spin-up separate JVMs for each job
register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
raw_logs = LOAD '/user/mrittman/rm_logs' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
GENERATE FLATTEN
(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\]  
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')
)AS
(remoteAddr: chararray, remoteLogname: chararray, user: chararray,time: chararray);
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|bot.*');
logs_base_page = FOREACH logs_base_nobots GENERATE SUBSTRING(time,0,2) as day)  
AS (method:chararray, request_page:chararray, protocol:chararray), remoteAddr, status;
logs_base_page_cleaned = FILTER logs_base_page BY NOT (SUBSTRING(request_page,0,3) ==  
'/wp' or request_page == '/' or SUBSTRING(request_page,0,7) == '/files/'  
or SUBSTRING(request_page,0,12) == '/favicon.ico');
logs_base_page_cleaned_by_page = GROUP logs_base_page_cleaned BY request_page;
page_count = FOREACH logs_base_page_cleaned_by_page GENERATE FLATTEN(group)  
as request_page, COUNT(logs_base_page_cleaned) as hits;
…
store pages_and_post_top_10 into 'top_10s/pages'; JobId Maps Reduces Alias Feature Outputs
job_1417127396023_0145 12 2 logs_base,logs_base_nobots,logs_base_page,logs_base_page_cleaned, 
logs_base_page_cleaned_by_page,page_count,raw_logs GROUP_BY,COMBINER
job_1417127396023_0146 2 1 pages_and_post_details,pages_and_posts_trim,posts,posts_cleaned HASH_JOIN
job_1417127396023_0147 1 1 pages_and_posts_sorted SAMPLER
job_1417127396023_0148 1 1 pages_and_posts_sorted ORDER_BY,COMBINER
job_1417127396023_0149 1 1 pages_and_posts_sorted

MapReduce 2 and YARN
• MapReduce 2 (MR2) splits the functionality of the JobTracker 

by separating resource management and job scheduling/monitoring
• Introduces YARN (Yet Another Resource Manager)
• Permits other processing frameworks to MR

Node 
Manager
‣For example, Apache Spark
• Maintains backwards compatibility with MR1
Client
• Introduced with CDH5+ Node 
Resource 
Manager Manager
Client
Node 
Manager

Apache Tez
• Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc
• Models processing as an entire data flow graph (DAG), rather than separate job steps
‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems
‣Dataflow steps pass data between them as streams, rather than writing/reading from disk
• Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez
• Favoured In-memory / Hive v2   TEZ DAG
route by Hortonworks
Map()
Reduce() Reduce()
Input Data
Map()
Reduce() Reduce()
Map()
Output Data
Reduce()
Reduce()
Map()
Input Data
Map()

Tez Advantage - Drop-In Replacement for MR with Hive, Pig
set hive.execution.engine=mr
4m 17s
set hive.execution.engine=tez
2m 25s

Apache Spark
• Another DAG execution engine running on YARN
• More mature than TEZ, with richer API and more vendor support
• Uses concept of an RDD (Resilient Distributed Dataset)
‣RDDs like tables or Pig relations, but can be cached in-memory
‣Great for in-memory transformations, or iterative/cyclic processes
• Spark jobs comprise of a DAG of tasks operating on RDDs
• Access through Scala, Python or Java APIs
• Related projects include
‣Spark SQL
‣Spark Streaming

Rich Developer Support + Wide Developer Ecosystem
• Native support for multiple languages  

with identical APIs
scala> val logfile = sc.textFile("logs/access_log")

‣Python - prototyping, data wrangling
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353)  
called with curMem=234759, maxMem=309225062
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2  
‣Scala - functional programming features
stored as values to memory (estimated size 75.5 KB, free 294.6 MB)
logfile: org.apache.spark.rdd.RDD[String] =  
‣Java - lower-level, application integration
MappedRDD[31] at textFile at <console>:15
scala> logfile.count()
• Use of closures, iterations, and other   14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
...
common language constructs to minimize code
14/05/12 21:19:06 INFO SparkContext: Job finished:  
count at <console>:18, took 0.192536694 s
• Integrated support for distributed +  res7: Long = 154563
functional programming
• Unified API for batch and streaming scala> val logfile = sc.textFile("logs/access_log").cache
scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
...
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403

Accompanied by Innovations in Underlying Platform
In-Memory Distributed Storage,  Cluster Resource Management to 

to accompany In-Memory Distributed Processing support multi-tenant distributed services

Combining Real-Time Processing with Real-Time Loading
• Most Oracle DWs process data in batches (or at best, micro-batches)
• Tools like ODI typically work in this way,  

often linking up with database CDC
Voice + Chat
Transcripts
• Hadoop systems are usually real-time, from the start
Chat Logs Call Center Logs iBeacon Logs Website Logs

‣In the past, via Hadoop streaming, Flume etc
‣Batch loading then added for initial data load into system Real-time Feeds
Hadoop
Node
Raw Data

Apache Flume : Distributed Transport for Log Activity
• Apache Flume is the standard way to transport log files from source through to target
‣Initial use-case was webserver log files, but can transport any file from A>B
‣Does not do data transformation, but can send to multiple targets / target types
‣Mechanisms and checks to ensure successful transport of entries
• Has a concept of “agents”, “sinks” and “channels”
‣Agents collect and forward log data
‣Sinks store it in final destination
‣Channels store log data en-route
• Simple configuration through INI files
‣Handled outside of ODI12c

GoldenGate for Continuous Streaming to Hadoop
• Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop
• Leverages GoldenGate & HDFS / Hive Java APIs
• Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive)
• Likely to be formal part of GoldenGate in future release - but usable now
• Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1

Apache Kafka : Reliable, Message-Based
• Developed by LinkedIn, designed to address Flume issues around reliability, throughput
‣(though many of those issues have been addressed since)
• Designed for persistent messages as the common use case
‣Website messages, events etc vs. log file entries
• Consumer (pull) rather than Producer (push) model
• Supports multiple consumers per message queue
• More complex to set up than Flume, and can use 

Flume as a consumer of messages
‣But gaining popularity, especially  

alongside Spark Streaming

Adding Real-Time Processing to Loading : Spark Streaming
• Add mid-stream processing to ingestion process
• Sessionization, classification, more complex transformation and ref data lookup
• Access to machine learning algorithms using MLib
‣Example implementation at: 

http://blog.cloudera.com/blog/2014/11/how-to-do-near- 
real-time-sessionization-with-spark-streaming-and- 
apache-hadoop/

but … Hadoop development
is only for Java programmers
…isn’t it?
SQL Increasingly Used in Hadoop for Data Access
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or   E : info@rittmanmead.com
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) W : www.rittmanmead.com
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
• Cloudera’s answer to Hive query response time issues
• MPP SQL query engine running on Hadoop, bypasses MapReduce

for direct data access
• Mostly in-memory, but spills to disk if required
• Uses Hive metastore to access Hive table metadata
• Similar SQL dialect to Hive - not as rich though and no support for
Hive SerDes, storage handlers etc

How Impala Works
• A replacement for Hive, but uses Hive concepts and 

data dictionary (metastore)
BI Server
• MPP (Massively Parallel Processing) query engine 

Presentation Svr
that runs within Hadoop

Cloudera Impala 
ODBC Driver
‣Uses same file formats, security, 

resource management as Hadoop
• Processes queries in-memory
Impala Impala
Impala
Hadoop Hadoop
Hadoop
• Accesses standard HDFS file data

HDFS etc HDFS etc
HDFS etc
• Option to use Apache AVRO, RCFile, 

LZO or Parquet (column-store)
Impala
Impala Multi-Node 
Hadoop Cluster
• Designed for interactive, real-time  Hadoop

HDFS etc
Hadoop
HDFS etc
SQL-like access to Hadoop

Enabling Hive Tables for Impala
• Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list
• Run SHOW TABLES Impala SQL command to view tables available
• Run COUNT(*) on main ACCESS_PER_POST table to see typical response time

[oracle@bigdatalite ~]$ impala-shell [bigdatalite.localdomain:21000] > select count(*)  
Starting Impala Shell without Kerberos authentication from access_per_post;
Query: select count(*) from access_per_post
[bigdatalite.localdomain:21000] > invalidate metadata; +----------+
Query: invalidate metadata | count(*) |
+----------+
Fetched 0 row(s) in 2.18s | 343 |
[bigdatalite.localdomain:21000] > show tables; +----------+
Query: show tables Fetched 1 row(s) in 2.76s
+-----------------------------------+
| name |
+-----------------------------------+
| access_per_post |
| access_per_post_cat_author |
| … |
| posts |
|——————————————————————————————————-+
Fetched 45 row(s) in 0.15s

Significantly-Improved Ad-Hoc Query Response Time vs Hive
• Significant improvement over Hive response time
• Now makes Hadoop suitable for ad-hoc querying
Simple Two-Table Join against Impala Data Only

Logical Query Summary Stats: Elapsed time 2, Response time 1, Compilation time 0 (seconds)
vs
Simple Two-Table Join against Hive Data Only
Logical Query Summary Stats: Elapsed time 50, Response time 49, Compilation time 0 (seconds)

Oracle Big Data SQL
• Part of Oracle Big Data 4.0 (BDA-only)
‣Also requires Oracle Database 12c, Oracle Exadata Database Machine
• Extends Oracle Data Dictionary to cover Hive
• Extends Oracle SQL and SmartScan to Hadoop

SQL Queries
• Extends Oracle Security Model over Hadoop
‣Fine-grained access control
‣Data redaction, data masking

Exadata Database 
Server
‣Uses fast c-based readers where possible 
SmartScan SmartScan
(vs. Hive MapReduce generation)
‣Map Hadoop parallelism to Oracle PQ

Exadata  Hadoop  Oracle Big 
‣Big Data SQL engine works on top of YARN

Storage Servers Cluster Data SQL
‣Like Spark, Tez, MR2

View Hive Table Metadata in the Oracle Data Dictionary
• Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata
‣Linked by Exadata configuration steps to one or more BDA clusters
• DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata
• Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore
SQL> col database_name for a30
SQL> col table_name for a30
SQL> select database_name, table_name
2 from dba_hive_tables;
DATABASE_NAME TABLE_NAME
------------------------------ ------------------------------
default access_per_post
default access_per_post_categories
default access_per_post_full
default apachelog
default categories
default countries
default cust
default hive_raw_apache_access_log

Hive Access through Oracle External Tables + Hive Driver
• Big Data SQL accesses Hive tables through external table mechanism
‣ORACLE_HIVE external table type imports Hive metastore metadata
‣ORACLE_HDFS requires metadata to be specified
• Access parameters cluster and tablename specify Hive table source and BDA cluster
CREATE TABLE access_per_post_categories(
hostname varchar2(100),
request_date varchar2(100),
post_id varchar2(10),
title varchar2(200),
author varchar2(100),
category varchar2(100),
ip_integer number)
organization external
(type oracle_hive
default directory default_dir
access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));

Extending SmartScan, and Oracle SQL, Across All Data
• Brings query-oﬄoading features of Exadata 

to Oracle Big Data Appliance
• Query across both Oracle and Hadoop sources
• Intelligent query optimisation applies SmartScan 

close to ALL data
• Use same SQL dialect across both sources
• Apply same security rules, policies,  

user access rights across both sources

Apache Drill
• SQL query engine that doesn’t require a formal (HCatalog) schema
• Infers the schema from the semi-structured dataset (JSON etc)
‣Allows users to analyze data without any ETL or up-front schema definitions.
‣Data can be in any file format such as text, JSON, or Parquet
‣Improved agility and flexibility  0: jdbc:drill:zk=local> select state, city, count(*) totalreviews
vs formal modelling in Hive etc from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
group by state, city order by count(*) desc limit 10;
+------------+------------+--------------+
| state | city | totalreviews |
+------------+------------+--------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+--------------+

Hive-on-Spark (and Pig-on-Spark)
• Addition of Spark as a back-end execution engine for Hive (and Pig)
• Has the advantage of making use of all existing Hive scripts, infrastructure
• But … probably is even more of a dead-end than Tez
‣Is still faster than Hive on MR
‣But Hive with column/in-memory optimized 

storage is now typically CPU bound
‣Spark consumes more CPU, Disk  

& Network IO than Tez
‣Additional translation overhead from  

RDDs to Hive’s “Row Containers”

Spark SQL - Adding SQL Processing to Apache Spark
• Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries
• Bring in and federate additional data from JDBC sources
• Load, read and save data in Hive, Parquet and other structured tabular formats
val accessLogsFilteredDF = accessLogs
.filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*"))
.filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF()
.registerTempTable("accessLogsFiltered")
val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*)  

as total  
FROM accessLogsFiltered a  
JOIN posts p ON a.endpoint = p.POST_SLUG  
GROUP BY p.POST_TITLE, p.POST_AUTHOR  
ORDER BY total DESC LIMIT 10 ")
// Persist top ten table for this window to HDFS as parquet file
topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet" 
, "parquet", SaveMode.Overwrite)

Choosing the Appropriate SQL Engine to Add to Hadoop

Apache Parquet - Column-Orientated Storage for Analytics
• Beginners usually store data in HDFS using text file formats (CSV) but these have limitations
• Apache AVRO often used for general-purpose processing
‣Splitability, schema evolution, in-built metadata, support for block compression
• Parquet now commonly used with Impala due to column-orientated storage
‣Mirrors work in RDBMS world around column-store
‣Only return (project) the columns you require across a wide table

Cloudera Kudu - Combining Best of HBase and Column-Store
• But Parquet (and HDFS) have significant limitation for real-time analytics applications
‣Append-only orientation, focus on column-store  

makes streaming ingestion harder
• Cloudera Kudu aims to combine best of HDFS + HBase
‣Real-time analytics-optimised
‣Supports updates to data
‣Fast ingestion of data
‣Accessed using SQL-style tables 

and get/put/update/delete API

but … Hadoop is insecure
and has fragmented security
…doesn’t it?
Consistent Security and Audit Now Emerging on Platform

Hadoop Security Initially Was a Mess
• Clusters by default are unsecured (vunerable to account spoofing) & need Kerberos enabled
• Data access controlled by POSIX-style permissions on HDFS files
• Hive and Impala can Apache Sentry RBAC
‣Result is data duplication and complexity
‣No consistent API or abstracted security model

/user/mrittman/scratchpad
/user/ryeardley/scratchpad
/user/mpatel/scratchpad
/data/rm_website_analysis/logfiles/incoming
/data/rm_website_analysis/logfiles/archive
/data/rm_website_analysis/tweets/incoming
/data/rm_website_analysis/tweets/archive

Oracle Big Data SQL : Extend Oracle Security to Hadoop
• Use standard Oracle Security over Hadoop & NoSQL
‣Grant & Revoke Privileges

DBMS_REDACT.ADD_POLICY(
‣Redact Data
object_schema => 'txadp_hive_01',
object_name => 'customer_address_ext',
‣Apply Virtual Private Database
column_name => 'ca_street_name',

policy_name => 'customer_address_redaction',
function_type => DBMS_REDACT.RANDOM,
‣Provides Fine-grain Access Control
expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'',
''REDACTION_TESTER'')=''TRUE'''
 
• Great solution to extend existing Oracle  );
security model over Hadoop datasets
SQL
JSON
Redacted
data Customer data
subset in Oracle DB

Cloudera RecordService
• Provides a higher level, logical abstraction for data (ie Tables or Views)
‣Can be used with Spark & Spark SQL, with Predicate pushdown, projection
• Returns schemed objects (instead of paths and bytes) in similar way to HCatalog
• Unified data access path allows platform-wide performance improvements
• Secure service that does not execute arbitrary user code
‣Central location for all authorization checks using Sentry metadata.

but … any predictive modelling
has to be done outside Hadoop, in R
…doesn’t it?
Spark MLLib : Adding Machine Learning Capabilities to Spark
• Part of Spark, extends Scala, Java & Python API
// Compute raw scores on the test set.

• Integrated workflow including ML pipelines
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
• Currently supports following algorithms:
}
(score, point.label)
‣Binary classification
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
‣Regression
val auROC = metrics.areaUnderROC()
‣Clustering
println("Area under ROC = " + auROC)
// Save and load model

‣Collaborative filtering
model.save(sc, "myModelPath")
val sameModel = SVMModel.load(sc, "myModelPath")
‣Dimensionality Reduction

Example Usage : Oracle Big Data Preparation Cloud Service
• Data enrichment tool aimed at domain experts, not programmers
• Uses machine-learning to automate  

data classification + profiling steps
• Automatically highlight sensitive data, 

and offer to redact or obfuscate
• Dramatically reduce the time required  Raw Data Mapped Data

Data stored in the
to onboard new data sources
original format (usually
files) such as SS7, ASN.
Data sets produced by
mapping and
transforming raw data
1, JSON etc.
• Hosted in Oracle Cloud for zero-install
‣File upload and download from browser

Voice + Chat
‣Automate for production data loads Transcripts

Identifying Schemas in Semi-/Unstructured Data

Use of Machine Learning to Identify Data Patterns
• Automatically profile, parse and classify incoming datasets using Spark MLLib Word2Vec
• Spot and obfuscate sensitive data automatically, automatically suggest column names

Summary
• Hadoop is evolving
‣Hadoop 2.0 breaks the dependency on MapReduce
‣Spark, Tez etc allow us to create execution plans that  

run in-memory, faster than before
‣New streaming models allow us to process data  

via sockets, micro batches or continuously
• And Oracle developers can make use of these new capabilities
‣Oracle Big Data SQL can access Hadoop data loaded in real-time
‣OBIEE, particularly in 11.1.1.9, can access Impala
‣ODI is likely to support Hive on Tez and Hive on Spark shortly,  

and will have support for Spark in the future

Big Data for Oracle Devs -  
Towards Spark, Real-Time and Predictive Analytics
Mark Rittman, CTO, Rittman Mead
Riga Dev Day 2016, Riga, March 2016

Big Data Analytics for Oracle Professionals

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics for Oracle Professionals

Uploaded by

Copyright:

Available Formats

Big Data for Oracle Developers & DBAs -

Towards Spark, Real-Time and Predictive Analytics

info@rittmanmead.com www.rittmanmead.com @rittmanmead

• Mark Rittman, Co-Founder of Rittman Mead

‣Oracle ACE Director, specialising in Oracle BI&DW

‣14 Years Experience with Oracle Technology

‣Regular columnist for Oracle Magazine

• Author of two Oracle Press Oracle BI books

‣Oracle Business Intelligence Developers Guide

‣Oracle Exalytics Revealed

‣Writer for Rittman Mead Blog :

info@rittmanmead.com www.rittmanmead.com @rittmanmead 2

• Everyone’s talking about Hadoop and “Big Data”

info@rittmanmead.com www.rittmanmead.com @rittmanmead

• Gives us an ability to store more data, at more detail, for longer

• Provides a cost-effective way to analyse vast amounts of data

• Hadoop & NoSQL technologies can give us “schema-on-read” capabilities

• There’s vast amounts of innovation in this area we can harness

• And it’s very complementary to Oracle BI & DW

info@rittmanmead.com www.rittmanmead.com @rittmanmead

Hadoop Customer 360 Apps

info@rittmanmead.com www.rittmanmead.com @rittmanmead

• Extend the DW with new data sources, datatypes, detail-level data

• Use Hadoop, Hive and MapReduce for low-cost ETL staging

info@rittmanmead.com www.rittmanmead.com @rittmanmead

Past, current and future interpretation of

Raw Data Reservoir Information

Discovery Lab Sandboxes Rapid Development Sandboxes

info@rittmanmead.com www.rittmanmead.com @rittmanmead

• Oracle Engineered system for big data processing and analysis

• Cluster racks together for horizontal scale-out using enterprise-quality infrastructure

Oracle Big Data Appliance Oracle Big Data Appliance

info@rittmanmead.com www.rittmanmead.com @rittmanmead 8

• Hadoop, through MapReduce, breaks processing down into simple stages

‣Reduce : aggregate the results

• Most ETL jobs can be broken down into filtering,

• Hadoop then automatically runs job on cluster

‣Share-nothing small chunks of work

‣Handle faults etc

‣Gather the results back in

info@rittmanmead.com www.rittmanmead.com @rittmanmead

• Hive is a query environment over Hadoop/MapReduce to support SQL-like queries

• Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically

• Approach used by ODI and OBIEE

• Allows Hadoop data to be accessed just like

info@rittmanmead.com www.rittmanmead.com @rittmanmead

• Generally use MapReduce and Hive to access data

‣ODBC and JDBC access to Hive tabular data

data on HDFS to be accessed like RDBMS

What pages are people visiting?

info@rittmanmead.com www.rittmanmead.com @rittmanmead

info@rittmanmead.com www.rittmanmead.com @rittmanmead

• Provides a framework that is infinitely scalable and very fault tolerant

• Hadoop handled job scheduling and resource management

‣Automatic distributed processing

‣Slow but extremely powerful

info@rittmanmead.com www.rittmanmead.com @rittmanmead 15

• Each job stages its intermediate results to disk

info@rittmanmead.com www.rittmanmead.com @rittmanmead 16

• MapReduce 2 (MR2) splits the functionality of the JobTracker

• Introduces YARN (Yet Another Resource Manager)

• Permits other processing frameworks to MR

‣For example, Apache Spark

Big Data for Oracle Developers & DBAs -  

‣Writer for Rittman Mead Blog : 

Raw Data Reservoir Information 

• Most ETL jobs can be broken down into filtering,  

• Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically 

• Approach used by ODI and OBIEE 

• Allows Hadoop data to be accessed just like  

• MapReduce 2 (MR2) splits the functionality of the JobTracker 

• Favoured In-memory / Hive v2   TEZ DAG

• Native support for multiple languages  

• Integrated support for distributed +  res7: Long = 154563

In-Memory Distributed Storage,  Cluster Resource Management to 

• Tools like ODI typically work in this way,  

• More complex to set up than Flume, and can use 

‣But gaining popularity, especially  

‣Example implementation at: 

• A replacement for Hive, but uses Hive concepts and 

• MPP (Massively Parallel Processing) query engine 

‣Uses same file formats, security, 

• Option to use Apache AVRO, RCFile, 

• Designed for interactive, real-time  Hadoop