You are on page 1of 58

Big Data for Oracle Developers & DBAs - 


Towards Spark, Real-Time and Predictive Analytics


Mark Rittman, CTO, Rittman Mead
Riga Dev Day 2016, Riga, March 2016

info@rittmanmead.com www.rittmanmead.com @rittmanmead


About the Speaker

• Mark Rittman, Co-Founder of Rittman Mead

‣Oracle ACE Director, specialising in Oracle BI&DW

‣14 Years Experience with Oracle Technology

‣Regular columnist for Oracle Magazine

• Author of two Oracle Press Oracle BI books

‣Oracle Business Intelligence Developers Guide

‣Oracle Exalytics Revealed

‣Writer for Rittman Mead Blog :



http://www.rittmanmead.com/blog

• Email : mark.rittman@rittmanmead.com

• Twitter : @markrittman

info@rittmanmead.com www.rittmanmead.com @rittmanmead 2


Hadoop is the Big Hot Topic In IT / Analytics

• Everyone’s talking about Hadoop and “Big Data”

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Why is Hadoop of Interest to Us?

• Gives us an ability to store more data, at more detail, for longer

• Provides a cost-effective way to analyse vast amounts of data

• Hadoop & NoSQL technologies can give us “schema-on-read” capabilities

• There’s vast amounts of innovation in this area we can harness

• And it’s very complementary to Oracle BI & DW

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Flexible Cheap Storage for Logs, Feeds + Social Data

Voice + Chat
Transcripts

CRM Data Transactions Chat Logs Call Center Logs iBeacon Logs Website Logs Social FeedsDemographics

Real-time Feeds,

batch and API

Hadoop Customer 360 Apps


Node
Raw Data

Business analytics

SQL-on-Hadoop

Predictive 

$50k Models

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Deploy Alongside Traditional DW as “Data Reservoir”

• Extend the DW with new data sources, datatypes, detail-level data

• Offload archive data into Hadoop but federate it with DW data in user queries

• Use Hadoop, Hive and MapReduce for low-cost ETL staging

Hadoop Platform
Operational Data
Data Factory
Data Reservoir
Segments
Transactions
File Based
Integration Raw
Mapped
Business
Models
Customer Data
Customer Data Intelligence Tools
Customer
Master ata Data stored in Machine
Data streams Data sets Learning
ETL Based the original
produced by
Integration format (usually
mapping and
files) such as
transforming Marketing /
SS7, ASN.1,
Unstructured Data Stream JSON etc. raw data Sales Applications
Based
Integration
Data sets and Models and
samples programs
Voice + Chat
Transcripts Discovery & Development Labs
Safe & secure Discovery and Development
environment
Data Transfer Data Access

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Incorporate Hadoop Data Reservoirs into DW Design
Data Sources
Enterprise
Performance
Management
Data Ingestion Access & Performance Layer

Past, current and future interpretation of


Data Engines & 
 enterprise data. Structured to support agile
Poly-structured 
 access & navigation
sources Pre-built & 


Query Federation
Virtualization & 

Ad-hoc 

Foundation Data Layer
BI Assets
Structured •Operational Data
•COTS Data Immutable modelled data. Business
Data
 •Master & Ref. Data Process Neutral form. Abstracted from
Sources •Streaming & BAM business process changes

Raw Data Reservoir Information



Services
Content SMS Immutable raw data reservoir
Docs Web & Social Media Raw data at rest is not interpreted
Information Interpretation

Discovery Lab Sandboxes Rapid Development Sandboxes


Data 

Project based data stores to Project based data stored to
support specific discovery facilitate rapid content / Science
objectives presentation delivery

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Deployed on Oracle Big Data Appliance

• Oracle Engineered system for big data processing and analysis

• Start with Oracle Big Data Appliance Starter Rack - expand up to 18 nodes per rack

• Cluster racks together for horizontal scale-out using enterprise-quality infrastructure

Oracle Big Data Appliance Oracle Big Data Appliance


Starter Rack + Expansion Starter Rack + Expansion
• Cloudera CDH + Oracle software • Cloudera CDH + Oracle software
• 18 High-spec Hadoop Nodes with • 18 High-spec Hadoop Nodes with
InfiniBand switches for internal Infiniband InfiniBand switches for internal
Hadoop traffic, optimised for network Hadoop traffic, optimised for network
throughput throughput
• 1 Cisco Management Switch • 1 Cisco Management Switch
• Single place for support for H/W + S/ • Single place for support for H/W + S/
W
 W

Scoring
Enriched 

Customer Profile
Modeling

info@rittmanmead.com www.rittmanmead.com @rittmanmead 8


Hadoop Tenets : Simplified Distributed Processing

• Hadoop, through MapReduce, breaks processing down into simple stages

‣Map : select the columns and values you’re interested in, pass through as key/value pairs

‣Reduce : aggregate the results

• Most ETL jobs can be broken down into filtering, 



projecting and aggregating
Mapper
Filter, Project
Mapper
Filter, Project
Mapper
Filter, Project

• Hadoop then automatically runs job on cluster

‣Share-nothing small chunks of work

Reducer Reducer
‣Run the job on the node where the data is
Aggregate Aggregate

‣Handle faults etc

‣Gather the results back in


Output

One HDFS file per reducer,

in a directory

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Hive as the Hadoop SQL Access Layer

• MapReduce jobs are typically written in Java, but Hive can make this simpler

• Hive is a query environment over Hadoop/MapReduce to support SQL-like queries

• Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically



creates MapReduce jobs against data previously loaded into the Hive HDFS tables

• Approach used by ODI and OBIEE



to gain access to Hadoop data

• Allows Hadoop data to be accessed just like 



any other data source (sort of...)

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Hive Provides a SQL Interface for BI + ETL Tools

• Data integration tools such as Oracle Data Integrator can load and process Hadoop data

• BI tools such as Oracle Business Intelligence 12c can report on Hadoop data

• Generally use MapReduce and Hive to access data

‣ODBC and JDBC access to Hive tabular data


Access direct Hive or extract using ODI12c
‣Allows Hadoop unstructured/semi-structured
 for structured OBIEE dashboard analysis

data on HDFS to be accessed like RDBMS

What pages are people visiting?


Who is referring to us on Twitter?
What content has the most reach?

info@rittmanmead.com www.rittmanmead.com @rittmanmead


Common Developer Understanding of Hadoop Today

• Most Oracle DBAs and developers know about Hadoop, but assume…
‣Hadoop is just for batch (because of the MapReduce JVN spin-up issue)
‣Hadoop is just for large datasets, not ad-hoc work or micro batches
‣Hadoop will always be slow because it stages everything to disk
‣All Hadoop can do is Map (select, filter) and Reduce (aggregate)
‣Hadoop == MapReduce

info@rittmanmead.com www.rittmanmead.com @rittmanmead


but … Hadoop is slow
and only for batch jobs
…isn’t it?
Hadoop is Now Real-Time, In-Memory and Analytics-Optimised

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 E : info@rittmanmead.com
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) W : www.rittmanmead.com
Hadoop 1.0 and MapReduce

• MapReduce’s great innovation was to break processing down into distributed jobs

• Jobs that have no functional dependency on each other, only upstream tasks

• Provides a framework that is infinitely scalable and very fault tolerant

• Hadoop handled job scheduling and resource management

‣All MapReduce code had to do was provide the “map” and “reduce” functions

‣Automatic distributed processing

‣Slow but extremely powerful

info@rittmanmead.com www.rittmanmead.com @rittmanmead 15


Compiling Hive/Pig Scripts into MapReduce

• A typical Hive or Pig script compiles down into multiple MapReduce jobs

• Each job stages its intermediate results to disk

• Safe, but slow - write to disk, spin-up separate JVMs for each job
register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
raw_logs = LOAD '/user/mrittman/rm_logs' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
GENERATE FLATTEN
(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] 

"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')
)AS
(remoteAddr: chararray, remoteLogname: chararray, user: chararray,time: chararray);
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|bot.*');
logs_base_page = FOREACH logs_base_nobots GENERATE SUBSTRING(time,0,2) as day) 

AS (method:chararray, request_page:chararray, protocol:chararray), remoteAddr, status;
logs_base_page_cleaned = FILTER logs_base_page BY NOT (SUBSTRING(request_page,0,3) == 

'/wp' or request_page == '/' or SUBSTRING(request_page,0,7) == '/files/' 

or SUBSTRING(request_page,0,12) == '/favicon.ico');
logs_base_page_cleaned_by_page = GROUP logs_base_page_cleaned BY request_page;
page_count = FOREACH logs_base_page_cleaned_by_page GENERATE FLATTEN(group) 

as request_page, COUNT(logs_base_page_cleaned) as hits;

store pages_and_post_top_10 into 'top_10s/pages'; JobId Maps Reduces Alias Feature Outputs
job_1417127396023_0145 12 2 logs_base,logs_base_nobots,logs_base_page,logs_base_page_cleaned,

logs_base_page_cleaned_by_page,page_count,raw_logs GROUP_BY,COMBINER
job_1417127396023_0146 2 1 pages_and_post_details,pages_and_posts_trim,posts,posts_cleaned HASH_JOIN
job_1417127396023_0147 1 1 pages_and_posts_sorted SAMPLER
job_1417127396023_0148 1 1 pages_and_posts_sorted ORDER_BY,COMBINER
job_1417127396023_0149 1 1 pages_and_posts_sorted

info@rittmanmead.com www.rittmanmead.com @rittmanmead 16


MapReduce 2 and YARN

• MapReduce 2 (MR2) splits the functionality of the JobTracker



by separating resource management and job scheduling/monitoring

• Introduces YARN (Yet Another Resource Manager)

• Permits other processing frameworks to MR


Node

Manager

‣For example, Apache Spark

• Maintains backwards compatibility with MR1

Client
• Introduced with CDH5+ Node

Resource

Manager Manager

Client

Node

Manager

info@rittmanmead.com www.rittmanmead.com @rittmanmead 17


Apache Tez

• Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc

• Models processing as an entire data flow graph (DAG), rather than separate job steps

‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems

‣Dataflow steps pass data between them as streams, rather than writing/reading from disk

• Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez

• Favoured In-memory / Hive v2 
 TEZ DAG

route by Hortonworks
Map()

Reduce() Reduce()

Input Data
Map()

Reduce() Reduce()
Map()

Output Data
Reduce()

Reduce()

Map()

Input Data
Map()

info@rittmanmead.com www.rittmanmead.com @rittmanmead 18


Tez Advantage - Drop-In Replacement for MR with Hive, Pig

set hive.execution.engine=mr

4m 17s

set hive.execution.engine=tez

2m 25s

info@rittmanmead.com www.rittmanmead.com @rittmanmead 19


Apache Spark

• Another DAG execution engine running on YARN

• More mature than TEZ, with richer API and more vendor support

• Uses concept of an RDD (Resilient Distributed Dataset)

‣RDDs like tables or Pig relations, but can be cached in-memory

‣Great for in-memory transformations, or iterative/cyclic processes

• Spark jobs comprise of a DAG of tasks operating on RDDs

• Access through Scala, Python or Java APIs

• Related projects include

‣Spark SQL

‣Spark Streaming

info@rittmanmead.com www.rittmanmead.com @rittmanmead 22


Rich Developer Support + Wide Developer Ecosystem

• Native support for multiple languages 



with identical APIs

scala> val logfile = sc.textFile("logs/access_log")


‣Python - prototyping, data wrangling
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) 

called with curMem=234759, maxMem=309225062
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 

‣Scala - functional programming features
stored as values to memory (estimated size 75.5 KB, free 294.6 MB)
logfile: org.apache.spark.rdd.RDD[String] = 


‣Java - lower-level, application integration

MappedRDD[31] at textFile at <console>:15

scala> logfile.count()

• Use of closures, iterations, and other 
 14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
...
common language constructs to minimize code
14/05/12 21:19:06 INFO SparkContext: Job finished: 

count at <console>:18, took 0.192536694 s

• Integrated support for distributed +
 res7: Long = 154563

functional programming

• Unified API for batch and streaming scala> val logfile = sc.textFile("logs/access_log").cache
scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
...
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403

info@rittmanmead.com www.rittmanmead.com @rittmanmead 23


Accompanied by Innovations in Underlying Platform

In-Memory Distributed Storage,
 Cluster Resource Management to



to accompany In-Memory Distributed Processing support multi-tenant distributed services

info@rittmanmead.com www.rittmanmead.com @rittmanmead 24


Combining Real-Time Processing with Real-Time Loading

• Most Oracle DWs process data in batches (or at best, micro-batches)

• Tools like ODI typically work in this way, 



often linking up with database CDC
Voice + Chat
Transcripts
• Hadoop systems are usually real-time, from the start

Chat Logs Call Center Logs iBeacon Logs Website Logs


‣In the past, via Hadoop streaming, Flume etc

‣Batch loading then added for initial data load into system Real-time Feeds

Hadoop
Node
Raw Data

info@rittmanmead.com www.rittmanmead.com @rittmanmead 25


Apache Flume : Distributed Transport for Log Activity

• Apache Flume is the standard way to transport log files from source through to target

‣Initial use-case was webserver log files, but can transport any file from A>B

‣Does not do data transformation, but can send to multiple targets / target types

‣Mechanisms and checks to ensure successful transport of entries

• Has a concept of “agents”, “sinks” and “channels”

‣Agents collect and forward log data

‣Sinks store it in final destination

‣Channels store log data en-route

• Simple configuration through INI files

‣Handled outside of ODI12c

info@rittmanmead.com www.rittmanmead.com @rittmanmead 26


GoldenGate for Continuous Streaming to Hadoop

• Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop

• Leverages GoldenGate & HDFS / Hive Java APIs

• Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive)

• Likely to be formal part of GoldenGate in future release - but usable now

• Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1

info@rittmanmead.com www.rittmanmead.com @rittmanmead 27


Apache Kafka : Reliable, Message-Based

• Developed by LinkedIn, designed to address Flume issues around reliability, throughput

‣(though many of those issues have been addressed since)

• Designed for persistent messages as the common use case

‣Website messages, events etc vs. log file entries

• Consumer (pull) rather than Producer (push) model

• Supports multiple consumers per message queue

• More complex to set up than Flume, and can use



Flume as a consumer of messages

‣But gaining popularity, especially 



alongside Spark Streaming

info@rittmanmead.com www.rittmanmead.com @rittmanmead 28


Adding Real-Time Processing to Loading : Spark Streaming

• Add mid-stream processing to ingestion process

• Sessionization, classification, more complex transformation and ref data lookup

• Access to machine learning algorithms using MLib

‣Example implementation at:



http://blog.cloudera.com/blog/2014/11/how-to-do-near-

real-time-sessionization-with-spark-streaming-and-

apache-hadoop/

info@rittmanmead.com www.rittmanmead.com @rittmanmead 29


but … Hadoop development
is only for Java programmers
…isn’t it?
SQL Increasingly Used in Hadoop for Data Access

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 E : info@rittmanmead.com
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) W : www.rittmanmead.com
Cloudera Impala - Fast, MPP-style Access to Hadoop Data

• Cloudera’s answer to Hive query response time issues

• MPP SQL query engine running on Hadoop, bypasses MapReduce


for direct data access

• Mostly in-memory, but spills to disk if required

• Uses Hive metastore to access Hive table metadata

• Similar SQL dialect to Hive - not as rich though and no support for
Hive SerDes, storage handlers etc

info@rittmanmead.com www.rittmanmead.com @rittmanmead 32


How Impala Works

• A replacement for Hive, but uses Hive concepts and



data dictionary (metastore)
BI Server

• MPP (Massively Parallel Processing) query engine



Presentation Svr

that runs within Hadoop


Cloudera Impala

ODBC Driver

‣Uses same file formats, security,



resource management as Hadoop

• Processes queries in-memory

Impala Impala
Impala
Hadoop Hadoop
Hadoop

• Accesses standard HDFS file data


HDFS etc HDFS etc
HDFS etc

• Option to use Apache AVRO, RCFile,



LZO or Parquet (column-store)
Impala
Impala Multi-Node

Hadoop Cluster

• Designed for interactive, real-time
 Hadoop


HDFS etc
Hadoop

HDFS etc

SQL-like access to Hadoop

info@rittmanmead.com www.rittmanmead.com @rittmanmead 33


Enabling Hive Tables for Impala

• Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list

• Run SHOW TABLES Impala SQL command to view tables available

• Run COUNT(*) on main ACCESS_PER_POST table to see typical response time


[oracle@bigdatalite ~]$ impala-shell [bigdatalite.localdomain:21000] > select count(*) 

Starting Impala Shell without Kerberos authentication from access_per_post;
Query: select count(*) from access_per_post
[bigdatalite.localdomain:21000] > invalidate metadata; +----------+
Query: invalidate metadata | count(*) |
+----------+
Fetched 0 row(s) in 2.18s | 343 |
[bigdatalite.localdomain:21000] > show tables; +----------+
Query: show tables Fetched 1 row(s) in 2.76s
+-----------------------------------+
| name |
+-----------------------------------+
| access_per_post |
| access_per_post_cat_author |
| … |
| posts |
|——————————————————————————————————-+
Fetched 45 row(s) in 0.15s

info@rittmanmead.com www.rittmanmead.com @rittmanmead 34


Significantly-Improved Ad-Hoc Query Response Time vs Hive

• Significant improvement over Hive response time

• Now makes Hadoop suitable for ad-hoc querying

Simple Two-Table Join against Impala Data Only


Logical Query Summary Stats: Elapsed time 2, Response time 1, Compilation time 0 (seconds)

vs
Simple Two-Table Join against Hive Data Only
Logical Query Summary Stats: Elapsed time 50, Response time 49, Compilation time 0 (seconds)

info@rittmanmead.com www.rittmanmead.com @rittmanmead 35


Oracle Big Data SQL

• Part of Oracle Big Data 4.0 (BDA-only)

‣Also requires Oracle Database 12c, Oracle Exadata Database Machine

• Extends Oracle Data Dictionary to cover Hive

• Extends Oracle SQL and SmartScan to Hadoop


SQL Queries
• Extends Oracle Security Model over Hadoop

‣Fine-grained access control

‣Data redaction, data masking


Exadata Database

Server
‣Uses fast c-based readers where possible

SmartScan SmartScan
(vs. Hive MapReduce generation)

‣Map Hadoop parallelism to Oracle PQ


Exadata
 Hadoop
 Oracle Big


‣Big Data SQL engine works on top of YARN


Storage Servers Cluster Data SQL

‣Like Spark, Tez, MR2

info@rittmanmead.com www.rittmanmead.com @rittmanmead 36


View Hive Table Metadata in the Oracle Data Dictionary

• Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata

‣Linked by Exadata configuration steps to one or more BDA clusters

• DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata

• Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore
SQL> col database_name for a30
SQL> col table_name for a30
SQL> select database_name, table_name
2 from dba_hive_tables;

DATABASE_NAME TABLE_NAME
------------------------------ ------------------------------
default access_per_post
default access_per_post_categories
default access_per_post_full
default apachelog
default categories
default countries
default cust
default hive_raw_apache_access_log

info@rittmanmead.com www.rittmanmead.com @rittmanmead 37


Hive Access through Oracle External Tables + Hive Driver

• Big Data SQL accesses Hive tables through external table mechanism

‣ORACLE_HIVE external table type imports Hive metastore metadata

‣ORACLE_HDFS requires metadata to be specified

• Access parameters cluster and tablename specify Hive table source and BDA cluster
CREATE TABLE access_per_post_categories(
hostname varchar2(100),
request_date varchar2(100),
post_id varchar2(10),
title varchar2(200),
author varchar2(100),
category varchar2(100),
ip_integer number)
organization external
(type oracle_hive
default directory default_dir
access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));

info@rittmanmead.com www.rittmanmead.com @rittmanmead 38


Extending SmartScan, and Oracle SQL, Across All Data

• Brings query-offloading features of Exadata



to Oracle Big Data Appliance

• Query across both Oracle and Hadoop sources

• Intelligent query optimisation applies SmartScan



close to ALL data

• Use same SQL dialect across both sources

• Apply same security rules, policies, 



user access rights across both sources

info@rittmanmead.com www.rittmanmead.com @rittmanmead 39


Apache Drill

• SQL query engine that doesn’t require a formal (HCatalog) schema

• Infers the schema from the semi-structured dataset (JSON etc)

‣Allows users to analyze data without any ETL or up-front schema definitions.

‣Data can be in any file format such as text, JSON, or Parquet

‣Improved agility and flexibility
 0: jdbc:drill:zk=local> select state, city, count(*) totalreviews
vs formal modelling in Hive etc from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
group by state, city order by count(*) desc limit 10;

+------------+------------+--------------+
| state | city | totalreviews |
+------------+------------+--------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+--------------+

info@rittmanmead.com www.rittmanmead.com @rittmanmead 40


Hive-on-Spark (and Pig-on-Spark)

• Addition of Spark as a back-end execution engine for Hive (and Pig)

• Has the advantage of making use of all existing Hive scripts, infrastructure

• But … probably is even more of a dead-end than Tez

‣Is still faster than Hive on MR

‣But Hive with column/in-memory optimized



storage is now typically CPU bound

‣Spark consumes more CPU, Disk 



& Network IO than Tez

‣Additional translation overhead from 



RDDs to Hive’s “Row Containers”

info@rittmanmead.com www.rittmanmead.com @rittmanmead 41


Spark SQL - Adding SQL Processing to Apache Spark

• Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries

• Bring in and federate additional data from JDBC sources

• Load, read and save data in Hive, Parquet and other structured tabular formats
val accessLogsFilteredDF = accessLogs
.filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*"))
.filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF()
.registerTempTable("accessLogsFiltered")

val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 



as total 

FROM accessLogsFiltered a 

JOIN posts p ON a.endpoint = p.POST_SLUG 

GROUP BY p.POST_TITLE, p.POST_AUTHOR 

ORDER BY total DESC LIMIT 10 ")

// Persist top ten table for this window to HDFS as parquet file

topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"

, "parquet", SaveMode.Overwrite)

info@rittmanmead.com www.rittmanmead.com @rittmanmead 42


Choosing the Appropriate SQL Engine to Add to Hadoop

info@rittmanmead.com www.rittmanmead.com @rittmanmead 43


Apache Parquet - Column-Orientated Storage for Analytics

• Beginners usually store data in HDFS using text file formats (CSV) but these have limitations

• Apache AVRO often used for general-purpose processing

‣Splitability, schema evolution, in-built metadata, support for block compression

• Parquet now commonly used with Impala due to column-orientated storage

‣Mirrors work in RDBMS world around column-store

‣Only return (project) the columns you require across a wide table

info@rittmanmead.com www.rittmanmead.com @rittmanmead 44


Cloudera Kudu - Combining Best of HBase and Column-Store

• But Parquet (and HDFS) have significant limitation for real-time analytics applications

‣Append-only orientation, focus on column-store 



makes streaming ingestion harder

• Cloudera Kudu aims to combine best of HDFS + HBase

‣Real-time analytics-optimised

‣Supports updates to data

‣Fast ingestion of data

‣Accessed using SQL-style tables



and get/put/update/delete API

info@rittmanmead.com www.rittmanmead.com @rittmanmead 45


but … Hadoop is insecure
and has fragmented security
…doesn’t it?
Consistent Security and Audit Now Emerging on Platform

info@rittmanmead.com www.rittmanmead.com @rittmanmead 47


Hadoop Security Initially Was a Mess

• Clusters by default are unsecured (vunerable to account spoofing) & need Kerberos enabled

• Data access controlled by POSIX-style permissions on HDFS files

• Hive and Impala can Apache Sentry RBAC

‣Result is data duplication and complexity

‣No consistent API or abstracted security model


/user/mrittman/scratchpad
/user/ryeardley/scratchpad
/user/mpatel/scratchpad
/user/mrittman/scratchpad
/user/mrittman/scratchpad
/data/rm_website_analysis/logfiles/incoming
/data/rm_website_analysis/logfiles/archive
/data/rm_website_analysis/tweets/incoming
/data/rm_website_analysis/tweets/archive

info@rittmanmead.com www.rittmanmead.com @rittmanmead 48


Oracle Big Data SQL : Extend Oracle Security to Hadoop

• Use standard Oracle Security over Hadoop & NoSQL

‣Grant & Revoke Privileges


DBMS_REDACT.ADD_POLICY(
‣Redact Data
object_schema => 'txadp_hive_01',
object_name => 'customer_address_ext',

‣Apply Virtual Private Database

column_name => 'ca_street_name',


policy_name => 'customer_address_redaction',
function_type => DBMS_REDACT.RANDOM,
‣Provides Fine-grain Access Control
expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'',
''REDACTION_TESTER'')=''TRUE'''

• Great solution to extend existing Oracle
 );

security model over Hadoop datasets

SQL
JSON

Redacted
data Customer data
subset in Oracle DB

info@rittmanmead.com www.rittmanmead.com @rittmanmead 49


Cloudera RecordService

• Provides a higher level, logical abstraction for data (ie Tables or Views)

‣Can be used with Spark & Spark SQL, with Predicate pushdown, projection

• Returns schemed objects (instead of paths and bytes) in similar way to HCatalog

• Unified data access path allows platform-wide performance improvements

• Secure service that does not execute arbitrary user code

‣Central location for all authorization checks using Sentry metadata.

info@rittmanmead.com www.rittmanmead.com @rittmanmead 50


but … any predictive modelling
has to be done outside Hadoop, in R
…doesn’t it?
Spark MLLib : Adding Machine Learning Capabilities to Spark

• Part of Spark, extends Scala, Java & Python API

// Compute raw scores on the test set.


• Integrated workflow including ML pipelines
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
• Currently supports following algorithms:
}
(score, point.label)

‣Binary classification
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
‣Regression
val auROC = metrics.areaUnderROC()

‣Clustering

println("Area under ROC = " + auROC)

// Save and load model


‣Collaborative filtering
model.save(sc, "myModelPath")
val sameModel = SVMModel.load(sc, "myModelPath")
‣Dimensionality Reduction

info@rittmanmead.com www.rittmanmead.com @rittmanmead 52


Example Usage : Oracle Big Data Preparation Cloud Service

• Data enrichment tool aimed at domain experts, not programmers

• Uses machine-learning to automate 



data classification + profiling steps

• Automatically highlight sensitive data,



and offer to redact or obfuscate

• Dramatically reduce the time required
 Raw Data Mapped Data


Data stored in the
to onboard new data sources
original format (usually
files) such as SS7, ASN.
Data sets produced by
mapping and
transforming raw data
1, JSON etc.
• Hosted in Oracle Cloud for zero-install

‣File upload and download from browser


Voice + Chat

‣Automate for production data loads Transcripts

info@rittmanmead.com www.rittmanmead.com @rittmanmead 53


Identifying Schemas in Semi-/Unstructured Data

info@rittmanmead.com www.rittmanmead.com @rittmanmead 54


Use of Machine Learning to Identify Data Patterns

• Automatically profile, parse and classify incoming datasets using Spark MLLib Word2Vec
• Spot and obfuscate sensitive data automatically, automatically suggest column names

info@rittmanmead.com www.rittmanmead.com @rittmanmead 55


Summary

• Hadoop is evolving

‣Hadoop 2.0 breaks the dependency on MapReduce

‣Spark, Tez etc allow us to create execution plans that 



run in-memory, faster than before

‣New streaming models allow us to process data 



via sockets, micro batches or continuously

• And Oracle developers can make use of these new capabilities

‣Oracle Big Data SQL can access Hadoop data loaded in real-time

‣OBIEE, particularly in 11.1.1.9, can access Impala

‣ODI is likely to support Hive on Tez and Hive on Spark shortly, 



and will have support for Spark in the future

info@rittmanmead.com www.rittmanmead.com @rittmanmead 57


Big Data for Oracle Devs - 

Towards Spark, Real-Time and Predictive Analytics
Mark Rittman, CTO, Rittman Mead
Riga Dev Day 2016, Riga, March 2016

info@rittmanmead.com www.rittmanmead.com @rittmanmead

You might also like