Professional Documents
Culture Documents
• Email : mark.rittman@rittmanmead.com
• Twitter : @markrittman
Voice + Chat
Transcripts
CRM Data Transactions Chat Logs Call Center Logs iBeacon Logs Website Logs Social FeedsDemographics
Real-time Feeds,
batch and API
Business analytics
SQL-on-Hadoop
Predictive
$50k Models
• Offload archive data into Hadoop but federate it with DW data in user queries
Hadoop Platform
Operational Data
Data Factory
Data Reservoir
Segments
Transactions
File Based
Integration Raw
Mapped
Business
Models
Customer Data
Customer Data Intelligence Tools
Customer
Master ata Data stored in Machine
Data streams Data sets Learning
ETL Based the original
produced by
Integration format (usually
mapping and
files) such as
transforming Marketing /
SS7, ASN.1,
Unstructured Data Stream JSON etc. raw data Sales Applications
Based
Integration
Data sets and Models and
samples programs
Voice + Chat
Transcripts Discovery & Development Labs
Safe & secure Discovery and Development
environment
Data Transfer Data Access
Query Federation
Virtualization &
Ad-hoc
Foundation Data Layer
BI Assets
Structured •Operational Data
•COTS Data Immutable modelled data. Business
Data
•Master & Ref. Data Process Neutral form. Abstracted from
Sources •Streaming & BAM business process changes
• Start with Oracle Big Data Appliance Starter Rack - expand up to 18 nodes per rack
‣Map : select the columns and values you’re interested in, pass through as key/value pairs
Reducer Reducer
‣Run the job on the node where the data is
Aggregate Aggregate
• MapReduce jobs are typically written in Java, but Hive can make this simpler
• Data integration tools such as Oracle Data Integrator can load and process Hadoop data
• BI tools such as Oracle Business Intelligence 12c can report on Hadoop data
• Most Oracle DBAs and developers know about Hadoop, but assume…
‣Hadoop is just for batch (because of the MapReduce JVN spin-up issue)
‣Hadoop is just for large datasets, not ad-hoc work or micro batches
‣Hadoop will always be slow because it stages everything to disk
‣All Hadoop can do is Map (select, filter) and Reduce (aggregate)
‣Hadoop == MapReduce
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
E : info@rittmanmead.com
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) W : www.rittmanmead.com
Hadoop 1.0 and MapReduce
• MapReduce’s great innovation was to break processing down into distributed jobs
• Jobs that have no functional dependency on each other, only upstream tasks
‣All MapReduce code had to do was provide the “map” and “reduce” functions
• A typical Hive or Pig script compiles down into multiple MapReduce jobs
• Safe, but slow - write to disk, spin-up separate JVMs for each job
register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
raw_logs = LOAD '/user/mrittman/rm_logs' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
GENERATE FLATTEN
(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\]
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')
)AS
(remoteAddr: chararray, remoteLogname: chararray, user: chararray,time: chararray);
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|bot.*');
logs_base_page = FOREACH logs_base_nobots GENERATE SUBSTRING(time,0,2) as day)
AS (method:chararray, request_page:chararray, protocol:chararray), remoteAddr, status;
logs_base_page_cleaned = FILTER logs_base_page BY NOT (SUBSTRING(request_page,0,3) ==
'/wp' or request_page == '/' or SUBSTRING(request_page,0,7) == '/files/'
or SUBSTRING(request_page,0,12) == '/favicon.ico');
logs_base_page_cleaned_by_page = GROUP logs_base_page_cleaned BY request_page;
page_count = FOREACH logs_base_page_cleaned_by_page GENERATE FLATTEN(group)
as request_page, COUNT(logs_base_page_cleaned) as hits;
…
store pages_and_post_top_10 into 'top_10s/pages'; JobId Maps Reduces Alias Feature Outputs
job_1417127396023_0145 12 2 logs_base,logs_base_nobots,logs_base_page,logs_base_page_cleaned,
logs_base_page_cleaned_by_page,page_count,raw_logs GROUP_BY,COMBINER
job_1417127396023_0146 2 1 pages_and_post_details,pages_and_posts_trim,posts,posts_cleaned HASH_JOIN
job_1417127396023_0147 1 1 pages_and_posts_sorted SAMPLER
job_1417127396023_0148 1 1 pages_and_posts_sorted ORDER_BY,COMBINER
job_1417127396023_0149 1 1 pages_and_posts_sorted
Client
• Introduced with CDH5+ Node
Resource
Manager Manager
Client
Node
Manager
• Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc
• Models processing as an entire data flow graph (DAG), rather than separate job steps
‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems
‣Dataflow steps pass data between them as streams, rather than writing/reading from disk
• Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez
route by Hortonworks
Map()
Reduce() Reduce()
Input Data
Map()
Reduce() Reduce()
Map()
Output Data
Reduce()
Reduce()
Map()
Input Data
Map()
set hive.execution.engine=mr
4m 17s
set hive.execution.engine=tez
2m 25s
• More mature than TEZ, with richer API and more vendor support
‣Spark SQL
‣Spark Streaming
scala> logfile.count()
• Use of closures, iterations, and other
14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
...
common language constructs to minimize code
14/05/12 21:19:06 INFO SparkContext: Job finished:
count at <console>:18, took 0.192536694 s
functional programming
• Unified API for batch and streaming scala> val logfile = sc.textFile("logs/access_log").cache
scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
...
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403
‣Batch loading then added for initial data load into system Real-time Feeds
Hadoop
Node
Raw Data
• Apache Flume is the standard way to transport log files from source through to target
‣Initial use-case was webserver log files, but can transport any file from A>B
‣Does not do data transformation, but can send to multiple targets / target types
• Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
E : info@rittmanmead.com
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) W : www.rittmanmead.com
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
• Similar SQL dialect to Hive - not as rich though and no support for
Hive SerDes, storage handlers etc
Impala Impala
Impala
Hadoop Hadoop
Hadoop
HDFS etc
• Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list
vs
Simple Two-Table Join against Hive Data Only
Logical Query Summary Stats: Elapsed time 50, Response time 49, Compilation time 0 (seconds)
• Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata
• Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore
SQL> col database_name for a30
SQL> col table_name for a30
SQL> select database_name, table_name
2 from dba_hive_tables;
DATABASE_NAME TABLE_NAME
------------------------------ ------------------------------
default access_per_post
default access_per_post_categories
default access_per_post_full
default apachelog
default categories
default countries
default cust
default hive_raw_apache_access_log
• Big Data SQL accesses Hive tables through external table mechanism
• Access parameters cluster and tablename specify Hive table source and BDA cluster
CREATE TABLE access_per_post_categories(
hostname varchar2(100),
request_date varchar2(100),
post_id varchar2(10),
title varchar2(200),
author varchar2(100),
category varchar2(100),
ip_integer number)
organization external
(type oracle_hive
default directory default_dir
access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));
‣Allows users to analyze data without any ETL or up-front schema definitions.
‣Improved agility and flexibility
0: jdbc:drill:zk=local> select state, city, count(*) totalreviews
vs formal modelling in Hive etc from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
group by state, city order by count(*) desc limit 10;
+------------+------------+--------------+
| state | city | totalreviews |
+------------+------------+--------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+--------------+
• Has the advantage of making use of all existing Hive scripts, infrastructure
• Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries
• Load, read and save data in Hive, Parquet and other structured tabular formats
val accessLogsFilteredDF = accessLogs
.filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*"))
.filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF()
.registerTempTable("accessLogsFiltered")
// Persist top ten table for this window to HDFS as parquet file
topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"
, "parquet", SaveMode.Overwrite)
• Beginners usually store data in HDFS using text file formats (CSV) but these have limitations
‣Only return (project) the columns you require across a wide table
• But Parquet (and HDFS) have significant limitation for real-time analytics applications
‣Real-time analytics-optimised
• Clusters by default are unsecured (vunerable to account spoofing) & need Kerberos enabled
SQL
JSON
Redacted
data Customer data
subset in Oracle DB
• Provides a higher level, logical abstraction for data (ie Tables or Views)
‣Can be used with Spark & Spark SQL, with Predicate pushdown, projection
• Returns schemed objects (instead of paths and bytes) in similar way to HCatalog
‣Binary classification
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
‣Regression
val auROC = metrics.areaUnderROC()
‣Clustering
• Automatically profile, parse and classify incoming datasets using Spark MLLib Word2Vec
• Spot and obfuscate sensitive data automatically, automatically suggest column names
• Hadoop is evolving
‣Oracle Big Data SQL can access Hadoop data loaded in real-time