2 Intro To BigDL

Module 2
Learning Objectives
You will be able to:
▪ Understand the problem BigDL solves

▪ How BigDL scales to large datasets
▪ Understand BigDL features
2
What is BigDL
 A distributed deep learning library for Apache Spark*

 Feature parity with popular deep learning frameworks:
Caffe*, Torch*, TensorFlow*
 High performance
Powered by Intel® Math Kernel Library (Intel® MKL) and multi-threaded
programming
 Can scale to huge datasets
Using Spark for scale
 Open sourced on December 2016
*Other names and brands may be claimed as the property of others.
BigDL Version History
Date Version Description
2016 Dec 0.1 First open source release
2017 July 0.2 • BigDL document website online

• Add support for TreeLSTM, TimeDistributed, 3D conv/pooling, bi-recurrent layers
• Support pip install
• Support windows platform
• Support load/save Caffe*/TensorFlow* models
2017 Nov 0.3 • New model storage format

• Support model quantization
• Support sparse tensor and model
• Better TensorFlow model support
• Support Apache Spark* 2.2
2018 Jan 0.4 • Support Keras* 1.2.2 model loading

• Python* 3.6 support
• OpenCV™ support, and add a dozen of image transformer based on OpenCV
2018 Mar 0.5 • Bring in a Keras-like API (Scala* and Python)

• Support load TensorFlow* dynamic models
• Support combining data preprocessing and neural network layers in the same model
• Speedup various modules in BigDL
• Add DataFrame-based image reader and transformer
2018 June 0.6 • Integrate Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to speed up CNN models
• Advanced model training support
• Support Apache Spark 2.3
BigDL & Apache Spark*
 Run BigDL applications as
Apache Spark* applications
 Scala* and Python* support
 Use other Apache Spark*
features
- In memory compute
- Integrate with Apache
Spark* ML Library (Mlib) and
streaming
 Easy development with
Jupyter Notebook*
BigDL: Big Compute Plus Big Data
 BigDL helps us in balancing our needs:
- Big Compute: fast linear algebra, Intel® MKL library
- Big Data: I/O parallelized to run on many CPUs
 BigDL allows massive scalability
- Runs on Apache Spark*
- Works with Apache Hadoop* ecosystem (via Apache Spark*)
 Plays nicely with other Deep Learning frameworks
- Use existing TensorFlow* or Caffe* to do inference in a distributed fashion (and massive
memory)
- Train new models based on existing TensorFlow* / Caffe* models
 Overall, BigDL helps us take advantage of existing Big Data infrastructure
8
GPUs (Graphics Processing Units)
 GPU usage has caused a lot of excitement in the DL community

- Example: TensorFlow* optimized to run well on GPUs
 CPU in past not vectorized for parallel compute
- Meant that GPUs were much faster for deep learning
 Modern Intel® Xeon® CPUs have vectorized linear algebra
- Properly optimized, approaches speed of GPUs
- Offers faster I/O performance for Big Data
9
Intel® Math Kernel Library (Intel® MKL)
 Features highly optimized, threaded, and vectorized math functions that

maximize performance on each processor family
 Utilizes industry-standard C and Fortran APIs for compatibility with popular
BLAS, LAPACK, and FFTW functions—no code changes required
 Dispatches optimized code for each processor automatically without the
need to branch code
 Provides priority support, connecting you directly to Intel engineers for
confidential answers to technical questions
10
Intel® MKL Performance
11
GPU vs. CPU for Big Data
Feature GPU CPU

Data size Works well on small Scales very well for large
medium data with scale data
intensive compute
Scaling Vertical scaling with Horizontal scaling with
few nodes (10s of 100s of nodes
nodes)
12
Production ML/DL Systems are Complex!
Actual ML/DL is only a small portion of massive production system

BigDL enables us to build complex production systems by leveraging the Big
Data ecosystem
13
BigDL Fills the 'gap' in Big Data + Deep Learning
 Follows proven design patterns for dealing with Big Data

 Sends 'compute to data' rather than reading massive data over network
Uses 'data locality' of HDFS (Apache Hadoop* File System)
 Utilizes 'cluster managers' like YARN / MESOS
 Automatically handles hardware/software failures
 Elasticity and resource sharing in a cluster
14
15
BigDL Use Cases
 Image processing JD.com

 Personalized recommendations at MasterCard
 MLSListings*
 Fraud detection @ Union Pay
16
Use Case: BigDL @ JD – Image Feature Extraction
About
- JD.com is China’s largest online retailer and
its biggest overall retailer
- They have 300M+ customers with annual
revenue of US $55 Billion (2017)
Project
- Image Feature Extraction is used in image-
similarity search
- JD team was trying to do feature extraction
on hundreds of millions of images from Red handbag
product catalog
(similar items)
17
Use Case : BigDL @ JD – Image Feature Extraction
Issues
- JD.com team tried to build feature extraction using GPUs and clusters
- Resource allocations across GPUs and clusters proved to be error prone
resulting in frequent out of memory errors, etc.
- Dependencies of GPU applications made production deployment more
complex
- Reading images took a long time which cannot be easily optimized for GPU
based solutions
18
BigDL based solution
19
BigDL based solution
- Read hundreds of millions of pictures from a distributed database in Apache
Spark* as resilient distributed datasets (RDD)
- Preprocess the RDD of images (including resizing, normalization, and batching) in
Apache Spark*
- Use BigDL to load the Single Shot Detector model for large scale, distributed
object detection on Apache Spark*
- Model for distributed feature extraction of the target images on Apache Spark*,
which will generate the corresponding features
- Store the result (RDD of extracted object features) in the Hadoop* Distributed File
System (HDFS)
20
WINS
- Take advantage of existing Big Data
infrastructure (Apache Spark*, HDFS)
- Take advantage of the scalability of
Apache Spark*
- Do all image processing at massive
scale
- Much better performance
21
Use Case : Personalized Recommendations at
MasterCard*
- MasterCard* has 2.4 billion cards and 56 billion transactions
- Offer personalized recommendations to customers
- Dataset
- 3 years of data
- 675k consumers
- 1.4 billion transactions (50G+ raw data)
- Target merchants ~2000
22
MasterCard*
Architecture:
23
MasterCard*
WINS:
- Scale out on Apache Hadoop* & Apache Spark* cluster
- 6 node cluster, 24 hyper cores, 384 G memory, 20TB disk
- Cloudera Hadoop with Spark* 2.2
- Run Spark* ALS algorithm + BigDL NCF (neural collaborative filtering)
24
*Other names and brands may be claimed as the property of others. 25
Apache Spark* Execution Model
26
Scaling to Large Datasets
When large files are

uploaded in Apache
Hadoop*, they are
partitioned across the
cluster
27
Apache Spark* can

process each partition of
data in parallel
28
 Apache Spark* will use

'location hints' provided
by HDFS to read data
effectively
 The best I/O throughput
is achieved, when we read
'local data'
29
A Simple Example of Distributed Computing:
Distributed Count
 Here we are doing a distributed
count
 The data is distributed across many
worker machines:
- Many partitions
 Each worker is performing a count
on local data (partitions)
 Then the intermediate counts are
sent to aggregator to sum it all up
30
BigDL on Apache Spark*
 BigDL runs as standard
Apache Spark* jobs
- No changes to Apache
Spark* required
 Each iteration of training
runs as an Apache Spark*
job
 Apache Spark* jobs
process data in parallel
(bigger diagram on next slide)
31
BigDL on Apache Spark*
32
33
BigDL Features (Detailed in Next Slides)
 Scales to large datasets using Apache Spark* (already discussed)

 Provides standard Deep Learning algorithms out of the box
 Supports Python* and Scala*
 Compatible with Apache Spark* ML Library
 TensorBoard* compatible visualizations
 Runs on cloud platforms
34
BigDL Features: Provides Popular DL Algorithms
Out of the Box
Application Algorithm
Image classification ResNet50, Inception, VGG
Object detection Single Shot Detector (SSD)
Recommendation NCF, Wide and Deep
Text classification CNN, LSTM based models
Data generation Auto-encoder, VAE
35
BigDL Features: Python* and Scala* Support
 BigDL is written in Scala* language
 Python* is supported via a 'language wrapper'
layer
 We can write our programs using Scala* or
Python*
 Python* has become a preferred language of
choice for Machine Learning and Deep Learning
(see next slide for usage)
 This architecture gives the flexibility of Python*
with the speed provided by Intel® MKL*
 Similar to how Apache Spark* supports multiple
languages (Scala* core + language wrappers)
36
Popular Languages for Machine Learning
37
BigDL Features: Apache Spark* ML Compatibility
 Analytics Zoo provides high-level APIs to simplify programming

- Supports Apache Spark* dataframe-based API
 Starting with version 2, Apache Spark* provides newer data structures:
Dataframe and Dataset
 These new data models are highly optimized for speed and memory usage
 Plus they are very easy to use
 Going forward, newer data structures (Dataframe, Dataset) will replace
Apache Spark* old data model – RDD
38
BigDL Features: Apache Spark* ML Compatibility
 Natively integrates with the Apache Spark* Machine Learning Pipeline

 ML Pipeline is an easy-to-use workflow to execute multiple steps together
and in sequence
 Here is a pipeline of text analytics steps using a pipeline
39
BigDL Features: Visualizations Using TensorBoard*
 TensorBoard* is a popular visualization tool for Deep Learning

- TensorBoard* is part of the TensorFlow* project
 TensorBoard* gives insight into learning process by displaying metrics like
accuracy, loss , throughput, etc.
 BigDL can write logs in TensorBoard* format that can then be visualized
using TensorBoard*
 See next slide for diagram
40
BigDL and TensorBoard*
41
BigDL Features: Import Models From Other NN
Frameworks
Framework BigDL can Import BigDL can
From Export To
TensorFlow* Yes Yes
(Support operation list)
Caffe* Yes Yes

(Support layer list)
Keras* Yes No
42
BigDL Features: Cloud Platform Support
 BigDL supports the following platforms:

- Amazon Web Services (AWS)*
- Google Cloud*
- Microsoft Azure*
- IBM Cloud*
- Other
43
Running BigDL on AWS*
 Use BigDL AMI (Amazon Machine Image) in AWS Marketplace

 Includes
- BigDL
- Python*
- Apache Spark*
 BigDL is free to use – pay standard AWS charges
44
Running BigDL in Google Cloud* Platform
 BigDL runs on Google Cloud* as DataProc

 Use this script to spin up BigDL easily
 The script will download latest BigDL (customizable) and spin up an Apache
Spark* cluster
45
BigDL on Kubernetes*
 Kubernetes*, with Docker*, has become a very popular deployment platform

 BigDL can leverage Apache Spark* on Kubernetes*
 Once Apache Spark cluster is provisioned on Kubernetes*, BigDL jobs can be
submitted
46
47
Noteworthy Upcoming Features
BigDL is a very fast-moving project – new features are added with each release
Upcoming features include:
- Keras2* API support
- Model serving
- More NLP (natural language processing)
48
Summary
We have learned about:

- Large scale learning using BigDL and Apache Spark*
- BigDL features
- BigDL use cases
49
Further Reading
 Official BigDL home page
51

2 Intro To BigDL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Intro To BigDL

Uploaded by

Copyright:

Available Formats

Module 2

▪ Understand the problem BigDL solves

 A distributed deep learning library for Apache Spark*

2016 Dec 0.1 First open source release

2017 July 0.2 • BigDL document website online

2017 Nov 0.3 • New model storage format

2018 Jan 0.4 • Support Keras* 1.2.2 model loading

2018 Mar 0.5 • Bring in a Keras-like API (Scala* and Python)

 GPU usage has caused a lot of excitement in the DL community

*Other names and brands may be claimed as the property of others.

 Features highly optimized, threaded, and vectorized math functions that

Feature GPU CPU

Actual ML/DL is only a small portion of massive production system

 Follows proven design patterns for dealing with Big Data

*Other names and brands may be claimed as the property of others.

 Image processing JD.com

BigDL based solution

*Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

When large files are

*Other names and brands may be claimed as the property of others.

Apache Spark* can

*Other names and brands may be claimed as the property of others.

 Apache Spark* will use

*Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

 Scales to large datasets using Apache Spark* (already discussed)

*Other names and brands may be claimed as the property of others.

 Analytics Zoo provides high-level APIs to simplify programming

 Natively integrates with the Apache Spark* Machine Learning Pipeline

*Other names and brands may be claimed as the property of others.

 TensorBoard* is a popular visualization tool for Deep Learning

*Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

Caffe* Yes Yes

*Other names and brands may be claimed as the property of others.

 BigDL supports the following platforms:

*Other names and brands may be claimed as the property of others.

 Use BigDL AMI (Amazon Machine Image) in AWS Marketplace

*Other names and brands may be claimed as the property of others.

 BigDL runs on Google Cloud* as DataProc

*Other names and brands may be claimed as the property of others.

 Kubernetes*, with Docker*, has become a very popular deployment platform

*Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

We have learned about:

*Other names and brands may be claimed as the property of others.

 Official BigDL home page

You might also like

 Kubernetes, with Docker, has become a very popular deployment platform