You are on page 1of 51

Module 2

Learning Objectives
You will be able to:

▪ Understand the problem BigDL solves


▪ How BigDL scales to large datasets
▪ Understand BigDL features

2
What is BigDL

 A distributed deep learning library for Apache Spark*


 Feature parity with popular deep learning frameworks:
Caffe*, Torch*, TensorFlow*
 High performance
Powered by Intel® Math Kernel Library (Intel® MKL) and multi-threaded
programming
 Can scale to huge datasets
Using Spark for scale
 Open sourced on December 2016
*Other names and brands may be claimed as the property of others.
BigDL Version History
Date Version Description

2016 Dec 0.1 First open source release

2017 July 0.2 • BigDL document website online


• Add support for TreeLSTM, TimeDistributed, 3D conv/pooling, bi-recurrent layers
• Support pip install
• Support windows platform
• Support load/save Caffe*/TensorFlow* models

2017 Nov 0.3 • New model storage format


• Support model quantization
• Support sparse tensor and model
• Better TensorFlow model support
• Support Apache Spark* 2.2

2018 Jan 0.4 • Support Keras* 1.2.2 model loading


• Python* 3.6 support
• OpenCV™ support, and add a dozen of image transformer based on OpenCV

2018 Mar 0.5 • Bring in a Keras-like API (Scala* and Python)


• Support load TensorFlow* dynamic models
• Support combining data preprocessing and neural network layers in the same model
• Speedup various modules in BigDL
• Add DataFrame-based image reader and transformer

2018 June 0.6 • Integrate Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to speed up CNN models
• Advanced model training support
• Support Apache Spark 2.3
*Other names and brands may be claimed as the property of others.
BigDL & Apache Spark*
 Run BigDL applications as
Apache Spark* applications
 Scala* and Python* support
 Use other Apache Spark*
features
- In memory compute
- Integrate with Apache
Spark* ML Library (Mlib) and
streaming
 Easy development with
Jupyter Notebook*
*Other names and brands may be claimed as the property of others.
BigDL: Big Compute Plus Big Data
 BigDL helps us in balancing our needs:
- Big Compute: fast linear algebra, Intel® MKL library
- Big Data: I/O parallelized to run on many CPUs
 BigDL allows massive scalability
- Runs on Apache Spark*
- Works with Apache Hadoop* ecosystem (via Apache Spark*)
 Plays nicely with other Deep Learning frameworks
- Use existing TensorFlow* or Caffe* to do inference in a distributed fashion (and massive
memory)
- Train new models based on existing TensorFlow* / Caffe* models
 Overall, BigDL helps us take advantage of existing Big Data infrastructure
*Other names and brands may be claimed as the property of others.
8
GPUs (Graphics Processing Units)

 GPU usage has caused a lot of excitement in the DL community


- Example: TensorFlow* optimized to run well on GPUs
 CPU in past not vectorized for parallel compute
- Meant that GPUs were much faster for deep learning
 Modern Intel® Xeon® CPUs have vectorized linear algebra
- Properly optimized, approaches speed of GPUs
- Offers faster I/O performance for Big Data

*Other names and brands may be claimed as the property of others.

9
Intel® Math Kernel Library (Intel® MKL)

 Features highly optimized, threaded, and vectorized math functions that


maximize performance on each processor family
 Utilizes industry-standard C and Fortran APIs for compatibility with popular
BLAS, LAPACK, and FFTW functions—no code changes required
 Dispatches optimized code for each processor automatically without the
need to branch code
 Provides priority support, connecting you directly to Intel engineers for
confidential answers to technical questions

10
Intel® MKL Performance

11
GPU vs. CPU for Big Data

Feature GPU CPU


Data size Works well on small Scales very well for large
medium data with scale data
intensive compute
Scaling Vertical scaling with Horizontal scaling with
few nodes (10s of 100s of nodes
nodes)

12
Production ML/DL Systems are Complex!

Actual ML/DL is only a small portion of massive production system


BigDL enables us to build complex production systems by leveraging the Big
Data ecosystem

13
BigDL Fills the 'gap' in Big Data + Deep Learning

 Follows proven design patterns for dealing with Big Data


 Sends 'compute to data' rather than reading massive data over network
Uses 'data locality' of HDFS (Apache Hadoop* File System)
 Utilizes 'cluster managers' like YARN / MESOS
 Automatically handles hardware/software failures
 Elasticity and resource sharing in a cluster

*Other names and brands may be claimed as the property of others.

14
15
BigDL Use Cases

 Image processing JD.com


 Personalized recommendations at MasterCard
 MLSListings*
 Fraud detection @ Union Pay

16
Use Case: BigDL @ JD – Image Feature Extraction

About
- JD.com is China’s largest online retailer and
its biggest overall retailer
- They have 300M+ customers with annual
revenue of US $55 Billion (2017)
Project
- Image Feature Extraction is used in image-
similarity search
- JD team was trying to do feature extraction
on hundreds of millions of images from Red handbag
product catalog
(similar items)
17
Use Case : BigDL @ JD – Image Feature Extraction

Issues
- JD.com team tried to build feature extraction using GPUs and clusters
- Resource allocations across GPUs and clusters proved to be error prone
resulting in frequent out of memory errors, etc.
- Dependencies of GPU applications made production deployment more
complex
- Reading images took a long time which cannot be easily optimized for GPU
based solutions

18
Use Case : BigDL @ JD – Image Feature Extraction

BigDL based solution

19
Use Case : BigDL @ JD – Image Feature Extraction
BigDL based solution
- Read hundreds of millions of pictures from a distributed database in Apache
Spark* as resilient distributed datasets (RDD)
- Preprocess the RDD of images (including resizing, normalization, and batching) in
Apache Spark*
- Use BigDL to load the Single Shot Detector model for large scale, distributed
object detection on Apache Spark*
- Model for distributed feature extraction of the target images on Apache Spark*,
which will generate the corresponding features
- Store the result (RDD of extracted object features) in the Hadoop* Distributed File
System (HDFS)
*Other names and brands may be claimed as the property of others.

20
Use Case : BigDL @ JD – Image Feature Extraction

WINS
- Take advantage of existing Big Data
infrastructure (Apache Spark*, HDFS)
- Take advantage of the scalability of
Apache Spark*
- Do all image processing at massive
scale
- Much better performance

*Other names and brands may be claimed as the property of others.

21
Use Case : Personalized Recommendations at
MasterCard*
- MasterCard* has 2.4 billion cards and 56 billion transactions
- Offer personalized recommendations to customers
- Dataset
- 3 years of data
- 675k consumers
- 1.4 billion transactions (50G+ raw data)
- Target merchants ~2000

22
Use Case : Personalized Recommendations at
MasterCard*
Architecture:

23
Use Case : Personalized Recommendations at
MasterCard*
WINS:
- Scale out on Apache Hadoop* & Apache Spark* cluster
- 6 node cluster, 24 hyper cores, 384 G memory, 20TB disk
- Cloudera Hadoop with Spark* 2.2
- Run Spark* ALS algorithm + BigDL NCF (neural collaborative filtering)

*Other names and brands may be claimed as the property of others.

24
*Other names and brands may be claimed as the property of others. 25
Apache Spark* Execution Model

*Other names and brands may be claimed as the property of others.

26
Scaling to Large Datasets

When large files are


uploaded in Apache
Hadoop*, they are
partitioned across the
cluster

*Other names and brands may be claimed as the property of others.

27
Scaling to Large Datasets

Apache Spark* can


process each partition of
data in parallel

*Other names and brands may be claimed as the property of others.

28
Scaling to Large Datasets

 Apache Spark* will use


'location hints' provided
by HDFS to read data
effectively
 The best I/O throughput
is achieved, when we read
'local data'

*Other names and brands may be claimed as the property of others.

29
A Simple Example of Distributed Computing:
Distributed Count
 Here we are doing a distributed
count
 The data is distributed across many
worker machines:
- Many partitions
 Each worker is performing a count
on local data (partitions)
 Then the intermediate counts are
sent to aggregator to sum it all up

30
BigDL on Apache Spark*
 BigDL runs as standard
Apache Spark* jobs
- No changes to Apache
Spark* required
 Each iteration of training
runs as an Apache Spark*
job
 Apache Spark* jobs
process data in parallel
(bigger diagram on next slide)
*Other names and brands may be claimed as the property of others.

31
BigDL on Apache Spark*

*Other names and brands may be claimed as the property of others.

32
33
BigDL Features (Detailed in Next Slides)

 Scales to large datasets using Apache Spark* (already discussed)


 Provides standard Deep Learning algorithms out of the box
 Supports Python* and Scala*
 Compatible with Apache Spark* ML Library
 TensorBoard* compatible visualizations
 Runs on cloud platforms

*Other names and brands may be claimed as the property of others.

34
BigDL Features: Provides Popular DL Algorithms
Out of the Box
Application Algorithm
Image classification ResNet50, Inception, VGG
Object detection Single Shot Detector (SSD)
Recommendation NCF, Wide and Deep
Text classification CNN, LSTM based models
Data generation Auto-encoder, VAE

35
BigDL Features: Python* and Scala* Support
 BigDL is written in Scala* language
 Python* is supported via a 'language wrapper'
layer
 We can write our programs using Scala* or
Python*
 Python* has become a preferred language of
choice for Machine Learning and Deep Learning
(see next slide for usage)
 This architecture gives the flexibility of Python*
with the speed provided by Intel® MKL*
 Similar to how Apache Spark* supports multiple
languages (Scala* core + language wrappers)
*Other names and brands may be claimed as the property of others.

36
Popular Languages for Machine Learning

37
BigDL Features: Apache Spark* ML Compatibility

 Analytics Zoo provides high-level APIs to simplify programming


- Supports Apache Spark* dataframe-based API
 Starting with version 2, Apache Spark* provides newer data structures:
Dataframe and Dataset
 These new data models are highly optimized for speed and memory usage
 Plus they are very easy to use
 Going forward, newer data structures (Dataframe, Dataset) will replace
Apache Spark* old data model – RDD
*Other names and brands may be claimed as the property of others.

38
BigDL Features: Apache Spark* ML Compatibility

 Natively integrates with the Apache Spark* Machine Learning Pipeline


 ML Pipeline is an easy-to-use workflow to execute multiple steps together
and in sequence
 Here is a pipeline of text analytics steps using a pipeline

*Other names and brands may be claimed as the property of others.

39
BigDL Features: Visualizations Using TensorBoard*

 TensorBoard* is a popular visualization tool for Deep Learning


- TensorBoard* is part of the TensorFlow* project
 TensorBoard* gives insight into learning process by displaying metrics like
accuracy, loss , throughput, etc.
 BigDL can write logs in TensorBoard* format that can then be visualized
using TensorBoard*
 See next slide for diagram

*Other names and brands may be claimed as the property of others.

40
BigDL and TensorBoard*

*Other names and brands may be claimed as the property of others.

41
BigDL Features: Import Models From Other NN
Frameworks
Framework BigDL can Import BigDL can
From Export To
TensorFlow* Yes Yes
(Support operation list)

Caffe* Yes Yes


(Support layer list)
Keras* Yes No

*Other names and brands may be claimed as the property of others.

42
BigDL Features: Cloud Platform Support

 BigDL supports the following platforms:


- Amazon Web Services (AWS)*
- Google Cloud*
- Microsoft Azure*
- IBM Cloud*
- Other

*Other names and brands may be claimed as the property of others.

43
Running BigDL on AWS*

 Use BigDL AMI (Amazon Machine Image) in AWS Marketplace


 Includes
- BigDL
- Python*
- Apache Spark*
 BigDL is free to use – pay standard AWS charges

*Other names and brands may be claimed as the property of others.

44
Running BigDL in Google Cloud* Platform

 BigDL runs on Google Cloud* as DataProc


 Use this script to spin up BigDL easily
 The script will download latest BigDL (customizable) and spin up an Apache
Spark* cluster

*Other names and brands may be claimed as the property of others.

45
BigDL on Kubernetes*

 Kubernetes*, with Docker*, has become a very popular deployment platform


 BigDL can leverage Apache Spark* on Kubernetes*
 Once Apache Spark cluster is provisioned on Kubernetes*, BigDL jobs can be
submitted

*Other names and brands may be claimed as the property of others.

46
47
Noteworthy Upcoming Features

BigDL is a very fast-moving project – new features are added with each release
Upcoming features include:
- Keras2* API support
- Model serving
- More NLP (natural language processing)

*Other names and brands may be claimed as the property of others.

48
Summary

We have learned about:


- Large scale learning using BigDL and Apache Spark*
- BigDL features
- BigDL use cases

*Other names and brands may be claimed as the property of others.

49
Further Reading

 Official BigDL home page

51

You might also like