You are on page 1of 61

Data Science and Data Scientists

Course Overview

Asst. Prof. Dr. Santitham Prom-on


Department of Computer Engineering, Faculty of Engineering
King Mongkuts University of Technology Thonburi


Data science
Exploratory data analysis



R Tableau

Schedule
Day 1
Morning
(9am 12pm)

Afternoon
(1pm 3pm)

Day 2

Module 1
Introduction to
data science

Module 3
Exploratory data
analysis

Module 2 Data
modeling and
visualization
tool

Module 4 Lab:
visualization
using Tableau

Day 3

Day 4

Module 5
Statistical
analysis and
DOE

Module 7
Predictive
modeling

Module 9
Similarities, and
clustering

Module 8
Fitting a model
to data

Module 10
Data analytic
thinking

Module 6 Lab:
Data preparation
using R

Day 5

Santitham Prom-on, Ph.D.


Experience

Current Positions
Assistant Professor
Department of Computer Engineering
King Mongkuts University of Technology Thonburi

Data Scientist
Big Data Experience Center

Senior Consultant
InsightEra Co. Ltd.

Honorary Research Associate


Department of Speech, Hearing and Phonetic
Sciences, University College London, UK

7 years working at KMUTT


9 research articles in international journals
20+ international conference papers
13 research projects (only as PI)
8 industrial consultant projects
14+ workshops/short courses for industry
2 start-up companies

Data Science and Data Scientists


Module 1: Introduction to data science
28 November 2016, 9:00 12:00

Asst. Prof. Dr. Santitham Prom-on


Department of Computer Engineering, Faculty of Engineering
King Mongkuts University of Technology Thonburi

Module I Overview
Learning Outcome
Understand the relationships
between big data and data science
Explain the basic concepts of data
science and the roles of data
scientists
Identify common tasks in data
sciences from the problem
Recognize the problems that can be
solved by the data science process

Agenda
Big data and data science
Data scientists
Case studies in data sciences

Social Analytics

Gartners hype cycle 2011

Big Data and Extreme Information


Predictive Analytics

Gartners hype cycle 2015


Self-Service Analytics

Citizen Data Science

The rise of data

A course in data science

DIKW (D) Pyramid

From descriptive to prescriptive

Analytics Capabilities Framework

Business Analytics

Predictive
(Proactive)

Prescriptive
(Proactive)

What happen?
What is happening?

Why did it happen?

What will happen?

What should I do?


Why should I do it?

Business reporting
Dashboards
Scorecards
Data warehousing

Behavior analysis
Cause and effect
analysis
Correlation

Well-defined
business problems
and opportunities

Cause and effect of


changes in business
activities

Accurate projections
of the future states
and conditions

Outcomes

Questions

Diagnostic
(Reactive)

Enablers

Descriptive
(Reactive)

Data mining
Text mining
Internet mining
Forecasting

Optimization
Simulation
Decision modeling
Expert systems
Best possible
business decision
and transaction

The Synopsis
A set of fundamental concepts/principles that underlie techniques
for extracting useful knowledge from data.
How data science fits in the organization
General ways of thinking data analytically
General concepts for actually extracting knowledge from data.

This is not an algorithm course and does not presume


sophisticated mathematical background.
We will
Discuss a set of concepts in analytic-thinking
Develop frameworks to structure the analysis so that it is systematic.
Understand data science for business strategy

The Synopsis
The science
Extracting useful knowledge from data to solve business
problems can be treated systematically by following a
process with reasonably well-defined stages.

The technology
From a large mass of data, IT can be used to find
informative descriptive attributes of entities of interest

Grow Business with Big Data

Anticipate demand and competitor activity


Availability at right location (inventory)
Dynamic pricing
Relevant promotions and timely offers
Accelerate customer acquisition
Improve customer loyalty
Develop more innovative product

Data-Analytic Thinking
Every aspect of business is open to data
collection
Operations/Manufacturing
Supply-chain management
Customer behavior
Marketing campaign performance
Marketing trends
Industry news
Competitors movement

Applications of Data Science

Targeted marketing
Campaign combinations for effective up-selling
Recommendations for cross-selling
Customer behavior analysis: the key to marketing in this digital
era is contacting customers just

When they wish to be reached


When they are in the right location
Then.engaging them with personalized real-time offers.

Predicting customer churn


Fraud detection
Workforce management

Predictive Analytics utilizes a variety of statistical, modeling, data


mining, and machine learning techniques to study recent and
historical data, thereby allowing analysts to make predictions
about the future., Forbes

Data Engineering and Data Science


DDD = practice of basing
decision on the analysis
of data, rather than
intuition
Principles and
techniques for
understanding
phenomena via the
analysis of data.

Accessing and
processing of
massive-scale data
flexibly and
efficiently with Big
Data technologies

The data analysis is


not testing a simple
hypothesis, but the
data are explored
with the hope that
something useful will
be discovered.

Big Data Technologies


Big Data = datasets that are too large for traditional
data processing system.
Big Data 1.0: Firms are building the capabilities to
process large data, in support of their current
operations.
Big Data 2.0: Firms are asking what can I now do
that I couldnt do before, or do better than before
Its the golden ear of data science

Trends of data analytics

A Case Study in Consumer Credit


Cards
In the 1980s, credit cards had uniform pricing
A small regional bank in Virginia, Signet Bank, start
thinking about modeling profitability
Offered different terms to different customers
Made better offers to the best customers

Signet Bank started to view data as an asset that


worth investing in
But how ??

Case Studies on Retails


Target
Targeted Market: Pregnancy prediction change
in buying behaviors
Amazon
Rankings and Recommendations
Method and system for anticipatory package
shipping
Jet
Membership fee business model: Online version
of Costco

Target
Apply analytical skills to help the store enhance revenues,
forecast trends, improve process and much more.

Target identified about 25 products that, when analyzed


together, suggested shoppers pregnancy.
Target can also estimate the due date to within a small window
from purchasing behaviors.

A teenage girl in Minnesota, was pregnant (without her


father knowing), based on a formula involving
Elevated rates of buying unscented lotion
Mineral supplements, and cotton balls.
Target started sending her coupons for baby gear

Recommendation system (Brought 30% more business)


Data: Shopping cart, wish list, previous purchases, items rated
and reviewed, geo-location, time-on-site, duration of views,
links clicked, telephone inquiries, responses to marketing
materials
Heavily customize the customer browsing experience and
perfect the art of cross-selling.

Anticipatory package shipping


Predicting customer orders and ship goods
to geographical locations without specific
address.

Like Costco, Jet makes money on membership fees


Like Ebay, Jet functions as a marketplace
Cut cost of logistic by buying locally and multiple
items at once.
Introduce software that track everything, customers
can watch their saving add up and judge whether its
worth the annual fee.

Application of big data and data science


in customer analytics
Problem

Big Data Capabilities

Business Impact

Customer Churn

Include customer contact (e.g. call


center transcript) & social media data
Analyze customer sentiment
Model and score churn propensity

Timely prediction and reduction of


churn

Cross- and Upselling

Analyze & model response behavior


Select campaign addresses based on
micro-segmentation

Efficient and precisely targeted


marketing
Increase cross- and up-selling

Segmentation

Advanced analytics to enhance client


lifestyle analysis and profiling
Predictive analysis of spend

Client lifestyle analysis and spend


prediction
Increase customer satisfaction

Risk Profiling

Refine risk profiling models frequently Comprehensive risk profiling


to adapt to dynamic business
Improved risk evaluation
environment

Case studies - Financial compliance

Classification - To identify irrelevant messages such as automated notifications, or newsletters.


Graph analysis - To build communication profiles of individuals. This technique is often used in
security analytics and malware detection to identify anomalous behavior. Graph analysis can
establish hot spots of fraudulent activity based on who is talking to whom.
Text analytics - To identify the language behind fraud, determine the sentiment and certainty in the
language of a trader before and after executing trades.

Case studies on Telecom


Fraud Detection
Examples of fraud by types
International revenue share fraud
Premium rate service fraud
Interconnect bypass fraud

Examples of fraud by methods

Subscription fraud
PBX hacking
Wangiri fraud
Phishing
Abuse of service term and conditions
SMS faking

Telecom Fraud Detection


From Traditional to Big Data Analytics
Traditional Approach

Toward Big Data Analytics

Silos of data across


multiple systems
Batch approach using
ETL/EDW
Rule based on old, known
types of fraud
Business intelligence

Batch real-time
Thresholds anomaly
detection
Rules machine learning
SQL SQL and graph
analysis
Silos of data data lakes
Scale-up hardware
commodity Hadoop
architectures

Telecom Fraud Detection


Visualizing Wangiri Attack

Telecom Fraud Detection


Using Data Lake to Enhance the Accuracy
TD.35
+
Billing
+
CRM

Machine Learning to Detect Wangiri Attack

Detecting the accomplice


Criminals typically need accomplices.
Using graph theory, we can discover the
accomplice.
Criminals may have made tens of thousands of
calls.
However, most of the numbers are called only
once or twice, and very few of them are a
conversation.
What we can see here is the criminal speaks
primarily to four people and one of them he not
only calls, but gets calls back from regularly.
This is his local accomplice.

Who is data scientist?

A course in data science

A course in data science

A course in data science

A course in data science

Common in Data Science


Data driven business decision making problem is
unique
But there are sets of common tasks that underlie
most problems.
Data scientist critical skills
Decompose a business problem into subtasks
Recognize familiar problems and their solutions (Dont try
to re-invent the wheel)
Identify subtasks that have not been automated and use
creativity and intelligence to design a process

Possible Subtask Patterns (1)


1. Classification: predict, for each individual in a population, which
of a set of classes this individual belongs to. Categorical target
Among the customers of Telco, which are likely to respond to a given offer ?
(Classes: will respond, will not respond)

2. Regression: produce a model that, given an individual, estimates


the value of the particular variable specific to that individual.
Numeric target
How much will a given customer use the service? (variable: service usage)

3. Similarity matching: identify similar individuals based on data


know about them. Similarity underlie solutions to other tasks.
Finding people who are similar to you in terms of products they have
purchased.

Possible Subtask Patterns (2)


4. Clustering: group individuals in a population by their
similarity (not driven by any specific purpose).
Do our customers form natural groups or segments?

5. Co-occurrence grouping: find associations between entities


based on transactions involving them.
What items are commonly purchased together?
Clustering looks at similarity between objects base on the objects
attributes, co-occurrence grouping considering similarity of objects
based on their appearing together in transactions

Possible Subtask Patterns (3)


6. Profiling: characterize the typical behavior of an
individual, group, or population.
What is the typical cell phone usage of this customer segment ?
Used to establish behavior norms for anomaly detection (fraud
detection)

7. Link Prediction: predict connections between data items


(Link should exist at what strength)
Social network based applications, such as, friends suggestions and
recommendations.

Possible Subtask Patterns (4)


8. Data Reduction: take a large set of data and replace it with
a smaller set that contains much of the important
information in the larger set.
9. Causal Modeling: help understand what events or actions
actually influence others.

Supervised vs. Unsupervised


Methods
Supervised learning:
The training data is a set of examples (pair of input objects and desired
output value)
Analyzes the training data and produces an inferred function
Use the function for mapping new examples
Classification, regression, casual modeling
Can we find groups of customers who might cancel services after contract
expired ?

Unsupervised learning:
Find hidden structure in unlabeled data (no training data)
Clustering, co-occurrence grouping, profiling
Might include the same set of examples but would not include the target
information
Do our customers fall into different groups ?
Similarity matching, link prediction, data reduction can be either

The General Method


Decide whether the problem is supervised or unsupervised. If
supervised
How to produce a precise definition of a target variable.
Make sure we can obtain values for some example data.

Mining the data to find patterns and build models


Then use the resulting model to answer questions.
Customer churn:

Build a model with class probability estimation


Describe each existing customer using a set of characteristics
Model take these characteristics in historical data as inputs
Use the model to predict which customers will leave (produce a score or
probability estimate)

Pause and Think (1)


Will this customer purchase service S1 if given incentive I?
Which service package (S1, S2, or none) will a customer
likely purchase if given incentive I?
How much will this customer use the service?
Is a customer likely to continue to subscribe to the service?
At which probability?

Pause and Think (2)


What techniques have been used for
Amazons recommendation system
Amazons Anticipatory package shipping
Signets credit card
Targets pregnancy prediction

Big Data Process


Extracting useful knowledge from
data to solve business problems can
be treated systematically by
following a process with reasonably
well-defined stages.

Iterative Process

The entire process is an


exploration of the data
over and over

End of Module I
Question?

You might also like