You are on page 1of 23

Business Understanding and

Data Collection
PERTEMUAN XII

© IBM 2020
Foundational Methodology for Data Science

© IBM 2020
Business
Understanding

Every project starts with business understanding. We must keep in mind


that a Data Science project is definitely a Business project, so it must
always be oriented on achieving results focused on the business and have
a global vision aligned with the business strategy.
The business sponsors who need the analytic solution play the most
critical role in this stage
Define the problem, project objectives, and solution requirements from
business perspective
What is the goal? How do you define “success” and how can you measure
it?

© IBM 2020
Business
Understanding

Business Understanding example :


Traffic Problem: Traffic congestion wastes time and money
Clear question: How can we optimize traffic light duration using data on
traffic patterns, weather, and pedestrian traffic?
Measurable outcomes:
- % decrease in commute time
- % decrease in length/duration of traffic jams

© IBM 2020
Analytic Approach

Once the business problem has been clearly stated, the data scientist can
define the analytic approach to solving the problem
This stage entails expressing the problem in the context of statistical and
machine-learning techniques, so the organization can identify the most
suitable ones for the desired outcome
In brief, analytic approach is how to express problem in context of
statistical and machine learning techniques

© IBM 2020
Analytic Approach

“Predicting revenue in the next quarter?” à Regression


“Does this patient have cancer A, cancer B, or are they healthy?” à
Classification
“Are there groups of users that seem to be have similarly to each other?”
à Clustering
“How can I target discounts to specific customers?” à
Recommendation/Personalization

© IBM 2020
Data Requirements

The chosen analytic approach determines the data requirements.


Specifically, the analytic methods to be used require certain data content,
formats and representations, guided by domain knowledge.

© IBM 2020
Data Collection

In the initial data collection stage, data scientists identify and gather the
available data resources—structured, unstructured and semi-structured—
relevant to the problem domain.

Available data? Obtain data? Revise data requirements or collect more


data?

© IBM 2020
Gathering Data

Data could be gathered through several sources, such as:


1. Internal company data (excel, internal databases, etc)
2. Web API’s, Web scraping
3. Dataset via public data
4. Dataset via open data

© IBM 2020
Open Data

Open Data is defined as structured data that is machine-readable,


freely shared, used and built on without restrictions - Knowledge is
open if anyone is free to access, use, modify, and share it — subject, at
most, to measures that preserve provenance and openness.

© IBM 2020
Open Data

The Open Definition provides a more detailed definition of Open Data. To


summarize the most important points:
Availability and Access: the data must be available as a whole and at no
more than a reasonable reproduction cost, preferably by downloading
over the internet. The data must also be available in a convenient and
modifiable form.
Re-use and Redistribution: the data must be provided under terms that
permit re-use and redistribution including the intermixing with other
datasets.
Universal Participation: everyone must be able to use, re-use and
redistribute. There should be no discrimination against fields of endeavor
or against persons or groups. For example, 'non-commercial' restrictions
that would prevent 'commercial' use, or restrictions of use for certain
purposes (e.g. only in education), are not allowed.

© IBM 2020
Public Data vs Open
Data

Open data and content can be


freely used, modified, and
shared by anyone and for any
purpose. Meanwhile, Public
data can be defined as all
information in the public
domain that are only accessible
via requests (less accessible).

© IBM 2020
Open Data Sources

There are several free Open Data sources anyone can use, such as:

1. World Bank Open Data https://data.worldbank.org/


2. Kaggle https://www.kaggle.com/datasets
3. UNICEF Dataset https://data.unicef.org/
4. WHO Open Data https://www.who.int/gho/database/en/
5. IBM Data Asset eXchange (DAX)
https://developer.ibm.com/exchanges/data/

© IBM 2020
IBM Data Asset
eXchange (DAX)

Online hub for developers and data scientists to find carefully curated free
and open datasets under open data licenses.
While there are many resources available online for finding open datasets,
DAX is unique in its high level of quality and curation.
An example of the sorts of datasets we’re releasing is the Finance
Proposition Bank and Contracts Proposition Bank datasets. These
datasets are part of an active research program from IBM Research.

© IBM 2020
IBM Data Asset
eXchange (DAX)

© IBM 2020
Practice

© IBM 2020
Practice

Form a group consists of 3 people


Choose one topic from the following Data Science projects
Discuss the business problem and solution you can provide for project that you choose
Present your result in front of your class

© IBM 2020
Practice – Topic 1

Prudential Life Insurance Assessment


In a one-click shopping world with on-demand everything, the life insurance application
process is antiquated. Customers provide extensive information to identify risk classification
and eligibility, including scheduling medical exams, a process that takes an average of 30
days.
The result? People are turned off. That’s why only 40% of U.S. households own individual life
Discuss the business insurance. Prudential wants to make it quicker and less labor intensive for new and existing
customers to get a quote while maintaining privacy boundaries.
problem and solution
that you can provide By developing a predictive model that accurately classifies risk using a more automated
with the dataset approach, you can greatly impact public perception of the industry.
The results will help Prudential better understand the predictive power of the data points in
the existing assessment, enabling us to significantly streamline the process.
Download the dataset here

© IBM 2020
Practice – Topic 2
Restaurant Revenue Prediction
With over 1,200 quick service restaurants across the globe, TFI is the company behind some
of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and
Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily
investments in developing new restaurant sites.
Right now, deciding when and where to open new restaurants is largely a subjective process
based on the personal judgement and experience of development teams. This subjective
data is difficult to accurately extrapolate across geographies and cultures.
Discuss the business
problem and solution New restaurant sites take large investments of time and capital to get up and running. When
that you can provide the wrong location for a restaurant brand is chosen, the site closes within 18 months and
with the dataset operating losses are incurred.
Finding a mathematical model to increase the effectiveness of investments in new restaurant
sites would allow TFI to invest more in other important business areas, like sustainability,
innovation, and training for new employees. Using demographic, real estate, and commercial
data, this competition challenges you to predict the annual restaurant sales of 100,000
regional locations.
Download the dataset here

© IBM 2020
Practice – Topic 3

Airbnb New User Bookings


Instead of waking to overlooked "Do not disturb" signs, Airbnb travelers find themselves
rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a
houseboat, or cooking a shared regional breakfast with their hosts.
Discuss the business
problem and solution New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By
that you can provide accurately predicting where a new user will book their first travel experience, Airbnb can
with the dataset share more personalized content with their community, decrease the average time to first
booking, and better forecast demand.
Download the dataset here

© IBM 2020
References

Foundational Methodology for Data Science – IBM Analytics White Paper


The Data Science Process by Polong Lin - https://www-
01.ibm.com/events/wwe/grp/grp304.nsf/vLookupPDFs/Polong%20Lin%20Presentation/$fil
e/Polong%20Lin%20Presentation.pdf
https://cognitiveclass.ai/courses/data-science-with-open-data
https://medium.com/enigma/what-is-public-data-938e086f363f
https://developer.ibm.com/blogs/ibm-data-asset-exchange-dax-free-open-data-ai/
https://www.techedgegroup.com/blog/data-science-process-problem-statement-definition
https://www.kaggle.com/c/prudential-life-insurance-assessment/overview
https://www.kaggle.com/c/restaurant-revenue-prediction/overview
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/overview

© IBM 2020
Thank You

© IBM 2020
©Copyright IBM Corporation 2020. All rights reserved. The information contained in these materials is provided for informational purposes only,
and is provided AS IS without warranty of any kind, express or implied. Any statement of direction represents IBM’s current intent, is subject to
change or withdrawal, and represents only goals and objectives. IBM, the IBM logo, and other IBM products and services are trademarks of the
International Business Machines Corporation, in the United States, other countries or both, Other company, product, or service names may be
trademarks or service marks of others

© IBM 2020

You might also like