You are on page 1of 50

DATA SCIENCE

DATA SCIENCE PROCESS


Ganjil 2020/2021
Capaian Pembelajaran Mata Kuliah
CPMK - 3

Mampu menjelaskan tahapan proses dalam sains data


• Mampu menjelaskan alur proses sains data
• Mampu menjelaskan setiap tahapan dalam proses sains data
Capaian Pembelajaran Mata Kuliah
Outline
DATA S C I E N C E P R O C E S S
P ro c e s s O ve rv i ew
Data Science Process Overview
The Steps

Setting Retrieving Data


Research Goal Data Preparation

Presentation
Data
and Data Modeling
Exploration
Automation
Data Science Process Overview
The Tips

• Keep in the linear track


• Dividing into smaller stages
• Work together as a team
• Different project could be different approach
• Don’t be a slave to the process
S t a g e O n e
Setting the Research Goal
Setting the Research Goal
At Glance

Understanding the what, the why, and the


how of your project
Define
Outcome: Research
Goal
✓ a clear research goal
✓ a good understanding of the con- text
Create
✓ well-defined deliverables Project
✓ a plan of action with a timetable Charter
Setting the Research Goal
Define Research Goal

Spend time
understanding the goals
and context of your
research
Setting the Research Goal
Create Project Charter

Create a Project Charter


■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
S t a g e T w o
R e t r i e v i n g D a t a
R e t r i e v i n g D a t a
At Glance

Internal
Acquiring All the Data you Need Data

External
Data
R e t r i e v i n g D a t a
Internal Data

Start with data stored


within the company
R e t r i e v i n g D a t a
External Data

Don’t be afraid to
shop around
https://archive.ics.uci.edu/ml/index.php https://www.kaggle.com/datasets
R e t r i e v i n g D a t a
The Tip

Do data quality checks


now to prevent
problems later
S t a g e T h r e e
D a t a P r e p a r a t i o n
D a t a P r e p a r a t i o n
At Glance

Data
Cleansing

Sanitize and Prepare Data Data


(a diamond in the rough) Transform
ation
for Use in the Next Phase

Combinin
g Data
D a t a P r e p a r a t i o n
Cleansing Data

focuses on removing
errors in your data
D a t a P r e p a r a t i o n
Cleansing Data

Data Entry Errors


D a t a P r e p a r a t i o n
Cleansing Data

Redundant Whitespace

• mismatch of keys such as “FR ” – “FR”


• in Python you can use the strip() function
D a t a P r e p a r a t i o n
Cleansing Data

Impossible Values and Sanity Checks

• check the value against physically or


theoretically impossible values
• people taller than 3 meters or someone
with an age of 299 years
• Solution: check = 0 <= age <= 120
D a t a P r e p a r a t i o n
Cleansing Data

Outliers
D a t a P r e p a r a t i o n
Cleansing Data

Missing Values
D a t a P r e p a r a t i o n
Cleansing Data

Deviations from a Code Book

• A code book is a description of your data, a form of


metadata
• For instance: “0” equals “negative”, “5” stands for “very positive”
• The type of data you’re looking at: is it hierarchical, graph, else?
• You look at those values that are present in set A but not
in set B
D a t a P r e p a r a t i o n
Cleansing Data

Different Units of Measurement

• When integrating two data sets, you have to pay


attention to their respective units of measurement
• For instance: the prices of gasoline (prices per gallon
versus prices per liter)
D a t a P r e p a r a t i o n
Cleansing Data

Different Levels of Aggregation

• Having different levels of aggregation is similar to having


different types of measurement
• For instance: data set containing data per week versus
one containing data per work week
D a t a P r e p a r a t i o n
Cleansing Data

find and identify data errors


D a t a P r e p a r a t i o n
Data Transformation

Certain models
require their data to
be in a certain shape
D a t a P r e p a r a t i o n
Data Transformation

Turning Variables
into Dummies
D a t a P r e p a r a t i o n
Data Transformation

Reducing the Number of Variables

• Having too many variables in your model makes the


model difficult to handle
• Certain techniques don’t perform well when you
overload them with too many input variables
D a t a P r e p a r a t i o n
Combining Data

focus on integrating
different sources comes
from different places
D a t a P r e p a r a t i o n
Combining Data

Joining Tables
D a t a P r e p a r a t i o n
Combining Data

Appending Tables
D a t a P r e p a r a t i o n
Combining Data

Enriching Aggregated Measures


S t a g e F o u r
D a t a E x p l o r a t i o n
D a t a E x p l o r a t i o n
At Glance

Graphical

Exploratory data analysis take Link and


a deep dive into the data Brush

Non-
Graphical
D a t a E x p l o r a t i o n
Graphical
D a t a E x p l o r a t i o n
Link and Brush

Link and brush allows you to select observations in one plot


and highlight the same observations in the other plots
D a t a E x p l o r a t i o n
Non-Graphical

Tabulation, clustering,
and building simple
models
S t a g e F i v e
D a t a M o d e l i n g
D a t a M o d e l i n g
At Glance

Selection

The way you build your model


Execution
depends on whether you go

Diagnostic
D a t a M o d e l i n g
Model and Variable Selection

findings from exploratory


analysis give a fair idea of
what variables to
construct a good model
D a t a M o d e l i n g
Model and Variable Selection

choosing the right model


for a problem requires
judgment on your part
D a t a M o d e l i n g
Model Execution

Once you’ve chosen a model you’ll need to implement it in code


D a t a M o d e l i n g
Model Diagnostics and Comparison
S t a g e S i x
Presentation and Automation
Presentation and Automation
At Glance

Presenting findings and building Presenting


applications on top of them

Automating
S U M M A R Y
Summary of The Course

Data science process consists of six steps:


• Setting the research goal
• Retrieving data
• Data preparation
• Data exploration
• Data modeling
• Presentation and automation
T h a n k Y o u
Credit by:

Introducing Data Science


Chapter 2

You might also like