Introduction To Data Science

DATA SCIENCE BOOTCAMP PROGRAM
Introduction to
Data Science
Azizur Rachman, S.T., B.Eng.

Hello, I’m Aziz!
Swiss German University
Mechatronics Engineering, Sarjana Teknik and Bachelor of Engineering
(Double Degree from Indonesia and Germany)
PT Mitra Bakti UT
(subsidiary of Yayasan Karya Bakti United Tractors )
1. Group Leader Mechanical Electrical
2. Account Manager
Peopleshift / Shift Academy

Data Tech Officer/ Data Science Tutor Coordinator and B2B Coordinator
Today’s Objective
1. What is Data Science?

2. Simple Study Case
3. General Flow Process of Data Science
WHAT IS DATA
SCIENCE?
What is Data Science?
Simple Study Case
A bread seller, Roy bakes his bread

everyday.
He sells different varieties of bread.
What problem arise or things to

consider when selling his bread?
Things to Consider:
1. How many bread does he need to

bake everyday?
2. How much do varieties of bread
cost?
3. Which variety of bread is the most
profitable?
4. Is there any particular variety of
bread that is most favorite?
5. When is the right time to sell the
bread?
6. Where would he sell the bread?
7. How would he sell his bread?
What is Data Science?
Data Science is a
multidisciplinary subject
Steps in doing analytics
Data Science Career
Data Science Around Us
Learning Path of Data Science in general
(Coding)
Python ML : Regresi
Data
Visualisasi
ML :
Introduction SQL Final Project
Classification
Statistik
ML :
Packages
Clustering
Cross-industry standard
process for data mining
(CRISP-DM)
The Process
From From
From Problem
Requirements Understanding
to Approach
to Collection to Preparation
From From Modeling

Deployment to to Evaluation
Feedback
Business Understanding
• What is the problem your company
faces?
• What is the goal of the project/data
analysis?
• What do you want to learn more about?
Example:
Business problem:
• Information overload, facts vs hoaxes.
Data Science Project (Advanced Level):

• Fake news detection using Machine
Learning
Analytic Approach
• Which Analysis method suits the best for the
problem?
• What algorithm should be used to address the
problem
Example:
Analysis Method:
• NLP (Natural Language Processing)
• Use Supervised Learning especially
Classification model to classify whether a news
is fake or not
Algorithm
• NLTK (Natural Language Toolkit)
• Decision Tree
• Random Forest
• etc.
Data Collection
• Where is the data acquired from? Do we need
query language?
• What is the query to collect the data?
EXAMPLE :
Data Acquired From :
- Real News Feed from the web.
- The format of the data could be in SQL form
- Other dataset from online sources to be used for
learning purposes.
Data Understanding
• What Cleaning Process should be involved later?
• What information can the data tell us?
• What can we do with the data?
Example
Key Question for General Understanding
• Is there an Missing values and/ or Outlier?
• Is there any incorrect data types?
Key Question for Exploratory Analysis

• Was there any pattern to fake news?
• Was there any particular time of publication?
• Reliable resources?
• Was there any attempt to spread news multiple times in a
specific given time interval?
Data Preparation
• What is the best approach to handle the raw data?
• Is there any new features that should be engineered?
Example
Approaches for handling missing values:
• Change with mean (if normally distributed)
• Change with modulus(if discrete variables)
• Change in respect with other features
Approaches for handling outlier:

• Dropping outliers
• Capping outliers (Limit maximum range)
• Replace with median
Exploratory Data Analysis (EDA)
Exploratory Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot
anomalies, to check assumption with the help of of statistical
summary and graphical representations
3 Parts of EDA
1. Cleaning
Checking for problems with the collected data, such as missing data
or measurement error, data types of columns, etc
2. Defining questions
Identifying the relationship between the variables that are
particularly interesting or unexpected
3. Visualizations
Using effective visualizations to communicate the result
About Cleaning and Preprocessing
Dataset
Cleaning your data should be the
first step in your Data Science
(DS) or Machine Learning (ML)
workflow.
Without cleaned data, you will be

having a much harder time seeing
the actual important parts in your
exploration.
According to CrowdFlower, data

scientists spend 60% of the time
organizing and cleansing data!
Why Cleaning data is important?
1. Data Visualization and Data Analysis can be done easily

using cleaned data.
2. Valid Data Interpretation/ data insights
Raw Data Ready to use

Common Problem in Data Cleaning
1. Duplicate Dataset
2. Missing data
3. Outliers
4. Data Type
2. Missing data
3. Outliers
4. Data Type
Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
2. Missing data
3. Outliers
4. Data Type
2. Missing data
3. Outliers
4. Data Type
Other Method for Imputing Missing

Value :
1.Median (Used for
skewness distribution)
2.Mode (Used for categorical type)
3.Mean (Used for Normally
Distributed Data)
2. Missing data
3. Outliers
4. Data Type
An outlier is a data point that lies an abnormal distance
from other values in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1
The box plot is a useful graphical
display for describing the
behavior of the data in the
middle as well as at the
ends of the distributions.
2. Missing data
3. Outliers
4. Data Type
Data Modeling, Evaluation, and Deployment
• How do we plan to manage fitting multiple models?
• What metrics that is used to evaluate model
performance?
Example
Fitting and improve Model:
• Use Decision Tree, Random Forest etc.
• Use Random Search CV for Hyperparameter Tuning
Metrics to evaluate model

• Confusion Matrix (Accuracy, Precision, Recall)
Model Deployment:
• Use Flask for Framework
• Create web-based product for management to product
• Credit Fraud Detection Prevention
Thank You !

Introduction To Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science

Uploaded by

Copyright:

Available Formats

DATA SCIENCE BOOTCAMP PROGRAM

Azizur Rachman, S.T., B.Eng.

Peopleshift / Shift Academy

1. What is Data Science?

A bread seller, Roy bakes his bread

He sells different varieties of bread.

What problem arise or things to

1. How many bread does he need to

From From Modeling

Data Science Project (Advanced Level):

Key Question for Exploratory Analysis

Approaches for handling outlier:

Without cleaned data, you will be

According to CrowdFlower, data

1. Data Visualization and Data Analysis can be done easily

Raw Data Ready to use

Other Method for Imputing Missing

Metrics to evaluate model

You might also like