You are on page 1of 33

DATA SCIENCE BOOTCAMP PROGRAM

Introduction to
Data Science

Azizur Rachman, S.T., B.Eng.


Hello, I’m Aziz!
Swiss German University
Mechatronics Engineering, Sarjana Teknik and Bachelor of Engineering
(Double Degree from Indonesia and Germany)

PT Mitra Bakti UT
(subsidiary of Yayasan Karya Bakti United Tractors )
1. Group Leader Mechanical Electrical
2. Account Manager

Peopleshift / Shift Academy


Data Tech Officer/ Data Science Tutor Coordinator and B2B Coordinator
Today’s Objective

1. What is Data Science?


2. Simple Study Case
3. General Flow Process of Data Science
WHAT IS DATA
SCIENCE?
What is Data Science?
Simple Study Case

A bread seller, Roy bakes his bread


everyday.

He sells different varieties of bread.

What problem arise or things to


consider when selling his bread?
Things to Consider:

1. How many bread does he need to


bake everyday?
2. How much do varieties of bread
cost?
3. Which variety of bread is the most
profitable?
4. Is there any particular variety of
bread that is most favorite?
5. When is the right time to sell the
bread?
6. Where would he sell the bread?
7. How would he sell his bread?
What is Data Science?

Data Science is a
multidisciplinary subject
Steps in doing analytics
Data Science Career
Data Science Around Us
Learning Path of Data Science in general
(Coding)

Python ML : Regresi

Data
Visualisasi
ML :
Introduction SQL Final Project
Classification

Statistik
ML :
Packages
Clustering
Cross-industry standard
process for data mining
(CRISP-DM)
The Process

From From
From Problem
Requirements Understanding
to Approach
to Collection to Preparation

From From Modeling


Deployment to to Evaluation
Feedback
Business Understanding
• What is the problem your company
faces?
• What is the goal of the project/data
analysis?
• What do you want to learn more about?

Example:
Business problem:
• Information overload, facts vs hoaxes.

Data Science Project (Advanced Level):


• Fake news detection using Machine
Learning
Analytic Approach
• Which Analysis method suits the best for the
problem?
• What algorithm should be used to address the
problem

Example:
Analysis Method:
• NLP (Natural Language Processing)
• Use Supervised Learning especially
Classification model to classify whether a news
is fake or not
Algorithm
• NLTK (Natural Language Toolkit)
• Decision Tree
• Random Forest
• etc.
Data Collection
• Where is the data acquired from? Do we need
query language?
• What is the query to collect the data?

EXAMPLE :
Data Acquired From :
- Real News Feed from the web.
- The format of the data could be in SQL form
- Other dataset from online sources to be used for
learning purposes.
Data Understanding
• What Cleaning Process should be involved later?
• What information can the data tell us?
• What can we do with the data?
Example
Key Question for General Understanding
• Is there an Missing values and/ or Outlier?
• Is there any incorrect data types?

Key Question for Exploratory Analysis


• Was there any pattern to fake news?
• Was there any particular time of publication?
• Reliable resources?
• Was there any attempt to spread news multiple times in a
specific given time interval?
Data Preparation
• What is the best approach to handle the raw data?
• Is there any new features that should be engineered?

Example
Approaches for handling missing values:
• Change with mean (if normally distributed)
• Change with modulus(if discrete variables)
• Change in respect with other features

Approaches for handling outlier:


• Dropping outliers
• Capping outliers (Limit maximum range)
• Replace with median
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA)
Exploratory Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot
anomalies, to check assumption with the help of of statistical
summary and graphical representations
Exploratory Data Analysis (EDA)
3 Parts of EDA
1. Cleaning
Checking for problems with the collected data, such as missing data
or measurement error, data types of columns, etc

2. Defining questions
Identifying the relationship between the variables that are
particularly interesting or unexpected

3. Visualizations
Using effective visualizations to communicate the result
About Cleaning and Preprocessing
Dataset
Cleaning your data should be the
first step in your Data Science
(DS) or Machine Learning (ML)
workflow.

Without cleaned data, you will be


having a much harder time seeing
the actual important parts in your
exploration.

According to CrowdFlower, data


scientists spend 60% of the time
organizing and cleansing data!
Why Cleaning data is important?

1. Data Visualization and Data Analysis can be done easily


using cleaned data.
2. Valid Data Interpretation/ data insights

Raw Data Ready to use


Common Problem in Data Cleaning

1. Duplicate Dataset
2. Missing data
3. Outliers
4. Data Type
Common Problem in Data Cleaning
1. Duplicate Dataset
2. Missing data
3. Outliers
4. Data Type

Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
Common Problem in Data Cleaning
1. Duplicate Dataset
2. Missing data
3. Outliers
4. Data Type
Common Problem in Data Cleaning
1. Duplicate Dataset
2. Missing data
3. Outliers
4. Data Type

Other Method for Imputing Missing


Value :
1.Median (Used for
skewness distribution)
2.Mode (Used for categorical type)
3.Mean (Used for Normally
Distributed Data)
Common Problem in Data Cleaning
1. Duplicate Dataset
2. Missing data
3. Outliers
4. Data Type
Common Problem in Data Cleaning
An outlier is a data point that lies an abnormal distance
from other values in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1
The box plot is a useful graphical
display for describing the
behavior of the data in the
middle as well as at the
ends of the distributions.
Common Problem in Data Cleaning
1. Duplicate Dataset
2. Missing data
3. Outliers
4. Data Type
Data Modeling, Evaluation, and Deployment
• How do we plan to manage fitting multiple models?
• What metrics that is used to evaluate model
performance?

Example
Fitting and improve Model:
• Use Decision Tree, Random Forest etc.
• Use Random Search CV for Hyperparameter Tuning

Metrics to evaluate model


• Confusion Matrix (Accuracy, Precision, Recall)

Model Deployment:
• Use Flask for Framework
• Create web-based product for management to product
• Credit Fraud Detection Prevention
Thank You !

You might also like