You are on page 1of 22

DATA

PREPROCESSING
WHY
PREPROCESS
THE DATA ?

2
Why Preprocess the Data?
⊷ databases are highly susceptible to noisy,
missing, and inconsistent data due to their
typically huge size (often several gigabytes
or more) and their likely origin from
multiple, heterogenous sources
⊷ Low-quality data will lead to low-quality
mining results.

3
Why Preprocess the Data?
DATA QUALITY
⊷ Data have quality if they satisfy the
requirements of the intended use
⊷ There are many factors comprising data
quality, including accuracy, completeness,
consistency, timeliness, believability, and
interpretability.

4
Why Preprocess the Data?

Lembaga
Perlindungan
Investor

Kustodian Kustodian

Pemodal Pemodal Pemodal Pemodal

5
Why Preprocess the Data?

Risiko Pemodal Risiko Kustodian

•Fluktuasi IHSG (daily) •Modal kerja bersih


•Nilai transaksi (daily) disesuaikan tidak cukup
•Volume transaksi (daily) (triwulan)
•Frekuensi transaksi (daily) •Profit margin (triwulan)
•Kapitalisasi Pasar •Debt to equity ratio
(triwulan)
•Nilai Aset Pemodal
(monthly) •Aset Pemodal (triwulan)
•Jumlah sub rekening efek
(Monthly)

6
Why Preprocess the Data?
DATA QUALITY
⊷ incomplete (lacking attribute values or certain attributes of
interest, or containing only aggregate data)
⊷ inaccurate or noisy (containing errors, or values that
deviate from the expected)
⊷ inconsistent (e.g., containing discrepancies in the
department codes used to categorize items)

Timeliness
Believability
Interpretability

7

⊷ How can the data be
preprocessed in order to help
improve the quality of the data
and, consequently, of the
mining results?
⊷ How can the data be
preprocessed so as to improve
the efficiency and ease of the
mining process?

8
Macam Data Preprocessing

9
DATA
CLEANING
Missing Value
Noisy Data

10
DATA CLEANING
Data cleaning routines work to “clean” the data by filling in missing
values, smoothing noisy data, identifying or removing outliers, and
resolving inconsistencies.
If users believe the data are dirty, they are unlikely to trust the results of
any data mining that has been applied. Furthermore, dirty data can cause
confusion for the mining procedure, resulting in unreliable output. Although
most mining routines have some procedures for dealing with incomplete or
noisy data, they are not always robust. Instead, they may concentrate on
avoiding overfitting the data to the function being modeled. Therefore, a
useful preprocessing step is to run your data through some data cleaning
routines.

11
DATA CLEANING

Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.

12
MISSING VALUES

13
MISSING VALUES

•Ignore the tuple


1
•Fill in the missing value manually
2
•Use a global constant to fill in the missing value
3
•Use a measure of central tendency for the attribute (e.g., the mean or median) to fill
4 in the missing value

•Use the attribute mean or median for all samples belonging to the same class as
5 •the given tuple

•Use the most probable value to fill in the missing value


6

14
MISSING VALUES

15
NOISY DATA

16
NOISY DATA

“What is noise?” Noise is a random error or variance


in a measured variable.

Data visualization can be used to identify outliers,


which may represent noise

data smoothing

17
NOISY DATA

1. Binning

2. Regression

3. Outlier Detection

18
NOISY DATA
1. Binning

smoothing by bin :
-Means
-Medians
-Boundaries

19
NOISY DATA
2. Regression

20
NOISY DATA
3. Outlier Detection

Cluster Analysis

21
DATA CLEANING
AS A PROCESS
Data cleaning routines work to “clean” the data by filling in missing
values, smoothing noisy data, identifying or removing outliers, and
resolving inconsistencies.
If users believe the data are dirty, they are unlikely to trust the results of
any data mining that has been applied. Furthermore, dirty data can cause
confusion for the mining procedure, resulting in unreliable output. Although
most mining routines have some procedures for dealing with incomplete or
noisy data, they are not always robust. Instead, they may concentrate on
avoiding overfitting the data to the function being modeled. Therefore, a
useful preprocessing step is to run your data through some data cleaning
routines.

22

You might also like