You are on page 1of 11

chapter 2

Data preprocessing
1) Define Data Preprocessing ,examples ?
2)what is aggregation , sampling, Curse of
Dimentionality?
3) Attribute transformation,disctretization &
binarization
4) Similarity && Dissimilarity?

1) define data preprocessing ,examples ?


data preprocessing : refers to a set of processes for
preparing (cleaning and organizing) the raw data to make it
suitable for training and building models.
"data preprocessing is done before training "
examples : data cleaning,aggregation,sampling,diemensional
reduction,feature creation,attribute transformation,similarity &
dissimilarity
2) what is aggregation , sampling, Curse of
Dimentionality?

 aggregation :

aggregation(attributes): "country,capital,city" = address


"first_name,last_name" = name
aggregation(object/record) : all record that contain
temperture= 37, 37.1, 37.2, 37.3, 37.4, 37.5 combine in a same
record that its temperature=37.2 "average"
:two record for the same patient "record liver hospital","recod
chest hospital" combine in the same record that conatine all
attributes.
 Sampling : Sampling is a method that allows us to get
information about the population based on the statistics
from a subset of the population (sample), without having
to investigate every individual."sample must be
reprsentative "

Why do we need Sampling?


Selecting a sample requires less time than selecting every
item in a population.
Sample selection is a cost-efficient method.
Analysis of the sample is less cumbersome and more
practical than an analysis of the entire population.
 Curse of Dimentionality:
"PCA": principle component analysis
"feature selection": select some attributes from all
attributes that effect on training process.

4) attribute transformation, discretization &


binarization ?
attribute transformation : transform the originnal
attribute to another attribute the is more efficiently. "x : x^2".
discretization : converting a continous arrtibutes into
categorical attributes
ex: temperture 37.2 = moderate
binarization : converting a continous attributes or discrete
attributes into binary attributes .
ex: 37.4=moderate=37=100101
4) Similarity && Dissimilarity?
similarity :
- numerical measure of how a like two data object .
- is higher when objects are more a like.
- similarity =1 a like ,similarity=0 different.

dissimilarity :
- numerical measure of how a different two data object .
- is higher when objects are more different .
- dissimilarity =1 different ,dissimilarity=0 a like.

to measure similarity && dissimilarity between two


data objects :
anaconda :full package contain spyder,jupyter,......
 when download anaconda "spyder,jupyter download
automatic"

 spyder,jupyter : is IDE "integrated development


environment " is a interface /gui for python instead of
writting code in cmd.

 programming language:is language between human and


machine then compiler convert it to machine language
because the computer only understand machine
language(0,1).

You might also like