You are on page 1of 5

OVER-SAMPLING/CASE-BASED SAMPLING/BIASED

SAMPLING
Before:
10% target variable with 10,000 data sets (original fraction =
0,1)
90%non-target variable with 90,000 data sets
100% total with 100,000 data sets
So if the target only makes a fraction of 10% in the beginning,
you take a sample where it is oversampled to a 50%/50%
distribution. That is quite easy, just take the complete 10%
target variables and add a randomly taken amount of data sets
of the non-target variables in the same number like the target
variables have.
After:
50% target variables with 10,000 data sets (oversampled
fraction = 0,5)
50%non-target variable with 10,000 data sets
100% total with 20,000 data sets
Now you apply your usual data mining on this flatfile, and you

5-10% target variable with more then 10,000 data


sets: 50/50 oversampling is ok
<5% target variable with more then 10,000 data
sets: 30/70 oversampling is recommend (30%
target variable and 70% non-target variable)
<5% target variable with less then 10,000 data
sets: the whole flatfile should not be smaller as
20,000 data sets. So do the oversampling in a way
that your target variable fraction is maximized, but
you still have in sum more then 20, 000 data sets.