You are on page 1of 10

  Scan to know paper details and

author's profile

Data Normalization using Median & Median Absolute 


Deviation (MMAD) based Z-Score for Robust Predictions 
vs. Min – Max Normalization 
Sunil Kappal 

ABSTRACT
In the world of data analytics, data normalization is not a new concept as it is a preprocessing stage of any
type of number driven business problem. The goal of normalization is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in the ranges of values. There are
multitude of data normalization techniques available namely Min-Max normalization, Z-Score
normalization, coefficient based normalization etc. Data normalization may also vary based on the level of
measurement of the variables namely nominal scale variables, ordinal scale variable interval scale variable,
additive scale variable etc. However, the scope of this paper is purely focused on a continuous set of
numbers and deploy the proposed (MMAD) normalization technique to standardize the values for creating
a robust simple linear regression model. The alternative aim of this paper is also to pitch the proposed
(MMAD) normalization technique against the min-max normalization method to see its effectiveness and
robustness​.

Keywords:​ normalization, regression, median absolute deviation, MMAD.

Classification:​ ​FOR Code: ​080199

​ nglish
Language: E

LJP Copyright ID: 925686


Print ISSN: 2631-8490
Online ISSN: 2631-8504

London Journal of Research in Science: Natural and Formal

Volume 19 | Issue 4 | Compilation 1.0 465U


 
©  2019.  Sunil  Kappal.  This  is a research/review paper, distributed under the terms of the Creative Commons Attribution-Noncommercial 4.0 
Unported  License  http://creativecommons.org/licenses/by-nc/4.0/),  permitting  all  noncommercial  use,  distribution,  and  reproduction  in 
any medium, provided the original work is properly cited. 
Data Normalization using Median & Median Absolute 
Deviation (MMAD) based Z-Score for Robust 
Predictions vs. Min – Max Normalization 
Sunil Kappal 
____________________________________________ 

ABSTRACT  and maximum values of x, where x is the set of


observed values of x.
In the world of data analytics, data
normalization is not a new concept as it is a

London Journal of Research in Science: Natural and Formal


x−(x)
z= (x) −(x)
preprocessing stage of any type of number
driven business problem. The goal of
This maps or scales the data where minimum
normalization is to change the values of numeric
value of x is mapped to 0 and maximum value of x
columns in the dataset to a common scale,
is mapped to 1. The one major downside of min –
without distorting differences in the ranges of
max normalization is its inability to handle
values. There are multitude of data
outliers. For example if you have 99 values
normalization techniques available namely
between 0 to 40, and one value is 100, then the 99
Min-Max normalization, Z-Score normalization,
values will be scaled to a value between 0 to 0.4
coefficient based normalization etc. Data
and the data remain as squished as it was before.
normalization may also vary based on the level
Henceforth, in this paper I would like to introduce
of measurement of the variables namely nominal
a new technique called Median & Median
scale variables, ordinal scale variable interval
Absolute Deviation (MMAD) based normalization
scale variable, additive scale variable etc.
which is very simple and easy to use technique
However, the scope of this paper is purely
and can be applied on any size of data set in no
focused on a continuous set of numbers and
time.
deploy the proposed (MMAD) normalization
technique to standardize the values for creating a
II. UNDERSTANDING MEDIAN & MEDIAN 
robust simple linear regression model. The
alternative aim of this paper is also to pitch the ABSOLUTE DEVIATION AS A STATISTIC 
proposed (MMAD) normalization technique While we already know how easy and simple it is
against the min-max normalization method to to calculate a median value. A median is the value
see its effectiveness and robustness.​ separating the higher half of a sample data, from
the lower half. Median can also be expressed as
Keywords: normalization, regression, median another way of finding the average of the sample
absolute deviation, MMAD.
data by sorting the number list from low to high
Author: Data Dojo GH 5&7/775 Paschim Vihar and then finding the middle digit within the
New Delhi - 110087. number list.

Example: One Number in the middle:


I. INTRODUCTION 
Number list = 3,2,4,1,1
Min – Max Normalization is on the most popular
Step 1​ sort the number list low to high = 1,1,2,3,4
and overly used data normalization techniques. It
is a technique which linearly transforms the Step 2​ find the middle digit = 1,1,​2​,3,4
variables, where min and max are the minimum μ̂m (Median) = 2

© 2019 London Journals Press Volume 19 | Issue 4 | Compilation 1.0 39


Absolute deviation from the median was In this section of the paper we will not only see
(re-)discovered and popularized by (1) Hampel the proposed data normalization technique but
(1974) who attributes the idea of Carl Friedrich will also compare it with the min-max
Gauss (1777 – 1855). The median (M) is, like the normalization technique results. To prove the
mean a measure of central tendency is not effectiveness of the proposed data normalization
sensitive to the outliers.(2) technique we will also create a simple linear
regression models and assess each set’s R2 value
Similarly, it is very easy to calculate the MAD to better understand its usability and
(Median Absolute Deviation) statistic as it is effectiveness.
reliant on purely finding the median of absolute
deviations from the median and can be defined The proposed normalization technique is given
below with explanation:
| |
(
as:​MAD = M i |xi − M j xj | ( ))
London Journal of Research in Science: Natural and Formal

μˆ m −xi
| | z= σ̂ m

Where xj is the n original observations and M i is ● Where z = Z score


the median of the series. Usually, b = 1.4826, a ● μ̂m = Median estimator
constant linked to the assumption of normality of ● σ̂m = Median Absolute Deviation
the data, disregarding the abnormality induced by
outliers.
The proposed method can be used for any length
Example: Calculating the MAD of data with integer as a type. The above stated
proposed normalization technique helps to:
Step 1 ​Subtract median from each observation to
make the series of absolute value (2-3), (4-3), 1. Normalize data to ensure that the data
(5-3), (10-3), (9-3), (11-3), (10-3), (2000-3) that is converges to a standard normal distribution
8,6,5,1,1,2,1,1991 with a median of 0 and median absolute
deviation of 1
Step 2 ​When ranked these values,we obtain: 2. Eradicates multicollinearity while creating a
1,1,1,2,5,6,8 and 1991
multivariate linear regression model
Step 3​ Calculate the Median which is 3 3. Improves the performance of the model

III. PROPOSED METHOD – MMAD –  The comparison study that is done via data
MEDIAN & MEDIAN ABSOLUTE  tabulation and graphical representation is
DEVIATION BASED NORMALIZATION  described below. This study helps to identify the
effectiveness of MMAD based data normalization.
In the introduction section of this paper we have In the below table we are comparing the proposed
already understood certain limitations or data normalization technique (MMAD) with min
disadvantages of a min – max normalization – max normalization technique as well through
technique and have setup an expectation that the box plot graph (below Fig.1 through 3) on
there has to be another data normalization highly non-normal customer sentiment data and
technique which is: customer conversation data named call duration.
 
1. Robust to the outlier problem  
2. Applicable for any data size (small, medium or  
large)  
3. Easy and fast to implements  
 
 
Data Normalization using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min – Max Normalization

40 Volume 19 | Issue 4 | Compilation 1.0 © 2019 London Journals Press


Table 1

London Journal of Research in Science: Natural and Formal


(Note: The above table is the sample representation of the entire data set which was used to perform the
normalization and to produce graphical representation with the linear regression model)

Fig. 1 
The above graph shows the non-normalized form improvement in the centrality and data dispersion
of the date which have different scales with values but not to a greater extent. The below box
different centrality and data dispersion values. plot clearly shows the improvements as well as the
opportunity to further standardize the data with
Post standardizing the values using the min – max some better and robust technique.
data normalization technique we saw some
 
Data Normalization Using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min – Max Normalization

© 2019 London Journals Press Volume 19 | Issue 4 | Compilation 1.0 41


London Journal of Research in Science: Natural and Formal

Fig. 2 ​ Fig. 3 
Looking at figure 3 it clearly shows how effective effect of data normalization on the accuracy of
the MMAD data normalization technique is as: linear regression model that we have created
using two normalized data sets i.e. min – max
● It brought the value of centrality for two normalized data set and MMAD normalized data
differently scaled datasets to almost a similar set.
value
● It also reduced the data dispersion for both This exercise will lend itself to focus on the
the sets the min. value for both the data sets alternative aim of this analysis i.e. if the proposed
has almost a null difference method (MMAD normalization) helps to generate
● The median lines are almost identical a better simple linear regression model compared
to the min – max normalization.
Now, that we have standardized the data using
min – max and the proposed MMAD based data
normalization techniques, we would like to see the

Fig. 4   
Data Normalization using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min – Max Normalization

42 Volume 19 | Issue 4 | Compilation 1.0 © 2019 London Journals Press


The above figure 4 shows the adjusted R-square of performance of the current model and what
78.87% for cubic linear regression model created alternative models can be created using this min
using the min-max data normalization technique max normalized data.
with additional statics figure 5 (below) shows the

London Journal of Research in Science: Natural and Formal


Fig. 5 

When we ran the simple linear regression model with an R-squared value of 54.25% ​figure 7 for
using the propose MMAD data normalization linear model as compared to (51.91% for min-max
technique not much of an improvement we were normalized data ​figure 5)​ and for quadratic model
able to see for the cubic linear regression model 64.04% ​figure 7 as compared to (61.85% for
with an R-squared value of 78.20% ​figure 6 which min-max normalized data ​figure 5​)
is pretty close the min-max based model’s
r-squared value.

However, it did outperformed the min-max


normalized data on the alternative model’s front

Fig. 6 

 
Data Normalization using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min – Max Normalization

© 2019 London Journals Press Volume 19 | Issue 4 | Compilation 1.0 43


Fig. 7 

IV. ​CONCLUSION  4. Casella, G and Berger, R. L. (2002). Statistical


Inference. Duxbury, Pacific Grove, CA, second
As we have seen and studied that MMAD data
edition
normalization technique helps to scale the data in
5. Lehmann, E.L. (1999). Elements of
a much more effective way compared to min-max
Large-Sample Theory. Springer, New York
London Journal of Research in Science: Natural and Formal

technique, which is sensitive to outliers and may


not produce desirable results. Also, we saw that
our proposed MMAD normalized data also
performed pretty nicely (if not better) on the cubic
linear regression model and outperformed
min-max normalized data on alternative model
accuracy.

The Median and Median Absolute Deviation


based data normalization technique’s main
feature is its robustness to outliers whereas most
of the popular techniques i.e. mean and standard
deviation based Z scores, Min-Max etc. are
sensitive to outliers. Hence, I would like to
purpose this new method of data normalization
that can work wonders in today’s dynamic and
ever so demanding data analytics world.

REFERENCES 
1. Hampel, F. R. (1974). The influence curve and
its role in robust estimation. Journal of the
American Statistical Association, 69( 346),
383-393, ​http://dx.doi.org/10.1080/​016214
59.1974.10482962.
2. Ref: Leys, C., et al ;., Detecting outliers: Do
not use standard deviation around the mean,
use absolute deviation around the media,
Journal of Experimental Social Psychology
(2013),
http://dx.doi.org/10.1016/j.jsep.2013.03.013)
3. Huber, P. J. (1981), Robust Statistics. New
York; John Wiley

Data Normalization using Median & Median Absolute Deviation (MMAD) based Z-Score for Robust Predictions vs. Min – Max Normalization

44 Volume 19 | Issue 4 | Compilation 1.0 © 2019 London Journals Press


LondonJ
our
nal
Pres
sMember
shi
p
ForAut
hor
s,s
ubs
cri
ber
s,Boar
dsandor
gani
zat
ions

LondonJour nalsPr e
ssme mbe rshipisane li
tec ommuni ty
ofscholars,researchers
,scienti
sts,prof
essional sandi n-
st
ituti
onsas sociatedwithallthemaj ordisciplines .
LondonJour nalsPr e
ssme mbe rshipsarefori ndiv i
duals,
res
e ar
c hinsti
tutions,anduni ver
sitie
s.Author s,s ubsc
r i
b-
ers
,Edi tori
alBoar dme mbe rs,Adv i
soryBoar dme mbers,
andor ganizati
onsar eallpartofme mbe rne twor k.

ForAut
hor
s ForI
nst
it
uti
ons ForSubs
cri
ber
s

AuthorMe mbe rs
hi pprov ide Societyfl
ourishwhe nt woinsti
tu- Subs cribet odistngui
i she dSTM
accesstoscienti
ficinnov ation, ti
onsc omet ogether."Organiz
ations, (scientifi
c ,t
echnic al,andme di-
nextgenerationtool s
,ac cesst o researchinsti
tutes,anduniversi
ties cal)publ isher.Subs c r
iption
conferences/seminars canj oi
nLJPSubs cripti
onme mbe r- me mbe rshipisav ailableforindi-
/sympos i
ums /we binars,ne twork- shiporpr i
vile
ge d"Fe l
lowMe mbe r- vidualsuni ver
siti
e sandi ns t
itu-
ingopportuni i
tes,andpr iv
ileged ship"me mbe rshi
pf acil
it
ati
ngr e- tions( print&onl ine ).Subs crb-
i
benefit
s. searcherstopubl i
sht hei
rworkwi th ersc anac cs
esjour nalsf r
om our
Authorsmays ubmi trese arch us,be c
omepe erreviewersandjoin li
br ari
e s,publishedi ndi ffe
rent
manus crptorpape
i rwithout usonAdv is
oryBoar d. format sl i
kePr i
nt edHar dcopy ,
beingane xisti
ngme mbe rofLJP. Inte r
ac ti
v ePDFs ,EPUBs ,
Onc eanon- me mbe raut hors ub- eBooks ,indexabledoc ume ntsand
mitsar es
ear chpape rhe /shebe - theaut hormanage ddy nami cli
ve
come sapar tof"Prov i
sional we bpagear ticl
es,LaTe X,PDFs
AuthorMe mbe rs
hi p". etc.

© 2019 London Journals Press Volume 19 | Issue 4 | Compilation 1.0 V


GOGREENANDHELP
SAVETHEENVI
RONMENT

J
OURNALAVAI
LABLEI
N

PRINTED VERSION, INTERACTIVE PDFS, EPUBS, EBOOKS, INDEXABLE


DOCUMENTS AND THE AUTHOR MANAGED DYNAMIC LIVE WEB PAGE
ARTICLES, LATEX, PDFS, RESTRUCTURED TEXT, TEXTILE, HTML, DOCBOOK,
MEDIAWIKI MARKUP, TWIKI MARKUP, OPML, EMACS ORG-MODE & OTHER
SCANTOKNOW MORE

s
uppor
t@j
our
nal
spr
ess.
com
www.j
our
nal
spr
ess
.com

*
THI
SJOURNALSUPPORTAUGMENTEDREALI
TYAPPSANDSOFTWARES

©C
©C
© opyr
ight2
Copyright  0
2 17L
019  ondon
nJ
London J
ournalsP
Journals ress
Press
 

You might also like