You are on page 1of 21

 

 1

An evaluation of the empirical calculation methods of the Gini


coefficient
 

 2

I. Introduction, Aims and Rationale


Economic inequality is a persistent and pressing issue having the potency to rouse
resentment amongst a nation’s population, give rise to social and economic upheavals and
provoke strong arguments about its magnitude, impacts and potential solutions. I became
interested in the issue of global inequality only after witnessing varying degrees of poverty
within and between the areas of residence that have dominated my life: India (various
areas within) and Singapore. It is fascinating to see that such drastic inequalities could

exist in the small vicinity of cities as seen in the figure below.

 Figuree 1: India: Po
 Figur Poverty
verty and Affluence in th
thee same plot o
off land 

I was curious as to how such a crucial measure that defined various governmental policies
was calculated, withstanding the income disparity within a geographical region and the
vastness of data required for an accurate calculation in countries such as India. This led
my basic research in the area from which I discovered the prevalence of mathematics in
generalising formulae to represent economic inequality. Through the lessons in school, I
was able to recognise the basic principles behind some of these formulae which further
probed me to investigate. This is because, an additional desire of mine was to apply the
deep study we did of calculus and series in school to something more tangible and real.

As such, to understand the process of rendering mathematics pertaining to real life socio-

political situations reliable and trustworthy, I decided to focus on income inequality,


comparing various ways of calculating the Gini Coefficient (a global standard), particularly
 

 3

that of India. Through the investigation, I am to identify the reasons for unreliability (if any)
and to understand what a perfect measure of economic inequality would be.

II. Background Information


The Gini Coefficient is the most renowned and widely employed measure of inequality
inequality,, and
is a standard in governmental calculations. It is named after its founder, Corrado Gini, who
discovered it in 1912. The value of a region’s Gini Coefficient ranges between 0 and 1 and

is based on the net income of residents. Here, 0 represents perfect equality with each
resident earning the same income and 1 represents perfect inequality where 1 person
earns all of the income (Bourne). As such a higher Gini coefficient value would mean
greater disparity between the incomes of the richest and poorest earners in a particular
region.

There are a number of different ways to calculate the Gini coefficient. These include
graphical methods that involve the cumulation of various data points and frequencies such
as the Lorenz curve and more theoretical ones such as Pareto’s distribution function.
These are the 2 methods I will be analysing and comparing against each other.

The reliability will be on the basis of the closeness of the values extracted from each
method to the value released by the Indian government for the year 2013 which was
G=0.510 in 2013 (Nair).

Method 1: Using the Lorenz Curve: Trapezium Rule


The most common way of viewing the GINI coefficient is through the generalised Lorenz
curve.

  e
  m
  o
  c
  n
   I
   f
  o
  n
  o
   i
   t
  r
  o
  p
  o
  r
   P
  e
  v
   i
   t
  a
   l
  u
  m
  u
   C

 L0 ( x ) =  x

10 x
10 x
2 !1
 L1 ( x ) =

1023

Cumulative Proportion of Population

 Figuree 2: The line o


 Figur off perfect equ
equity
ity and an a
arbitrary
rbitrary Lor
Lorenz
enz curve
 

 4

In reference to Figure 2, this curve depicts the percentages of a defined population


arranged
arranged fr
from
om the po
poores
orestt to the rich
richest
est on the h
horizo
orizontal
ntal (  x  ) axis an
and
d the cumu
cumulative
lative
percentage of income enjoyed by a segment of a nation’s population. For example,
Quintile 3 shows the cumulative percentage of income earned or wealth by the 1st, 2nd
and 3rd quintiles combined. Since 0% of the population have 0% of the income, the curve
passes through point A (0,0) and since 100% of the population enjoy all the income, the
curve passes through point B (1,1) as seen in the diagram. As such a Lorenz curve runs

from one corner of the unit square to the diagonally opposite corner. This serves as the
benchmark for a perfectly equal distribution of income indicated by the curve  L0 ( x ) .

10 x
2 !1
Figure 2 displays an arbitrary yet possible Lorenz curve  L1 ( x ) = . The degree of
1023

income inequality is defined by the deviation of the Lorenz curve from the line of perfect
inequality.. This deviation (Gini coefficient) is measured by the area undern
inequality underneath
eath the Lorenz
curve, as we will observe.

With a Lorenz curve plot such as the one above, we can measure the Gini coefficient. The
general formula to be used in the investigation is represented by the following integral:

%
! = ! "   "" # # $ ! "# # $$# 
"

This calculates the area between the curve of perfect inequality and a Lorenz curve
divided by the area under the perfect inequality curve. In Figure 1 for example, the Gini
coefficient of  L1 ( x ) is measured as the area LA  (Lorenz area) between the curve and
 L0 ( x ) divided by the
  area under L0 ( x ) as highlight in magenta and orange respectively
respectively..
Since, at point B, the coordinates are (1,1) , this forms a right angled triangle with point A
and (1
(1,, 0) being the other two vertices, which is highlighted in a light shade of orange.

Hence the area under the equity curve is the area under a triangle, which is ! !!!! =
!.
" "

As such, the Gini coefficient can be generally written as:

"# !
! = = #"# !"
 #  "# =

! "# #

where LA is the area between the two curves mentioned above and G is the Gini
coefficient of  L2 ( x ) , with reference to Figure 2. However
However,, the general formula is difficult to
employ in real life situations. This is because, nations collect raw data from their
population in large numbers which may be difficult to formulate as a generalised graph. I
 

 5

will attempt to do this using the trapezium rule with a limited set of data acquired from the
official census data of India’s income brackets as seen in the following table.

Proportion of Proportion of
Population: Income
(converting % to (converting % to
decimals) decimals
 x ii    yi 

1 0 0
2 0.2 (first quintile) 0.061

3 0.4 (second 0.153


quintile)

4 0.6 (third quintile) 0.279

5 0.8 (fourth 0.468


quintile)

6 1 (fifth quintile) 1. 0

Table 1: Cumulative frequency table depicting India’s  Figuree 3: India’s quintile inco
 Figur income
me proportion
proportion sc
scatter
atter plot 
income in quintiles

The trapezium rule refers to a rule of numerical integration that estimates the area under a
curve. As such, it is a way of estimating integrals of curves by segregating the area under
curve into a number of trapeziums, whose areas are then summed. To find the Gini
coefficient, the data points in Table 1 can be used to formulate a number of trapeziums to
represent an estimated Lorenz Curve, as seen in the figure below:

 L0 ( x ) =  x

Estimated  L1 ( x )

 Figure 4:
4: Are
Area
a under a
an
n estimated Lo
Lorenz
renz ccurve,
urve, formulated
formulated
with the tra eziu
ezium
m ru
rule
le
 

 6

Here, the summed area of trapeziums T1, T2, T3 and T4 and the triangle TR1 (in red)
subtracting by the area of TR0 (in green) represents the area LA. The area of TR0, the
triangle below  L0 ( x ) is 1/2. Hence, in accordance to the formula stated above the Gini
coefficient estimated using the trapezium rule will be:

0.5 ! (0.01 + 0.02 + 0.04 + 0.07 + 0.15) 0.21


G = = =  0.420
0.5 0.5

This value is an great underestimation of the governmentally stated value of the coefficient
which is G=0.510. This suggests that the Trapezium rule results in the presence of a
negative bias for the calculation of the Gini coefficient, rendering it a largely ineffective
measure.
 

 7

Method 2: Using the Lorenz Curve: Polynomial Regression


To rectify this limitation and formulate a more accurate Lorenz curve, I will attempt to
formulate a polynomial graph using polynomial regression. This refers to a method of
curve fitting with which a set of data is approximated using a polynomial function that takes
the form  f ( x ) C 0   + C 1 x1 + C 2 x 2 + ... + C n x n where C refers to a set of coefficients and n refers
=

to the degree of the polynomial function. Here, the difference between the measured value
of  yi and the actual
  value of yi is referred to as the
  residual value R .

The general model for polynomial regression can be created using the method of least
squares. This method attempts to reduce the variance between the values in order to fit
the data points accurately, by finding the lowest sum of residuals. Since linear and
polynomial regression models are often unreliable, tending to inappropriately depict the
data, residuals are used to examine their accuracy. A residual (e ) point  refers to the
difference between the actual value of the dependent variable (y) and the value predicted
by the points on a regression curve (y 1 ) (“Finding Residuals”).

This is shown graphically in the figure below:

e =   y !  y1

 Figure 5:
5: Depiction of resid
residual
ual values

Here, the sum of squared residuals is represented by:


 

 8
n 2

SSR   ! !  !" yi   " (C 0   + C 1 x  i   + ... + C n x ) #$


n
i
i 1
=

In order to minimise the polynomial we take partial derivatives of this function with respect
to each
each of the c
consta
onstants
nts ( C  ), whe
where
re we e
equate
quate the re
residua
siduall to 0 to fi
find
nd the lowest
lowest v
value
alue o
off
SR (sum of residuals). Partial derivatives refer to derivatives of a function with multiple

variables, where all the variables except C  are held to be fixed (Weisstein)
(Weisstein)..

To find the Lorenz curve of India, I will restrict the investigation to quadratic regression,
where the general equation is:

2
 yi   = C 2 x + C 1 x + C 0

n 2
where:
SSR   ! !  !" yi   " (C 0   + C 
  1 xi   + C 2 x ) #$ 2
i
i = 1

The partial derivatives for this quadratic function will be:

!(SSR) = "2'  #$ y " (C    + C     x + C   x 2 ) %& = 0


!(C 0 ) i =1
0 1 2

!(SSR) n

= "2 '  # y " (C    + C 


    x + C   x  ) %x   =  0
2

!(C 1 ) $ & 0 1 2
i =1

!(SSR) n

= "2 '  #     x + C   x  ) %x 


y " (C    + C 
2 2
=  0
!(C 2 ) $ & 0 1 2
i =1

Dividing both sides by 2 and factoring out the constants, this leads us to the following
equations:

n n n

equation (a)
C 0 n + C 1   xi   + C 
 2
! i = 1
!x
i 1
=
 2
i =

!
  y
i = 1
i

n n n n

C 0 ! C 1 ! C 2 ! !


2 3
  x  i +   x i +   xi   =   x  y i i
equation (b)
i 1
= i = 1 i 1
= i = 1

n n n n

C 0 ! C 1 ! C 2 ! !


2 3 4 2
  x i +   x i +   xi   =   x  y i i
equation (c)
i 1
= i 1
= i = 1 i 1
=

which can be expressed as the following:

" n n % " n %
$ n !x i
  ! xi
2
' $ ! y i '
= 1 = 1 = 1

$ ' !# C  $& $
i i i
n n n 0
n '
(1)  $ ! xi   ! xi   ! xi ' # C  & $ 2 3
1
=
! xi yi '
$ i= 1i i '# =1 & $ = 1 i = 1 '
$ n n n ' " C  % $ 2 n '
$ ! xi   ! xi   ! xi '' $ !  xi  yi '
2 3 4 2

$# i =1 i i
& =1 $# = 1 i = 1 '&
 

 9

The creation of the matrix and its representation of the 3 equations above can be observed
by looking at the multiplication of the matrices on the right hand side of (1). To multiply two
matrices we need to do the dot product of each row of the first matrix and the only column
of the second matrix. This calculates the sum of all the products of matching members as
seen below:

" n n %
$ n !x i
  ! xi
2
'
$$ '' ! C  $
i =
1 i 1
=

# &
0
n n n

$ ! x   ! x i i
2
  !x 3
i
' ( # C  &
1
$ i = 1 i 1
= i 1
= ' # &
$ n n n ' " C  % 2

$
$#
! x
i 1
2
i   !x
i 1
i
3
  !x
i 1
4
i
'
'&
= = =

n n

C 0 n + C 1 !   2 ! xi
2
=   xi   + C 
i = 1 i 1
=

=
!
  y i
i = 1

As seen, finding the dot product of the first row of the first matrix and the second matrix

yields equation (a). Finding the dot product of the next two rows of the first matrix will
result in equation (b) and (c). Therefore, matrices can be used to represent equations (a),
(b) and (c).

We can determine the value of the constants by multiplying both sides of (1)   by the
transposed first matrix:

(1
" % n n " n %
$ ! xi   ! xi ' $ ! yi '
2
n
! C  $ $ i i ' = 1 = 1 $ i = 1 '
# 0
& $ n n n ' $ n '
# C 1& = $ ! xi   ! xi   ! xi ' 2 3
$ ! xi yi '
# & $ i = i1 i ' 1
= = 1 $ i
= 1 '
" C  %
2
$ n n n ' $ n '
$ ! xi   ! xi   ! xi ' $ !  xi  yi '
2 3 4 2

$# i = i1 i '& 1
= 1
=
$# i=1 '&

To ca
calc
lcul
ulat
ate
e th
the
e in
inve
vers
rse
e ma
matr
trix
ix o
off a 3 ! 3 ma
matr
trix
ix,, we c
can
an u
use
se tthe
he ffol
ollo
lowi
wing
ng p
pro
roce
cess
ss..
! ! " # $
# &
Suppose a general matrix in the form: !" # $ % &  & , where each letter corresponds to a
#  ' ( ) &
" %
real number. The inverse matrix will be:
 

 1
10
0

! $
# e f    d f  d e &
!
# h i g i g h &
# &
1
# a c  a b &
M
!1
= # ! b c   ! &
M # h i g i g h &
# &
# b c  a c a b &
#   ! &
#" e f  d f  d e &%

a b
where the arbitrary minor matrix: =  ad  ! bc
c d 

and what is known as the determinant:

 " #   ' #   ' "


! = ! !& + ) = !""% !  #$#
 #$ # ! &"'% !  #(#
 #( # + )"'$ ! "(
"(##
$ %  ( %  ( $

A curve can then be generated for a quadratic function by solving for the coefficients in the
matrix. In the case of India, we have the information of the proportion of income earned by
each quintile of the population shown in Table 1.

Inputting the  xi and


  yi values depicted in the table into matrix equation 1, we get the
following:

! $ '1
C 0 ! 6 3 2.29 $ ! 1.961 $
# & # & # &
# C 1 & =
# 3 2.29 1 .8
& # 1.6152
&
# C 2
& #" 2.29 1.8 1.5664 &% #" 1.42668 &%
" %

To solve for the inverse of matrix, we must first find its determinant which can be
calculated by summing the product of a cofactor of the first row and their respective minor
matrix:

  2.29 1.8   3 1.8   3 2.29


6 !3 + 2.29
1.8 1.
1.5
5664 2.29
2.29 1.56
1.5664
64 2.29 1.8
=  6(3.59 ! 3.24) ! 3(4.70 ! 4.12) + 2.29(5.4 ! 5.24)
=  0.60347

The reciprocal of this can be multiplied to the following matrix to give us the transposition
of the matrix:
 

 1
11
1

" %
$ 2.29 1.8
!
  3 1.8 3 2.29
'
$ 1.81.8 1.5
1.5664
664 2.
2.29
29 1.
1.56
5664
64 2.29 1.8 '
$ '
1 $   3 2.29 6 2.29   6 3 '
$ ! ! '
0.60347 1.
1.8
8 1.
1.56
5664
64 2.
2.29
29 1.
1.56
5664
64 2.29
2.29 1.8
1.8
$ '
$ 3 2.29   6 2.29 6 3
'
‘ $ ! '
$# 2.29 1.8 3 1.8 3 2.29 '&

" 0.347   !0.57


0.577
7 0.15
0.156
6 %
$ !0.
1
0.57
577
7 4.
4.154   !3.93
154
'
0.60347 $ '
=

$# 0.156   !3.93 4.74 '&


" 0.490   !0.815 0.220
%
$ '
=
$ !0.815 5.869   !5.553 '
$ '
$ 0.220   !5.553 6.697
'
$# '&

We can determine the value of the coefficients, C 0 , C 1 , C 2 , by substituting this into the
original equation.
! $
! $ 0.490   !0.815 0.220
# C 0 & ## & !# 1.961 $&
# C 1 & = !0.815 5.869   !5.553 & # 1.6152
&
# &
# C 2
& # & #" 1.42668 &%
" % #"
0.220   !5.553 6.697
&%
! $
# '0.04139503523151611 &
! C 0 $ # &
# & # &
# C 1 & =
# '0.04057415997524583 &
# C 2
& # &
#" &% # &
# 1.0179444066877004 &
" %

This would give us the quadratic equation:


2
 y  = C 2 x + C 1 x + C 0
2
 y  = 1.02 x ! 0.041 x ! 0.041

In the equation above, the coefficients have been represented up to 3 significant figures
for ease of observation. The resultant Lorenz curve (L q) amidst the scatter points of input
can be seen below:
 

  e
  m
  o
  c
 1
12
2
  n
   I
   f
  o
  n
  o
   i
   t
  r
  o
  p
  o
  r
   P
  e
  v
   i
   t
  a
   l
  u
  m
  u
   C

 L0 ( x ) =  x

2
 Lq ( x
 x)) = 1.02 x ! 0.041 x ! 0.041

Cumulative Proportion of Population

 Figuree 6: The rresultant


 Figur esultant Lor
Lorenz
enz curve from quadratic
quadratic rregre
egression
ssion

From the nature of the curve, we can tell that it does not pass through the data points
exulted in Table 1. This suggests that the prediction of the  y values  for all x  based on a
limited set of data does not accurately portray the income proportion of each segment of
the population for India. From the deviations of the data points (highlighted by the red
points in Figure 6) from the best fit curve, we can formulate a table to depict each residual
point:

 x y y1   e

0 0. 00 -0.04 0.04

0. 2 0. 06 0.01 0.05

0. 4 0. 15 0.14 0.01

0. 6 0. 28 0.35 -0.07

0. 8 0. 47 0.64 -0.17

1. 0 1.0 1. 102 -0.102

Table 2: Residual Plot Data for Table 1


 

 1
13
3

The residual sum of squares, as explained earlier is a measure that indicates the degree
to which a statistical model is a good fit for a data set. The value of SSR in this case is
SSR   0.048404 which suggests that the although the quadratic line draws a suitable best fit
 =

line, it does not perfectly represent the data. More significantly, it does not fulfil the
requirements of a Lorenz curve which is that it pass through the origin and point B (1,1) .
This was a limitation I recognised only after the computation of data, and drawing the
curve out using graphing software. I realised that using quadratic regression might not be
an appropriate method to sketch a Lorenz curve.

To combat this issue, I decided to use polynomial regression to define a polynomial of a


higher degree using the data points in Table 1.

Since we have 6 data points, a polynomial equation of the fifth degree can be constructed
to represent the Lorenz curve . I chose to use a fifth degree polynomial here with the
5 4 3 2
general equation of,  yi   = C 5 x + C 4 x + C 3 x + C 2 x + C 1 x + C 0 , since this is the maximum
order of a polynomial that can be created using 6 data points, presumably result in the
most accurate Lorenz curve possible. The aforementioned equation (1) can be
alternatively written as:

! 1   x    x 
2 $ ! $
&! $  y1
# 1 1
C 0 # &
# 1   x 
2
  x 
2
2
&# & #  y2 &
# &# C 1 & =

# &
! ! ! !
# &# C 2
& # &
# 1   x n  
2
x n &" % #"  yn &%
" %

wher
where
e n refe
refers
rs to tthe
he  nu
numb
mber
er of x  an
and
  d y co
coor
ordi
dina
nate
tes.
s. The
The fi
firs
rstt matr
matrix
ix iin
n the
the e
equ
quat
atio
ion
n abo
above
ve
is known as a Vandermonde matr matrix
ix which is a type of matrix that arises in the polynomial
least squares fitting (Weisstein). In the case of a polynomial of the fifth degree, using the
values from Table 1 this is represented as:
  ! $ ! 
! $ ##
"
! " " " " " && ! " $
# & ! 
!
# &
# ! "#$ "#"% "#""& "#""!' "#"""($ &# & # "#"'! &
# ! "#% "#!' "#"'% "#"$)' "#"!"$% &# ! 
$ & # "#!)( &
# ! "#' "#(' "#$!' "#!$*' "#"++'
&# ! 
& =
# "#$+*
&
# &# ( & # &
# ! "#& "#'% "#)!$ "#%"*' "#($+'& &# !  & # "#%'& &
#" ! ! ! ! ! ! &% # %
& #" ! &%
# !  &
" )
%
! $
#

& '!
"
! ! " " " " " $ ! " $
# ! 
!
& # & # &
# & # ! "#$ "#"% "#""& "#""!' "#"""($ & # "#"'! &
# ! 
$ & # ! "#% "#!' "#"'% "#"$)' "#"!"$% & # "#!)( &
# & =
# & # &
# ! ( & # ! "#' "#(' "#$!' "#!$*' "#"++' & # "#$+* &
# !  & # ! "#& "#'% "#)!$ "#%"*' "#($+'& & # "#%'& &
# %
& #" ! ! ! ! ! ! &% #" ! &%
# !  &
" )
%
 

 1
14
4

Performing the aforementioned steps on inversion and matrix multiplication using I.TI.T.. (a
calculator), due to the magnitude of the matrix we get the following matrix for the
constants:
! !  $
# ! & ! ! $
# ! " & # &
# & # !'$($()#)%(!&*&)( &
# ! # & # '"'$!#%%(%!$+*!+& &
=

# !  & # ('&%(#)"%)$*((!( &


# $ & # &
# !  & # '"!'"%*(!&%($$$+& &
# % & #" &'&%!!(*%#*$**+& &%
# ! & &
" %
From these values, the equation for the Lorenz curve of India in 2013 will be:

!
!"!#$ !  ! %$"%#& ! # + '"!#' ! ( ! %"($) ! ) + $"('# ! 

seen as the Lorenz curve in the diagram below


below,, with the various scatter points defining
India’s income quintiles from Table 1.

  e
  m
  o
  c
  n
   I
   f
  o
  n
  o
   i
   t
  r
  o
  p
  o
  r
   P
  e
  v
   i
   t
  a
   l
  u
  m
  u
   C

2
 Lq ( x)
 x ) =
1.02 x ! 0.041 x ! 0.041
 L0 ( x ) =  x

5
 L ( x ) =  5.540 x ! 10.148 x 4 + 6.546 x 3 ! 1.302 x 2 + 0.364 x

Cumulative Proportion of Population

 Figure 7: The rresultant


esultant Lor
Lorenz
enz curve from po
polynomial
lynomial rregre
egression
ssion
 

 1
15
5

In comparison to the Lorenz curve derived from quadratic regression, it is observed that
using a polynomial of the 5th degree is more suitable to calculate the Lorenz curve, since it
goes through both the origin and point B.

The gini coefficient using the integral formula


f ormula according to our curve and data is:

'


! = !   "  ! "#$#%& "  ! '&$'%( " % + )$#%) " * ! '$*&! " ! + &$*)% " +#" 
#
&

= &$%%*

As can be seen, this Lorenz curve has no deviation from the


t he data points as it interests all 6
of them seen in Table 1. Since there are no residual points, this suggests that it is a more
accurate depiction of India’s income distribution that Lq  obtained with polynomial
regression.

According to official data, the Gini coefficient of India in 2013 was G=0.510, which is not
equivalent to the Gini co coefficient
efficient calculated from th
the
e pr
predicted
edicted Lorenz Curve,  L . T This
his
might be a result of limited range of data used which reduces the socio-political viability of
the calculations and does not accurately estimate the Gini Coefficient. In this case,

polynomial regression to sketch a Lorenz curve would be more accurate with a larger set
of data.

Method 3: Using the Covariance formula


The calculation of the Gini Coefficient using geometrical interpretations based on the
Lorenz Curve, is only one of the myriad ways the index can be calculated with. An
alternative method is to represent the Gini Index in terms of the covariance between
income levels (proportion of population) and the cumulative distribution of income.
Knowing the general formula of the
t he Gini Coefficient using the Lorenz curve, we can rewrite
it as:

%
! = !   " # # $ ! "# # $$# 

"    "# # $$# 
" "
%
= %!! "  "

In this case, lets assume that the cumulative distribution function F ( x ) gives the proportion
of the population having a an
n income level below or eq equal
ual to  x  . This is a non-decreas
non-decreasing ing
function that represents the perc percentage
entage o
off individuals with an income b below
elow  x  . Let’s call
this prop
proportio
ortion
n  p . Addit
Additional
ionally
ly,, lets
  assu
assume
me that F ( x ) is conti
continuou
nuouslysly differen
differentiable
tiable such that
the following density exists:

F !  ( x ) =   f  (


 ( x )

where
where for a g
give
iven
nvvalu
alue
eooff  x  the pro
propor
 portio
tion
n p can be alt
altern
ernati
ativel
vely
y defin
defined
ed as:
 x

 p  =
!   f ( x )
0
=  F ( x )
 

 1
16
6

Using the geometrical representation of the aforementioned general formula for the Gini
Coefficient, we can represent it in terms of the covariance between income levels and the
cumulative distribution of income (Lubrano).

G  = 1! 2
!   L( p) dx
0

2
where
wher e Cov is the cova
covarian
riance
ce = Cov( x,  F ( x )) between income levels y and the
 µ 
cumulative distribution of
the
the same income F ( y) an  d µ  is average income.

The table below represents household incomes for each of India’s quintiles, as an
extension of Table 1:

Proportion of Proportion of Income Household Income


Population: (converting % to (Rs/Annum)
(converting % to decimals  yi 
decimals)
 x i i 
1 0.2 (first quintile) 0. 061 19, 041

2 0.4 (second quintile) 0. 153 29, 353

3 0.6 (third quintile) 0. 279 41, 220

4 0.8 (fourth quintile) 0. 468 65, 235

5 1 (fifth quintile) 1. 0 153, 872


Table 3: Table of mean income levels corresponding to each population
quintile in India

Using this,
this, the cumu
cumulative
lative di
distrib
stribution
ution of inc
income
ome refe
refers
rs to  x  coor
coordinat
dinates
es while inc
income
ome leve
levels
ls
refer to the
the avera
average
ge pers
personal
onal in
income
come co
corres
rrespond
ponding
ing to  x  segm
segment
ent of the po
populat
pulation.
ion. This
This
suggests that the Gini coefficient is proportional to the covariance between a variable and
its rank. The covariance of two variables indicates how they change together. As such, it
provides a measure of the degree of correlation between sets of random variables, with a
positive covariance value suggesting a positively relation and a negative value, an inverse
relation.

Understanding the idea of covariance was especially challenging for me, since statistics
was one topic that was not visited in any of my math lessons. As such, as opposed to a
formulaic one, I attempted to diagrammatically understanding and explain the concept.
Using the paired data in Table
Table 3, a scatter plot is seen below:
 

 1
17
7

 Figure 8:
8: Diagramm
Diagrammatical
atical rrepres
epresentation
entation of C
Covariance
ovariance

In the diagram I drew all possible rectangles that could exist between the 5 data points,
colouring them red. Here, the covariance is represented net amount of red in the plot
(reflecting the average covariation between the variables), which would be roughly around
the middle due to darker shades of red there. Mathematically, this is shown with the
formula:
n

! ( x  !  x )( y  !  y )
i i
i 1
Cov( x, y) =
=

n !1
 x  = independent variable
where:  y  = dependent variable
n  = number of data points
 x   = mean of independent variable,  x
 y   = mean of dependent variable, y

Usin
Using
g tthe
he va
valu
lues
es in Tab
able
le 3, we ca
can
nccal
alcu
cula
late
te x  an
and
  d y fir
first
st..
5

! x
i 1
i
3
 x  0.6
=

  = = =

5 5 5
! y i
308,721
i 1
 y   =
=

= =  61,, 744.2   µ 
 61  =

5 5

Substituting these values into the aforementioned covariance formula we get:


5

! ( x  !  x )( y
)( y  !  y )
i i
i =1
Cov( x, y) =

4
17081.28 + 6478.24 + 0 + 698.16 + 36851.12
=
4
= 15277.2
 

 1
18
8

2
Dividing this value by we can calculate the value of the Gini Coefficient using the
 µ 

covariance formula:
2
G = ! 15277.2
61744.2
=  0.495

As can be seen, the value of G=0.495 is not equivalent to the officially stated value for the
Gini Coefficient of India in 2013 of G=0.510, calculated and published by the Indian
government using their complete data. With only 5 generalised income levels used to
determine the covariance between income levels and proportions of population in India,
this is inevitable. By using a limited number of data points, I realised I am ignoring various
idiosyncrasies that may be present in the income distribution of each individual segment.
This has led to an underestimation of the Gini coefficient of India.

As with the result from the first method, the reason for the discrepancy most likely lies
primary in the limited access a civilian has to national income data. This creates
challenges in observing the effectiveness of distinct methods to calculate the Gini
Coefficient.

Discussion and Analysis


In this investigation, I attempted to present an analysis of three formulaic methods to
calculate the Gini Coefficient; two based on area ratios under a Lorenz curve and the other
based on covariance formulas.

The numerical integration method of the Trapezium rule in comparison to Method 2 is


extremely unreliable as it inevitably results in a positive bias for the Lorenz curve, and a
negative bias for the gini coefficient. This is because, the method creates the curve with
straight line segments that would lie above parabolic lines connecting the data points (as

seen in Method 2). This results in a greater area underneath the Lorenz curve for Method
1, and thus a smaller Gini coefficient.

When comparing methods 2 and 3, despite the fact that the values of the Gini coefficient
using Methods 2 and 3 were lower than the governmentally defined value of G=0.510,
Method 2 seems to be more ineffective in accurately measuring the value since the value it
predicted had a greater discrepancy from the actual value, than the one predicted by the
covari
covarianc
ance
e fo
formu
rmula.
la. A rea
reason
son for thi
this
s co
could
uld be tthat
hat formul
formulati
ating
ng tthe
he Lore
Lorenz
nz cur
curve
ve  L ( x ) fro
from
ma
data
data set of siz
size n  =  6 res
esul
ults
ts in a c
cur
urve
ve th
that
at es
esti
tima
  mate
tes
s iinc
ncom
ome
e p
prrop
opor
orti
tion
ons
s ( y) fo
forr a
all
ll th
the
e
unspec
unspecifie
ified
d seg
segmen
ments
ts / pop
popula
ulatio
tion
n pro
propor
portio
tions
ns of tthe
he In
India
dian
n soc
societ
iety
y ( x ) . In the c
case
ase o
off my
investigation, where the data points were restricted to quintile income data this gives great
room for uncertainties and inaccurate estimations of the income disparity within these
quintiles. On the other hand, since the Gini coefficient based on the covariance formula
 

 1
19
9

was sole
solely
ly de
deriv
rived
ed fr
from
om th
the
e re
relat
lation
ionshi
ship
p be
betwe
tween
en th
the
e 5  x  and
  y coo
coordi
rdinat
nates,
es, its
its value
value o
off
G=0.495 was closer to the actual value.

With empirical evidence form my investigation, the Gini coefficient seems to be most
accurately calculated using the covariance-based method. However, with better access to
a wide range of income proportions and data points, most governments opt to employ the
Lorenz curve to determine the coefficient instead (Method 2). The most distinct difference
between Method 2 and Method 3 is that the Lorenz curve is an extremely contextualised
and direct manner to calculate the Gini. This is because, it was created primarily to act as
a graph of the cumulative frequencies of income proportions and population proportions,
which along with the well defined rules for the curve suggest that it was solely meant for
this purpose. On the other hand, the covariance formula in Method 3, is used as an
inference to the Gini Coefficient, generally indicating the type of relationships between two
random variables. This allows Method 3, to provide measurements for various other areas
of interest such as the magnitude of positive or negative correlation between any two
variables. This trait of Method 3, can be used to better understand the degree of inequality
in a country, by filling in loopholes that may exist as a result of the coefficient being a
simplistic consideration of income distribution.

Overall, the Gini coefficient does have limitations as a measure of inequality. One of the
major ones is that the coefficient is not additive across various segments of a population
and fails to ignore the nuances of income disparity that may exist within each segment. For
a better judgement of a nation’s inequality degree, the coefficient is used in conjunction
with other indices of income inequality such as the Theil Index, which is additive over
various population segments and measures. It identifies the share of inequality attributable
to the between region components, and is measurement based on General Entropy
formulae, mitigating some of the limitations of the Gini coefficient. /cite/ 

Assumptions and Limitations

In the investigation, the use of the Gini coefficient as a tool to compare income inequalities
of multiple countries was not explored. This could have been a possible extension of the
investigation, which could also lend for a deeper understanding of its relevance in modern
economic inequality and its reliability as such.

Additionally, the scope of the research, as a result of limited access to census data
regarding India’s income proportions, was limited. Yet, for the sake of comparison and
exploration, the results were assumed to be conclusive and were compared to the actual
value of the coefficient published by the Indian government to determine the reliability of
each method.
 

 2
20
0

Conclusion
The investigation enabled us to determine the various implications and calculations of the
Gini coefficients that may vary numerically depending on the nuances of each method.
Working with the Gini coefficient and with so many areas of mathematics that were novel
to me has allowed me appreciate the idea of inequality, the sharing of monetary resources
and applied mathematics in the modern day
day.. I was astonished at how drastic the difference
was between the lowest and highest quintiles of India’s earning population, an insight

which would have not been as revelatory without mathematically deriving them through
Lorenz curves. quantitative and empirical analysis of social issues such as income
inequality allowed me to broaden my perspective on the implications and severity of this
prevalent issue.
 

 2
21
1

Bibliography
Bourne, Murray. "The Gini Coefficient of Wealth Distribution." Intmathcom RSS. N.p., 24
Feb. 2010. Web. 07 Mar. 2017.
Nair,, Remya. "IMF Warns of Growing Inequality in India and China." Http:// 
Nair
www.livemint.com/.
www.livemint.com/. Livemint, 03 May 2016. Web. 07 Mar
Mar.. 2017.
"Finding Residuals." Interactivate: Finding Residuals. CSERD, n.d. Web. 23 Mar. 2017.
Weisstein, Eric W. "Vandermonde Matrix." From MathWorld--A Wolfram Web Resource.
http://mathworld.wolfram.com/VandermondeMa
http://mathworld.wolfram.com/VandermondeMatrix.html.
trix.html. 23 Mar.
Mar. 2017.
Lubrano, Michael. "The Econometrics of Inequality and Poverty."
Poverty." (n.d.): n. pag. Http:// 
www.vcharite.univ-mrs
www.vcharite.univ-mrs.fr/PP/lubrano/cours/Le
.fr/PP/lubrano/cours/Lecture-4.pdf.
cture-4.pdf. Sept. 2016. Web. 24 Mar
Mar..
2017.

You might also like