You are on page 1of 6

Proceedings of 13th International Conference on Computer and Information Technology (ICCIT 2010)

23-25 December, 2010, Dhaka, Bangladesh

Spatial Data Mining on Literacy Rates and Educational


Establishments in Bangladesh
A. K. M. Zahiduzzaman, Mohammed Nahyan Quasem, Mridul Khan, Rashedur M Rahman
Department of Electrical Engineering and Computer Science, North South University,
Bashundhara, Dhaka, Bangladesh
thesetu@gmail.com, nahyan.quasem@gmail.com, mridul.khan@gmail.com, rashedurrahman@yahoo.com

Abstract motivating factor of our research is therefore to apply


Data mining is the process of extracting non-trivial the ESDA techniques on Bangladesh geospatial data.
patterns from large volume of data. It generates insight Our objective is to check whether the geographic
and turns the data into valuable information. A critical distribution of literacy rates and educational
yet common flaw when performing data mining is to establishment counts are consistent with Waldo Tobler's
ignore the geographic locations from where the data is first law of geography [2] – “Everything is related to
taken. When this geospatial attribute of the data is taken everything else, but near things are more related than
into consideration, the process is known to be distant things.” Our research aims to find the nature of
geospatial data mining. This task essentially deals with the spatial patterns of these variables. We are also
the detection of spatial patterns in the data, the interested to see whether they produce any interesting
formulation of hypotheses and the assessment of outcome under spatial autocorrelation. Besides, in this
descriptive or predictive spatial models. Spatial data research we investigate which spatial model would be
mining could provide interesting and useful information more appropriate to fit the data we have. The
to government, environmentalists and relevant decision contribution from this research could help the concerned
makers’ in the assessment of the relative performance of authorities and the decision makers’ alike.
a particular geographic area. The results could also be Rest of the paper is organized as follows. Section 2
used for causal analysis by domain experts. In our describers related work in the area of spatial mining.
research we perform spatial data mining using literacy Section 3 presents the architecture of software package
rates and the number of educational establishments. The used in this research. Section 4 outlines the
data is from the 64 well defined administrative units of methodologies used in this paper. Section 5 focuses on
Bangladesh known as Zilas. This paper contains a the data sources from where data is collected ad the
summary of the theory, methodology and detailed process of and data acquisition. A detailed analysis of
analysis of results. We compare the results found by research findings is presented in Section 6. Finally
spatial model with classical regression model. The section 7 concludes and gives direction of future
results demonstrate that spatial lag model outperforms research.
the classical model in different perspectives.
II. RELATED WORK
Keywords: Data Mining, Exploratory Spatial Data Spatial data mining and exploratory spatial data analysis
Analysis, Geographic Information Systems, Spatial are relatively new fields. The number of research done
Autocorrelation, Spatial Regression. in these fields is therefore far less than those in other
areas of data mining and applied statistics. In our
I. INTRODUCTION
research we investigate for spatial patterns in literacy
When a dataset contains information about its rates and the number of educational establishments in
geographic origin we call it geospatial data. Spatial data the 64 Zilas of Bangladesh. To the best of our
mining is a sub-field of data mining that employs knowledge, this is the first initiative ever taken in
specialized techniques for dealing with geospatial data. Bangladesh.
The key goal is to find interesting patterns in these In [3] the authors found a spatial relationship between
datasets. These patterns can answer questions directly or the proportion of the population with a standard of
serve as the basis for research by domain experts. living below the poverty line and soil condition.
Spatial data mining basically performs Exploratory Dominic’s et al [4] compared traditional measurements
Spatial Data Analysis (ESDA) on large computerized of geographic concentration of economic activities with
data repositories. A comprehensive overview of this spatial data analysis techniques. They also presented a
field could be found in [1]. At present ESDA is more comprehensive analysis of the manufacturing industry
important than ever before due to the amazing and a set of hypotheses.
processing power of modern computers, and the A research paper that talks about the applications of
availability of massive amount of geospatial data. spatial data analysis besides its research findings is [5].
However, examples of the application of this field are Here the authors elaborated on the potential application
scarce and almost unheard of in Bangladesh. A of spatial data analysis techniques for policy making
and discussed some interesting related projects before
delving into their research. The research itself is quite is the mean of all the points in the dataset. is the
similar to the previous one. It analysed the spatial
properties of Turkey's manufacturing industry. At the individual points in the weights matrix. is 1 if
end, the paper presented examples of the successful is a neighbor of , otherwise it will be 0.
application of spatial analysis for generating clustering The value of this index ranges from -1 to 1. A Moran’s I
policies. of 1 means perfect spatial correlation while a value of -1
means perfect dispersion. Moran’s I measures global
III. PRIMARY TOOL – OPENGEODA spatial autocorrelation and is thus well suited for
summarizing a variable’s spatial behaviour over the
The tool we use in this research is OpenGeoDa [6]. It is
a software package for spatial data analysis that is being entire region of interest.
developed at the United States' National Science B. The Moran Scatter Plot
Foundation (NSF) funded centre for Spatially Integrated
Social Sciences (CSISS). The development effort is led If we plot Wy, standardized average of neighbouring
by Dr. Luc Anselin – who is currently one of the leading values of y (sometimes called spatial lag) against
researchers in the field of spatial data analysis. standardized value of y we get the Moran scatter plot.
OpenGeoDa is the improved and open source version of The slope of the regression line is the Moran’s I. In
GeoDa. GeoDa was developed at the Spatial Analysis [10], this was developed and shown to be a good ESDA
Laboratory of the University of Illinois at Urbana- tool for studying how the local spatial behaviour of the
Champaign under the direction of Dr. Anselin. Even in variable builds up the global Moran’s I statistic. The
its early days GeoDa and its predecessor DynESDA was scatter plot shows how spatially dispersed the data is. It
considered powerful tools for exploratory spatial data also gives a hint as to where the potential spatial clusters
analysis. Having such a rich background OpenGeoDa and outliers lie. The points in the High-High and Low-
inherently supports most ESDA techniques. It supports Low quadrants (1st and 3rd quadrant) are potential
basic geospatial visualization methods like choropleth clusters while the points in the other two quadrants are
mapping, histograms, and box plots and can perform potential outliers. The statistical significance of these
many advanced tasks such as the generation of the clusters and outliers can be measured using the Local
Moran scatter plot, calculating Local Indicators of Indicators of Spatial Association [12].
Spatial Association, building and analysing spatial C. Spatial Regression
regression models etc. Some of the advanced methods
were developed by Luc Anselin, for these the Traditional regression models do not take the spatial
implementation in OpenGeoDa can be considered the properties of data into consideration. This is why we
authoritative one. OpenGeoDa is a open source software need specialized regression techniques for dealing with
with its entire source code available for alteration. geospatial data. OpenGeoDa supports the construction
However for our work we did not make any and analysis of spatial regression models. We have used
modifications. We used the binary release of version these features to see which spatial regression model
0.9.8.14 alpha, available at [7]. would best fit the data we have.

IV. ALGORITHMS AND METHODOLOGY V. DATA ACQUISITION

Our research deals with spatial autocorrelation and Collecting geospatial data is not easy. It is more difficult
spatial regression. For visualizing the data we used a in the context of a developing country like Bangladesh.
choropleth map drawn using the equal interval On the top of that, our research is about the geospatial
classification scheme [9] and a Moran Scatter Plot [10]. analysis of statistical data; not geographic or
We used the global Moran statistic to get a sense of the topographic data. So the data we collected did not have
global spatial autocorrelation and then fitted the data to any geographical Meta data for Geographic Information
an appropriate spatial regression model. Systems – such as longitudes and latitudes. We had to
geo-reference the data ourselves by associating elements
A. Global Spatial Autocorrelation from the data set to geographic polygons in a Zila level
Spatial autocorrelation is the correlation of a variable digital map of Bangladesh.
with itself across space. A common statistic used for First we had to collect this digital map. This was
this is the Moran’s I index [11]. It is calculated by the relatively easy since an online project called Global
following formula: Administrative Areas (GADM) [13] had it and allowed
free academic usage. The map was in ESRI shape file
format [14]. Our data analysis tool could read it without
any problems. We had to edit a few Zila names to make
them consistent with the ones used in our data.
We got our statistical data from Bangladesh Bureau of
Here, I is the Moran’s I index. The sign and magnitude Statistics (BBS) – which is Bangladesh government's
of this index gives the nature of the correlation of the N official organization for the collection and
number of points that are all indexed by i and j. is the dissemination of statistical data. This paper deals with
current point in the data set that is under consideration, geospatial characteristics on Zila level literacy rates and
the number of educational establishments. Those data Table I Calculated Moran’s I Values
are available in publications compiled by the BBS. Variable Moran’s I
Literacy rates are collected at 10 year intervals as part of Literacy Rate of 2001 0.4465
the National Population Census. The data about number Educational
of educational establishments is collected at a similar -0.0676
Establishments
time interval as part of the National Economic Census.
The literacy rates we have used are from Population
Census 2001 [15]. The rates show what percentage of
the population of age 7 and above in a Zila are literate.
The data for the number of educational establishments
in each Zila comes from Economic Census 2001 [16].

VI. RESULT ANALYSIS


A. Spatial Autocorrelation
Our analysis shows that literacy rates have a moderately
high level of spatial correlation but the distribution of
educational establishments is almost random. Fig. 1 is a
choropleth map of the 64 Zilas coloured by 4 equal
width ranges of literacy rates. Even this trivial
visualization shows large spatial groups of Zilas
belonging to the same class.
When we look at Table I we see that the Moran’s I .
value for literacy rate is 0.4465. This is a moderately Fig. 2. Moran scatter plot of the literacy rates of the
high positive value so there is a positive global spatial 64 zilas. Standardized literacy rates along X and
autocorrelation. In layman’s terms this means that a Zila standardized average of neighbours’ literacy rates
with a high literacy rate Zila can be expected to have along Y. The regression line with Moran’s I as slope
Zilas with similar literacy rates around it. However the is reasonably accurate
Moran’s I value for the number of educational
establishments is very close to zero. According to the Now look at the Moran scatter plots in Fig. 2 and Fig. 3.
definition this represents a random spatial pattern. They clearly support our previous observations. In Fig.
2 the points are reasonably well distributed around the
regression line but in Fig. 3 we see a dense cluster of
points and a few extreme outliers.

Fig. 3. Moran scatter plot of the number of


educational establishments in the 64 zilas.
Standardized counts along X and standardized
average of neighbours’ counts along Y. Points not
well distributed.

So using spatial autocorrelation we have found that in


Fig. 1. A choropleth map showing 4 equal width 2001 the literacy rates had a decent positive global
ranges of literacy rates where spatial groups clearly spatial correlation but apparently the spatial distribution
visible. of educational establishments was random.
B. Regression Table IV Diagnostics for spatial dependence
TEST MI/DF VALUE PROB
In this research, we investigate the importance and
Moran's
necessity of including spatial characteristics into 0.131529 1.8524188 0.0639656
I(error)
analysis. We demonstrate this by building a regression
model including\excluding spatial nature of data. Our LM (lag) 1 5.4722674 0.0193205
objective is to develop a model that can predict the Robust
1 3.2171014 0.0728726
literacy rate of 2001 on the basis of other explanatory LM (lag)
variables. First, we fit a linear model through those data LM
1 2.2570978 0.1330031
by ordinary least square regression without considering (error)
the spatial characteristics. The model is given below: Robust
LM 1 0.0019318 0.9649425
lit 01 = CONST + a1lit 91 + b1ed 01 (error)
Here, CONST is the constant term and a1 is the
regression coefficient. The literacy rate of 2001 and The first statistic is Moran’s I, which gives the same
1991 is represented by lit 01 and lit 91 respectively where value as in Figure 2. As the Moran’s statistic is
significant it suggests for the spatial dependence.
ed 01 is the educational establishments in 2001. It
However, this test could not suggest whether spatial lag
would be better if we could use the number of model or spatial error model could fit best with respect
educational establishments in 1991. However, this data to data. Four Lagrange Multiplier test statistics are used
was not collected in Economic Census conducted in for this purpose. The following workflow [8] is used to
1991. take decision between two alternatives, i.e., spatial lag
Using the Ordinary Least Square (OLS) regression we model or spatial error model.
found the following value for coefficients and the
corresponding statistical significances:
Table II Ordinary Least Squares Results
Variable Coeff Std.Err t-statistic Prob
CONST 12.99 1.9420 6.689 0.000
lit 91 0.99 0.0784 12.721 0.000
ed 01 -0.00011 0.0004 -0.2709 0.787

Both the constant and linear coefficient term are


significant (also the t-statistic). However, the coefficient
for the educational establishment is not significant, i.e.,
Prob>0.04. Therefore, for further investigation we
exclude this term. The linear model is also a good fit
with respect to different metrics shown in Table III.
Table III Summary Results of Linear
and Spatial Regression
Linear Spatial
Metric
Regression Regression
R-squared 0.723010 0.752474
Sum squared
1009.9 860.22
residual
Sigma-square 16.2887 14.1012
Log likelihood -179.091 -176.08
Akaike Info
362.182 358.16
Criterion
Schwarz Criterion 366.5 364.637 Fig.4. Workflow for Spatial Regression Model Decision

The R-squared for linear regression is close to 1 that From Table IV we observe that the probability value for
demonstrates the goodness of fit between the linear the LM Lag model is only significant among others. The
model and the data we have. As the data is fitted quite significance is tested by the value of the probability
well by linear regression model we do not consider (PROB). If the value is <0.04 then we consider it to be
higher order models in this research. significant [8]. As the probability is of LM Lag model is
Before including the spatial characteristics into 0.02 we consider to fit the data with this model. The
consideration we first test whether any spatial spatial Lag model is presented below:
autocorrelation exists in data. A total of five test lit 01 = CONST + a1lit 91 + β Wlit 01
statistics are reported in Table IV to test for the spatial
dependence.
Here, Wlit 01 is a spatially lagged dependent variable for presents the scatter plot Moran’s test statistic is 0.0146,
the weight matrix W , lit91 is the literacy rate in 1991, or essentially zero. This indicates that including the
spatially lagged term into model eliminates all the
CONST is a constant term, a1 and β are parameters or spatial autocorrelation as it should be.
the coefficients. Running the spatial lag model we find
the following value of coefficients and corresponding
significances.
Table V Spatial Lag Model Results
Variable Coeff Std. Err z-value Prob
W _ lit 01 0.2731 0.0982 2.78009 0.0054
CONST 6.0409 3.0344 1.99075 0.0465
lit91 0.8634 0.0889 9.70776 0.0000

From Table V, we see all the coefficients are significant


including the autoregressive coefficient with a
value, β =0.273. There are some minor difference in the
significance of all other regression coefficient between
the spatial lag model and the classical OLS model
(Table II). More importantly the significance of CONST
Fig.5. Moran Scatter Plot for spatial lag residuals
has changed from (ρ < 0.0000) to (ρ < 0.0465) . The
magnitude of all estimated coefficients is affected Finally, if we plot prediction errors against the predicted
showing a decrease in absolute value. This reveals the values we got a line almost parallel to x axis with a
fact that the literacy rate of a Zila’s neighboring slope -0.0100. The slope is close to zero indicating the
locations attributes the explanatory power of the goodness of fit of the data to the model. As a result
independent variables. In the spatial lag model this is prediction error is almost zero everywhere. Only for one
picked up by the autoregressive coefficient β . district, Bhola we have the most prediction error 22.
Table III presents a relative performance measure This error also contributes the most to the value of sum
between classical and spatial lag model. R-squared or residuals presented in Table III.
value is increased from classical model representing a
better fit of the linear spatial model than the classical
model. Besides there is an increase of Log Likelihood
in the spatial model from -179 to -176. Compensating
the improved fit for the added variable (the spatially
lagged dependent variable), the AIC and SC decreases
relative to OLS.
Table VI presents the prediction of the true literacy rate
of 2001 (LR2001), predicted literacy rate ( LR̂ 2001 ),
prediction error and residuals for the first ten
observations.
Table VI LM Lag Regression Result
Predicted Residu
Obs. LR2001 Error
LR̂ 2001 al
1 46.27 46.6367 -0.7475 -0.3667
2 53.55 54.7018 -1.0345 -1.1518 Fig.6. Scatter Plot of Prediction Errors against
3 55.22 33.1237 22.1847 22.0962 Predicted Values
4 46.02 47.1306 -1.0006 -1.1106
VII. CONCLUSIONS
5 41.81 42.0915 -2.1590 -0.2815
6 25.29 32.6394 -6.2549 -7.3494 The analysis presented in this paper clearly shows that
7 46.23 45.4177 2.0399 0.8122 there is some spatial consistency in the distribution of
8 23.09 30.2760 -6.2931 -7.1860 literacy rates and educational establishments in
9 30.53 32.6725 -1.8493 -2.1425 Bangladesh. Taking policy level decisions with these
10 40.3 40.8180 -0.1013 -0.5180 spatial properties in mind can lead to uniform positive
development throughout the country. Future work with
The residuals are the estimates of the model error term, more geospatial variables can be carried out to create a
complete spatial profile of Bangladesh's developmental
i.e., (1 − βW )LR 2001 − (CONST + a1lit 91 ) where the
indicators.
prediction error is, LR2001- LR̂ 2001 . We also calculate
the Moran’s I test statistic for the residuals. Figure 5
VIII. ACKNOWLEDGMENT [7] GeoDaCenter, Retrieved 20th. October, 2010
http://geodacenter.asu.edu/software/downloads
We would like to thank the librarians at Bangladesh
[8] L. Anselin, Exploring Spatial Data with GeoDa: A
Bureau of Statistics for patiently guiding us through
Workbook, Urbana, USA: CSISS, 2005.
their data collection. We are also grateful to Dr. Luc
[9] Choropleth Mapping with Exploratory Data
Anselin and Arizona State University for making
Analysis. Retrieved 20th. October, 2010,
GeoDa freely available.
http://www.locationintelligence.net/articles/718.ht
ml
REFERENCES
[10] L. Anselin, “The Moran Scatterplot as an ESDA
[1] M. J. D. Smith, M. F. Goodchild and P. A.
Tool to Assess Local Instability in Spatial
Longley, GeospatialAnalysis: A Comprehensive
Association,” in GIS DATA Specialist Meeting on
Guide to Principles, Techniques and Software
GIS and Spatial Analysis, 1993, paper 9330.
Tools, 3rd ed. Leicester, UK: Matador, 2009.
[11] P. A. P. Moran, “Notes on Continuous Stochastic
[2] W. Tobler, “A computer movie simulating urban
Phenomena,” Biometrika, vol. 37, pp. 17-33,
growth in the Detroitregion,” Economic
1950.
Geography, 46(2), 234-240.
[12] L. Anselin, “Local Indicators of Spatial
[3] F. J. Paraguas, A. A. Kamil, K. S. Pheng, M. M.
Association – LISA,” inGISDATA Specialist
Dey, and M. L. Bose, “Exploration and
Meeting on GIS and Spatial Analysis, 1993, paper
Visualization of Poverty-Environment
9331.
Relationship Using Exploratory Spatial Data
[13] GADM Database for Global Administrative
Analysis Techniques,” in Proc. IRCMSA 2005,
Areas, Retrieved 20th. October, 2010
2005, p. 357.
http://www.gadm.org/
[4] L. de Dominicis, G. Arbia, and H. L. F. de Groot,
[14] ESRI Shape files Technical Description, ESRI,
“The Spatial Distribution of Economic Activities
1998, Retrieved 20th. October, 2010.
in Italy,” Tinbergen Institute Discussion Papers,
http://www.esri.com/library/whitepapers/pdfs/shap
07-094/3, Dec. 2007.
efile.pdf.
[5] B. Türkcan, E. T. Çalışkan, and A. A. Kaya,
[15] Bangladesh Bureau of Statistics, Population
“Industrial Clusters as a Regional Development
Census 2001, National Series, Volume-I, Dhaka,
Tool: A Spatial Analysis on Turkey,” in Proc.
Bangladesh: Bangladesh Bureau of Statistics.
Econ Anadolu 2009, 2009.
[16] Bangladesh Bureau of Statistics, Economic
[6] L. Anselin and M. McCann, “OpenGeoDa, Open
Census 2001, National Report, Dhaka,
Source Software for the Exploration and
Bangladesh: Bangladesh Bureau of Statistics.
Visualization of Geospatial Data,” in Proc.
ACMGIS '09, 2009, pp. 550-551.