You are on page 1of 8

Geospatial Data Mining (E-Learning)

English e-learning course, corresponding to the contents of a course held at the Institute of Statistics and Information Management, Universidade Nova de Lisboa.

TUTORIAL You can see this document as the road map for the practical work. This tutorial is separated into two exercises. The exercises are required, you have to do them in order to have the credit. The first exercise is self-explanatory in order to aid and simplify your understanding of some of the tools that we discuss in the course. These exercises were developed in order to involve different tasks that in some way relate with GIScience. Our objective was to try to cover a wide range of examples hopping that at least some of them would be of interest to your work. Even if none of these examples apply directly to your research interests the diversity of the examples will provide a wide perspective on what you can expect from data mining tools. The choice of the software was done with one major concern: availability. The idea is for you have the possibility of continue working with these tools (if youre interested) without having to buy expensive software packages. The downside side to this option is the fact that freely available tools usually have poor interfaces, this can be a bit more frustrating in the early stages of learning. For Part 2 software packages available is the SOM_PAK, a very good software package, very efficient and capable of processing very large datasets. It doesnt have a graphical interface, and all interaction is done through DOS command line. This may be frightening for some of you, but let me assure you that after the first shock (and some experiments) it is relatively easy to use. The manual is available here, you should read it in order to be able to complete the proposed exercises.

Exercises The Self-Organizing Map

Fernando Lucas Bao

Exercise 1 Mission: Good afternoon Mr. Hunt. Your mission, should you choose to accept it, involves the development of a geodemographic classification. The basic idea is to create a geodemographic typology of the city of Lisbon. Obviously, you dont know anything about Lisbon and its socio-economic structure, so the idea is not to develop a perfect geodemographic typology. The aim is to provide you with the basic knowledge of some tools you can use in your office to analyse data from your own city or country. Nevertheless, I expect that by the end of the project you will know a lot more about Lisbon than you do now. To accomplish this task you have available two data files and three software packages. The data files available comprise: 1. a .txt file (named tutorial1.txt) with the socio-economic variables you will be used to develop the geodemographic typology, each record in this file corresponds to an ED (and can be joined to the shapefile using the last variable in the set, which is a label (code)). This file is ready to be processed by SOM_PAK (which only accepts .txt files). If you check the file you will notice two things. First, variables do not have headings. Second, the first row is composed of only one number (22), this value indicates to the software how many processing variables there are in the file. The file has 23 columns, but only 22 are for processing the last variable constitutes a code (label in the SOM_PAK lingo), which identifies each record and allows you to join this database with lisboa.shp in order to draw some maps. 2. a zipped ( shapefile (named lisboa.shp, a shapefile is ArcViews native format) of the enumeration districts (ED) in the city of Lisbon, In terms of software you have the opportunity to use Microsoft Excel, in order to analyse and pre-process your data; Arcview to visualize results geographically; and the SOM_PAK to develop the typologies (cluster the data). To make your life easier I decided to throw in somlx2.bat file which basically does everything you need to do in SOM_PAK, you just have to run it. Make sure you read 3

the file first and understand the instructions that it gives to the software (for this you need the SOM_PAK manual). You can change the parameters values which will yield different results in the clustering. Beware of possible outliers present in the data! The data set you will be using comprises 882 records corresponding to the 882 enumeration districts of Lisbon. Each unit contains information on about 600 people and 250 households. In the end of this document you can find a summary of the variables available. The data has already been normalized in order to avoid scale effects. In this particular case the normalization was achieved through the development of ratios. As the data is organized based on themes (like dwellings, population age, education statistics, etc) the individual variables were normalized in reference to a total of a theme, as a result the scale effect is reduced. The following steps constitute a guide to the tasks you are supposed to perform. 1. Getting acquainted with the data (Optional). Make a map based on the data answering the following question: Is there a pattern in the distribution of old people (above 65 years) in Lisbon? 2. Search and destroy (Optional). Analyse the available data, trying to detect outliers. You can skip this step and the next one and go directly to step 4. As we discussed the U-Matrix which provided by the SOM can be of assistance if you are trying to identify outliers. This way, once you have performed your classification, you have to access the file and analyze the U-Matrix looking for outliers. Clean the dataset, leaving out any unwanted records. 3. Build a geodemographic typology of the city with a SOM of 5*5 (you will get 25 clusters, but you are welcome to try any number of clusters you see fit). In order to do this you can use the somlx2.bat file already available for your use. Nevertheless, it is important that you make an effort to understand the instructions written in the somlx2.bat file. You can only do it yourself if you are able to understand these instructions. You are welcome to change the parameter defined in the file and compare the different results. 4. Analyse the U-Matrix and define which are the clusters present in the data. Once youve run the somlx2.bat file an file is created, here is where you can find the visual representation of the U-Matrix. Clearly, 25 clusters are probably too much for our quick analysis; we would like to have a summary of

the classification, reducing the 25 to a smaller number. Neurons which are connected by areas of light colour should be included in the same cluster. Are there particular clusters that can be isolated? Should we take out some of the records which are very different from the rest and perform a new clustering? Yes, probably we should. This way we have to identify which are the records that we want to exclude and remove them from the file. Remember that the numbering attributed by the SOM_PAK starts at 0, this way the 0,0 neuron is the first column of the first row, and its located at the top-left corner of the U-Mat. The best way to proceed in order to identify which are the records that need to be removed is to open the lx.m4 file (in Excel), open the tutorial1.txt (in Excel) copy the tutorial1.txt file into the lx.m4 and delete the records that are classified in the clusters which were considered outliers. 5. After removing the outliers you should return to the beginning of the process and rerun SOM_PAK and analyse the new U-Matrix. 6. To have a more accurate idea of what each cluster represents it is necessary to analyse the clusters. This can be achieved in several ways: use the planes program of SOM_PAK; use Excel to sort the data based on the clusters and use the subtotal option analyse the average of each cluster in each variable this should give you an idea of the characteristic features of each cluster. To make a map with your classification (optional) 7. In order to make a map of your clusters we have to be able to join files tutorial1.txt (which provides the code to join to lisboa.shp) and lx.m4 (which provides the cluster where each record is classified). This task is very simple. All you need to do is to open both files in Excel and copy the contents of lx.m4 to tutorial1.txt (this must be done without any changes to the sorting of the records), and save it as hunt.dbf file. At this point you have a file which has your clusters and can be joined in ArcView with the geographic representation of Lisbon. 8. As you may noticed in order to define the clusters for each record you need to group the first two columns of the lx.m4 into one. This way a record which has 1 and 4 in the first two columns (meaning that it was classified in the first column and fourth row of the SOM) will be classified in cluster 14. Use the Excel program to concatenate (=CONCATENATE()) the columns and rows (in 5

hunt.dbf file) and redefine the clusters (you may decide that clusters 14 and 13 are very similar and you want to include the records into only one cluster). The result should be a .dbf file which can be used to join to the shapefile. 9. Map the results. Mr. Hunt, this isnt mission difficult, its mission impossible. Difficult should be a walk in the park for you. As always, should any member of your team be caught or killed, the Secretary will disavow all knowledge of your actions. Good luck. This message will self destruct in 10 seconds. Annexe Definition of the Variables 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 Percentage of dwellings with one or two rooms Percentage of dwellings with three or four rooms Percentage of rented dwellings Percentage of owner occupied dwellings Percentage of vacant dwellings Percentage of households without unemployed people Percentage of households with people older than 65 years old Percentage of households with people younger than 15 years old Percentage of households with one or two people Percentage of households with three or four people Percentage of people looking for the first job Percentage of unemployed people looking for job Percentage of employed people Percentage of people without any economic activity Percentage of retired people and pensioners Percentage of people doesnt know how to read or write Percentage of people who completed the 1 stage of basic education

1 8 1 9 2 0 2 1 2 2

Percentage of people who completed the 2 stage of basic education Percentage of people who completed the 3 stage of basic education Percentage of people who completed high-school Percentage of people who completed a specialization after high-school Percentage of people who completed university

Exercise 2 Challenge: A good colleague of my works with satellite images. Although he really doesnt know much about neural networks, he has recently heard that they can be very useful in image classification tasks. Knowing that I teach Data Mining he approach me to see if we could make a small experiment with a dataset that he has available. This dataset has 3 target classes (urban, rural, forest) and each pattern is described by 4 variables, considered by my colleague as the most important bands. As Im a very busy guy I would like for you to try to use a SOM and see if you can classify appropriately the 3 different types of pixels. The idea is to see if we are able to isolate each particular land use in an area of the SOM. If we can do that we will be able, using the SOM, to classify automatically the rest of the satellite image. Report your findings and explain which are the classes that you can distinguish easily and the one that you cannot distinguish. Data file: satellite1.txt. The file is ready to be processed in SOM_PAK, with 4 attributes for each example (pixel) and a label (1=water, 2= urban, 3= other uses). Software: SOM_PAK Deliverables: an analysis of the segmentation that you were able to make. This should include a U-matrix with the labels of the training examples and a drawing scheme (over the U-Matrix) of were you expect the different types of land uses to be classified.