Spatial Statistics and Machine Learning Summary

Data Wrangling.......................................................................................................................................
2
Geographic information system basics...............................................................................................2
Raster data models.........................................................................................................................2
Vector data models........................................................................................................................3
Satellite Imagery and Aerial Photography......................................................................................5
Geospatial Data Acquisition............................................................................................................8
Geospatial Database Management................................................................................................9
Data Quality....................................................................................................................................9
Vector data: Single layer analysis.................................................................................................10
Vector data: Multiple layer analysis.............................................................................................11
Raster data: single and multilayer analysis...................................................................................12
Raster data: Spatial interpolation.................................................................................................13
Raster data: Terrain mapping.......................................................................................................14
Point in Polygon & Intersect.........................................................................................................14
Spatial Join....................................................................................................................................14
Nearest Neighbour.......................................................................................................................14
LiDAR technology.............................................................................................................................15
Data Wrangling
Geographic information system basics
Raster data models
The raster data model consists of rows and columns of equally sized pixels interconnected to form a
planar surface. These pixels are used as building blocks for creating points, lines, areas, networks,
and surfaces. Because of the reliance on a uniform series of square pixels, the raster data model is
referred to as a grid-based system. Typically, a single data value will be assigned to each grid locale.
Each cell in a raster carries a single value, which represents the characteristic of the spatial
phenomenon at a location denoted by its row and column. The data type for that cell value can be
either integer or floating-point.
The area covered by each pixel determines the spatial resolution of the raster model from which it is
derived. Specifically, resolution is determined by measuring one side of the square pixel. A raster
model with pixels representing 10 m by 10 m (or 100 square meters) in the real world would be said
to have a spatial resolution of 10 m; a raster model with pixels measuring 1 km by 1 km (1 square
kilometer) in the real world would be said to have a spatial resolution of 1 km;
Care must be taken when determining the resolution of a raster because using an overly coarse pixel
resolution will cause a loss of information, whereas using overly fine pixel resolution will result in
significant increases in file size and computer processing requirements during display and/or analysis.
Imagery employing the raster data model must exhibit several properties:
- First, each pixel must hold at least one value, even if that data value is zero;
- Second, Second, a cell can hold any alphanumeric index that represents an attribute.
o Quantative, for example, elevation in meters
o Quantative, for example, 1 = agriculture, 2 = grassland
- Third, points and lines “move” to the center of the cell. Therefore the minimum width for a
line feature needs to be one cell, regardless of the spatial resolution
There are several methods for encoding raster data from scratch; (1) cell-by-cell raster encoding
where each cell in the raster gets a value (100010001); (2) run-length raster encoding where only
filled cells get a value (row 1, column 5) and (3) quad-tree raster encoding.
Raster data has several advantages and disadvantages:
Advantages
The technology required to create raster graphics is inexpensive and ubiquitous. Nearly everyone
currently owns some sort of raster image generator, namely a digital camera, and few cellular
phones are sold today that don’t include such functionality
Relative simplicity of the underlying data structure. Each grid location represented in the raster
image correlates to a single value. This simplicity also lends itself to easy interpretation and
maintenance of the graphics, relative to its vector counterpart.
Disadvantages
raster files are typically very large. Particularly in the case of raster images built from the cell-by-cell
encoding.
the output images are less “pretty” than their vector counterparts. This is particularly noticeable
when the raster images are enlarged or zoomed
The geometric transformations that arise during map reprojection efforts can cause problems for
raster graphics
It is not suitable for some types of spatial analyses. For example, difficulties arise when attempting
to overlay and analyze multiple raster graphics produced at differing scales and pixel resolutions.
Combining information from a raster image with 10 m spatial resolution with a raster image with 1
km spatial resolution will most likely produce nonsensical output information as the scales of
analysis are far too disparate to result in meaningful and/or interpretable conclusions
Vector data models
Vector data models use points and their associated X, Y coordinate pairs to represent the vertices of
spatial features, much as if they were being drawn on a map by hand. Three fundamental vector
types exist in geographic information systems (GISs): points, lines, and polygons.
Vector data models can be structured in many different ways:
- Spaghetti data model. In the spaghetti model, each point, line, and/or polygon feature is
represented as a string of X, Y coordinate pairs (or as a single X, Y coordinate pair in the case
of a vector image with a single point) with no inherent structure. Despite the location
designations associated with each line, or strand of spaghetti, spatial relationships are not
explicitly encoded within the spaghetti model; rather, they are implied by their location. This
results in a lack of topological information, which is problematic if the user attempts to make
measurements or analysis
- The topological data model is characterized by the inclusion of topological information within
the dataset, as the name implies. Topology is a set of rules that model the relationships
between neighboring points, lines, and polygons and determines how they share geometry.
Three basic topological precepts that are necessary to understand the topological data
model:
o Connectivity describes the arc-node topology for the feature dataset.
o Area definition states that an arc that connects to surround an area defines a
polygon, also called polygon-arc topology. This results in a reduction in the amount
of data stored and ensures that adjacent polygon boundaries do not overlap.
o Contiguity, the third topological precept, is based on the concept that polygons that
share a boundary are deemed adjacent. Specifically, polygon topology requires that
all arcs in a polygon have a direction (a from-node and a to-node), which allows
adjacency information to be determined.
Topological information allows for rapid error detection:
- Open or unclosed polygons when an arc does not completely loop back on itself;
- Unlabeled polygons when a polygon doesn’t contain an attribute;
- Slivers occur when the shared boundary of two polygons do not meet exactly;
- When two lines do not meet perfectly at a node it can be called an “undershoot” when the
lines do not extend far enough to meet each other and an “overshoot” when the line extends
beyond the feature it should connect to
Vector data has several advantages and disadvantages:
Advantages
vector data models tend to be better representations of reality due to the accuracy and precision
of points, lines, and polygons over the regularly spaced grid cells of the raster model. This results
in vector data tending to be more aesthetically pleasing than raster data
provides an increased ability to alter the scale of observation and analysis. As each coordinate
pair associated with a point, line, and polygon represents an infinitesimally exact location.
Zooming doesn’t affect the view of the graphic as it does with a raster
Advantages
Vector data tend to be more compact in data structure, so file sizes are typically much smaller
than their raster counterparts
Disadvantages
the data structure tends to be much more complex than the simple raster data model. As the
location of each vertex must be stored explicitly in the model, there are no shortcuts for storing
data like there are for raster models (e.g., the runlength and quad-tree encoding methodologies).
The implementation of spatial analysis can also be relatively complicated due to minor
differences in accuracy and precision between the input datasets. Similarly, the algorithms for
manipulating and analyzing vector data are complex and can lead to intensive processing
requirements, particularly when dealing with large datasets.
Satellite Imagery and Aerial Photography

Satellites can be active or passive.
- Active satellites make use of remote sensors that detect reflected responses from objects
that are irradiated from artificially generated energy sources. For example, active sensors
such as radars emit radio waves, laser sensors emit light waves, and sonar sensors emit
sound waves. In all cases, the sensor emits the signal and then calculates the time it takes for
the returned signal to “bounce” back from some remote feature. Knowing the speed of the
emitted signal, the time delay from the original emission to the return can be used to
calculate the distance to the feature.
- Passive satellites, alternatively, make use of sensors that detect the reflected or emitted
electromagnetic radiation from natural sources. This natural source is typically the energy
from the sun, but other sources can be imaged as well, such as magnetism and geothermal
activity. Using an example we’ve all experienced, taking a picture with a flash-enabled
camera would be active remote sensing, while using a camera without a flash (i.e., relying on
ambient light to illuminate the scene) would be passive remote sensing
The quality and quantity of satellite imagery is determined by its resolution. Improving one type of
resolution often necessitates a reduction in one of the other types of resolution. For example, an
increase in spatial resolution is typically associated with a decrease in spectral resolution, for
example, geostationary satellites (those that circle the earth proximal to the equator once each day)
yield high temporal resolution but low spatial resolution, while sun-synchronous satellites (those that
synchronize a near-polar orbit of the sensor with the sun’s illumination) yield low temporal
resolution while providing high spatial resolution:
- Spatial resolution
- Spectral resolution: The spectral resolution is determined by the interval size of the
wavelengths and the number of intervals being scanned. Multispectral and hyperspectral
sensors are those sensors that can resolve a multitude of wavelengths intervals within the
spectrum. For example, the IKONOS satellite resolves images for bands at the blue (445–516
nm), green (506–95 nm), red (632–98 nm), and near-infrared (757–853 nm) wavelength
intervals on its 4-meter multispectral sensor
- Temporal resolution: The amount of time between each image collection period and is
determined by the repeat cycle of the satellite’s orbit
- radiometric resolution, refers to the sensitivity of the sensor to variations in brightness and
specifically denotes the number of grayscale levels that can be imaged by the sensor.
Typically, the available radiometric values for a sensor are 8-bit (yielding values that range
from 0–255 as 256 unique values or as 28 values); 11-bit (0–2,047); 12-bit (0–4,095); or 16-
bit (0–63,535).
Another form of remote sensing is aerial photography. While aerial photography connotes images
taken of the visible spectrum, sensors to measure bands within the nonvisible spectrum (e.g.,
ultraviolet, infrared, near-infrared) can also be fixed to aerial sources. However, care must be taken
with aerial photographs as the sensors used to take the images are similar to cameras in their use of
lenses. These lenses add a curvature to the images, which becomes more pronounced as one moves
away from the center of the photo.
Another source of potential error in an aerial photograph is relief displacement. This error arises
from the three-dimensional aspect of terrain features and is seen as apparent leaning away of
vertical objects from the center point of an aerial photograph. To imagine this type of error, consider
that a smokestack would look like a doughnut if the viewing camera was directly above the feature.
However, if this same smokestack was observed near the edge of the camera’s view, one could
observe the sides of the smokestack. This error is frequently seen with trees and multistory buildings
and worsens with increasingly taller features.
Orthophotos are vertical photographs that have been geometrically “corrected” to remove the
curvature and terrain-induced error from images. The most common orthophoto product is the
digital ortho quarter quadrangle (DOQQ)
Remote sensing and electromagnetic energy

A remote sensing system using electromagnetic energy, all energy that moves with the velocity of
light in a harmonic wave pattern (radiation), consists basically of four components:
- A source of electromagnetic energy, this is most often the sun’s reflected energy by the
surface of the studied object or the emitted heat by the earth itself, or it is an artificial man-
made source of energy such as microwave radar.
- Atmospheric interaction, electromagnetic energy passing through the atmosphere (sun to
earth’s surface and earth’s surface to the sensor) is distorted, absorbed and scattered.
- Earth’s surface interaction, the intensity and characteristics of electromagnetic radiation
reflected from or emitted by the earth’s surface is a function of the characteristics of the
objects at the surface and a function of wavelength.
- The sensor, a man-made sensor is used to register emitted or reflected electromagnetic
energy (camera, radiometer, CCD). The characteristics and the quality of the sensor
determine how well the radiation can be recorded.
Electromagnetic waves are described using three measurements:
- Wavelength ( λ ); distance between successive wave peaks, often micrometer or nanometer;

o Visible wavelength (R, G, B) between 0,4 and 0,7 micrometer.
o Infrared between 0,7 and 1,1 micrometer
o Short-wave infrared between 1,1 and 2,5 micrometer
o Thermal infrared between 3 and 14 micrometer
o Microwave region between 1 millimeter and 1 meter
- Frequency (f); number of wave peaks passing a fixed point per unit time
- Velocity (c) ; speed of light, which is a constant 3 ∙10 8 m/s
Velocity, wavelength and frequency are related to each other in c= λ ∙ f

Electromagnetic energy that encounters matter, whether solid, liquid or gas, is called incident
radiation. Interactions with matter can change the intensity, direction, wavelength polarization and
phase of the incident radiation. The information of interest, for example, land cover or soil type are
deduced from these changes due to radiation-matter interaction:
- Transmission
- Absorption
- Emitted by the substance (absorbed, heated and emitted)
- Scattering
- Reflection
Working with different wave lengths can give you a different perspective in the land use of an area.
This is often done using multi-spectral scanners. Two advantages are:
- Objects at the surface of the earth have varying reflection behaviour through the optical
spectrum, they can be recognized and/or identified more easily using several spectral bands
than using just one band.
- A large number of objects do not reflect radiation very well in the visible part of the
spectrum. Remote sensing observations outside the visible wavelengths or in combination
with observations in the visible spectrum produce a much more contrasting image, which is
helpful to identify objects or to determine their condition
Image correction and analysis

Radiometric correction is an image restoration operation. Image restoration techniques aim at
compensating for errors, noise and distortion by scanning, transmission and recording images.
Sources
of radiometric distortion are the atmosphere and the instruments used to register the image data.
Therefore generally two steps are necessary for radiometric correction 1) correction for gain
and offset of the detector and 2) correction for the atmospheric conditions at the time of data
acquisition (correcting for distortion in the atmosphere of the measured radiance by aerosols,
moisture, gasses etc).
Several analysis methods exists:
- Image enhancement; An image can be enhanced in contrast using the histogram. This
histogram contains information on the used brightness range, for example, an image exists in
the 60 to 158 color range, these image values can be stretched over the entire spectrum of
the display values from 0 to 255.
- Digital filtering; Using a moving kernel for edge detection or smoothing of the image.
- Spectral ratioing; An enhancement technique to combine pixel values from different spectral
bands. Ratio images are prepared by dividing the digital numbers in one band by the
corresponding digital numbers in another band for each pixel, stretching the resulting values,
and plotting the new values as an image, for example, sunlight and shade area’s will produce
different band values at the original spectral band, but will have the same spectral brightness
after ratioing.
- Digital image transformation
o Principal component analysis; reduce the redundancy of information in spectral data
o Tasselled cap transformation;
o Linear spectral unmixing; The reflectance at each pixel of the image is assumed to be
a linear combination of the reflectance of each material (or endmember) present
within the pixel. For example, if 25% of a pixel contains material A, 25% of the pixel
contains material B, and 50% of the pixel contains material C, the spectrum for that
pixel is a weighted average of 0.25 times the spectrum of material A, plus 0.25 times
the spectrum of material B, plus 0.5 times the spectrum of material C
- Image classification; the process of assigning pixels to classes. Usually each pixel is
treated as an individual unit composed of values in several spectral bands. Classification of
each pixel is based on the match of the spectral signature of that pixel with a set of reference
spectral signatures.
o Supervised classification; image analyst specifies numerical descriptors of the various
land cover types
o Unsupervised classification; an algorithm clusters pixels based on statistical
properties of the pixel values
Image interpretation
To be able to interpret images, four different objects need to be analysed on their spectral
behaviour:
- Vegetation; The reflectance of vegetation in the visible wavelengths (0.43 - 0.66 μm) is
generally small and reflection in near infrared (0.7 - 1.1 μm) is generally large. Pigmentation,
physiological structure, water content.
- Soil; Most important properties for reflectance are moisture content, organic matter content,
texture, structure, mineral composition. The spectral image is less peaky than that for
vegetation.
- Crops
- Water; Higher water content means less reflection and more absorption
imaging spectroscopy refers to the acquisition of images in many, very narrow, contiguous spectral
bands in the optical part of the spectrum
Geospatial Data Acquisition

The most common datatypes available are:
- Alphanumeric string (text)
- Numbers
o Floating point is any number that contains decimal digits, e.g., short-integer of 16-
bits (2^16) and long-integer of 32-bits (2^32)
o Integer is any number that doesn’t contain decimal digits
- Boolean
- Date
- Binary
Data can be grouped within a measurement scale. These scales can be grouped into two categories:
- Categorical data
o Nominal (no scalar comparison between data, e.g., cities, eye color)
o Ordinal (describes the position, e.g., first, second, unsatisfied)
- Numeric data
o Interval (no meaningful zero value, e.g., temperature)
o Ratio (based around a meaningful zero value, e.g., population density)
Primary data capture

Vector data is acquired through GPS systems, where the X and Y coordinates are obtained and
supplemented with attribute descriptions, for example, telephone pole. The GPS points can be linked
together to form lines or polygons.
Raster data is acquired through remote sensing (satellites or aerial photography). Advantage of
obviating the need for physical access to the area being imaged. In addition, huge tracts of land can
be characterized with little to no additional time and labor by the researcher. However the sensor
needs to be controlled on proper working and be properly calibrated.
Secondary data capture

Secondary data capture is an indirect methodology that utilizes the vast amount of existing
geospatial data available in both digital and hard-copy formats. There are three methods to digitalize
hard-copy data:
- Tablet digitizing (holding the map behind a lit tablet and using a ‘puck’ as a mouse to add
attributes and coordinates
- Heads-up digitizing (scanning the map and editing it digitally)
- The third, automated method of secondary data capture requires the user to scan a paper
map and vectorize the information therein. This vectorization method typically requires a
specific software package that can convert a raster scan to vector lines.
Geospatial Database Management

Databases exist in different types:
- Flat type database (for example a excel spreadsheet)

- Hierarchical database (for example a tree with branches and leaves)
- Network database (for example an organization where multiple tuples can be connected with
each other)
- Relational database where each table is linked to other tables through keys.
o Primary key; The primary key represents the attribute (column) whose value
uniquely identifies a particular record (row) in the relation (table).
o Foreign key; The primary key corresponds to an identical attribute in a secondary
table (and possibly third, fourth, fifth, etc.) called a foreign key
Data Quality
Two primary attributes characterize data quality:
- Accuracy describes how close a measurement is to its actual value and is often expressed as
a probability (e.g., 80 percent of all points are within +/- 5 meters of their true locations).
- Precision refers to the variance of a value when repeated measurements are taken. A watch
may be correct to 1/1000th of a second (precise) but may be 30 minutes slow (not
accurate).
Several types of error can arise when accuracy and/or precision requirements are not met during
data capture and creation. Positional accuracy is the probability of a feature being within +/− units of
either its true location on earth (absolute positional accuracy) or its location in relation to other
mapped features (relative positional accuracy)
The five types of error in a geospatial dataset are related to:
- Positional accuracy: Errors can arise while registering the map on the digitizing board. A
paper map can shrink, stretch, or tear over time, changing the dimensions of the scene
- Attribute accuracy: Attribute errors can occur when an incorrect value is recorded within the
attribute field or when a field is missing a value.
- Temporal accuracy: Temporal accuracy addresses the age or timeliness of a dataset. No
dataset is ever completely current. In the time it takes to create the dataset, it has already
become outdated.
- Logical consistency: Logical consistency requires that the data are topologically correct. For
example, do roadways connect at nodes?
- Data completeness: Are all of the counties in the state represented?
Vector data: Single layer analysis

Spatial analysis is a fundamental component of a GIS that allows for an in-depth
study of the topological and geometric properties of a dataset or datasets. There are several tools to
use for analyzing spatial data:
- Buffering; creating an output polygon layer containing a zone of a specified width around an
input point, line, or polygon feature. There are several buffering options available to
accomplish different goals:
o Variable width buffer;
o Dissolve, or not dissolve, the boundaries between buffers;
o Multiple ring buffers (for example, different security zones);
o Doughnut buffer (buffer area that does not include the area inside the buffered
polygon)
o Setback buffers (buffer the area inside the polygon);
o Line buffers can be either on the left or on the right, with different end points, for
example, half circle or a rectangle
- Geoprocessing operations (manipulate GIS data). Several tools are available:
o Dissolve (combines adjacent polygon features in a single feature dataset);
o Append (creates an output polygon layer by combining the spatial
extent of two or more layers)
o Select (creates an output layer based on a user-defined query that
selects particular features from the input layer)
o Merge (combines features within a point, line, or polygon
layer into a single feature with identical attribute information)
Vector data: Multiple layer analysis

Among the most powerful and commonly used tools in a geographic information
system (GIS) is the overlay of cartographic information. In a GIS, an overlay is the
process of taking two or more different thematic maps of the same area and placing
them on top of one another to form a new map.
There are several vector overlay operations available in a GIS software package:
- point-in polygon (The point-in-polygon overlay operation requires a point input layer and a
polygon overlay layer. Upon performing this operation, a new output point layer is returned
that includes all the points that occur within the spatial extent of the overlay)
- polygon-on-point (In this case, the polygon layer is the input, while
the point layer is the overlay)
- line-on-line (A line-on-line overlay operation requires line features for both the input and
overlay layer. The output from this operation is a point or points located precisely
at the intersection(s) of the two linear datasets)
- line-in-polygon (In this case, each line that has any part of its extent within the overlay
polygon layer will be included in the output line layer, although these lines will be
truncated at the boundary of the overlay)
- polygon-on-line
- polygon-on-polygon (This is the most commonly used overlay operation. Using this
method, the polygon input and overlay layers are combined to create an output
polygon layer with the extent of the overlay)
depending on the operators utilized, the output layer will result in either a union, intersect,
symmetrical difference, identity, clip, erase or split.
- union ; both input layers to be polygon

- Intersect; input is a point, line or polygon layer, while the overlay layer is a polygon
- Symmetrical difference; both input layers to be polygon
- Identity; input can be points, lines or polygons. Overlay layers needs to be polygon. This
creates an output layer with the spatial extent of the input layer but includes attribute
information from the overlay
Two different type of errors can occur when overlay of two layers is being performed:
- Slivers; A common error produced when two slightly misaligned vector layers
are overlain, for example, when two polygons have been created from field data, satellite
imagery or aerial photography in two different time slots.
- Error propagation arises when inaccuracies are present in the original input and overlay
layers and are propagated through to the output layer
Raster data: single and multilayer analysis

One of the first steps performed on raster data is Reclassification. This is basically the single layer
process of assigning a new class or range value to all pixels in the dataset based on their
original values. For example, an elevation grid commonly contains a different value for nearly every
cell within its extent. These values could be simplified by aggregating each pixel value in a few
discrete classes (i.e., 0–100 = “1,” 101–200 = “2,” 201–300 = “3,” etc.). This simplification allows for
fewer unique values and cheaper storage requirements.
Another single layer analysis tool is creating a buffer around a single cell or a group of cells sharing
the same values. Where in vector data the precise destination from a given centre point is given,
within raster data an approximation is given with cell distance.
Raster data can also be analyzed using overlay methods:
- Clip: Where a raster layer is used as an input in combination with a vector polygon layer to
create an raster output layer that is clipped.
- Mathematical raster overlay: Two raster layers can be compared using mathematical
operations, for example, summation, multiply, divide, mean, etc. to create a new raster
output layer (often used in risk assessment)
- Boolean raster overlay: The output raster contains a 1 if true and a 0 if false
Raster analyses can be undertaken on four different scales of operation: local, neighborhood, zonal,
and global:
- Local: Calculations are done on each individual cell level (If it is preferred to represent those
elevations in meters, a simple, arithmetic transformation (original elevation in feet * 0.3048
= new elevation in meters) of each cell value can be performed locally to accomplish this
task.)
- Neighborhood: Neighborhood functions examine the relationship of an object with similar
surrounding objects. They can be performed on point, line, or polygon vector datasets as well
as on raster datasets. Raster analyses employ moving windows, also called filters or kernels,
to calculate new cell values for every location throughout the raster layer’s extent
o Neighborhood operations are commonly used for data simplification on raster
datasets. An analysis that averages neighborhood values would result in a smoothed
output raster with dampened highs and lows as the influence of the outlying data
values are reduced by the averaging process
- Zonal: A zonal operation is employed on groups of cells of similar value or like features,
- not surprisingly called zones
- Global: Global raster operations examine the entire areal extent of the dataset
Raster data: Spatial interpolation

Interpolation is used to estimate the value of a variable at an unsampled location from
measurements made at nearby or neighboring locales. To do so, surfaces are needed. One way to
create surfaces based on a point layer set is using Thiessen polygons. Thiessen polygons are
mathematically generated areas that define the sphere of influence around each point in the dataset
relative to all other points. Lines are drawn exactly separating two points at same distance. From
these lines polygons are drawn, where each point lies within a polygon for which the distance to that
point is the nearest.
The three basic methods used to create interpolated surfaces are spline, inverse distance weighting
(IDW), and trend surface. The spline interpolation method forces a smoothed curve through the set
of known input points to estimate the unknown, intervening values. IDW interpolation estimates the
values of unknown locations using the distance to proximal, known values. The weight placed on the
value of each proximal value is in inverse proportion to its spatial distance from the target locale.
Therefore, the farther the proximal point, the less weight it carries in defining the target point’s
value. Finally, trend surface interpolation is the most complex method as it fits a multivariate
statistical
regression model to the known points, assigning a value to each unknown location based on that
model.
Raster data: Terrain mapping

Surface analysis is often referred to as terrain (elevation) analysis when information related to slope,
aspect, viewshed, hydrology, volume, and so forth are calculated on raster surfaces such as DEMs
(digital elevation models.
- Slope maps: They are typically created by fitting a planar surface to a 3-by-3 moving window
around each target cell. When dividing the horizontal distance across the moving window
(which is determined via the spatial resolution of the raster image) by the vertical distance
within the window (measure as the difference between the largest cell value and the central
cell value), the slope is relatively easily obtained
- Aspect maps: Any cell that exhibits a slope must, by definition, be oriented in a known
direction. This orientation is referred to as aspect. This is either coded as Noorth, East,
Northeast etc. or between 1° and 360°.
- Hillshade map: represents the illumination of a surface from some hypothetical, user-defined
light source (presumably, the sun)
Point in Polygon & Intersect

Finding out if a certain point is located inside or outside of an area, or finding out if a line intersects
with another line or polygon are fundamental geospatial operations that are often used e.g. to select
data based on location. Such spatial queries are one of the typical first steps of the workflow when
doing spatial analysis. Performing a spatial join (will be introduced later) between two spatial
datasets is one of the most typical applications where Point in Polygon (PIP) query is used.
Spatial Join
Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring
them into another layer based on their spatial relationship is something you most likely need to do
on a regular basis. Attribute tables can be extended with information from other layers. In order to
do a spatial join the two layers need to be in the same CRS projection.
Nearest Neighbour
One commonly used GIS task is to be able to find the nearest neighbour for an object or a set of
objects. For instance, you might have a single Point object representing your home location, and then
another set of locations representing e.g. public transport stops. Then, quite typical question is
"which of the stops is closest one to my home?" This is a typical nearest neighbour analysis, where
the aim is to find the closest geometry to another geometry.
The method using the centroids to determine the nearest neighbors is quite slow from performance
point of view. Luckily there is an easy and widely used solution called spatial index that can
significantly boost the performance of your spatial queries. Various alternative techniques has been
developed to boost spatial queries, but one of the most popular one and widely used is a spatial
index based on R-tree data structure. The core idea behind the R-tree is to form a tree-like data
structure where nearby objects are grouped together, and their geographical extent (minimum
bounding box) is inserted into the data structure (i.e. R-tree). This bounding box then represents the
whole group of geometries as one level (typically called as "page" or "node") in the data structure.
This process is repeated several times, which produces a tree-like structure where different levels are
connected to each other. This structure makes the query times for finding a single object from the
data much faster, as the algorithm does not need to travel through all geometries in the data.
LiDAR technology
Airborne light detection and ranging (LiDAR) has now become industry standard tool for collecting
accurate and dense topographic data at very high speed. A LiDAR sensor generates short width laser
pulses to transmit these to the ground and receive the return signal of the laser to measure the time
of travel.
LiDAR data, such as x,y,z coordinates, GNSS time and associated color information (R,G,B) are stored
in an .las file. The content of the datafile is a geometric representation. The data accuracy is
computed using the RMSE (Root Mean Square Error). This is used to determine the:
- Vertical accuracy; comparing LiDAR points with ground surveyed points

- Planimetric accuracy
LiDAR data can be visualized using several techniques:
- Projection of 3D data to a 2D raster

- Visualization of data in 2.5D using DSM(Digital Surface Model) and DEM(Digital Elevation
Model) generated from LiDAR data
- generating 2D triangulation and 3D tetrahedralization using Delaunay Triangulation methods
and visualize the data as 2.5D
The approaches aimed at identification, extraction and classification of objects in LiDAR data are
primarily based on exploiting the geometric properties. This is done as follows:
- Literature suggests that most of the approaches begin with outlier detection and removal,
which is followed by ‘‘ground filtering’’. The process of filtering involves the labelling of the
dataset into terrain and non-terrain points
- The next step is to identify the individual roof planes of buildings. Several methods are
available, for example, Hough transform, RANSAC random sample consensus and clustering
of normal.
- After the planes are identified, 3D models of the buildings are reconstructed with polyhedral
models, modelling using the ground planes and intersection of adjacent roof planes.
- Tree classification
- Road detection
Spatial Interpolation (lectures)
Interpolation techniques
Spatial interpolation is used to predict a value at a certain point (x and y coordinate), which isn’t
sampled at that specific point.
There are several interpolation techniques to predict this unsampled value (the three nearest points
are used to determine the unknown value):
- Delauney Triangulation; The three nearest points generate a triangle in which the unknown
value exists. The slope is calculated to determine this value. A disadvantage of this method is
that there is a gradual change alongside the triangle edges and there is a breakpoint at the
triangle edges.
- Thiessen polygons; Divides the area into polygons, where the entire area gets the value
inside this polygon
- Trend surface using linear regression (1 st, 2nd, 3rd etc. order polygon can be fitted in the
values).
- Inverse distance weighting;
Spatial dependence
The interpolation techniques don’t account for the spatial dependence. Therefore geostatistics are
used. The basic concept is with a regionalized variable, which is a spatial variable that is dependent
on the surrounding, for example, a mountain top (size, shape, orientation and spatial arrangement of
observations determine the characteristics of the regionalized variable). The variable is continuous in
space and often there is only a sample available of the spatial observation, for example, a soil sample
for an area.
Spatial continuity is measured with the semivariance:
o Using the semivariance (a measure of the degree of spatial dependence between

observations along a specific support). This is a function of the difference over the
distance. The semivariance increases when the distance between the measured
points increases. On the calculated semivariance points, a continues graph can be
obtained by fitting a semivariogram. The has a couple of characteristics:
 Range; distance up to which spatial dependence exists
 Sill; Variance around the mean of the observations
 Nugget; semivariance at h=0 (theoretically equal to zero)
o There are different semivariogram models:
 Spherical
 Exponential
 Linear
Optimal interpolation Kriging

Spatial Statistics and Machine Learning Summary

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spatial Statistics and Machine Learning Summary

Uploaded by

Copyright:

Available Formats

Data Wrangling.......................................................................................................................................

Raster data has several advantages and disadvantages:

Vector data models can be structured in many different ways:

Topological information allows for rapid error detection:

Vector data has several advantages and disadvantages:

Satellite Imagery and Aerial Photography

Remote sensing and electromagnetic energy

Electromagnetic waves are described using three measurements:

- Wavelength ( λ ); distance between successive wave peaks, often micrometer or nanometer;

Velocity, wavelength and frequency are related to each other in c= λ ∙ f

Image correction and analysis

Geospatial Data Acquisition

Primary data capture

Secondary data capture

Geospatial Database Management

- Flat type database (for example a excel spreadsheet)

The five types of error in a geospatial dataset are related to:

Vector data: Single layer analysis

Vector data: Multiple layer analysis

- union ; both input layers to be polygon

Raster data: single and multilayer analysis

Raster data can also be analyzed using overlay methods:

Raster data: Spatial interpolation

Raster data: Terrain mapping

Point in Polygon & Intersect

- Vertical accuracy; comparing LiDAR points with ground surveyed points

LiDAR data can be visualized using several techniques:

- Projection of 3D data to a 2D raster

Spatial continuity is measured with the semivariance:

o Using the semivariance (a measure of the degree of spatial dependence between

You might also like