You are on page 1of 6

1

Toponym Recognition in Historical Maps with


Area/Linear Feature Alignment
Toby Baratta and Bo Wang
Grinnell College, Grinnell, Iowa, USA

I. I NTRODUCTION II. BACKGROUND


A. Related Works
Maps reflect the geography of our Earth. Historical maps
display the changes in the natural surroundings, the Plentiful researches have been dedicated to automatic
movement of people and the alteration of the Earth text recognition and information integration on maps.
surface that humans have accomplished. Historical maps Gelernter, et al. aims to enrich the gazetteer with user-
in digitalized forms are becoming more accessible to geocoded data collected online and their research pro-
scholars and the general public, however, the informa- vides insight on correctly matching geographic features
tion in these maps remains largely unsearchable. Very and integrating lake/area shape information with the
limited number of historical maps are annotated with original gazetteer. Chen et al. uses road intersection
geographical coordinates obtained by modern satellite points to conflate different geospatial datasets from var-
technologies and there is a good reason for this scarcity: ious sources in order to identify the area which the
manually verifying and annotating toponyms (place certain map section covers. However, these projects do
names) on historical maps is a tedious task. Unlike other not concern historical maps and the integrated geo-
text files such as books, the location of strings on a map graphic information was not used to optimize toponym
is restrained by geography and cartographic methods, recognition.
thus highly unpredictable. Non-text elements, such as Weinman (2013) poposed a joint probability model
boundary lines, roads, natural geographic features, cre- which recalculates the likelihood of each toponym can-
ate significant noise for the traditional OCR techniques. didate for all the strings found in the map, based on
Yet a well-annotated and search-friendly database of the alignment between the map coordinates and the
historical maps will save researchers hours of visually projection of the global georectification. [?] Our project
scrutinizing old maps for a place name and direct them follows the same idea, however the probability model is
to the geographic information of the certain place. In modified to include non-point features.
order to build this database, we aim to create a program Weinman’s paper from 2013 relates the most closely
which automatically extract and accurately recognize to what we have done, as what we worked on mostly
toponyms on historical maps. Weinman’s work (2013) was just an expanded version of his works. We added
on historical maps successfully improves single-point to- onto his works by adding in the area features, updating
ponym recognition by gazetteer alignment. In this paper, the gazetteer to the 2014 version, and working with
we discuss our work of integrating area/linear features Weinman to devise a probability model to deal with area
into the existing system in order to expand the range and linear features.
of geographic features that we can accurately recognize.
Our research question is whether incorporating shape B. Cartography
coordinates reduces the recognition error. Geospatial In order to recalculate the possibilities of toponym candi-
vector data which specifies the shape and location of dates, we need to project the known geographic features
lakes and rivers are added to the gazetteer. And we use on the historical maps and measure the accordance
a modified joint probability model to calculate the proba- between the projection and the string location on the
bilities of possible toponyms with alignment between the map. As stated in Weinman (2013), the specific projection
maps and the new gazetteer which includes area/linear of a given historical map is often missing, therefore
shape coordinates. we choose the same projection family. For the sake of
Part II review the related works and the cartographic controlling and comparison, we continue to use the
backgrounds of our research. Part III specifies the sys- Hassler polyconic projection. [?]
tem, task, and environment of our algorithm. Part IV Another issue to concern is the varying datums of the
describes our experiments and analysis. Part V concludes Geographic Coordinate System (GCS) used by different
our project and leads to future research. sources of data. A GCS is a system based on a datum
2

to define locations on the surface of the Earth. A da- the corresponding map coordintes of these area/linear
tum is defined by a reference ellipsoid which models features are only the primary points.
the shape of the Earth and the its relative position to
the center. All sources of data in our project use GCS 2. Gazetteer
North American 1983 based on the NAD83 datum with In our experiment, we continue to use the Geographic
the GRS80 ellipsoid, therefore there is no need for us Names Information System (GNIS) from the U.S. Board
to consider the possible errors resulted from different on Geographic Names (USBGN) as our authoritative
reference systems. [?] The problem we face is that Matlab gazetteer. For our purpose, in addition to the categories
does not specifically define any built-in GCS. Though of Civil and Populated Place, we also include the cate-
it does include various options of reference ellipsoids: gories of Lake, Swamp, Reservoir and Bay. The gazetteer
“grs80”, “wgs84”, etc. We choose the “grs80” ellipsoid was just updated in 2014 and for our gazetteer creation
since NAD83 is based on this model. We assume that the and posterior ranking we use that gazetteer. However,
reference ellipsoid have properties which accord with the our parse scores in all cases were done using the 2010
NAD83 datum. Since Matlab uses longitude and latitude gazetteer dataset. We feel that if given more time to run
to express location on the Earth, we also assume that our tests with 2014 parses, we might see a slightly im-
it uses the most common geocentric coordinate system. proved difference in our resuts although further testing
Since we are not working on extremely high level of would be needed to prove this.
accuracy, the assumptions above serve our purpose well. However, the GNIS only contain primary single-point
coordinates for area/linear features. In order to an-
swer our research question, we need to link the data
C. Data of coordinates of the area/linear to the gazetteer. We
1. Maps searched the open resources of geospatial vector data of
lakes/rivers on the Internet and found the following two
In order to compare our results with Weinman’s results,
databases: The National Atlas [?] shapefiles of lakes and
we continue to use the manually annotated maps for
the National Oceanic and Atmospheric Administration
our experiments from the David Rumsey Historical Map
[?] shapefiles of rivers. Among a number of different
Collection. [1] The annotations are checked and im-
resources, we choose these two databases because they
proved for our research with higher accuracy and more
cover the largest number of water bodies, include suffi-
complete coverage. From prior experiments, training and
cient geographic information on the water bodies and
testing data was already separated within this collection.
the shapefiles are easy to work with in Matlab. The
Then, we manually went through the testing database to
TNA and the NOAA datasets contains feature shape
find maps that had a comparably large number of lakes
information but they do not contain historical names of
on them. These maps vary considerably in the accuracy
lakes and rivers. Since we are working with historical
of the alignment, distribution of different lake shapes,
maps, it is important that we link the GNIS gazetteer and
thus we think that this relatively small sample will give
the shapefile databases. We compare the edit distance
us enough data to analyze the efficiency of our system.
between the names of the lake/river in GNIS and in
[?] stated in his paper, which remains true throughout
TNA/NOAA as well as the geographic distance between
our experiment:
certain points on the features to match the water bodies.
For our experiments, we have annotated 12 Details are specified in [?]. With the linking algorithm,
maps from the David Rumsey collection [1], we successfully matched 9531 lake shapefiles to the
with each map’s approximate region noted (i.e., gazetteer.
the primary state or counties and any sur-
rounding states having toponyms represented).
A simple bound on coordinate ranges, as given
III. A PPROACH
by some online map collections, would also A. Algorithm
suffice to restrict the gazetteer ultimately used 1. System
for recognition. Annotations in a simple XML Our probability model is based on the probability model
format include tight piecewise linear baseline in Weinman (2013), which is modified to include the
guides, text height and ground truth string for area features. It is a Bayesian network. The difference
each toponym. 4 In addition to the regional map from the model before is that we include toponyms of
shown in Figure 1, many are railroad and public lakes here and thus integrate the area shapefiles into the
survey maps. The average text height is 25 ± 16 model.
pixels. The key variables - strings, points and alignment -
Most maps in our collection are on the scale of one are retained in the model. Toponyms are assigned in
single state; a few includes several states or only part different branches of the model based on its category
of one state. Notes on which states appear on a certain in our gazetteer. Instead of using RANSAC in search for
map is used to restrict the gazetteer to only the states in an alignment to maximize the probability of a toponym,
question. Names of lakes and rivers are annotated but we use the ground truth alignment obtained by manual
3

annotations of the maps. The probability model applies geolexicon we use is also GNIS. Whether to include the
to each word image and the corresponding image co- full geolexicon or the partial geolexicon with only civil
ordinates found on a map. A flow chart explaining our and populated places is a factor that could influence the
process can be seen below. results of our experiment. We address this factor later in
experiments. An example of the prior probabilites would
be where if given an image with the string “Grinnell”,
the prior probabilities (parse scores) would be “Grinnell -
43%”, “Cornell - 52%”, and “Corn hell - 5%”. That would
be the pure output of the OCR model.

3. Environment
The environment is made up of the variables not con-
trolled in the system or as a task. For us, this would
include the geography of the Earth, the information on
the maps, and the speed of processing and computation.
Figure 1. Flow Chart of our Experimental System The geography of the actual world is what we are using
to get our alignment for the re-ranking but is entirely
Given a map image of a string, the prior probability out of our system’s control. It is part of the environment
of the recognition candidates P (y|W ) is given by the in which our system operates. The information on the
OCR. Te initial steps of processing the prior probabilities historical maps is out of our system’s control as well
remains the same as in Weinman (2013). We assign as the user’s control; the data that is on the maps is
an outlier probability to non-geographical text such as created by the mapmakers originally and thus is static.
legends. The bias term γ (0 < γ < 1) is decided for each The CPU and the amount of RAM that our computers are
map based on the alignment. The probability for inliers equipped is another external effect. We strive to optimize
is still a simple ratio with no bias. We also consider no our code program by vectorization and simplification in
bias among the points where the geographic feature is order to finish the calculation in a reasonable amount of
labeled. time.
In order to maximize the probability of the correct to-
ponym match, we recalculate the probabilities by finding IV. E XPERIMENTS
an alignment between the map and the real-life geo-
graphical data which we retrieve from existing database.
1 M ETHOD
With the alignment, we calculate the inferred locations We tested our system on eight test maps, each showing
of the toponyms on the map and further adjust the a single state. These maps stayed consistent throughout
likelihood of the toponyms by their distance from label our several tests. We ran five cases over these maps:
point p. The former model uses a Gaussian probability limited parse, limited gazetteer, point feature treatment
to include the area around single point features by only; limited parse, full gazeteer, point feature treatment
considering a certain value of standard deviation (σ). only; full parse, limited gazetteer, point feature treatment
([?], 4) only; full parse, full gazetteer, point feature treatment
The new new modified model assigns a uniform only; and full parse, full gazetteer, point feature and area
likelihood within the shape of the lake and a normal feature treatment. These are detailed more below.
(gaussian) probability distribution around the lake with
a certain scale factor. The functions regarding this specif- 1.1 Parameters
ically are found within [?].
Our system loads the coordinates of the lake centroid
points from the gazetteer and replace the point coor-
2. Task dinates with polygon coordinates for linked lakes. We
The initial input for the whole system is images of strings use the polyconic projection and an affine alignment
on a map, which is not controlled by us, but by the found by single control points to project the spherical
future users of our system. It should be clarified that this coordinates of the polygons into the image plane. The
image input is different from the information included in inlier bias γ is set at 0.5 since we assume that the
a certain map. What the users decide is which particular distribution is normal.
part of certain maps that they want to learn about. The Since map-makers through out history adapt different
input for our part of the system is the prior probability geodetic standards, for example, different models of the
generated by the OCR. The OCR system gives a list of Earth, various sphere-to-plane projection algorithm, we
possible recognition hypotheses for each string detected. should expect error in the alignment and the projection
In later steps we only consider the hypotheses on this process between the historical maps and the contempo-
list. With sufficient training data, the OCR parses the rary lake shape data. In prior research, the model defined
strings with a geolexicon and a regular dictionary. The a Gaussian probability around the single point features
4

since the toponyms are not always found exactly over 1.2 Data Format
the point coordinates. Similarly, we are not expecting to Our experiment relies on both the GNIS dataset updated
always find lake names labeled inside the lake shapes, in 2014 and The National Atlas database of US Water-
so we need to take the distance from a lake shape to its bodies and Wetlands updated as well in 2014. The TNA
name label into consideration. Therefore we want to find data contain lake shapefiles which consist of polygons
a reasonable scale factorbased on the above two factors specified by the coordinates of the vertices. However, the
to assign probabilities around both the point coordinates TNA databases does not include any historical names
(σp ) and the lake polygon coordinates (σA ). This way, we and as we work on historical maps, we considered it
can avoid false negatives where the lake name is actually important to link these two datasets.
pretty close to the lake shape so we will not falsely From our five testing experiments, you can see our
penalize its probability if the name is not inside the lake. results below. For each testing case, we recorded the per-
Within each lake, the probability is kept uniform and the cent word error, the percent character error, the number
joint probabilities get smaller and smaller as progressing of words in each gazetteer, and the harmonic mean of
outwards. See Figure 2. correct words’ ranks below.

Table 1
−5
Key to the Tables
x 10
9
5060 Number of Words in the Percent Word Error
8
Gazetteer
5080 Percent Character Error Harmonic Means of Correct
Words’ Ranks
7
5100

5120
6 In Table2, we show the results of when we used a
limited gazetteer, a limited parse, and point treatment
5
5140 only. Civil and Populated Place were included solely in
4
the gazetteer, with the same being used solely in the
5160
parse scores. For the prior scores, only civil and popu-
5180 3 lated places were allowed within the possibilities with
the same being used solely for the posterior probabilities.
5200
2 All strings were treated as point features.
5220 Lake

4580 4600 4620 4640 4660 4680 4700 4720 Probabilities


4740 4760 4780
1 Table 2
Limited Gazeteer, Limited Parse, Point Treatment Only
Figure 2. This figure shows a visualization of the probabil-
ity model around Rainbow Lake in MIchigan. The colored 1801 58.19
contour circles represent decreasing probabilities as they 41.04 1.930
get farther from the lake shape. The width between the
contour circles is decided by the scale factor δA of 35 In Table3, we show the results of when we used a
pixels. limited gazetteer, an unlimited parse, and point treat-
ment only. Civil and Populated Place were included
solely in the gazetteer, but within the parse scores, more
categories were used within the gazetteer . For the prior
However, for our purpose in this experiment, we made scores, more possible entries were given into the system
no change to the original code for the point feature then were given into the posterior system. All strings
scale factors (σp ) , which means we only considered were treated as point features.
the error from alignment for point features. For area
features, since all four training maps focus on displaying Table 3
the counties and major towns and not many lakes are Limited Gazetteer, Unlimited Parse, Point Treatment Only
presented, we only looked at the distance from the
point that indicates the location of a town to the label 1801 57.85
of the name of the town. We randomly selected ten 40.90 1.924
towns on each map and manually recorded the name-
point distance in pixels. For now, we simply evaluated In Table 4, we show the results of when we used an
the arithmetic mean and median of the raw residual unlimited gazetteer, a limited parse, and point treatment
distances from alignment and name-point distances and only. Within the parse scores, only Civil and Populated
decided to use a αA of 35 pixels. We realize that the Place were used; that means that for the prior scores, less
simplified approach for finding the scale factors may entries were given then into the posterior system. For
have resulted in problems in our experiment. the posterior system with the gazetteer, Reservoir, Lake,
5

Swamp, and Bay were added to Civil and Populated Negative numbers show when our error went down,
Place to generate the possible toponyms. All strings were while positive numbers reflect when our error went
treated as point features. up. Overall, you can see that our error did successfully
decrease with all of the changes to our system. However,
Table 4 you can also see that in cases four and five, word error
Limited Parse, Full Gazetteer, Point Treatment Only went up. That means that adding in the full gazetteer
plus the full parse overall hurt the system more than
1879 57.16
adding in just one or the other’s full data set. From that
39.62 1.900
we can conclude that there were too many extraneous
possibilities that were added into the system for it to
In table 5, we show the results of when we used be able to accurately select the correct toponym. From
an unlimited gazetteer, an unlimited parse, and point T4 to T5 we also see positive changes, where the only
treatment only. Within the parse scores, all categories difference is our new area feature model. From this, we
were used which means that for prior scores, a large can see that our area model increased error.
number of entries were given–around the same as in
the posterior system. For the posterior system with the
gazetteer, Reservoir, Lake, Swamp, and Bay were added
V. C ONCLUSIONS
to Civil and Populated Place to generate the possible Moving forward, there are several changes to our system
toponyms. All strings were treated as point features. This that might improve our results. For one, using parse
gives us a real ground truth to see how our area method scores for the 2014 GNIS data may improve our overall
affects our results. scores by giving us a more accurate dataset to work with.
As the GNIS data changes every year and is updated
Table 5 frequently to handle errors or multiple entries, we feel
Full Parse, Full Gazetteer, Point Treatment Only that this new data would improve our accuracy. Also,
the 2014 data is what is used to form the gazetteer later
1879 57.59
as well as was used in our entire linking process, thus
39.12 1.911
the change might make for spurious non-matches. In
addition, a better way to figuring out our scale factor
In table 6, we show the results of when we used may increase our accuracy. Our current system uses both
an unlimited gazetteer, an unlimited parse, and point the alignment on our training maps and the distance of
treatment only. Within the parse scores, all categories point features to their labels on those maps and there
were used which means that for prior scores, a large may be a better way to generate that scale factor on a per-
number of entries were given–around the same as in map basis or a better way of incorporating the alignment
the posterior system. For the posterior system with the and the distance of point features.
gazetteer, Reservoir, Lake, Swamp, and Bay were added In addition, doing area tests with limited gazetteers or
to Civil and Populated Place to generate the possible limited parses solely might show some interesting results
toponyms. However, in this case all area features were due to our conclusion about adding in both full gazetteer
not treated as point features. First, a gazetteer was and full parse caused more errors. Once we do have a
created specifically for the map in question of all feature σA that is maximized we might find a better solution to
categories selected by the user. Next, we replace the removing extraneous features from there and optimizing
points for area features (lakes, swamps, reservoirs, bays) that part of the system. This would result in a best case
within our gazetteer. Then, for point features (Civil and scenario rather than a likely case scenario, but would
Populated Places) we treat them the same as in all other give us the full breadth of what our system could do.
tables. For area features, however, we use the method Overall, we feel that our experiment concluded in
detailed in [?]. showing that very small amounts of change happened
throughout our system overall. However, if you sum
Table 6 from table two to table five changes overall, all of our
Full Parse, Full Gazeteer, Point Treatment and Area changes did end in a final negative result—thus, error
Feature Treatment did decrease overall. However, that seems to be more
1879 57.81 from adding in the full gazetteer rather than our area
39.53 1.917 system at all unfortunately. Due to the dependency
on the parse scores and the scale factor, we feel that
In the table found at the top of the next page, per- optimizing the scale factor and updating our parse scores
centage change is taken across each of the tables; the could end up leading to an overall positive result in the
equation for that is shown in Algorithm 1. future.

(100−testCaseP ercent) R EFERENCES


Algorithm 1 %Change = 100 × (100−baseCaseP ercent)
[1] Cartography Associates. David Rumsey historical map collection.
6

Table 7
Table of Parses and Change Throughout

Table 1 Table 2 Table 3 Table 4 Table 5 Overall Change from T1 to T5


Word Error Percent Change - -0.8066 -1,6106 1.0139 0.5215 -0.008818
Character Error Percent Change - -0.2369 -2.1199 -0.8213 0.6780 -0.025001

You might also like