Professional Documents
Culture Documents
to define locations on the surface of the Earth. A da- the corresponding map coordintes of these area/linear
tum is defined by a reference ellipsoid which models features are only the primary points.
the shape of the Earth and the its relative position to
the center. All sources of data in our project use GCS 2. Gazetteer
North American 1983 based on the NAD83 datum with In our experiment, we continue to use the Geographic
the GRS80 ellipsoid, therefore there is no need for us Names Information System (GNIS) from the U.S. Board
to consider the possible errors resulted from different on Geographic Names (USBGN) as our authoritative
reference systems. [?] The problem we face is that Matlab gazetteer. For our purpose, in addition to the categories
does not specifically define any built-in GCS. Though of Civil and Populated Place, we also include the cate-
it does include various options of reference ellipsoids: gories of Lake, Swamp, Reservoir and Bay. The gazetteer
“grs80”, “wgs84”, etc. We choose the “grs80” ellipsoid was just updated in 2014 and for our gazetteer creation
since NAD83 is based on this model. We assume that the and posterior ranking we use that gazetteer. However,
reference ellipsoid have properties which accord with the our parse scores in all cases were done using the 2010
NAD83 datum. Since Matlab uses longitude and latitude gazetteer dataset. We feel that if given more time to run
to express location on the Earth, we also assume that our tests with 2014 parses, we might see a slightly im-
it uses the most common geocentric coordinate system. proved difference in our resuts although further testing
Since we are not working on extremely high level of would be needed to prove this.
accuracy, the assumptions above serve our purpose well. However, the GNIS only contain primary single-point
coordinates for area/linear features. In order to an-
swer our research question, we need to link the data
C. Data of coordinates of the area/linear to the gazetteer. We
1. Maps searched the open resources of geospatial vector data of
lakes/rivers on the Internet and found the following two
In order to compare our results with Weinman’s results,
databases: The National Atlas [?] shapefiles of lakes and
we continue to use the manually annotated maps for
the National Oceanic and Atmospheric Administration
our experiments from the David Rumsey Historical Map
[?] shapefiles of rivers. Among a number of different
Collection. [1] The annotations are checked and im-
resources, we choose these two databases because they
proved for our research with higher accuracy and more
cover the largest number of water bodies, include suffi-
complete coverage. From prior experiments, training and
cient geographic information on the water bodies and
testing data was already separated within this collection.
the shapefiles are easy to work with in Matlab. The
Then, we manually went through the testing database to
TNA and the NOAA datasets contains feature shape
find maps that had a comparably large number of lakes
information but they do not contain historical names of
on them. These maps vary considerably in the accuracy
lakes and rivers. Since we are working with historical
of the alignment, distribution of different lake shapes,
maps, it is important that we link the GNIS gazetteer and
thus we think that this relatively small sample will give
the shapefile databases. We compare the edit distance
us enough data to analyze the efficiency of our system.
between the names of the lake/river in GNIS and in
[?] stated in his paper, which remains true throughout
TNA/NOAA as well as the geographic distance between
our experiment:
certain points on the features to match the water bodies.
For our experiments, we have annotated 12 Details are specified in [?]. With the linking algorithm,
maps from the David Rumsey collection [1], we successfully matched 9531 lake shapefiles to the
with each map’s approximate region noted (i.e., gazetteer.
the primary state or counties and any sur-
rounding states having toponyms represented).
A simple bound on coordinate ranges, as given
III. A PPROACH
by some online map collections, would also A. Algorithm
suffice to restrict the gazetteer ultimately used 1. System
for recognition. Annotations in a simple XML Our probability model is based on the probability model
format include tight piecewise linear baseline in Weinman (2013), which is modified to include the
guides, text height and ground truth string for area features. It is a Bayesian network. The difference
each toponym. 4 In addition to the regional map from the model before is that we include toponyms of
shown in Figure 1, many are railroad and public lakes here and thus integrate the area shapefiles into the
survey maps. The average text height is 25 ± 16 model.
pixels. The key variables - strings, points and alignment -
Most maps in our collection are on the scale of one are retained in the model. Toponyms are assigned in
single state; a few includes several states or only part different branches of the model based on its category
of one state. Notes on which states appear on a certain in our gazetteer. Instead of using RANSAC in search for
map is used to restrict the gazetteer to only the states in an alignment to maximize the probability of a toponym,
question. Names of lakes and rivers are annotated but we use the ground truth alignment obtained by manual
3
annotations of the maps. The probability model applies geolexicon we use is also GNIS. Whether to include the
to each word image and the corresponding image co- full geolexicon or the partial geolexicon with only civil
ordinates found on a map. A flow chart explaining our and populated places is a factor that could influence the
process can be seen below. results of our experiment. We address this factor later in
experiments. An example of the prior probabilites would
be where if given an image with the string “Grinnell”,
the prior probabilities (parse scores) would be “Grinnell -
43%”, “Cornell - 52%”, and “Corn hell - 5%”. That would
be the pure output of the OCR model.
3. Environment
The environment is made up of the variables not con-
trolled in the system or as a task. For us, this would
include the geography of the Earth, the information on
the maps, and the speed of processing and computation.
Figure 1. Flow Chart of our Experimental System The geography of the actual world is what we are using
to get our alignment for the re-ranking but is entirely
Given a map image of a string, the prior probability out of our system’s control. It is part of the environment
of the recognition candidates P (y|W ) is given by the in which our system operates. The information on the
OCR. Te initial steps of processing the prior probabilities historical maps is out of our system’s control as well
remains the same as in Weinman (2013). We assign as the user’s control; the data that is on the maps is
an outlier probability to non-geographical text such as created by the mapmakers originally and thus is static.
legends. The bias term γ (0 < γ < 1) is decided for each The CPU and the amount of RAM that our computers are
map based on the alignment. The probability for inliers equipped is another external effect. We strive to optimize
is still a simple ratio with no bias. We also consider no our code program by vectorization and simplification in
bias among the points where the geographic feature is order to finish the calculation in a reasonable amount of
labeled. time.
In order to maximize the probability of the correct to-
ponym match, we recalculate the probabilities by finding IV. E XPERIMENTS
an alignment between the map and the real-life geo-
graphical data which we retrieve from existing database.
1 M ETHOD
With the alignment, we calculate the inferred locations We tested our system on eight test maps, each showing
of the toponyms on the map and further adjust the a single state. These maps stayed consistent throughout
likelihood of the toponyms by their distance from label our several tests. We ran five cases over these maps:
point p. The former model uses a Gaussian probability limited parse, limited gazetteer, point feature treatment
to include the area around single point features by only; limited parse, full gazeteer, point feature treatment
considering a certain value of standard deviation (σ). only; full parse, limited gazetteer, point feature treatment
([?], 4) only; full parse, full gazetteer, point feature treatment
The new new modified model assigns a uniform only; and full parse, full gazetteer, point feature and area
likelihood within the shape of the lake and a normal feature treatment. These are detailed more below.
(gaussian) probability distribution around the lake with
a certain scale factor. The functions regarding this specif- 1.1 Parameters
ically are found within [?].
Our system loads the coordinates of the lake centroid
points from the gazetteer and replace the point coor-
2. Task dinates with polygon coordinates for linked lakes. We
The initial input for the whole system is images of strings use the polyconic projection and an affine alignment
on a map, which is not controlled by us, but by the found by single control points to project the spherical
future users of our system. It should be clarified that this coordinates of the polygons into the image plane. The
image input is different from the information included in inlier bias γ is set at 0.5 since we assume that the
a certain map. What the users decide is which particular distribution is normal.
part of certain maps that they want to learn about. The Since map-makers through out history adapt different
input for our part of the system is the prior probability geodetic standards, for example, different models of the
generated by the OCR. The OCR system gives a list of Earth, various sphere-to-plane projection algorithm, we
possible recognition hypotheses for each string detected. should expect error in the alignment and the projection
In later steps we only consider the hypotheses on this process between the historical maps and the contempo-
list. With sufficient training data, the OCR parses the rary lake shape data. In prior research, the model defined
strings with a geolexicon and a regular dictionary. The a Gaussian probability around the single point features
4
since the toponyms are not always found exactly over 1.2 Data Format
the point coordinates. Similarly, we are not expecting to Our experiment relies on both the GNIS dataset updated
always find lake names labeled inside the lake shapes, in 2014 and The National Atlas database of US Water-
so we need to take the distance from a lake shape to its bodies and Wetlands updated as well in 2014. The TNA
name label into consideration. Therefore we want to find data contain lake shapefiles which consist of polygons
a reasonable scale factorbased on the above two factors specified by the coordinates of the vertices. However, the
to assign probabilities around both the point coordinates TNA databases does not include any historical names
(σp ) and the lake polygon coordinates (σA ). This way, we and as we work on historical maps, we considered it
can avoid false negatives where the lake name is actually important to link these two datasets.
pretty close to the lake shape so we will not falsely From our five testing experiments, you can see our
penalize its probability if the name is not inside the lake. results below. For each testing case, we recorded the per-
Within each lake, the probability is kept uniform and the cent word error, the percent character error, the number
joint probabilities get smaller and smaller as progressing of words in each gazetteer, and the harmonic mean of
outwards. See Figure 2. correct words’ ranks below.
Table 1
−5
Key to the Tables
x 10
9
5060 Number of Words in the Percent Word Error
8
Gazetteer
5080 Percent Character Error Harmonic Means of Correct
Words’ Ranks
7
5100
5120
6 In Table2, we show the results of when we used a
limited gazetteer, a limited parse, and point treatment
5
5140 only. Civil and Populated Place were included solely in
4
the gazetteer, with the same being used solely in the
5160
parse scores. For the prior scores, only civil and popu-
5180 3 lated places were allowed within the possibilities with
the same being used solely for the posterior probabilities.
5200
2 All strings were treated as point features.
5220 Lake
Swamp, and Bay were added to Civil and Populated Negative numbers show when our error went down,
Place to generate the possible toponyms. All strings were while positive numbers reflect when our error went
treated as point features. up. Overall, you can see that our error did successfully
decrease with all of the changes to our system. However,
Table 4 you can also see that in cases four and five, word error
Limited Parse, Full Gazetteer, Point Treatment Only went up. That means that adding in the full gazetteer
plus the full parse overall hurt the system more than
1879 57.16
adding in just one or the other’s full data set. From that
39.62 1.900
we can conclude that there were too many extraneous
possibilities that were added into the system for it to
In table 5, we show the results of when we used be able to accurately select the correct toponym. From
an unlimited gazetteer, an unlimited parse, and point T4 to T5 we also see positive changes, where the only
treatment only. Within the parse scores, all categories difference is our new area feature model. From this, we
were used which means that for prior scores, a large can see that our area model increased error.
number of entries were given–around the same as in
the posterior system. For the posterior system with the
gazetteer, Reservoir, Lake, Swamp, and Bay were added
V. C ONCLUSIONS
to Civil and Populated Place to generate the possible Moving forward, there are several changes to our system
toponyms. All strings were treated as point features. This that might improve our results. For one, using parse
gives us a real ground truth to see how our area method scores for the 2014 GNIS data may improve our overall
affects our results. scores by giving us a more accurate dataset to work with.
As the GNIS data changes every year and is updated
Table 5 frequently to handle errors or multiple entries, we feel
Full Parse, Full Gazetteer, Point Treatment Only that this new data would improve our accuracy. Also,
the 2014 data is what is used to form the gazetteer later
1879 57.59
as well as was used in our entire linking process, thus
39.12 1.911
the change might make for spurious non-matches. In
addition, a better way to figuring out our scale factor
In table 6, we show the results of when we used may increase our accuracy. Our current system uses both
an unlimited gazetteer, an unlimited parse, and point the alignment on our training maps and the distance of
treatment only. Within the parse scores, all categories point features to their labels on those maps and there
were used which means that for prior scores, a large may be a better way to generate that scale factor on a per-
number of entries were given–around the same as in map basis or a better way of incorporating the alignment
the posterior system. For the posterior system with the and the distance of point features.
gazetteer, Reservoir, Lake, Swamp, and Bay were added In addition, doing area tests with limited gazetteers or
to Civil and Populated Place to generate the possible limited parses solely might show some interesting results
toponyms. However, in this case all area features were due to our conclusion about adding in both full gazetteer
not treated as point features. First, a gazetteer was and full parse caused more errors. Once we do have a
created specifically for the map in question of all feature σA that is maximized we might find a better solution to
categories selected by the user. Next, we replace the removing extraneous features from there and optimizing
points for area features (lakes, swamps, reservoirs, bays) that part of the system. This would result in a best case
within our gazetteer. Then, for point features (Civil and scenario rather than a likely case scenario, but would
Populated Places) we treat them the same as in all other give us the full breadth of what our system could do.
tables. For area features, however, we use the method Overall, we feel that our experiment concluded in
detailed in [?]. showing that very small amounts of change happened
throughout our system overall. However, if you sum
Table 6 from table two to table five changes overall, all of our
Full Parse, Full Gazeteer, Point Treatment and Area changes did end in a final negative result—thus, error
Feature Treatment did decrease overall. However, that seems to be more
1879 57.81 from adding in the full gazetteer rather than our area
39.53 1.917 system at all unfortunately. Due to the dependency
on the parse scores and the scale factor, we feel that
In the table found at the top of the next page, per- optimizing the scale factor and updating our parse scores
centage change is taken across each of the tables; the could end up leading to an overall positive result in the
equation for that is shown in Algorithm 1. future.
Table 7
Table of Parses and Change Throughout