Professional Documents
Culture Documents
Extraction of Address Data From Unstructured Text Using - Free Knowledge Resources
Extraction of Address Data From Unstructured Text Using - Free Knowledge Resources
Extraction of Address
Data from Unstructured Text using Free Knowledge Resources. In Proceedings of the 13th
Internationcal Conference on Knowledge Management and Knowledge Technologies, 2013.
Accepted for Publication.
5.2 Evaluation Results Figure 4: F-Measure for all attributes (address to-
Our approach identified 4,449 addresses which correspond gether with company name), only the address and
to an average of 2.8 addresses per Web site. In some cases, the three single attributes of the address
LOEWE – Landes-Offensive zur Entwicklung Wissenschaft-
Table 3: Results for Extended Blacklist lich-ökonomischer Exzellenz, Förderlinie 3: KMU-Verbund-
Complete Complete Address Address
vorhaben (State Offensive for the Development of Scientific
Standard Extended Standard Extended
and Economic Excellence).
Recall 0.7985 0.8326 0.9467 0.9563
We thank kimeta GmbH for the essential help assisting with
Precision 0.6145 0.6563 0.9312 0.9338
building the evaluation corpus.
F-Measure 0.6945 0.7340 0.9389 0.9449
8. REFERENCES
impact on the identification of the geographical information. [1] D. Ahlers and S. Boll. Retrieving Address-based
Locations from the Web. In Proceedings of the 2nd
5.3 Issues international workshop on Geographic information
We faced some important challenges that are interesting retrieval, GIR ’08, pages 27–34, New York, NY, USA,
for further work. Examining the extracted addresses, we re- 2008. ACM.
alized mainly two reasons for an incorrectly identified com- [2] S. Asadi, G. Yang, X. Zhou, Y. Shi, B. Zhai, and
pany name: W.-R. Jiang. Pattern-Based Extraction of Addresses
from Web Page Content. In Y. Zhang, G. Yu,
• Often, an incorrect company name is not totally wrong E. Bertino, and G. Xu, editors, Progress in WWW
but either too short or too long. A common reason for Research and Development, volume 4976 of Lecture
detecting only part of the company name is an un- Notes in Computer Science, pages 407–418. Springer
usual structure of the name, e.g. the company name Berlin Heidelberg, 2008.
“oberüber Agentur für digitale Wertschöpfung” has not
[3] W. Cai, S. Wang, and Q. Jiang. Address extraction:
been identified completely but only “Agentur für digi-
Extraction of location-based information from the
tale Wertschöpfung”. This has both an impact on pre-
web. In Y. Zhang, K. Tanaka, J. Yu, S. Wang, and
cision and recall.
M. Li, editors, Web Technologies Research and
• The name identified describes a company but not the Development - APWeb 2005, volume 3399 of Lecture
company which is situated at the assigned address. Notes in Computer Science, pages 925–937. Springer
This has mainly an impact on the precision. Berlin Heidelberg, 2005.
[4] F. Dawson and T. Howes. vCard MIME Directory
A manual analysis of the results revealed further that the Profile. RFC 2426, IETF, September 1998.
transformation of the Web site to a pure textual represen- [5] T. Kagehiro, M. Koga, H. Sako, and H. Fujisawa.
tation may come together with the loss of information. De- Address-block extraction by Bayesian rule. In Pattern
pending on the structure of the HTML, document elements Recognition, 2004. ICPR 2004. Proceedings of the 17th
that are found in close distance on the Web site are sepa- International Conference on, volume 2, pages 582–585
rated significantly in the textual representation.
Vol.2, Aug.
[6] T. Lee, J. Hendler, O. Lassila, et al. The semantic
6. CONCLUSIONS AND FUTURE WORK web. Scientific American, 284(5):34–43, 2001.
Within this paper we have presented an approach for the [7] B. Loos and C. Biemann. Supporting Web-based
identification of business address data in unstructured text. Address Extraction with Unsupervised Tagging. In
For this, we have developed a hybrid approach which makes C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and
use of patterns and gazetteers of information that can be R. Decker, editors, Data Analysis, Machine Learning
obtained from OpenStreetMaps and Wikipedia. Thus, our and Applications, Studies in Classification, Data
approach is independent of any commercial solution. We Analysis, and Knowledge Organization, pages
showed that we achieve an excellent accuracy for the iden- 577–584. Springer Berlin Heidelberg, 2008.
tification of the address information better than the other [8] A. Luberg, P. Järv, K. Schoefegger, and T. Tammet.
approaches we have examined and in most cases we were Context-aware and Multilingual Information
able to correctly assign the name of the company being sit- Extraction for a Tourist Recommender System. In
uated at the respective address which had not been focus in Proceedings of the 11th International Conference on
any other work we found. Knowledge Management and Knowledge Technologies,
In the future we will adapt and evaluate the feasibility of i-KNOW ’11, pages 13:1–13:8, New York, NY, USA,
the approach in other languages and other countries. For 2011. ACM.
this, patterns for identification of single address attributes [9] H. Mühleisen and C. Bizer. Web Data Commons -
need to be defined and country-specific gazetteers need to be Extracting Structured Data from Two Large Web
assembled. As OpenStreetMap is available with high cover- Corpora. In Proceedings of the 5th Workshop on
age in many countries and the format of address attributes Linked Data on the Web, 2012.
in different countries is quite regular, we expect good results. [10] Netcraft. April 2012 Web Server Survey. http://
Further, we will work on the accuracy of the company name news.netcraft.com/archives/2012/04/04/
identification by using structural features from the HTML april-2012-web-server-survey.html, 2012. [Online;
source since company names are often emphasized. accessed 27-February-2013].
[11] K. Zickuhr. Three-quarters of smartphone owners use
7. ACKNOWLEDGMENTS location-based services. Pew Internet & American Life
This project (HA project no. 292/11-37) is funded in the Project, 2012.
framework of Hessen ModellProjekte, financed with funds of