information,which the GUI displays. In thefigure,the image on the right represents asatellite picture with middle point at coordi-nates (33.927,–118.406). The picture is 400
400 pixels,and each pixel equals 1 meteron the ground.Unfortunately,Tigerline’s street informa-tion is often inaccurate and rarely aligns withthe streets on the TerraService satelliteimagery. Aligning two geospatial data sets isa difficult problem often referred to as
For the Building Finder application,we obtain conflated street information fromthe Tigerline files for the city of El Segundo,Calif.,using our localized image-processingapproach for automatically aligning vectordata with satellite imagery.
Although Build-ing Finder’s coverage is currently limited toEl Segundo,it would be trivial to use thisapproach in the future to obtain conflatedTigerline files for all of the US.There are three ways to use the applica-tion’s GUI:•The user retrieves a specific image by fill-ing in its coordinates and clicking “update.”•The user navigates to nearby images byclicking one of four white arrows aroundthe image.•The user clicks on the box on the image(see Figure 2) that corresponds to a spe-cific house the user wants information(owner and address) about.The GUI in Figure 2 shows the results of running the user query. The boxes in thesatellite image indicate building locations,and the lines represent streets.
Technologies for efficientdata interaction
To efficiently integrate semanticallyheterogeneous information from multipledata sources,Building Finder uses severaltechnologies:•Machine-learning techniques for convert-ing traditional legacy Web sources anddatabases into Web services
•A record linkage system for integratingdata from multiple sources referring to asingle entity•A mediator system providing uniformaccess to data from various Web services•An efficient execution system for infor-mation-gathering agents•RDQL and RDF formalisms for repre-senting queries and query results
Online source wrapping
Much information on the Web is format-ted for human readers,not machines. Soft-ware
let programs or agents retrieveand translate data from Web sources into aformat the software can easily manipulate.Building Finder relies on wrappers to providethe needed data on a per-query basis becausethe breadth and depth of queries prevents usfrom storing or caching all data locally.Figure 3 shows how the wrapper navigatesand extracts information from the YahooWhite Pages. Given a name,a city,and astate,the wrapper queries the Yahoo site bybinding inputs to the wrapper to the corre-sponding variables in the search form on thatsite. Yahoo White Pages returns the resultspage,which lists names,addresses,andphone numbers,and the wrapper uses extrac-tion rules to extract the data. If Yahoo returnsmore than one results page,the wrapperextracts the data from all pages. BuildingFinder filters the data,formats it into a tableform,and sends it to the mediator,whichchooses how to integrate the data to answerqueries formulated according to user inputs.Because most Web pages are written inHTML,extracting data is a complicated task.HTML is a language with which Webbrowsers specify a Web page’s format; it’snot intended to help locate specific data on apage. HTML tags have no semantics withrespect to a page’s content. Consequently,awrapper must learn how to extract each datakey on each Web page by searching for anHTML tag pattern leading to the data. Forexample,if we want to extract the person’sname in Figure 4,we might notice that thetag always preceding the name data is
and the tag following thename data is
. Thus,we would create arule for the wrapper to extract the keyword“A Smith”by starting extraction after the tag
and stopping at the tag
wrapper rule learning
,uses machine-learning techniques
forextracting data automatically from semi-structured Web pages. We used the FetchAgent Platform (www.fetch.com) to createthe wrappers used to extract data used inBuilding Finder.
Integrating information from multipleWeb sites is complicated because the samedata objects can exist in inconsistent text for-mats across sites or because a search for aparticular object can return multiple results.This makes identifying matching objectsusing an exact text match difficult. As Fig-ure 5 shows,for a particular people query,the Yahoo White Pages site returns a recordfor Brandy Smith. When we use this person’saddress to query the property tax site,wereceive more than 15 results. We must there-fore use more than simple text-matching
Figure 1. The Building Finder GUI.
Authorized licensed use limited to: University of Westminster. Downloaded on April 20, 2009 at 10:18 from IEEE Xplore. Restrictions apply.