Robert Baumgartner Sergio Flesca Georg Gottlob ;er;iew ntroduction and Motivation Wrapper Generation Extraction Language/Mechanisms Testing Lixto Results Strengths & Weakness Current/Future Work %L ;s. XL TML & XML represent semi-structured data TML mainly presentation oriented Web content typically Iormatted in TML TML lacks data querying XL Ad;antages XML structure/layout separation XML provides suitable data representation XML sets act as database XML sets queried via, XML-GL, XML-QL, XQuery eBay Example o data querying ability increases cost and time to retrieve inIormation Irom web pages Example: watch interesting eBay oIIers oI notebooks Criteria: Auction contains the word 'notebook Current value between GBP 1500 and 3000 Received at least 3 bids eBay Problems eBay does not support complex queries Similar sites do not give restricted queries Large number oI results returned with no possibility to Iurther restrict the results Only one site can be queried at a time Results Irom diIIerent queries cannot be compiled into a single structured Iile eBay Solution Lixto introduces new ideas and programming language concepts Ior wrapper generation Lixto translates TML to XML Resulting XML can then be queried and Iurther processed Wrappers applied automatically to extract inIormation Irom changing web pages Lixto Ad;antages Easy to learn Full visual and interactive U provided o Iine tuning required o knowledge oI internal language necessary o knowledge oI TML necessary Graphical region marking and selection Works directly on browser-display pages, no additional view necessary Lixto Ad;antages Extraction oI target patterns based on: Surrounding landmarks Actual content TML attributes Order oI appearance Semantic and syntactic concepts Extraction Irom Ilat strings possible Semi-automatic wrapper generation Ad;anced Lixto Features isjunctive pattern deIinitions Crawling page links during extraction Recursive wrapping Extracted data can have disjoint structure Irom TML source page nternal data structure language Elog Implemented Lixto System Architecture and Implementation Lixto created with Java using Swing, OroMather and JOM Lixto toolkit contains three modules: nteractive Pattern Builder Extractor XML Generator reating Wrappers Lixto wrappers created interactively using patterns in a hierarchical order Patterns names act as deIault XML elements tem~ Price~ Sub patterns express 1:* relationships Each pattern characterizes one kind oI inIormation Each pattern is deIined by one or more Iilters Filter reation User highlights desired target nternally Elog rule created describing Iilter Add restrictive conditions to Iilter Goals added to Elog rule body Filter conditions: BeIore/aIter ot beIore/not aIter nternal Range Pattern reation Algorithm Loading initial document creates a document~ pattern User highlights instance oI the pattern Lixto displays all matched instances oI the pattern Pattern reation Algorithm User can add Iilters to limit the matched targets The set oI Iilters is added to the document~ pattern Test iI document~ pattern extracts exactly the desired set oI data I yes, save the pattern, iI no select new instance oI the pattern eneration of a New Pattern %he Lixto Browser onditional eneration Jisual Interface 'isual tree pattern construction Regular expression string patterns XML visualization tool Concept generator Regular expression / database driven Creates 'isCity, 'isate Requires no regular expression knowledge ain enu / Pattern eneration enu Elog nternal data storage language ata-log like syntax and semantics nvisible to the user SpeciIically designed Ior hierarchical and modular data extraction Flexible, intuitive, easily extensible Patterns stored as narrowing (logical and and broadening (logical or steps Elog rules are implementations oI the visually deIined Iilters Elog Extraction Program for eBay Example ocument odel Brackets speciIy character oIIsets odes numbered in depth-Iirst leIt-to-right Iashion TML tags reIer to element sets containing attribute names and values body~ tag contains attributes (name,body, (bgcolor,FFFFFF,(elementtext,.< %L Example Page XL %ranslation Extraction echanisms Tree extraction Elements identiIied by tree path (*.table*.tr Attribute constraints reduce matched elements Element path deIinition (epd: tree path attribute constraints String extraction Strings stored in context` nodes Regular expression matching %L %ree Extraction Lixto %est Sites #esults Strengths & Weakness ntuitive U (I it needs a manual it`s not a good program ighly customizable Supports crawling across web sites o tree output aIter crawling Slow Extracts only one target type at a time urrent/Future Work Extend tree structure to support crawling across multiple sites (crawling is currently supported Server based Lixto system Automated heuristics Support Ior multiple example targets at once Embedding Lixto wrappers into inIormation channel system