You are on page 1of 32

Jisual Web Information

Extraction With Lixto


Robert Baumgartner
Sergio Flesca
Georg Gottlob
;er;iew
ntroduction and Motivation
Wrapper Generation
Extraction Language/Mechanisms
Testing Lixto
Results
Strengths & Weakness
Current/Future Work
%L ;s. XL
TML & XML represent semi-structured
data
TML mainly presentation oriented
Web content typically Iormatted in TML
TML lacks data querying
XL Ad;antages
XML structure/layout separation
XML provides suitable data representation
XML sets act as database
XML sets queried via, XML-GL, XML-QL,
XQuery
eBay Example
o data querying ability increases cost and
time to retrieve inIormation Irom web pages
Example: watch interesting eBay oIIers oI
notebooks
Criteria:
Auction contains the word 'notebook
Current value between GBP 1500 and 3000
Received at least 3 bids
eBay Problems
eBay does not support complex queries
Similar sites do not give restricted queries
Large number oI results returned with no
possibility to Iurther restrict the results
Only one site can be queried at a time
Results Irom diIIerent queries cannot be
compiled into a single structured Iile
eBay Solution
Lixto introduces new ideas and programming
language concepts Ior wrapper generation
Lixto translates TML to XML
Resulting XML can then be queried and
Iurther processed
Wrappers applied automatically to extract
inIormation Irom changing web pages
Lixto Ad;antages
Easy to learn
Full visual and interactive U provided
o Iine tuning required
o knowledge oI internal language necessary
o knowledge oI TML necessary
Graphical region marking and selection
Works directly on browser-display pages, no
additional view necessary
Lixto Ad;antages
Extraction oI target patterns based on:
Surrounding landmarks
Actual content
TML attributes
Order oI appearance
Semantic and syntactic concepts
Extraction Irom Ilat strings possible
Semi-automatic wrapper generation
Ad;anced Lixto Features
isjunctive pattern deIinitions
Crawling page links during extraction
Recursive wrapping
Extracted data can have disjoint structure
Irom TML source page
nternal data structure language Elog
Implemented Lixto System
Architecture and Implementation
Lixto created with Java using Swing,
OroMather and JOM
Lixto toolkit contains three modules:
nteractive Pattern Builder
Extractor
XML Generator
reating Wrappers
Lixto wrappers created interactively using patterns
in a hierarchical order
Patterns names act as deIault XML elements
tem~
Price~
Sub patterns express 1:* relationships
Each pattern characterizes one kind oI inIormation
Each pattern is deIined by one or more Iilters
Filter reation
User highlights desired target
nternally Elog rule created describing Iilter
Add restrictive conditions to Iilter
Goals added to Elog rule body
Filter conditions:
BeIore/aIter
ot beIore/not aIter
nternal
Range
Pattern reation Algorithm
Loading initial document creates a
document~ pattern
User highlights instance oI the pattern
Lixto displays all matched instances oI the
pattern
Pattern reation Algorithm
User can add Iilters to limit the matched
targets
The set oI Iilters is added to the
document~ pattern
Test iI document~ pattern extracts exactly
the desired set oI data
I yes, save the pattern, iI no select new
instance oI the pattern
eneration of a New Pattern
%he Lixto Browser
onditional eneration
Jisual Interface
'isual tree pattern construction
Regular expression string patterns
XML visualization tool
Concept generator
Regular expression / database driven
Creates 'isCity, 'isate
Requires no regular expression knowledge
ain enu /
Pattern eneration enu
Elog
nternal data storage language
ata-log like syntax and semantics
nvisible to the user
SpeciIically designed Ior hierarchical and modular
data extraction
Flexible, intuitive, easily extensible
Patterns stored as narrowing (logical and and
broadening (logical or steps
Elog rules are implementations oI the visually
deIined Iilters
Elog Extraction Program for
eBay Example
ocument odel
Brackets speciIy character oIIsets
odes numbered in depth-Iirst leIt-to-right
Iashion
TML tags reIer to element sets containing
attribute names and values
body~ tag contains attributes
(name,body, (bgcolor,FFFFFF,(elementtext,.<
%L Example Page
XL %ranslation
Extraction echanisms
Tree extraction
Elements identiIied by tree path (*.table*.tr
Attribute constraints reduce matched elements
Element path deIinition (epd: tree path
attribute constraints
String extraction
Strings stored in context` nodes
Regular expression matching
%L %ree Extraction
Lixto %est Sites
#esults
Strengths & Weakness
ntuitive U (I it needs a manual it`s not a
good program
ighly customizable
Supports crawling across web sites
o tree output aIter crawling
Slow
Extracts only one target type at a time
urrent/Future Work
Extend tree structure to support crawling across
multiple sites (crawling is currently supported
Server based Lixto system
Automated heuristics
Support Ior multiple example targets at once
Embedding Lixto wrappers into inIormation
channel system

You might also like