You are on page 1of 2

Internet Data Mining

Sponsor Ling Liu / David Buttler

{lingliu, buttler}
223 / 260 CCB
Systems and Databases
Related Projects

The explosive growth of the Internet has become an overused cliche, yet the problems of
information overload remain as real as ever. Web search engines provide one way to manage the
deluge of information on the Internet, but they have some serious drawbacks for many
applications. Common search engines do not index dynamic content; any URL with a '?' is
ignored. Neither do search engines provide finer granularity than a single HTML page. Their
design makes them unsuitable for comparison shopping or data integration.
The DISL group has constructed a powerful set of information extraction tools to work at solving
some of these problems. There are several remaining research challenges however. The
following figure presents a simple architecture for a dynamic search engine.

Within this framework there are several possible short proejcts suitable for a 7001 mini project,
or an extended Special Problems.
1. Design and implffement a robot crawler that discovers new dynamic search engine
2. Design a technique to categorize a search engine by its contents (the pages that it
dynamically generates), the types of queries it responds to (query interface), or the
context of the search interface.
3. In conjunction with the categorization system, develop a user interface that assists users
in selecting the appropriate types of sources that are applicable to their query (see the
AQR project for an example static system)
4. Improve the automated object extraction system. This may be broken down into
individual projects by itself.

Currently, the automated object extraction system works in two phases: (1) identify the
region of a dynamically generated web page that contains data objects; (2) discover how
the objects are separated (e.g. is there a single tag that separates objects?), and use the
separator to split the data region into objects.

Mini-projects in this area may include the following:

○ Develop a new heuristic to identify where the data objects are; validate the
effectiveness of the heuristic
○ Develop a new heuristic to split the data region in to data objects; validate the
effectiveness of the heuristic
○ Implement a more sophisticated technique to combine individual heuristics to
produce a better result, either for the data region identification heuristics, or the
object separtor discovery heuristics.
There are several interesting projects related with this topic. Please see either David or Prof. Ling
Liu to discuss other options.
Resources that may be helpful:
• Local Java code library (convert an HTML file into a tree, automatically extract textual
objects from a page, and more).
• A Java framework to automatically run a heuristic over a large set of test web pages
• set of web pages to test solutions, plus a method to evaluate whether a data-region
heuristic or an object separator heuristic succeeded on a given web page.

You are expected to have a solid grasp of Java programming. Familiarity with XML is useful but
not required.

A report describing the work you did and how you evaluate your results; any source code you
produced to accomplish your results.
You will be graded on the novelty and quality of your report and implementation.