You are on page 1of 9

Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and

usage data.

Types of Web Mining:


Web Usage Mining Web Content Mining Web Structure Mining

Web Content Mining

mining, extraction and integration of useful data, information and knowledge from Web page contents.
Wrapper- A program for extracting structured data

Extraction from page


A Web page can be seen as a sequence of tokens (e.g., words, numbers and HTML tags). The extraction is done using a tree structure called the EC tree (embedded catalog tree), which models the data embedding in a HTML page. Each extraction is done using two rules, the start rule and the end rule. The start rule identifies the beginning of the node and the end rule identifies the end of the node.

Extraction from page


The extraction rules are based on the idea of landmarks. Landmark is a sequence of consecutive tokens and is used to locate the beginning or the end of a target item.

Sample
Extract Phone number from the ff. HTML code.
Name: Joels <p> Phone: <i> (310) 777-1111 </i><p>

R1: SkipTo(i) This rule means that the system should start from the beginning of the page and skip all the tokens until it sees the first <i> tag. <i> is a landmark.

Similarly, to identify the end of the text to be extracted, we can use: R2: SkipTo(</i>) R1 is called the start rule and R2 is called the end rule.

Name: Joels <p> Phone: <i> (310) 777-1111 </i><p>

You might also like