You are on page 1of 1

1.

Data Processing To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. Mining intrapage informative structure in news Web sites is in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. This splits a DOM tree into many small subtrees and applies a topdown informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. 2. Stemming Algorithm: Paice/Husk Stemming Algorithm Stemming is used to remove the stopwords (is, and, was, were.....) from the Retrieved keyword list. (From a specific page, each page may include many links and all the keywords in each link will be retrieved) 3. Entropy Calculation Entropy Calculation is based on weight of the terms that are most frequent and least frequent. (Based on a Theorem).

You might also like