(IJCSIS) International Journal of Computer Science and Information Security,Vol. 11, No. 1, January 2013
Web Content Extraction A Heuristic Approach
Neetu NarwalAsst. Prof., Maharaja Surajmal InstituteAffiliate College of GGSIP UniversityResearch Scholar, Lingayas University
Dr. Mayank SinghAss. Prof., Research Guide, Lingayas Universitymayanksingh2005@gmail.com
— Internet comprises of huge volume of data but theinformation it contains, are highly unstructured. There exists alot of algorithm for extraction and identification of main contentof the web page, as along with the main content there exist anirrelevant data, which is of little significance to the user and oftendistracting in many cases. Web page with only relevantinformation is of greater significance to the user. We haveproposed an algorithm to extract the main contents of web pageand store the contents in XML file, which can be utilized foruseful purpose.
Keywords- Web Data Extraction, DOM tree, Visual Cues, XML