You are on page 1of 36

Web page information

extraction system by using


Deep learning
Abstract

Nowadays the Internet has become an inevitable source of information for daily life activities.
Internet has a collection of large amount of information hidden in the web pages. Webpage
contains the noisy data like advertisements, page settings, navigation buttons and notices.
The data is hidden in these type of images in the webpage. The data is hidden in the image
will be extracted is still difficult. In this project, the hidden objects(data) in image are extracted
from the webpage.The NLP model is proposed to filter the data.
Requirement

1. Python
2. Html
3. Django
4. NLP
5. postgres
Modules

1. html
2. Django
3. NLTK
4. database
Project plan

Start from Frontend to middle ware till database. One by one i am going to discuss about this.
Django sample

Introduction:

Django is a popular Python web framework, meaning it is a third-party Python library used for
developing web applications.
Python installation
sudo add-apt-repository ppa:jonathonf/python-3.6

sudo apt-get update

sudo apt-get install python3.6


Run the executable installer

1. Double click on python exe file


2. Select

add python 3.6 to PATH

3. Click next
4. Click install
5. Click close
6. Open command prompt
7. Type python and press enter
Django install
>> py --version

>> py -m pip install virtualenvwrapper-win

>> mkvirtualenv myproject

>> workon myproject

>> py -m pip install Django


Start frontend
>> Django-admin startproject myproject

mysite/

manage .py

mysite /

__init__ .py

settings .py

urls .py

asgi .py

wsgi .py
● The outer myproject/ root directory is a container for your project. Its name doesn’t
matter to Django; you can rename it to anything you like.
● manage.py: A command-line utility that lets you interact with this Django project in
various ways. You can read all the details about manage.py in django-admin and
manage.py.
● The inner mysite/ directory is the actual Python package for your project. Its name is
the Python package name you’ll need to use to import anything inside it (e.g.
mysite.urls).
● __init__.py: An empty file that tells Python that this directory should be considered
a Python package. If you’re a Python beginner, read more about packages in the official
Python docs.
● settings.py: Settings/configuration for this Django project. Django settings will tell
you all about how settings work.
● urls.py: The URL declarations for this Django project; a “table of contents” of your
Django-powered site. You can read more about URLs in URL dispatcher.
● asgi.py: An entry-point for ASGI-compatible web servers to serve your project. See
How to deploy with ASGI for more details.
● wsgi.py: An entry-point for WSGI-compatible web servers to serve your project. See
How to deploy with WSGI for more details.
myproject/

__init__ .py ● Models.py - A model is a class that represents table or collection


admin.py in our DB, and where every attribute of the class is a field of the
table or collection. Models are defined in the app/models.py
apps.py
● Views.py - Django's views are the information brokers of a
Django application. A view sources data from your database
migrations /
and delivers it to a template. For a web application the view
delivers webpage content and templates.
__init__.py
● Urls.py - A URL is a web address. You can see a URL every time
models .py
you visit a website – it is visible in your browser's address bar.
tests.py

views.py
manage.py

Manage.py is automatically created in each Django project. It does the same thing as
django-admin but also sets the DJANGO_SETTINGS_MODULE environment variable so
that it points to your project’s settings.py file. The django-admin script should be on your
system path if you installed Django via pip.
Run Application

Python manage.py runserver


output
Print text

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p>This is a paragraph.</p>

</body>
</html>
Home page
Registration page
Admin page
Login page
Url form page
output
Logout
NLTK

NLTK is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion forum .

>>Pip install NLTK


Get the data

By using the Urls collect the raw data.

The dataset used in this project is the Artificial intelligence Raw Dataset.
Request

You're going to use requests to do this, one of the most popular and useful Python
packages out there.

>> Import request

Requests will allow you to send HTTP/1.1 requests using Python. With
it, you can add content like headers, form data, multipart files, and
parameters via simple Python libraries. It also allows you to access the
response data of Python in the same way.
Get the text from the Html

Here you will use package beautifulsoup. The package website says:

Beautiful Soup is a Python package for parsing HTML and XML


documents . It creates a parse tree for parsed pages that can be used to
extract data from HTML, which is useful for web scraping.

>> from bs4 import beautifulsoup


Extract words from your Text with NLP

● Tokenize the text


● Remove stopwords
Tokenization

You want to tokenize your text, that is, split it into a list a words. Essentially, you

want to split off the parts off the text that are separated by whitespaces.

To do this, you're going to use a powerful tool called regular expressions. A regular

expression, or regex for short, is a sequence of characters that define a search

pattern.
Remove stop words

It is common practice to remove words that appear alot in the English language
such as 'the', 'of' and 'a' (known as stopwords) because they're not so interesting
words.
Create a histogram diagram

● You create a frequency distribution

object using the function

nltk.FreqDist();

● You use the plot() method of the

resulting object.
Database connection

Urls -
https://www.zdnet.com/article/what-is-ai-everything-you-need-to-know-about-artificial-intellige
nce/

You might also like