Project Explanation PDF

Web page information
extraction system by using

Deep learning
Abstract
Nowadays the Internet has become an inevitable source of information for daily life activities.
Internet has a collection of large amount of information hidden in the web pages. Webpage
contains the noisy data like advertisements, page settings, navigation buttons and notices.
The data is hidden in these type of images in the webpage. The data is hidden in the image
will be extracted is still difficult. In this project, the hidden objects(data) in image are extracted
from the webpage.The NLP model is proposed to filter the data.
Requirement
1. Python
2. Html
3. Django
4. NLP
5. postgres
Modules
1. html
2. Django
3. NLTK
4. database
Project plan
Start from Frontend to middle ware till database. One by one i am going to discuss about this.
Django sample
Introduction:
Django is a popular Python web framework, meaning it is a third-party Python library used for
developing web applications.
Python installation
sudo add-apt-repository ppa:jonathonf/python-3.6
sudo apt-get update
sudo apt-get install python3.6

Run the executable installer
1. Double click on python exe file

2. Select
add python 3.6 to PATH
3. Click next
4. Click install
5. Click close
6. Open command prompt
7. Type python and press enter
Django install
>> py --version
>> py -m pip install virtualenvwrapper-win
>> mkvirtualenv myproject
>> workon myproject
>> py -m pip install Django

Start frontend
>> Django-admin startproject myproject
mysite/
manage .py
mysite /
__init__ .py
settings .py
urls .py
asgi .py
wsgi .py
● The outer myproject/ root directory is a container for your project. Its name doesn’t
matter to Django; you can rename it to anything you like.
● manage.py: A command-line utility that lets you interact with this Django project in
various ways. You can read all the details about manage.py in django-admin and
manage.py.
● The inner mysite/ directory is the actual Python package for your project. Its name is
the Python package name you’ll need to use to import anything inside it (e.g.
mysite.urls).
● __init__.py: An empty file that tells Python that this directory should be considered
a Python package. If you’re a Python beginner, read more about packages in the official
Python docs.
● settings.py: Settings/configuration for this Django project. Django settings will tell
you all about how settings work.
● urls.py: The URL declarations for this Django project; a “table of contents” of your
Django-powered site. You can read more about URLs in URL dispatcher.
● asgi.py: An entry-point for ASGI-compatible web servers to serve your project. See
How to deploy with ASGI for more details.
● wsgi.py: An entry-point for WSGI-compatible web servers to serve your project. See
How to deploy with WSGI for more details.
myproject/
__init__ .py ● Models.py - A model is a class that represents table or collection

admin.py in our DB, and where every attribute of the class is a field of the
table or collection. Models are defined in the app/models.py
apps.py
● Views.py - Django's views are the information brokers of a
Django application. A view sources data from your database
migrations /
and delivers it to a template. For a web application the view
delivers webpage content and templates.
__init__.py
● Urls.py - A URL is a web address. You can see a URL every time
models .py
you visit a website – it is visible in your browser's address bar.
tests.py
views.py
manage.py
Manage.py is automatically created in each Django project. It does the same thing as
django-admin but also sets the DJANGO_SETTINGS_MODULE environment variable so
that it points to your project’s settings.py file. The django-admin script should be on your
system path if you installed Django via pip.
Run Application
Python manage.py runserver
●
output
Print text
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
Home page
Registration page
Admin page
Login page
Url form page
output
Logout
NLTK
NLTK is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion forum .
>>Pip install NLTK

Get the data
By using the Urls collect the raw data.
The dataset used in this project is the Artificial intelligence Raw Dataset.
Request
You're going to use requests to do this, one of the most popular and useful Python
packages out there.
>> Import request
Requests will allow you to send HTTP/1.1 requests using Python. With
it, you can add content like headers, form data, multipart files, and
parameters via simple Python libraries. It also allows you to access the
response data of Python in the same way.
Get the text from the Html
Here you will use package beautifulsoup. The package website says:
Beautiful Soup is a Python package for parsing HTML and XML

documents . It creates a parse tree for parsed pages that can be used to
extract data from HTML, which is useful for web scraping.
>> from bs4 import beautifulsoup

Extract words from your Text with NLP
● Tokenize the text

● Remove stopwords
Tokenization
You want to tokenize your text, that is, split it into a list a words. Essentially, you
want to split off the parts off the text that are separated by whitespaces.
To do this, you're going to use a powerful tool called regular expressions. A regular
expression, or regex for short, is a sequence of characters that define a search
pattern.
Remove stop words
It is common practice to remove words that appear alot in the English language
such as 'the', 'of' and 'a' (known as stopwords) because they're not so interesting
words.
Create a histogram diagram
● You create a frequency distribution
object using the function
nltk.FreqDist();
● You use the plot() method of the
resulting object.
Database connection
Urls -
https://www.zdnet.com/article/what-is-ai-everything-you-need-to-know-about-artificial-intellige
nce/

Project Explanation PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Explanation PDF

Uploaded by

Copyright:

Available Formats

Web page information

extraction system by using

sudo apt-get update

sudo apt-get install python3.6

1. Double click on python exe file

add python 3.6 to PATH

>> py -m pip install virtualenvwrapper-win

>> mkvirtualenv myproject

>> workon myproject

>> py -m pip install Django

init .py ● Models.py - A model is a class that represents table or collection

Python manage.py runserver

>>Pip install NLTK

By using the Urls collect the raw data.

>> Import request

Beautiful Soup is a Python package for parsing HTML and XML

>> from bs4 import beautifulsoup

● Tokenize the text

expression, or regex for short, is a sequence of characters that deﬁne a search

● You create a frequency distribution

object using the function

● You use the plot() method of the

You might also like