Professional Documents
Culture Documents
Untitled
Untitled
Abstract
The user comments posted on YouTube video sharing website based on their
relevance to the video content given by the description associated with the video
posted. Comments are analysed for polarity and are further segregated as
positive or negative. A comparative analysis of classifier using the Bag of
Words and Association List approaches is presented. In the recently advanced
society, online social media sites like YouTube, Twitter, Facebook, LinkedIn,
etc are very popular. People turn to social media for interacting with other
people, gaining knowledge, sharing ideas, for entertainment and staying
informed about the events happening in the rest of the world. Among these
sites, YouTube has emerged as the most popular website for sharing and
viewing video content. However, such success has also attracted malicious
users, which aim to self-promote their videos or disseminate viruses and
malware. These spam videos may be unrelated to their title or may contain
irrelevant content. Therefore, it is very important to find a way to detect these
videos and report them. In this work, we have evaluated several top-
performance classification techniques for such purpose. The statistical analysis
of results indicates that the Multilayer Perception and Support Vector Machine
show good accuracy results.
Introduction
Spam Detection (SD) technique that can compute useless and superfluous
features in blogs, making significant stories more handy to perpetual
stakeholders. They suggested an extension of their work to widen the definition
of spam such as URLs, short message removal, etc. in addition to inclusion
antagonist awareness, online deployment to enable prediction of futuristic
comments. Crawlers are used for collecting wall posts in exacting Facebook
users. Then this wall post filters and finally collects wall post which contains
the URLs. This method differentiates the wall to post text and link which is
mentioned in the wall. This method collects groups from similar texture content
and posts it including the same destination URLs. Post Similarity graph
clustering algorithm is used to identify the similarity between post and URL.
Based on this malicious user and post are identified. the concept of anomaly
detection wherein the divergence from authentic emails was used as a metric to
classify emails as spam or ham. Better accuracy was achieved owing to the
limited training sets as seen in labelling based systems
DISADVANTAGES
Proposed system
Proposed system uses The YouTube Spam Collection Data Set Labelled as
spam or ham. This forms a dataset that is fed into a Term frequency-Inverse
document frequency (TF-IDF) vectorizer which transforms words into
numerical features (numpy arrays) for training and testing. The transformed
dataset is split into training and testing subsets and fed into Multilayer
Perceptrons(MLPs), Support Vector Machine(SVM), Naïve Bayes(NB),
Decision Tree(DT), Random Forest(RF), Logistic Regression(LG), and k-
Nearest Neighbor(kNN) pipelines respectively. All implementation was
performed on Python using Scikit−learn for Machine Learning Classifiers.
ADVANTAGES
2. Naive Bayes requires a limited amount of training data for the estimation of
the test data. So, the training period is less.
Data Collection
The YouTube Spam Collection Data Set Collect from Data Repositories. It has
five datasets composed of 1,956 real and non-encoded messages that were
labelled as legitimate (ham) or spam. Each sample represents a text comment
posted in the comments section of each selected video. No pre processing
technique was performed. Subsequently, each sample was manually labelled as
spam or legitimate (ham), using a collaborative tagging tool developed for this
purpose, called Labelling. The samples have associated a piece of metadata
information, such as the author’s name and publication date, which have been
preserved.
Dataset Information
The samples were extracted from the comments section of five videos that were
among the 10 most viewed on YouTube during the collection period. The table
below lists the datasets, the YouTube video ID, the number of samples in each
class and the total number of samples per dataset.
Attribute Information
The collection is composed of one CSV file per dataset, where each line has the
following attributes COMME NT_ID,AUTHOR, DATE, CONTENT, Class
Data Pre processing
For Pre-processing phase, the raw dataset will be executed the data cleaning
such as tokenization, stop words removal and stemming are performed. The
clean dataset will be used for next process of feature selection and extraction.
The data-set used here is split into 80% for the training set and the remaining
20% for the test set. In any text mining problem, text cleaning is the first step
where we remove those words from the document which may not contribute to
the information we want to extract. YouTube Comments may contain a lot of
undesirable characters like punctuation marks, stop words, digits, etc which
may not be helpful in SD. After cleaning the text we fed our dataset into a Term
frequency-Inverse document frequency (TF-IDF) vectorizer which transforms
words into numerical features (numpy arrays) for training and testing.
TF-IDF stands for term frequency-inverse document frequency, and the TF-IDF
weight is a weight often used in information retrieval and text mining. This
weight is a statistical measure used to evaluate how important a word is to a
document in a collection or corpus. The importance increases proportionally to
the number of times a word appears in the document but is offset by the
frequency of the word in the corpus. Variations of the tf-IDF weighting scheme
are often used by search engines as a central tool in scoring and ranking a
document's relevance given a user query
Feature Extraction
Once the dictionary is ready, we can extract the word count vector of 4454
dimensions for each YouTube comment of the whole dataset. Each word count
vector contains the frequency of 4454 words in the whole dataset file. The main
advantage of using the words present in the dataset is that it is capable of
reducing uncertainty in the prediction of the final results as those phrases have a
remarkable effect of frequency count in spam and ham comments in YouTube
Classifier Techniques
After Feature Extraction the transformed dataset is fed into any two classifier
techniques Support Vector Machine(SVM), Naïve Bayes(NB), Decision
Tree(DT), Random Forest(RF), Logistic Regression(LG), and k-Nearest
Neighbour(kNN) pipelines respectively. There is training and testing process in
this phase. 80% will be used for training and 20% for testing. After completing
the step iii, supposed to be there is features that is considered as spam. Thus, the
dataset needs to train based on machine learning techniques. SVM is
successfully suitable in differentiating positive and negative problem such as
spam. SVM is a supervised learning model that analyzes data used for
classification and regression. SVM mostly used in classification problems.
SVM is used for binary classification problem and used kernel functions. K-NN
is a supervised learning method. Data is appearing in a vector space in the K-
NN algorithm. K–NN emphasize k most similar training data points to a testing
data point. After determining the K-Nearest Neighbours, the algorithm will
combines the neighbours’ to decide the label of testing data point. For
implementation, labels are combined as the labels used simple majority vote
Result Evaluation
we evaluate our Result and also define the evaluation criteria to calculate the
performances of our classification models.
HARDWARE SPECIFICATION: (MINIMUM REQUIREMENT)
INTERNAL MEMORY CA : 2 GB
SOFTWARE SPECIFICATION:
LANGUAGE : PYTHON
IDE : SPYDER
Python 3.0 was released on 3 December 2008. It was a major revision of the
language that is not completely backward-compatible. Many of its major
features were backported to Python 2.6.x and 2.7.x version series. Releases of
Python 3 include the 2to3 utility, which automates (at least partially) the
translation of Python 2 code to Python 3.
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out
of concern that a large body of existing code could not easily be forward-ported
to Python 3.
Python is a multi-paradigm programming language. Object-oriented
programming and structured programming are fully supported, and many of its
features support functional programming and aspect-oriented programming
(including by metaprogramming and metaobjects (magic methods)). Many other
paradigms are supported via extensions, including design by contract and logic
programming.
Python uses dynamic typing and a combination of reference counting and a
cycle-detecting garbage collector for memory management. It also features
dynamic name resolution (late binding), which binds method and variable
names during program execution.
Python's design offers some support for functional programming in the Lisp
tradition. It has filter, map, and reduce functions; list comprehensions,
dictionaries, sets, and generator expressions. The standard library has two
modules (itertools and functools) that implement functional tools borrowed
from Haskell and Standard ML
Rather than having all of its functionality built into its core, Python was
designed to be highly extensible. This compact modularity has made it
particularly popular as a means of adding programmable interfaces to existing
applications. Van Rossum's vision of a small core language with a large
standard library and easily extensible interpreter stemmed from his frustrations
with ABC, which espoused the opposite approach
MYSQL
MySQL Server is a powerful database management system and the user can
create application that requires little or no programming. It supports GUI
features and an entire programming language, Phpmyadmin which can be used
to develop richer and more developed application. There are quite a few reasons,
the first being that MySQL is a feature rich program that can handle any
database related task you have. You can create places to store your data build
tools that make it easy to read and modify your database contents, and ask
questions of your data. MySQL is a relational database, a database that stores
information about related objects. In MySQL that database means a collection of
tables that hold data. It collectively stores all the other related objects such as
queries, forms and reports that are used to implement function effectively.
The MySQL database can act as a back end database for PHP as a front end,
MySQL supports the user with its powerful database management functions. A
beginner can create his/her own database very simply by some mouse clicks.
Another good reason to use MySQL as backend tool is that it is a component of
the overwhelmingly popular Open source software.
MySQL is written in C and C++. Its SQL parser is written in yacc, but it uses a
home-brewed lexical analyzer.[15] MySQL works on many system platforms,
including AIX, BSDi, FreeBSD, HP-UX, eComStation, i5/OS, IRIX, Linux,
macOS, Microsoft Windows, NetBSD, Novell NetWare, OpenBSD,
OpenSolaris, OS/2 Warp, QNX, Oracle Solaris, Symbian, SunOS, SCO
OpenServer, SCO UnixWare, Sanos and Tru64. A port of MySQL to OpenVMS
also exists.
MySQL enables data to be stored and accessed across multiple storage engines,
including InnoDB, CSV, and NDB. MySQL is also capable of replicating data
and partitioning tables for better performance and durability. MySQL users
aren't required to learn new commands; they can access their data using
standard SQL commands.
The RDBMS supports large databases with millions records and supports many
data types including signed or unsigned integers 1, 2, 3, 4, and 8 bytes long;
FLOAT; DOUBLE; CHAR; VARCHAR; BINARY; VARBINARY; TEXT;
BLOB; DATE; TIME; DATETIME; TIMESTAMP; YEAR; SET; ENUM; and
OpenGIS spatial types. Fixed- and variable-length string types are also
supported
HYPER TEXT MARKUP LANGUAGE (HTML)
Following the rigors of SGML, TBL bore HTML to the world in 1990. Since
then, many of us have it to be easy to use but sometimes quite limiting. These
limiting factors are being addressed but the World Wide Web Consortium (aka
W3c) at MIT. But HTML had to start somewhere, and its success argues that it
didn’t start out too badly.
HyperText is the method by which you move around on the web — by clicking
on special text called hyperlinks which bring you to the next page. The fact that
it is hyper just means it is not linear — i.e. you can go to any place on the
Internet whenever you want by clicking on links — there is no set order to do
things in. Markup is what HTML tags do to the text inside them. They mark it
as a certain type of text (italicised text, for example). HTML is a Language, as it
has code-words and syntax like any other language.
HTML consists of a series of short codes typed into a text-file by the site author
— these are the tags. The text is then saved as a html file, and viewed through a
browser, like Internet Explorer or Netscape Navigator. This browser reads the
file and translates the text into a visible form, hopefully rendering the page as
the author had intended. Writing your own HTML entails using tags correctly to
create your vision. You can use anything from a rudimentary text-editor to a
powerful graphical editor to create HTML pages.
The tags are what separate normal text from HTML code. You might know
them as the words between the <angle-brackets>. They allow all the cool stuff
like images and tables and stuff, just by telling your browser what to render on
the page. Different tags will perform different functions. The tags themselves
don’t appear when you view your page through a browser, but their effects do.
The simplest tags do nothing more than apply formatting to some text
Web browsers receive HTML documents from a web server or from local
storage and render the documents into multimedia web pages. HTML describes
the structure of a web page semantically and originally included cues for the
appearance of the document.
HTML elements are the building blocks of HTML pages. With HTML
constructs, images and other objects such as interactive forms may be embedded
into the rendered page. HTML provides a means to create structured documents
by denoting structural semantics for text such as headings, paragraphs, lists,
links, quotes and other items. HTML elements are delineated by tags, written
using angle brackets. Tags such as <img /> and <input /> directly introduce
content into the page. Other tags such as <p> surround and provide information
about document text and may include other tags as sub-elements. Browsers do
not display the HTML tags, but use them to interpret the content of the page.
After the HTML and HTML+ drafts expired in early 1994, the IETF created an
HTML Working Group, which in 1995 completed "HTML 2.0", the first HTML
specification intended to be treated as a standard against which future
implementations should be based.
Of course, but since making websites became more popular and needs increased
many other supporting languages have been created to allow new stuff to
happen, plus HTML is modified every few years to make way for
improvements. Cascading Stylesheets are used to control how your pages are
presented, and make pages more accessible. Basic special effects and interaction
is provided by JavaScript, which adds a lot of power to basic HTML. Most of
this advanced stuff is for later down the road, but when using all of these
technologies together, you have a lot of power at your disposal.
CSS
Cascading Style Sheets (CSS) is a style sheet language used for describing the
presentation of a document written in a markup language like HTML. CSS is a
cornerstone technology of the World Wide Web, alongside HTML and
JavaScript. CSS is designed to enable the separation of presentation and
content, including layout, colors, and fonts. This separation can improve content
accessibility, provide more flexibility and control in the specification of
presentation characteristics, enable multiple web pages to share formatting by
specifying the relevant CSS in a separate .css file, and reduce complexity and
repetition in the structural content.
Separation of formatting and content also makes it feasible to present the same
markup page in different styles for different rendering methods, such as on-
screen, in print, by voice (via speech-based browser or screen reader), and on
Braille-based tactile devices. CSS also has rules for alternate formatting if the
content is accessed on a mobile device. The name cascading comes from the
specified priority scheme to determine which style rule applies if more than one
rule matches a particular element. This cascading priority scheme is predictable.
The CSS specifications are maintained by the World Wide Web Consortium
(W3C). Internet media type (MIME type) text/css is registered for use with CSS
by RFC 2318 (March 1998). The W3C operates a free CSS validation service
for CSS documents. In addition to HTML, other markup languages support the
use of CSS including XHTML, plain XML, SVG, and XUL.
CSS has a simple syntax and uses a number of English keywords to specify the
names of various style properties. A style sheet consists of a list of rules. Each
rule or rule-set consists of one or more selectors, and a declaration block.
Before CSS, nearly all presentational attributes of HTML documents were
contained within the HTML markup. All font colors, background styles,
element alignments, borders and sizes had to be explicitly described, often
repeatedly, within the HTML. CSS lets authors move much of that information
to another file, the style sheet, resulting in considerably simpler HTML.
Stands for "Cascading Style Sheet." Cascading style sheets are used to format
the layout of Web pages. They can be used to define text styles, table sizes, and
other aspects of Web pages that previously could only be defined in a page's
HTML.
CSS helps Web developers create a uniform look across several pages of a Web
site. Instead of defining the style of each table and each block of text within a
page's HTML, commonly used styles need to be defined only once in a CSS
document. Once the style is defined in cascading style sheet, it can be used by
any page that references the CSS file. Plus, CSS makes it easy to change styles
across several pages at once. For example, a Web developer may want to
increase the default text size from 10pt to 12pt for fifty pages of a Web site. If
the pages all reference the same style sheet, the text size only needs to be
changed on the style sheet and all the pages will show the larger text.
While CSS is great for creating text styles, it is helpful for formatting other
aspects of Web page layout as well. For example, CSS can be used to define the
cell padding of table cells, the style, thickness, and color of a table's border, and
the padding around images or other objects. CSS gives Web developers more
exact control over how Web pages will look than HTML does. This is why most
Web pages today incorporate cascading style sheets.
CSS is created and maintained through a group of people within the W3C called
the CSS Working Group. The CSS Working Group creates documents called
specifications. When a specification has been discussed and officially ratified
by the W3C members, it becomes a recommendation. These ratified
specifications are called recommendations because the W3C has no control over
the actual implementation of the language. Independent companies and
organizations create that software.
JAVASCRIPT
Client-side JavaScript is the most common form of the language. The script
should be included in or referenced by an HTML document for the code to be
interpreted by the browser. It means that a web page need not be a static HTML,
but can include programs that interact with the user, control the browser, and
dynamically create HTML content. The JavaScript client-side mechanism
provides many advantages over traditional CGI server-side scripts. For
example, you might use JavaScript to check if the user has entered a valid e-
mail address in a form field. The JavaScript code is executed when the user
submits the form, and only if all the entries are valid, they would be submitted
to the Web Server. JavaScript can be used to trap user-initiated events such as
button clicks, link navigation, and other actions that the user initiates explicitly
or implicitly.
JavaScript can be implemented using JavaScript statements that are placed
within the <script>... </script> HTML tags in a web page.
You can place the <script> tags, containing your JavaScript, anywhere within
your web page, but it is normally recommended that you should keep it within
the <head> tags.
The <script> tag alerts the browser program to start interpreting all the text
between these tags as a script.
All the modern browsers come with built-in support for JavaScript. Frequently,
you may need to enable or disable this support manually. This chapter explains
the procedure of enabling and disabling JavaScript support in your browsers:
Internet Explorer, Firefox, chrome, and Opera.
Python 2.6 or higher is usually required for installation of Flask. Although Flask
and its dependencies work well with Python 3 (Python 3.3 onwards), many
Flask extensions do not support it properly. Hence, it is recommended that Flask
should be installed on Python 2.7. virtualenv is a virtual Python environment
builder. It helps a user to create multiple Python environments side-by-side.
Thereby, it can avoid compatibility issues between the different versions of the
libraries. This command needs administrator privileges. Add sudo before pip on
Linux/Mac OS. If you are on Windows, log in as Administrator. On Ubuntu
virtualenv may be installed using its package manager. The route() function of
the Flask class is a decorator, which tells the application which URL should call
the associated function. Importing flask module in the project is mandatory. An
object of Flask class is our WSGI application. Flask constructor takes the name
of current module (__name__) as argument. The rule parameter represents URL
binding with the function. The options is a list of parameters to be forwarded to
the underlying Rule object. Finally the run() method of Flask class runs the
application on the local development server.
A Flask application is started by calling the run() method. However, while the
application is under development, it should be restarted manually for each
change in the code. To avoid this inconvenience, enable debug support. The
server will then reload itself if the code changes. It will also provide a useful
debugger to track the errors if any, in the application. The Debug mode is
enabled by setting the debug property of the application object to True before
running or passing the debug parameter to the run() method.
Modern web frameworks use the routing technique to help a user remember
application URLs. It is useful to access the desired page directly without having
to navigate from the home page. The route() decorator in Flask is used to bind
URL to a function. As a result, if a user visits http://localhost:5000/hello URL,
the output of the hello_world() function will be rendered in the browser. The
add_url_rule() function of an application object is also available to bind a URL
with a function as in the above example, route() is used. It is possible to build a
URL dynamically, by adding variable parts to the rule parameter. This variable
part is marked as <variable-name>. It is passed as a keyword argument to the
function with which the rule is associated. In the following example, the rule
parameter of route() decorator contains <name> variable part attached to URL
‘/hello’. Hence, if the http://localhost:5000/hello/TutorialsPoint is entered as a
URL in the browser, ‘TutorialPoint’ will be supplied to hello() function as
argument.
An advantage of using Flask might be the fact that this framework is light, and
the risk for encountering Flask security bugs is minimal. At the same time, a
drawback might be the fact that it requires quite some effort from the part of the
programmer in order to boost the list of dependencies via plugins. A great thing
about Flask is the template engine available. The purpose of such templates is to
allow basic layout configuration for web pages with the purpose of mentioning
which element is susceptible to change. As such, you will be able to define your
template once and keep it the same all over the pages of a website. With the aid
of a template engine, you will be able to save a lot of time when setting up your
application, and even when it comes to updates or maintenance issues. Overall,
Flask is easy to learn and manage as a scalable tool. It allows any type of
approach or programming technique, as there are no restrictions included on the
app architecture or data abstraction layers. You can even run it on embedded
systems like a Raspberry Pi. Your web app can be loaded on any device,
including mobile phone, desktop pc or even a tv. Besides, it benefits from a
community that offers support and solutions suggestions to a multitude of
problems that programmers might face when using Flask in Python. The core
benefit of Flask is that the programmer controls everything, while he or she will
get a deeper understanding of how internal mechanics of frameworks function.
Werkzeug
Jinja
A framework "is a code library that makes a developer's life easier when
building reliable, scalable, and maintainable web applications" by providing
reusable code or extensions for common operations. There are a number of
frameworks for Python, including Flask, Tornado, Pyramid, and Django. Flask
is an API of Python that allows to build up web-applications. It was developed
by Armin Ronacher. Flask’s framework is more explicit than Django’s
framework and is also easier to learn because it have less base code to
implement a simple web-Application. A Web-Application Framework or Web
Framework is the collection of modules and libraries that helps the developer to
write applications without writing the low-level codes such as protocols, thread
management, etc. Flask is based on WSGI(Web Server Gateway Interface)
toolkit and Jinja2 template engine
Why Flask?
easy to use.
built in development server and debugger
integrated unit testing support
RESTful request dispatching
uses Jinja2 templating
Database:
A database is simply a collection of used data just like phone book. MySQL
database include such objects as tables, queries, forms, and more.
Tables:
In MySQL tables are collection of similar data. With all tables can be organized
differently, and contain mostly different information- but they should all be in
the same database file. For instance we may have a database file called video
store. Containing tables named members, tapes, reservations and so on. These
tables are stored in the same database file because they are often used together
to create reports to help to fill out on screen forms.
Relational database:
MySQL is a relational database. Relational databases tools like access can help
us manage information in three important ways.
Reduce redundancy
Facilitate the sharing of information
Keep data accurate.
Fields
MySQL use key fields and indexing to help speed many database operations.
We can tell MySQL, which should be key fields, or MySQL can assign them
automatically.
Controls and objects:
Queries are access objects us display, print and use our data. They can be things
like field labels that we drag around when designing reports. Or they can be
pictures, or titles for reports, or boxes containing the results of calculations.
Forms:
Forms are on screen arrangement that make it easy to enter and read data. we
can also print the forms if we want to. We can design form our self, or let the
access auto form feature.
Reports:
Reports are paper copies of dynaset. We can also print reports to disk, if we
like. Access helps us to create the reports. There are even wizards for complex
printouts.
Properties:
- Logical design
- Physical design
Logical design reviews the present physical system, prepares input and output
specifications, makes edit security and control specifications
Physical design maps out the details of the physical system, plans, system
implementation, device a test and implementation plan.
DESIGN PROCESS
INPUT DESIGN
All the data entry screen are interactive in nature, so that the user can directly
enter into data according to the prompted messages. The user are also can
directly enter into data according to the prompted messages. The users are also
provided with option of selecting an appropriate input from a list of values. This
will reduce the number of error, which are otherwise likely to arise if they were
to be entered by the user itself.
Input design is one of the most important phase of the system design. Input
design is the process where the input received in the system are planned and
designed, so as to get necessary information from the user, eliminating the
information that is not required. The aim of the input design is to ensure the
maximum possible levels of accuracy and also ensures that the input is
accessible that understood by the user. The input design is the part of overall
system design, which requires very careful attention. If the data going into the
system is incorrect then the processing and output will magnify the errors.
The first step is to draw a data flow diagram (DFD). The DFD was first
developed by Larry Constantine as a way of expressing system requirements in
graphical form.
A DFD also known as a “bubble chart” has the purpose of clarifying system
requirements and identifying major transformations that will become programs
in system design. So, it is the starting point of the design phase that functionally
decomposes the requirements specifications down to the lowest level of detail.
A DFD consists of series of bubbles join by the data flows in the system.
The purpose of data flow diagrams is to provide a semantic bridge between
users and systems developers. The diagrams are:
External Entity
Process
A data flow shows the flow of information from its source to its destination. A
data flow is represented by a line, with arrowheads showing the direction of
flow. Information always flows to or from a process and may be written, verbal
or electronic. Each data flow may be referenced by the processes or data stores
at its head and tail, or by a description of its contents.
Data Store
Resource Flow
A resource flow shows the flow of any physical material from its source to its
destination. For this reason they are sometimes referred to as physical flows.
The physical material in question should be given a meaningful name. Resource
flows are usually restricted to early, high-level diagrams and are used when a
description of the physical flow of materials is considered to be important to
help the analysis.
OUTPUT DESIGN
The output form of the system is either by screen or by hard copies. Output
design aims at communicating the results of the processing of the users. The
reports are generated to suit the needs of the users .The reports have to be
generated with appropriate levels. In our project outputs are generated by asp as
html pages. As its web application output is designed in a very user-friendly this
will be through screen most of the time.
CODE DESIGN
The main purpose of code design is to simplify the coding and to achieve better
performance and quality with free of errors. The coding is prepared in such a
way that the internal procedures are more meaningful validation manager is
displayed for each column. The coding of the variables is done in such a way
that one other than person who developed the packages can understand its
purpose.
To reduce the server load, the project is designed in a way that most of the
Validation of fields is done as client side validation, which will be more
effective.
DATABASE DESIGN
The database design involves creation of tables that are represented in physical
database as stored files. They have their own existence. Each table constitute of
rows and columns where each row can be viewed as record that consists of
related information and column can be viewed as field of data of same type. The
table is also designed with some position can have a null value.
The database design of project is designed in such a way values are kept without
redundancy and with normalized format.
DEVELOPMENT APPROACH
The importance of new system is that it is user friendly and a better interface
with user’s working on it. It can overcome the problems of manual system and
the security problem.
It is the process of exercising software with the intent of finding and ultimately
correcting errors. This fundamental philosophy does not change for web
applications, because web based system and applications reside on network and
inter-operate with many different operating systems, browsers, hardware
platforms and communication protocols. Thus searching for errors is significant
challenge for web applications.
Testing issues:
Testing phase is the development phase that validates the code against the
functional specifications. Testing is a vital to the achievement of the system
goals. The objective of testing is to discover errors. To fulfill this objective a
series of test step such as the unit test, integration test, validation and system
test where planned and executed.
Unit testing
Here each program is tested individually so any error apply unit is debugged.
The sample data are given for the unit testing. The unit test results are recorded
for further references. During unit testing the functions of the program unit
validation and the limitations are tested.
Unit testing is testing changes made in a existing or new program this test is
carried out during the programming and each module is found to be working
satisfactorily. For example in the registration form after entering all the fields
we click the submit button. When submit button is clicked, all the data in form
are validated. Only after validation entries will be added to the database.
1. Functional test
2. Performance test
3. Stress Test
4. Structure test
Functional test involve exercising the code with nominal input values for
which the expected results are known as well as boundary values and special
values.
VALIDATION TESTING
OUTPUT TESTING
Asking the user about the format required by them tests the output generated by
the system under consideration .It can be done in two ways, One on screen and
other on printer format. The output format on the screen is found to be correct
as the format designed n system test.
SYSTEM TESTING
In the system testing the whole system is tested for interface between
each module and program units are tested and recorded. This testing
is done with sample data. The securities, communication between
interfaces are tested
1. Integrated testing
2. Acceptance testing
Integrated testing
Objective is to take unit tested modules and build a program structure that has
been dictated by design
Acceptance testing
The acceptance testing is the final stage of the user the various possibilities of
the data are entered and the results are tested.
Validation testing
Testing results
All the tests should be traceable to customer requirements the focus of testing
will shift progressively from programs Exhaustive testing is not possible To be
more effective testing should be which has probability of finding errors
QUALITY ASSURANCE
Reliablility: The degree to which the system performs its intended functions
overtime
Maintainability: To use with which program errors are located and corrected
GENERIC RISKS
Risk identification is the systematic attempt to specify threats to the project plan
(estimates the schedule resource overloading etc.). By identifying know and
predictable risk the first step is to avoiding them. When possible and controlling
them when necessary there are two types of risk.
1. Generic Risk
2. Product specific risk
Generic risks are potential threats to every software project. Only those with a
clear understanding of technology can identify product specific risk The people
and the environment that is specific to the project at a hand and to identify the
product specific risk and the project the plan and the software statement of
scope are examined and answer to the following question is developed.
What special characteristics of this product may threaten the project plan.
One method for identifying risk is to create a risk item and checklists. The
checklist can be used for risk identification and focus on some subset to know
and predictable risk in the following sub categories.
1. Product risk
2. Risk associated with overall size of software to built or modified
3. Business imparts
4. Risk associated with constraints imposed with management
5. Customer characteristics
Project Risks
It identify a potential budgetary, schedule, personnel like staffing, organizing,
resource, customer requirement, problems and their impact on a software
project
Technical risks
Any system developed should be secured & protected against possible hazards.
Security measures are provided to prevent unauthorized access to database at
various levels. Password protection & simple procedures to change the
unauthorized access are provided to the users.
The user will have to enter the user name and password and if it is validated he
can participate in auction. Otherwise if he/she is a new user he should get
registered and then he can place an order
When he/she registered they should provide authentication through jpg files
(like ration card Xerox, voter identity card Xerox). A multi layer security
architecture comprising firewalls filtering routers encryption & digital
certification must be assured in this project in real time that order details are
protected from unauthorized access.
SYSTEM IMPLEMENTATION
Implementation is the stage in the project where the theoretical design is turned
into a working system. The most crucial stage is achieving a successful new
system and giving a user confidence in that the new system will work efficiently
and effectively in the implementation stage. The stage consist of
IMPLEMENTATION PROCEDURES
The implementation phase is less creative than system design. A system design
may be dropped at any time prior to implementation, although it becomes more
difficult when it goes to the design phase. The final report of the
implementation phase includes procedural flowcharts, record layouts, and a
workable plan for implementing the candidate system design into a operational
design.
USER TRAINING
It is designed to prepare the users for testing & converting the system. There is
several ways to trail the users they are:
1) User manual
2) Help screens
3) Training demonstrations.
1) User manual:
The summary of important functions about the system & software can be
provided as a document to the user. User training is designed to prepare the user
for testing and convening a system
The summary of important functions about the system and the software can be
provided as a document to the user
1) Documentation tools:
Document production & desktop publishing tool support nearly ever aspect of
software developers. Most software development organizations spend a
substantial amount of time developing documents, and in many cases the
documentation process itself is quite inefficient. It is not use unusual for a
software development effort on documentation. For this reason, Documentation
tools provide an important opportunity to improve productivity.
2) Document restructuring:
Creating document is far too timed consuming. If the system work’s, we’ll live
with what we have. In some cases, this is the correct approach. It is not possible
to recreate document for hundreds of computer programs.
The system is business critical and must be fully redocumented. Even in this
case, an intelligent approach is to pare documentation to an essential minimum.
SYSTEM MAINTENANCE
1. Perf
ective maintenance
2. Prev
entive maintenance
Perfective maintenance:
Changes made to the system to avoid future problems. Any changes can be
made in the future and our project can adopt the changes.
CONCLUSION
YouTube a social networking feature website providing one of the largest video
content publication. This project used four machine learning models and tested
accuracy with different test size proportions. Among all the models, the
Random classifier has given good accuracy that is 95% for the standard
datasets. Unlike other existing projects, this project has the advantage of taking
youtube video url and able to classify the spam and ham comments in real time.
When the comments are very high for a youtube video then it takes more time
and sometimes the machine may not yield results.
FUTURE WORK
In future, the model can be modified so that more accurate results can be
obtained in low processing time and the size of datasets can be increased for
better results.
REFERENCE
[1] P. Chopade, J. Zhan, and M. Bikdash. Node attributes and edge structure for
large-scale big data network analytics and community detection. In International
[3] P. Cui, Z. Wang, and Z. Su. What videos are similar with you?:Learning a
common attributed representation for video recommendation. In ACM
International Conference on Multimedia (MM),pages 597–606, 2014.
Admin Data
Dataset.csv
Collection
Data
Acquisition Dataset.csv
Data Pre-
processing Dataset.csv
Dataset.csv
Prediction using
ANN
Create
Model File
View
Prediction
User Dataset.csv
DATASET
DATA IMPORT
X DATA Split
Y DATA Split
Algorithm Accuracy Graph
Output Comment
Sample code
Model.py
import pandas as pd
import numpy as np
df1 = pd.read_csv("Youtube01-Psy.csv")
df1.head()
df2 = pd.read_csv("Youtube02-KatyPerry.csv")
df3 = pd.read_csv("Youtube03-LMFAO.csv")
df4 = pd.read_csv("Youtube04-Eminem.csv")
df5 = pd.read_csv("Youtube05-Shakira.csv")
frames = [df1,df2,df3,df4,df5]
df_merged = pd.concat(frames)
df_merged
df_merged.shape
keys = ["Psy","KatyPerry","LMFAO","Eminem","Shakira"]
df_with_keys = pd.concat(frames,keys=keys)
df_with_keys
df_with_keys.loc['Shakira']
df_with_keys.to_csv("YoutubeSpamMergeddata.csv")
df = df_with_keys
df.size
df.columns
df.dtypes
df.isnull().isnull().sum()
df["DATE"]
df.AUTHOR
df_data = df[["CONTENT","CLASS"]]
df_data.columns
df_x = df_data['CONTENT']
df_y = df_data['CLASS']
cv = CountVectorizer()
ex.toarray()
cv.get_feature_names()
corpus = df_x
cv = CountVectorizer()
cv.get_feature_names()
X_train
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
clf_acc=clf.score(X_test,y_test)*100
print("Accuracy of Model",clf.score(X_test,y_test)*100,"%")
courses = list(data.keys())
values = list(data.values())
width = 0.4)
plt.xlabel("Algorithm")
plt.ylabel("Accuracy")
plt.title("Accuracy of Algorithms")
plt.show()
clf.predict(X_test)
# Sample Prediciton
vect = cv.transform(comment).toarray()
clf.predict(vect)
class_dict = {'ham':0,'spam':1}
class_dict.values()
if clf.predict(vect) == 1:
print("Spam")
else:
print("Ham")
# Sample Prediciton 2
vect = cv.transform(comment1).toarray()
clf.predict(vect)
if clf.predict(vect) == 1:
print("Spam")
else:
print("Ham")
import pickle
naivebayesML = open("YtbSpam_model.pkl","wb")
pickle.dump(clf,naivebayesML)
naivebayesML.close()
ytb_model = open("YtbSpam_model.pkl","rb")
new_model = pickle.load(ytb_model)
new_model
# Sample Prediciton 3
comment2 = ["Hey Music Fans I really appreciate all of you,but see this song
too"]
vect = cv.transform(comment2).toarray()
new_model.predict(vect)
if new_model.predict(vect) == 1:
print("Spam")
else:
print("Ham")
App.py
import pandas as pd
import pickle
app = Flask(__name__)
@app.route('/')
def home():
return render_template('home.html')
@app.route('/predict',methods=['POST'])
def predict():
df= pd.read_csv("YoutubeSpamMergedData.csv")
df_data = df[["CONTENT","CLASS"]]
df_x = df_data['CONTENT']
df_y = df_data.CLASS
corpus = df_x
cv = CountVectorizer()
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
# ytb_model = open("naivebayes_spam_model.pkl","rb")
# clf = joblib.load(ytb_model)
if request.method == 'POST':
comment = request.form['comment']
data = [comment]
vect = cv.transform(data).toarray()
my_prediction = clf.predict(vect)
if __name__ == '__main__':
app.run(debug=False)