Module Description Preprocessing

MODULE DESCRIPTION
PREPROCESSING:
Creation of database for twitter asyncronous system, dataset of ratings i.e. actual ratings is used.
Validity of results is based on the use of dataset, so creation of database is one important step.
Some websites provides the available datasets which include users and tweets with significant
rating history, which makes it possible to have sufficient number of highly predicted tweets for
recommendations to each user. The data was gathered using twitter’s publicly available API.
Twitter momentarily updates its top ten trending topic list. There is no information as to how a
topic gets chosen to appear in this list or how often this list gets updated. However, one can
request up to 1500 tweets for a given trending topic.
It had two processes running to collect this data. One process requested a list of trending topics
from twitter every 30 seconds and maintained a unique list. Whenever there was a new trending
topic detected, the other process requested a list of related tweets from twitter using its search
API. After the data was collected, the trending topics were manually annotated into the following
four categories:
1. News
2. Meme
3. Ongoing Event
The three annotators were used to annotate the trending topics. Each one of them looked at the
tweets related to the trending topics to assign a suitable category.
TWEETS RATING PREDICTION
In this module there are Greedy & Dynamic Blocking Algorithms twitter asynchronous system
techniques Proposed: greedy algorithm.it is an live Content based approach recommends tweets
similar to the user preferred in the past. Dynamic Greedy approach suggests tweets that users
with similar preferences have liked in the past. It can combine both content based and
collaborative filtering approaches. The proposed system uses Greedy & Dynamic Blocking
Algorithms approach. While giving suggestions to each user, twitter asynchronous system
performs the following two tasks.
First, based on the available information the ratings of unrated tweets are predicted using some
recommendation algorithm. a new approach for classifying Twitter trends by adding a layer of
trends selection and best topic tweets hash tag ranking. A variety of feature ranking algorithms,
such as TF-IDF and bag-of-words, are used to facilitate the feature selection process. This helps
in surfacing the important features, while reducing the feature space and making the
classification process more efficient. Four Greedy & Dynamic Blocking text classifiers (one for
each class), backed by these sophisticated feature ranking and feature selection techniques, are
used to successfully categorize Twitter trends. Using the bag-of-words and TF-IDF rankings, our
research provides an average class precision improvement, over the current methodologies, of
33.14% and 28.67% correspondingly
And second, based on the result of predicted ratings the system finds relevant tweets and
recommends them to the user.
GREEDY & DYNAMIC BLOCKING ALGORITHMS TWEET BASED
COLLABORATIVE FILTERING
In this module uses the set of tweets the active user has rated and calculates the similarity
between these tweets and target tweets and then selects N most similar tweets. Tweets’s
corresponding similarities are also computed. Using the most similar tweets, the prediction is
computed. The information filtering module is responsible for actual retrieval and selection of
movies from the movie database. Based on the knowledge gathered from the learning module,
information filtering process is done.
After passing out the test of user knowledge, the standardized ratings provided by the user are
stored in the rating database. Based on the data in the rating database, a film is recommended to
the user ui using the following steps Assume M = Total number of users N = Total numbers of
films n = Total number of films not rated by user.
1) For each film F ȯ n not rated by user ui , find the correlation with each of the other (N-1)
films.
2) Based on the correlation coefficient values select S films, which is mostly closely correlated
with F. This will form a group of S similar films with F.
3) Find the correlation of all users with the current user ui based on the rating given by every
user to those similar films. Based on the correlation coefficient values, select X users, which are
most closely correlated with user. Thus it will form a group of X users similar with user .
TWEET SIMILARITY COMPUTATION:
In this module the similarity computation between two tweets a (target tweets) and b is to first
find the users who have rated both of these tweets. There are number of different ways to
compute similarity. The proposed system uses adjusted cosine similarity method which is more
beneficial due to the subtracting the corresponding user average from each co-rated pair.
Similarity between tweets a and b is given.
PREDICTION COMPUTATION MODULE:
In this modules to obtain the predictions weighted sum approach is used. Weighted sum
computes the prediction of target tweets for a user u by computing the sum of ratings given by
the user on the tweets similar to target tweets. Prediction on an tweets a for user u is given
Content based technique The utility for user u of tweets i is estimated based on the utilities
assigned by user u to set of all tweets similar to tweets. Only the tweets with high degree of
similarity to user’s preferences are would get recommended.
TRENDING TWEETS RESULT ANALYSIS MODULE:

In movie database creation module, information related to user, movies and ratings has been
stored in different tables. Thus system can retrieve the data properly from database and also get
movie ratings explicitly from the users. In tweets based collaborative filtering technique, tweets
similarity computation and prediction computation modules have been implemented.
Recommended lists are generated on non purchased movies of login user. So we have computed
system predicted ratings for all non purchased movies of login user. To calculate system
predicted rating of target movie, first we have obtained 5 most similar tweets and then used
weighted sum approach for rating prediction computation. As per the 5-star scale of rating,
predicted value lies between 1 to 5. We have used Mean Absolute Error (MAE) accuracy metric
to evaluate the accuracy of predicted ratings by this module shown in graph.
3. SYSTEM ENVIRONMENT
3.1 HARDWARE REQUIREMENTS:
 System : Pentium Core 2 Duo
 Hard Disk : 80 GB
 RAM : 1GB DDR 2 RAM
 Key Board : LG 104 Key keyboard
 Mouse : Logitech Optical mouse
 Monitor : 15 inch TFT Monitor
3.2 SOFTWARE REQUIREMENTS:
 Operating System : Windows 7
 Front end : JDK 1.7 /Net beans 8.0
 Coding Language : Java

3.3 FEATURES OF SOFTWARE
The software requirement specification is created at the end of the analysis

task. The function and performance allocated to software as part of system engineering are
developed by establishing a complete information report as functional representation, a
representation of system behavior, an indication of performance requirements and design
constraints, appropriate validation criteria.
3.3.1 FEATURES OF JAVA
Java platform has two components:
 The Java Virtual Machine (Java VM)

 The Java Application Programming Interface (Java API)
The Java API is a large collection of ready-made software components that
provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is
grouped into libraries (packages) of related components.
The following figure depicts a Java program, such as an application or applet, that's
running on the Java platform. As the figure shows, the Java API and Virtual Machine insulates
the Java program from hardware dependencies.
As a platform-independent environment, Java can be a bit slower than native code.

However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can
bring Java's performance close to that of native code without threatening portability.
FACTORY METHODS:
The InetAddress class has no visible constructors. To create an InetAddress

object, user use one of the available factory methods. Factory methods are merely a convention
whereby static methods in a class return an instance of that class. This is done in lieu of
overloading a constructor with various parameter lists when having unique method names makes
the results much clearer.
Three commonly used InetAddress factory methods are:
1. Static InetAddress getLocalHost ( ) throws
UnknownHostException
2. Static InetAddress getByName (String hostName)
throws UnknowsHostException
3. Static InetAddress [ ] getAllByName (String hostName)
throws UnknownHostException
The getLocalHost ( ) method simply returns the InetAddress object that

represents the local host. The getByName ( ) method returns an InetAddress for a host name
passed to it. If these methods are unable to resolve the host name, they throw an
UnknownHostException.
On the internet, it is common for a single name to be used to represent several

machines. In the world of web servers, this is one way to provide some degree of scaling. The
getAllByName ( ) factory method returns an array of InetAddresses that represent all of the
addresses that a particular name resolves to. It will also throw an UnknownHostException if it
can’t resolve the name to at least one address. Java 2, version 1.4 also includes the factory
method getByAddress ( ), which takes an IP address and returns an InetAddress object. Either an
IPv4 or an IPv6 address can be used.
INSTANCE METHODS:
The InetAddress class also has several other methods, which can be used on the
objects returned by the methods just discussed. Here are some of the most commonly used.
Boolean equals (Object other)- Returns true if this object has the same
Internet address as other.
1. byte [ ] get Address ( )- Returns a byte array that represents the object’s
Internet address in network byte order.
2. String getHostAddress ( ) - Returns a string that represents the host address

associated with the InetAddress object.
3. String get Hostname ( ) - Returns a string that represents the host name
associated with the InetAddress object.
4. boolean isMulticastAddress ( )- Returns true if this Internet address is a

multicast address. Otherwise, it returns false.
5. String toString ( ) - Returns a string that lists the host name and the IP address
for convenience.
TCP/IP CLIENT SOCKETS:
TCP/IP sockets are used to implement reliable, bidirectional, persistent,

point-to-point and stream-based connections between hosts on the Internet. A socket can be used
to connect Java’s I/O system to other programs that may reside either on the local machine or on
any other machine on the Internet.
There are two kinds of TCP sockets in Java. One is for servers, and the other
is for clients. The Server Socket class is designed to be a “listener,” which waits for clients to
connect before doing anything. The Socket class is designed to connect to server sockets and
initiate protocol exchanges.
The creation of a Socket object implicitly establishes a connection between

the client and server. There are no methods or constructors that explicitly expose the details of
establishing that connection. Here are two constructors used to create client sockets:
Socket (String hostName, int port) - Creates a socket connecting the local host to
the named host and port; can throw an UnknownHostException or anIOException.
Socket (InetAddress ipAddress, int port) - Creates a socket using a preexisting

InetAddress object and a port; can throw an IOException.
A socket can be examined at any time for the address and port information
associated with it, by use of the following methods:
 InetAddress getInetAddress ( ) - Returns the InetAddress associated with

the Socket object.
 Int getPort ( ) - Returns the remote port to which this Socket object is
connected.
 Int getLocalPort ( ) - Returns the local port to which this Socket object is
connected.
Once the Socket object has been created, it can also be examined to gain access
to the input and output streams associated with it. Each of these methods can throw an
IOException if the sockets have been invalidated by a loss of connection on the Net.
InputStream getInputStream ( ) - Returns the InputStream associated with the
invoking socket.
OutputStream getOutputStream ( ) - Returns the OutputStream associated with the

invoking socket.
TCP/IP SERVER SOCKETS:
Java has a different socket class that must be used for creating server applications.
The ServerSocket class is used to create servers that listen for either local or remote client
programs to connect to them on published ports. ServerSockets are quite different form normal
Sockets.
When the user create a ServerSocket, it will register itself with the system as having
an interest in client connections.
 ServerSocket(int port) - Creates server socket on the specified port with a queue length of
50.
 Serversocket(int port, int maxQueue) - Creates a server socket on the specified port with
a maximum queue length of maxQueue.
 ServerSocket(int port, int maxQueue, InetAddress localAddress)-Creates a server socket
on the specified port with a maximum queue length of maxQueue. On a multihomed host,
localAddress specifies the IP address to which this socket binds.
 ServerSocket has a method called accept( ) - which is a blocking call that will wait for a
client to initiate communications, and then return with a normal Socket that is then used
for communication with the client.
URL:
The Web is a loose collection of higher-level protocols and file formats, all
unified in a web browser. One of the most important aspects of the Web is that Tim Berners-Lee
devised a scaleable way to locate all of the resources of the Net. The Uniform Resource Locator
(URL) is used to name anything and everything reliably.
The URL provides a reasonably intelligible form to uniquely identify or address

information on the Internet. URLs are ubiquitous; every browser uses them to identify
information on the Web.
SYSTEM DESIGN
INPUT DESIGN
Input design is the process of converting the user–originated input to a computer based
format. The design decision for handling input specify how data are accepted for computer
processing. Input design is a part of overall system design that needs careful attention.
The collection of input data is considered to be the most expensive part of the system
design. Since the inputs have to be planned in such a way so as to get the relevant information,
extreme care is taken to obtain the pertinent information. If the data going into the system is
incorrect then the processing and outputs will magnify these errors. The goal of designing input
data is to make data entry as easy, logical and free from errors as possible.
The following are the objectives of input design:
 To produce a cost effective method of input.
 To make the input forms understandable to the end users.
 To ensure the validation of data inputs.
The nature of input data is determined partially during logical system deign. However the
nature of inputs is made more explicit during the physical design. The impact of inputs on the
system is also determined.
Effort has been made to ensure that input data remains accurate from the stage at which it
is recorded and documented to the stage at which it is accepted by the computer. Validation
procedures are also present to detect errors in data input, which is beyond control procedures.
Validation procedures are designed to check each record, data item or field against certain
criteria.
To address this problem, we classify Twitter Trending Topics into 18 general categories such as
sports, politics, technology, etc. We experiment with 2 approaches for topic classification; (i) the
well-known Bag-of-Words approach for text classification and (ii) network-based classification.
In text-based classification method, we construct word vectors with trending topic definition
and tweets, and the commonly used tf-idf weights are used to classify the topics using a Naive
Bayes Multinomial classifier. In network-based classification method, we identify top 5 similar
topics for a given topic based on the number of common influential users. The categories of the
similar topics and the number of common influential users between the given topic and its
similar topics are used to classify the given topic using a C5.0 decision tree learner. Experiments
on a database of randomly selected 768 trending topics (over 18 classes) show that classification
accuracy of up to 65% and 70% can be achieved using text-based and network-based
classification modeling respectively. Keywords-Social Network
OUTPUT DESIGN
The output is designed in such a way that it is attractive, convenient and informative.
Forms are designed in JAVA with various features, which make the console output more
pleasing.
As the outputs are the most important sources of information to the users, better design
should improve the system’s relationships with us and also will help in decision-making. Form
design elaborates the way output is presented and the layout available for capturing information.
In this project, the output is designed in the form reports. The system is designed to
generate various user friendly reports to help the business process. Following are the different
reports supported:
The website What the Trend provides a regularly updated list of ten most popular topics
called “trending topics” from Twitter. A trending topic may be a breaking news story or it may
be about a recently aired TV show. The website also allows thousands of users across the world
to define, in a few short sentences, why this term is interesting or important to people, which we
refer to as “trend definition” in the paper. The Twitter API allows high-throughput near real-time
access to various subsets of public Twitter data. We downloaded trending topics and definitions
every 30 minutes from What the Trend and all tweets that contain trending topics from Twitter
while the topic is trending. All the tweets containing a trending topic constitutes a document. For
example, while the topic “superbowl” is trending, we keep downloading all tweets that contain
the word “superbowl” from Twitter, and save the tweets in a document called “superbowl”. In
case a tweet contains more than two trending topics, the tweet is saved in all relevant documents.
For example, if a tweet contains two trending topics “superbowl” and “NFL”, the same tweet is
saved into two documents called “superbowl” and “NFL”. From 23000+ trending topics that we
have downloaded since February 2010, we randomly selected 768 topics as our dataset.
UML Diagrams
Use Case Diagram
Enter Tweets
Preprocess Tweets
Find Possible trends
Tweet Segment
User
Compute Score
POS Tagging
Rumor elimination
Segment Tweets
Class Diagram
User
+String twt
+String seg
+void getTweets() Tweets Segmentation
+void findSegment()
+String twt
+String seg
+double scr
+void preprocess()
+void PossibleSegment()
+void findscr()
+void getPOS()
+void getNE()
+void segmentTwt()
Activity Diagram
Enter Tweets
Preprocessing
Find Possible Trends
Compute Score
Get POS Tag
Rumor Elimination
Segmented Tweets
Sequence Diagram
User Tweet Segment
1 : Enter Tweets()
2 : Preprocess Tweets()
3 : Find Possible Segment()
4 : Compute Score()
5 : Get POS Tag()
6 : Find Rumor Elimination()
7 : Trends ClassifiedTweets()
Collaboration Diagram
User
1 : Enter Tweets()
6 : Find Named Entities()
5 : Get POS Tag()
4 : Compute Score()
7 : Trends Tweets() Tweet Segment
3 : Find rumors elimination()
2 : Preprocess
Tweets()
SYSTEM TESTING AND IMPLEMENTATION
SYSTEM TESTING
System testing is the stage of implementation, which is aimed at ensuring that the system
works accurately and efficiently before live operation commences. Testing is vital to the success
of the system. System testing makes a logical assumption that if all the parts of the system are
correct, the goal will be successfully achieved. The candidate system is subject to a variety of
tests.
A series of tests are performed for the proposed system before the system is ready for
user acceptance testing.
The testing steps are:
 Unit testing
 Integration testing
 Validation testing
 Output testing
 User acceptance testing
UNIT TESTING
Unit testing focuses verification efforts on the smallest unit of software design, the
module. This is also known as “module testing” .The modules are tested separately. This testing
is carried out during programming stage itself. In this testing step, each module is found to be
working satisfactorily as regard to the expected output from the module.
INTEGRATION TESTING
Data can be lost across an interface; one module can have an adverse effect on others;
sub-functions when combined may not produce the desired major functions; integration testing is
a systematic testing for constructing the program structure. While at the same time conducting to
uncover errors associated within the interface? The objective is to take unit tested modules and to
combine them and test it as a whole. Here correction is difficult because the vast expenses of the
entire program complicate the isolation of causes. This is the integration-testing step; all the
errors encountered are corrected for the next testing step.
VALIDATION TESTING
Verification testing runs the system in a simulated environment using simulated data.
This simulated test is sometimes called alpha testing. This simulated test is primarily looking for
errors and monitions regarding end user and decisions design specifications hat where specified
in the earlier phases but not fulfilled during construction.
Validation refers to the process of using software in a live environment in order to find
errors. The feedback from the validation phase generally produces changes in the software to
deal with errors and failures that are uncovered. Than a set of user sites is selected that puts the
system in to use on a live basis. They are called beta tests.
The beta test suits use the system in day to day activities. They process live transactions
and produce normal system output. The system is live in every sense of the word; except that the
users are aware they are using a system that can fail. But the transactions that are entered and
persons using the system are real. Validation may continue for several months. During the course
of validating the system, failure may occur and the software will be changed. Continued use
may produce additional failures and need for still more changes.
OUTPUT TESTING
After performing the validation, the next step is output testing of the proposed system,
since no system could be useful if it does not produce the required output in the specified format.
Asking the users about the format required by them tests the output generated or displayed by the
system under consideration. Hence the output format is considered in two ways-one is on screen
and another in printed format.
USER ACCEPTANCE TESTING
User acceptance of a system is the key factor for the success of any system. The system
under consideration is tested for the user acceptance by constantly keeping in touch with the
prospective system users at the time of developing and making changes whenever required. This
is done in regard to the following point:
An acceptance test has the objective of selling the user on the validity and reliability of
the system .it verifies that the system’s procedures operate to system specifications and that the
integrity of important data is maintained. Performance of an acceptance test is actually the user’s
show. User motivation is very important for the successful performance of the system. After that
a comprehensive test report is prepared. This report shows the system’s tolerance, Performance
range, error rate and accuracy.
SYSTEM MAINTENANCE
The objectives of this maintenance work are to make sure that the system gets
into work all time without any bug. Provision must be for environmental changes which may
affect the computer or software system. This is called the maintenance of the system. Nowadays
there is the rapid change in the software world. Due to this rapid change, the system should be
capable of adapting these changes. In this project the process can be added without affecting
other parts of the system.
Maintenance plays a vital role. The system is liable to accept any modification
after its implementation. This system has been designed to favor all new changes. Doing this will
not affect the system’s performance or its accuracy.
Maintenance is necessary to eliminate errors in the system during its working life and to
tune the system to any variations in its working environment. It has been seen that there are
always some errors found in the system that must be noted and corrected. It also means the
review of the system from time to time.
The review of the system is done for:
 Knowing the full capabilities of the system.
 Knowing the required changes or the additional requirements.
 Studying the performance.
TYPES OF MAINTENANCE:
 Corrective maintenance
 Adaptive maintenance
 Perfective maintenance
 Preventive maintenance
CORRECTIVE MAINTENANCE
Changes made to a system to repair flows in its design coding or implementation. The
design of the software will be changed. The corrective maintenance is applied to correct the
errors that occur during that operation time. The user may enter invalid file type while submitting
the information in the particular field, then the corrective maintenance will displays the error
message to the user in order to rectify the error.
Maintenance is a major income source. Nevertheless, even today many organizations

assign maintenance to unsupervised beginners, and less competent programmers.
The user’s problems are often caused by the individuals who developed the product, not
the maintainer. The code itself may be badly written maintenance is despised by many software
developers Unless good maintenance service is provided, the client will take future development
business elsewhere. Maintenance is the most important phase of software production, the most
difficult and most thankless.
ADAPTIVE MAINTENANCE:
It means changes made to system to evolve its functionalities to change business

needs or technologies. If any modification in the modules the software will adopt those
modifications. If the user changes the server then the project will adapt those changes. The
modification server work as the existing is performed.
PERFECTIVE MAINTENANCE:
Perfective maintenance means made to a system to add new features or improve

performance. The perfective maintenance is done to take some perfect measures to maintain the
special features. It means enhancing the performance or modifying the programs to respond to
the users need or changing needs. This proposed system could be added with additional
functionalities easily. In this project, if the user wants to improve the performance further then
this software can be easily upgraded.
PREVENTIVE MAINTENANCE:
Preventive maintenance involves changes made to a system to reduce the changes

of features system failure. The possible occurrence of error that might occur are forecasted and
prevented with suitable preventive problems. If the user wants to improve the performance of
any process then the new features can be added to the system for this project.
EXPERIMENTAL SETUP
tter Trending Topic +Classification
Trending Category Dynamic

topics Greeedy
Training split
Blocking
Lady gaga Music
topics
Burberry Fashion Data Mining
Testing split
Lady Gaga Optimization
I pad Technolo
gy and
Validation
Tweets Top story TV &
3 movies
Super Sports Tweets/Re

bowl -Tweets
Tornado Other
news
gaga
Data collection
Labeling Data Modeling Machine Learning
For our experiments, we used popular tools such as WEKA and SPSS modeler . WEKA
is a widely used machine learning tool that supports various modeling algorithms for data
preprocessing, clustering, classification, regression and feature selection. SPSS modeler is
popular data mining software with unique graphical user interface and high prediction accuracy.
It is widely used in business marketing, resource planning, medical research, law enforcement
and national security. In all experiments, 10- fold cross-validation was used to evaluate the
classification accuracy. The ZeroR classifier was used to get a baseline accuracy, which simply
predicts the majority class.
Text-based classification
Using Naive Bayes Multinomial (NBM), Naive Bayes (NB), and Support Vector Machines
(SVM-L) with linear kernels classifiers, we find that the accuracy of classification is a function
of number of tweets and frequent terms. Fig. 6 presents the comparison of classification accuracy
using different classifiers for text-based classification. TD represents the trend definition.
Model(x,y) represents classifier model used to classify topics, with x number of tweets per topic
and y top frequent terms. For example, NB(100,1000) represents the accuracy using NB
classifier with 100 tweets per topic and 1000 most frequent terms (from text-based modeling
result).
Network-based classification
Presents the comparison of classification accuracy using different classifiers for network-
based classification. Clearly, C5.0 decision tree classifier gives best classification accuracy
(70.96%) followed by k-Nearest Neighbor (63.28%), Support Vector Machine (54.349%),
Logistic Regression (53.457%). C5.0 decision tree classifier achieves 3.68 times higher accuracy
compared to the ZeroR baseline classifier. The 70.96% accuracy is very good considering that
we categorize topics into 18 classes. To the best of our knowledge, the number of classes used in
our experiment is much larger than the number of classes used in any earlier research works
(two-class classification is the most common).
CONCLUSION
In the last few decades, twitter asyncronous systems have been used, among the many available
solutions, in order to mitigate information and cognitive overload problem by suggesting related
and relevant tweets to the users. In this regards, numerous advances have been made to get a
high-quality and fine-tuned twitter asyncronous system. Nevertheless, designers face several
prominent issues and challenges.
In this work, we have touched variety of topics like natural Language Processing, Text
Classification, Feature selection, Feature ranking, etc. Each one of these topics was used to
leverage the massive information flowing through twitter. Understanding twitter was as
important as knowing the topics in question. The results of the previous experiments, led us to
the conclusion that feature selection is an absolutely necessity in a text classification system.
This was proved when we compared our results with a system that uses the exact same dataset
without feature selection. We were able to achieve 33.14% and 28.67% improvement with bag-
of-words and TF-IDF scoring techniques correspondingly.
We also mentioned recognition and some opportunities that our work provides in the fields of
news media, marketing and businesses in general. We hope that our work can provide a good
foundation to the future of text classification in social media and to the opportunities that comes
with it.
REFERENCES
[1] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Found. Trends Inf.
Retrieval, vol. 2, no. 1/2, pp. 1–135, 2008.
[2] J. Bollen, H. Mao, and A. Pepe, “Modeling public mood and emotion: Twitter sentiment
and socio-economic phenomena,” in Proc. Int. AAAI Conf. Weblogs Social Media, 2011,
pp. 17–21.
[3] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith, “From tweets to
polls: Linking text sentiment to public opinion time series,” in Proc. Int. AAAI Conf.
Weblogs Social Media, 2010, pp. 122–129.
[4] M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proc. 10th ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, pp. 168–177.
[5] T. Chen, R. Xu, Y. He, Y. Xia, and X. Wang, “Learning user and product distributed
representations using a sequence model for sentiment analysis,” IEEE Comput. Intell.
Mag., vol. 11, no. 3, pp. 34–44, Aug. 2016.
[6] Y. Wu, S. Liu, K. Yan, M. Liu, and F. Wu, “OpinionFlow: Visual analysis of opinion
diffusion on social media,” IEEE Trans. Vis. Comput. Graph., vol. 20, no. 12, pp. 1763–
1772, Dec. 2014.
[7] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: Sentiment classification using
machine learning techniques,” in Proc. ACL Conf. Empirical Methods Natural Language
Process., 2002, pp. 79–86.
[8] A.Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant
supervision,” Stanford Univ., Stanford, CA, USA, Project Rep. CS224N, pp. 1–12, 2009.
[9] F. Wu, Y. Song, and Y. Huang, “Microblog sentiment classification with contextual
knowledge regularization,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 2332–2338.
[10] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boom-boxes and
blenders: Trends adaptation for sentiment classification,” in Proc. 45th Annu. Meeting
Assoc. Comput. Linguistics, 2007, vol. 7, pp. 440–447.
[11] X. Glorot, A. Bordes, and Y. Bengio, “Trends adaptation for large-scale sentiment
classification: A deep learning approach,” in Proc. 28th Int. Conf. Mach. Learn., 2011,
pp. 513–520.
[12] S.-S. Li, C.-R. Huang, and C.-Q. Zong, “Multi-Trends sentiment classification
with classifier combination,” J. Comput. Sci. Technol., vol. 26, no. 1, pp. 25–33, 2011.
[13] L. Li, X. Jin, S. J. Pan, and J.-T. Sun, “Multi-Trends active learning for text
classification,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
2012, pp. 1086–1094.
[14] A.Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for
linear inverse problems,” SIAM J. Imaging Sci., vol. 2, no. 1, pp. 183–202, 2009.
[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization
and statistical learning via the alternating direction method of multipliers,” Found. Trends
Mach. Learn., vol. 3, no. 1, pp. 1–122, 2011.
[16] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, “Cross-Trends sentiment
classification via spectral feature alignment,” in Proc. 19th Int. Conf. World Wide Web,
2010, pp. 751–760.
[17] Y. Lu, M. Castellanos, U. Dayal, and C. Zhai, “Automatic construction of a
context-aware sentiment lexicon: An optimization approach,” in Proc. 20th Int. Conf.
World Wide Web, 2011, pp. 347356.
[18] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data
Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[19] Y. He, C. Lin, and H. Alani, “Automatically extracting polaritybearing topics for
cross-Trends sentiment classification,” in Proc. 49th Annu. Meeting Assoc. Comput.
Linguistics, 2011, pp. 123–131.
[20] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proc. 10th
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, pp. 109–117.
APPENDICES
SAMPLE SCREENS:
ENTER KEYWORD TO EXTRACT RAW TWEETS:
EXTRACTED TWEETS:
PRE-PROCESSING TWEETS:
TRENDS TAGS AND RETWEETS EXTRACTION
CALCULATE HASHTAG INFLUENCES:
RUMOURS ELIMINATION:
TRENDS TWEETS AND TOPICS AFTER BLOCKING RUMOURS:
SAMPLE CODINGS:
MAIN.JAVA
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package drimux;
import java.util.ArrayList;
import java.util.Collections;
import javax.swing.JFrame;
import javax.swing.JOptionPane;
import twitter4j.Query;
import twitter4j.QueryResult;
import twitter4j.Status;
import twitter4j.Twitter;
import twitter4j.TwitterFactory;
/**
*
* @author Elcot
*/
public class DRIMUX {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
System.out.println("\t\t\t****************");
System.out.println("\t\t\t DRIMUX");
System.out.println("\t\t\t****************");
/
******************************************************************************
**********************************/
System.out.println("===================================");
System.out.println("\t1) Extract Tweets");
System.out.println("===================================");
ArrayList alltweets=new ArrayList();
String qry=JOptionPane.showInputDialog(new JFrame(),"Enter the Keyword: ");
try
{
String str="";
Twitter twitter1 = new TwitterFactory().getInstance();
Query query = new Query(qry);
query.setCount(200);
QueryResult result = twitter1.search(query);
for (Status status : result.getTweets())
{
String sg=status.getText().trim();
if(!alltweets.contains(sg.trim()))
{
String time=status.getCreatedAt().toString().trim();
System.out.println(sg.trim());
str=str+ time.trim() + " --> " +"@"+ status.getUser().getScreenName() + ":" +
status.getText()+"\n\n";
alltweets.add(status.getText().trim());
}
}
System.out.println("====================================================
=");
System.out.println("\t Extracted Tweets of Keyword - "+qry.trim());
=");
System.out.println(str.trim());
}
catch(Exception e)
{
e.printStackTrace();
}
/
******************************************************************************
**********************************/
System.out.println();
System.out.println("=========================================");
System.out.println("\t2) Extract Hashtags from Tweets");
System.out.println("=========================================");
ArrayList allHashtags=new ArrayList();

for(int i=0;i<alltweets.size();i++)
{
String s=alltweets.get(i).toString().trim();
String sp[]=s.trim().replaceAll("\n"," ").split(" ");
for(int j=0;j<sp.length;j++)
{
if(sp[j].trim().startsWith("#"))
{
if(!(allHashtags.contains(sp[j].trim())))
{
allHashtags.add(sp[j].trim());
System.out.println(sp[j].trim());
}
}
}
}
System.out.println("=====================================");
System.out.println("\t3) Retweet Extraction");
System.out.println("=====================================");
ArrayList allretweets=new ArrayList();

{
String tweet=alltweets.get(i).toString().trim();
if(tweet.trim().contains("RT"))
{
allretweets.add(tweet.trim());
System.out.println(tweet.trim());
}
}
System.out.println("===================================================")
;
System.out.println("\t4) Extract Hashtags from Retweet");
;
ArrayList allHashtagsaftRe=new ArrayList();
for(int i=0;i<allretweets.size();i++)
{
String s=allretweets.get(i).toString().trim();
String sp[]=s.trim().split(" ");
{
if(sp[j].trim().startsWith("#"))
{
String hashtag=sp[j].trim();
if(!(hashtag.trim().equals("")))
{
if(!(allHashtagsaftRe.contains(hashtag.trim())))
{
allHashtagsaftRe.add(hashtag.trim());
System.out.println(hashtag.trim());
}
}
}
}
}
;
System.out.println("\t5) Calcuate Hastags Influence");
;
ArrayList allWords=new ArrayList();

{
String s=alltweets.get(i).toString().trim();
String sp[]=s.trim().split(" ");
{
String word=sp[j].trim().replaceAll("[^\\w\\s]", "");
allWords.add(word.toLowerCase().trim());
}
}
ArrayList allHashtagsinfluence=new ArrayList();

System.out.println("Hashtag"+"\t-->\t"+"Influence");
for(int i=0;i<allHashtagsaftRe.size();i++)
{
String hashtag=allHashtagsaftRe.get(i).toString().trim();
String topic=hashtag.trim().replaceAll("#","").replaceAll("[^\\w\\s]", "");
int influence=Collections.frequency(allWords,topic.toLowerCase().trim());
allHashtagsinfluence.add(influence);
System.out.println(hashtag.trim()+"\t-->\t"+influence);
}
=============================");
System.out.println("\t6) Greedy & Dynamic Blocking Algorithms (for Detect & Block
Rumours)");
=============================");
long start=System.currentTimeMillis();
int Threshold=1;
ArrayList secureTweets=new ArrayList(); // VB
for(int i=0;i<alltweets.size();i++) // Initial Edge Matrix A0

{
String tweet=alltweets.get(i).toString().trim().replaceAll("\n", " ");
//System.out.println("tweet: "+tweet);
ArrayList availableHashtags=new ArrayList();

String sp[]=tweet.trim().split(" ");
{
if(sp[j].trim().contains("#"))
{
if(!(availableHashtags.contains(sp[j].trim())))
{
availableHashtags.add(sp[j].trim());
}
}
}
double val=0;
int sz=0;
for(int j=0;j<availableHashtags.size();j++)
{
String hash=availableHashtags.get(j).toString().trim();
int index=allHashtagsaftRe.indexOf(hash.trim());
if(index>=0)
{
String influ=allHashtagsinfluence.get(index).toString().trim();
double inf=Double.parseDouble(influ.trim());
if(inf>Threshold)
{
val=val+inf;
sz++;
}
}
}
//System.out.println("val: "+val);
//System.out.println("sz: "+sz);
//double totinfluence=val/(double)sz;
//System.out.println("totinfluence: "+totinfluence);
String mainResult="Secure";
if(val==0)
{
if(!(tweet.trim().contains("#")))
{
mainResult="Secure";
}
else
{
mainResult="Rumour";
}
}
//System.out.println("maniResult: "+maniResult);
System.out.println(tweet.trim()+" --> "+mainResult.trim());
if(mainResult.trim().equals("Secure"))
{
secureTweets.add(tweet.trim());
}
}
;
System.out.println("\t7) Tweets after Blocking Rumours");
;
for(int i=0;i<secureTweets.size();i++)
{
String s=secureTweets.get(i).toString().trim();
System.out.println(s.trim());
}
long stop=System.currentTimeMillis();
long rumourBlockingtime=stop-start;
System.out.println("Rumour Blocking Time: "+rumourBlockingtime+" ms");
int rumourTweetsSize=alltweets.size()-secureTweets.size();
double infectionRatio=(double)((double)rumourTweetsSize/(double)alltweets.size())*100;
System.out.println("Infection Ratio: "+infectionRatio+" %");
}
}

Module Description Preprocessing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module Description Preprocessing

Uploaded by

Copyright:

Available Formats

MODULE DESCRIPTION

TWEETS RATING PREDICTION

TWEET SIMILARITY COMPUTATION:

TRENDING TWEETS RESULT ANALYSIS MODULE:

3.1 HARDWARE REQUIREMENTS:

 System : Pentium Core 2 Duo

 RAM : 1GB DDR 2 RAM

 Key Board : LG 104 Key keyboard

 Mouse : Logitech Optical mouse

 Monitor : 15 inch TFT Monitor

3.2 SOFTWARE REQUIREMENTS:

 Operating System : Windows 7

 Front end : JDK 1.7 /Net beans 8.0

 Coding Language : Java

The software requirement specification is created at the end of the analysis

3.3.1 FEATURES OF JAVA

Java platform has two components:

 The Java Virtual Machine (Java VM)

As a platform-independent environment, Java can be a bit slower than native code.

The InetAddress class has no visible constructors. To create an InetAddress

Three commonly used InetAddress factory methods are:

1. Static InetAddress getLocalHost ( ) throws

2. Static InetAddress getByName (String hostName)

3. Static InetAddress [ ] getAllByName (String hostName)

The getLocalHost ( ) method simply returns the InetAddress object that

On the internet, it is common for a single name to be used to represent several

2. String getHostAddress ( ) - Returns a string that represents the host address

4. boolean isMulticastAddress ( )- Returns true if this Internet address is a

TCP/IP sockets are used to implement reliable, bidirectional, persistent,

The creation of a Socket object implicitly establishes a connection between

Socket (InetAddress ipAddress, int port) - Creates a socket using a preexisting

 InetAddress getInetAddress ( ) - Returns the InetAddress associated with

OutputStream getOutputStream ( ) - Returns the OutputStream associated with the

TCP/IP SERVER SOCKETS:

The URL provides a reasonably intelligible form to uniquely identify or address

The following are the objectives of input design:

 To produce a cost effective method of input.

 To make the input forms understandable to the end users.

 To ensure the validation of data inputs.

Use Case Diagram

Find Possible trends

Find Possible Trends

Get POS Tag

User Tweet Segment

3 : Find Possible Segment()

5 : Get POS Tag()

6 : Find Rumor Elimination()

5 : Get POS Tag()

SYSTEM TESTING AND IMPLEMENTATION

USER ACCEPTANCE TESTING

The review of the system is done for:

 Knowing the full capabilities of the system.

 Knowing the required changes or the additional requirements.

 Studying the performance.

Maintenance is a major income source. Nevertheless, even today many organizations

It means changes made to system to evolve its functionalities to change business

Perfective maintenance means made to a system to add new features or improve

Preventive maintenance involves changes made to a system to reduce the changes

tter Trending Topic +Classification

Trending Category Dynamic

Super Sports Tweets/Re

ArrayList alltweets=new ArrayList();

String qry=JOptionPane.showInputDialog(new JFrame(),"Enter the Keyword: ");